Master of Science in Computer Science (MSCS)
Given a collection of strings and a query string, the goal of the approximate string matching is to efficiently find the strings in the collection, which are similar to the query string. In this paper, we focus on edit distance as a measure to quantify the similarity between two strings. Existing q-gram based methods use inverted lists to index the q-grams of the given string collection. These methods begin with generating the q-grams of the query string, disjoint or overlapping, and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed to segment inverted lists in order to obtain relatively shorter lists, thus reducing the merging cost. The filtering technique we propose in this thesis, which is called position restricted alignment, combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter. It enables us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by experiments.
Document Availability at the Time of Submission
Release the entire work immediately for access worldwide.
Cai, Xuanting, "Approximate sequence alignment" (2013). LSU Master's Theses. 983.