Academic
Publications
Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Merging and Filtering Algorithms for Approximate String Searches,10.1109/ICDE.2008.4497434,Chen Li,Jiaheng Lu,Yiming Lu

Efficient Merging and Filtering Algorithms for Approximate String Searches   (Citations: 36)
BibTex | RIS | RefWorks Download
We study the following problem: how to efficiently find in a collection of strings those similar to a given query string? Various similarity functions can be used, such as edit distance, Jaccard similarity, and cosine similarity. This problem is of great interests to a variety of applications that need a high real-time performance, such as data cleaning, query relaxation, and spellchecking. Several algorithms have been proposed based on the idea of merging inverted lists of grams generated from the strings. In this paper we make two contributions. First, we develop several algorithms that can greatly improve the performance of existing algorithms. Second, we study how to integrate existing filtering techniques with these algorithms, and show that they should be used together judiciously, since the way to do the integration can greatly affect the performance. We have conducted experiments on several real data sets to evaluate the proposed techniques.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...In combination with the DivideSkip technique on the inverted list index that was developed by Chen Li et al. [5], the exact PG-DS algorithm outperforms the approximate algorithm of Mazeika and B¨ ohlen for reasonable sample sizes...

    Michail Kazimianecet al. PG-Skip: Proximity Graph Based Clustering of Long Strings

    • ...the similarity, a lower bound on the number of common grams between two similar strings can be computed [15]...
    • ...There are two main families of algorithms for answering approximate string queries: those based on trees (e.g., suffix trees) [16], [12], [21], and those based on inverted indexes on q-grams [11], [15], [5]...
    • ...Several recent papers have focused on approximate selection queries [11], [15], [5] using in-memory indexes...
    • ...Similar observations exist for other similarity functions such as Jaccard and Dice [15]...
    • ...Analogous findings exist for other similarity functions [15]...
    • ...We use a structure called FilterTree [15] to facilitate such a partitioning...
    • ...“Simple” retrieves all the inverted lists of a query string’s grams and then solves the T-occurrence problem with an efficient in-memory algorithm (we used DivideSkip [15])...

    Alexander Behmet al. Answering approximate string queries on large data sets using external...

    • ...A wide variety of main-memory-single-node solutions have been proposed in the literature for various kinds of fuzzy queries (selection queries with similarity thresholds [41], ranking queries [58], and join queries [60])...

    Alexander Behmet al. ASTERIX: towards a scalable, semistructured data platform for evolving...

    • ...There are also many other studies on string similarity join [7], [15], [2], [3], [22], [20], [18], [17], which focus on either character-based similarity or token-based similarity, and approximate string searching [14], [8], [13], [9], [23], [12], which given a query string and a set of strings, finds all similar strings of the query string in the string set...

    Jiannan Wanget al. Fast-join: An efficient method for fuzzy token matching based string s...

    • ...Plaintext fuzzy keyword search solutions have been proposed [15-17]...
    • ...In [15], the authors propose some efficient merging algorithms to merge inverted lists of grams generated from stored strings so as to obtain approximately matched strings faster...

    M. Chuahet al. Privacy-Aware BedTree Based Solution for Fuzzy Multi-keyword Search ov...

Sort by: