String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsiste...String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.展开更多
Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this articl...Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this article,we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics,which has the natural advantage of avoiding the manual tuning of similarity thresholds.Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small,and that carefully tuning the similarity threshold is important for achieving good results.The methods based on supervised machine learning,particularly when considering ensembles of decision trees,can achieve good results on this task,significantly outperforming the individual similarity metrics.展开更多
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and...String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.展开更多
基金This work was partly supported by the National Grand Fundamental Research 973 Program of China (2015CB358700), the National Natural Science Foundation of China (Grant Nos. 61422205, 61472198), Beijing Higher Education Young Elite Teacher Project(YETP0105), Tsinghua-Tencent Joint Laboratory for Internet In- novation Technology, "NEXT Research Center", Singapore (WBS:R-252- 300-001-490), Huawei, Shenzhou, FDCT/ll6/2013/A3, MYRG105(Y1- L3)-FST13-GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Tech- nology (2013zx01039-002-002).
文摘String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.
基金the Trans-Atlantic Platform for the Social Sciences and Humanities,through the Digging into Data project with reference HJ-253525also through the Reassembling the Republic of Letters networking programme(EU COST Action IS1310)+1 种基金The researchers from INESC-ID also had financial support from Fundação para a Ciência e a Tecnologia(FCT),through project grants with references PTDC/EEI-SCR/1743/2014(Saturn)CMUP-ERI/TIC/0046/2014(GoLocal),as well as through the INESC-ID multi-annual funding from the PIDDAC programme(UID/CEC/50021/2013).
文摘Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching,that is,the problem of matching place names that share a common referent.In this article,we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics,which has the natural advantage of avoiding the manual tuning of similarity thresholds.Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small,and that carefully tuning the similarity threshold is important for achieving good results.The methods based on supervised machine learning,particularly when considering ensembles of decision trees,can achieve good results on this task,significantly outperforming the individual similarity metrics.
文摘String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.