摘要
相似记录检测已成为数据清洗的一个重要分支,也是消除数据冗余提高数据质量的一个重要途径,在数据统计、数据分析、数据仓库、人工智能和数据挖掘等领域都有实际应用。该文对目前相似记录检测方法进行了研究,针对诸多方法都存在检测精度不足和时效慢的问题,采用K-Modes进行聚类分组的方法,通过信息熵理论来确定属性权重并约简属性维度,同时在记录匹配阶段依据属性重要程度对各聚类分组的数据逐属性进行比较,根据阈值来判断其相似性,避免整条记录参与匹配耗费时间,在完成对每个数据集的检测后最终消除相似重复记录。实验表明,该方法能有效缩小检测数据集范围和相似匹配效率,提高检测精度和时间效率,具有较高的查全率和查准率。
Detecting approximately duplicated records has become an important branch of data cleaning and it’s an important way in eliminating data redundancy to improve the data quality,used in the data statistics,data analysis,data warehouse,artificial intelligence and data mining.This paper studies the current approximately duplicated records detection method.For there are many methods to detect the problem of low detection accuracy and efficiency,using K-Modes to cluster method and information entropy theory to reduction of dimensions and determine the attribute weights.In the record matching phase,records are compared according to the importance of the attribute,and the approximately duplicated records are judged according to the threshold value.This method avoids the whole record of matching and saves time.After each data set is completed,the approximately duplicate records are finally eliminated.The experiment shows that this method can effectively reduce the detection data set range and detection efficiency,improve the time efficiency and detection accuracy,and have higher detection rate and precision.
作者
陈彦萍
洪明杰
杨小宝
CHEN Yanping;HONG Mingjie;YANG Xiaobao(Xi'an University of Posts and Telecommunications,Xi'an 710121)
出处
《计算机与数字工程》
2019年第12期2966-2972,共7页
Computer & Digital Engineering
基金
陕西省科技统筹创新工程重点产业创新链工业领域项目(编号:2016KTZDGY04-01)
陕西省教育厅专项科研计划项目(编号:16JK1701)资助
关键词
相似重复记录
K-Modes聚类算法
信息熵
相似检测
approximate duplicate record
K-Modes clustering algorithm
information entropy
similar detecting