摘要
数据仓库中脏数据处理的热点问题是识别与消除相似重复记录。针对中文地址类重复信息的处理,提出了一种基于特征字符的分词策略,在建立了包含分词规则的元数据库基础上,描述了基于特征字符的分词算法。实验结果表明分词所用的时间随着数据集的增长变化不大。因此,将分词方法应用于中文地址类重复记录的检测,也不会增加检测的时间。
It's a hot issue to eliminate approximately duplicated records in cleansing dirty data of data warehouse.Aiming at processing of Chinese address information,a segment mechanism based on the feature word is proposed.The meta-database of segment rules is established,and the feature word based segment algorithm is presented.The experiment results indicate that the segment time is invariable along with the data set growing.So this method can be used in detecting approximately duplicated records,but the detecting time will not increase.
出处
《沈阳航空工业学院学报》
2008年第4期63-66,共4页
Journal of Shenyang Institute of Aeronautical Engineering
关键词
相似重复记录
中文地址
特征字符
分词
Approximately duplicated records
Chinese address information
Tagged word
Segment