摘要
对大规模汉语文本语料库分词正确率的评价提出了新的见解 ,即在分层抽样基础上对文本样本进行聚类 .通过聚类可提高检验精度或减少样本量 .该方法采用了一种新的样本相似性度量公式 ,该公式综合考虑了样本向量间的距离和样本向量各分量之间的线性相关性 .通过对聚类结果的动态评价 ,调整聚类的类别数和相似性因子 ,提高了聚类的效率和质量 .
A testing model of the large-scale corpus segmentation is proposed. The sample clustering method based on hierarchical sampling is adopted in the model. We conduct the operation of the sample clustering method according to a new measurement formula for the similarity of the samples, in which the distance of the sample vector and the linear correlation among the components of the sample vector are taken into consideration comprehensively. Through the dynamic evaluation of the clustering results, the clustering parameters are adjusted, and meanwhile, the clustering efficiency and quality are improved. Compared with the random sampling method, the sample clustering method can reduce the sample number by 63.3% under the large-scale circumstances. The experiment still shows that this method improves the testing precision by 60%.
出处
《计算机学报》
EI
CSCD
北大核心
2004年第2期192-196,共5页
Chinese Journal of Computers
基金
国家"八六三"高技术研究发展计划 (2 0 0 1AA114 0 3 1)资助
关键词
汉语
语料库
分词评价
相似性因子
样本聚类
语言学
分层抽样
Classification (of information)
Computer selection and evaluation
Indexing (of information)
Sampled data control systems
Sampling
Vectors