期刊文献+

基于聚类的语料库分词评价方法研究 被引量:4

Evaluation Method of the Corpus Segmentation Based on Clustering
在线阅读 下载PDF
导出
摘要 对大规模汉语文本语料库分词正确率的评价提出了新的见解 ,即在分层抽样基础上对文本样本进行聚类 .通过聚类可提高检验精度或减少样本量 .该方法采用了一种新的样本相似性度量公式 ,该公式综合考虑了样本向量间的距离和样本向量各分量之间的线性相关性 .通过对聚类结果的动态评价 ,调整聚类的类别数和相似性因子 ,提高了聚类的效率和质量 . A testing model of the large-scale corpus segmentation is proposed. The sample clustering method based on hierarchical sampling is adopted in the model. We conduct the operation of the sample clustering method according to a new measurement formula for the similarity of the samples, in which the distance of the sample vector and the linear correlation among the components of the sample vector are taken into consideration comprehensively. Through the dynamic evaluation of the clustering results, the clustering parameters are adjusted, and meanwhile, the clustering efficiency and quality are improved. Compared with the random sampling method, the sample clustering method can reduce the sample number by 63.3% under the large-scale circumstances. The experiment still shows that this method improves the testing precision by 60%.
出处 《计算机学报》 EI CSCD 北大核心 2004年第2期192-196,共5页 Chinese Journal of Computers
基金 国家"八六三"高技术研究发展计划 (2 0 0 1AA114 0 3 1)资助
关键词 汉语 语料库 分词评价 相似性因子 样本聚类 语言学 分层抽样 Classification (of information) Computer selection and evaluation Indexing (of information) Sampled data control systems Sampling Vectors
  • 相关文献

参考文献8

  • 1Kirk M.Wolter. Introduction to Variance Estimation. Beijing: Statistics Press of China,1998(in Chinese)(科克沃尔特著,王吉利,李毅等译.方差估计引论. 北京:中国统计出版社, 1998)
  • 2Lars Bretzner, Ivan Laptev, Tony Lindeberg. Hand gesture recognition using Multi-Scale colour features, hierarchical models and particle filtering. In: Proceedings of Face and Gesture 2002, Washington DC, 2002, 423~428
  • 3杨俊龙,金勇进.分层抽样技术在应收账款审计中的应用[J].经济经纬,2002,19(5):88-90. 被引量:2
  • 4Feng Shi-Yong, Ni Jia-Xun, Zou Guo-Hua. Theories and Methods of Sampling Survey. Beijing; Statistics Press of China, 1998(in Chinese)(冯士雍,倪加勋,邹国华. 抽样调查理论与方法. 北京:中国统计出版社, 1998)
  • 5刘少辉,董明楷,张海俊,李蓉,史忠植.一种基于向量空间模型的多层次文本分类方法[J].中文信息学报,2002,16(3):8-14. 被引量:75
  • 6Sun Ji-Xiang et al. Modern Pattern Recognition. Changsha: Press of National University of Defence Technology, 2002(in Chinese)(孙即祥等. 现代模式识别. 长沙:国防科技大学出版社, 2002)
  • 7Hall D.J., Ball D.J.. ISODATA: A novel method of data analysis and pattern classification. Stanford Research Institute, Menlo Park CA:Technical Report AD 699616, 1965
  • 8Judith T.Lessler, William D. Kalsbeek. Nonsampling Error in Surveys. Beijing: Statistics Press of China, 1997(in Chinese)(J.T.莱斯勒等著,金勇进译.调查中的非抽样误差.北京:中国统计出版社,1997)

二级参考文献7

共引文献75

同被引文献50

  • 1孙茂松.谈谈汉语分词语料库的一致性问题[J].语言文字应用,1999(2):90-93. 被引量:20
  • 2约翰·辛克莱,王建华.关于语料库的建立[J].语言文字应用,2000(2):63-71. 被引量:16
  • 3朱玉祥,苗春生,孙承佼.基于遗传算法的试题库智能组卷系统研究[J].南京气象学院学报,2006,29(2):282-285. 被引量:13
  • 4林陈雷,郭安源等.Visual Basic教育信息化系统开发实例导航[M].北京:人民邮电出版社,2004.
  • 5杨伦标 高英仪编.模糊数学原理及应用[M].广州:华南理工大学出版社,1998.94-132.
  • 6谌红.模糊数学在国民经济中的应用[M].武汉:华中理工大学出版社,1993..
  • 7Ishihara Y , Asakawa C, Fukuzawa H. Studies on sulfur dioxides removal from fuel gas by dry limestone injection procees[J]. J. Fule Soc. Jan, 1997 (54).
  • 8Jaime G. Carbonell, Ralf D. Brown. The generalized examplebased machine translation [EB/OL]. http://www-2.cs.cmu. edu/~ralf/ebmt.html.
  • 9李维刚,刘挺,王震,等.双语语料库段落重组对齐方法研究[A].孙茂松,陈群秀.语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003.332-338.
  • 10Christopher D. Manning. Foundations of statistical natural language processing [M]. Massachusetts Institute of Technology,Fifth Printing,2002.

引证文献4

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部