期刊文献+

基于Trie树的词语左右熵和互信息新词发现算法 被引量:12

Trie tree based new word discovery algorithm using left-right entropy and mutual information
在线阅读 下载PDF
导出
摘要 由于大量新词的出现,使得中文文本分析产生了较大的困难,因此新词发现成为目前中文自然语言处理中的热点和难点问题。为此,文中提出了一种基于Trie树的词语左右熵和互信息新词发现算法。先根据成词规则,筛选掉文本中的停用词和非中文字符,将每个字与其右邻的字组成二元组;然后利用左右信息熵和互信息进行成词概率的计算,根据计算到的成词概率和词频筛选出新词;并且设计了三个实验,验证了算法的有效性和可行性。实验结果表明,该新词发现算法成词准确率较高,比其他新词发现算法时间效率有较大的提高,对于中文分词结果的优化起到重要的作用。 The emergence of multitude of new words makes Chinese discourse analysis difficult.Therefore,the discovery of new words has become a hot and difficult problem in natural language processing of Chinese.A Trie tree based new word discovery algorithm using left-right entropy and mutual information is proposed.The disused words and non Chinese characters in the text are filtered out according to the rules of word-formation.Each word is divided into a binary group with its right neighbor,and then the probability of word-formation is calculated by means of left-right information entropy and mutual information,so as to screen out new words according to the calculated probability of word-formation and word frequency.Three experiments were designed to verify the effectiveness and feasibility of the algorithm.The experimental results show that the new word discovery algorithm has higher accuracy of word-formation,and has higher time efficiency than other new word discovery algorithms,which plays an important role in the optimization of Chinese word segmentation results.
作者 郭理 张恒旭 王嘉岐 秦怀斌 GUO Li;ZHANG Hengxu;WANG Jiaqi;QIN Huaibin(College of Information Science and Technology,Shihezi University,Shihezi 832000,China)
出处 《现代电子技术》 北大核心 2020年第6期65-69,共5页 Modern Electronics Technique
基金 国家社会科学基金项目(14XXW004)。
关键词 新词发现算法 左右熵 互信息 TRIE树 算法设计 对比验证 new word discovery algorithm left-right entropy mutual information Trie tree algorithm design comparison validation
  • 相关文献

参考文献13

二级参考文献99

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:200
  • 2徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:57
  • 3崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量:32
  • 4中国互联网络信息中心(CNNIC).第31次中国互联网络发展状况统计报告.http://www.cnic.cn/xw/kydt/201301/t20130117-3751850.html,2011-7.
  • 5黄萱菁,赵军.中文文本情感倾向性分析.中国计算机学会通讯,2008;4(2):39-47.
  • 6Wilson T, Wiebe J, Hoffmann P. Recognizing contextual polarity in phrase-level sentiment analysis. HLT-EMNLP-2005,2005 : 347-354.
  • 7Du Weifu, Tan Songbo. Building domain-oriented sentiment lexicon by improved information bottleneck. Proceedings of the CIKM Confer- ence, 2009:1749-1752.
  • 8Peter T. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annum Meeting of the Association for Computational Linguistics, 2002: 417-424.
  • 9Pang B L, Lee S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. Proceedings of EMNLP, 2002.
  • 10Durant K T, Smith M D. Mining sentiment classification from politi- cal web logs. Proceedings of WEBKDD, 2006.

共引文献224

同被引文献130

引证文献12

二级引证文献54

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部