摘要
由于大量新词的出现,使得中文文本分析产生了较大的困难,因此新词发现成为目前中文自然语言处理中的热点和难点问题。为此,文中提出了一种基于Trie树的词语左右熵和互信息新词发现算法。先根据成词规则,筛选掉文本中的停用词和非中文字符,将每个字与其右邻的字组成二元组;然后利用左右信息熵和互信息进行成词概率的计算,根据计算到的成词概率和词频筛选出新词;并且设计了三个实验,验证了算法的有效性和可行性。实验结果表明,该新词发现算法成词准确率较高,比其他新词发现算法时间效率有较大的提高,对于中文分词结果的优化起到重要的作用。
The emergence of multitude of new words makes Chinese discourse analysis difficult.Therefore,the discovery of new words has become a hot and difficult problem in natural language processing of Chinese.A Trie tree based new word discovery algorithm using left-right entropy and mutual information is proposed.The disused words and non Chinese characters in the text are filtered out according to the rules of word-formation.Each word is divided into a binary group with its right neighbor,and then the probability of word-formation is calculated by means of left-right information entropy and mutual information,so as to screen out new words according to the calculated probability of word-formation and word frequency.Three experiments were designed to verify the effectiveness and feasibility of the algorithm.The experimental results show that the new word discovery algorithm has higher accuracy of word-formation,and has higher time efficiency than other new word discovery algorithms,which plays an important role in the optimization of Chinese word segmentation results.
作者
郭理
张恒旭
王嘉岐
秦怀斌
GUO Li;ZHANG Hengxu;WANG Jiaqi;QIN Huaibin(College of Information Science and Technology,Shihezi University,Shihezi 832000,China)
出处
《现代电子技术》
北大核心
2020年第6期65-69,共5页
Modern Electronics Technique
基金
国家社会科学基金项目(14XXW004)。
关键词
新词发现算法
左右熵
互信息
TRIE树
算法设计
对比验证
new word discovery algorithm
left-right entropy
mutual information
Trie tree
algorithm design
comparison validation