期刊文献+

采用无标注语料和词“粘连”剔除策略的韵律短语识别

Recognition of Prosodic Phrases Based on Unlabeled Corpus and “Adhesion” Culling Strategy
在线阅读 下载PDF
导出
摘要 针对人工标注韵律结构获取大规模语料的困难和问题,利用标点符号能够表示停顿的性质,提出一种采用无标注语料和词"粘连"剔除策略的韵律短语识别方法。对标点符号划分等级,并在利用其模拟韵律边界时对其赋予不同的权重。基于无标注语料构建最大熵模型,并采取Top-K方法实现句子韵律短语边界的自动预测。通过计算相邻语法词词性间的互信息对句子进行"粘连"处理,生成"粘连"单元,并对出现在其内部的韵律边界进行剔除,实现韵律短语的自动识别。实验结果表明,获取无标注语料时对标点进行分级利用及采用"粘连"剔除策略能够明显提升模型性能,该方法能够获得较好的识别效果。 Obtaining large-scale annotated corpus manually is very difficult and has some disadvantages.Based on the pause role of punctuation,this paper proposed a prosodic phrase recognition method which uses unlabeled corpus and"adhesion"culling strategy.In the method,punctuation is graded and given different weights when it is used to simulate the prosodic boundaries.For recognizing prosodic phrase boundaries automatically,a max entropy model is constructed based on an unlabeled corpus and a Top-K method is also used.According to the mutual information of two contiguous part of speech tagging,words are bundled into adhesion units and the prosodic boundaries appear in it are eliminated.The experimental results show that hierarchical use of punctuation and"adhesion"culling strategy can improve the performance of the model significantly.The method can obtain better recognition results.
出处 《计算机科学》 CSCD 北大核心 2016年第2期51-56,共6页 Computer Science
基金 国家自然科学青年基金项目(61005053 61100138) 山西省青年科技研究基金资助项目(2012021012-1) 山西省自然科学基金资助项目(2011011016-2) 山西省回国留学人员科研资助项目(2013-022)资助
关键词 无标注语料 韵律短语边界 最大熵(ME) 互信息 Unlabeled corpus Prosodic phrase boundary Maximum entropy(ME) Mutual information
  • 相关文献

参考文献15

二级参考文献116

共引文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部