摘要
针对人工标注韵律结构获取大规模语料的困难和问题,利用标点符号能够表示停顿的性质,提出一种采用无标注语料和词"粘连"剔除策略的韵律短语识别方法。对标点符号划分等级,并在利用其模拟韵律边界时对其赋予不同的权重。基于无标注语料构建最大熵模型,并采取Top-K方法实现句子韵律短语边界的自动预测。通过计算相邻语法词词性间的互信息对句子进行"粘连"处理,生成"粘连"单元,并对出现在其内部的韵律边界进行剔除,实现韵律短语的自动识别。实验结果表明,获取无标注语料时对标点进行分级利用及采用"粘连"剔除策略能够明显提升模型性能,该方法能够获得较好的识别效果。
Obtaining large-scale annotated corpus manually is very difficult and has some disadvantages.Based on the pause role of punctuation,this paper proposed a prosodic phrase recognition method which uses unlabeled corpus and"adhesion"culling strategy.In the method,punctuation is graded and given different weights when it is used to simulate the prosodic boundaries.For recognizing prosodic phrase boundaries automatically,a max entropy model is constructed based on an unlabeled corpus and a Top-K method is also used.According to the mutual information of two contiguous part of speech tagging,words are bundled into adhesion units and the prosodic boundaries appear in it are eliminated.The experimental results show that hierarchical use of punctuation and"adhesion"culling strategy can improve the performance of the model significantly.The method can obtain better recognition results.
出处
《计算机科学》
CSCD
北大核心
2016年第2期51-56,共6页
Computer Science
基金
国家自然科学青年基金项目(61005053
61100138)
山西省青年科技研究基金资助项目(2012021012-1)
山西省自然科学基金资助项目(2011011016-2)
山西省回国留学人员科研资助项目(2013-022)资助
关键词
无标注语料
韵律短语边界
最大熵(ME)
互信息
Unlabeled corpus
Prosodic phrase boundary
Maximum entropy(ME)
Mutual information