摘要
信息抽取技术中,隐式篇章关系识别一直是研究难点.针对现有的有监督篇章关系识别方法中需要大量人工标注数据的缺点,提出了用自训练的策略实现半监督的隐式篇章关系的自动识别模型,尝试仅用少量标注样本,却获得和有监督方法相媲美的识别准确率,为未来实时大数据篇章关系识别提供了新的契机.此外,为了进一步提高识别的准确率,还针对词对特征、产生式特征、动词特征等9种篇章关系特征进行特征组合分析,构建候选篇章关系实例的知识表示,对模型进行优化.通过在Penn Discourse Treebank(PDTB2.0)语料库上的实验结果分析表明,该模型比传统有监督识别方法在准确率和F-score上分别提高了5.2%和13.5%.
In the area of information extraction (IE),it is a difficult task for implicit discourse relation identification. Aim to over- come the shortage of labeled data for the existing supervised discourse relation identification methods,a semi-supervised identification model based on self-training strategy was presented. Using only few labeled examples, the model achieved comparable performance with supervised methods,which provides a new opportunity for future real-time big-data identification task.Besides, we extracted 9 kinds of features,such as, word pair, production rule and verb etc. were extracted, and knowledge representation of candidate in- stances were constructed by serveral of them to optimize the model.Experimental results on Penn Discourse Treebank (PDTB2.0) showed that our model increases of accuracy and F-score by 5.2% and 13.5% respectively compared with traditional supervised method.
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2014年第2期182-189,共8页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(60803078)
福建省自然科学基金(2010J01351)
教育部海外留学回国人员科研启动基金
关键词
隐式篇章关系识别
半监督学习
自训练
组合特征
implicit discourse relation identification
semi-supervised learning
self-training
combined features