摘要
多词表达是自然语言中的常见现象,其自动提取对很多自然语言处理任务有着举足轻重的作用。本研究以Google公司发布的基于公共网页的1至5元文法语料库作为词频统计的依据,同时结合自动词性标注的信息,对英国国家语料库的书面语材料部分进行多词表达的自动提取。研究结果表明,该方法能够充分利用Google语料库的精确词频信息,从而提高了多词表达抽取的准确率,并且能够较好地缓解数据稀疏现象带来问题。
Multiword expressions are used frequently in everyday language, whose automatic extraction plays a vital part in many natural language processing tasks. This paper proposes an approach to exploit statistical information from Web 1T 5-gram Corpus compiled and issued by Google Corporation for identifying mnhiword expressions in the written texts of British National Corpus. The pilot study shows that Google corpus provides a reliable as well as ample source of co-occurrence information, hence greatly enhancing the precision of muhiword expressions extraction and reducing data sparseness.
出处
《科技通报》
北大核心
2013年第10期171-173,共3页
Bulletin of Science and Technology
基金
中央高校基本科研业务费专项资金资助(2012HGXJ0109)
(2012HGXJ0110)
关键词
多词表达
自然语言处理
数据稀疏
multiword expression
natural language processing
data sparseness