摘要
文中提出一种新的知识获取方法,即从完全没有任何标注的生语料库中,采用NA假设自动构造带标训练数据,利用基于多特征的相似评估技术自动获取名词短语结构规则.该方法具有两个特点:①由于从没有任何标注的生语料库中自动获取带标训练数据,促使带标数据规模可以很大,且容易构造不同领域的带标语料库;②所获取的短语结构规则具有概率属性,可用于分类检索等应用中的名词短语抽取.为论证方法有效性,采用美国Berlitz公司的汽车配件真实语料进行测试,前50个名词短语结构规则的准确率高达80%.
Here presented is a new approach to NP phrase structure rule acquisition based on multi\|feature similarity estimation from corpora without bracketed and nonterminal labels. By computing the distance between a rule and all feature rules based on their local contextual information, the system could sort all rules by their distances. The smaller the distance, the larger the similarity. Experiments using Berlitz corpus show that the approach presented achieves a relatively high accuracy: 80% in the first 50 rules. This result demonstrates that training data acquisition based on NA assumption is effective for rule acquisition and parsing.
出处
《计算机研究与发展》
EI
CSCD
北大核心
1999年第5期601-607,共7页
Journal of Computer Research and Development
基金
国家自然科学基金
国家教委博士点基金
关键词
短语结构规则
自然语言处理
自动获取
noun phrase structure rule, distance function, multifeaturebased similarity estimation