摘要
本文提出了一种基于小规模语料库和机器可读词典 (MachineReadableDictionary ,MRD)的无指导的动词语义获取方法。该方法不需要使用有义项标注的语料库 ,而是使用从语料中获得的V +N搭配以及MRD中多义词定义的应用实例中获得的知识。使用两种方法解决数据稀疏问题 :首先 ,将词的相似性度量由直接共现扩展到共现词的共现 ,以共现聚类而不是共现词来计算词的相似度。其次 ,从MRD定义中获取名词的IS-A关系。通过这些方法 ,即使两个词不共享任何词 ,也可认为是相似的。实验表明 ,该方法可从很小规模的语料中获取知识 ,并在不限制词义的情况下达到 85 7%的正确排歧率。
This paper presents a system for unsupervised verb semantic knowledge acquisition using small corpus and a machine-readable dictionary (MRD). The system does not depend on sense-tagged corpus, but learns a set of typical usages listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions and uses verb-object co-occurrences acquired from the corpus. This paper concentrates on the problem of data sparseness in two ways. First, extending word similarity measures from direct co-occurrences to co-occurrences of co-occurred words, we compute the word similarities using not co-occurred words but co-occurred clusters. Second, we acquire IS-A relations of nouns from the MRD definitions. It is possible to cluster the nouns roughly by the identification of the IS-A relationship. By these methods, two words may be considered similar even if they do not share any word. Experiments show that this method can learn from very small training corpus and achieve over 85.7% correct disambiguation performance without a restriction of word's senses.
出处
《中文信息学报》
CSCD
北大核心
2004年第6期23-29,共7页
Journal of Chinese Information Processing
基金
山西省青年基金资助项目 (2 0 0 0 10 17)
关键词
人工智能
自然语言处理
机器可读词典
二元分布
语义
知识获取
artificial intelligence
natural language processing
MRD
dual distribution
semantic
knowledge acquisition