摘要
术语识别在本体构建、词典构建等领域应用广泛,而术语权重计算是术语识别中的关键步骤。本文通过改进TF-IDF公式,将组成术语词条的长度作为权重因素之一,同时考虑术语在文档集中的领域相关性。整个过程基于MapReduce编程模型实现,在Hadoop云平台中以分布式方式计算候选领域术语的权重。实验结果表明,该方法不仅简化了术语权重计算的实施步骤,也提高了算法执行效率。
Term recognition is widely used in the ontology construction,dictionary construction and other fields.And term weighting is a key step in the term recognition.In this paper,several improvements have been made to TF-IDF algorithm,e.g.,the length of terms is considered in weighting,also with terms' correlations to documentation set.The candidate term weight is calculated in a distributed manner based on MapReduce on Hadoop.Experimental results show that the method proposed not only simplifies the steps of term weighting,but also improves the efficiency of the algorithm.
出处
《电信科学》
北大核心
2011年第11期62-65,共4页
Telecommunications Science
基金
国家自然科学基金资助项目(No.60872133)
北京市自然科学基金资助项目(No.4092015)
北京市教委科技发展计划资助项目(No.KM201010772023)