摘要
提出了基于综合方法的主题句的提取方法,并着重讨论了文本主题概念的提取以及相应的权值体系.根据概念间的相互关系,对同义概念进行语义归并,对上下位概念进行语义聚焦,模拟人工标引专家在分析文本主题时的“兼顾各个方面的主题,同时又有所侧重”的原则.在调整文本主题上下位概念的权值时,既考虑下位概念对上位概念的增强作用,同时又考虑到这种调整不影响整个文本主题的分布,从而更精确地抽取出文本的主题概念.采用多种权重度量方式,综合评估句子反映主题的价值.在此基础上,采用主题句选择算法将文本的主题数与所抽取的主题句的数量关联在一起,保证每一个主要的主题都有对应的主题句被选中,并解决主题句的去重问题,从而进一步提高所抽出主题句的主题覆盖性和概括性.
An extraction method for subject sentences of text was put forward. The new method is mainly based on concept equivalence and hierarchies. The emphasis is put on the synonymy relation and hyponymy/hypernymy relation among terms. The concepts and the relationships among them are denoted and organized in the concept base designed for automatic text processing and the prototype system was implemented. Based on concept analysis, the weight of subject terms will be recalculated. The adjusted weight accurately reflects the contribution of term to document topic. Sentences are ranked by a weighted combination of varions metrics. The nonredundant key sentences will be extracted by the rotary choice algorithm which trades off between maximal topic coverage and minimal redundancy of key sentences. The algorithm tries to achieve simultaneously the following goals: (l) Choose the most significant sentences as topic sentences for each topic; (2) Remove the redundant sentences from candidates of topic sentences.
出处
《上海交通大学学报》
EI
CAS
CSCD
北大核心
2006年第5期771-774,782,共5页
Journal of Shanghai Jiaotong University
基金
国家高技术研究发展计划(863)项目(2002AA119050)
关键词
主题句
主题抽取
文本压缩
subject sentence
subject extraction
text compressing