期刊文献+

基于LSA和pLSA的多文档自动文摘 被引量:6

Multi-Documentation Summarization Based on LSA and pLSA
在线阅读 下载PDF
导出
摘要 本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。 This paper proposes a new strategy of multi-document summarization based on the latent semantic analysis and the probabilistic latent semantic analysis. Firstly, all documents are split to paragraphs, and they are used to clustering. New features are used to construct word-paragraph matrices. Latent semantic analysis which stems from linear algebra performs a singular value decomposition of word-paragraph matrices, so that unimportant information is filtered and the high dimensional representation in the vector space model is changed to low dimensional representation in the latent semantic space. Co-occurrence data is changed to the probabilistic model by the probabilistic latent semantic analysis. In the period of summarization, the method of centroid-based summarization is used to generate summarization. The experimental results show that the performance of summarization is improved.
作者 俞辉
出处 《计算机工程与科学》 CSCD 北大核心 2009年第9期108-111,共4页 Computer Engineering & Science
关键词 多文档自动文摘 潜在语义分析 奇异值分解 multi-document summarization latent semantic analysis singular value decomposition
  • 相关文献

参考文献7

二级参考文献27

共引文献21

同被引文献53

引证文献6

二级引证文献81

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部