摘要
针对用户生成内容中短文本特征语义描述能力弱和K-means算法对初始聚类中心选值的敏感性问题,通过维基百科概念、链接结构和类别体系信息对短文本进行特征扩展以补充其语义信息。进而以文本间语义关系为基础构建文本集加权复杂网络,利用节点综合特性来选取初始聚类中心,并结合K-means算法对网络节点进行社团划分以达到短文本聚类的目的。实验结果表明,该方法能够有效提高短文本聚类效果。
To solve the problem of weak semantic description ability of short text feature in user generated content, and the traditional K - means algorithm for document clustering is sensitive to the initial clustering center, this paper proposes that the semantic features information of short text can be supplied by feature extension based on the concept, link struc- ture and category system of Wikipedia. Then the weighted complex network of short text set is built by the semantic rela- tion of texts, and text clustering is achieved by node partitioning community based on K - means algorithm whose initial clustering center is chosen according to the synthetic characteristics of network nodes. Results of experiment show that the algorithm proposed by this paper can improve the effect of short text clustering.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第9期88-92,共5页
New Technology of Library and Information Service