摘要
Web文档聚类可以有效地压缩搜索空间,加快检索速度,提高查询精度.提出了一种Web文档的聚类算法.该算法首先采用向量空间模型VSM(vector space model)表示主题,根据主题表示文档;再以文档为事务,以主题为事务项,将文档和主题间的关系看作事务的形式,采用关联规则挖掘算法发现主题频集,相应的文档集即为初步文档类;然后依据类间距离和类内连接强度阈值合并、拆分类,最终实现文档聚类.实验结果表明,该算法是有效的,能处理文档类间固有的重叠情况,具有一定的实用价值.
By grouping similar Web documents into clusters, the search space can be reduced, the search accelerated, and its precision improved. In this paper, a new clustering algorithm is introduced. In the clustering technique, topics are represented according to VSM (vector space model), documents are represented according to topics, and the relation between documents and topics is viewed in a transactional form, each document corresponds to a transaction and each topic corresponds to an item. A frequent item sets can be found by using the association rules discovery algorithm, corresponding documents can be seen as initial clusters. These clusters are merged according to the distance between clusters, or divided according to the strength of connection among documents of a cluster. By real Web documents, experimental results show the algorithm抯 effectiveness and suitability for tackling the overlapping clusters inhered by documents.
出处
《软件学报》
EI
CSCD
北大核心
2002年第3期417-423,共7页
Journal of Software
基金
国家自然科学基金资助项目(60173058)
国家863青年基金资助项目(863-306-QN2000-5)~~