基于模式元素的文档聚类方法研究

A Research on Clustering Method Based on Element of XML Schema

下载PDF

导出

摘要聚类问题的关键是把相似的事物聚集在一起,因此相似度计算是进行文档聚类的首要问题.XML模式是XML文档结构的体现,对XML文档的聚类可以通过XML模式的聚类来实现.本文提出一种基于XML模式元素的文档聚类方法,通过计算XML模式元素间的相似度来对文档进行聚类,综合考虑了XML模式中元素的结构和语义信息,进一步提高了计算相似度的精度,提高聚类的准确性,并且易于提取聚簇的通用XML模式. A clustering method based on element of XML schema is brought forward in this paper. The key of clustering is to aggregate the similar things together. Therefore, the similarity is the important foundation for XML clustering. Schema is the representation of document structure, and clustering of XML documents can be achieved through clustering of XML schemas. The＇authors of this paper cluster documents by calculating the sim ilarity of elements, because elements are the main body in XML. The approach takes full account of the struc ture and semantics of elements, and makes a more accurate calculation of similarity. In the meanwhile, it im- proves the accuracy of clustering and makes it easy to extract the common XML schema.

作者孙霞张玉生

机构地区常熟理工学院计算机科学与工程学院

出处《常熟理工学院学报》 2012年第8期94-98,共5页 Journal of Changshu Institute of Technology

关键词元素模式相似度聚类 element schema similarity clustering

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Chang C H, Lui S C, Wu Y C. Applying pattern mining to Web information extraction[A]. In Proceedings of the Fifth Pacific Asia Conference on Knowledge Discovery and Data Mining [C]. Hong Kong, 2001:3.
2Min J K, Ahn J Y, Chung C W. Efficient Extraction of Schemas for XML Documents[J]. Information Processing Letters, 2003, 85(1): 7.
3张海威,袁晓洁,杨娜,王鑫.元素路径模型:高效的XML Schema提取方法[J].计算机工程,2008,34(3):32-34. 被引量：2
4Hegewald J, Naumann F, Weis M. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents[C]. Proceedings of the 22nd International Conference on Data Engineering Workshops. Atlanta, GA, USA: [s.n. ], 2006: 81.
5George M, richard B. Introduction to wordNet:an online lexical database[J]. International Journal of Lexicography, 1993, 3(4): 235-312.
6杨厚群,何中市,雷景生.基于划分的XML文档聚类研究[J].计算机科学,2008,35(3):183-185. 被引量：4

二级参考文献19

1潘有能.XML文档自动聚类研究[J].情报学报,2006,25(2):215-220. 被引量：16
2Garofalakis M, Gionis A, Rastogi R, et al. XTRACT: A System for Extracting Document Type Descriptors from XML Documents[C]// Proceedings ofACM SIGMOD. Dallas, Texas: [s. n.], 2000: 165.
3Berman L, Diaz A. Data Descriptors by Example[EB/OL]. (2001 - 10-10). http://www.alphaworks.ibm.com/tech/DDbE.
4Moh C H, Lim E P, Ng W K. DTD-miner: A Tool for Mining DTD from XML Documents[C]//Proceedings of International Workshop on Advance Issues of E-commerce and Web-based Information Systems. San Jose : [s. n.], 2000: 144.
5Hegewald J, Naumann F, Weis M. XStruct: Efficient Schema Extraction from Multiple and Large XML Documents[C]// Proceedings of the 22rid International Conference on Data Engineering Workshops. Atlanta, GA, USA: [s. n.], 2006:81.
6Min J K, Ahn J Y, Chung C W. Efficient Extraction of Schemas for XML Documents[J]. Information Processing Letters, 2003, 85( 1): 7.
7Leung H P, Chung F L,Chan S C F. On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems,2005,7(4)
8Kaufman L,Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York:Wiley, 1990
9Lee M L, et al. XClust: Clustering XML schemas for Effective Integration. In:Proc. 11th Int Conf on Information & Knowledge Management, McLean, Nov. 2002. 292-299
10Sigmod XML DataSet. Available at : http ://www. acm.org/ sigmod/ record/xml, 2005 - 7

共引文献4

1李众,梁志剑.一种改进的文本聚类算法[J].陕西科技大学学报（自然科学版）,2008,26(6):163-166.
2刘铮,刘伟.XML数据中孤立点检测方法研究[J].计算机工程与设计,2010,31(18):4001-4004. 被引量：3
3孙霞,程宏斌.基于模式的XML文档相似度算法[J].计算机工程,2010,36(21):54-56. 被引量：2
4张力生,洪小云,雷大江.基于路径特征的XML文档结构相似性度量[J].计算机应用与软件,2015,32(7):39-42. 被引量：4

1李由,刘东波,张维明.基于数据实例分布特征的自动模式匹配方法[J].计算机科学,2005,32(11):85-87. 被引量：11
2杜小坤,李国徽,王江晴,帖军,李艳红.基于信息元的模式匹配方法[J].软件学报,2015,26(10):2596-2613. 被引量：5
3李国徽,杜小坤,胡方晓,杨兵,唐向红.基于函数依赖的结构匹配方法[J].软件学报,2009,20(10):2667-2678. 被引量：5
4王宏鼎,谭少华,唐世渭,杨冬青,童云海.基于模式元素语义关系的模式合并方法研究[J].北京大学学报（自然科学版）,2007,43(3):405-411. 被引量：3
5张哲.基于领域本体的XML模式元素的相似性[J].微电子学与计算机,2007,24(4):220-224.
6曹兰英,严义,邬惠峰.基于模式匹配的XML自动转换技术[J].计算机工程与应用,2012,48(25):72-76. 被引量：6
7于波,唐世渭,张鹏,童云海.基于实体分类的数据库模式匹配方法[J].计算机科学,2004,31(10):157-159. 被引量：8
8翁年凤,刁兴春,曹建军,冯径.不确定模式匹配研究综述[J].计算机科学,2011,38(12):1-5. 被引量：4
9何杰,王新云,郭艺歌.利用实例的异构网络服务模式匹配方法[J].华中师范大学学报（自然科学版）,2015,49(6):843-850.
10程伟,周龙骧,林河水,孙玉芳.一种多策略通用模式匹配方法[J].计算机科学,2004,31(11):121-123. 被引量：2

常熟理工学院学报

2012年第8期

浏览历史

内容加载中请稍等...

基于模式元素的文档聚类方法研究

参考文献6

二级参考文献19

共引文献4

相关作者

相关机构

相关主题

浏览历史