采用树自动机推理技术的信息抽取方法被引量：2

Information extraction using tree automata inference technique

下载PDF

导出

摘要提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。 This paper proposes an information extraction method based on an improved k-contextual tree automata inference algorithm.The key idea is to transform（semi-） structured documents into tree,creating unranked tree automata which can accept the tree and extract data according to the unranked tree automata state of acceptance and rejection,using an advanced k-contextual tree language,which is called KLH tree language.The method makes full use of the tree structure of the web document and combines the method based on web structure with grammar inference.Experimental results show that the approach with tree automata inference is favorable against some other approach in the learning time and extraction time.

作者谭鹏许张来顺

机构地区解放军信息工程大学电子技术学院

出处《计算机工程与应用》 CSCD 北大核心 2010年第16期153-156,共4页 Computer Engineering and Applications

关键词树自动机推理算法结构化(半结构化)文档无秩树自动机信息抽取 KLH树 tree automata inference algorithm （semi-）structured documents unranked tree automata information extraction KLH tree language

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1Ahonen H.Generating grammars for structured documents using grammatical inference methods[D].Univemity of Helsinki,Department of Computer Science,1996.
2Freitag.Using grammatical inference to improve precision in information extractiou[C] //Workshop on Automata Induction,Grammatical Inference,and Language Acquisition,ICML-97,1997.
3Rico-Juan J,Calera-Rubio J,Carrasco R.Probabilistic k-testable tree-lauguages[C] //lecture Notes in Computer Science 1891:ICGI 2000.[S.I.] :Springer,2000:221-228.
4Kosala R,van den Bussche J,Bruynooghe M,et al.Information extraction in structured documents using tree automata induction[C] //Lecture Notes in Computer Science 2431:PKDD.[S.I.] :Springer,2002:299-310.
5王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793. 被引量：6
6Knsala R.Information extraction from Web documents based on local unrauked tree automaton inference[C] //Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence.[s.I.] :Morgan Kaufmann,2003:403-408.
7Muggleton s.Inductive acquisition of expert knowledge[M].Wokingham:Addison-Wesley,1990:80-85.

二级参考文献10

1Alberto H F, Berthier A. A brief survey of Web data extraction tools[J]. ACM SIGMOD Record, 2002,31(2):170-179.
2Crescenzi V, Mecca G, Meraldo P. Roadrunner: Towards automatic data extraction from large Web sites[A]. Atzeri P, Aprs P, Ceri S, et al. Int Conf on Very Large Data Base 2001[C]. Roma,Italy:Morgan Kaufmann,2001.109-118.
3Arvind A, Garcia-Molina H. Extracting structured data from Web pages[R]. Stanford:Stanford University, 2002.
4Neven F. Automata theory for XML researchers[J]. ACM SIGMOD Record, 2002,31(3):39-46.
5Roci-Juan J, Calera-Rubio J, Carrasco R. Probabilistic k-testable tree language[A]. Arlindo L. ICGI 2000[C]. Lisbon, Portugal:Springer, 2000.221-228.
6Kosala R, Bussche J. Information extraction in structured documents using tree automata induction[A]. Elomaa T. Principles of Data Mining and Knowledge Discovery 2002[C]. Helsinki, Finland:Springer,2002.299-310.
7Kosala R. Information extraction by tree automata inference[R]. Belgium:Katholieke University, 2003.
8Apparao V, Byrne S, Champion M. Document object model level 1[EB/OL]. http:∥www.w3c.org/TR/1998/REC-DOM-Level-1-19981001/,1998-10-01/2003-08-12.
9孟小峰,王海燕,谷明哲,王静.XWIS中基于预定义模式的包装器[J].计算机应用,2001,21(9):1-3. 被引量：3
10李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量：102

共引文献5

1谭鹏许,张来顺,滕婕.基于DTA的信息抽取技术研究[J].计算机应用与软件,2009,26(12):228-230.
2谭鹏许,谭晓贞,张来顺.基于无秩树自动机的信息抽取技术研究[J].计算机工程与设计,2009,30(23):5506-5509. 被引量：1
3寇月,李冬,申德荣,于戈,聂铁铮.D-EEM:一种基于DOM树的Deep Web实体抽取机制[J].计算机研究与发展,2010,47(5):858-865. 被引量：17
4赵凤芝,饶平,刘永江.多格式海量数据统一存取的设计与研究[J].科学技术与工程,2011,11(9):2003-2006. 被引量：1
5张敏.基于确定性树自动机技术的信息抽取研究[J].才智,2011,0(36):213-214.

同被引文献20

1林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量：49
2Ciravegna F.Adaptive information extraction from text by rule induction and generalization[C]//Proc of the17th International Joint Conf on Artificial Intelligence,2001.
3Rabiner L E.A tutorial on hidden Markov models and selected application in speech recognition[J].Proceedings of the IEEE,1989,77(2):257-286.
4Che Wanxiang,Li Zhenghua,Liu Ting.LTP:a Chinese language technology platform[C]//Proceedings of the Coling2010:Dem-onstrations,Beijing,China,2010:13-16.
5Appeltd D.Introduction to information extraction[J].AI Commun,1999,12(3):161-172.
6Miller S,Crystal M,Fox H,et al.Algorithms that learn to extract information-BBN:description of the SIFT system as used for MUC-7[C]//Proc of MUC-7,1998.
7于江德,肖新峰,樊孝忠.基于隐马尔可夫模型的中文文本事件信息抽取[J].微电子学与计算机,2007,24(10):92-94. 被引量：17
8周顺先,林亚平,王耀南,易叶青.基于二阶隐马尔可夫模型的文本信息抽取[J].电子学报,2007,35(11):2226-2231. 被引量：25
9付剑锋,刘宗田,付雪峰,周文,仲兆满.基于依存分析的事件识别[J].计算机科学,2009,36(11):217-219. 被引量：20
10吴飞,庄越挺.互联网跨媒体分析与检索:理论与算法[J].计算机辅助设计与图形学学报,2010,22(1):1-9. 被引量：35

引证文献2

1袁璐,蒙祖强,许珂.依存分析和HMM相结合的信息抽取方法[J].计算机工程与应用,2012,48(9):138-140. 被引量：4
2熊回香,杨滋荣,蒋武轩.跨媒体知识图谱构建中多模态数据语义相关性研究[J].情报理论与实践,2019,42(2):13-18. 被引量：30

二级引证文献34

1陈财森,向阳霞,寇应展,刘会英.面向装备作战数据的知识图谱平台构建[J].装甲兵学报,2022(5):105-110. 被引量：2
2楼雯.馆藏资源语义化关键技术及实证研究[J].中国图书馆学报,2013,39(6):27-40. 被引量：19
3李欣,张毅,汪志莉.图书馆异构特藏资源整合的数字人文研究需求[J].数字图书馆论坛,2017(11):48-53. 被引量：14
4丁晟春,王莉,刘梦露.基于规则的动物卫生事件舆情信息抽取研究[J].计算机应用与软件,2018,35(9):56-62. 被引量：7
5陈国,刘亮亮,张再跃.用户短文本无关语自动识别方法研究[J].计算机与数字工程,2019,47(7):1748-1752. 被引量：1
6郑丽珺.大数据时代图书馆馆藏资源的跨媒体知识服务研究[J].图书馆学刊,2019,41(7):50-54. 被引量：8
7芮浩.人工智能技术在电视台内容管理中的应用场景研究[J].人工智能,2020(2):97-104. 被引量：5
8陶兴,张向先,张莉曼,卢恒.网络学术社区跨平台用户生成内容知识聚合研究[J].情报理论与实践,2020,43(7):151-156. 被引量：7
9牛力,刘慧琳,曾静怡.档案工作参与数字人文建设的模式分析[J].档案学通讯,2020(5):62-67. 被引量：13
10肖勇,钱斌,周密.基于语义关联的电力计量跨媒体知识图谱构建方法[J].计算机科学,2020,47(S02):126-131. 被引量：6

1谭鹏许,谭晓贞,张来顺.基于无秩树自动机的信息抽取技术研究[J].计算机工程与设计,2009,30(23):5506-5509. 被引量：1
2陈文伟,黄金才.基于神经网络的模糊推理[J].模糊系统与数学,1996,10(4):26-30. 被引量：4
3魏绍贤.MPP是万亿次计算机的主要结构途径[J].航空计算技术,1996,26(2):1-3.
4Crook.,D 曲谦.高性能图象处理体系结构的发展趋势[J].电子计算机,1999(5):45-51.
5Cem S. Sutcu Murat Oztermiyeci.Research on the Contextual and Functional Differences Between New and Old Journalism[J].Journalism and Mass Communication,2011,1(1):48-56.
6罗芳,张顺颐,王攀.基于人工智能的VoIP网络QoS专家系统的设计与实现[J].南京邮电学院学报（自然科学版）,2005,25(5):6-10.
7谭鹏许,张来顺,滕婕.基于DTA的信息抽取技术研究[J].计算机应用与软件,2009,26(12):228-230.
8王伟鹃.浅谈人工智能[J].才智,2008,0(4):77-77. 被引量：1
9申林杰,肖凯,刘晓东.基于模糊推理原则的车载多传感器全方位酒精检测系统[J].中北大学学报（自然科学版）,2014,35(4):479-484. 被引量：2
10Fehmi Azemi,Wilson Ozuem,Yllka Azemi.Contextual Influences＂ Online Service Failure and Recovery Strategies[J].Chinese Business Review,2015,14(8):382-389.

计算机工程与应用

2010年第16期

浏览历史

内容加载中请稍等...

采用树自动机推理技术的信息抽取方法被引量：2

参考文献7

二级参考文献10

共引文献5

同被引文献20

引证文献2

二级引证文献34

相关作者

相关机构

相关主题

浏览历史

采用树自动机推理技术的信息抽取方法 被引量：2

参考文献7

二级参考文献10

共引文献5

同被引文献20

引证文献2

二级引证文献34

相关作者

相关机构

相关主题

浏览历史

采用树自动机推理技术的信息抽取方法被引量：2