In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is p...In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.展开更多
以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规...以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规则的优点,能够有效地抽取MIB Web页面中的信息。展开更多
基金the National Grand Fundamental Research 973 Program of China(G1998030414)
文摘In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.
文摘以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block), Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规则的优点,能够有效地抽取MIB Web页面中的信息。