摘要
XML作为半结构化的语言,因其能预先定义标记等优势被普遍应用于非结构化到结构化信息的转换中。利用POI技术把网络上繁杂的非结构化数据转化为XML半结构化数据,把半结构化数据转化为结构化数据,使用户能够简便地查询所需信息。通过实验对SAX,DOM的解析效率进行了对比,实验表明解析相同大小的XML文件,SAX比DOM效率更高,而且此种差距会随着XML文件的增大而逐渐增大。
XML,as a semi-structured language,is widely used in converting unstructured information to structured information because of its special characteristic of pre-defined mark.In this work,the complicated unstructured data on the network was converted to XML semi-structured data through POI technology,then the semi-structured data was converted to structured data by parsing XML file through SAX,which would provide convenience for users to search for information.In addition,those efficiencies of parsing of XML files though methods of SAX and DOM were compared in this work for the first time.It demonstrates that the parsing efficiency of SAX is higher than DOM when they are used to parse the same file,and this gap will increase with the size of XML file.
出处
《计算机科学》
CSCD
北大核心
2017年第B11期414-417,共4页
Computer Science
基金
湖北省统计科研计划重点项目(HB131-32)资助
关键词
大数据
非结构化数据
可扩展标记语言
文档解析技术
Big data,Uns truc tured data,Extensible markup language,Document resolution technology