期刊文献+
共找到22,760篇文章
< 1 2 250 >
每页显示 20 50 100
Deep Learning Multimodal for Unstructured and Semi-Structured Textual Documents Classicatio 被引量:1
1
作者 Nany Katamesh Osama Abu-Elnasr Samir Elmougy 《Computers, Materials & Continua》 SCIE EI 2021年第7期589-606,共18页
Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classication task becomes an interesting area for ... Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classication task becomes an interesting area for controlling data behavior.This paper presents a document classication multimodal for categorizing textual semi-structured and unstructured documents.The multimodal implements several individual deep learning models such as Deep Neural Networks(DNN),Recurrent Convolutional Neural Networks(RCNN)and Bidirectional-LSTM(Bi-LSTM).The Stacked Ensemble based meta-model technique is used to combine the results of the individual classiers to produce better results,compared to those reached by any of the above mentioned models individually.A series of textual preprocessing steps are executed to normalize the input corpus followed by text vectorization techniques.These techniques include using Term Frequency Inverse Term Frequency(TFIDF)or Continuous Bag of Word(CBOW)to convert text data into the corresponding suitable numeric form acceptable to be manipulated by deep learning models.Moreover,this proposed model is validated using a dataset collected from several spaces with a huge number of documents in every class.In addition,the experimental results prove that the proposed model has achieved effective performance.Besides,upon investigating the PDF Documents classication,the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features,respectively.Moreover,concerning the JSON Documents classication,the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features,respectively.Furthermore,as for the XML Documents classication,the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features,respectively. 展开更多
关键词 document classication deep learning text vectorization convolutional neural network bi-directional neural network stacked ensemble
在线阅读 下载PDF
Construction of a Maritime Knowledge Graph Using GraphRAG for Entity and Relationship Extraction from Maritime Documents 被引量:1
2
作者 Yi Han Tao Yang +2 位作者 Meng Yuan Pinghua Hu Chen Li 《Journal of Computer and Communications》 2025年第2期68-93,共26页
In the international shipping industry, digital intelligence transformation has become essential, with both governments and enterprises actively working to integrate diverse datasets. The domain of maritime and shippi... In the international shipping industry, digital intelligence transformation has become essential, with both governments and enterprises actively working to integrate diverse datasets. The domain of maritime and shipping is characterized by a vast array of document types, filled with complex, large-scale, and often chaotic knowledge and relationships. Effectively managing these documents is crucial for developing a Large Language Model (LLM) in the maritime domain, enabling practitioners to access and leverage valuable information. A Knowledge Graph (KG) offers a state-of-the-art solution for enhancing knowledge retrieval, providing more accurate responses and enabling context-aware reasoning. This paper presents a framework for utilizing maritime and shipping documents to construct a knowledge graph using GraphRAG, a hybrid tool combining graph-based retrieval and generation capabilities. The extraction of entities and relationships from these documents and the KG construction process are detailed. Furthermore, the KG is integrated with an LLM to develop a Q&A system, demonstrating that the system significantly improves answer accuracy compared to traditional LLMs. Additionally, the KG construction process is up to 50% faster than conventional LLM-based approaches, underscoring the efficiency of our method. This study provides a promising approach to digital intelligence in shipping, advancing knowledge accessibility and decision-making. 展开更多
关键词 Maritime Knowledge Graph GraphRAG Entity and Relationship Extraction document Management
在线阅读 下载PDF
IEC releases 3 technical documents for electrical energy storage
3
《China Standardization》 2025年第4期14-14,共1页
In the process of building a new power system dominated by new energy sources,power storage is a key supporting technology that ensures the safe and stable operation of the power grid,enables the flexible regulation o... In the process of building a new power system dominated by new energy sources,power storage is a key supporting technology that ensures the safe and stable operation of the power grid,enables the flexible regulation of the system,and raises the level of new energy consumption.It is also key to achieving carbon peak and neutrality as well as energy transformation. 展开更多
关键词 carbon peak neutrality new energy sourcespower storage raises level new energy consumptionit technical documents power gridenables flexible regulation systemand electrical energy storage building new power system
原文传递
Lossless Mapping from Semi-Structured Data to Structured Data 被引量:2
4
作者 李文武 金远平 童咪娜 《Journal of Southeast University(English Edition)》 EI CAS 2002年第1期46-53,共8页
Most semi-structured data are of certain structure regularity. Having beenstored as structured data in relational database (RDB), they can be effectively managed by databasemanagement system (DBMS). Some semi-structur... Most semi-structured data are of certain structure regularity. Having beenstored as structured data in relational database (RDB), they can be effectively managed by databasemanagement system (DBMS). Some semi-structured data are difficult to transform due to theirirregular structures. We design an efficient algorithm and data structure for ensuring losslesstransformation. We bring forward an approach of schema extraction through data mining, in whichdifferent kinds of elements are transformed respectively and lossless mapping from semi-structureddata to structured data can be achieved. 展开更多
关键词 semi-structured data DTD RDB schema mapping overflow data
在线阅读 下载PDF
Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering 被引量:4
5
作者 Sahand Vahidnia Alireza Abbasi Hussein A.Abbass 《Journal of Data and Information Science》 CSCD 2021年第3期99-122,共24页
Purpose:Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields.This also helps in having a better collab... Purpose:Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields.This also helps in having a better collaboration with governments and businesses.This study aims to investigate the development of research fields over time,translating it into a topic detection problem.Design/methodology/approach:To achieve the objectives,we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents.Document embedding approaches are utilized to transform documents into vector-based representations.The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms(i.e.LDA)against a benchmark dataset.A case study is also conducted exploring the evolution of Artificial Intelligence(AI)detecting the research topics or sub-fields in related AI publications.Findings:Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset.Using the proposed method,we also show how the topics have evolved in the period of the recent 30 years,taking advantage of a keyword extraction method for cluster tagging and labeling,demonstrating the context of the topics.Research limitations:We noticed that it is not possible to generalize one solution for all downstream tasks.Hence,it is required to fine-tune or optimize the solutions for each task and even datasets.In addition,interpretation of cluster labels can be subjective and vary based on the readers’opinions.It is also very difficult to evaluate the labeling techniques,rendering the explanation of the clusters further limited.Practical implications:As demonstrated in the case study,we show that in a real-world example,how the proposed method would enable the researchers and reviewers of the academic research to detect,summarize,analyze,and visualize research topics from decades of academic documents.This helps the scientific community and all related organizations in fast and effective analysis of the fields,by establishing and explaining the topics.Originality/value:In this study,we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction.We also use a concept extraction method as a labeling approach in this study.The effectiveness of the method has been evaluated in a case study of AI publications,where we analyze the AI topics during the past three decades. 展开更多
关键词 Dynamics of science Science mapping document clustering Artificial intelligence Deep learning
在线阅读 下载PDF
INFORMATION RETRIEVAL FOR SHORT DOCUMENTS 被引量:2
6
作者 Qi Haoliang Li Mu +1 位作者 Gao Jianfeng Li Sheng 《Journal of Electronics(China)》 2006年第6期933-936,共4页
The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is av... The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be ally linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance. 展开更多
关键词 Information retrieval Short documents Reference document Model (RDM)
在线阅读 下载PDF
Hadoop-Based Similarity Computation System for Composed Documents 被引量:1
7
作者 Xiaoming Zhang Zhipeng Qin +3 位作者 Xuwei Liu Qianyun Hou Baishuang Zhang Jie Wu 《Journal of Computer and Communications》 2015年第5期196-202,共7页
There exist a large number of composed documents in universities in the teaching process. Most of them are required to check the similarity for validation. A kind of similarity computation system is constructed for co... There exist a large number of composed documents in universities in the teaching process. Most of them are required to check the similarity for validation. A kind of similarity computation system is constructed for composed documents with images and text information. Firstly, each document is split and outputs two parts as images and text information. Then, these documents are compared by computing the similarities of images and text contents independently. Through Hadoop system, the text contents are easily and quickly separated. Experimental results show that the proposed system is efficient and practical. 展开更多
关键词 SIMILARITY COMPUTATION Composed documents Map REDUCE SYSTEM Integration
在线阅读 下载PDF
Automatic Table Recognition and Extraction from Heterogeneous Documents 被引量:1
8
作者 Florence Folake Babatunde Bolanle Adefowoke Ojokoh Samuel Adebayo Oluwadare 《Journal of Computer and Communications》 2015年第12期100-110,共11页
This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which ... This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction. 展开更多
关键词 Hidden MARKOV Model Table Recognition and EXTRACTION HYPERTEXT MARKUP Language HETEROGENEOUS documents
在线阅读 下载PDF
Performance Assessment of Nanocellulose Hydroxypropyl Methyl Cellulose Composite on Role of Nano-CaCO_(3) for the Preservation of Paper Documents 被引量:4
9
作者 Xiaochun Ma Altaf Halim +2 位作者 Xiaohong Li Huiming Fan Shiyu Fu 《Paper And Biomaterials》 CAS 2022年第2期1-9,共9页
Deacidification and self-cleaning are important for the preservation of paper documents.In this study,nano-CaCO_(3) was used as a deacidification agent and stabilized by nanocellulose(CNC)and hydroxypropyl methylcellu... Deacidification and self-cleaning are important for the preservation of paper documents.In this study,nano-CaCO_(3) was used as a deacidification agent and stabilized by nanocellulose(CNC)and hydroxypropyl methylcellulose(HPMC)to form a uniform dispersion.Followed by polydimethylsiloxane(PDMS)treatment and chemical vapor deposition(CVD)of methyltrimethoxysilane(MTMS),a hydrophobic coating was constructed for self-cleaning purposes.The pH value of the treated paper was approximately 8.20,and the static contact angle was as high as 152.29°.Compared to the untreated paper,the tensile strength of the treated paper increased by 12.6%.This treatment method endows the paper with a good deacidification effect and self-cleaning property,which are beneficial for its long-term preservation. 展开更多
关键词 paper documents NANOCELLULOSE self-cleaning nano-CaCO_(3) superhydrophobicity DEACIDIFICATION
在线阅读 下载PDF
Stochastic Model for Multiple Classes and Subclasses Simple Documents Processing 被引量:1
10
作者 Pierre Moukeli Mbindzoukou Arsène Roland Moukoukou Marius Massala 《Intelligent Information Management》 2021年第2期124-140,共17页
The issue of document management has been raised for a long time, especially with the appearance of office automation in the 1980s, which led to dematerialization and Electronic Document Management (EDM). In the same ... The issue of document management has been raised for a long time, especially with the appearance of office automation in the 1980s, which led to dematerialization and Electronic Document Management (EDM). In the same period, workflow management has experienced significant development, but has become more focused on the industry. However, it seems to us that document workflows have not had the same interest for the scientific community. But nowadays, the emergence and supremacy of the Internet in electronic exchanges are leading to a massive dematerialization of documents;which requires a conceptual reconsideration of the organizational framework for the processing of said documents in both public and private administrations. This problem seems open to us and deserves the interest of the scientific community. Indeed, EDM has mainly focused on the storage (referencing) and circulation of documents (traceability). It paid little attention to the overall behavior of the system in processing documents. The purpose of our researches is to model document processing systems. In the previous works, we proposed a general model and its specialization in the case of small documents (any document processed by a single person at a time during its processing life cycle), which represent 70% of documents processed by administrations, according to our study. In this contribution, we extend the model for processing small documents to the case where they are managed in a system comprising document classes organized in subclasses;which is the case for most administrations. We have thus observed that this model is a Markovian <i>M<sup>L×K</sup>/M<sup>L×K</sup>/</i>1 queues network. We have analyzed the constraints of this model and deduced certain characteristics and metrics. <span style="white-space:normal;"><i></i></span><i>In fine<span style="white-space:normal;"></span></i>, the ultimate objective of our work is to design a document workflow management system, integrating a component of global behavior prediction. 展开更多
关键词 document Processing WORKFLOW Hierarchic Chart Counting Processes Stochastic Models Waiting Lines Markov Processes Priority Queues Multiple Class and Subclass Queues
在线阅读 下载PDF
Supporting B2B Business Documents in XML Web Services 被引量:3
11
作者 KIM Hyoungdo 《Journal of Electronic Science and Technology of China》 2004年第3期53-57,73,共6页
While XML web services become recognized as a solution to business-to-business transactions, there are many problems that should be solved. For example, it is not easy to manipulate business documents of existing stan... While XML web services become recognized as a solution to business-to-business transactions, there are many problems that should be solved. For example, it is not easy to manipulate business documents of existing standards such as RosettaNet and UN/EDIFACT EDI, traditionally regarded as an important resource for managing B2B relationships. As a starting point for the complete implementation of B2B web services, this paper deals with how to support B2B business documents in XML web services. In the first phase, basic requirements for driving XML web services by business documents are introduced. As a solution, this paper presents how to express B2B business documents in WSDL, a core standard for XML web services. This kind of approach facilitates the reuse of existing business documents and enhances interoperability between implemented web services. Furthermore, it suggests how to link with other conceptual modeling frameworks such as ebXML/UMM, built on a rich heritage of electronic business experience. 展开更多
关键词 business document XML web service EBXML
在线阅读 下载PDF
Study on Multi-Label Classification of Medical Dispute Documents 被引量:2
12
作者 Baili Zhang Shan Zhou +2 位作者 Le Yang Jianhua Lv Mingjun Zhong 《Computers, Materials & Continua》 SCIE EI 2020年第12期1975-1986,共12页
The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treat... The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treatment process in order to provide detailed trace data in medical dispute resolution.Second,IoMT can infiltrate the ongoing treatment and provide timely intelligent decision support to medical staff.This information includes recommendation of similar historical cases,guidance for medical treatment,alerting of hired dispute profiteers etc.The multi-label classification of medical dispute documents(MDDs)plays an important role as a front-end process for intelligent decision support,especially in the recommendation of similar historical cases.However,MDDs usually appear as long texts containing a large amount of redundant information,and there is a serious distribution imbalance in the dataset,which directly leads to weaker classification performance.Accordingly,in this paper,a multi-label classification method based on key sentence extraction is proposed for MDDs.The method is divided into two parts.First,the attention-based hierarchical bi-directional long short-term memory(BiLSTM)model is used to extract key sentences from documents;second,random comprehensive sampling Bagging(RCS-Bagging),which is an ensemble multi-label classification model,is employed to classify MDDs based on key sentence sets.The use of this approach greatly improves the classification performance.Experiments show that the performance of the two models proposed in this paper is remarkably better than that of the baseline methods. 展开更多
关键词 Internet of Medical Things(IoMT) medical disputes medical dispute document(MDD) multi-label classification(MLC) key sentence extraction class imbalance
在线阅读 下载PDF
EDCMS:A Content Management System for Engineering Documents
13
作者 Chris McMahon Mansur Darlington +1 位作者 Steve Culley Peter Wild 《International Journal of Automation and computing》 EI 2007年第1期56-70,共15页
Engineers often need to look for the right pieces of information by sifting through long engineering documents, It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting ... Engineers often need to look for the right pieces of information by sifting through long engineering documents, It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting their attention to new ways to help information users, including engineers, to access and retrieve document content. The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure), document mark-up (with EXtensible Mark- up Language (XML), HyperText Mark-up Language (HTML), and Scalable Vector Graphics (SVG)), and a facetted classification mechanism. Document content extraction is implemented via computer programming (with Java). An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers.The main features of the EDCMS system are: 1) EDCMS is a system that enables users, especially engineers, to access and retrieve information at content rather than document level. In other words, it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information. 2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content. 3) Users can use the EDCMS to access and retrieve content objects, i.e. text, images and graphics (including engineering drawings) via multiple views and at different granularities based on decomposition schemes. Experiments with the EDCMS have been conducted on semi-structured documents, a textbook of CADCAM, and a set of project posters in the Engineering Design domain. Experimental results show that the system provides information users with a powerful solution to access document content. 展开更多
关键词 document content management engineering design decomposition schemes document mark-up facetted classification.
在线阅读 下载PDF
Study on Documents ofCampus Network
14
作者 万伟太 杨林 宋为 《International Journal of Mining Science and Technology》 SCIE EI 1997年第2期66-69,共4页
Campus network establishment belongs to the field of system engineering. It is necessary to carry on cooperation among departments. Standardization is the key to solve the problem, and its core is standardization of d... Campus network establishment belongs to the field of system engineering. It is necessary to carry on cooperation among departments. Standardization is the key to solve the problem, and its core is standardization of documents. Therefore, this paper will be concentrated on the discussion of relevant problems in combination with our campus network practice. 展开更多
关键词 CAMPUS NETWORK ENGINEERING document
在线阅读 下载PDF
Word Segmentation for Chinese Judicial Documents 被引量:1
15
作者 Linxia Yao Jidong Ge +5 位作者 Chuanyi Li Yuan Yao Zhenhao Li Jin Zeng Bin Luo Victor Chang 《国际计算机前沿大会会议论文集》 2019年第1期476-478,共3页
Word segmentation is an integral step in many knowledge discovery applications. However, existing word segmentation methods have problems when applying to Chinese judicial documents:(1) existing methods rely on large-... Word segmentation is an integral step in many knowledge discovery applications. However, existing word segmentation methods have problems when applying to Chinese judicial documents:(1) existing methods rely on large-scale labeled data which is typically unavailable in judicial documents, and (2) judicial document has its own language features and writing formats. In this paper, a word segmentation method is proposed for Chinese judicial documents. The proposed method consists of two steps:(1) automatically generating some labeled data as legal dictionaries, and (2) applying a hybrid multilayer neural networks to do word segmentation incorporating legal dictionaries. Experiments are conducted on a dataset of Chinese judicial documents showing that the proposed model can achieve better results than the existing methods. 展开更多
关键词 CHINESE word SEGMENTATION KNOWLEDGE DISCOVERY JUDICIAL documents
在线阅读 下载PDF
THE EARLIEST SLAVERY DOCUMENTS FROM MESOPOTAMIA 被引量:1
16
作者 Wu Yuhong, IHAC Northeast Normal University, Changchun 《Journal of Ancient Civilizations》 2009年第1期1-33,共33页
In his ice-breaking article "The Smell of the Cage in Cuneiform Digital Library" Journal 2009/4, Robert K. Englund discusses some archaic lists of slaves from Uruk III and Jemdet Nasr (Ancient Ni-ru?) about ... In his ice-breaking article "The Smell of the Cage in Cuneiform Digital Library" Journal 2009/4, Robert K. Englund discusses some archaic lists of slaves from Uruk III and Jemdet Nasr (Ancient Ni-ru?) about 3100-2900 B.C. 展开更多
关键词 THE EARLIEST SLAVERY documents FROM MESOPOTAMIA
在线阅读 下载PDF
A Novel Method for Transforming XML Documents to Time Series and Clustering Them Based on Delaunay Triangulation
17
作者 Narges Shafieian 《Applied Mathematics》 2015年第6期1076-1085,共10页
Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures... Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures search engine’s efficiency. In this paper, we propose a technique for detecting the similarity in the structure of XML documents;in the following, we would cluster this document with Delaunay Triangulation method. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. So we could use Discrete Fourier Transform as a simple method to analyze these signals in frequency domain and make similarity matrices through a kind of distance measurement, in order to group them into clusters. We exploited Delaunay Triangulation as a clustering method to cluster the d-dimension points of XML documents. The results show a significant efficiency and accuracy in front of common methods. 展开更多
关键词 XML Mining document CLUSTERING XML CLUSTERING Schema Matching Similarity Measures DELAUNAY TRIANGULATION Cluster
在线阅读 下载PDF
Automatic generation of context-aware workflow documents for ubiquitous robot services
18
作者 Yongseong Cho Jongsun Choi Jaeyoung Choi 《Journal of Measurement Science and Instrumentation》 CAS 2012年第3期260-267,共8页
Intelligent robots in ubiquitous computing environment should be able to receive a variety of surrounding informa tion and provide users with appropriate services. A developer can describe the robot services that are ... Intelligent robots in ubiquitous computing environment should be able to receive a variety of surrounding informa tion and provide users with appropriate services. A developer can describe the robot services that are proper to users' envir onments by using his or her various environments, and process them through the execution engine. However, it is difficult for a developer to describe and develop robot services, who knows all surrounding information which is called context infor mation. If there is a method for describing and documenting robot services in intuitive expressions, that is to use graphical user interfaces(GUls), it would be very helpful. This paper suggests that robot service developers describe robot services us ing intuitive GUls with contextawareness. And the services can be automatically generated into workflow documents. Robot services that robot service developers have made with intuitive GUIs can be automatically generated into workflow docu ments by using the object modeling technique(OMT). Developers can describe robot services based on contextaware work flow language(CAWL ). For testing, scenariobased robot services are described using CAWLbased development tool, and their workflow documents are automatically generated. 展开更多
关键词 ubiquitous robot service automatic generation context-aware workflow document(CAWL)
在线阅读 下载PDF
Semi-structured Data Extraction and Schema Knowledge Mining
19
作者 陈恩红 WANG Xufa 《High Technology Letters》 EI CAS 2001年第1期1-5,共5页
A semi structured data extraction method to get the useful information embedded in a group of relevant web pages and store it with OEM(Object Exchange Model) is proposed. Then, the data mining method is adopted to dis... A semi structured data extraction method to get the useful information embedded in a group of relevant web pages and store it with OEM(Object Exchange Model) is proposed. Then, the data mining method is adopted to discover schema knowledge implicit in the semi structured data. This knowledge can make users understand the information structure on the web more deeply and thourouly. At the same time, it can also provide a kind of effective schema for the querying of web information. 展开更多
关键词 semi-structured data SCHEMA Data extraction.
在线阅读 下载PDF
Mathematical Expression Extraction in Text Fields of Documents Based on HMM
20
作者 Xuedong Tian Ruihan Bai +2 位作者 Fang Yang Jinyuan Bai Xinfu Li 《Journal of Computer and Communications》 2017年第14期1-13,共13页
Aiming at the problem that the mathematical expressions in unstructured text fields of documents are hard to be extracted automatically, rapidly and effectively, a method based on Hidden Markov Model (HMM) is proposed... Aiming at the problem that the mathematical expressions in unstructured text fields of documents are hard to be extracted automatically, rapidly and effectively, a method based on Hidden Markov Model (HMM) is proposed. Firstly, this method trained the HMM model through employing the symbol combination features of mathematical expressions. Then, some preprocessing works such as removing labels and filtering words were carried out. Finally, the preprocessed text was converted into an observation sequence as the input of the HMM model to determine which is the mathematical expression and extracts it. The experimental results show that the proposed method can effectively extract the mathematical expressions from the text fields of documents, and also has the relatively high accuracy rate and recall rate. 展开更多
关键词 Mathematical Expression EXTRACTION Hidden MARKOV Model TEXT FIELDS documents SYMBOL Combination Features
在线阅读 下载PDF
上一页 1 2 250 下一页 到第
使用帮助 返回顶部