Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss abo...Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss about multi-document summarization that differs from the single one in which the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. Since the number and variety of online medical news make them difficult for experts in the medical field to read all of the medical news, an automatic multi-document summarization can be useful for easy study of information on the web. Hence we propose a new approach based on machine learning meta-learner algorithm called AdaBoost that is used for summarization. We treat a document as a set of sentences, and the learning algorithm must learn to classify as positive or negative examples of sentences based on the score of the sentences. For this learning task, we apply AdaBoost meta-learning algorithm where a C4.5 decision tree has been chosen as the base learner. In our experiment, we use 450 pieces of news that are downloaded from different medical websites. Then we compare our results with some existing approaches.展开更多
Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML doc...Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML document are provided in this paper. The main idea is that in term of data model of XML document, XML document is parsed to Document Structure-Tree with Hierarchical Model and Leaf-Data with Relation Model for storage. Simultaneously Proxy node is imported in order to solve the problem that XML data store in cross-blocks. And with XSD model information, sparse index is constructed to save storage space. It is proved that this storage structure could improve efficiency of XML document operation.展开更多
In the XML community, exact queries allow users to specify exactly what they want to check and/or retrieve in an XML document. When they are applied to a semi-structured document or to a document with an overly comple...In the XML community, exact queries allow users to specify exactly what they want to check and/or retrieve in an XML document. When they are applied to a semi-structured document or to a document with an overly complex model, the lack or the ignorance of the explicit document model (DTD—Document Type Definition, Schema, etc.) increases the risk of obtaining an empty result set when the query is too specific, or, too large result set when it is too vague (e.g. it contains wildcards such as “*”). The reason is that in both cases, users write queries according to the document model they have in mind;this can be very far from the one that can actually be extracted from the document. Opposed to exact queries, preference queries are more flexible and can be relaxed to expand the search space during their evaluations. Indeed, during their evaluation, certain constraints (the preferences they contain) can be relaxed if necessary to avoid precisely empty results;moreover, the returned answers can be filtered to retain only the best ones. This paper presents an algorithm for evaluating such queries inspired by the TreeMatch algorithm proposed by Yao et al. for exact queries. In the proposed algorithm, the best answers are obtained by using an adaptation of the Skyline operator (defined in relational databases) in the context of documents (trees) to incrementally filter into the partial solutions set, those which satisfy the maximum of preferential constraints. The only restriction imposed on documents is No-Self-Containment.展开更多
随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及...随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。展开更多
文摘Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss about multi-document summarization that differs from the single one in which the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. Since the number and variety of online medical news make them difficult for experts in the medical field to read all of the medical news, an automatic multi-document summarization can be useful for easy study of information on the web. Hence we propose a new approach based on machine learning meta-learner algorithm called AdaBoost that is used for summarization. We treat a document as a set of sentences, and the learning algorithm must learn to classify as positive or negative examples of sentences based on the score of the sentences. For this learning task, we apply AdaBoost meta-learning algorithm where a C4.5 decision tree has been chosen as the base learner. In our experiment, we use 450 pieces of news that are downloaded from different medical websites. Then we compare our results with some existing approaches.
基金Supported by the National Natural Science Foun-dation of China (60073045)
文摘Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML document are provided in this paper. The main idea is that in term of data model of XML document, XML document is parsed to Document Structure-Tree with Hierarchical Model and Leaf-Data with Relation Model for storage. Simultaneously Proxy node is imported in order to solve the problem that XML data store in cross-blocks. And with XSD model information, sparse index is constructed to save storage space. It is proved that this storage structure could improve efficiency of XML document operation.
文摘In the XML community, exact queries allow users to specify exactly what they want to check and/or retrieve in an XML document. When they are applied to a semi-structured document or to a document with an overly complex model, the lack or the ignorance of the explicit document model (DTD—Document Type Definition, Schema, etc.) increases the risk of obtaining an empty result set when the query is too specific, or, too large result set when it is too vague (e.g. it contains wildcards such as “*”). The reason is that in both cases, users write queries according to the document model they have in mind;this can be very far from the one that can actually be extracted from the document. Opposed to exact queries, preference queries are more flexible and can be relaxed to expand the search space during their evaluations. Indeed, during their evaluation, certain constraints (the preferences they contain) can be relaxed if necessary to avoid precisely empty results;moreover, the returned answers can be filtered to retain only the best ones. This paper presents an algorithm for evaluating such queries inspired by the TreeMatch algorithm proposed by Yao et al. for exact queries. In the proposed algorithm, the best answers are obtained by using an adaptation of the Skyline operator (defined in relational databases) in the context of documents (trees) to incrementally filter into the partial solutions set, those which satisfy the maximum of preferential constraints. The only restriction imposed on documents is No-Self-Containment.
文摘随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model,VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。