Medical text representation is crucial for medical natural language processing(NLP)applications.Bidirectional encoder representations from transformers(BERT)has achieved the state-of-the-art performance in general dom...Medical text representation is crucial for medical natural language processing(NLP)applications.Bidirectional encoder representations from transformers(BERT)has achieved the state-of-the-art performance in general domain text representation.However,limited by the design of the pretraining task and the frequency of knowledge occurrence,it lacks understanding of medical knowledge.To overcome these problems,we proposed a selective knowledge extraction and fusion framework to enhance medical text representation.In the knowledge extraction phase,we first designed a semantic importance evaluation metric to extract internal knowledge.We then used large language models(LLMs)to extract external knowledge from systematized nomenclature of medicine clinical term(SNOMED CT).In the knowledge fusion phase,we utilized an attention mechanism and Siamese network to integrate internal knowledge and external knowledge.Extracting knowledge through large language models(LLMs)and integrating it into five different types of BERT models,we achieved significant improvements in the task of pulmonary disease text classification.展开更多
Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks.Their limited length,pervasive abbrevi-ations,and coined acronyms and words exacerbate the...Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks.Their limited length,pervasive abbrevi-ations,and coined acronyms and words exacerbate the prob-lems of synonymy and polysemy,and bring about new chal-lenges to data mining applications such as text clustering and classification.To address these issues,we dissect some poten-tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages.Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques.The proposed ap-proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter.With its significant performance improvement,we further investi-gate potential factors that contribute to the improved perfor-mance.展开更多
An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF)...An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.展开更多
基金supported by the R&D Program of Beijing Municipal Education Commission(No.KM202310025020)the Project of Cultivation for Young Top-notch Talents of Beijing Municipal Institutions(No.BPHR202203113)the National Natural Science Foundation of China(Nos.62103397 and 82100265).
文摘Medical text representation is crucial for medical natural language processing(NLP)applications.Bidirectional encoder representations from transformers(BERT)has achieved the state-of-the-art performance in general domain text representation.However,limited by the design of the pretraining task and the frequency of knowledge occurrence,it lacks understanding of medical knowledge.To overcome these problems,we proposed a selective knowledge extraction and fusion framework to enhance medical text representation.In the knowledge extraction phase,we first designed a semantic importance evaluation metric to extract internal knowledge.We then used large language models(LLMs)to extract external knowledge from systematized nomenclature of medicine clinical term(SNOMED CT).In the knowledge fusion phase,we utilized an attention mechanism and Siamese network to integrate internal knowledge and external knowledge.Extracting knowledge through large language models(LLMs)and integrating it into five different types of BERT models,we achieved significant improvements in the task of pulmonary disease text classification.
文摘Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks.Their limited length,pervasive abbrevi-ations,and coined acronyms and words exacerbate the prob-lems of synonymy and polysemy,and bring about new chal-lenges to data mining applications such as text clustering and classification.To address these issues,we dissect some poten-tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages.Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques.The proposed ap-proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter.With its significant performance improvement,we further investi-gate potential factors that contribute to the improved perfor-mance.
基金supported by the Hi-Tech Research and Development Program of China (2009AA01Z430)the National Natural Science Foundation of China (60972077,60821001)+2 种基金the National S&T Major Program (2010ZX03003-003-01)the Fundamental Research Funds for the Central Universities (BUPT2011RC0210)the Science and Technology on Electronic Control Laboratory
文摘An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.