Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Ther...Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.展开更多
The prediction of crystal properties plays a crucial role in materials science and applications.Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks(GNNs)....The prediction of crystal properties plays a crucial role in materials science and applications.Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks(GNNs).However,accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge.Surprisingly,predicting crystal properties from crystal text descriptions is understudied,despite the rich information and expressiveness that text data offer.In this paper,we develop and make public a benchmark dataset(TextEdge)that contains crystal text descriptions with their properties.We then propose LLM-Prop,a method that leverages the generalpurpose learning capabilities of large language models(LLMs)to predict properties of crystals from their text descriptions.LLM-Prop outperforms the current state-of-the-art GNN-based methods by approximately 8%on predicting band gap,3%on classifying whether the band gap is direct or indirect,and 65%on predicting unit cell volume,and yields comparable performance on predicting formation energy per atom,energy per atom,and energy above hull.LLM-Prop also outperforms the fine-tuned MatBERT,a domain-specific pre-trained BERT model,despite having 3 times fewer parameters.We further fine-tune the LLM-Prop model directly on CIF files and condensed structure information generated by Robocrystallographer and found that LLM-Prop fine-tuned on text descriptions provides a better performance on average.Our empirical results highlight the importance of having a natural language input to LLMs to accurately predict crystal properties and the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.展开更多
With the ever-increasing number of natural disasters warning documents in document databases, the document database is becoming an economic and efficient way for enterprise staffs to learn and understand the contents ...With the ever-increasing number of natural disasters warning documents in document databases, the document database is becoming an economic and efficient way for enterprise staffs to learn and understand the contents of the natural disasters warning through searching for necessary text documents. Generally, the document database can recommend a mass of documents to the enterprise staffs through analyzing the enterprise staff's precisely typed keywords. In fact, these recommended documents place a heavy burden on the enterprise staffs to learn and select as the enterprise staffs have little background knowledge about the contents of the natural disasters warning. Thus, the enterprise staffs fail to retrieve and select appropriate documents to achieve their desired goals.Considering the above drawbacks, in this paper, we propose a fuzzy keywords-driven Natural Disasters Warning Documents retrieval approach(named NDWDkeyword). Through the text description mining of documents and the fuzzy keywords searching technology, the retrieval approach can precisely capture the enterprise staffs' target requirements and then return necessary documents to the enterprise staffs. Finally, a case study is run to explain our retrieval approach step by step and demonstrate the effectiveness and feasibility of our proposal.展开更多
基金supported by the Agridata,the sub-program of National Science and Technology Infrastructure Program(Grant No.2005DKA31800)
文摘Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.
基金support from the Schmidt DataX Fund at Princeton University made possible through a major gift from the Schmidt Futures FoundationAdji Bousso Dieng acknowledges support from the National Science Foundation,Office of Advanced Cyberinfrastructure(OAC)#2118201,and from the Schmidt Futures AI2050 Early Career Fellowship.
文摘The prediction of crystal properties plays a crucial role in materials science and applications.Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks(GNNs).However,accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge.Surprisingly,predicting crystal properties from crystal text descriptions is understudied,despite the rich information and expressiveness that text data offer.In this paper,we develop and make public a benchmark dataset(TextEdge)that contains crystal text descriptions with their properties.We then propose LLM-Prop,a method that leverages the generalpurpose learning capabilities of large language models(LLMs)to predict properties of crystals from their text descriptions.LLM-Prop outperforms the current state-of-the-art GNN-based methods by approximately 8%on predicting band gap,3%on classifying whether the band gap is direct or indirect,and 65%on predicting unit cell volume,and yields comparable performance on predicting formation energy per atom,energy per atom,and energy above hull.LLM-Prop also outperforms the fine-tuned MatBERT,a domain-specific pre-trained BERT model,despite having 3 times fewer parameters.We further fine-tune the LLM-Prop model directly on CIF files and condensed structure information generated by Robocrystallographer and found that LLM-Prop fine-tuned on text descriptions provides a better performance on average.Our empirical results highlight the importance of having a natural language input to LLMs to accurately predict crystal properties and the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.
文摘With the ever-increasing number of natural disasters warning documents in document databases, the document database is becoming an economic and efficient way for enterprise staffs to learn and understand the contents of the natural disasters warning through searching for necessary text documents. Generally, the document database can recommend a mass of documents to the enterprise staffs through analyzing the enterprise staff's precisely typed keywords. In fact, these recommended documents place a heavy burden on the enterprise staffs to learn and select as the enterprise staffs have little background knowledge about the contents of the natural disasters warning. Thus, the enterprise staffs fail to retrieve and select appropriate documents to achieve their desired goals.Considering the above drawbacks, in this paper, we propose a fuzzy keywords-driven Natural Disasters Warning Documents retrieval approach(named NDWDkeyword). Through the text description mining of documents and the fuzzy keywords searching technology, the retrieval approach can precisely capture the enterprise staffs' target requirements and then return necessary documents to the enterprise staffs. Finally, a case study is run to explain our retrieval approach step by step and demonstrate the effectiveness and feasibility of our proposal.