Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-pr...Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-property labels,and limited accessibility,hindering system-atic modeling across property prediction tasks.Here,we present OpenPoly,a curated experimental polymer database derived from extensive lit-erature mining and manual validation,comprising 3985 unique polymer-property data points spanning 26 key properties.We further develop a multi-task benchmarking framework that evaluates property prediction using four encoding methods and eight representative models.Our re-sults highlight that the optimized degree-of-polymerization encoding coupled with Morgan fingerprints achieves an optimal trade-off between computational cost and accuracy.In data-scarce condition,XGBoost outperforms deep learning models on key properties such as dielectric con-stant,glass transition temperature,melting point,and mechanical strength,achieving R2 scores of 0.65-0.87.To further showcase the practical utility of the database,we propose potential polymers for two energy-relevant applications:high temperature polymer dielectrics and fuel cell membranes.By offering a consistent and accessible benchmark and database,OpenPoly paves the way for more accurate polymer-property modeling and fosters data-driven advances in polymer genome engineering.展开更多
The molecular structures of hydrocarbons in straight run gasoline were numerically coded. The nonlinear quantitative relationship(QSRR) between gas chromatography(GC) retention indices of the hydrocarbons and their m...The molecular structures of hydrocarbons in straight run gasoline were numerically coded. The nonlinear quantitative relationship(QSRR) between gas chromatography(GC) retention indices of the hydrocarbons and their molecular structures were established by using an error back propagation(BP) algorithm. The GC retention indices of 150 hydrocarbons were then predicted by removing 15 compounds(as a test set) and using the 135 remained molecules as a calibration set. Through this procedure, all the compounds in the whole data set were then predicted in groups of 15 compounds. The results obtained by BP with the correlation coefficient and the standard deviation 0 993 4 and 16 54, are satisfied.展开更多
Grain proteins are essential for human health and food security.However,accurately predicting their biological functions remains challenging due to the difficulty of simultaneously capturing both local structural deta...Grain proteins are essential for human health and food security.However,accurately predicting their biological functions remains challenging due to the difficulty of simultaneously capturing both local structural details and global positional information from protein data.To address this issue,a novel graph-based deep learning model,SPE-GTN(Structural and Positional Encoding Graph Transformer Network),is proposed for grain protein function prediction.In this model,structural and positional encodings are embedded into a protein complex graph to enable the joint extraction of local and global features.Graph Convolutional Networks(GCNs)are employed to aggregate neighborhood information,whereas Transformer mechanisms are used to model longrange dependencies among protein nodes.The proposed SPE-GTN model is evaluated on four datasets comprising proteins from wheat,soybean,maize,and indica rice,representing a diverse range of grain types.Experimental results demonstrate that SPE-GTN achieves a 13.6%improvement in prediction accuracy and a 9.4%enhancement in F1-score.Compared to state-of-the-art methods.Theoretical analysis further validates its capacity to effectively capture complex relationships within protein interaction networks.These findings highlight the effectiveness and generalizability of SPE-GTN in real-world grain protein function prediction tasks and provide novel insights into protein bioinformatics and agricultural genomics.展开更多
基金financially supported by the National Natural Science Foundation of China (Nos. 92372126,52373203)the Excellent Young Scientists Fund Program
文摘Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-property labels,and limited accessibility,hindering system-atic modeling across property prediction tasks.Here,we present OpenPoly,a curated experimental polymer database derived from extensive lit-erature mining and manual validation,comprising 3985 unique polymer-property data points spanning 26 key properties.We further develop a multi-task benchmarking framework that evaluates property prediction using four encoding methods and eight representative models.Our re-sults highlight that the optimized degree-of-polymerization encoding coupled with Morgan fingerprints achieves an optimal trade-off between computational cost and accuracy.In data-scarce condition,XGBoost outperforms deep learning models on key properties such as dielectric con-stant,glass transition temperature,melting point,and mechanical strength,achieving R2 scores of 0.65-0.87.To further showcase the practical utility of the database,we propose potential polymers for two energy-relevant applications:high temperature polymer dielectrics and fuel cell membranes.By offering a consistent and accessible benchmark and database,OpenPoly paves the way for more accurate polymer-property modeling and fosters data-driven advances in polymer genome engineering.
文摘The molecular structures of hydrocarbons in straight run gasoline were numerically coded. The nonlinear quantitative relationship(QSRR) between gas chromatography(GC) retention indices of the hydrocarbons and their molecular structures were established by using an error back propagation(BP) algorithm. The GC retention indices of 150 hydrocarbons were then predicted by removing 15 compounds(as a test set) and using the 135 remained molecules as a calibration set. Through this procedure, all the compounds in the whole data set were then predicted in groups of 15 compounds. The results obtained by BP with the correlation coefficient and the standard deviation 0 993 4 and 16 54, are satisfied.
基金Project supported by National Key Research and Development Pro-gram Project of China(2023YFF1103404,2023YFF1103403)the Do-mestic Science and Technology Cooperation Projects of Shanghai,China(23015820700)Shanghai Agricultural Science and Technology Inno-vation Program(I2023007).
文摘Grain proteins are essential for human health and food security.However,accurately predicting their biological functions remains challenging due to the difficulty of simultaneously capturing both local structural details and global positional information from protein data.To address this issue,a novel graph-based deep learning model,SPE-GTN(Structural and Positional Encoding Graph Transformer Network),is proposed for grain protein function prediction.In this model,structural and positional encodings are embedded into a protein complex graph to enable the joint extraction of local and global features.Graph Convolutional Networks(GCNs)are employed to aggregate neighborhood information,whereas Transformer mechanisms are used to model longrange dependencies among protein nodes.The proposed SPE-GTN model is evaluated on four datasets comprising proteins from wheat,soybean,maize,and indica rice,representing a diverse range of grain types.Experimental results demonstrate that SPE-GTN achieves a 13.6%improvement in prediction accuracy and a 9.4%enhancement in F1-score.Compared to state-of-the-art methods.Theoretical analysis further validates its capacity to effectively capture complex relationships within protein interaction networks.These findings highlight the effectiveness and generalizability of SPE-GTN in real-world grain protein function prediction tasks and provide novel insights into protein bioinformatics and agricultural genomics.