The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decis...The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decision support,documentation,and diagnostics.This evaluation examines the performance of leading Med-LLMs,including GPT-4Med,Med-PaLM,MEDITRON,PubMedGPT,and MedAlpaca,across diverse medical datasets.It provides graphical comparisons of their effectiveness in distinct healthcare domains.The study introduces a domain-specific categorization system that aligns these models with optimal applications in clinical decision-making,documentation,drug discovery,research,patient interaction,and public health.The paper addresses deployment challenges of Medical-LLMs,emphasizing trustworthiness and explainability as essential requirements for healthcare AI.It presents current evaluation techniques that improve model transparency in high-stakes medical contexts and analyzes regulatory frameworks using benchmarking datasets such asMedQA,MedMCQA,PubMedQA,and MIMIC.By identifying ongoing challenges in biasmitigation,reliability,and ethical compliance,thiswork serves as a resource for selecting appropriate Med-LLMs and outlines future directions in the field.This analysis offers a roadmap for developing Med-LLMs that balance technological innovation with the trust and transparency required for clinical integration,a perspective often overlooked in existing literature.展开更多
Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-pr...Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-property labels,and limited accessibility,hindering system-atic modeling across property prediction tasks.Here,we present OpenPoly,a curated experimental polymer database derived from extensive lit-erature mining and manual validation,comprising 3985 unique polymer-property data points spanning 26 key properties.We further develop a multi-task benchmarking framework that evaluates property prediction using four encoding methods and eight representative models.Our re-sults highlight that the optimized degree-of-polymerization encoding coupled with Morgan fingerprints achieves an optimal trade-off between computational cost and accuracy.In data-scarce condition,XGBoost outperforms deep learning models on key properties such as dielectric con-stant,glass transition temperature,melting point,and mechanical strength,achieving R2 scores of 0.65-0.87.To further showcase the practical utility of the database,we propose potential polymers for two energy-relevant applications:high temperature polymer dielectrics and fuel cell membranes.By offering a consistent and accessible benchmark and database,OpenPoly paves the way for more accurate polymer-property modeling and fosters data-driven advances in polymer genome engineering.展开更多
Prompt fission neutron spectra(PFNS)have a significant role in nuclear science and technology.In this study,the PFNS for^(239)Pu are evaluated using both differential and integral experimental data.A method that lever...Prompt fission neutron spectra(PFNS)have a significant role in nuclear science and technology.In this study,the PFNS for^(239)Pu are evaluated using both differential and integral experimental data.A method that leverages integral criticality benchmark experiments to constrain the PFNS data is introduced.The measured central values of the PFNS are perturbed by constructing a covariance matrix.The PFNS are sampled using two types of covariance matrices,either generated with an assumed correlation matrix and incorporating experimental uncertainties or derived directly from experimental reports.The joint Monte Carlo transport code is employed to perform transport simulations on five criticality benchmark assemblies by utilizing perturbed PFNS data.Extensive simulations result in an optimized PFNS that shows improved agreement with the integral criticality benchmark experiments.This study introduces a novel approach for optimizing differential experimental data through integral experiments,particularly when a covariance matrix is not provided.展开更多
Most material distribution-based topology optimization methods work on a relaxed form of the optimization problem and then push the solution toward the binary limits.However,when benchmarking these methods,researchers...Most material distribution-based topology optimization methods work on a relaxed form of the optimization problem and then push the solution toward the binary limits.However,when benchmarking these methods,researchers use known solutions to only a single form of benchmark problem.This paper proposes a comparison platform for systematic benchmarking of topology optimization methods using both binary and relaxed forms.A greyness measure is implemented to evaluate how far a solution is from the desired binary form.The well-known ZhouRozvany(ZR)problem is selected as the benchmarking problem here,making use of available global solutions for both its relaxed and binary forms.The recently developed non-penalization Smooth-edged Material Distribution for Optimizing Topology(SEMDOT),well-established Solid Isotropic Material with Penalization(SIMP),and continuation methods are studied on this platform.Interestingly,in most cases,the grayscale solutions obtained by SEMDOT demonstrate better performance in dealing with the ZR problem than SIMP.The reasons are investigated and attributed to the usage of two different regularization techniques,namely,the Heaviside smooth function in SEMDOT and the power-law penalty in SIMP.More importantly,a simple-to-use benchmarking graph is proposed for evaluating newly developed topology optimization methods.展开更多
Background:Large language models(LLMs)have shown promise in educational applications,but their performance on high-stakes admissions tests,such as the Dental Admission Test(DAT),remains unclear.Understanding the capab...Background:Large language models(LLMs)have shown promise in educational applications,but their performance on high-stakes admissions tests,such as the Dental Admission Test(DAT),remains unclear.Understanding the capabilities and limitations of these models is critical for determining their suitability in test preparation.Methods:This study evaluated the ability of 16 LLMs,including general-purpose models(e.g.,GPT-3.5,GPT-4,GPT-4o,GPT-o1,Google’s Bard,mistral-large,and Claude),domain-specific finetuned models(e.g.,DentalGPT,MedGPT,and BioGPT),and open-source models(e.g.,Llama2-7B,Llama2-13B,Llama2-70B,Llama3-8B,and Llama3-70B),to answer questions from a sample DAT.Quantitative analysis was performed to assess model accuracy in different sections,and qualitative thematic analysis by subject matter experts examined specific challenges encountered by the models.Results:GPT-4o and GPT-o1 outperformed others in text-based questions assessing knowledge and comprehension,with GPT-o1 achieving perfect scores in the natural sciences(NS)and reading comprehension(RC)sections.Open-source models such as Llama3-70B also performed competitively in RC tasks.However,all models,including GPT-4o,struggled substantially with perceptual ability(PA)items,highlighting a persistent limitation in handling image-based tasks requiring visual-spatial reasoning.Fine-tuned medical models(e.g.,DentalGPT,MedGPT,and BioGPT)demonstrated moderate success in text-based tasks but underperformed in areas requiring critical thinking and reasoning.Thematic analysis identified key challenges,including difficulties with stepwise problem-solving,transferring knowledge,comprehending intricate questions,and hallucinations,particularly on advanced items.Conclusions:While LLMs show potential for reinforcing factual knowledge and supporting learners,their limitations in handling higherorder cognitive tasks and image-based reasoning underscore the need for judicious integration with instructor-led guidance and targeted practice.This study provides valuable insights into the capabilities and limitations of current LLMs in preparing prospective dental students and highlights pathways for future innovations to improve performance across all cognitive skills assessed by the DAT.展开更多
In molybdenum chemistry,the oxidative addition of o-quinone or 1,2-dicarbonyl compounds to molybdenum has been widely used in Mo-catalyzed C—C bond construction.The carbonyl oxidative addition to Mo(0)or Mo(Ⅱ)is the...In molybdenum chemistry,the oxidative addition of o-quinone or 1,2-dicarbonyl compounds to molybdenum has been widely used in Mo-catalyzed C—C bond construction.The carbonyl oxidative addition to Mo(0)or Mo(Ⅱ)is the critical elementary reaction of molybdenum catalysis.However,the relevant density functional theory(DFT)studies are relatively scarce,especially regarding the rational selection of functionals.In this work,14 functionals were employed to investigate the Mo-catalyzed carbonyl oxidative addition step.A benchmark study was carried out to evaluate their performance in structure optimization and energy calculation.Analyses of mean absolute error(MAE)and mean squared error(MSE)indicated that the B3LYP-D3(BJ),TPSSh,and ωB97X-D functionals exhibited superior performance in structure optimization.Using the DLPNO-CCSD(T)functional as the reference,the M06,M06-L,and MN15-L functionals exhibited good performance for energy calculation based on the structures optimized using the B3LYP-D3(BJ)functional.In particular,MN15-L provided the best performance with the smallest MAE and MSE.展开更多
Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medic...Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medical practice,a rigorous and systematic evaluation of their medical competence is imperative.This study presents a comprehensive review of the established methodologies and benchmarks for evaluating the medical competence of LLMs,encompassing a thorough analysis of current assessment practices across medical knowledge,clinical practice competence,and ethical-safety considerations.By integrating clinician competency assessment frameworks into LLMs evaluation,we propose a structured tri-dimensional framework that systematically organizes existing evaluation approaches according to medical theoretical knowledge,clinical practice ability,and ethical-safety considerations.Furthermore,this research provides critical insights into future developmental trajectories while establishing foundational frameworks and standardization protocols for the integration of LLMs into medical practice.展开更多
文摘The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decision support,documentation,and diagnostics.This evaluation examines the performance of leading Med-LLMs,including GPT-4Med,Med-PaLM,MEDITRON,PubMedGPT,and MedAlpaca,across diverse medical datasets.It provides graphical comparisons of their effectiveness in distinct healthcare domains.The study introduces a domain-specific categorization system that aligns these models with optimal applications in clinical decision-making,documentation,drug discovery,research,patient interaction,and public health.The paper addresses deployment challenges of Medical-LLMs,emphasizing trustworthiness and explainability as essential requirements for healthcare AI.It presents current evaluation techniques that improve model transparency in high-stakes medical contexts and analyzes regulatory frameworks using benchmarking datasets such asMedQA,MedMCQA,PubMedQA,and MIMIC.By identifying ongoing challenges in biasmitigation,reliability,and ethical compliance,thiswork serves as a resource for selecting appropriate Med-LLMs and outlines future directions in the field.This analysis offers a roadmap for developing Med-LLMs that balance technological innovation with the trust and transparency required for clinical integration,a perspective often overlooked in existing literature.
基金financially supported by the National Natural Science Foundation of China (Nos. 92372126,52373203)the Excellent Young Scientists Fund Program
文摘Advancing the integration of artificial intelligence and polymer science requires high-quality,open-source,and large-scale datasets.However,existing polymer databases often suffer from data sparsity,lack of polymer-property labels,and limited accessibility,hindering system-atic modeling across property prediction tasks.Here,we present OpenPoly,a curated experimental polymer database derived from extensive lit-erature mining and manual validation,comprising 3985 unique polymer-property data points spanning 26 key properties.We further develop a multi-task benchmarking framework that evaluates property prediction using four encoding methods and eight representative models.Our re-sults highlight that the optimized degree-of-polymerization encoding coupled with Morgan fingerprints achieves an optimal trade-off between computational cost and accuracy.In data-scarce condition,XGBoost outperforms deep learning models on key properties such as dielectric con-stant,glass transition temperature,melting point,and mechanical strength,achieving R2 scores of 0.65-0.87.To further showcase the practical utility of the database,we propose potential polymers for two energy-relevant applications:high temperature polymer dielectrics and fuel cell membranes.By offering a consistent and accessible benchmark and database,OpenPoly paves the way for more accurate polymer-property modeling and fosters data-driven advances in polymer genome engineering.
基金supported by the National Natural Science Foundation of China(No.12347126)。
文摘Prompt fission neutron spectra(PFNS)have a significant role in nuclear science and technology.In this study,the PFNS for^(239)Pu are evaluated using both differential and integral experimental data.A method that leverages integral criticality benchmark experiments to constrain the PFNS data is introduced.The measured central values of the PFNS are perturbed by constructing a covariance matrix.The PFNS are sampled using two types of covariance matrices,either generated with an assumed correlation matrix and incorporating experimental uncertainties or derived directly from experimental reports.The joint Monte Carlo transport code is employed to perform transport simulations on five criticality benchmark assemblies by utilizing perturbed PFNS data.Extensive simulations result in an optimized PFNS that shows improved agreement with the integral criticality benchmark experiments.This study introduces a novel approach for optimizing differential experimental data through integral experiments,particularly when a covariance matrix is not provided.
文摘Most material distribution-based topology optimization methods work on a relaxed form of the optimization problem and then push the solution toward the binary limits.However,when benchmarking these methods,researchers use known solutions to only a single form of benchmark problem.This paper proposes a comparison platform for systematic benchmarking of topology optimization methods using both binary and relaxed forms.A greyness measure is implemented to evaluate how far a solution is from the desired binary form.The well-known ZhouRozvany(ZR)problem is selected as the benchmarking problem here,making use of available global solutions for both its relaxed and binary forms.The recently developed non-penalization Smooth-edged Material Distribution for Optimizing Topology(SEMDOT),well-established Solid Isotropic Material with Penalization(SIMP),and continuation methods are studied on this platform.Interestingly,in most cases,the grayscale solutions obtained by SEMDOT demonstrate better performance in dealing with the ZR problem than SIMP.The reasons are investigated and attributed to the usage of two different regularization techniques,namely,the Heaviside smooth function in SEMDOT and the power-law penalty in SIMP.More importantly,a simple-to-use benchmarking graph is proposed for evaluating newly developed topology optimization methods.
基金partially supported by the National Institutes of Health’s National Center for Complementary and Integrative Health under grant number R01AT009457National Institute on Aging under grant number R01AG078154National Cancer Institute under grant number R01CA287413.
文摘Background:Large language models(LLMs)have shown promise in educational applications,but their performance on high-stakes admissions tests,such as the Dental Admission Test(DAT),remains unclear.Understanding the capabilities and limitations of these models is critical for determining their suitability in test preparation.Methods:This study evaluated the ability of 16 LLMs,including general-purpose models(e.g.,GPT-3.5,GPT-4,GPT-4o,GPT-o1,Google’s Bard,mistral-large,and Claude),domain-specific finetuned models(e.g.,DentalGPT,MedGPT,and BioGPT),and open-source models(e.g.,Llama2-7B,Llama2-13B,Llama2-70B,Llama3-8B,and Llama3-70B),to answer questions from a sample DAT.Quantitative analysis was performed to assess model accuracy in different sections,and qualitative thematic analysis by subject matter experts examined specific challenges encountered by the models.Results:GPT-4o and GPT-o1 outperformed others in text-based questions assessing knowledge and comprehension,with GPT-o1 achieving perfect scores in the natural sciences(NS)and reading comprehension(RC)sections.Open-source models such as Llama3-70B also performed competitively in RC tasks.However,all models,including GPT-4o,struggled substantially with perceptual ability(PA)items,highlighting a persistent limitation in handling image-based tasks requiring visual-spatial reasoning.Fine-tuned medical models(e.g.,DentalGPT,MedGPT,and BioGPT)demonstrated moderate success in text-based tasks but underperformed in areas requiring critical thinking and reasoning.Thematic analysis identified key challenges,including difficulties with stepwise problem-solving,transferring knowledge,comprehending intricate questions,and hallucinations,particularly on advanced items.Conclusions:While LLMs show potential for reinforcing factual knowledge and supporting learners,their limitations in handling higherorder cognitive tasks and image-based reasoning underscore the need for judicious integration with instructor-led guidance and targeted practice.This study provides valuable insights into the capabilities and limitations of current LLMs in preparing prospective dental students and highlights pathways for future innovations to improve performance across all cognitive skills assessed by the DAT.
基金Project supported by the Fundamental Research Funds for the Central Universities(No.2042025kf0052)。
文摘In molybdenum chemistry,the oxidative addition of o-quinone or 1,2-dicarbonyl compounds to molybdenum has been widely used in Mo-catalyzed C—C bond construction.The carbonyl oxidative addition to Mo(0)or Mo(Ⅱ)is the critical elementary reaction of molybdenum catalysis.However,the relevant density functional theory(DFT)studies are relatively scarce,especially regarding the rational selection of functionals.In this work,14 functionals were employed to investigate the Mo-catalyzed carbonyl oxidative addition step.A benchmark study was carried out to evaluate their performance in structure optimization and energy calculation.Analyses of mean absolute error(MAE)and mean squared error(MSE)indicated that the B3LYP-D3(BJ),TPSSh,and ωB97X-D functionals exhibited superior performance in structure optimization.Using the DLPNO-CCSD(T)functional as the reference,the M06,M06-L,and MN15-L functionals exhibited good performance for energy calculation based on the structures optimized using the B3LYP-D3(BJ)functional.In particular,MN15-L provided the best performance with the smallest MAE and MSE.
基金Guangzhou Science and Technology Program,Grant/Award Numbers:2025B03J0110,2024A03J1074,2024A03J0927。
文摘Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medical practice,a rigorous and systematic evaluation of their medical competence is imperative.This study presents a comprehensive review of the established methodologies and benchmarks for evaluating the medical competence of LLMs,encompassing a thorough analysis of current assessment practices across medical knowledge,clinical practice competence,and ethical-safety considerations.By integrating clinician competency assessment frameworks into LLMs evaluation,we propose a structured tri-dimensional framework that systematically organizes existing evaluation approaches according to medical theoretical knowledge,clinical practice ability,and ethical-safety considerations.Furthermore,this research provides critical insights into future developmental trajectories while establishing foundational frameworks and standardization protocols for the integration of LLMs into medical practice.