Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and langua...Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.展开更多
Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether ...Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.展开更多
The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.De...The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.展开更多
Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, ...Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.展开更多
BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patie...BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patient information needs.However,LLM use to deliver accurate and comprehensible IBD-related medical information has yet to be thoroughly investigated.AIM To assess the utility of three LLMs(ChatGPT-4.0,Claude-3-Opus,and Gemini-1.5-Pro)as a reference point for patients with IBD.METHODS In this comparative study,two gastroenterology experts generated 15 IBD-related questions that reflected common patient concerns.These questions were used to evaluate the performance of the three LLMs.The answers provided by each model were independently assessed by three IBD-related medical experts using a Likert scale focusing on accuracy,comprehensibility,and correlation.Simultaneously,three patients were invited to evaluate the comprehensibility of their answers.Finally,a readability assessment was performed.RESULTS Overall,each of the LLMs achieved satisfactory levels of accuracy,comprehensibility,and completeness when answering IBD-related questions,although their performance varies.All of the investigated models demonstrated strengths in providing basic disease information such as IBD definition as well as its common symptoms and diagnostic methods.Nevertheless,when dealing with more complex medical advice,such as medication side effects,dietary adjustments,and complication risks,the quality of answers was inconsistent between the LLMs.Notably,Claude-3-Opus generated answers with better readability than the other two models.CONCLUSION LLMs have the potential as educational tools for patients with IBD;however,there are discrepancies between the models.Further optimization and the development of specialized models are necessary to ensure the accuracy and safety of the information provided.展开更多
ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential sec...ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.展开更多
Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing...Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.展开更多
This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution...This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution from Recurrent Neural Networks(RNNs)to Transformer models,highlighting the significant advancements and innovations in LLM architectures.The review explores state-of-the-art techniques such as in-context learning and various fine-tuning approaches,with an emphasis on optimizing parameter efficiency.We also discuss methods for aligning LLMs with human preferences,including reinforcement learning frameworks and human feedback mechanisms.The emerging technique of retrieval-augmented generation,which integrates external knowledge into LLMs,is also evaluated.Additionally,we address the ethical considerations of deploying LLMs,stressing the importance of responsible and mindful application.By identifying current gaps and suggesting future research directions,this review provides a comprehensive and critical overview of the present state and potential advancements in LLMs.This work serves as an insightful guide for researchers and practitioners in artificial intelligence,offering a unified perspective on the strengths,limitations,and future prospects of LLMs.展开更多
With the widespread application of large language models(LLMs)in natural language processing and code generation,traditional High-Level Language Programming courses are facing unprecedented challenges and opportunitie...With the widespread application of large language models(LLMs)in natural language processing and code generation,traditional High-Level Language Programming courses are facing unprecedented challenges and opportunities.As a core programming language for computer science majors,C language remains irreplaceable due to its foundational nature and engineering adaptability.This paper,based on the rapid development of large model technologies,proposes a systematic reform design for C language teaching,focusing on teaching objectives,content structure,teaching methods,and evaluation systems.The article suggests a teaching framework centered on“human-computer collaborative programming,”integrating prompt training,AI-assisted debugging,and code generation analysis,aiming to enhance students’problem modeling ability,programming expression skills,and AI collaboration literacy.展开更多
Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accura...Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accuracy and reliability have undergone tested.This bibliometric analysis aims to provide an overview of research on LLMs in ophthalmology from both thematic and geographical perspectives.Methods:All existing and highly cited LLM-related ophthalmology research papers published in English up to 24th April 2025 were sourced from Scopus,PubMed,and Web of Science.The characteristics of these publications,including publication output,authors,journals,countries,institutions,citations,and research domains,were analyzed using Biblioshiny and VOSviewer software.Results:A total of 277 articles from 1,459 authors and 89 journals were included in this study.Although relevant publications began to appear in 2019,there was a significant increase starting from 2023.He M and Shi D are the most prolific authors,while Investigative Ophthalmology&Visual Science stands out as the most prominent journal.Most of the top-publishing countries are high-income economies,with the USA taking the lead,and the University of California is the leading institution.VOSviewer identified 5 clusters in the keyword co-occurrence analysis,indicating that current research focuses on the clinical applications of LLMs,particularly in diagnosis and patient education.Conclusions:While LLMs have demonstrated effectiveness in retaining knowledge,their accuracy in image-based diagnosis remains limited.Therefore,future research should investigate fine-tuning strategies and domain-specific adaptations to close this gap.Although research on the applications of LLMs in ophthalmology is still in its early stages,it holds significant potential for advancing the field.展开更多
Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and...Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and recurrence.Despite this,cardiac rehabilitation is underutilized and adherence to these programs has been a demonstrated barrier in achieving these outcomes.As a result,there is a growing focus on innovating these programs,especially from the standpoint of digital health and personalized medicine.This editorial discusses the possible roles of large language models,such as their role in ChatGPT,in further personalizing cardiac rehabilitation programs through simplifying medical jargon and employing motivational interviewing techniques,thus boosting patient engagement and adherence.However,these possibilities must be further investigated in the clinical literature.Likewise,the integration of large language models in cardiac rehabilitation will be challenging in its nascent stages to ensure accurate and ethical information delivery.展开更多
Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design pa...Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design patterns is not fully assessed.The recent introduction of generative large language models(LLMs)like ChatGPT and CoPilot has demonstrated significant promise in software development.They assist with a variety of tasks including code generation,modeling,bug fixing,and testing,leading to enhanced efficiency and productivity.Although initial uses of these LLMs have had a positive effect on software development,their potential influence on the application of design patterns remains unexplored.This study introduces a method to quantify LLMs’ability to implement design patterns,using Role-Based Metamodeling Language(RBML)for a rigorous specification of the pattern’s problem,solution,and transformation rules.The method evaluates the pattern applicability of a software application using the pattern’s problem specification.If deemed applicable,the application is input to the LLM for pattern application.The resulting application is assessed for conformance to the pattern’s solution specification and for completeness against the pattern’s transformation rules.Evaluating the method with ChatGPT 4 across three applications reveals ChatGPT’s high proficiency,achieving averages of 98%in conformance and 87%in completeness,thereby demonstrating the effectiveness of the method.Using RBML,this study confirms that LLMs,specifically ChatGPT 4,have great potential in effective and efficient application of design patterns with high conformance and completeness.This opens avenues for further integrating LLMs into complex software engineering processes.展开更多
The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education....The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education.To do so,there is a need to understand the impact of LLMs on software engineering education.In this paper,we conducted a preliminary case study on three software requirements engineering classes where students are allowed to use LLMs to assist in their projects.Based on the students’experience,performance,and feedback from a survey conducted at the end of the courses,we characterized the challenges and benefits of applying LLMs in software engineering education.This research contributes to the ongoing discourse on the integration of LLMs in education,emphasizing both their prominent potential and the need for balanced,mindful usage.展开更多
Generative artificial intelligence,represented by large language models,holds vast application scenarios and significant development potential in the field of language teaching.This study employs large language models...Generative artificial intelligence,represented by large language models,holds vast application scenarios and significant development potential in the field of language teaching.This study employs large language models such as ChatGPT4o,ERNIE Bot,and Spark Cognition to explore how they empower teachers in international Chinese language teaching through practical cases.It focuses on various aspects of international Chinese language teaching and language skills training,examining the application effects of large language models in generating tailored teaching content and converting textual content into multimodal teaching materials.Finally,the study proposes that teachers should rationally recognize the opportunities and challenges that large language models bring to the teaching ecosystem,while acknowledging the models’efficiency in empowering teachers’instruction,it is crucial to fully recognize their essential tool nature,uphold teachers’subjectivity,and pay close attention to the boundaries of their development and application.展开更多
AIM:To assess the performance of five distinct large language models(LLMs;ChatGPT-3.5,ChatGPT-4,PaLM2,Claude 2,and SenseNova)in comparison to two human cohorts(a group of funduscopic disease experts and a group of oph...AIM:To assess the performance of five distinct large language models(LLMs;ChatGPT-3.5,ChatGPT-4,PaLM2,Claude 2,and SenseNova)in comparison to two human cohorts(a group of funduscopic disease experts and a group of ophthalmologists)on the specialized subject of funduscopic disease.METHODS:Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test.The performance of these entities was assessed by comparing their average scores,response stability,and answer confidence,thereby establishing a basis for evaluation.RESULTS:Among all the LLMs,ChatGPT-4 and PaLM2 exhibited the most substantial average correlation.Additionally,ChatGPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam.In comparison to human cohorts,ChatGPT-4 exhibited comparable performance to ophthalmologists,albeit falling short of the expertise demonstrated by funduscopic disease specialists.CONCLUSION:The study provides evidence of the exceptional performance of ChatGPT-4 in the domain of funduscopic disease.With continued enhancements,validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.展开更多
Background:Large Language Models(LLMs)have gained much attention and,in part,have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses.Osteoarthrit...Background:Large Language Models(LLMs)have gained much attention and,in part,have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses.Osteoarthritis(OA)is a common topic in skeletal muscle disor-ders,and patients often seek information about it online.Our study evaluated the ability of 3 LLMs(ChatGPT-3.5,ChatGPT-4.0,and Perplexity)to accurately answer common OA-related queries.Methods:We defined 6 themes(pathogenesis,risk factors,clinical presentation,diagnosis,treatment and prevention,and prognosis)based on a generalization of 25 frequently asked questions about OA.Three consultant-level orthopedic specialists independently rated the LLMs’replies on a 4-point accuracy scale.Thefinal ratings for each response were determined using a majority consensus approach.Responses classified as“satisfactory”were evaluated for comprehensiveness on a 5-point scale.Results:ChatGPT-4.0 demonstrated superior accuracy,with 64%of responses rated as“excellent”,compared to 40%for ChatGPT-3.5 and 28%for Perplexity(Pearson’s x2 test with Fisher’s exact test,all p<0.001).All 3 LLM-chatbots had high mean comprehensiveness ratings(Perplexity=3.88;ChatGPT-4.0=4.56;ChatGPT-3.5=3.96,out of a maximum score of 5).The LLM-chatbots performed reliably across domains,except for“treatment and prevention”However,ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity,garnering 53.8%“excellent”ratings(Pearson’s x2 test with Fisher’s exact test,all p<0.001).Conclusion:Ourfindings underscore the potential of LLMs,specifically ChatGPT-4.0 and Perplexity,to deliver accurate and thorough responses to OA-related queries.Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.展开更多
AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surfa...AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.展开更多
Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only...Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only a minority of patients can access effective treatment and care services,highlighting an urgent need for improvement.large language models(LLMs),with their natural language understanding and generation capabilities,are gradually penetrating the entire process of psychiatric diagnosis and treatment,including outpatient reception,diagnosis and therapy,clinical nursing,medication safety,and prognosis follow-up.They hold promise for improving the current severe shortage of health system resources and promoting equal access to mental health care.This article reviews the application scenarios and research progress of LLMs.It explores optimization methods for LLMs in psychiatry.Based on the research findings,we propose a clinical LLM for mental health using the Mixture of Experts framework to improve the accuracy of psychiatric diagnosis and therapeutic interventions.展开更多
Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a la...Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a lack of clear and consistent definitions of enterprise digital transformation,and(2)a lack of rigorous and accurate measurement methodologies.These shortcomings lead to research findings that are incomparable,difficult to replicate,and often conflicting.To effectively address the aforementioned challenges,this paper employs machine learning and large language models(LLMs)to construct a novel set of indicators for enterprise digital transformation.The work begins by manually annotating sentences from annual reports of listed companies in China from 2006 to 2020.These labeled sentences are then used to train and fine-tune several machine learning models,including LLMs.The ERNIE model,demonstrating the best classification performance among the models tested,is selected as the sentence classifier to predict sentence labels across the full text of the annual reports,ultimately constructing the enterprise digital transformation metrics.Both theoretical analysis and multiple data cross-validations demonstrate that the metrics developed in this paper are more accurate than existing approaches.Based on these metrics,the paper empirically examines the impact of enterprise digital transformation on financial performance.Our findings reveal three key points:(1)enterprise digital transformation significantly enhances financial performance,with big data,AI,mobile internet,cloud computing,and the Internet of Things(IoT)all playing a significant role;however,blockchain technology does not show a significant effect;(2)the significant positive effect of digital transformation on financial performance is primarily observed in firms with weaker initial financial performance;and(3)enterprise digital transformation improves financial performance mainly through enhancing efficiency and reducing costs.This research has practical implications for promoting enterprise digital transformation and fostering high-quality economic development.展开更多
Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generati...Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance.To address this gap,we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references.We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets.These datasets were selected for their public availability,large scale,and comprehensive coverage of diverse clinical cases in chest radiography,enabling robust evaluation and comparison with prior work.Results demonstrate that the Mistral model,particularly with task-oriented prompting,achieves superior performance(up to 91.9%accuracy),surpassing other models and closely aligning with established metrics like BERTScore-F1(88.1%)and CLIP-Score(88.7%).Statistical analyses,including paired t-tests(p<0.01)and analysis of variance(p<0.05),confirm significant improvements driven by structured prompting.Failure case analysis reveals limitations,such as over-reliance on lexical similarity,underscoring the need for domain-specific fine-tuning.This framework advances the evaluation of artificial intelligence-driven(AI-driven)radiology report generation,offering a robust,clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.展开更多
基金supported by National Natural Science Foundation of China(62376219 and 62006194)Foundational Research Project in Specialized Discipline(Grant No.G2024WD0146)Faculty Construction Project(Grant No.24GH0201148).
文摘Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
文摘Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.
基金supported by the National Key R&D Program of China under Grant No.2022YFB3103500the National Natural Science Foundation of China under Grants No.62402087 and No.62020106013+3 种基金the Sichuan Science and Technology Program under Grant No.2023ZYD0142the Chengdu Science and Technology Program under Grant No.2023-XT00-00002-GXthe Fundamental Research Funds for Chinese Central Universities under Grants No.ZYGX2020ZB027 and No.Y030232063003002the Postdoctoral Innovation Talents Support Program under Grant No.BX20230060.
文摘The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.
文摘Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.
基金Supported by the China Health Promotion Foundation Young Doctors'Research Foundation for Inflammatory Bowel Disease,the Taishan Scholars Program of Shandong Province,China,No.tsqn202306343National Natural Science Foundation of China,No.82270578.
文摘BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patient information needs.However,LLM use to deliver accurate and comprehensible IBD-related medical information has yet to be thoroughly investigated.AIM To assess the utility of three LLMs(ChatGPT-4.0,Claude-3-Opus,and Gemini-1.5-Pro)as a reference point for patients with IBD.METHODS In this comparative study,two gastroenterology experts generated 15 IBD-related questions that reflected common patient concerns.These questions were used to evaluate the performance of the three LLMs.The answers provided by each model were independently assessed by three IBD-related medical experts using a Likert scale focusing on accuracy,comprehensibility,and correlation.Simultaneously,three patients were invited to evaluate the comprehensibility of their answers.Finally,a readability assessment was performed.RESULTS Overall,each of the LLMs achieved satisfactory levels of accuracy,comprehensibility,and completeness when answering IBD-related questions,although their performance varies.All of the investigated models demonstrated strengths in providing basic disease information such as IBD definition as well as its common symptoms and diagnostic methods.Nevertheless,when dealing with more complex medical advice,such as medication side effects,dietary adjustments,and complication risks,the quality of answers was inconsistent between the LLMs.Notably,Claude-3-Opus generated answers with better readability than the other two models.CONCLUSION LLMs have the potential as educational tools for patients with IBD;however,there are discrepancies between the models.Further optimization and the development of specialized models are necessary to ensure the accuracy and safety of the information provided.
文摘ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.
基金supported by a grant from the Ministry of Research,Innovation and Digitization,CNCS/CCCDI-UEFISCDI,project number COFUND-CETP-SMART-LEM-1,within PNCDI Ⅳ.
文摘Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.
文摘This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution from Recurrent Neural Networks(RNNs)to Transformer models,highlighting the significant advancements and innovations in LLM architectures.The review explores state-of-the-art techniques such as in-context learning and various fine-tuning approaches,with an emphasis on optimizing parameter efficiency.We also discuss methods for aligning LLMs with human preferences,including reinforcement learning frameworks and human feedback mechanisms.The emerging technique of retrieval-augmented generation,which integrates external knowledge into LLMs,is also evaluated.Additionally,we address the ethical considerations of deploying LLMs,stressing the importance of responsible and mindful application.By identifying current gaps and suggesting future research directions,this review provides a comprehensive and critical overview of the present state and potential advancements in LLMs.This work serves as an insightful guide for researchers and practitioners in artificial intelligence,offering a unified perspective on the strengths,limitations,and future prospects of LLMs.
基金Education and Teaching Research Project of Beijing University of Technology(ER2024KCB08)。
文摘With the widespread application of large language models(LLMs)in natural language processing and code generation,traditional High-Level Language Programming courses are facing unprecedented challenges and opportunities.As a core programming language for computer science majors,C language remains irreplaceable due to its foundational nature and engineering adaptability.This paper,based on the rapid development of large model technologies,proposes a systematic reform design for C language teaching,focusing on teaching objectives,content structure,teaching methods,and evaluation systems.The article suggests a teaching framework centered on“human-computer collaborative programming,”integrating prompt training,AI-assisted debugging,and code generation analysis,aiming to enhance students’problem modeling ability,programming expression skills,and AI collaboration literacy.
基金supported by Health and Medical Research Fund,Hong Kong(11220386,12230246).
文摘Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accuracy and reliability have undergone tested.This bibliometric analysis aims to provide an overview of research on LLMs in ophthalmology from both thematic and geographical perspectives.Methods:All existing and highly cited LLM-related ophthalmology research papers published in English up to 24th April 2025 were sourced from Scopus,PubMed,and Web of Science.The characteristics of these publications,including publication output,authors,journals,countries,institutions,citations,and research domains,were analyzed using Biblioshiny and VOSviewer software.Results:A total of 277 articles from 1,459 authors and 89 journals were included in this study.Although relevant publications began to appear in 2019,there was a significant increase starting from 2023.He M and Shi D are the most prolific authors,while Investigative Ophthalmology&Visual Science stands out as the most prominent journal.Most of the top-publishing countries are high-income economies,with the USA taking the lead,and the University of California is the leading institution.VOSviewer identified 5 clusters in the keyword co-occurrence analysis,indicating that current research focuses on the clinical applications of LLMs,particularly in diagnosis and patient education.Conclusions:While LLMs have demonstrated effectiveness in retaining knowledge,their accuracy in image-based diagnosis remains limited.Therefore,future research should investigate fine-tuning strategies and domain-specific adaptations to close this gap.Although research on the applications of LLMs in ophthalmology is still in its early stages,it holds significant potential for advancing the field.
文摘Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and recurrence.Despite this,cardiac rehabilitation is underutilized and adherence to these programs has been a demonstrated barrier in achieving these outcomes.As a result,there is a growing focus on innovating these programs,especially from the standpoint of digital health and personalized medicine.This editorial discusses the possible roles of large language models,such as their role in ChatGPT,in further personalizing cardiac rehabilitation programs through simplifying medical jargon and employing motivational interviewing techniques,thus boosting patient engagement and adherence.However,these possibilities must be further investigated in the clinical literature.Likewise,the integration of large language models in cardiac rehabilitation will be challenging in its nascent stages to ensure accurate and ethical information delivery.
文摘Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design patterns is not fully assessed.The recent introduction of generative large language models(LLMs)like ChatGPT and CoPilot has demonstrated significant promise in software development.They assist with a variety of tasks including code generation,modeling,bug fixing,and testing,leading to enhanced efficiency and productivity.Although initial uses of these LLMs have had a positive effect on software development,their potential influence on the application of design patterns remains unexplored.This study introduces a method to quantify LLMs’ability to implement design patterns,using Role-Based Metamodeling Language(RBML)for a rigorous specification of the pattern’s problem,solution,and transformation rules.The method evaluates the pattern applicability of a software application using the pattern’s problem specification.If deemed applicable,the application is input to the LLM for pattern application.The resulting application is assessed for conformance to the pattern’s solution specification and for completeness against the pattern’s transformation rules.Evaluating the method with ChatGPT 4 across three applications reveals ChatGPT’s high proficiency,achieving averages of 98%in conformance and 87%in completeness,thereby demonstrating the effectiveness of the method.Using RBML,this study confirms that LLMs,specifically ChatGPT 4,have great potential in effective and efficient application of design patterns with high conformance and completeness.This opens avenues for further integrating LLMs into complex software engineering processes.
基金supported in part by the Teaching Reform Project of Chongqing University of Posts and Telecommunications,China under Grant No.XJG23234Chongqing Municipal Higher Education Teaching Reform Research Project under Grant No.203399the Doctoral Direct Train Project of Chongqing Science and Technology Bureau under Grant No.CSTB2022BSXM-JSX0007。
文摘The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education.To do so,there is a need to understand the impact of LLMs on software engineering education.In this paper,we conducted a preliminary case study on three software requirements engineering classes where students are allowed to use LLMs to assist in their projects.Based on the students’experience,performance,and feedback from a survey conducted at the end of the courses,we characterized the challenges and benefits of applying LLMs in software engineering education.This research contributes to the ongoing discourse on the integration of LLMs in education,emphasizing both their prominent potential and the need for balanced,mindful usage.
文摘Generative artificial intelligence,represented by large language models,holds vast application scenarios and significant development potential in the field of language teaching.This study employs large language models such as ChatGPT4o,ERNIE Bot,and Spark Cognition to explore how they empower teachers in international Chinese language teaching through practical cases.It focuses on various aspects of international Chinese language teaching and language skills training,examining the application effects of large language models in generating tailored teaching content and converting textual content into multimodal teaching materials.Finally,the study proposes that teachers should rationally recognize the opportunities and challenges that large language models bring to the teaching ecosystem,while acknowledging the models’efficiency in empowering teachers’instruction,it is crucial to fully recognize their essential tool nature,uphold teachers’subjectivity,and pay close attention to the boundaries of their development and application.
基金Supported by National Natural Science Foundation of China(No.82160195)Science and Technology Project of Jiangxi Provincial Department of Education(No.GJJ200169)+1 种基金Science and Technology Project of Jiangxi Province Health Commission of Traditional Chinese Medicine(No.2020A0087)Science and Technology Project of Jiangxi Health Commission(No.202130210).
文摘AIM:To assess the performance of five distinct large language models(LLMs;ChatGPT-3.5,ChatGPT-4,PaLM2,Claude 2,and SenseNova)in comparison to two human cohorts(a group of funduscopic disease experts and a group of ophthalmologists)on the specialized subject of funduscopic disease.METHODS:Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test.The performance of these entities was assessed by comparing their average scores,response stability,and answer confidence,thereby establishing a basis for evaluation.RESULTS:Among all the LLMs,ChatGPT-4 and PaLM2 exhibited the most substantial average correlation.Additionally,ChatGPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam.In comparison to human cohorts,ChatGPT-4 exhibited comparable performance to ophthalmologists,albeit falling short of the expertise demonstrated by funduscopic disease specialists.CONCLUSION:The study provides evidence of the exceptional performance of ChatGPT-4 in the domain of funduscopic disease.With continued enhancements,validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.
基金supported by the Health and Medical Research Fund of the Food and Health Bureau of the Government of the Hong Kong Special Administrative Region(HMRF/19202461)by a direct grant(2022/044)from the Chinese University of Hong Kong.
文摘Background:Large Language Models(LLMs)have gained much attention and,in part,have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses.Osteoarthritis(OA)is a common topic in skeletal muscle disor-ders,and patients often seek information about it online.Our study evaluated the ability of 3 LLMs(ChatGPT-3.5,ChatGPT-4.0,and Perplexity)to accurately answer common OA-related queries.Methods:We defined 6 themes(pathogenesis,risk factors,clinical presentation,diagnosis,treatment and prevention,and prognosis)based on a generalization of 25 frequently asked questions about OA.Three consultant-level orthopedic specialists independently rated the LLMs’replies on a 4-point accuracy scale.Thefinal ratings for each response were determined using a majority consensus approach.Responses classified as“satisfactory”were evaluated for comprehensiveness on a 5-point scale.Results:ChatGPT-4.0 demonstrated superior accuracy,with 64%of responses rated as“excellent”,compared to 40%for ChatGPT-3.5 and 28%for Perplexity(Pearson’s x2 test with Fisher’s exact test,all p<0.001).All 3 LLM-chatbots had high mean comprehensiveness ratings(Perplexity=3.88;ChatGPT-4.0=4.56;ChatGPT-3.5=3.96,out of a maximum score of 5).The LLM-chatbots performed reliably across domains,except for“treatment and prevention”However,ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity,garnering 53.8%“excellent”ratings(Pearson’s x2 test with Fisher’s exact test,all p<0.001).Conclusion:Ourfindings underscore the potential of LLMs,specifically ChatGPT-4.0 and Perplexity,to deliver accurate and thorough responses to OA-related queries.Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.
基金Supported by National Natural Science Foundation of China(No.82160195,No.82460203)Degree and Postgraduate Education Teaching Reform Project of Jiangxi Province(No.JXYJG-2020-026).
文摘AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.
基金Supported by the STI2030-Major Projects,No.2021ZD0203400 and No.2021ZD0200800the National Natural Science Foundation of China,No.82171477。
文摘Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only a minority of patients can access effective treatment and care services,highlighting an urgent need for improvement.large language models(LLMs),with their natural language understanding and generation capabilities,are gradually penetrating the entire process of psychiatric diagnosis and treatment,including outpatient reception,diagnosis and therapy,clinical nursing,medication safety,and prognosis follow-up.They hold promise for improving the current severe shortage of health system resources and promoting equal access to mental health care.This article reviews the application scenarios and research progress of LLMs.It explores optimization methods for LLMs in psychiatry.Based on the research findings,we propose a clinical LLM for mental health using the Mixture of Experts framework to improve the accuracy of psychiatric diagnosis and therapeutic interventions.
基金supported by the Fundamental Research Funds for the Central Universitiesfollowing projects:the Major Project of the National Social Science Fund of China(NSSFC)“Research on the Synergistic Mechanisms of Innovation and Governance for High-Quality Development of the Digital Economy”(Grant No.22&ZD070)+1 种基金the Youth Project of the National Natural Science Foundation of China(NSFC)“Research on Risk-Taking of Zombie Enterprises from a Government-Enterprise Interaction Perspective:Tendency,Behavioral Patterns,and Economic Consequences”(Grant No.72002213)the General Program of the National Natural Science Foundation of China(NSFC)“Reshaping Enterprise Nature,Boundaries,and Internal Organization in the Digital Economy”(Grant No.72273144).
文摘Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a lack of clear and consistent definitions of enterprise digital transformation,and(2)a lack of rigorous and accurate measurement methodologies.These shortcomings lead to research findings that are incomparable,difficult to replicate,and often conflicting.To effectively address the aforementioned challenges,this paper employs machine learning and large language models(LLMs)to construct a novel set of indicators for enterprise digital transformation.The work begins by manually annotating sentences from annual reports of listed companies in China from 2006 to 2020.These labeled sentences are then used to train and fine-tune several machine learning models,including LLMs.The ERNIE model,demonstrating the best classification performance among the models tested,is selected as the sentence classifier to predict sentence labels across the full text of the annual reports,ultimately constructing the enterprise digital transformation metrics.Both theoretical analysis and multiple data cross-validations demonstrate that the metrics developed in this paper are more accurate than existing approaches.Based on these metrics,the paper empirically examines the impact of enterprise digital transformation on financial performance.Our findings reveal three key points:(1)enterprise digital transformation significantly enhances financial performance,with big data,AI,mobile internet,cloud computing,and the Internet of Things(IoT)all playing a significant role;however,blockchain technology does not show a significant effect;(2)the significant positive effect of digital transformation on financial performance is primarily observed in firms with weaker initial financial performance;and(3)enterprise digital transformation improves financial performance mainly through enhancing efficiency and reducing costs.This research has practical implications for promoting enterprise digital transformation and fostering high-quality economic development.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government(MSIT)(IITP-2024-RS-2024-00436773).
文摘Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance.To address this gap,we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references.We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets.These datasets were selected for their public availability,large scale,and comprehensive coverage of diverse clinical cases in chest radiography,enabling robust evaluation and comparison with prior work.Results demonstrate that the Mistral model,particularly with task-oriented prompting,achieves superior performance(up to 91.9%accuracy),surpassing other models and closely aligning with established metrics like BERTScore-F1(88.1%)and CLIP-Score(88.7%).Statistical analyses,including paired t-tests(p<0.01)and analysis of variance(p<0.05),confirm significant improvements driven by structured prompting.Failure case analysis reveals limitations,such as over-reliance on lexical similarity,underscoring the need for domain-specific fine-tuning.This framework advances the evaluation of artificial intelligence-driven(AI-driven)radiology report generation,offering a robust,clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.