We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of t...We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.展开更多
Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and langua...Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.展开更多
Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether ...Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.展开更多
BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patie...BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patient information needs.However,LLM use to deliver accurate and comprehensible IBD-related medical information has yet to be thoroughly investigated.AIM To assess the utility of three LLMs(ChatGPT-4.0,Claude-3-Opus,and Gemini-1.5-Pro)as a reference point for patients with IBD.METHODS In this comparative study,two gastroenterology experts generated 15 IBD-related questions that reflected common patient concerns.These questions were used to evaluate the performance of the three LLMs.The answers provided by each model were independently assessed by three IBD-related medical experts using a Likert scale focusing on accuracy,comprehensibility,and correlation.Simultaneously,three patients were invited to evaluate the comprehensibility of their answers.Finally,a readability assessment was performed.RESULTS Overall,each of the LLMs achieved satisfactory levels of accuracy,comprehensibility,and completeness when answering IBD-related questions,although their performance varies.All of the investigated models demonstrated strengths in providing basic disease information such as IBD definition as well as its common symptoms and diagnostic methods.Nevertheless,when dealing with more complex medical advice,such as medication side effects,dietary adjustments,and complication risks,the quality of answers was inconsistent between the LLMs.Notably,Claude-3-Opus generated answers with better readability than the other two models.CONCLUSION LLMs have the potential as educational tools for patients with IBD;however,there are discrepancies between the models.Further optimization and the development of specialized models are necessary to ensure the accuracy and safety of the information provided.展开更多
The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.De...The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.展开更多
Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, ...Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.展开更多
ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential sec...ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.展开更多
The application of generative artificial intelligence(AI)is bringing about notable changes in anime creation.This paper surveys recent advancements and applications of diffusion and language models in anime generation...The application of generative artificial intelligence(AI)is bringing about notable changes in anime creation.This paper surveys recent advancements and applications of diffusion and language models in anime generation,focusing on their demonstrated potential to enhance production efficiency through automation and personalization.Despite these benefits,it is crucial to acknowledge the substantial initial computational investments required for training and deploying these models.We conduct an in-depth survey of cutting-edge generative AI technologies,encompassing models such as Stable Diffusion and GPT,and appraise pivotal large-scale datasets alongside quantifiable evaluation metrics.Review of the surveyed literature indicates the achievement of considerable maturity in the capacity of AI models to synthesize high-quality,aesthetically compelling anime visual images from textual prompts,alongside discernible progress in the generation of coherent narratives.However,achieving perfect long-form consistency,mitigating artifacts like flickering in video sequences,and enabling fine-grained artistic control remain critical ongoing challenges.Building upon these advancements,research efforts have increasingly pivoted towards the synthesis of higher-dimensional content,such as video and three-dimensional assets,with recent studies demonstrating significant progress in this burgeoning field.Nevertheless,formidable challenges endure amidst these advancements.Foremost among these are the substantial computational exigencies requisite for training and deploying these sophisticated models,particularly pronounced in the realm of high-dimensional generation such as video synthesis.Additional persistent hurdles include maintaining spatial-temporal consistency across complex scenes and mitigating ethical considerations surrounding bias and the preservation of human creative autonomy.This research underscores the transformative potential and inherent complexities of AI-driven synergy within the creative industries.We posit that future research should be dedicated to the synergistic fusion of diffusion and autoregressive models,the integration of multimodal inputs,and the balanced consideration of ethical implications,particularly regarding bias and the preservation of human creative autonomy,thereby establishing a robust foundation for the advancement of anime creation and the broader landscape of AI-driven content generation.展开更多
Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design pa...Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design patterns is not fully assessed.The recent introduction of generative large language models(LLMs)like ChatGPT and CoPilot has demonstrated significant promise in software development.They assist with a variety of tasks including code generation,modeling,bug fixing,and testing,leading to enhanced efficiency and productivity.Although initial uses of these LLMs have had a positive effect on software development,their potential influence on the application of design patterns remains unexplored.This study introduces a method to quantify LLMs’ability to implement design patterns,using Role-Based Metamodeling Language(RBML)for a rigorous specification of the pattern’s problem,solution,and transformation rules.The method evaluates the pattern applicability of a software application using the pattern’s problem specification.If deemed applicable,the application is input to the LLM for pattern application.The resulting application is assessed for conformance to the pattern’s solution specification and for completeness against the pattern’s transformation rules.Evaluating the method with ChatGPT 4 across three applications reveals ChatGPT’s high proficiency,achieving averages of 98%in conformance and 87%in completeness,thereby demonstrating the effectiveness of the method.Using RBML,this study confirms that LLMs,specifically ChatGPT 4,have great potential in effective and efficient application of design patterns with high conformance and completeness.This opens avenues for further integrating LLMs into complex software engineering processes.展开更多
Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and...Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and recurrence.Despite this,cardiac rehabilitation is underutilized and adherence to these programs has been a demonstrated barrier in achieving these outcomes.As a result,there is a growing focus on innovating these programs,especially from the standpoint of digital health and personalized medicine.This editorial discusses the possible roles of large language models,such as their role in ChatGPT,in further personalizing cardiac rehabilitation programs through simplifying medical jargon and employing motivational interviewing techniques,thus boosting patient engagement and adherence.However,these possibilities must be further investigated in the clinical literature.Likewise,the integration of large language models in cardiac rehabilitation will be challenging in its nascent stages to ensure accurate and ethical information delivery.展开更多
Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only...Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only a minority of patients can access effective treatment and care services,highlighting an urgent need for improvement.large language models(LLMs),with their natural language understanding and generation capabilities,are gradually penetrating the entire process of psychiatric diagnosis and treatment,including outpatient reception,diagnosis and therapy,clinical nursing,medication safety,and prognosis follow-up.They hold promise for improving the current severe shortage of health system resources and promoting equal access to mental health care.This article reviews the application scenarios and research progress of LLMs.It explores optimization methods for LLMs in psychiatry.Based on the research findings,we propose a clinical LLM for mental health using the Mixture of Experts framework to improve the accuracy of psychiatric diagnosis and therapeutic interventions.展开更多
This article evaluates the transformative potential of large language models(LLMs)as patient education tools for managing inflammatory bowel disease.The discussion highlights their ability to deliver nuanced and perso...This article evaluates the transformative potential of large language models(LLMs)as patient education tools for managing inflammatory bowel disease.The discussion highlights their ability to deliver nuanced and personalized infor-mation,addressing limitations in traditional educational materials.Key consider-ations include the necessity for domain-specific fine-tuning to enhance accuracy,the adoption of robust evaluation metrics beyond readability,and the integration of LLMs with clinical decision support systems to improve real-time patient education.Ethical and accessibility challenges,such as algorithmic bias,data privacy,and digital literacy,are also examined.Recommendations emphasize the importance of interdisciplinary collaboration to optimize LLM integration,en-suring equitable access and improved patient outcomes.By advancing LLM technology,healthcare can empower patients with accurate and personalized information,enhancing engagement and disease management.展开更多
Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a la...Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a lack of clear and consistent definitions of enterprise digital transformation,and(2)a lack of rigorous and accurate measurement methodologies.These shortcomings lead to research findings that are incomparable,difficult to replicate,and often conflicting.To effectively address the aforementioned challenges,this paper employs machine learning and large language models(LLMs)to construct a novel set of indicators for enterprise digital transformation.The work begins by manually annotating sentences from annual reports of listed companies in China from 2006 to 2020.These labeled sentences are then used to train and fine-tune several machine learning models,including LLMs.The ERNIE model,demonstrating the best classification performance among the models tested,is selected as the sentence classifier to predict sentence labels across the full text of the annual reports,ultimately constructing the enterprise digital transformation metrics.Both theoretical analysis and multiple data cross-validations demonstrate that the metrics developed in this paper are more accurate than existing approaches.Based on these metrics,the paper empirically examines the impact of enterprise digital transformation on financial performance.Our findings reveal three key points:(1)enterprise digital transformation significantly enhances financial performance,with big data,AI,mobile internet,cloud computing,and the Internet of Things(IoT)all playing a significant role;however,blockchain technology does not show a significant effect;(2)the significant positive effect of digital transformation on financial performance is primarily observed in firms with weaker initial financial performance;and(3)enterprise digital transformation improves financial performance mainly through enhancing efficiency and reducing costs.This research has practical implications for promoting enterprise digital transformation and fostering high-quality economic development.展开更多
This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution...This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution from Recurrent Neural Networks(RNNs)to Transformer models,highlighting the significant advancements and innovations in LLM architectures.The review explores state-of-the-art techniques such as in-context learning and various fine-tuning approaches,with an emphasis on optimizing parameter efficiency.We also discuss methods for aligning LLMs with human preferences,including reinforcement learning frameworks and human feedback mechanisms.The emerging technique of retrieval-augmented generation,which integrates external knowledge into LLMs,is also evaluated.Additionally,we address the ethical considerations of deploying LLMs,stressing the importance of responsible and mindful application.By identifying current gaps and suggesting future research directions,this review provides a comprehensive and critical overview of the present state and potential advancements in LLMs.This work serves as an insightful guide for researchers and practitioners in artificial intelligence,offering a unified perspective on the strengths,limitations,and future prospects of LLMs.展开更多
Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generati...Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance.To address this gap,we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references.We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets.These datasets were selected for their public availability,large scale,and comprehensive coverage of diverse clinical cases in chest radiography,enabling robust evaluation and comparison with prior work.Results demonstrate that the Mistral model,particularly with task-oriented prompting,achieves superior performance(up to 91.9%accuracy),surpassing other models and closely aligning with established metrics like BERTScore-F1(88.1%)and CLIP-Score(88.7%).Statistical analyses,including paired t-tests(p<0.01)and analysis of variance(p<0.05),confirm significant improvements driven by structured prompting.Failure case analysis reveals limitations,such as over-reliance on lexical similarity,underscoring the need for domain-specific fine-tuning.This framework advances the evaluation of artificial intelligence-driven(AI-driven)radiology report generation,offering a robust,clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.展开更多
Domain Generation Algorithms(DGAs)continue to pose a significant threat inmodernmalware infrastructures by enabling resilient and evasive communication with Command and Control(C&C)servers.Traditional detection me...Domain Generation Algorithms(DGAs)continue to pose a significant threat inmodernmalware infrastructures by enabling resilient and evasive communication with Command and Control(C&C)servers.Traditional detection methods-rooted in statistical heuristics,feature engineering,and shallow machine learning-struggle to adapt to the increasing sophistication,linguistic mimicry,and adversarial variability of DGA variants.The emergence of Large Language Models(LLMs)marks a transformative shift in this landscape.Leveraging deep contextual understanding,semantic generalization,and few-shot learning capabilities,LLMs such as BERT,GPT,and T5 have shown promising results in detecting both character-based and dictionary-based DGAs,including previously unseen(zeroday)variants.This paper provides a comprehensive and critical review of LLM-driven DGA detection,introducing a structured taxonomy of LLM architectures,evaluating the linguistic and behavioral properties of benchmark datasets,and comparing recent detection frameworks across accuracy,latency,robustness,and multilingual performance.We also highlight key limitations,including challenges in adversarial resilience,model interpretability,deployment scalability,and privacy risks.To address these gaps,we present a forward-looking research roadmap encompassing adversarial training,model compression,cross-lingual benchmarking,and real-time integration with SIEM/SOAR platforms.This survey aims to serve as a foundational resource for advancing the development of scalable,explainable,and operationally viable LLM-based DGA detection systems.展开更多
Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accura...Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accuracy and reliability have undergone tested.This bibliometric analysis aims to provide an overview of research on LLMs in ophthalmology from both thematic and geographical perspectives.Methods:All existing and highly cited LLM-related ophthalmology research papers published in English up to 24th April 2025 were sourced from Scopus,PubMed,and Web of Science.The characteristics of these publications,including publication output,authors,journals,countries,institutions,citations,and research domains,were analyzed using Biblioshiny and VOSviewer software.Results:A total of 277 articles from 1,459 authors and 89 journals were included in this study.Although relevant publications began to appear in 2019,there was a significant increase starting from 2023.He M and Shi D are the most prolific authors,while Investigative Ophthalmology&Visual Science stands out as the most prominent journal.Most of the top-publishing countries are high-income economies,with the USA taking the lead,and the University of California is the leading institution.VOSviewer identified 5 clusters in the keyword co-occurrence analysis,indicating that current research focuses on the clinical applications of LLMs,particularly in diagnosis and patient education.Conclusions:While LLMs have demonstrated effectiveness in retaining knowledge,their accuracy in image-based diagnosis remains limited.Therefore,future research should investigate fine-tuning strategies and domain-specific adaptations to close this gap.Although research on the applications of LLMs in ophthalmology is still in its early stages,it holds significant potential for advancing the field.展开更多
The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education....The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education.To do so,there is a need to understand the impact of LLMs on software engineering education.In this paper,we conducted a preliminary case study on three software requirements engineering classes where students are allowed to use LLMs to assist in their projects.Based on the students’experience,performance,and feedback from a survey conducted at the end of the courses,we characterized the challenges and benefits of applying LLMs in software engineering education.This research contributes to the ongoing discourse on the integration of LLMs in education,emphasizing both their prominent potential and the need for balanced,mindful usage.展开更多
Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing...Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.展开更多
AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surfa...AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.展开更多
文摘We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.
基金supported by National Natural Science Foundation of China(62376219 and 62006194)Foundational Research Project in Specialized Discipline(Grant No.G2024WD0146)Faculty Construction Project(Grant No.24GH0201148).
文摘Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
文摘Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.
基金Supported by the China Health Promotion Foundation Young Doctors'Research Foundation for Inflammatory Bowel Disease,the Taishan Scholars Program of Shandong Province,China,No.tsqn202306343National Natural Science Foundation of China,No.82270578.
文摘BACKGROUND Inflammatory bowel disease(IBD)is a global health burden that affects millions of individuals worldwide,necessitating extensive patient education.Large language models(LLMs)hold promise for addressing patient information needs.However,LLM use to deliver accurate and comprehensible IBD-related medical information has yet to be thoroughly investigated.AIM To assess the utility of three LLMs(ChatGPT-4.0,Claude-3-Opus,and Gemini-1.5-Pro)as a reference point for patients with IBD.METHODS In this comparative study,two gastroenterology experts generated 15 IBD-related questions that reflected common patient concerns.These questions were used to evaluate the performance of the three LLMs.The answers provided by each model were independently assessed by three IBD-related medical experts using a Likert scale focusing on accuracy,comprehensibility,and correlation.Simultaneously,three patients were invited to evaluate the comprehensibility of their answers.Finally,a readability assessment was performed.RESULTS Overall,each of the LLMs achieved satisfactory levels of accuracy,comprehensibility,and completeness when answering IBD-related questions,although their performance varies.All of the investigated models demonstrated strengths in providing basic disease information such as IBD definition as well as its common symptoms and diagnostic methods.Nevertheless,when dealing with more complex medical advice,such as medication side effects,dietary adjustments,and complication risks,the quality of answers was inconsistent between the LLMs.Notably,Claude-3-Opus generated answers with better readability than the other two models.CONCLUSION LLMs have the potential as educational tools for patients with IBD;however,there are discrepancies between the models.Further optimization and the development of specialized models are necessary to ensure the accuracy and safety of the information provided.
基金supported by the National Key R&D Program of China under Grant No.2022YFB3103500the National Natural Science Foundation of China under Grants No.62402087 and No.62020106013+3 种基金the Sichuan Science and Technology Program under Grant No.2023ZYD0142the Chengdu Science and Technology Program under Grant No.2023-XT00-00002-GXthe Fundamental Research Funds for Chinese Central Universities under Grants No.ZYGX2020ZB027 and No.Y030232063003002the Postdoctoral Innovation Talents Support Program under Grant No.BX20230060.
文摘The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.
文摘Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.
文摘ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.
基金supported by the National Natural Science Foundation of China(Grant No.62202210).
文摘The application of generative artificial intelligence(AI)is bringing about notable changes in anime creation.This paper surveys recent advancements and applications of diffusion and language models in anime generation,focusing on their demonstrated potential to enhance production efficiency through automation and personalization.Despite these benefits,it is crucial to acknowledge the substantial initial computational investments required for training and deploying these models.We conduct an in-depth survey of cutting-edge generative AI technologies,encompassing models such as Stable Diffusion and GPT,and appraise pivotal large-scale datasets alongside quantifiable evaluation metrics.Review of the surveyed literature indicates the achievement of considerable maturity in the capacity of AI models to synthesize high-quality,aesthetically compelling anime visual images from textual prompts,alongside discernible progress in the generation of coherent narratives.However,achieving perfect long-form consistency,mitigating artifacts like flickering in video sequences,and enabling fine-grained artistic control remain critical ongoing challenges.Building upon these advancements,research efforts have increasingly pivoted towards the synthesis of higher-dimensional content,such as video and three-dimensional assets,with recent studies demonstrating significant progress in this burgeoning field.Nevertheless,formidable challenges endure amidst these advancements.Foremost among these are the substantial computational exigencies requisite for training and deploying these sophisticated models,particularly pronounced in the realm of high-dimensional generation such as video synthesis.Additional persistent hurdles include maintaining spatial-temporal consistency across complex scenes and mitigating ethical considerations surrounding bias and the preservation of human creative autonomy.This research underscores the transformative potential and inherent complexities of AI-driven synergy within the creative industries.We posit that future research should be dedicated to the synergistic fusion of diffusion and autoregressive models,the integration of multimodal inputs,and the balanced consideration of ethical implications,particularly regarding bias and the preservation of human creative autonomy,thereby establishing a robust foundation for the advancement of anime creation and the broader landscape of AI-driven content generation.
文摘Design patterns offer reusable solutions for common software issues,enhancing quality.The advent of generative large language models(LLMs)marks progress in software development,but their efficacy in applying design patterns is not fully assessed.The recent introduction of generative large language models(LLMs)like ChatGPT and CoPilot has demonstrated significant promise in software development.They assist with a variety of tasks including code generation,modeling,bug fixing,and testing,leading to enhanced efficiency and productivity.Although initial uses of these LLMs have had a positive effect on software development,their potential influence on the application of design patterns remains unexplored.This study introduces a method to quantify LLMs’ability to implement design patterns,using Role-Based Metamodeling Language(RBML)for a rigorous specification of the pattern’s problem,solution,and transformation rules.The method evaluates the pattern applicability of a software application using the pattern’s problem specification.If deemed applicable,the application is input to the LLM for pattern application.The resulting application is assessed for conformance to the pattern’s solution specification and for completeness against the pattern’s transformation rules.Evaluating the method with ChatGPT 4 across three applications reveals ChatGPT’s high proficiency,achieving averages of 98%in conformance and 87%in completeness,thereby demonstrating the effectiveness of the method.Using RBML,this study confirms that LLMs,specifically ChatGPT 4,have great potential in effective and efficient application of design patterns with high conformance and completeness.This opens avenues for further integrating LLMs into complex software engineering processes.
文摘Cardiac rehabilitation is a crucial multidisciplinary approach to improve patient outcomes.There is a growing body of evidence that suggests that these programs contribute towards reducing cardiovascular mortality and recurrence.Despite this,cardiac rehabilitation is underutilized and adherence to these programs has been a demonstrated barrier in achieving these outcomes.As a result,there is a growing focus on innovating these programs,especially from the standpoint of digital health and personalized medicine.This editorial discusses the possible roles of large language models,such as their role in ChatGPT,in further personalizing cardiac rehabilitation programs through simplifying medical jargon and employing motivational interviewing techniques,thus boosting patient engagement and adherence.However,these possibilities must be further investigated in the clinical literature.Likewise,the integration of large language models in cardiac rehabilitation will be challenging in its nascent stages to ensure accurate and ethical information delivery.
基金Supported by the STI2030-Major Projects,No.2021ZD0203400 and No.2021ZD0200800the National Natural Science Foundation of China,No.82171477。
文摘Psychiatric disorders constitute a complex health issue,primarily manifesting as significant disturbances in cognition,emotional regulation,and behavior.However,due to limited resources within health care systems,only a minority of patients can access effective treatment and care services,highlighting an urgent need for improvement.large language models(LLMs),with their natural language understanding and generation capabilities,are gradually penetrating the entire process of psychiatric diagnosis and treatment,including outpatient reception,diagnosis and therapy,clinical nursing,medication safety,and prognosis follow-up.They hold promise for improving the current severe shortage of health system resources and promoting equal access to mental health care.This article reviews the application scenarios and research progress of LLMs.It explores optimization methods for LLMs in psychiatry.Based on the research findings,we propose a clinical LLM for mental health using the Mixture of Experts framework to improve the accuracy of psychiatric diagnosis and therapeutic interventions.
文摘This article evaluates the transformative potential of large language models(LLMs)as patient education tools for managing inflammatory bowel disease.The discussion highlights their ability to deliver nuanced and personalized infor-mation,addressing limitations in traditional educational materials.Key consider-ations include the necessity for domain-specific fine-tuning to enhance accuracy,the adoption of robust evaluation metrics beyond readability,and the integration of LLMs with clinical decision support systems to improve real-time patient education.Ethical and accessibility challenges,such as algorithmic bias,data privacy,and digital literacy,are also examined.Recommendations emphasize the importance of interdisciplinary collaboration to optimize LLM integration,en-suring equitable access and improved patient outcomes.By advancing LLM technology,healthcare can empower patients with accurate and personalized information,enhancing engagement and disease management.
基金supported by the Fundamental Research Funds for the Central Universitiesfollowing projects:the Major Project of the National Social Science Fund of China(NSSFC)“Research on the Synergistic Mechanisms of Innovation and Governance for High-Quality Development of the Digital Economy”(Grant No.22&ZD070)+1 种基金the Youth Project of the National Natural Science Foundation of China(NSFC)“Research on Risk-Taking of Zombie Enterprises from a Government-Enterprise Interaction Perspective:Tendency,Behavioral Patterns,and Economic Consequences”(Grant No.72002213)the General Program of the National Natural Science Foundation of China(NSFC)“Reshaping Enterprise Nature,Boundaries,and Internal Organization in the Digital Economy”(Grant No.72273144).
文摘Despite broad consensus on the importance of enterprise digital transformation,significant discrepancies persist regarding its actual effects.This divergence stems primarily from two key measurement challenges:(1)a lack of clear and consistent definitions of enterprise digital transformation,and(2)a lack of rigorous and accurate measurement methodologies.These shortcomings lead to research findings that are incomparable,difficult to replicate,and often conflicting.To effectively address the aforementioned challenges,this paper employs machine learning and large language models(LLMs)to construct a novel set of indicators for enterprise digital transformation.The work begins by manually annotating sentences from annual reports of listed companies in China from 2006 to 2020.These labeled sentences are then used to train and fine-tune several machine learning models,including LLMs.The ERNIE model,demonstrating the best classification performance among the models tested,is selected as the sentence classifier to predict sentence labels across the full text of the annual reports,ultimately constructing the enterprise digital transformation metrics.Both theoretical analysis and multiple data cross-validations demonstrate that the metrics developed in this paper are more accurate than existing approaches.Based on these metrics,the paper empirically examines the impact of enterprise digital transformation on financial performance.Our findings reveal three key points:(1)enterprise digital transformation significantly enhances financial performance,with big data,AI,mobile internet,cloud computing,and the Internet of Things(IoT)all playing a significant role;however,blockchain technology does not show a significant effect;(2)the significant positive effect of digital transformation on financial performance is primarily observed in firms with weaker initial financial performance;and(3)enterprise digital transformation improves financial performance mainly through enhancing efficiency and reducing costs.This research has practical implications for promoting enterprise digital transformation and fostering high-quality economic development.
文摘This critical review provides an in-depth analysis of Large Language Models(LLMs),encompassing their foundational principles,diverse applications,and advanced training methodologies.We critically examine the evolution from Recurrent Neural Networks(RNNs)to Transformer models,highlighting the significant advancements and innovations in LLM architectures.The review explores state-of-the-art techniques such as in-context learning and various fine-tuning approaches,with an emphasis on optimizing parameter efficiency.We also discuss methods for aligning LLMs with human preferences,including reinforcement learning frameworks and human feedback mechanisms.The emerging technique of retrieval-augmented generation,which integrates external knowledge into LLMs,is also evaluated.Additionally,we address the ethical considerations of deploying LLMs,stressing the importance of responsible and mindful application.By identifying current gaps and suggesting future research directions,this review provides a comprehensive and critical overview of the present state and potential advancements in LLMs.This work serves as an insightful guide for researchers and practitioners in artificial intelligence,offering a unified perspective on the strengths,limitations,and future prospects of LLMs.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government(MSIT)(IITP-2024-RS-2024-00436773).
文摘Artificial intelligence is reshaping radiology by enabling automated report generation,yet evaluating the clinical accuracy and relevance of these reports is a challenging task,as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance.To address this gap,we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references.We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets.These datasets were selected for their public availability,large scale,and comprehensive coverage of diverse clinical cases in chest radiography,enabling robust evaluation and comparison with prior work.Results demonstrate that the Mistral model,particularly with task-oriented prompting,achieves superior performance(up to 91.9%accuracy),surpassing other models and closely aligning with established metrics like BERTScore-F1(88.1%)and CLIP-Score(88.7%).Statistical analyses,including paired t-tests(p<0.01)and analysis of variance(p<0.05),confirm significant improvements driven by structured prompting.Failure case analysis reveals limitations,such as over-reliance on lexical similarity,underscoring the need for domain-specific fine-tuning.This framework advances the evaluation of artificial intelligence-driven(AI-driven)radiology report generation,offering a robust,clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.
基金the Deanship of Scientific Research at King Khalid University for funding this work through large group under grant number(GRP.2/663/46).
文摘Domain Generation Algorithms(DGAs)continue to pose a significant threat inmodernmalware infrastructures by enabling resilient and evasive communication with Command and Control(C&C)servers.Traditional detection methods-rooted in statistical heuristics,feature engineering,and shallow machine learning-struggle to adapt to the increasing sophistication,linguistic mimicry,and adversarial variability of DGA variants.The emergence of Large Language Models(LLMs)marks a transformative shift in this landscape.Leveraging deep contextual understanding,semantic generalization,and few-shot learning capabilities,LLMs such as BERT,GPT,and T5 have shown promising results in detecting both character-based and dictionary-based DGAs,including previously unseen(zeroday)variants.This paper provides a comprehensive and critical review of LLM-driven DGA detection,introducing a structured taxonomy of LLM architectures,evaluating the linguistic and behavioral properties of benchmark datasets,and comparing recent detection frameworks across accuracy,latency,robustness,and multilingual performance.We also highlight key limitations,including challenges in adversarial resilience,model interpretability,deployment scalability,and privacy risks.To address these gaps,we present a forward-looking research roadmap encompassing adversarial training,model compression,cross-lingual benchmarking,and real-time integration with SIEM/SOAR platforms.This survey aims to serve as a foundational resource for advancing the development of scalable,explainable,and operationally viable LLM-based DGA detection systems.
基金supported by Health and Medical Research Fund,Hong Kong(11220386,12230246).
文摘Background:With the rapid development of artificial intelligence(AI),large language models(LLMs)have emerged as a potent tool for invigorating ophthalmology across clinical,educational,and research fields.Their accuracy and reliability have undergone tested.This bibliometric analysis aims to provide an overview of research on LLMs in ophthalmology from both thematic and geographical perspectives.Methods:All existing and highly cited LLM-related ophthalmology research papers published in English up to 24th April 2025 were sourced from Scopus,PubMed,and Web of Science.The characteristics of these publications,including publication output,authors,journals,countries,institutions,citations,and research domains,were analyzed using Biblioshiny and VOSviewer software.Results:A total of 277 articles from 1,459 authors and 89 journals were included in this study.Although relevant publications began to appear in 2019,there was a significant increase starting from 2023.He M and Shi D are the most prolific authors,while Investigative Ophthalmology&Visual Science stands out as the most prominent journal.Most of the top-publishing countries are high-income economies,with the USA taking the lead,and the University of California is the leading institution.VOSviewer identified 5 clusters in the keyword co-occurrence analysis,indicating that current research focuses on the clinical applications of LLMs,particularly in diagnosis and patient education.Conclusions:While LLMs have demonstrated effectiveness in retaining knowledge,their accuracy in image-based diagnosis remains limited.Therefore,future research should investigate fine-tuning strategies and domain-specific adaptations to close this gap.Although research on the applications of LLMs in ophthalmology is still in its early stages,it holds significant potential for advancing the field.
基金supported in part by the Teaching Reform Project of Chongqing University of Posts and Telecommunications,China under Grant No.XJG23234Chongqing Municipal Higher Education Teaching Reform Research Project under Grant No.203399the Doctoral Direct Train Project of Chongqing Science and Technology Bureau under Grant No.CSTB2022BSXM-JSX0007。
文摘The advent of large language models(LLMs)has made knowledge acquisition and content creation increasingly easier and cheaper,which in turn redefines learning and urges transformation in software engineering education.To do so,there is a need to understand the impact of LLMs on software engineering education.In this paper,we conducted a preliminary case study on three software requirements engineering classes where students are allowed to use LLMs to assist in their projects.Based on the students’experience,performance,and feedback from a survey conducted at the end of the courses,we characterized the challenges and benefits of applying LLMs in software engineering education.This research contributes to the ongoing discourse on the integration of LLMs in education,emphasizing both their prominent potential and the need for balanced,mindful usage.
基金supported by a grant from the Ministry of Research,Innovation and Digitization,CNCS/CCCDI-UEFISCDI,project number COFUND-CETP-SMART-LEM-1,within PNCDI Ⅳ.
文摘Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.
基金Supported by National Natural Science Foundation of China(No.82160195,No.82460203)Degree and Postgraduate Education Teaching Reform Project of Jiangxi Province(No.JXYJG-2020-026).
文摘AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.