In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple e...In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple extraction models facemultiple challenges when processing domain-specific data,including insufficient utilization of semantic interaction information between entities and relations,difficulties in handling challenging samples,and the scarcity of domain-specific datasets.To address these issues,our study introduces three innovative components:Relation semantic enhancement,data augmentation,and a voting strategy,all designed to significantly improve the model’s performance in tackling domain-specific relational triple extraction tasks.We first propose an innovative attention interaction module.This method significantly enhances the semantic interaction capabilities between entities and relations by integrating semantic information fromrelation labels.Second,we propose a voting strategy that effectively combines the strengths of large languagemodels(LLMs)and fine-tuned small pre-trained language models(SLMs)to reevaluate challenging samples,thereby improving the model’s adaptability in specific domains.Additionally,we explore the use of LLMs for data augmentation,aiming to generate domain-specific datasets to alleviate the scarcity of domain data.Experiments conducted on three domain-specific datasets demonstrate that our model outperforms existing comparative models in several aspects,with F1 scores exceeding the State of the Art models by 2%,1.6%,and 0.6%,respectively,validating the effectiveness and generalizability of our approach.展开更多
We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract informa...We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract information from medical text,facilitating more accurate classification while minimizing the number of trainable parameters.Extensive experiments conducted on various datasets demonstrate the effectiveness of our approach.展开更多
Current experimental and computational methods have limitations in accurately and efficiently classifying ion channels within vast protein spaces.Here we have developed a deep learning algorithm,GPT2 Ion Channel Class...Current experimental and computational methods have limitations in accurately and efficiently classifying ion channels within vast protein spaces.Here we have developed a deep learning algorithm,GPT2 Ion Channel Classifier(GPT2-ICC),which effectively distinguishing ion channels from a test set containing approximately 239 times more non-ion-channel proteins.GPT2-ICC integrates representation learning with a large language model(LLM)-based classifier,enabling highly accurate identification of potential ion channels.Several potential ion channels were predicated from the unannotated human proteome,further demonstrating GPT2-ICC’s generalization ability.This study marks a significant advancement in artificial-intelligence-driven ion channel research,highlighting the adaptability and effectiveness of combining representation learning with LLMs to address the challenges of imbalanced protein sequence data.Moreover,it provides a valuable computational tool for uncovering previously uncharacterized ion channels.展开更多
We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of t...We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.展开更多
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact...Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.展开更多
This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to use...This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.展开更多
Knowledge distillation has become a standard technique for compressing large language models into efficient student models,but existing methods often struggle to balance prediction accuracy with explanation quality.Re...Knowledge distillation has become a standard technique for compressing large language models into efficient student models,but existing methods often struggle to balance prediction accuracy with explanation quality.Recent approaches such as Distilling Step-by-Step(DSbS)introduce explanation supervision,yet they apply it in a uniform manner that may not fully exploit the different learning dynamics of prediction and explanation.In this work,we propose a task-structured curriculum learning(TSCL)framework that structures training into three sequential phases:(i)prediction-only,to establish stable feature representations;(ii)joint prediction-explanation,to align task outputs with rationale generation;and(iii)explanation-only,to refine the quality of rationales.This design provides a simple but effective modification to DSbS,requiring no architectural changes and adding negligible training cost.We justify the phase scheduling with ablation studies and convergence analysis,showing that an initial prediction-heavy stage followed by a balanced joint phase improves both stability and explanation alignment.Extensive experiments on five datasets(e-SNLI,ANLI,CommonsenseQA,SVAMP,and MedNLI)demonstrate that TSCL consistently outperforms strong baselines,achieving gains of+1.7-2.6 points in accuracy and 0.8-1.2 in ROUGE-L,corresponding to relative error reductions of up to 21%.Beyond lexical metrics,human evaluation and ERASERstyle faithfulness diagnostics confirm that TSCL produces more faithful and informative explanations.Comparative training curves further reveal faster convergence and lower variance across seeds.Efficiency analysis shows less than 3%overhead in wall-clock training time and no additional inference cost,making the approach practical for realworld deployment.This study demonstrates that a simple task-structured curriculum can significantly improve the effectiveness of knowledge distillation.By separating and sequencing objectives,TSCL achieves a better balance between accuracy,stability,and explanation quality.The framework generalizes across domains,including medical NLI,and offers a principled recipe for future applications in multimodal reasoning and reinforcement learning.展开更多
The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in S...The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.展开更多
War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient an...War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient and inflexible,with particularly pronounced limitations in command and decision-making.The overwhelming volume of information and high decision complexity hinder the realization of autonomous and agile command and control.To address this challenge,an intelligent warfare simulation framework named Command-Agent is proposed,which deeply integrates large language models(LLMs)with digital twin battlefields.By constructing a highly realistic battlefield environment through real-time simulation and multi-source data fusion,the natural language interaction capabilities of LLMs are leveraged to lower the command threshold and to enable autonomous command through the Observe-Orient-Decide-Act(OODA)feedback loop.Within the Command-Agent framework,a multimodel collaborative architecture is further adopted to decouple the decision-generation and command-execution functions of LLMs.By combining specialized models such as Deep Seek-R1 and MCTool,the limitations of single-model capabilities are overcome.MCTool is a lightweight execution model fine-tuned for military Function Calling tasks.The framework also introduces a Vector Knowledge Base to mitigate hallucinations commonly exhibited by LLMs.Experimental results demonstrate that Command-Agent not only enables natural language-driven simulation and control but also deeply understands commander intent.Leveraging the multi-model collaborative architecture,during red-blue UAV confrontations involving 2 to 8 UAVs,the integrated score is improved by an average of 41.8%compared to the single-agent system(MCTool),accompanied by a 161.8%optimization in the battle loss ratio.Furthermore,when compared with multi-agent systems lacking the knowledge base,the inclusion of the Vector Knowledge Base further improves overall performance by 16.8%.In comparison with the general model(Qwen2.5-7B),the fine-tuned MCTool leads by 5%in execution efficiency.Therefore,the proposed Command-Agent introduces a novel perspective to the military command system and offers a feasible solution for intelligent battlefield decision-making.展开更多
Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that man...Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that manipulate model behavior through malicious instructions.Following Kitchenham’s guidelines,this systematic review synthesizes 128 peer-reviewed studies from 2022 to 2025 to provide a unified understanding of this rapidly evolving threat landscape.Our findings reveal a swift progression from simple direct injections to sophisticated multimodal attacks,achieving over 90%success rates against unprotected systems.In response,defense mechanisms show varying effectiveness:input preprocessing achieves 60%–80%detection rates and advanced architectural defenses demonstrate up to 95%protection against known patterns,though significant gaps persist against novel attack vectors.We identified 37 distinct defense approaches across three categories,but standardized evaluation frameworks remain limited.Our analysis attributes these vulnerabilities to fundamental LLM architectural limitations,such as the inability to distinguish instructions from data and attention mechanism vulnerabilities.This highlights critical research directions such as formal verification methods,standardized evaluation protocols,and architectural innovations for inherently secure LLM designs.展开更多
Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lack...Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lacks formal task definitions and corresponding benchmarks.To bridge this gap,we define the Automated Online Public Opinion Report Generation(OPOR-Gen)task and construct OPOR-Bench,an event-centric dataset with 463 crisis events across 108 countries(comprising 8.8 K news articles and 185 K tweets).To evaluate report quality,we propose OPOR-Eval,a novel agent-based framework that simulates human expert evaluation.Validation experiments show OPOR-Eval achieves a high Spearman’s correlation(ρ=0.70)with human judgments,though challenges in temporal reasoning persist.This work establishes an initial foundation for advancing automated public opinion reporting research.展开更多
Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstan...Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstanding performance across various domains,thereby prompting researchers to investigate their applicability in recommendation systems.However,due to the lack of task-specific knowledge and an inefficient feature extraction process,LLMs still have suboptimal performance in recommendation tasks.Therefore,external knowledge sources,such as knowledge graphs(KGs)and knowledge bases(KBs),are often introduced to address the issue of data sparsity.Compared to KGs,KBs possess higher retrieval efficiency,making them more suitable for scenarios where LLMs serve as recommenders.To this end,we introduce a novel framework integrating LLMs with KBs for enhanced retrieval generation,namely LLMKB.LLMKB initially leverages structured knowledge to create mapping dictionaries,extracting entity-relation information from heterogeneous knowledge to construct KBs.Then,LLMKB achieves the embedding calibration between user information representations and documents in KBs through retrieval model fine-tuning.Finally,LLMKB employs retrievalaugmented generation to produce recommendations based on fused text inputs,followed by post-processing.Experiment results on two public CRS datasets demonstrate the effectiveness of our framework.Our code is publicly available at the link:https://anonymous.4open.science/r/LLMKB-6FD0.展开更多
Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLM...Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.展开更多
Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical ...Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents.To address this issue,we proposed a medical NER approach based on pre-trained language models and a domain dictionary.First,we constructed a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources,such as the YiduN4 K data set.Second,we employed this dictionary to train domain-specific pre-trained language models using un-labelled medical texts.Third,we employed a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels.Fourth,the BiLSTM-CRF sequence tagging model was used to fine-tune the pre-trained language models.Our experiments on the un-labelled medical texts,which were extracted from Chinese electronic medical records,show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7%and 95.3%,respectively.展开更多
Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and langua...Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.展开更多
Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, ...Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.展开更多
ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential sec...ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.展开更多
The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.De...The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.展开更多
Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether ...Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.展开更多
AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surfa...AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.展开更多
基金Science and Technology Innovation 2030-Major Project of“New Generation Artificial Intelligence”granted by Ministry of Science and Technology,Grant Number 2020AAA0109300.
文摘In the process of constructing domain-specific knowledge graphs,the task of relational triple extraction plays a critical role in transforming unstructured text into structured information.Existing relational triple extraction models facemultiple challenges when processing domain-specific data,including insufficient utilization of semantic interaction information between entities and relations,difficulties in handling challenging samples,and the scarcity of domain-specific datasets.To address these issues,our study introduces three innovative components:Relation semantic enhancement,data augmentation,and a voting strategy,all designed to significantly improve the model’s performance in tackling domain-specific relational triple extraction tasks.We first propose an innovative attention interaction module.This method significantly enhances the semantic interaction capabilities between entities and relations by integrating semantic information fromrelation labels.Second,we propose a voting strategy that effectively combines the strengths of large languagemodels(LLMs)and fine-tuned small pre-trained language models(SLMs)to reevaluate challenging samples,thereby improving the model’s adaptability in specific domains.Additionally,we explore the use of LLMs for data augmentation,aiming to generate domain-specific datasets to alleviate the scarcity of domain data.Experiments conducted on three domain-specific datasets demonstrate that our model outperforms existing comparative models in several aspects,with F1 scores exceeding the State of the Art models by 2%,1.6%,and 0.6%,respectively,validating the effectiveness and generalizability of our approach.
文摘We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract information from medical text,facilitating more accurate classification while minimizing the number of trainable parameters.Extensive experiments conducted on various datasets demonstrate the effectiveness of our approach.
基金funded by grants from the National Key Research and Development Program of China(Grant Nos.:2022YFE0205600 and 2022YFC3400504)the National Natural Science Foundation of China(Grant Nos.:82373792 and 82273857)the Fundamental Research Funds for the Central Universities,China,and the East China Normal University Medicine and Health Joint Fund,China(Grant No.:2022JKXYD07001).
文摘Current experimental and computational methods have limitations in accurately and efficiently classifying ion channels within vast protein spaces.Here we have developed a deep learning algorithm,GPT2 Ion Channel Classifier(GPT2-ICC),which effectively distinguishing ion channels from a test set containing approximately 239 times more non-ion-channel proteins.GPT2-ICC integrates representation learning with a large language model(LLM)-based classifier,enabling highly accurate identification of potential ion channels.Several potential ion channels were predicated from the unannotated human proteome,further demonstrating GPT2-ICC’s generalization ability.This study marks a significant advancement in artificial-intelligence-driven ion channel research,highlighting the adaptability and effectiveness of combining representation learning with LLMs to address the challenges of imbalanced protein sequence data.Moreover,it provides a valuable computational tool for uncovering previously uncharacterized ion channels.
文摘We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.
基金supported by the National Key R&D Program of China[2022YFF0902703]the State Administration for Market Regulation Science and Technology Plan Project(2024MK033).
文摘Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.
基金funded by the Office of the Vice-President for Research and Development of Cebu Technological University.
文摘This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.
文摘Knowledge distillation has become a standard technique for compressing large language models into efficient student models,but existing methods often struggle to balance prediction accuracy with explanation quality.Recent approaches such as Distilling Step-by-Step(DSbS)introduce explanation supervision,yet they apply it in a uniform manner that may not fully exploit the different learning dynamics of prediction and explanation.In this work,we propose a task-structured curriculum learning(TSCL)framework that structures training into three sequential phases:(i)prediction-only,to establish stable feature representations;(ii)joint prediction-explanation,to align task outputs with rationale generation;and(iii)explanation-only,to refine the quality of rationales.This design provides a simple but effective modification to DSbS,requiring no architectural changes and adding negligible training cost.We justify the phase scheduling with ablation studies and convergence analysis,showing that an initial prediction-heavy stage followed by a balanced joint phase improves both stability and explanation alignment.Extensive experiments on five datasets(e-SNLI,ANLI,CommonsenseQA,SVAMP,and MedNLI)demonstrate that TSCL consistently outperforms strong baselines,achieving gains of+1.7-2.6 points in accuracy and 0.8-1.2 in ROUGE-L,corresponding to relative error reductions of up to 21%.Beyond lexical metrics,human evaluation and ERASERstyle faithfulness diagnostics confirm that TSCL produces more faithful and informative explanations.Comparative training curves further reveal faster convergence and lower variance across seeds.Efficiency analysis shows less than 3%overhead in wall-clock training time and no additional inference cost,making the approach practical for realworld deployment.This study demonstrates that a simple task-structured curriculum can significantly improve the effectiveness of knowledge distillation.By separating and sequencing objectives,TSCL achieves a better balance between accuracy,stability,and explanation quality.The framework generalizes across domains,including medical NLI,and offers a principled recipe for future applications in multimodal reasoning and reinforcement learning.
基金the research project LaTe4PoliticES(PID2022-138099OB-I00)funded by MCIN/AEI/10.13039/501100011033 and the European Fund for Regional Development(ERDF)-a way to make Europe.Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.
文摘The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.
文摘War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient and inflexible,with particularly pronounced limitations in command and decision-making.The overwhelming volume of information and high decision complexity hinder the realization of autonomous and agile command and control.To address this challenge,an intelligent warfare simulation framework named Command-Agent is proposed,which deeply integrates large language models(LLMs)with digital twin battlefields.By constructing a highly realistic battlefield environment through real-time simulation and multi-source data fusion,the natural language interaction capabilities of LLMs are leveraged to lower the command threshold and to enable autonomous command through the Observe-Orient-Decide-Act(OODA)feedback loop.Within the Command-Agent framework,a multimodel collaborative architecture is further adopted to decouple the decision-generation and command-execution functions of LLMs.By combining specialized models such as Deep Seek-R1 and MCTool,the limitations of single-model capabilities are overcome.MCTool is a lightweight execution model fine-tuned for military Function Calling tasks.The framework also introduces a Vector Knowledge Base to mitigate hallucinations commonly exhibited by LLMs.Experimental results demonstrate that Command-Agent not only enables natural language-driven simulation and control but also deeply understands commander intent.Leveraging the multi-model collaborative architecture,during red-blue UAV confrontations involving 2 to 8 UAVs,the integrated score is improved by an average of 41.8%compared to the single-agent system(MCTool),accompanied by a 161.8%optimization in the battle loss ratio.Furthermore,when compared with multi-agent systems lacking the knowledge base,the inclusion of the Vector Knowledge Base further improves overall performance by 16.8%.In comparison with the general model(Qwen2.5-7B),the fine-tuned MCTool leads by 5%in execution efficiency.Therefore,the proposed Command-Agent introduces a novel perspective to the military command system and offers a feasible solution for intelligent battlefield decision-making.
基金supported by 2023 Higher Education Scientific Research Planning Project of China Society of Higher Education(No.23PG0408)2023 Philosophy and Social Science Research Programs in Jiangsu Province(No.2023SJSZ0993)+2 种基金Nantong Science and Technology Project(No.JC2023070)Key Project of Jiangsu Province Education Science 14th Five-Year Plan(Grant No.B-b/2024/02/41)the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province(Grant No.SKLACSS-202407).
文摘Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that manipulate model behavior through malicious instructions.Following Kitchenham’s guidelines,this systematic review synthesizes 128 peer-reviewed studies from 2022 to 2025 to provide a unified understanding of this rapidly evolving threat landscape.Our findings reveal a swift progression from simple direct injections to sophisticated multimodal attacks,achieving over 90%success rates against unprotected systems.In response,defense mechanisms show varying effectiveness:input preprocessing achieves 60%–80%detection rates and advanced architectural defenses demonstrate up to 95%protection against known patterns,though significant gaps persist against novel attack vectors.We identified 37 distinct defense approaches across three categories,but standardized evaluation frameworks remain limited.Our analysis attributes these vulnerabilities to fundamental LLM architectural limitations,such as the inability to distinguish instructions from data and attention mechanism vulnerabilities.This highlights critical research directions such as formal verification methods,standardized evaluation protocols,and architectural innovations for inherently secure LLM designs.
基金supported by the Fundamental Research Funds for the Central Universities(No.CUC25SG013)the Foundation of Key Laboratory of Education Informatization for Nationalities(Yunnan Normal University),Ministry of Education(No.EIN2024C006).
文摘Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lacks formal task definitions and corresponding benchmarks.To bridge this gap,we define the Automated Online Public Opinion Report Generation(OPOR-Gen)task and construct OPOR-Bench,an event-centric dataset with 463 crisis events across 108 countries(comprising 8.8 K news articles and 185 K tweets).To evaluate report quality,we propose OPOR-Eval,a novel agent-based framework that simulates human expert evaluation.Validation experiments show OPOR-Eval achieves a high Spearman’s correlation(ρ=0.70)with human judgments,though challenges in temporal reasoning persist.This work establishes an initial foundation for advancing automated public opinion reporting research.
文摘Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstanding performance across various domains,thereby prompting researchers to investigate their applicability in recommendation systems.However,due to the lack of task-specific knowledge and an inefficient feature extraction process,LLMs still have suboptimal performance in recommendation tasks.Therefore,external knowledge sources,such as knowledge graphs(KGs)and knowledge bases(KBs),are often introduced to address the issue of data sparsity.Compared to KGs,KBs possess higher retrieval efficiency,making them more suitable for scenarios where LLMs serve as recommenders.To this end,we introduce a novel framework integrating LLMs with KBs for enhanced retrieval generation,namely LLMKB.LLMKB initially leverages structured knowledge to create mapping dictionaries,extracting entity-relation information from heterogeneous knowledge to construct KBs.Then,LLMKB achieves the embedding calibration between user information representations and documents in KBs through retrieval model fine-tuning.Finally,LLMKB employs retrievalaugmented generation to produce recommendations based on fused text inputs,followed by post-processing.Experiment results on two public CRS datasets demonstrate the effectiveness of our framework.Our code is publicly available at the link:https://anonymous.4open.science/r/LLMKB-6FD0.
文摘Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.
基金This work is supported in part by the Guangdong Science and Technology grant(No.2016A010101033)the Hong Kong and Macao joint research and development grant with Wuyi University(No.2019WGAH21).
文摘Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents.To address this issue,we proposed a medical NER approach based on pre-trained language models and a domain dictionary.First,we constructed a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources,such as the YiduN4 K data set.Second,we employed this dictionary to train domain-specific pre-trained language models using un-labelled medical texts.Third,we employed a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels.Fourth,the BiLSTM-CRF sequence tagging model was used to fine-tune the pre-trained language models.Our experiments on the un-labelled medical texts,which were extracted from Chinese electronic medical records,show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7%and 95.3%,respectively.
基金supported by National Natural Science Foundation of China(62376219 and 62006194)Foundational Research Project in Specialized Discipline(Grant No.G2024WD0146)Faculty Construction Project(Grant No.24GH0201148).
文摘Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
文摘Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.
文摘ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.
基金supported by the National Key R&D Program of China under Grant No.2022YFB3103500the National Natural Science Foundation of China under Grants No.62402087 and No.62020106013+3 种基金the Sichuan Science and Technology Program under Grant No.2023ZYD0142the Chengdu Science and Technology Program under Grant No.2023-XT00-00002-GXthe Fundamental Research Funds for Chinese Central Universities under Grants No.ZYGX2020ZB027 and No.Y030232063003002the Postdoctoral Innovation Talents Support Program under Grant No.BX20230060.
文摘The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.
文摘Purpose:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises,appointments and promotion.It is therefore important to investigate whether Large Language Models(LLMs)can play a role in this process.Design/methodology/approach:This article assesses which ChatGPT inputs(full text without tables,figures,and references;title and abstract;title only)produce better quality score estimates,and the extent to which scores are affected by ChatGPT models and system prompts.Findings:The optimal input is the article title and abstract,with average ChatGPT scores based on these(30 iterations on a dataset of 51 papers)correlating at 0.67 with human scores,the highest ever reported.ChatGPT 4o is slightly better than 3.5-turbo(0.66),and 4o-mini(0.66).Research limitations:The data is a convenience sample of the work of a single author,it only includes one field,and the scores are self-evaluations.Practical implications:The results suggest that article full texts might confuse LLM research quality evaluations,even though complex system instructions for the task are more effective than simple ones.Thus,whilst abstracts contain insufficient information for a thorough assessment of rigour,they may contain strong pointers about originality and significance.Finally,linear regression can be used to convert the model scores into the human scale scores,which is 31%more accurate than guessing.Originality/value:This is the first systematic comparison of the impact of different prompts,parameters and inputs for ChatGPT research quality evaluations.
基金Supported by National Natural Science Foundation of China(No.82160195,No.82460203)Degree and Postgraduate Education Teaching Reform Project of Jiangxi Province(No.JXYJG-2020-026).
文摘AIM:To assess the possibility of using different large language models(LLMs)in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases:ChatGPT-4,ChatGPT-3.5,Claude 2,PaLM2,and SenseNova.METHODS:A group of experienced ophthalmology professors were asked to develop a 100-question singlechoice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions.The exam includes questions on the following topics:keratitis disease(20 questions),keratoconus,keratomalaciac,corneal dystrophy,corneal degeneration,erosive corneal ulcers,and corneal lesions associated with systemic diseases(20 questions),conjunctivitis disease(20 questions),trachoma,pterygoid and conjunctival tumor diseases(20 questions),and dry eye disease(20 questions).Then the total score of each LLMs and compared their mean score,mean correlation,variance,and confidence were calculated.RESULTS:GPT-4 exhibited the highest performance in terms of LLMs.Comparing the average scores of the LLMs group with the four human groups,chief physician,attending physician,regular trainee,and graduate student,it was found that except for ChatGPT-4,the total score of the rest of the LLMs is lower than that of the graduate student group,which had the lowest score in the human group.Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers,giving very little chance of an incorrect answer.ChatGPT-4 showed higher credibility when answering questions,with a success rate of 59%,but gave the wrong answer to the question 28% of the time.CONCLUSION:GPT-4 model exhibits excellent performance in both answer relevance and confidence.PaLM2 shows a positive correlation(up to 0.8)in terms of answer accuracy during the exam.In terms of answer confidence,PaLM2 is second only to GPT4 and surpasses Claude 2,SenseNova,and GPT-3.5.Despite the fact that ocular surface disease is a highly specialized discipline,GPT-4 still exhibits superior performance,suggesting that its potential and ability to be applied in this field is enormous,perhaps with the potential to be a valuable resource for medical students and clinicians in the future.