期刊文献+
共找到739篇文章
< 1 2 37 >
每页显示 20 50 100
Hepatitis C Patient Education:Large Language Models Show Promise in Disseminating Guidelines
1
作者 Jinyan Chen Ruijie Zhao +10 位作者 Chiyu He Huigang Li Yajie You Zuyuan Lin Ze Xiang Jianyong Zhuo Wei Shen Zhihang Hu Shusen Zheng Xiao Xu Di Lu 《Journal of Clinical and Translational Hepatology》 2026年第1期116-119,共4页
This study evaluated the accuracy,completeness,and comprehensibility of responses from mainstream large language models(LLMs)to hepatitis C virus(HCV)-related questions,aiming to assess their performance in addressing... This study evaluated the accuracy,completeness,and comprehensibility of responses from mainstream large language models(LLMs)to hepatitis C virus(HCV)-related questions,aiming to assess their performance in addressing patient queries about disease and lifestyle behaviors.The models selected were ChatGPT-4o,Gemini 2.0 Pro,Claude 3.5 Sonnet,and DeepSeek V3,with 12 questions chosen by two HCV experts from the domains of prevention,diagnosis,and treatment. 展开更多
关键词 addressing patient queries disease lifestyle behaviorsthe large language models large language models llms GUIDELINES hepatitis C accuracy patient education COMPREHENSIBILITY
原文传递
CIT-Rec:Enhancing Sequential Recommendation System with Large Language Models
2
作者 Ziyu Li Zhen Chen +2 位作者 Xuejing Fu Tong Mo Weiping Li 《Computers, Materials & Continua》 2026年第3期2328-2343,共16页
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact... Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations. 展开更多
关键词 Large language models vision language models sequential recommendation instruction tuning
在线阅读 下载PDF
Classification of Job Offers into Job Positions Using O*NET and BERT Language Models
3
作者 Lino Gonzalez-Garcia Miguel-Angel Sicilia Elena García-Barriocanal 《Computers, Materials & Continua》 2026年第2期2133-2147,共15页
Classifying job offers into occupational categories is a fundamental task in human resource information systems,as it improves and streamlines indexing,search,and matching between openings and job seekers.Comprehensiv... Classifying job offers into occupational categories is a fundamental task in human resource information systems,as it improves and streamlines indexing,search,and matching between openings and job seekers.Comprehensive occupational databases such as O∗NET or ESCO provide detailed taxonomies of interrelated positions that can be leveraged to align the textual content of postings with occupational categories,thereby facilitating standardization,cross-system interoperability,and access to metadata for each occupation(e.g.,tasks,knowledge,skills,and abilities).In this work,we explore the effectiveness of fine-tuning existing language models(LMs)to classify job offers with occupational descriptors from O∗NET.This enables a more precise assessment of candidate suitability by identifying the specific knowledge and skills required for each position,and helps automate recruitment processes by mitigating human bias and subjectivity in candidate selection.We evaluate three representative BERT-like models:BERT,RoBERTa,and DeBERTa.BERT serves as the baseline encoder-only architecture;RoBERTa incorporates advances in pretraining objectives and data scale;and DeBERTa introduces architectural improvements through disentangled attention mechanisms.The best performance was achieved with the DeBERTa model,although the other models also produced strong results,and no statistically significant differences were observed acrossmodels.We also find that these models typically reach optimal performance after only a few training epochs,and that training with smaller,balanced datasets is effective.Consequently,comparable results can be obtained with models that require fewer computational resources and less training time,facilitating deployment and practical use. 展开更多
关键词 Occupational databases job offer classification language models O∗NET BERT RoBERTa DeBERTa
在线阅读 下载PDF
PROMPTx-PE:Adaptive Optimization of Prompt Engineering Strategies for Accuracy and Robustness in Large Language Models
4
作者 Talha Farooq Khan Fahad Ali +2 位作者 Majid Hussain Lal Khan Hsien-Tsung Chang 《Computers, Materials & Continua》 2026年第5期685-715,共31页
The outstanding growth in the applications of large language models(LLMs)demonstrates the significance of adaptive and efficient prompt engineering tactics.The existing methods may not be variable,vigorous and streaml... The outstanding growth in the applications of large language models(LLMs)demonstrates the significance of adaptive and efficient prompt engineering tactics.The existing methods may not be variable,vigorous and streamlined in different domains.The offered study introduces an immediate optimization outline,named PROMPTx-PE,that is going to yield a greater level of precision and strength when it comes to the assignments that are premised on LLM.The proposed systemfeatures a timely selection schemewhich is informed by reinforcement learning,a contextual layer and a dynamic weighting module which is regulated by Lyapunov-based stability guidelines.The PROMPTx-PE dynamically varies the exploration and exploitation of the prompt space,depending on real-time feedback and multi-objective reward development.Extensive testing on both benchmark(GLUE,SuperGLUE)and domain-specific data(Healthcare-QA and Industrial-NER)demonstrates a large best performance to be 89.4%and a strong robustness disconnect with under 3%computation expense.The results confirm the effectiveness,consistency,and scalability of PROMPTx-PE as a platform of adaptive prompt engineering based on recent uses of LLMs. 展开更多
关键词 Prompt engineering large language models adaptive optimization ROBUSTNESS multi-objective optimization reinforcement learning natural language processing
在线阅读 下载PDF
Detection of Maliciously Disseminated Hate Speech in Spanish Using Fine-Tuning and In-Context Learning Techniques with Large Language Models
5
作者 Tomás Bernal-Beltrán RonghaoPan +3 位作者 JoséAntonio García-Díaz María del Pilar Salas-Zárate Mario Andrés Paredes-Valverde Rafael Valencia-García 《Computers, Materials & Continua》 2026年第4期353-390,共38页
The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in S... The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns. 展开更多
关键词 Hate speech detection malicious communication campaigns AI-driven cybersecurity social media analytics large language models prompt-tuning fine-tuning in-context learning natural language processing
在线阅读 下载PDF
Semantic Causality Evaluation of Correlation Analysis Utilizing Large Language Models
6
作者 Adam Dudáš 《Computers, Materials & Continua》 2026年第5期2246-2269,共24页
It is known that correlation does not imply causality.Some relationships identified in the analysis of data are coincidental or unknown,and some are produced by real-world causality of the situation,which is problemat... It is known that correlation does not imply causality.Some relationships identified in the analysis of data are coincidental or unknown,and some are produced by real-world causality of the situation,which is problematic,since there is a need to differentiate between these two scenarios.Until recently,the proper−semantic−causality of the relationship could have been determined only by human experts from the area of expertise of the studied data.This has changed with the advance of large language models,which are often utilized as surrogates for such human experts,making the process automated and readily available to all data analysts.This motivates the main objective of this work,which is to introduce the design and implementation of a large language model-based semantic causality evaluator based on correlation analysis,together with its visual analysis model called Causal heatmap.After the implementation itself,the model is evaluated from the point of view of the quality of the visual model,from the point of view of the quality of causal evaluation based on large language models,and from the point of view of comparative analysis,while the results reached in the study highlight the usability of large language models in the task and the potential of the proposed approach in the analysis of unknown datasets.The results of the experimental evaluation demonstrate the usefulness of the Causal heatmap method,supported by the evident highlighting of interesting relationships,while suppressing irrelevant ones. 展开更多
关键词 CORRELATION CAUSALITY correlation analysis large language models VISUALIZATION
在线阅读 下载PDF
SDNet:A self-supervised bird recognition method based on large language models and diffusion models for improving long-term bird monitoring
7
作者 Zhongde Zhang Nan Su +3 位作者 Chenxun Deng Yandong Zhao Weiping Liu Qiaoling Han 《Avian Research》 2026年第1期200-215,共16页
The collection and annotation of lar ge-scale bird datasets are resource-intensive and time-consuming processes that significantly limit the scalability and accuracy of biodiversity monitoring systems.While self-super... The collection and annotation of lar ge-scale bird datasets are resource-intensive and time-consuming processes that significantly limit the scalability and accuracy of biodiversity monitoring systems.While self-supervised learning(SSL)has emerged as a promising approach for leveraging unannotated data,current SSL methods face two critical challenges in bird species recognition:(1)long-tailed data distributions that result in poor performance on underrepresented species;and(2)domain shift issues caused by data augmentation strategies designed to mitigate class imbalance.Here we present SDNet,a novel SSL-based bird recognition framework that integrates diffusion models with large language models(LLMs)to overcome these limitations.SDNet employs LLMs to generate semantically rich textual descriptions for tail-class species by prompting the models with species taxonomy,morphological attributes,and habitat information,producing detailed natural language priors that capture fine-grained visual characteristics(e.g.,plumage patterns,body proportions,and distinctive markings).These textual descriptions are subsequently used by a conditional diffusion model to synthesize new bird image samples through cross-attention mechanisms that fuse textual embeddings with intermediate visual feature representations during the denoising process,ensuring generated images preserve species-specific morphological details while maintaining photorealistic quality.Additionally,we incorporate a Swin Transformer as the feature extraction backbone whose hierarchical window-based attention mechanism and shifted windowing scheme enable multi-scale local feature extraction that proves particularly effective at capturing finegrained discriminative patterns(such as beak shape and feather texture)while mitigating domain shift between synthetic and original images through consistent feature representations across both data sources.SDNet is validated on both a self-constructed dataset(Bird_BXS)an d a publicly available benchmark(Birds_25),demonstrating substantial improvements over conventional SSL approaches.Our results indicate that the synergistic integration of LLMs,diffusion models,and the Swin Transformer architecture contributes significantly to recognition accuracy,particularly for rare and morphologically similar species.These findings highlight the potential of SDNet for addressing fundamental limitations of existing SSL methods in avian recognition tasks and establishing a new paradigm for efficient self-supervised learning in large-scale ornithological vision applications. 展开更多
关键词 Biodiversity conservation Bird intelligent monitoring Diffusion models Large-scale language models Long-tailed learning Self-supervised learning
在线阅读 下载PDF
Assessing Large Language Models for Early Article Identification in Otolaryngology—Head and Neck Surgery Systematic Reviews
8
作者 Ajibola B.Bakare Young Lee +2 位作者 Jhuree Hong Claus-Peter Richter Jonathan P.Kuriakose 《Health Care Science》 2026年第1期19-28,共10页
Background:Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology—Head and Neck Surgery systematic literature reviews.Methods:Three PRISMA-based systematic reviews(Ja... Background:Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology—Head and Neck Surgery systematic literature reviews.Methods:Three PRISMA-based systematic reviews(Jabbour et al.2017,Wong et al.2018,and Wu et al.2021)were replicated using ChatGPTv3.5 and Bard.Outputs(author,title,publication year,and journal)were compared to the original references and cross-referenced with medical databases for authenticity and recall.Results:Several themes emerged when comparing Bard and ChatGPT across the three reviews.Bard generated more outputs and had greater recall in Wong et al.'s review,with a broader date range in Jabbour et al.'s review.In Wu et al.'s review,ChatGPT-2 had higher recall and identified more authentic outputs than Bard-2.Conclusion:Large language models(LLMs)failed to fully replicate peer-reviewed methodologies,producing outputs with inaccuracies but identifying relevant,especially recent,articles missed by the references.While human-led PRISMA-based reviews remain the gold standard,refining LLMs for literature reviews shows potential. 展开更多
关键词 artificial intelligence BARD ChatGPT large language models systematic review
暂未订购
When Large Language Models and Machine Learning Meet Multi-Criteria Decision Making: Fully Integrated Approach for Social Media Moderation
9
作者 Noreen Fuentes Janeth Ugang +4 位作者 Narcisan Galamiton Suzette Bacus Samantha Shane Evangelista Fatima Maturan Lanndon Ocampo 《Computers, Materials & Continua》 2026年第1期2137-2162,共26页
This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to use... This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities. 展开更多
关键词 Self-moderation user-generated content k-means clustering TODIM large language models
在线阅读 下载PDF
Automating the Initial Development of Intent-Based Task-Oriented Dialog Systems Using Large Language Models:Experiences and Challenges
10
作者 Ksenia Kharitonova David Pérez-Fernández +1 位作者 Zoraida Callejas David Griol 《Computers, Materials & Continua》 2026年第5期1021-1062,共42页
Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate unt... Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate until the assistant behaves consistently.This paper investigates how far large language models(LLMs)can automate this development.In this paper,we use two reference corpora,Let’s Go(English,public transport)and MEDIA(French,hotel booking),to prompt four LLM families(GPT-4o,Claude,Gemini,Mistral Small)and generate the core specifications required by the rasa platform.These include intent sets with example utterances,entity definitions with slot mappings,response templates,and basic dialog flows.To structure this process,we introduce a model-and platform-agnostic pipelinewith two phases.The first normalizes and validates LLM-generated artifacts,enforcing crossfile consistency andmaking slot usage explicit.The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably.Across eight projects,all models required some targeted repairs before training.After applying our pipeline,all reached≥70%task completion(many above 84%),while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth.These results show that,with modest guidance,current LLMs can produce workable end-to-end dialog prototypes directly fromraw transcripts.Our main contributions are:(i)a reusable bootstrap method aligned with industry domain-specific languages(DSLs),(ii)a small set of high-impact corrective patterns,and(iii)a simple but effective harness for closed-loop refinement across conversational platforms. 展开更多
关键词 Task-oriented dialog systems large language models(LLMs) RASA dialog automation natural language understanding(NLU) slot filling conversational AI human-in-the-loop NLP
在线阅读 下载PDF
Task-Structured Curriculum Learning for Multi-Task Distillation:Enhancing Step-by-Step Knowledge Transfer in Language Models
11
作者 Ahmet Ezgi Aytug Onan 《Computers, Materials & Continua》 2026年第3期1647-1673,共27页
Knowledge distillation has become a standard technique for compressing large language models into efficient student models,but existing methods often struggle to balance prediction accuracy with explanation quality.Re... Knowledge distillation has become a standard technique for compressing large language models into efficient student models,but existing methods often struggle to balance prediction accuracy with explanation quality.Recent approaches such as Distilling Step-by-Step(DSbS)introduce explanation supervision,yet they apply it in a uniform manner that may not fully exploit the different learning dynamics of prediction and explanation.In this work,we propose a task-structured curriculum learning(TSCL)framework that structures training into three sequential phases:(i)prediction-only,to establish stable feature representations;(ii)joint prediction-explanation,to align task outputs with rationale generation;and(iii)explanation-only,to refine the quality of rationales.This design provides a simple but effective modification to DSbS,requiring no architectural changes and adding negligible training cost.We justify the phase scheduling with ablation studies and convergence analysis,showing that an initial prediction-heavy stage followed by a balanced joint phase improves both stability and explanation alignment.Extensive experiments on five datasets(e-SNLI,ANLI,CommonsenseQA,SVAMP,and MedNLI)demonstrate that TSCL consistently outperforms strong baselines,achieving gains of+1.7-2.6 points in accuracy and 0.8-1.2 in ROUGE-L,corresponding to relative error reductions of up to 21%.Beyond lexical metrics,human evaluation and ERASERstyle faithfulness diagnostics confirm that TSCL produces more faithful and informative explanations.Comparative training curves further reveal faster convergence and lower variance across seeds.Efficiency analysis shows less than 3%overhead in wall-clock training time and no additional inference cost,making the approach practical for realworld deployment.This study demonstrates that a simple task-structured curriculum can significantly improve the effectiveness of knowledge distillation.By separating and sequencing objectives,TSCL achieves a better balance between accuracy,stability,and explanation quality.The framework generalizes across domains,including medical NLI,and offers a principled recipe for future applications in multimodal reasoning and reinforcement learning. 展开更多
关键词 Knowledge distillation curriculum learning language models multi-task learning step-by-step learning
在线阅读 下载PDF
Beyond Accuracy:Evaluating and Explaining the Capability Boundaries of Large Language Models in Syntax-Preserving Code Translation
12
作者 Yaxin Zhao Qi Han +1 位作者 Hui Shu Yan Guang 《Computers, Materials & Continua》 2026年第2期1371-1394,共24页
LargeLanguageModels(LLMs)are increasingly appliedinthe fieldof code translation.However,existing evaluation methodologies suffer from two major limitations:(1)the high overlap between test data and pretraining corpora... LargeLanguageModels(LLMs)are increasingly appliedinthe fieldof code translation.However,existing evaluation methodologies suffer from two major limitations:(1)the high overlap between test data and pretraining corpora,which introduces significant bias in performance evaluation;and(2)mainstream metrics focus primarily on surface-level accuracy,failing to uncover the underlying factors that constrain model capabilities.To address these issues,this paper presents TCode(Translation-Oriented Code Evaluation benchmark)—a complexity-controllable,contamination-free benchmark dataset for code translation—alongside a dedicated static feature sensitivity evaluation framework.The dataset is carefully designed to control complexity along multiple dimensions—including syntactic nesting and expression intricacy—enabling both broad coverage and fine-grained differentiation of sample difficulty.This design supports precise evaluation of model capabilities across a wide spectrum of translation challenges.The proposed evaluation framework introduces a correlation-driven analysis mechanism based on static program features,enabling predictive modeling of translation success from two perspectives:Code Form Complexity(e.g.,code length and character density)and Semantic Modeling Complexity(e.g.,syntactic depth,control-flow nesting,and type system complexity).Empirical evaluations across representative LLMs—including Qwen2.5-72B and Llama3.3-70B—demonstrate that even state-of-the-art models achieve over 80% compilation success on simple samples,but their accuracy drops sharply below 40% on complex cases.Further correlation analysis indicates that Semantic Modeling Complexity alone is correlated with up to 60% of the variance in translation success,with static program features exhibiting nonlinear threshold effects that highlight clear capability boundaries.This study departs fromthe traditional accuracy-centric evaluation paradigm and,for the first time,systematically characterizes the capabilities of large languagemodels in translation tasks through the lens of programstatic features.The findings provide actionable insights for model refinement and training strategy development. 展开更多
关键词 Large language models(LLMs) code translation compiler testing program analysis complexity-based evaluation
在线阅读 下载PDF
Decision-making performance of large language models vs.human physicians in challenging lung cancer cases:A real-world case-based study
13
作者 Ning Yang Kailai Li +19 位作者 Baiyang Liu Xiting Chen Aimin Jiang Chang Qi Wenyi Gan Lingxuan Zhu Weiming Mou Dongqiang Zeng Mingjia Xiao Guangdi Chu Shengkun Peng Hank ZHWong Lin Zhang Hengguo Zhang Xinpei Deng Quan Cheng Bufu Tang Anqi Lin Juan Zhou Peng Luo 《Intelligent Oncology》 2026年第1期15-24,共10页
Background:Despite the promise shown by large language models(LLMs)for standardized tasks,their multidimensional performance in real-world oncology decision-making remains unevaluated.This study aims to introduce a fr... Background:Despite the promise shown by large language models(LLMs)for standardized tasks,their multidimensional performance in real-world oncology decision-making remains unevaluated.This study aims to introduce a framework for evaluating LLMs and physician decisions in challenging lung cancer cases.Methods:We curated 50 challenging lung cancer cases(25 local and 25 published)classified as complex,rare,or refractory.Blinded three-dimensional,five-point Likert evaluations(1–5 for comprehensiveness,specificity,and readability)compared standalone LLMs(DeepSeek R1,Claude 3.5,Gemini 1.5,and GPT-4o),physicians by experience level(junior,intermediate,and senior),and AI-assisted juniors;intergroup differences and augmentation effects were analyzed statistically.Results:Of 50 challenging cases(18 complex,17 rare,and 15 refractory)rated by three experts,DeepSeek R1 achieved scores of 3.95±0.33,3.71±0.53,and 4.26±0.18 for comprehensiveness,specificity,and readability,respectively,positioning it between intermediate(3.68,3.68,3.75)and senior(4.50,4.64,4.53)physicians.GPT-4o and Claude 3.5 reached intermediate physician–level comprehensiveness(3.76±0.39,3.60±0.39)but junior-to-intermediate physician–level specificity(3.39±0.39,3.39±0.49).All LLMs scored higher on rare cases than intermediate physicians but fell below junior physicians in refractory-case specificity.AIassisted junior physicians showed marked gains in rare cases,with comprehensiveness rising from 2.32 to 4.29(84.8%),specificity from 2.24 to 4.26(90.8%),and readability from 2.76 to 4.59(66.0%),while specificity declined by 3.2%(3.17 to 3.07)in refractory cases.Error analysis showed complementary strengths,with physicians demonstrating reasoning stability and LLMs excelling in knowledge updating and risk management.Conclusions:LLMs performed variably in clinical decision-making tasks depending on case type,performing better in rare cases and worse in refractory cases requiring longitudinal reasoning.Complementary strengths between LLMs and physicians support case-and task-tailored human–AI collaboration. 展开更多
关键词 Large language models Clinical evaluation DECISION-MAKING Lung cancer
暂未订购
Command-agent:Reconstructing warfare simulation and command decision-making using large language models
14
作者 Mengwei Zhang Minchi Kuang +3 位作者 Heng Shi Jihong Zhu Jingyu Zhu Xiao Jiang 《Defence Technology(防务技术)》 2026年第2期294-313,共20页
War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient an... War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient and inflexible,with particularly pronounced limitations in command and decision-making.The overwhelming volume of information and high decision complexity hinder the realization of autonomous and agile command and control.To address this challenge,an intelligent warfare simulation framework named Command-Agent is proposed,which deeply integrates large language models(LLMs)with digital twin battlefields.By constructing a highly realistic battlefield environment through real-time simulation and multi-source data fusion,the natural language interaction capabilities of LLMs are leveraged to lower the command threshold and to enable autonomous command through the Observe-Orient-Decide-Act(OODA)feedback loop.Within the Command-Agent framework,a multimodel collaborative architecture is further adopted to decouple the decision-generation and command-execution functions of LLMs.By combining specialized models such as Deep Seek-R1 and MCTool,the limitations of single-model capabilities are overcome.MCTool is a lightweight execution model fine-tuned for military Function Calling tasks.The framework also introduces a Vector Knowledge Base to mitigate hallucinations commonly exhibited by LLMs.Experimental results demonstrate that Command-Agent not only enables natural language-driven simulation and control but also deeply understands commander intent.Leveraging the multi-model collaborative architecture,during red-blue UAV confrontations involving 2 to 8 UAVs,the integrated score is improved by an average of 41.8%compared to the single-agent system(MCTool),accompanied by a 161.8%optimization in the battle loss ratio.Furthermore,when compared with multi-agent systems lacking the knowledge base,the inclusion of the Vector Knowledge Base further improves overall performance by 16.8%.In comparison with the general model(Qwen2.5-7B),the fine-tuned MCTool leads by 5%in execution efficiency.Therefore,the proposed Command-Agent introduces a novel perspective to the military command system and offers a feasible solution for intelligent battlefield decision-making. 展开更多
关键词 Digital twin battlefield Large language models Multi-agent system Military command
在线阅读 下载PDF
Prompt Injection Attacks on Large Language Models:A Survey of Attack Methods,Root Causes,and Defense Strategies
15
作者 Tongcheng Geng Zhiyuan Xu +1 位作者 Yubin Qu W.Eric Wong 《Computers, Materials & Continua》 2026年第4期134-185,共52页
Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that man... Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that manipulate model behavior through malicious instructions.Following Kitchenham’s guidelines,this systematic review synthesizes 128 peer-reviewed studies from 2022 to 2025 to provide a unified understanding of this rapidly evolving threat landscape.Our findings reveal a swift progression from simple direct injections to sophisticated multimodal attacks,achieving over 90%success rates against unprotected systems.In response,defense mechanisms show varying effectiveness:input preprocessing achieves 60%–80%detection rates and advanced architectural defenses demonstrate up to 95%protection against known patterns,though significant gaps persist against novel attack vectors.We identified 37 distinct defense approaches across three categories,but standardized evaluation frameworks remain limited.Our analysis attributes these vulnerabilities to fundamental LLM architectural limitations,such as the inability to distinguish instructions from data and attention mechanism vulnerabilities.This highlights critical research directions such as formal verification methods,standardized evaluation protocols,and architectural innovations for inherently secure LLM designs. 展开更多
关键词 Prompt injection attacks large language models defense mechanisms security evaluation
在线阅读 下载PDF
OPOR-Bench:Evaluating Large Language Models on Online Public Opinion Report Generation
16
作者 Jinzheng Yu Yang Xu +4 位作者 Haozhen Li Junqi Li Ligu Zhu Hao Shen Lei Shi 《Computers, Materials & Continua》 2026年第4期1403-1427,共25页
Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lack... Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lacks formal task definitions and corresponding benchmarks.To bridge this gap,we define the Automated Online Public Opinion Report Generation(OPOR-Gen)task and construct OPOR-Bench,an event-centric dataset with 463 crisis events across 108 countries(comprising 8.8 K news articles and 185 K tweets).To evaluate report quality,we propose OPOR-Eval,a novel agent-based framework that simulates human expert evaluation.Validation experiments show OPOR-Eval achieves a high Spearman’s correlation(ρ=0.70)with human judgments,though challenges in temporal reasoning persist.This work establishes an initial foundation for advancing automated public opinion reporting research. 展开更多
关键词 Online public opinion reports crisis management large language models agent-based evaluation
在线阅读 下载PDF
Addressing Prompt Injection in Large Language Models via In-Context Learning
17
作者 Go Sato Shusaku Egami +2 位作者 Yasuyuki Tahara Akihiko Ohsuga Yuichi Sei 《Computers, Materials & Continua》 2026年第5期2270-2306,共37页
While Large Language Models(LLMs)possess the capability to perform a wide range of tasks,security attacks known as prompt injection and jailbreaking remain critical challenges.Existing defense approaches addressing th... While Large Language Models(LLMs)possess the capability to perform a wide range of tasks,security attacks known as prompt injection and jailbreaking remain critical challenges.Existing defense approaches addressing this problem face challenges such as the over-refusal of prompts that contain harmful vocabulary but are semantically benign,and the limited accuracy improvement inmachine learning-based approaches due to the ease of distinguishing benign prompts in existing datasets.Therefore,we propose a multi-LLM agent framework aimed at achieving both the accurate rejection of harmful prompts and appropriate responses to benign prompts.Distinct from prior studies,the proposed method adopts In-Context Learning(ICL)during the learning phase,presenting a novel approach that obviates the need for computationally expensive parameter updates required by conventional fine-tuning.To demonstrate the proposed method’s capability for rapid and easy deployment,this study targets LLMs with insufficient alignment.In the experiments,macro-averaged binary classification metrics were used to comprehensively evaluate harmfulness detection.Experimental results using three LLMs demonstrated that the proposed method achieved performance that surpassed four baselines across all evaluation metrics for the target LLMs,evidencing significant effectiveness with an average improvement of 16.6 points in F1-score compared to the vanilla models.The significance of this study lies in the proposal of a novel approach based on ICL that does not require parameter updates.This framework offers high sustainability in practical deployment,as it allows for the adaptive enhancement of detection performance against continuously evolving attack methods solely through the accumulation of logs,without the necessity of retraining the LLM itself.By mitigating the trade-off between safety and utility,this research contributes to the implementation of robust LLMs. 展开更多
关键词 Large language models(LLMs) prompt injection in-context learning(ICL) multi-agent system
在线阅读 下载PDF
A Survey on Medical Competence Evaluation Benchmarks for Large Language Models
18
作者 Qiting Wang Huiru Zou +3 位作者 Haobin Zhang Yongshun Huang Junzhang Tian Weibin Cheng 《Health Care Science》 2026年第1期4-18,共15页
Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medic... Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medical practice,a rigorous and systematic evaluation of their medical competence is imperative.This study presents a comprehensive review of the established methodologies and benchmarks for evaluating the medical competence of LLMs,encompassing a thorough analysis of current assessment practices across medical knowledge,clinical practice competence,and ethical-safety considerations.By integrating clinician competency assessment frameworks into LLMs evaluation,we propose a structured tri-dimensional framework that systematically organizes existing evaluation approaches according to medical theoretical knowledge,clinical practice ability,and ethical-safety considerations.Furthermore,this research provides critical insights into future developmental trajectories while establishing foundational frameworks and standardization protocols for the integration of LLMs into medical practice. 展开更多
关键词 BENCHMARK large language model medical competence ABSTRACT
在线阅读 下载PDF
LLMKB:Large Language Models with Knowledge Base Augmentation for Conversational Recommendation
19
作者 FANG Xiu QIU Sijia +1 位作者 SUN Guohao LU Jinhu 《Journal of Donghua University(English Edition)》 2026年第1期91-103,共13页
Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstan... Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstanding performance across various domains,thereby prompting researchers to investigate their applicability in recommendation systems.However,due to the lack of task-specific knowledge and an inefficient feature extraction process,LLMs still have suboptimal performance in recommendation tasks.Therefore,external knowledge sources,such as knowledge graphs(KGs)and knowledge bases(KBs),are often introduced to address the issue of data sparsity.Compared to KGs,KBs possess higher retrieval efficiency,making them more suitable for scenarios where LLMs serve as recommenders.To this end,we introduce a novel framework integrating LLMs with KBs for enhanced retrieval generation,namely LLMKB.LLMKB initially leverages structured knowledge to create mapping dictionaries,extracting entity-relation information from heterogeneous knowledge to construct KBs.Then,LLMKB achieves the embedding calibration between user information representations and documents in KBs through retrieval model fine-tuning.Finally,LLMKB employs retrievalaugmented generation to produce recommendations based on fused text inputs,followed by post-processing.Experiment results on two public CRS datasets demonstrate the effectiveness of our framework.Our code is publicly available at the link:https://anonymous.4open.science/r/LLMKB-6FD0. 展开更多
关键词 recommender system large language model(LLM) knowledge base(KB)
在线阅读 下载PDF
Benchmarking of Large Language Models for the Dental Admission Test
20
作者 Yu Hou Jay Patel +5 位作者 Liya Dai Emily Zhang Yang Liu Zaifu Zhan Pooja Gangwani Rui Zhang 《Health Data Science》 2025年第1期216-224,共9页
Background:Large language models(LLMs)have shown promise in educational applications,but their performance on high-stakes admissions tests,such as the Dental Admission Test(DAT),remains unclear.Understanding the capab... Background:Large language models(LLMs)have shown promise in educational applications,but their performance on high-stakes admissions tests,such as the Dental Admission Test(DAT),remains unclear.Understanding the capabilities and limitations of these models is critical for determining their suitability in test preparation.Methods:This study evaluated the ability of 16 LLMs,including general-purpose models(e.g.,GPT-3.5,GPT-4,GPT-4o,GPT-o1,Google’s Bard,mistral-large,and Claude),domain-specific finetuned models(e.g.,DentalGPT,MedGPT,and BioGPT),and open-source models(e.g.,Llama2-7B,Llama2-13B,Llama2-70B,Llama3-8B,and Llama3-70B),to answer questions from a sample DAT.Quantitative analysis was performed to assess model accuracy in different sections,and qualitative thematic analysis by subject matter experts examined specific challenges encountered by the models.Results:GPT-4o and GPT-o1 outperformed others in text-based questions assessing knowledge and comprehension,with GPT-o1 achieving perfect scores in the natural sciences(NS)and reading comprehension(RC)sections.Open-source models such as Llama3-70B also performed competitively in RC tasks.However,all models,including GPT-4o,struggled substantially with perceptual ability(PA)items,highlighting a persistent limitation in handling image-based tasks requiring visual-spatial reasoning.Fine-tuned medical models(e.g.,DentalGPT,MedGPT,and BioGPT)demonstrated moderate success in text-based tasks but underperformed in areas requiring critical thinking and reasoning.Thematic analysis identified key challenges,including difficulties with stepwise problem-solving,transferring knowledge,comprehending intricate questions,and hallucinations,particularly on advanced items.Conclusions:While LLMs show potential for reinforcing factual knowledge and supporting learners,their limitations in handling higherorder cognitive tasks and image-based reasoning underscore the need for judicious integration with instructor-led guidance and targeted practice.This study provides valuable insights into the capabilities and limitations of current LLMs in preparing prospective dental students and highlights pathways for future innovations to improve performance across all cognitive skills assessed by the DAT. 展开更多
关键词 capabilities limitations models dental admission test language models llms BENCHMARKING performance dental admission large language models high stakes tests
原文传递
上一页 1 2 37 下一页 到第
使用帮助 返回顶部