Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact...Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.展开更多
This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to use...This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.展开更多
Background:Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology—Head and Neck Surgery systematic literature reviews.Methods:Three PRISMA-based systematic reviews(Ja...Background:Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology—Head and Neck Surgery systematic literature reviews.Methods:Three PRISMA-based systematic reviews(Jabbour et al.2017,Wong et al.2018,and Wu et al.2021)were replicated using ChatGPTv3.5 and Bard.Outputs(author,title,publication year,and journal)were compared to the original references and cross-referenced with medical databases for authenticity and recall.Results:Several themes emerged when comparing Bard and ChatGPT across the three reviews.Bard generated more outputs and had greater recall in Wong et al.'s review,with a broader date range in Jabbour et al.'s review.In Wu et al.'s review,ChatGPT-2 had higher recall and identified more authentic outputs than Bard-2.Conclusion:Large language models(LLMs)failed to fully replicate peer-reviewed methodologies,producing outputs with inaccuracies but identifying relevant,especially recent,articles missed by the references.While human-led PRISMA-based reviews remain the gold standard,refining LLMs for literature reviews shows potential.展开更多
Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medic...Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medical practice,a rigorous and systematic evaluation of their medical competence is imperative.This study presents a comprehensive review of the established methodologies and benchmarks for evaluating the medical competence of LLMs,encompassing a thorough analysis of current assessment practices across medical knowledge,clinical practice competence,and ethical-safety considerations.By integrating clinician competency assessment frameworks into LLMs evaluation,we propose a structured tri-dimensional framework that systematically organizes existing evaluation approaches according to medical theoretical knowledge,clinical practice ability,and ethical-safety considerations.Furthermore,this research provides critical insights into future developmental trajectories while establishing foundational frameworks and standardization protocols for the integration of LLMs into medical practice.展开更多
War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient an...War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient and inflexible,with particularly pronounced limitations in command and decision-making.The overwhelming volume of information and high decision complexity hinder the realization of autonomous and agile command and control.To address this challenge,an intelligent warfare simulation framework named Command-Agent is proposed,which deeply integrates large language models(LLMs)with digital twin battlefields.By constructing a highly realistic battlefield environment through real-time simulation and multi-source data fusion,the natural language interaction capabilities of LLMs are leveraged to lower the command threshold and to enable autonomous command through the Observe-Orient-Decide-Act(OODA)feedback loop.Within the Command-Agent framework,a multimodel collaborative architecture is further adopted to decouple the decision-generation and command-execution functions of LLMs.By combining specialized models such as Deep Seek-R1 and MCTool,the limitations of single-model capabilities are overcome.MCTool is a lightweight execution model fine-tuned for military Function Calling tasks.The framework also introduces a Vector Knowledge Base to mitigate hallucinations commonly exhibited by LLMs.Experimental results demonstrate that Command-Agent not only enables natural language-driven simulation and control but also deeply understands commander intent.Leveraging the multi-model collaborative architecture,during red-blue UAV confrontations involving 2 to 8 UAVs,the integrated score is improved by an average of 41.8%compared to the single-agent system(MCTool),accompanied by a 161.8%optimization in the battle loss ratio.Furthermore,when compared with multi-agent systems lacking the knowledge base,the inclusion of the Vector Knowledge Base further improves overall performance by 16.8%.In comparison with the general model(Qwen2.5-7B),the fine-tuned MCTool leads by 5%in execution efficiency.Therefore,the proposed Command-Agent introduces a novel perspective to the military command system and offers a feasible solution for intelligent battlefield decision-making.展开更多
Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that man...Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that manipulate model behavior through malicious instructions.Following Kitchenham’s guidelines,this systematic review synthesizes 128 peer-reviewed studies from 2022 to 2025 to provide a unified understanding of this rapidly evolving threat landscape.Our findings reveal a swift progression from simple direct injections to sophisticated multimodal attacks,achieving over 90%success rates against unprotected systems.In response,defense mechanisms show varying effectiveness:input preprocessing achieves 60%–80%detection rates and advanced architectural defenses demonstrate up to 95%protection against known patterns,though significant gaps persist against novel attack vectors.We identified 37 distinct defense approaches across three categories,but standardized evaluation frameworks remain limited.Our analysis attributes these vulnerabilities to fundamental LLM architectural limitations,such as the inability to distinguish instructions from data and attention mechanism vulnerabilities.This highlights critical research directions such as formal verification methods,standardized evaluation protocols,and architectural innovations for inherently secure LLM designs.展开更多
Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lack...Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lacks formal task definitions and corresponding benchmarks.To bridge this gap,we define the Automated Online Public Opinion Report Generation(OPOR-Gen)task and construct OPOR-Bench,an event-centric dataset with 463 crisis events across 108 countries(comprising 8.8 K news articles and 185 K tweets).To evaluate report quality,we propose OPOR-Eval,a novel agent-based framework that simulates human expert evaluation.Validation experiments show OPOR-Eval achieves a high Spearman’s correlation(ρ=0.70)with human judgments,though challenges in temporal reasoning persist.This work establishes an initial foundation for advancing automated public opinion reporting research.展开更多
LargeLanguageModels(LLMs)are increasingly appliedinthe fieldof code translation.However,existing evaluation methodologies suffer from two major limitations:(1)the high overlap between test data and pretraining corpora...LargeLanguageModels(LLMs)are increasingly appliedinthe fieldof code translation.However,existing evaluation methodologies suffer from two major limitations:(1)the high overlap between test data and pretraining corpora,which introduces significant bias in performance evaluation;and(2)mainstream metrics focus primarily on surface-level accuracy,failing to uncover the underlying factors that constrain model capabilities.To address these issues,this paper presents TCode(Translation-Oriented Code Evaluation benchmark)—a complexity-controllable,contamination-free benchmark dataset for code translation—alongside a dedicated static feature sensitivity evaluation framework.The dataset is carefully designed to control complexity along multiple dimensions—including syntactic nesting and expression intricacy—enabling both broad coverage and fine-grained differentiation of sample difficulty.This design supports precise evaluation of model capabilities across a wide spectrum of translation challenges.The proposed evaluation framework introduces a correlation-driven analysis mechanism based on static program features,enabling predictive modeling of translation success from two perspectives:Code Form Complexity(e.g.,code length and character density)and Semantic Modeling Complexity(e.g.,syntactic depth,control-flow nesting,and type system complexity).Empirical evaluations across representative LLMs—including Qwen2.5-72B and Llama3.3-70B—demonstrate that even state-of-the-art models achieve over 80% compilation success on simple samples,but their accuracy drops sharply below 40% on complex cases.Further correlation analysis indicates that Semantic Modeling Complexity alone is correlated with up to 60% of the variance in translation success,with static program features exhibiting nonlinear threshold effects that highlight clear capability boundaries.This study departs fromthe traditional accuracy-centric evaluation paradigm and,for the first time,systematically characterizes the capabilities of large languagemodels in translation tasks through the lens of programstatic features.The findings provide actionable insights for model refinement and training strategy development.展开更多
Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstan...Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstanding performance across various domains,thereby prompting researchers to investigate their applicability in recommendation systems.However,due to the lack of task-specific knowledge and an inefficient feature extraction process,LLMs still have suboptimal performance in recommendation tasks.Therefore,external knowledge sources,such as knowledge graphs(KGs)and knowledge bases(KBs),are often introduced to address the issue of data sparsity.Compared to KGs,KBs possess higher retrieval efficiency,making them more suitable for scenarios where LLMs serve as recommenders.To this end,we introduce a novel framework integrating LLMs with KBs for enhanced retrieval generation,namely LLMKB.LLMKB initially leverages structured knowledge to create mapping dictionaries,extracting entity-relation information from heterogeneous knowledge to construct KBs.Then,LLMKB achieves the embedding calibration between user information representations and documents in KBs through retrieval model fine-tuning.Finally,LLMKB employs retrievalaugmented generation to produce recommendations based on fused text inputs,followed by post-processing.Experiment results on two public CRS datasets demonstrate the effectiveness of our framework.Our code is publicly available at the link:https://anonymous.4open.science/r/LLMKB-6FD0.展开更多
The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in S...The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.展开更多
Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLM...Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.展开更多
Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the prefere...Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the preferences of Chinese dermatologists regarding LLM-generated responses in clinical psoriasis scenarios and to assess how they prioritize key quality dimensions,including accuracy,traceability,and logicality.Methods:A cross-sectional,web-based survey was conducted between December 25,2024,and January 22,2025,following the Checklist for Reporting Results of Internet E-Surveys guidelines.A total of 1247 valid responses were collected from practicing dermatologists across 33 of China's provincial-level administrative divisions.Participants evaluated responses to five categories of clinical questions(etiology,clinical presentation,differential diagnosis,treatment,and case study)generated by five LLMs:ChatGPT-4o,Kimi.ai,Doubao,ZuoYiGPT,and Lingyi-agent.Statistical associations between participant characteristics and model preferences were examined using chi-square tests.Results:ChatGPT-4o(Model 1)emerged as the most preferred model across all clinical tasks,consistently receiving the highest number of votes in case study(n=740),clinical presentation(n=666),differential diagnosis(n=707),etiology(n=602),and treatment(n=656).Significant variation in model preference by professional title was observed only for the differential diagnosis task(χ^(2)=21.13,df=12,p=0.0485),while no significant differences were found across hospital tiers(p>0.05).In terms of evaluation dimensions,accuracy was most frequently rated as“very important”(n=635).A significant association existed between hospital tier and the most valued dimension(χ^(2)=27.667,df=9,p=0.0011),with dermatologists in primary hospitals prioritizing traceability more than their peers in higher-tier hospitals.No significant associations were found across professional titles(p=0.127).Conclusions:Chinese dermatologists suggest a strong preference for ChatGPT-4o over domestic LLMs in psoriasis-related clinical tasks.While accuracy remains the primary criterion,traceability and logicality are also critical,particularly for clinicians in lower-tier hospitals.These findings suggest that future clinical LLMs should prioritize not only content accuracy but also source transparency and structural clarity to meet the diverse needs of different clinical settings.展开更多
Accurate forecasting of tropical cyclone(TC)tracks and intensities is essential.Although the TianXing large weather model,a six-hourly forecasting model surpassing operational forecasts,exhibits superior performance,i...Accurate forecasting of tropical cyclone(TC)tracks and intensities is essential.Although the TianXing large weather model,a six-hourly forecasting model surpassing operational forecasts,exhibits superior performance,its TC forecasts still require enhancement.Prediction errors persist due to biases in the training data and smoothing effects in data-driven methods.To address this,we introduce CycloneBCNet,a deep-learning model designed to correct TianXing’s TC forecast biases by leveraging spatial and temporal data.CycloneBCNet utilizes the SimVP(simpler yet better video prediction)framework with spatial attention to highlight cyclone core regions in forecast fields.It also incorporates TC trend information(center position,maximum wind speed,and minimum sea level pressure)via an LSTM(long short-term memory)module.These TC vectors are derived from post-processed TianXing forecasts.By fusing features from forecast fields and TC vectors,CycloneBCNet corrects biases across multiple lead times.At a 96-h lead time,the track error reduces from 162.4 to 86.4 km,the wind speed error from 17.2 to 6.69 m s^(-1),and the pressure error from 22.2 to 9.36 hPa.Interpretability analysis shows that CycloneBCNet adjusts its attention across forecast lead times.Intensity corrections prioritize inner-core dynamics,particularly the eye and eyewall,while track corrections shift from lower-level variables and the cyclone’s core to broader environmental factors and mid-to upper-level features as the forecast duration increases.These findings demonstrate that CycloneBCNet effectively captures key TC dynamics consistent with meteorological principles,including the dominance of near-surface conditions for intensity and the increasing influence of steering currents on track prediction.展开更多
Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standa...Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.展开更多
Magnesium hydride(MgH_(2)),a promising high-capacity hydrogen storage material,is hindered by slow dehydrogenation kinetics.AIdriven catalyst discovery to address this is often hampered by the laborious extraction of ...Magnesium hydride(MgH_(2)),a promising high-capacity hydrogen storage material,is hindered by slow dehydrogenation kinetics.AIdriven catalyst discovery to address this is often hampered by the laborious extraction of data from unstructured literature.To overcome this,we introduce a transformative“LLM to Agent”framework that synergistically integrates Large Language Models(LLMs)for automated data curation with Machine Learning(ML)for predictive design.We automatically constructed a comprehensive database of 809 MgH_(2)catalysts(6555 data rows)with high fidelity and an~40-fold acceleration over manual methods.The resulting ML models achieved high accuracy(average R^(2)>0.91)in predicting dehydrogenation temperature and activation energy,subsequently guiding a Genetic Algorithm(GA)in an exploratory inverse design that autonomously uncovered key design principles for high-performance catalysts.Encouragingly,a strong alignment was found between these AI-discovered principles and the design strategies of recently reported,state-of-the-art experimental systems,providing substantial evidence for the validity of our approach.The framework culminates in Cat-Advisor,a novel,domain-adapted multi-agent system.Cat-Advisor translates ML predictions and retrieval-augmented knowledge into actionable design guidance,demonstrating capabilities that surpass those of general-purpose LLMs in this specialized domain.This work delivers a practical AI toolkit for accelerated materials discovery and advances the emerging Agent-based paradigm for designing next-generation energy technologies.展开更多
Objective To develop QingNangTCM,a specialized large language model(LLM)tailored for expert-level traditional Chinese medicine(TCM)question-answering and clinical reasoning,addressing the scarcity of domain-specific c...Objective To develop QingNangTCM,a specialized large language model(LLM)tailored for expert-level traditional Chinese medicine(TCM)question-answering and clinical reasoning,addressing the scarcity of domain-specific corpora and specialized alignment.Methods We constructed QnTCM_Dataset,a corpus of 100000 entries,by integrating data from ShenNong_TCM_Dataset and SymMap v2.0,and synthesizing additional samples via retrieval-augmented generation(RAG)and persona-driven generation.The dataset comprehensively covers diagnostic inquiries,prescriptions,and herbal knowledge.Utilizing P-Tuning v2,we fine-tuned the GLM-4-9B-Chat backbone to develop QingNangTCM.A multidimensional evaluation framework,assessing accuracy,coverage,consistency,safety,professionalism,and fluency,was established using metrics such as bilingual evaluation understudy(BLEU),recall-oriented understudy for gisting evaluation(ROUGE),metric for evaluation of translation with explicit ordering(METEOR),and LLM-as-a-Judge with expert review.Qualitative analysis was conducted across four simulated clinical scenarios:symptom analysis,disease treatment,herb inquiry,and failure cases.Baseline models included GLM-4-9BChat,DeepSeek-V2,HuatuoGPT-II(7B),and GLM-4-9B-Chat(freeze-tuning).Results QingNangTCM achieved the highest scores in BLEU-1/2/3/4(0.425/0.298/0.137/0.064),ROUGE-1/2(0.368/0.157),and METEOR(0.218),demonstrating a balanced and superior normalized performance profile of 0.900 across the dimensions of accuracy,coverage,and consistency.Although its ROUGE-L score(0.299)was lower than that of HuatuoGPT-II(7B)(0.351),it significantly outperformed domain-specific models in expert-validated win rates for professionalism(86%)and safety(73%).Qualitative analysis confirmed that the model strictly adheres to the“symptom-syndrome-pathogenesis-treatment”reasoning chain,though occasional misclassifications and hallucinations persisted when dealing with rare medicinal materials and uncommon syndromes.展开更多
Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and langua...Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.展开更多
Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, ...Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.展开更多
ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential sec...ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.展开更多
The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.De...The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.展开更多
基金supported by the National Key R&D Program of China[2022YFF0902703]the State Administration for Market Regulation Science and Technology Plan Project(2024MK033).
文摘Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.
基金funded by the Office of the Vice-President for Research and Development of Cebu Technological University.
文摘This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.
文摘Background:Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology—Head and Neck Surgery systematic literature reviews.Methods:Three PRISMA-based systematic reviews(Jabbour et al.2017,Wong et al.2018,and Wu et al.2021)were replicated using ChatGPTv3.5 and Bard.Outputs(author,title,publication year,and journal)were compared to the original references and cross-referenced with medical databases for authenticity and recall.Results:Several themes emerged when comparing Bard and ChatGPT across the three reviews.Bard generated more outputs and had greater recall in Wong et al.'s review,with a broader date range in Jabbour et al.'s review.In Wu et al.'s review,ChatGPT-2 had higher recall and identified more authentic outputs than Bard-2.Conclusion:Large language models(LLMs)failed to fully replicate peer-reviewed methodologies,producing outputs with inaccuracies but identifying relevant,especially recent,articles missed by the references.While human-led PRISMA-based reviews remain the gold standard,refining LLMs for literature reviews shows potential.
基金Guangzhou Science and Technology Program,Grant/Award Numbers:2025B03J0110,2024A03J1074,2024A03J0927。
文摘Large language models(LLMs)show considerable potential to revolutionize healthcare through their performance across diverse clinical applications.Given the inherent constraints of LLMs and the critical nature of medical practice,a rigorous and systematic evaluation of their medical competence is imperative.This study presents a comprehensive review of the established methodologies and benchmarks for evaluating the medical competence of LLMs,encompassing a thorough analysis of current assessment practices across medical knowledge,clinical practice competence,and ethical-safety considerations.By integrating clinician competency assessment frameworks into LLMs evaluation,we propose a structured tri-dimensional framework that systematically organizes existing evaluation approaches according to medical theoretical knowledge,clinical practice ability,and ethical-safety considerations.Furthermore,this research provides critical insights into future developmental trajectories while establishing foundational frameworks and standardization protocols for the integration of LLMs into medical practice.
文摘War rehearsals have become increasingly important in national security due to the growing complexity of international affairs.However,traditional rehearsal methods,such as military chess simulations,are inefficient and inflexible,with particularly pronounced limitations in command and decision-making.The overwhelming volume of information and high decision complexity hinder the realization of autonomous and agile command and control.To address this challenge,an intelligent warfare simulation framework named Command-Agent is proposed,which deeply integrates large language models(LLMs)with digital twin battlefields.By constructing a highly realistic battlefield environment through real-time simulation and multi-source data fusion,the natural language interaction capabilities of LLMs are leveraged to lower the command threshold and to enable autonomous command through the Observe-Orient-Decide-Act(OODA)feedback loop.Within the Command-Agent framework,a multimodel collaborative architecture is further adopted to decouple the decision-generation and command-execution functions of LLMs.By combining specialized models such as Deep Seek-R1 and MCTool,the limitations of single-model capabilities are overcome.MCTool is a lightweight execution model fine-tuned for military Function Calling tasks.The framework also introduces a Vector Knowledge Base to mitigate hallucinations commonly exhibited by LLMs.Experimental results demonstrate that Command-Agent not only enables natural language-driven simulation and control but also deeply understands commander intent.Leveraging the multi-model collaborative architecture,during red-blue UAV confrontations involving 2 to 8 UAVs,the integrated score is improved by an average of 41.8%compared to the single-agent system(MCTool),accompanied by a 161.8%optimization in the battle loss ratio.Furthermore,when compared with multi-agent systems lacking the knowledge base,the inclusion of the Vector Knowledge Base further improves overall performance by 16.8%.In comparison with the general model(Qwen2.5-7B),the fine-tuned MCTool leads by 5%in execution efficiency.Therefore,the proposed Command-Agent introduces a novel perspective to the military command system and offers a feasible solution for intelligent battlefield decision-making.
基金supported by 2023 Higher Education Scientific Research Planning Project of China Society of Higher Education(No.23PG0408)2023 Philosophy and Social Science Research Programs in Jiangsu Province(No.2023SJSZ0993)+2 种基金Nantong Science and Technology Project(No.JC2023070)Key Project of Jiangsu Province Education Science 14th Five-Year Plan(Grant No.B-b/2024/02/41)the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province(Grant No.SKLACSS-202407).
文摘Large language models(LLMs)have revolutionized AI applications across diverse domains.However,their widespread deployment has introduced critical security vulnerabilities,particularly prompt injection attacks that manipulate model behavior through malicious instructions.Following Kitchenham’s guidelines,this systematic review synthesizes 128 peer-reviewed studies from 2022 to 2025 to provide a unified understanding of this rapidly evolving threat landscape.Our findings reveal a swift progression from simple direct injections to sophisticated multimodal attacks,achieving over 90%success rates against unprotected systems.In response,defense mechanisms show varying effectiveness:input preprocessing achieves 60%–80%detection rates and advanced architectural defenses demonstrate up to 95%protection against known patterns,though significant gaps persist against novel attack vectors.We identified 37 distinct defense approaches across three categories,but standardized evaluation frameworks remain limited.Our analysis attributes these vulnerabilities to fundamental LLM architectural limitations,such as the inability to distinguish instructions from data and attention mechanism vulnerabilities.This highlights critical research directions such as formal verification methods,standardized evaluation protocols,and architectural innovations for inherently secure LLM designs.
基金supported by the Fundamental Research Funds for the Central Universities(No.CUC25SG013)the Foundation of Key Laboratory of Education Informatization for Nationalities(Yunnan Normal University),Ministry of Education(No.EIN2024C006).
文摘Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises.While large language models(LLMs)enable automated report generation,this specific domain lacks formal task definitions and corresponding benchmarks.To bridge this gap,we define the Automated Online Public Opinion Report Generation(OPOR-Gen)task and construct OPOR-Bench,an event-centric dataset with 463 crisis events across 108 countries(comprising 8.8 K news articles and 185 K tweets).To evaluate report quality,we propose OPOR-Eval,a novel agent-based framework that simulates human expert evaluation.Validation experiments show OPOR-Eval achieves a high Spearman’s correlation(ρ=0.70)with human judgments,though challenges in temporal reasoning persist.This work establishes an initial foundation for advancing automated public opinion reporting research.
文摘LargeLanguageModels(LLMs)are increasingly appliedinthe fieldof code translation.However,existing evaluation methodologies suffer from two major limitations:(1)the high overlap between test data and pretraining corpora,which introduces significant bias in performance evaluation;and(2)mainstream metrics focus primarily on surface-level accuracy,failing to uncover the underlying factors that constrain model capabilities.To address these issues,this paper presents TCode(Translation-Oriented Code Evaluation benchmark)—a complexity-controllable,contamination-free benchmark dataset for code translation—alongside a dedicated static feature sensitivity evaluation framework.The dataset is carefully designed to control complexity along multiple dimensions—including syntactic nesting and expression intricacy—enabling both broad coverage and fine-grained differentiation of sample difficulty.This design supports precise evaluation of model capabilities across a wide spectrum of translation challenges.The proposed evaluation framework introduces a correlation-driven analysis mechanism based on static program features,enabling predictive modeling of translation success from two perspectives:Code Form Complexity(e.g.,code length and character density)and Semantic Modeling Complexity(e.g.,syntactic depth,control-flow nesting,and type system complexity).Empirical evaluations across representative LLMs—including Qwen2.5-72B and Llama3.3-70B—demonstrate that even state-of-the-art models achieve over 80% compilation success on simple samples,but their accuracy drops sharply below 40% on complex cases.Further correlation analysis indicates that Semantic Modeling Complexity alone is correlated with up to 60% of the variance in translation success,with static program features exhibiting nonlinear threshold effects that highlight clear capability boundaries.This study departs fromthe traditional accuracy-centric evaluation paradigm and,for the first time,systematically characterizes the capabilities of large languagemodels in translation tasks through the lens of programstatic features.The findings provide actionable insights for model refinement and training strategy development.
文摘Conversational recommender systems(CRSs)focus on refining preferences and providing personalized recommendations through natural language interactions and dialogue history.Large language models(LLMs)have shown outstanding performance across various domains,thereby prompting researchers to investigate their applicability in recommendation systems.However,due to the lack of task-specific knowledge and an inefficient feature extraction process,LLMs still have suboptimal performance in recommendation tasks.Therefore,external knowledge sources,such as knowledge graphs(KGs)and knowledge bases(KBs),are often introduced to address the issue of data sparsity.Compared to KGs,KBs possess higher retrieval efficiency,making them more suitable for scenarios where LLMs serve as recommenders.To this end,we introduce a novel framework integrating LLMs with KBs for enhanced retrieval generation,namely LLMKB.LLMKB initially leverages structured knowledge to create mapping dictionaries,extracting entity-relation information from heterogeneous knowledge to construct KBs.Then,LLMKB achieves the embedding calibration between user information representations and documents in KBs through retrieval model fine-tuning.Finally,LLMKB employs retrievalaugmented generation to produce recommendations based on fused text inputs,followed by post-processing.Experiment results on two public CRS datasets demonstrate the effectiveness of our framework.Our code is publicly available at the link:https://anonymous.4open.science/r/LLMKB-6FD0.
基金the research project LaTe4PoliticES(PID2022-138099OB-I00)funded by MCIN/AEI/10.13039/501100011033 and the European Fund for Regional Development(ERDF)-a way to make Europe.Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.
文摘The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.
文摘Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.
基金National Key Research and Development Program of China,Grant/Award Number:2024YFF0507404Special Clinical Business Fund for High-Level Hospitals of China-Japan Friendship Hospital,Grant/Award Number:2024-NHLHCRF-TS-01。
文摘Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the preferences of Chinese dermatologists regarding LLM-generated responses in clinical psoriasis scenarios and to assess how they prioritize key quality dimensions,including accuracy,traceability,and logicality.Methods:A cross-sectional,web-based survey was conducted between December 25,2024,and January 22,2025,following the Checklist for Reporting Results of Internet E-Surveys guidelines.A total of 1247 valid responses were collected from practicing dermatologists across 33 of China's provincial-level administrative divisions.Participants evaluated responses to five categories of clinical questions(etiology,clinical presentation,differential diagnosis,treatment,and case study)generated by five LLMs:ChatGPT-4o,Kimi.ai,Doubao,ZuoYiGPT,and Lingyi-agent.Statistical associations between participant characteristics and model preferences were examined using chi-square tests.Results:ChatGPT-4o(Model 1)emerged as the most preferred model across all clinical tasks,consistently receiving the highest number of votes in case study(n=740),clinical presentation(n=666),differential diagnosis(n=707),etiology(n=602),and treatment(n=656).Significant variation in model preference by professional title was observed only for the differential diagnosis task(χ^(2)=21.13,df=12,p=0.0485),while no significant differences were found across hospital tiers(p>0.05).In terms of evaluation dimensions,accuracy was most frequently rated as“very important”(n=635).A significant association existed between hospital tier and the most valued dimension(χ^(2)=27.667,df=9,p=0.0011),with dermatologists in primary hospitals prioritizing traceability more than their peers in higher-tier hospitals.No significant associations were found across professional titles(p=0.127).Conclusions:Chinese dermatologists suggest a strong preference for ChatGPT-4o over domestic LLMs in psoriasis-related clinical tasks.While accuracy remains the primary criterion,traceability and logicality are also critical,particularly for clinicians in lower-tier hospitals.These findings suggest that future clinical LLMs should prioritize not only content accuracy but also source transparency and structural clarity to meet the diverse needs of different clinical settings.
基金supported by the Meteorological Joint Funds of the National Natural Science Foundation of China(Grant No.U2142211)the National Natural Science Foundation of China(Grant Nos.42075141,42341202 and 62088101)+1 种基金the National Key Research and Development Program of China(Grant No.2020YFA0608000)the Shanghai Municipal Science and Technology Major Project(Grant No.2021SHZDZX0100).
文摘Accurate forecasting of tropical cyclone(TC)tracks and intensities is essential.Although the TianXing large weather model,a six-hourly forecasting model surpassing operational forecasts,exhibits superior performance,its TC forecasts still require enhancement.Prediction errors persist due to biases in the training data and smoothing effects in data-driven methods.To address this,we introduce CycloneBCNet,a deep-learning model designed to correct TianXing’s TC forecast biases by leveraging spatial and temporal data.CycloneBCNet utilizes the SimVP(simpler yet better video prediction)framework with spatial attention to highlight cyclone core regions in forecast fields.It also incorporates TC trend information(center position,maximum wind speed,and minimum sea level pressure)via an LSTM(long short-term memory)module.These TC vectors are derived from post-processed TianXing forecasts.By fusing features from forecast fields and TC vectors,CycloneBCNet corrects biases across multiple lead times.At a 96-h lead time,the track error reduces from 162.4 to 86.4 km,the wind speed error from 17.2 to 6.69 m s^(-1),and the pressure error from 22.2 to 9.36 hPa.Interpretability analysis shows that CycloneBCNet adjusts its attention across forecast lead times.Intensity corrections prioritize inner-core dynamics,particularly the eye and eyewall,while track corrections shift from lower-level variables and the cyclone’s core to broader environmental factors and mid-to upper-level features as the forecast duration increases.These findings demonstrate that CycloneBCNet effectively captures key TC dynamics consistent with meteorological principles,including the dominance of near-surface conditions for intensity and the increasing influence of steering currents on track prediction.
基金National Key Research and Development Program of China(2024YFC3505400)Capital Clinical Project of Beijing Municipal Science&Technology Commission(Z221100007422092)Capital’s Funds for Health Improvement and Research(2024-1-2231).
文摘Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.
基金supported by the Natural Science Foundation of Hebei Province(E2023502006)Fundamental Research Fund for the Central Universities(2025MS131).
文摘Magnesium hydride(MgH_(2)),a promising high-capacity hydrogen storage material,is hindered by slow dehydrogenation kinetics.AIdriven catalyst discovery to address this is often hampered by the laborious extraction of data from unstructured literature.To overcome this,we introduce a transformative“LLM to Agent”framework that synergistically integrates Large Language Models(LLMs)for automated data curation with Machine Learning(ML)for predictive design.We automatically constructed a comprehensive database of 809 MgH_(2)catalysts(6555 data rows)with high fidelity and an~40-fold acceleration over manual methods.The resulting ML models achieved high accuracy(average R^(2)>0.91)in predicting dehydrogenation temperature and activation energy,subsequently guiding a Genetic Algorithm(GA)in an exploratory inverse design that autonomously uncovered key design principles for high-performance catalysts.Encouragingly,a strong alignment was found between these AI-discovered principles and the design strategies of recently reported,state-of-the-art experimental systems,providing substantial evidence for the validity of our approach.The framework culminates in Cat-Advisor,a novel,domain-adapted multi-agent system.Cat-Advisor translates ML predictions and retrieval-augmented knowledge into actionable design guidance,demonstrating capabilities that surpass those of general-purpose LLMs in this specialized domain.This work delivers a practical AI toolkit for accelerated materials discovery and advances the emerging Agent-based paradigm for designing next-generation energy technologies.
基金Hebei Province Higher Education Scientific Research Project(QN2025367)Zhangjiakou City 2022 Municipal Science and Technology Plan Self-raised Fund Project(221105D)Hebei Province Education Science“14th Five-Year Plan”Project(2404224).
文摘Objective To develop QingNangTCM,a specialized large language model(LLM)tailored for expert-level traditional Chinese medicine(TCM)question-answering and clinical reasoning,addressing the scarcity of domain-specific corpora and specialized alignment.Methods We constructed QnTCM_Dataset,a corpus of 100000 entries,by integrating data from ShenNong_TCM_Dataset and SymMap v2.0,and synthesizing additional samples via retrieval-augmented generation(RAG)and persona-driven generation.The dataset comprehensively covers diagnostic inquiries,prescriptions,and herbal knowledge.Utilizing P-Tuning v2,we fine-tuned the GLM-4-9B-Chat backbone to develop QingNangTCM.A multidimensional evaluation framework,assessing accuracy,coverage,consistency,safety,professionalism,and fluency,was established using metrics such as bilingual evaluation understudy(BLEU),recall-oriented understudy for gisting evaluation(ROUGE),metric for evaluation of translation with explicit ordering(METEOR),and LLM-as-a-Judge with expert review.Qualitative analysis was conducted across four simulated clinical scenarios:symptom analysis,disease treatment,herb inquiry,and failure cases.Baseline models included GLM-4-9BChat,DeepSeek-V2,HuatuoGPT-II(7B),and GLM-4-9B-Chat(freeze-tuning).Results QingNangTCM achieved the highest scores in BLEU-1/2/3/4(0.425/0.298/0.137/0.064),ROUGE-1/2(0.368/0.157),and METEOR(0.218),demonstrating a balanced and superior normalized performance profile of 0.900 across the dimensions of accuracy,coverage,and consistency.Although its ROUGE-L score(0.299)was lower than that of HuatuoGPT-II(7B)(0.351),it significantly outperformed domain-specific models in expert-validated win rates for professionalism(86%)and safety(73%).Qualitative analysis confirmed that the model strictly adheres to the“symptom-syndrome-pathogenesis-treatment”reasoning chain,though occasional misclassifications and hallucinations persisted when dealing with rare medicinal materials and uncommon syndromes.
基金supported by National Natural Science Foundation of China(62376219 and 62006194)Foundational Research Project in Specialized Discipline(Grant No.G2024WD0146)Faculty Construction Project(Grant No.24GH0201148).
文摘Large language models(LLMs)have undergone significant expansion and have been increasingly integrated across various domains.Notably,in the realm of robot task planning,LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions.However,for embodied tasks,where robots interact with complex environments,textonly LLMs often face challenges due to a lack of compatibility with robotic visual perception.This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks.Additionally,we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.Our results,based on diverse datasets,indicate that GPT-4V effectively enhances robot performance in embodied tasks.This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
文摘Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently,researchers have explored the potential of using large language models(LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.
文摘ChatGPT is a powerful artificial intelligence(AI)language model that has demonstrated significant improvements in various natural language processing(NLP) tasks. However, like any technology, it presents potential security risks that need to be carefully evaluated and addressed. In this survey, we provide an overview of the current state of research on security of using ChatGPT, with aspects of bias, disinformation, ethics, misuse,attacks and privacy. We review and discuss the literature on these topics and highlight open research questions and future directions.Through this survey, we aim to contribute to the academic discourse on AI security, enriching the understanding of potential risks and mitigations. We anticipate that this survey will be valuable for various stakeholders involved in AI development and usage, including AI researchers, developers, policy makers, and end-users.
基金supported by the National Key R&D Program of China under Grant No.2022YFB3103500the National Natural Science Foundation of China under Grants No.62402087 and No.62020106013+3 种基金the Sichuan Science and Technology Program under Grant No.2023ZYD0142the Chengdu Science and Technology Program under Grant No.2023-XT00-00002-GXthe Fundamental Research Funds for Chinese Central Universities under Grants No.ZYGX2020ZB027 and No.Y030232063003002the Postdoctoral Innovation Talents Support Program under Grant No.BX20230060.
文摘The integration of artificial intelligence(AI)technology,particularly large language models(LLMs),has become essential across various sectors due to their advanced language comprehension and generation capabilities.Despite their transformative impact in fields such as machine translation and intelligent dialogue systems,LLMs face significant challenges.These challenges include safety,security,and privacy concerns that undermine their trustworthiness and effectiveness,such as hallucinations,backdoor attacks,and privacy leakage.Previous works often conflated safety issues with security concerns.In contrast,our study provides clearer and more reasonable definitions for safety,security,and privacy within the context of LLMs.Building on these definitions,we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety,security,and privacy in LLMs.Additionally,we explore the unique research challenges posed by LLMs and suggest potential avenues for future research,aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.