Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLM...Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.展开更多
Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remain...Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remains one of the security challenges. is paper proposes LinguTimeX,a new framework that combines natural language processing with arti cial intelligence,along with explainable Arti cial Intelligence(AI)not only to detect CTC but also to provide insights into the decision process.LinguTimeX performs multidimensional feature extraction by fusing linguistic attributes with temporal network patterns to identify covert channels precisely.LinguTimeX demonstrates strong e ectiveness in detecting CTC across multiple languages;namely English,Arabic,and Chinese.Speci cally,the LSTM and RNN models achieved F1 scores of 90%on the English dataset,89%on the Arabic dataset,and 88%on the Chinese dataset,showcasing their superior performance and ability to generalize across multiple languages. is highlights their robustness in detecting CTCs within security systems,regardless of the language or cultural context of the data.In contrast,the DeepForest model produced F1-scores ranging from 86%to 87%across the same datasets,further con rming its e ectiveness in CTC detection.Although other algorithms also showed reasonable accuracy,the LSTM and RNN models consistently outperformed them in multilingual settings,suggesting that deep learning models might be better suited for this particular problem.展开更多
As an ordinary Yunnan local,I never imagined becoming so closely connected to the exotic land of Laos.The luckiest event of my life was probably my choice to tick a box on a 2007 college entrance examination applicati...As an ordinary Yunnan local,I never imagined becoming so closely connected to the exotic land of Laos.The luckiest event of my life was probably my choice to tick a box on a 2007 college entrance examination application form,indicating my willingness to enrollin a major other than my preference,which led me into the world of the Lao language.展开更多
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact...Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.展开更多
The outstanding growth in the applications of large language models(LLMs)demonstrates the significance of adaptive and efficient prompt engineering tactics.The existing methods may not be variable,vigorous and streaml...The outstanding growth in the applications of large language models(LLMs)demonstrates the significance of adaptive and efficient prompt engineering tactics.The existing methods may not be variable,vigorous and streamlined in different domains.The offered study introduces an immediate optimization outline,named PROMPTx-PE,that is going to yield a greater level of precision and strength when it comes to the assignments that are premised on LLM.The proposed systemfeatures a timely selection schemewhich is informed by reinforcement learning,a contextual layer and a dynamic weighting module which is regulated by Lyapunov-based stability guidelines.The PROMPTx-PE dynamically varies the exploration and exploitation of the prompt space,depending on real-time feedback and multi-objective reward development.Extensive testing on both benchmark(GLUE,SuperGLUE)and domain-specific data(Healthcare-QA and Industrial-NER)demonstrates a large best performance to be 89.4%and a strong robustness disconnect with under 3%computation expense.The results confirm the effectiveness,consistency,and scalability of PROMPTx-PE as a platform of adaptive prompt engineering based on recent uses of LLMs.展开更多
Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standa...Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.展开更多
The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in S...The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.展开更多
This study evaluated the accuracy,completeness,and comprehensibility of responses from mainstream large language models(LLMs)to hepatitis C virus(HCV)-related questions,aiming to assess their performance in addressing...This study evaluated the accuracy,completeness,and comprehensibility of responses from mainstream large language models(LLMs)to hepatitis C virus(HCV)-related questions,aiming to assess their performance in addressing patient queries about disease and lifestyle behaviors.The models selected were ChatGPT-4o,Gemini 2.0 Pro,Claude 3.5 Sonnet,and DeepSeek V3,with 12 questions chosen by two HCV experts from the domains of prevention,diagnosis,and treatment.展开更多
Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate unt...Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate until the assistant behaves consistently.This paper investigates how far large language models(LLMs)can automate this development.In this paper,we use two reference corpora,Let’s Go(English,public transport)and MEDIA(French,hotel booking),to prompt four LLM families(GPT-4o,Claude,Gemini,Mistral Small)and generate the core specifications required by the rasa platform.These include intent sets with example utterances,entity definitions with slot mappings,response templates,and basic dialog flows.To structure this process,we introduce a model-and platform-agnostic pipelinewith two phases.The first normalizes and validates LLM-generated artifacts,enforcing crossfile consistency andmaking slot usage explicit.The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably.Across eight projects,all models required some targeted repairs before training.After applying our pipeline,all reached≥70%task completion(many above 84%),while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth.These results show that,with modest guidance,current LLMs can produce workable end-to-end dialog prototypes directly fromraw transcripts.Our main contributions are:(i)a reusable bootstrap method aligned with industry domain-specific languages(DSLs),(ii)a small set of high-impact corrective patterns,and(iii)a simple but effective harness for closed-loop refinement across conversational platforms.展开更多
It is known that correlation does not imply causality.Some relationships identified in the analysis of data are coincidental or unknown,and some are produced by real-world causality of the situation,which is problemat...It is known that correlation does not imply causality.Some relationships identified in the analysis of data are coincidental or unknown,and some are produced by real-world causality of the situation,which is problematic,since there is a need to differentiate between these two scenarios.Until recently,the proper−semantic−causality of the relationship could have been determined only by human experts from the area of expertise of the studied data.This has changed with the advance of large language models,which are often utilized as surrogates for such human experts,making the process automated and readily available to all data analysts.This motivates the main objective of this work,which is to introduce the design and implementation of a large language model-based semantic causality evaluator based on correlation analysis,together with its visual analysis model called Causal heatmap.After the implementation itself,the model is evaluated from the point of view of the quality of the visual model,from the point of view of the quality of causal evaluation based on large language models,and from the point of view of comparative analysis,while the results reached in the study highlight the usability of large language models in the task and the potential of the proposed approach in the analysis of unknown datasets.The results of the experimental evaluation demonstrate the usefulness of the Causal heatmap method,supported by the evident highlighting of interesting relationships,while suppressing irrelevant ones.展开更多
Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the prefere...Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the preferences of Chinese dermatologists regarding LLM-generated responses in clinical psoriasis scenarios and to assess how they prioritize key quality dimensions,including accuracy,traceability,and logicality.Methods:A cross-sectional,web-based survey was conducted between December 25,2024,and January 22,2025,following the Checklist for Reporting Results of Internet E-Surveys guidelines.A total of 1247 valid responses were collected from practicing dermatologists across 33 of China's provincial-level administrative divisions.Participants evaluated responses to five categories of clinical questions(etiology,clinical presentation,differential diagnosis,treatment,and case study)generated by five LLMs:ChatGPT-4o,Kimi.ai,Doubao,ZuoYiGPT,and Lingyi-agent.Statistical associations between participant characteristics and model preferences were examined using chi-square tests.Results:ChatGPT-4o(Model 1)emerged as the most preferred model across all clinical tasks,consistently receiving the highest number of votes in case study(n=740),clinical presentation(n=666),differential diagnosis(n=707),etiology(n=602),and treatment(n=656).Significant variation in model preference by professional title was observed only for the differential diagnosis task(χ^(2)=21.13,df=12,p=0.0485),while no significant differences were found across hospital tiers(p>0.05).In terms of evaluation dimensions,accuracy was most frequently rated as“very important”(n=635).A significant association existed between hospital tier and the most valued dimension(χ^(2)=27.667,df=9,p=0.0011),with dermatologists in primary hospitals prioritizing traceability more than their peers in higher-tier hospitals.No significant associations were found across professional titles(p=0.127).Conclusions:Chinese dermatologists suggest a strong preference for ChatGPT-4o over domestic LLMs in psoriasis-related clinical tasks.While accuracy remains the primary criterion,traceability and logicality are also critical,particularly for clinicians in lower-tier hospitals.These findings suggest that future clinical LLMs should prioritize not only content accuracy but also source transparency and structural clarity to meet the diverse needs of different clinical settings.展开更多
This paper undertakes a systematic combing of the development of research on integrating Chinese culture into foreign language education in China from the 1980s to 2025,dividing it into three stages:cultural attachmen...This paper undertakes a systematic combing of the development of research on integrating Chinese culture into foreign language education in China from the 1980s to 2025,dividing it into three stages:cultural attachment,cultural compensation,and cultural symbiosis,and reveals the logical shift of the research from the dominance of target language culture to the construction of the subjectivity of Chinese culture.Through quantitative and qualitative analysis of 435 CSSCI papers,three core themes are extracted:what to integrate,why to integrate,and how to integrate.This paper critically analyzes three pairs of contradictions:the imbalance between instrumentality and humanism,the separation of national narrative and individual expression,and the disconnection between traditional inheritance and modern transformation.It is proposed that future research should reconstruct the educational logic based on the Chinese context,integrate the national and individual dimensions,and build a dialogue mechanism between tradition and modernity,so as to provide theoretical and practical reference for the construction of a foreign language education system with Chinese characteristics.展开更多
文摘Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.
基金This study is financed by the European Union-NextGenerationEU,through the National Recovery and Resilience Plan of the Republic of Bulgaria,Project No.BG-RRP-2.013-0001.
文摘Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remains one of the security challenges. is paper proposes LinguTimeX,a new framework that combines natural language processing with arti cial intelligence,along with explainable Arti cial Intelligence(AI)not only to detect CTC but also to provide insights into the decision process.LinguTimeX performs multidimensional feature extraction by fusing linguistic attributes with temporal network patterns to identify covert channels precisely.LinguTimeX demonstrates strong e ectiveness in detecting CTC across multiple languages;namely English,Arabic,and Chinese.Speci cally,the LSTM and RNN models achieved F1 scores of 90%on the English dataset,89%on the Arabic dataset,and 88%on the Chinese dataset,showcasing their superior performance and ability to generalize across multiple languages. is highlights their robustness in detecting CTCs within security systems,regardless of the language or cultural context of the data.In contrast,the DeepForest model produced F1-scores ranging from 86%to 87%across the same datasets,further con rming its e ectiveness in CTC detection.Although other algorithms also showed reasonable accuracy,the LSTM and RNN models consistently outperformed them in multilingual settings,suggesting that deep learning models might be better suited for this particular problem.
文摘As an ordinary Yunnan local,I never imagined becoming so closely connected to the exotic land of Laos.The luckiest event of my life was probably my choice to tick a box on a 2007 college entrance examination application form,indicating my willingness to enrollin a major other than my preference,which led me into the world of the Lao language.
基金supported by the National Key R&D Program of China[2022YFF0902703]the State Administration for Market Regulation Science and Technology Plan Project(2024MK033).
文摘Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.
基金supported by the National Science and Technology Council(NSTC),Taiwan,under grant number 114-2221-E-182-041-MY3by Chang Gung University and Chang Gung Memorial Hospital under project number NERPD4Q0021.
文摘The outstanding growth in the applications of large language models(LLMs)demonstrates the significance of adaptive and efficient prompt engineering tactics.The existing methods may not be variable,vigorous and streamlined in different domains.The offered study introduces an immediate optimization outline,named PROMPTx-PE,that is going to yield a greater level of precision and strength when it comes to the assignments that are premised on LLM.The proposed systemfeatures a timely selection schemewhich is informed by reinforcement learning,a contextual layer and a dynamic weighting module which is regulated by Lyapunov-based stability guidelines.The PROMPTx-PE dynamically varies the exploration and exploitation of the prompt space,depending on real-time feedback and multi-objective reward development.Extensive testing on both benchmark(GLUE,SuperGLUE)and domain-specific data(Healthcare-QA and Industrial-NER)demonstrates a large best performance to be 89.4%and a strong robustness disconnect with under 3%computation expense.The results confirm the effectiveness,consistency,and scalability of PROMPTx-PE as a platform of adaptive prompt engineering based on recent uses of LLMs.
基金National Key Research and Development Program of China(2024YFC3505400)Capital Clinical Project of Beijing Municipal Science&Technology Commission(Z221100007422092)Capital’s Funds for Health Improvement and Research(2024-1-2231).
文摘Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.
基金the research project LaTe4PoliticES(PID2022-138099OB-I00)funded by MCIN/AEI/10.13039/501100011033 and the European Fund for Regional Development(ERDF)-a way to make Europe.Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.
文摘The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.
基金funded by the National Key Research and Development Program of China(No.2021YFA1100500)the National Natural Science Foundation of China(No.82370662)the Key Research&Development Plan of Zhejiang Province(No.2024C03051).
文摘This study evaluated the accuracy,completeness,and comprehensibility of responses from mainstream large language models(LLMs)to hepatitis C virus(HCV)-related questions,aiming to assess their performance in addressing patient queries about disease and lifestyle behaviors.The models selected were ChatGPT-4o,Gemini 2.0 Pro,Claude 3.5 Sonnet,and DeepSeek V3,with 12 questions chosen by two HCV experts from the domains of prevention,diagnosis,and treatment.
基金This publication is part of the TrustBoost project,that has received funding from MICIU/AEI/10.13039/501100011033,from FEDER,UEIt is a coordinated project by a multidisciplinary team from the Universidad Politécnica de Madrid(UPM)and University of Granada(UGR),with two subprojects that address TrustBoost’s objectives:“Enhancing Trustworthiness in Conversational AI through Multimodal Affective Awareness”(Trust Boost-UPM,ref.PID2023-150584OB-C21)“Breaking the Duality of Conversational AI:Going beyond Guided Conversations While Ensuring Compliance with Domain Rules and Constraints”(Trust Boost-UGR,ref.PID2023-150584OB-C22).
文摘Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate until the assistant behaves consistently.This paper investigates how far large language models(LLMs)can automate this development.In this paper,we use two reference corpora,Let’s Go(English,public transport)and MEDIA(French,hotel booking),to prompt four LLM families(GPT-4o,Claude,Gemini,Mistral Small)and generate the core specifications required by the rasa platform.These include intent sets with example utterances,entity definitions with slot mappings,response templates,and basic dialog flows.To structure this process,we introduce a model-and platform-agnostic pipelinewith two phases.The first normalizes and validates LLM-generated artifacts,enforcing crossfile consistency andmaking slot usage explicit.The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably.Across eight projects,all models required some targeted repairs before training.After applying our pipeline,all reached≥70%task completion(many above 84%),while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth.These results show that,with modest guidance,current LLMs can produce workable end-to-end dialog prototypes directly fromraw transcripts.Our main contributions are:(i)a reusable bootstrap method aligned with industry domain-specific languages(DSLs),(ii)a small set of high-impact corrective patterns,and(iii)a simple but effective harness for closed-loop refinement across conversational platforms.
基金supported by University Grant Agency of Matej Bel University in Banská Bystrica project number UGA-14-PDS-2025.
文摘It is known that correlation does not imply causality.Some relationships identified in the analysis of data are coincidental or unknown,and some are produced by real-world causality of the situation,which is problematic,since there is a need to differentiate between these two scenarios.Until recently,the proper−semantic−causality of the relationship could have been determined only by human experts from the area of expertise of the studied data.This has changed with the advance of large language models,which are often utilized as surrogates for such human experts,making the process automated and readily available to all data analysts.This motivates the main objective of this work,which is to introduce the design and implementation of a large language model-based semantic causality evaluator based on correlation analysis,together with its visual analysis model called Causal heatmap.After the implementation itself,the model is evaluated from the point of view of the quality of the visual model,from the point of view of the quality of causal evaluation based on large language models,and from the point of view of comparative analysis,while the results reached in the study highlight the usability of large language models in the task and the potential of the proposed approach in the analysis of unknown datasets.The results of the experimental evaluation demonstrate the usefulness of the Causal heatmap method,supported by the evident highlighting of interesting relationships,while suppressing irrelevant ones.
基金National Key Research and Development Program of China,Grant/Award Number:2024YFF0507404Special Clinical Business Fund for High-Level Hospitals of China-Japan Friendship Hospital,Grant/Award Number:2024-NHLHCRF-TS-01。
文摘Background:Large language models(LLMs)have shown considerable promise in supporting clinical decision-making.However,their adoption and evaluation in dermatology remains limited.This study aimed to explore the preferences of Chinese dermatologists regarding LLM-generated responses in clinical psoriasis scenarios and to assess how they prioritize key quality dimensions,including accuracy,traceability,and logicality.Methods:A cross-sectional,web-based survey was conducted between December 25,2024,and January 22,2025,following the Checklist for Reporting Results of Internet E-Surveys guidelines.A total of 1247 valid responses were collected from practicing dermatologists across 33 of China's provincial-level administrative divisions.Participants evaluated responses to five categories of clinical questions(etiology,clinical presentation,differential diagnosis,treatment,and case study)generated by five LLMs:ChatGPT-4o,Kimi.ai,Doubao,ZuoYiGPT,and Lingyi-agent.Statistical associations between participant characteristics and model preferences were examined using chi-square tests.Results:ChatGPT-4o(Model 1)emerged as the most preferred model across all clinical tasks,consistently receiving the highest number of votes in case study(n=740),clinical presentation(n=666),differential diagnosis(n=707),etiology(n=602),and treatment(n=656).Significant variation in model preference by professional title was observed only for the differential diagnosis task(χ^(2)=21.13,df=12,p=0.0485),while no significant differences were found across hospital tiers(p>0.05).In terms of evaluation dimensions,accuracy was most frequently rated as“very important”(n=635).A significant association existed between hospital tier and the most valued dimension(χ^(2)=27.667,df=9,p=0.0011),with dermatologists in primary hospitals prioritizing traceability more than their peers in higher-tier hospitals.No significant associations were found across professional titles(p=0.127).Conclusions:Chinese dermatologists suggest a strong preference for ChatGPT-4o over domestic LLMs in psoriasis-related clinical tasks.While accuracy remains the primary criterion,traceability and logicality are also critical,particularly for clinicians in lower-tier hospitals.These findings suggest that future clinical LLMs should prioritize not only content accuracy but also source transparency and structural clarity to meet the diverse needs of different clinical settings.
基金“A Study on the Value and Path of Integrating Excellent Traditional Chinese Culture Into Intercultural Communication Courses”(ZD2024)a project by the Beijing Higher Education Association,as well as“A Study on the Path of Empowering the Integration of Excellent Traditional Chinese Culture Into Intercultural Communication Courses With Generative AI”(2024),an institutional project of Beijing International Studies University.
文摘This paper undertakes a systematic combing of the development of research on integrating Chinese culture into foreign language education in China from the 1980s to 2025,dividing it into three stages:cultural attachment,cultural compensation,and cultural symbiosis,and reveals the logical shift of the research from the dominance of target language culture to the construction of the subjectivity of Chinese culture.Through quantitative and qualitative analysis of 435 CSSCI papers,three core themes are extracted:what to integrate,why to integrate,and how to integrate.This paper critically analyzes three pairs of contradictions:the imbalance between instrumentality and humanism,the separation of national narrative and individual expression,and the disconnection between traditional inheritance and modern transformation.It is proposed that future research should reconstruct the educational logic based on the Chinese context,integrate the national and individual dimensions,and build a dialogue mechanism between tradition and modernity,so as to provide theoretical and practical reference for the construction of a foreign language education system with Chinese characteristics.