Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to g...Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to generate metaphors.CMGBench offers a high-quality dataset comprising 810 examples,3,354 annotations,and two types of metaphor expressions:direct metaphor expressions and implicit metaphor expressions,the latter of which has received limited attention in previous research.To assess the quality of metaphors generated by LLMs,we introduce three evaluation criteria.The first criterion measures the disparity between the vehicles in the LLM-generated metaphors and those used by humans.The second criterion evaluates whether an LLM-generated metaphor contains unconventional semantic collocations.The third criterion calculates the proportion of implicit metaphor expressions within the LLM-generated metaphors.We conducted extensive experiments on both proprietary and open-source LLMs.The results demonstrate that,compared to human-generated metaphors,LLM-generated metaphors display a lack of variety in using the attributes of a vehicle,show limited innovation in semantic collocation,and tend to use direct expressions.Even top-performing LLMs such as GPT-4,when not explicitly prompted to generate implicit metaphors,produced them at a rate of only 23.9%.This highlights a significant gap between human and LLM capabilities in metaphor generation.展开更多
The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decis...The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decision support,documentation,and diagnostics.This evaluation examines the performance of leading Med-LLMs,including GPT-4Med,Med-PaLM,MEDITRON,PubMedGPT,and MedAlpaca,across diverse medical datasets.It provides graphical comparisons of their effectiveness in distinct healthcare domains.The study introduces a domain-specific categorization system that aligns these models with optimal applications in clinical decision-making,documentation,drug discovery,research,patient interaction,and public health.The paper addresses deployment challenges of Medical-LLMs,emphasizing trustworthiness and explainability as essential requirements for healthcare AI.It presents current evaluation techniques that improve model transparency in high-stakes medical contexts and analyzes regulatory frameworks using benchmarking datasets such asMedQA,MedMCQA,PubMedQA,and MIMIC.By identifying ongoing challenges in biasmitigation,reliability,and ethical compliance,thiswork serves as a resource for selecting appropriate Med-LLMs and outlines future directions in the field.This analysis offers a roadmap for developing Med-LLMs that balance technological innovation with the trust and transparency required for clinical integration,a perspective often overlooked in existing literature.展开更多
基金supported by the National Key Research and Development Program of China(Grant No.2024YFE0203000).
文摘Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to generate metaphors.CMGBench offers a high-quality dataset comprising 810 examples,3,354 annotations,and two types of metaphor expressions:direct metaphor expressions and implicit metaphor expressions,the latter of which has received limited attention in previous research.To assess the quality of metaphors generated by LLMs,we introduce three evaluation criteria.The first criterion measures the disparity between the vehicles in the LLM-generated metaphors and those used by humans.The second criterion evaluates whether an LLM-generated metaphor contains unconventional semantic collocations.The third criterion calculates the proportion of implicit metaphor expressions within the LLM-generated metaphors.We conducted extensive experiments on both proprietary and open-source LLMs.The results demonstrate that,compared to human-generated metaphors,LLM-generated metaphors display a lack of variety in using the attributes of a vehicle,show limited innovation in semantic collocation,and tend to use direct expressions.Even top-performing LLMs such as GPT-4,when not explicitly prompted to generate implicit metaphors,produced them at a rate of only 23.9%.This highlights a significant gap between human and LLM capabilities in metaphor generation.
文摘The emergence of Medical Large Language Models has significantly transformed healthcare.Medical Large Language Models(Med-LLMs)serve as transformative tools that enhance clinical practice through applications in decision support,documentation,and diagnostics.This evaluation examines the performance of leading Med-LLMs,including GPT-4Med,Med-PaLM,MEDITRON,PubMedGPT,and MedAlpaca,across diverse medical datasets.It provides graphical comparisons of their effectiveness in distinct healthcare domains.The study introduces a domain-specific categorization system that aligns these models with optimal applications in clinical decision-making,documentation,drug discovery,research,patient interaction,and public health.The paper addresses deployment challenges of Medical-LLMs,emphasizing trustworthiness and explainability as essential requirements for healthcare AI.It presents current evaluation techniques that improve model transparency in high-stakes medical contexts and analyzes regulatory frameworks using benchmarking datasets such asMedQA,MedMCQA,PubMedQA,and MIMIC.By identifying ongoing challenges in biasmitigation,reliability,and ethical compliance,thiswork serves as a resource for selecting appropriate Med-LLMs and outlines future directions in the field.This analysis offers a roadmap for developing Med-LLMs that balance technological innovation with the trust and transparency required for clinical integration,a perspective often overlooked in existing literature.