Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to g...Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to generate metaphors.CMGBench offers a high-quality dataset comprising 810 examples,3,354 annotations,and two types of metaphor expressions:direct metaphor expressions and implicit metaphor expressions,the latter of which has received limited attention in previous research.To assess the quality of metaphors generated by LLMs,we introduce three evaluation criteria.The first criterion measures the disparity between the vehicles in the LLM-generated metaphors and those used by humans.The second criterion evaluates whether an LLM-generated metaphor contains unconventional semantic collocations.The third criterion calculates the proportion of implicit metaphor expressions within the LLM-generated metaphors.We conducted extensive experiments on both proprietary and open-source LLMs.The results demonstrate that,compared to human-generated metaphors,LLM-generated metaphors display a lack of variety in using the attributes of a vehicle,show limited innovation in semantic collocation,and tend to use direct expressions.Even top-performing LLMs such as GPT-4,when not explicitly prompted to generate implicit metaphors,produced them at a rate of only 23.9%.This highlights a significant gap between human and LLM capabilities in metaphor generation.展开更多
基金supported by the National Key Research and Development Program of China(Grant No.2024YFE0203000).
文摘Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to generate metaphors.CMGBench offers a high-quality dataset comprising 810 examples,3,354 annotations,and two types of metaphor expressions:direct metaphor expressions and implicit metaphor expressions,the latter of which has received limited attention in previous research.To assess the quality of metaphors generated by LLMs,we introduce three evaluation criteria.The first criterion measures the disparity between the vehicles in the LLM-generated metaphors and those used by humans.The second criterion evaluates whether an LLM-generated metaphor contains unconventional semantic collocations.The third criterion calculates the proportion of implicit metaphor expressions within the LLM-generated metaphors.We conducted extensive experiments on both proprietary and open-source LLMs.The results demonstrate that,compared to human-generated metaphors,LLM-generated metaphors display a lack of variety in using the attributes of a vehicle,show limited innovation in semantic collocation,and tend to use direct expressions.Even top-performing LLMs such as GPT-4,when not explicitly prompted to generate implicit metaphors,produced them at a rate of only 23.9%.This highlights a significant gap between human and LLM capabilities in metaphor generation.