期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Data Mixing for Large Language Models Pretraining:A Survey and Outlook
1
作者 Zhuo Chen Yuxuan Miao +1 位作者 Supryadi Deyi Xiong 《Data Intelligence》 2026年第1期15-55,共41页
Large language models(LLMs)rely on pretraining on massive and highly heterogeneous corpora,where the composition of training data has a decisive impact on training efficiency and downstream generalization under realis... Large language models(LLMs)rely on pretraining on massive and highly heterogeneous corpora,where the composition of training data has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints.Unlike samplelevel data selection,data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively.In recent years,a growing body of work has proposed principled data mixing methods for LLMs pretraining;however,the literature remains fragmented and lacks a dedicated,systematic survey.This paper provides a comprehensive review of data mixing for LLMs pretraining.We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline,and briefly explain how existing methods make this bilevel formulation tractable in practice.We then introduce a fine-grained taxonomy that organizes existing methods along two main dimensions:static versus dynamic mixing.Static mixing is further categorized into rule-based and learning-based methods,while dynamic mixing is further grouped into adaptive and externally guided families.For each class of methods,we summarize representative approaches and analyze their characteristics,strengths,and limitations from a performance-cost trade-off perspective.Building on this analysis,we highlight key challenges that cut across methods,including limited transferability across data domains,optimization objectives,models,and validation sets,as well as unstandardized evaluation protocols and benchmarks,as well as the inherent tension between performance gains and cost control in learning-based methods.Finally,we outline several exploratory directions,including finer-grained domain partitioning and inverse data mixing,as well as pipeline-aware designs,aiming to provide conceptual and methodological insights for future research. 展开更多
关键词 Data mixing Large language model Pretraining Domain reweight Survey
原文传递
CMGBench:Benchmarking Chinese Metaphor Generation for Large Language Models
2
作者 Yan Liu Renren Jin +1 位作者 Tianhao Shen Deyi Xiong 《Data Intelligence》 2025年第4期1270-1290,共21页
Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to g... Can large language models(LLMs)generate metaphor expressions as humans do?To address this question,we present CMGBench,a Chinese Metaphor Generation Benchmark specifically designed to evaluate the ability of LLMs to generate metaphors.CMGBench offers a high-quality dataset comprising 810 examples,3,354 annotations,and two types of metaphor expressions:direct metaphor expressions and implicit metaphor expressions,the latter of which has received limited attention in previous research.To assess the quality of metaphors generated by LLMs,we introduce three evaluation criteria.The first criterion measures the disparity between the vehicles in the LLM-generated metaphors and those used by humans.The second criterion evaluates whether an LLM-generated metaphor contains unconventional semantic collocations.The third criterion calculates the proportion of implicit metaphor expressions within the LLM-generated metaphors.We conducted extensive experiments on both proprietary and open-source LLMs.The results demonstrate that,compared to human-generated metaphors,LLM-generated metaphors display a lack of variety in using the attributes of a vehicle,show limited innovation in semantic collocation,and tend to use direct expressions.Even top-performing LLMs such as GPT-4,when not explicitly prompted to generate implicit metaphors,produced them at a rate of only 23.9%.This highlights a significant gap between human and LLM capabilities in metaphor generation. 展开更多
关键词 Large language model Benchmark for large language models Metaphor generation Metaphor datasets Metaphor generation evaluation
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部