Large language models(LLMs)represent significant advancements in artificial intelligence.However,their increasing capabilities come with a serious challenge:misalignment,which refers to the deviation of model behavior...Large language models(LLMs)represent significant advancements in artificial intelligence.However,their increasing capabilities come with a serious challenge:misalignment,which refers to the deviation of model behavior from the designers’intentions and human values.This review aims to synthesize the current understanding of the LLM misalignment issue and provide researchers and practitioners with a comprehensive overview.We define the concept of misalignment and elaborate on its various manifestations,including generating harmful content,factual errors(hallucinations),propagating biases,failing to follow instructions,emerging deceptive behaviors,and emergent misalignment.We explore the multifaceted causes of misalignment,systematically analyzing factors from surface-level technical issues(e.g.,training data,objective function design,model scaling)to deeper fundamental challenges(e.g.,difficulties formalizing values,discrepancies between training signals and real intentions).This review covers existing and emerging techniques for detecting and evaluating the degree of misalignment,such as benchmark tests,red-teaming,and formal safety assessments.Subsequently,we examine strategies to mitigate misalignment,focusing on mainstream alignment techniques such as RLHF,Constitutional AI(CAI),instruction fine-tuning,and novel approaches that address scalability and robustness.In particular,we analyze recent advances in misalignment attack research,including system prompt modifications,supervised fine-tuning,self-supervised representation attacks,and model editing,which challenge the robustness of model alignment.We categorize and analyze the surveyed literature,highlighting major findings,persistent limitations,and current contentious points.Finally,we identify key open questions and propose several promising future research directions,including constructing high-quality alignment datasets,exploring novel alignment methods,coordinating diverse values,and delving into the deep philosophical aspects of alignment.This work underscores the complexity and multidimensionality of LLM misalignment issues,calling for interdisciplinary approaches to reliably align LLMs with human values.展开更多
While advanced Large Language Models(LLMs)can simulate human-like prosocial behaviors,the degree to which they align with human prosocial values and the underlying afective mechanisms remain unclear.This study address...While advanced Large Language Models(LLMs)can simulate human-like prosocial behaviors,the degree to which they align with human prosocial values and the underlying afective mechanisms remain unclear.This study addressed these gaps using the third-party punishment(TPP)paradigm,comparing LLM agents(GPT and DeepSeek series)with human participants(n=100).The LLM agents(n=500,100 agents per model)were one-to-one constructed based on the demographic and psychological features of human participants.Prompt engineering was employed to initiate TPP games and record punitive decisions and afective responses in LLM agents.Results revealed that:(1)GPT-4o,DeepSeek-V3,and DeepSeek-R1 models demonstrated stronger fairness value alignment,choosing punitive options more frequently than humans in TPP games;(2)all LLMs replicated the human pathway from unfairness through negative afective response to punitive decisions,with stronger mediation efects of negative emotions observed in DeepSeek models than GPT models;(3)only DeepSeek-R1 exhibited the human-like positive feedback loop from previous punitive decisions to positive afective feedback and subsequent punitive choices;(4)most LLMs(excluding GPT-3.5)showed signifcant representational similarity to human afect-decision patterns;(5)notably,all LLMs displayed rigid afective dynamics,characterized by lower afective variability and higher afective inertia than the fexible,contextsensitive fuctuations observed in humans.These fndings highlight notable advances in prosocial value alignment but underscore the necessity to enhance their afective dynamics to foster robust,adaptive prosocial LLMs.Such advancements could not only accelerate LLMs'alignment with human values but also provide empirical support for the broader applicability of prosocial theories to LLM agents.展开更多
We present a new algorithm for manifold learning and nonlinear dimensionality reduction. Based on a set of unorganized data points sampled with noise from a parameterized manifold, the local geometry of the manifold i...We present a new algorithm for manifold learning and nonlinear dimensionality reduction. Based on a set of unorganized data points sampled with noise from a parameterized manifold, the local geometry of the manifold is learned by constructing an approximation for the tangent space at each point, and those tangent spaces are then aligned to give the global coordinates of the data points with respect to the underlying manifold. We also present an error analysis of our algorithm showing that reconstruction errors can be quite small in some cases. We illustrate our algorithm using curves and surfaces both in 2D/3D Euclidean spaces and higher dimensional Euclidean spaces. We also address several theoretical and algorithmic issues for further research and improvements.展开更多
基金supported by National Natural Science Foundation of China(62462019,62172350)Guangdong Basic and Applied Basic Research Foundation(2023A1515012846)+6 种基金Guangxi Science and Technology Major Program(AA24263010)The Key Research and Development Program of Guangxi(AB24010085)Key Laboratory of Equipment Data Security and Guarantee Technology,Ministry of Education(GDZB2024060500)2024 Higher Education Scientific Research Planning Project(No.24NL0419)Nantong Science and Technology Project(No.JC2023070)the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province(GrantNo.SKLACSS-202407)sponsored by the Cultivation of Young andMiddle-aged Academic Leaders in the“Qing Lan Project”of Jiangsu Province and the 2025 Outstanding Teaching Team in the“Qing Lan Project”of Jiangsu Province.
文摘Large language models(LLMs)represent significant advancements in artificial intelligence.However,their increasing capabilities come with a serious challenge:misalignment,which refers to the deviation of model behavior from the designers’intentions and human values.This review aims to synthesize the current understanding of the LLM misalignment issue and provide researchers and practitioners with a comprehensive overview.We define the concept of misalignment and elaborate on its various manifestations,including generating harmful content,factual errors(hallucinations),propagating biases,failing to follow instructions,emerging deceptive behaviors,and emergent misalignment.We explore the multifaceted causes of misalignment,systematically analyzing factors from surface-level technical issues(e.g.,training data,objective function design,model scaling)to deeper fundamental challenges(e.g.,difficulties formalizing values,discrepancies between training signals and real intentions).This review covers existing and emerging techniques for detecting and evaluating the degree of misalignment,such as benchmark tests,red-teaming,and formal safety assessments.Subsequently,we examine strategies to mitigate misalignment,focusing on mainstream alignment techniques such as RLHF,Constitutional AI(CAI),instruction fine-tuning,and novel approaches that address scalability and robustness.In particular,we analyze recent advances in misalignment attack research,including system prompt modifications,supervised fine-tuning,self-supervised representation attacks,and model editing,which challenge the robustness of model alignment.We categorize and analyze the surveyed literature,highlighting major findings,persistent limitations,and current contentious points.Finally,we identify key open questions and propose several promising future research directions,including constructing high-quality alignment datasets,exploring novel alignment methods,coordinating diverse values,and delving into the deep philosophical aspects of alignment.This work underscores the complexity and multidimensionality of LLM misalignment issues,calling for interdisciplinary approaches to reliably align LLMs with human values.
基金supported by the National Natural Science Foundation of China(Grant Nos.32271110,62441614)the Tsinghua University Initiative Scientific Research Program(Grant No.20235080047)。
文摘While advanced Large Language Models(LLMs)can simulate human-like prosocial behaviors,the degree to which they align with human prosocial values and the underlying afective mechanisms remain unclear.This study addressed these gaps using the third-party punishment(TPP)paradigm,comparing LLM agents(GPT and DeepSeek series)with human participants(n=100).The LLM agents(n=500,100 agents per model)were one-to-one constructed based on the demographic and psychological features of human participants.Prompt engineering was employed to initiate TPP games and record punitive decisions and afective responses in LLM agents.Results revealed that:(1)GPT-4o,DeepSeek-V3,and DeepSeek-R1 models demonstrated stronger fairness value alignment,choosing punitive options more frequently than humans in TPP games;(2)all LLMs replicated the human pathway from unfairness through negative afective response to punitive decisions,with stronger mediation efects of negative emotions observed in DeepSeek models than GPT models;(3)only DeepSeek-R1 exhibited the human-like positive feedback loop from previous punitive decisions to positive afective feedback and subsequent punitive choices;(4)most LLMs(excluding GPT-3.5)showed signifcant representational similarity to human afect-decision patterns;(5)notably,all LLMs displayed rigid afective dynamics,characterized by lower afective variability and higher afective inertia than the fexible,contextsensitive fuctuations observed in humans.These fndings highlight notable advances in prosocial value alignment but underscore the necessity to enhance their afective dynamics to foster robust,adaptive prosocial LLMs.Such advancements could not only accelerate LLMs'alignment with human values but also provide empirical support for the broader applicability of prosocial theories to LLM agents.
文摘We present a new algorithm for manifold learning and nonlinear dimensionality reduction. Based on a set of unorganized data points sampled with noise from a parameterized manifold, the local geometry of the manifold is learned by constructing an approximation for the tangent space at each point, and those tangent spaces are then aligned to give the global coordinates of the data points with respect to the underlying manifold. We also present an error analysis of our algorithm showing that reconstruction errors can be quite small in some cases. We illustrate our algorithm using curves and surfaces both in 2D/3D Euclidean spaces and higher dimensional Euclidean spaces. We also address several theoretical and algorithmic issues for further research and improvements.