期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
The gains do not make up for the losses:a comprehensive evaluation for safety alignment of large language models via machine unlearning
1
作者 Weixiang ZHAO Yulin HU +5 位作者 xingyu sui Zhuojun LI Yang DENG Yanyan ZHAO Bing QIN Wanxiang CHE 《Frontiers of Computer Science》 2026年第2期125-149,共25页
Machine Unlearning(MU)has emerged as a promising technique for aligning large language models(LLMs)with safety requirements to steer them forgetting specific harmful contents.Despite the significant progress in previo... Machine Unlearning(MU)has emerged as a promising technique for aligning large language models(LLMs)with safety requirements to steer them forgetting specific harmful contents.Despite the significant progress in previous studies,we argue that the current evaluation criteria,which solely focus on safety evaluation,are actually impractical and biased,leading to concerns about the true effectiveness of MU techniques.To address this,we propose to comprehensively evaluate LLMs after MU from three aspects:safety,over-safety,and general utility.Specifically,a novel benchmark MUBENCH with 18 related datasets is first constructed,where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks.Furthermore,we examine whether MU introduces side effects,focusing on over-safety and utility-loss.Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods.The results highlight a challenging trilemma in safety alignment without side effects,indicating that there is still considerable room for further exploration.MUBENCH serves as a comprehensive benchmark,fostering future research on MU for safety alignment of LLMs. 展开更多
关键词 machine unlearning safety alignment large language models
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部