Machine Unlearning(MU)has emerged as a promising technique for aligning large language models(LLMs)with safety requirements to steer them forgetting specific harmful contents.Despite the significant progress in previo...Machine Unlearning(MU)has emerged as a promising technique for aligning large language models(LLMs)with safety requirements to steer them forgetting specific harmful contents.Despite the significant progress in previous studies,we argue that the current evaluation criteria,which solely focus on safety evaluation,are actually impractical and biased,leading to concerns about the true effectiveness of MU techniques.To address this,we propose to comprehensively evaluate LLMs after MU from three aspects:safety,over-safety,and general utility.Specifically,a novel benchmark MUBENCH with 18 related datasets is first constructed,where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks.Furthermore,we examine whether MU introduces side effects,focusing on over-safety and utility-loss.Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods.The results highlight a challenging trilemma in safety alignment without side effects,indicating that there is still considerable room for further exploration.MUBENCH serves as a comprehensive benchmark,fostering future research on MU for safety alignment of LLMs.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.62176078)the Fundamental Research Funds for the Central Universities(2022FRFK060002).
文摘Machine Unlearning(MU)has emerged as a promising technique for aligning large language models(LLMs)with safety requirements to steer them forgetting specific harmful contents.Despite the significant progress in previous studies,we argue that the current evaluation criteria,which solely focus on safety evaluation,are actually impractical and biased,leading to concerns about the true effectiveness of MU techniques.To address this,we propose to comprehensively evaluate LLMs after MU from three aspects:safety,over-safety,and general utility.Specifically,a novel benchmark MUBENCH with 18 related datasets is first constructed,where the safety is measured with both vanilla harmful inputs and 10 types of jailbreak attacks.Furthermore,we examine whether MU introduces side effects,focusing on over-safety and utility-loss.Extensive experiments are performed on 3 popular LLMs with 7 recent MU methods.The results highlight a challenging trilemma in safety alignment without side effects,indicating that there is still considerable room for further exploration.MUBENCH serves as a comprehensive benchmark,fostering future research on MU for safety alignment of LLMs.