摘要
AIGC已严重影响信息的真实性、可靠性,造成数据污染、产权归属、诚信危机等众多技术和社会问题.现有机器生成文本检测方法主要针对特定领域且检测准确率较低,更难用于敏感、私有、小样本等跨领域数据.针对该问题提出一种高可用性的跨领域机器生成文本检测方法.该方法优选任一领域内的类别中心样本训练生成专域编码器,利用领域特征增强边界区分性;构建一种正交损失函数联合专域编码器训练生成泛域编码器,强化机器生成文本的共性特征支持多领域机器生成文本的检测.真实数据实验结果表明,单领域检测模型无需微调即可在其他领域获得高检测准确率,适用范围广,实用性强.
Artificial intelligence generated content(AIGC)has seriously affected information authenticity and reliability,leading to various technical and social problems such as data pollution,property ownership,and credibility crisis.Existing machine-generated text detection methods are primarily designed for specific domains and suffer from relatively low detection accuracy,making them even less effective when applied to cross-domain data such as sensitive,private,or small-sample data.To address this problem,a high available cross-domain machine-generated text detection method was proposed.This method first selected the class-center samples in any domain to train a domain-specific encoder,thereby leveraging domain features enhance boundary distinguishability.Then,an orthogonal loss function was constructed to train a domain-general encoder with the domain-specific encoder,reinforcing the general-feature of machine-generated text to support the detection across multiple domains.Experimental results on real-world data show that the detection model trained on a single domain can obtain high detection accuracy in other domains without fine-tuning,highlighting its broad applications and strong practicality.
作者
罗森林
杨宗源
潘丽敏
周瑾洁
门元昊
李晔
LUO Senlin;YANG Zongyuan;PAN Limin;ZHOU Jinjie;MEN Yuanhao;LI Ye(School of Information and Electronics,Beijing Institute of Technology,Beijing 100081,China;China Network Coordination Emergency Response Team/China Coordination Center,Beijing 100029,China)
出处
《北京理工大学学报》
北大核心
2025年第12期1296-1304,共9页
Transactions of Beijing Institute of Technology
基金
国家“二四二”信息安全项目(2020A065)。
关键词
机器生成文本检测
域泛化
预训练语言模型
machine-generated text detection
domain generalization
pre-trained language model