面向开放世界的半监督特征选择算法

Semi-Supervised Feature Selection Algorithm for Open-World

下载PDF

导出

摘要现有的半监督学习方法大多遵循封闭世界假设,即在模型训练过程中类别信息保持不变,标记数据可以覆盖所有类别。然而,在实际应用中,这一假设往往难以满足,未标记数据中通常会包含大量的未知类数据样本。为此,近年来研究人员提出了一个极具挑战性的研究方向:将半监督学习推广到不仅能够有效识别已知类的未标记数据样本,还能对未知的新类样本进行学习,从而构建面向开放世界的半监督学习机制。为应对这一挑战,本文基于符号型数据,提出了一种面向开放世界的半监督特征选择算法(OpenSSFS)。该算法将耦合学习引入到了符号型样本相似性度量以及类别关联性分析中,构建了新的样本相似性和类别相关性度量,并据此依次构建了三个核心模块:面向未标记已知类数据的自适应伪标签生成算法,面向未标记未知类数据的粒化和新类发现算法,以及基于类别相关性的特征选择算法。对给定的开放世界数据集,首先计算已知类数据样本的特征选择结果,并通过伪标签生成算法为未标记的已知类样本分配伪标签,进而基于所有已知类样本更新特征选择结果;其次,识别未知类未标记样本中的新类,并计算新类上的特征选择结果;最后,融合已知类样本和未知类样本的有效特征子集,确定最终的特征选择结果。为了有效验证所提新算法的有效性,本文在模拟的开放世界数据环境中进行了实验分析,分别测试了该算法在不同比例的已知类和未知类,以及不同比例的标记样本和未标记样本上的性能。实验结果表明,OpenSSFS算法在多种场景下均展现了较好的分类性能:首先,在包含50%已知类和50%未知类,且拥有50%标记样本的数据集上,新算法的分类精度最高提升了近70%,显著优于其他对比算法;其次,随着标记样本比例从90%降至10%,新算法的性能依然优于其他算法,且未出现明显下降,显示出较强的鲁棒性;最后,即使在已知类比例较低的情况下,OpenSSFS算法仍能保持良好的性能,适用于开放性更高的任务场景。此外,实验分析中还对算法中的参数阈值进行了详细分析和讨论。 Existing semi-supervised learning methodologies typically operate under the closedworld assumption,wherein category information remains static throughout the learning process;that is,the labeled data utilized for model training encompasses all categories.However,this assumption frequently proves challenging to satisfy in practical applications.The unlabeled data often contain a substantial number of samples that belong to unknown classes.Consequently,researchers have identified a highly demanding research avenue in recent years:extending semisupervised learning to enable not only the accurate identification of unlabeled data samples from known classes but also the discovery and learning of new,previously unknown classes,thereby establishing a semi-supervised learning framework for open-world scenarios.To tackle this challenge,this paper introduces a semi-supervised feature selection algorithm tailored for openworld scenarios based on categorical data(OpenSSFS).This algorithm integrates coupled learning into the similarity measurement of categorical samples and the relevance analysis of classes relationships,thereby establishing a novel similarity metric and a new class correlation metric.Based on these metrics,the new algorithm systematically constructs three core modules in sequence.The first one is an adaptive pseudo-label generation algorithm for unlabeled knownclass data.The second one focuses on granulation and the discovery of novel classes within unlabeled data of unknown categories.And the final one presented is a feature selection algorithm based on classes relevance analysis.For a given open-world dataset,the first step involves computing the feature selection results for the known-class data samples,assigning pseudo-labels to the unlabeled known-class samples using the pseudo-label generation algorithm,and updating the feature selection results by incorporating all the known-class samples.In the second step,new classes within the unlabeled samples of the unknown class are identified,and the feature selection results based on new classes are computed.Finally,by integrating the effective feature subsets from both the known-class and unknown-class samples,the final feature selection outcome is determined.To further validate the effectiveness of the new algorithm proposed in this paper,an open world data environment is simulated in the experimental analysis.The new algorithm is tested and evaluated on the same dataset with varying proportions of known and unknown classes as well as different ratios of labeled and unlabeled samples.The experimental results indicate that the OpenSSFS algorithm has demonstrated excellent classification performance in various scenarios.Firstly,on a dataset comprising 50%known classes and 50%unknown classes,with 50%labeled samples,the new algorithm achieves a classification accuracy improvement of up to nearly 70%,demonstrating significantly superior performance compared to other contrast algorithms.Secondly,as the proportion of labeled samples is reduced from 90%to 10%,the performance of the new algorithm not only surpasses that of other algorithms but also maintains stability without any significant deterioration,thereby demonstrating its considerable robustness.Finally,the experimental findings regarding different proportions of known and unknown classes reveal that,even when there are only a few known classes,the new algorithm can still demonstrate excellent performance and handle more open task scenarios effectively.Furthermore,the experimental analysis carried out an analysis and discussion on the parameter threshold values set within the new algorithm.

作者王锋武文强梁吉业 WANG Feng;WU Wen-Qiang;LIANG Ji-Ye(School of Computer and Information Technology,Shanxi University,Taiyuan 030006)

机构地区山西大学计算机与信息技术学院

出处《计算机学报》北大核心 2025年第6期1273-1289,共17页 Chinese Journal of Computers

基金国家自然科学基金面上项目(62276158,62376141)资助。

关键词半监督学习开放世界学习特征选择耦合学习成对相似性 semi-supervised learning open-world learning feature selection coupled learning pairwise similarity

分类号 TP182 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献5

1朱鹏飞,张琬迎,王煜,胡清华.考虑多粒度类相关性的对比式开放集识别方法[J].软件学报,2022,33(4):1156-1169. 被引量：7
2刘畅,杨春,殷绪成.基于文字局部结构相似度量的开放集文字识别方法[J].自动化学报,2024,50(10):1977-1987. 被引量：1
3冯耀功,于剑,桑基韬,杨朋波.基于知识的零样本视觉识别综述[J].软件学报,2021,32(2):370-405. 被引量：17
4李永豪,胡亮,高万夫.基于稀疏系数矩阵重构的多标记特征选择[J].计算机学报,2022,45(9):1827-1841. 被引量：6
5王锋,姚珍,梁吉业.面向动态混合数据的多粒度增量特征选择算法[J].软件学报,2025,36(3):1186-1201. 被引量：3

二级参考文献23

1徐伟华,张晓燕,钟坚敏,张文修.序信息系统中属性约简的启发式算法[J].计算机工程,2010,36(17):69-71. 被引量：11
2王国胤,于洪,杨大春.基于条件信息熵的决策表约简[J].计算机学报,2002,25(7):759-766. 被引量：601
3潘吴斌,程光,郭晓军,王艳.基于选择性集成策略的嵌入式网络流特征选择[J].计算机学报,2014,37(10):2128-2138. 被引量：11
4肖丽莎,王红军,杨燕.基于属性依赖的混合约束半监督特征选择[J].计算机应用,2015,35(A02):80-84. 被引量：4
5刘知远,孙茂松,林衍凯,谢若冰.知识表示学习研究进展[J].计算机研究与发展,2016,53(2):247-261. 被引量：270
6程玉虎,乔雪,王雪松.基于混合属性的零样本图像分类[J].电子学报,2017,45(6):1462-1468. 被引量：6
7zhi-hua zhou.A brief introduction to weakly supervised learning[J].National Science Review,2018,5(1):44-53. 被引量：121
8李文英,曹斌,曹春水,黄永祯.一种基于深度学习的青铜器铭文识别方法[J].自动化学报,2018,44(11):2023-2030. 被引量：27
9田萱,王亮,丁琪.基于深度学习的图像语义分割方法综述[J].软件学报,2019,30(2):440-468. 被引量：278
10Tai-Ling Yuan,Zhe Zhu,Kun Xu,Cheng-Jun Li,Tai-Jiang Mu,Shi-Min Hu.A Large Chinese Text Dataset in the Wild[J].Journal of Computer Science & Technology,2019,34(3):509-521. 被引量：14

共引文献29

1王泽深,杨云,向鸿鑫,柳青.零样本学习综述[J].计算机工程与应用,2021,57(19):1-17. 被引量：10
2刘鹏程,孙林夫.基于第三方云平台的服务价值链多链知识图谱构建[J].计算机集成制造系统,2022,28(2):612-627. 被引量：7
3朱鹏飞,张琬迎,王煜,胡清华.考虑多粒度类相关性的对比式开放集识别方法[J].软件学报,2022,33(4):1156-1169. 被引量：7
4许睿,邵帅,曹维佳,刘宝弟,陶大鹏,刘伟锋.基于重构对比的广义零样本图像分类[J].模式识别与人工智能,2022,35(12):1078-1088. 被引量：2
5蒲瞻星,葛永新.基于多特征融合的小样本视频行为识别算法[J].计算机学报,2023,46(3):594-608. 被引量：13
6黎建宇,詹志辉.面向大规模特征选择的自监督数据驱动粒子群优化算法[J].智能系统学报,2023,18(1):194-206. 被引量：3
7郇战,周帮文,王澄,董晨辉,刘艳,王佳晖.基于开集类增量学习的人类活动识别研究[J].实验技术与管理,2023,40(2):40-47. 被引量：1
8倪伟,王展旭,卞悦旭.基于卷积神经网络的零样本细粒度特征识别[J].信息技术,2023,47(2):86-90.
9曹伟朋,吴宇豪,李大川,明仲,陈贞儒,叶璇.面向开放场景的交通标志识别方法[J].深圳大学学报（理工版）,2023,40(3):258-265.
10范宇飞,丁博,何勇军.基于判别器反馈的零样本图像分类方法[J].哈尔滨理工大学学报,2023,28(1):46-53. 被引量：2

1高云龙,史曙光,赵志翔,曹超,潘金艳.基于双模糊学习的鲁棒无监督特征选择算法[J].电子学报,2025,53(2):604-622.
2孙强,姚猛,李琛,王明明,李彦周.基于双通道特征融合的CNN-GSWOA-XGBoost齿轮箱故障诊断方法[J].铁道车辆,2025,63(3):26-33.
3王宇,刘旭,仲杰鹏.航空发动机机匣零件加工特征交互识别方法[J].机械设计与制造工程,2025,54(6):84-88. 被引量：1
4方丽琴,高茹,彭敏,李经瑶.汽车一键启动开关的设计与开发研究[J].汽车电器,2025(6):93-96.
5吴宝敏,刘万伦.幼儿轴对称关系类别学习的发展特征及线索提示的作用[J].心理学探新,2025,45(3):226-234.
6王攀攀,李兴宇,张成,韩丽.基于角域重采样和特征强化的电机滚动轴承故障迁移诊断方法[J].电工技术学报,2025,40(12):3905-3916. 被引量：1
7包晟宏,姚有健,李小丫,陈文.集成式PU学习方法PUEVD及其在软件源码漏洞检测中的应用[J].计算机科学,2025,52(S1):853-861.
8王金兰,蔡伯根,申彦春,刘江.列控车载设备故障样本生成质量评估方法[J].中国铁路,2025(5):139-145.
9韩柳沅,邓光明.基于改进的加权动态时间规整的面板数据聚类方法[J].桂林理工大学学报,2025,45(2):279-284.
10孙哲,李慧,邵荃,张军峰,贾萌.基于mRMR算法的脑电特征评价[J].南京航空航天大学学报(自然科学版),2025,57(3):580-588.

计算机学报

2025年第6期

浏览历史

内容加载中请稍等...

面向开放世界的半监督特征选择算法

参考文献5

二级参考文献23

共引文献29

相关作者

相关机构

相关主题

浏览历史