摘要
现有的半监督学习方法大多遵循封闭世界假设,即在模型训练过程中类别信息保持不变,标记数据可以覆盖所有类别。然而,在实际应用中,这一假设往往难以满足,未标记数据中通常会包含大量的未知类数据样本。为此,近年来研究人员提出了一个极具挑战性的研究方向:将半监督学习推广到不仅能够有效识别已知类的未标记数据样本,还能对未知的新类样本进行学习,从而构建面向开放世界的半监督学习机制。为应对这一挑战,本文基于符号型数据,提出了一种面向开放世界的半监督特征选择算法(OpenSSFS)。该算法将耦合学习引入到了符号型样本相似性度量以及类别关联性分析中,构建了新的样本相似性和类别相关性度量,并据此依次构建了三个核心模块:面向未标记已知类数据的自适应伪标签生成算法,面向未标记未知类数据的粒化和新类发现算法,以及基于类别相关性的特征选择算法。对给定的开放世界数据集,首先计算已知类数据样本的特征选择结果,并通过伪标签生成算法为未标记的已知类样本分配伪标签,进而基于所有已知类样本更新特征选择结果;其次,识别未知类未标记样本中的新类,并计算新类上的特征选择结果;最后,融合已知类样本和未知类样本的有效特征子集,确定最终的特征选择结果。为了有效验证所提新算法的有效性,本文在模拟的开放世界数据环境中进行了实验分析,分别测试了该算法在不同比例的已知类和未知类,以及不同比例的标记样本和未标记样本上的性能。实验结果表明,OpenSSFS算法在多种场景下均展现了较好的分类性能:首先,在包含50%已知类和50%未知类,且拥有50%标记样本的数据集上,新算法的分类精度最高提升了近70%,显著优于其他对比算法;其次,随着标记样本比例从90%降至10%,新算法的性能依然优于其他算法,且未出现明显下降,显示出较强的鲁棒性;最后,即使在已知类比例较低的情况下,OpenSSFS算法仍能保持良好的性能,适用于开放性更高的任务场景。此外,实验分析中还对算法中的参数阈值进行了详细分析和讨论。
Existing semi-supervised learning methodologies typically operate under the closedworld assumption,wherein category information remains static throughout the learning process;that is,the labeled data utilized for model training encompasses all categories.However,this assumption frequently proves challenging to satisfy in practical applications.The unlabeled data often contain a substantial number of samples that belong to unknown classes.Consequently,researchers have identified a highly demanding research avenue in recent years:extending semisupervised learning to enable not only the accurate identification of unlabeled data samples from known classes but also the discovery and learning of new,previously unknown classes,thereby establishing a semi-supervised learning framework for open-world scenarios.To tackle this challenge,this paper introduces a semi-supervised feature selection algorithm tailored for openworld scenarios based on categorical data(OpenSSFS).This algorithm integrates coupled learning into the similarity measurement of categorical samples and the relevance analysis of classes relationships,thereby establishing a novel similarity metric and a new class correlation metric.Based on these metrics,the new algorithm systematically constructs three core modules in sequence.The first one is an adaptive pseudo-label generation algorithm for unlabeled knownclass data.The second one focuses on granulation and the discovery of novel classes within unlabeled data of unknown categories.And the final one presented is a feature selection algorithm based on classes relevance analysis.For a given open-world dataset,the first step involves computing the feature selection results for the known-class data samples,assigning pseudo-labels to the unlabeled known-class samples using the pseudo-label generation algorithm,and updating the feature selection results by incorporating all the known-class samples.In the second step,new classes within the unlabeled samples of the unknown class are identified,and the feature selection results based on new classes are computed.Finally,by integrating the effective feature subsets from both the known-class and unknown-class samples,the final feature selection outcome is determined.To further validate the effectiveness of the new algorithm proposed in this paper,an open world data environment is simulated in the experimental analysis.The new algorithm is tested and evaluated on the same dataset with varying proportions of known and unknown classes as well as different ratios of labeled and unlabeled samples.The experimental results indicate that the OpenSSFS algorithm has demonstrated excellent classification performance in various scenarios.Firstly,on a dataset comprising 50%known classes and 50%unknown classes,with 50%labeled samples,the new algorithm achieves a classification accuracy improvement of up to nearly 70%,demonstrating significantly superior performance compared to other contrast algorithms.Secondly,as the proportion of labeled samples is reduced from 90%to 10%,the performance of the new algorithm not only surpasses that of other algorithms but also maintains stability without any significant deterioration,thereby demonstrating its considerable robustness.Finally,the experimental findings regarding different proportions of known and unknown classes reveal that,even when there are only a few known classes,the new algorithm can still demonstrate excellent performance and handle more open task scenarios effectively.Furthermore,the experimental analysis carried out an analysis and discussion on the parameter threshold values set within the new algorithm.
作者
王锋
武文强
梁吉业
WANG Feng;WU Wen-Qiang;LIANG Ji-Ye(School of Computer and Information Technology,Shanxi University,Taiyuan 030006)
出处
《计算机学报》
北大核心
2025年第6期1273-1289,共17页
Chinese Journal of Computers
基金
国家自然科学基金面上项目(62276158,62376141)资助。
关键词
半监督学习
开放世界学习
特征选择
耦合学习
成对相似性
semi-supervised learning
open-world learning
feature selection
coupled learning
pairwise similarity