异构环境下基于小样本学习的高维数据聚类被引量：1

Clustering of High-Dimensional Data Based on Few-Shot Learning in Heterogeneous Environments

下载PDF

导出

摘要异构环境中,不同数据源因在语义表达和表示形式上不同而存在语义鸿沟问题。而且异构环境下的数据是动态变化的,导致数据分布假设会随着时间的推移而失效。为提高异构环境中的聚类精度,提出基于小样本学习的高维数据聚类方法。首先,初步将相似性比较高的数据特征归并到一个类别,依据等距映射和稀疏系数矩阵定义特征重要度评分函数,根据各特征簇中的特征评分,选取得分最高的代表特征组成特征子集。然后,利用小样本学习无需对全部数据的复杂分布进行假设的优势,通过简化数据分布假设的方式,解决数据分布随时间变化而失效的问题。在标注少量样本后,利用深度三维神经网络获取投影函数,将高维数据特征转为低维语义表示。最后,通过计算嵌入空间中嵌入和类原型两者之间的关系分数学习语义共性,在聚类时将具有相似语义但不同语言表示的数据聚为一类,解决异构环境下的语义鸿沟问题,实现高维数据有效聚类。实验表明:该方法具有较好的高维数据聚类效果。 In heterogeneous environments,different data sources have the problem of semantic gap due to differences in semantic expressions and representation forms.Moreover,data in heterogeneous environments is dynamically changing,leading to the assumption of data distribution becoming invalid over time.To improve the clustering accuracy in heterogeneous environments,a high-dimensional data clustering method based on few-shot learning is proposed.Firstly,the data features with relatively high similarity are initially grouped into one category.The feature importance scoring function is defined based on the isometric mapping and sparse coefficient matrix.According to the feature scores in each feature cluster,the representative features with the highest score are selected to form the feature subset.Then,taking advantage ofthe fact that few-shot learning does not require assumptions about the complex distribution of all data,the problem of data distribution failure over time is solved by simplifying the assumptions of data distribution.After labeling a small number of samples,the projection function is obtained by using the deep three-dimensional neural network to convert the features of high-dimensional data into low-dimensional semantic representations.Finally,semantic commonalities are learned by calculating the relationship score between the embedding and the class prototype in the embedding space.When clustering,data with similar semantics but represented in different languages are grouped into one category to solve the semantic gap problem in heterogeneous environments and achieve effective clustering of high-dimensional data.Experiments show that this method has a better clustering effect on high-dimensional data.

作者杨帆刘璐 YANG Fan;LIU Lu(Shandong Taishan Pumped Storage Co.Ltd,Tai'an Shandong 271000,China)

机构地区山东泰山抽水蓄能有限公司

出处《计算机仿真》 2025年第11期262-265,346,共5页 Computer Simulation

基金泰安市科技创新行动计划项目(2018D1396)。

关键词异构环境小样本学习数据分布假设高维数据数据聚类深度三维神经网络 Heterogeneous Environment Few-Shot Learning Data Distribution Assumption High-Dimensional Data Data Clustering Deep Three-Dimensional Neural Network

分类号 TP391.9 [自动化与计算机技术—计算机应用技术]