摘要
为解决癌症基因组图谱中DNA甲基化数据不平衡导致假阴率上升的问题,提出一种基于TCGA数据库不平衡数据的改进分类方法.使用合成少数类过采样技术和Tomek Link算法进行混合采样,解决数据不平衡问题.在此基础上,将经特征选择后的训练集数据输入改进模型进行训练、学习及分类.基于TCGA数据库6种癌症DNA甲基化数据的实验结果表明:改进方法对少数类样本的分类性能有显著提高,对多数类样本的分类性能也有一定的提升.
In order to solve the problem that the DNA methylation data imbalance in cancer genomic map led to the increase in false negative rate,this paper proposed an improved classification method based on the imbalanced data of TCGA database,which used synthetic minority oversampling technique and Tomek Link algorithm for mixed sampling to resolve data imbalance problems.On this basis,the training set data after feature selection was input into the improved model for training,learning and classification.Based on DNA methylation data onto six cancers in the TCGA database,the experimental results showed that the classification performance of improved model was significantly improved for a few samples,and the performance of most samples was also improved.
作者
侯维岩
刘超
宋杨
孙燚
HOU Weiyan;LIU Chao;SONG Yang;SUN Yi(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China;School of Mechatronic and Automation,Shanghai University,Shanghai 200072,China)
出处
《安徽大学学报(自然科学版)》
CAS
北大核心
2020年第1期37-43,共7页
Journal of Anhui University(Natural Science Edition)
基金
国家自然科学基金资助项目(61573237)。