摘要
随着大数据技术的迅猛发展,健康医疗大数据突破性增长,且具有多源异构、多类型、多关联性.健康医疗大数据也具备特有的5V特征:volume,velocity,variety,value,veracity.然而健康医疗数据的安全问题也随之产生,如何保护病患的隐私数据不被泄露成为一项研究热点.该文针对病患隐私保护及其数据分析问题进行研究和探讨,以PCA-GRA Datafly算法为研究对象,为了解决传统算法的QI属性过度泛化的问题及K-means算法的局部最优问题,提出PCA-GRA-BK算法(主成分分析灰度关联分析BiK-means K匿名算法).首先通过PCA算法对医疗数据进行降维分析,利用少量数据揭示医疗数据之间的内在联系,并选择出QI属性;再使用GRA算法对QI属性进行关联度分析,确定与敏感属性的关联度,构建QI属性的泛化层次,使用手肘法确定聚类算法的最佳k值,并通过聚类算法完成健康医疗数据集相似等价类的聚类;最后借助K匿名算法完成对健康医疗数据的匿名化.通过将Datafly算法、PCA-GRA Datafly算法、PCA-GRA-KK算法和PCA-GRA-BK算法进行医疗数据的匿名分析比较发现,在确保数据有效性的前提下,PAC-GRA-BK算法对于数据信息的丢失率明显降低,算法的运行速度也明显提升,进一步证明了该文提出的PAC-GRA-BK算法.
With the rapid development of big data technology,healthcare data security issues have also arisen,how to protect patients private data from being leaked has become a research hot spot.The healthcare data also has 5V characters as Volume,Velocity,Variety,Value,Veracity.In this paper,patient privacy protection and its data analysis problems are studied.Taking PCA-GRA Datafly algorithm as the research object,in order to solve the problem of excessive generalization of QI attributes of traditional algorithms and the local optimization problem of K-means algorithm,the PCA-GRA-BK algorithm(principal component analysis gray-level correlation analysis BiK-means k anonymous algorithm)is proposed.Firstly the PCA algorithm is used to analyze the dimensionality of the healthcare data,several data is used to reveal the internal connection between the healthcare data,and the QI attribute is selected.Secondly the GRA algorithm is used to analyze the correlation degree of the QI attribute to determine the correlation degree with the sensitive attribute,and to construct the generalization level of QI attributes.Then we use the elbow method to determine the best k value of the clustering algorithm,and complete the clustering of similar equivalence classes of the healthcare data set through the clustering algorithm.Finally complete the anonymity of the healthcare data with the help of the K anonymity algorithm change.By comparing Datafly algorithm,PCA-GRA Datafly algorithm,PCA-GRA-KK algorithm and PCA-GRA-BK algorithm to the anonymous analysis of healthcare data,it is found that the loss rate of information is significantly reduced and the running speed of the algorithm is also significantly improved,which further proves the PAC-GRA-BK algorithm proposed in this paper.
作者
吴珺
郑欣丽
朱嘉辉
李天意
WU Jun;ZHENG Xinli;ZHU Jiahui;LI Tianyi(School of Computer Science and Technology,Hubei University of Technology,Wuhan 430068,China;School of Materials Science and Engineering,Wuhan University of Technology,Wuhan 430070,China)
出处
《华中师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2023年第3期364-372,共9页
Journal of Central China Normal University:Natural Sciences
基金
国家自然科学基金项目(61602161,61772180)
湖北省重点研发计划项目(2020BAB012)
湖北工业大学研究生基金项目(2021046).