摘要
在大数据和物联网应用中,本地差分隐私(LDP)技术用于保护聚类分析中的用户隐私,但现有方法要么在LDP下交互式地进行聚类,需要消耗大量隐私预算,要么没有同时考虑到聚类数据中蕴含的表示数据质量的高斯噪音以及为满足LDP保护的拉普拉斯噪音,致使聚类精度低下。同时,对于衡量用户提交数据和簇心之间的距离选择较为武断,没有充分利用到用户提交的噪音数据中蕴含的噪音模式。为此,该文创新性地提出一种满足LDP的混合噪音感知的模糊C均值聚类算法(mnFCM),该算法的主要思想是同时建模用户上传数据中蕴含的表示用户质量的高斯噪音以及为保护用户数据注入的拉普拉斯噪音,进而设计出混合噪音感知的距离替代传统的欧式距离,来衡量样本数据与簇心间的相似性。特别地,在mnFCM中,该文首先设计了混合噪音感知的距离计算方法,在此基础上给出算法新的目标函数,并基于拉格朗日乘子法设计了求解方法,最后理论上分析了求解算法的收敛性。该文进一步理论分析了mnFCM的隐私、效用和复杂度,分析结果表明所提算法严格满足LDP、相对于对比算法更接近非隐私下的簇心以及和非隐私算法具有接近的复杂度。在两个真实数据集上的实验结果表明,mnFCM在满足LDP下,聚类精度提高了10%~15%。
Objective In big data and Internet of Things(IoT)applications,clustering analysis of collected data is crucial for enhancing user experience.To mitigate privacy risks from using raw data directly,Local Differential Privacy(LDP)techniques are often employed.However,existing LDP clustering studies either require interactive execution,consuming significant privacy budgets,or fail to balance Gaussian noise in clustering data with Laplacian noise for LDP protection,resulting in low clustering accuracy.Moreover,distance metrics for similarity measurement are chosen arbitrarily without fully utilizing the noise characteristics of user-submitted noisy data.This study designs a hybrid noise-aware distance calculation method integrated into the fuzzy C-means clustering algorithm,effectively reducing noise impact on clustering results while protecting data privacy,ensuring both privacy security and clustering quality.It provides a robust solution for sensitive information processing in high-dimensional data environments.Methods This paper innovatively proposes a mixed noise-aware Fuzzy C-Means clustering algorithm(mnFCM)under LDP.The core idea is to model both Gaussian noise(representing data quality)and Laplacian noise(for data protection)in uploaded user data by constructing a more accurate mixed distribution model,and design a mixed noise-aware distance to replace Euclidean distance for measuring similarity between samples and cluster centers.Specifically,in mnFCM,this paper first designs a mixed noise-aware distance calculation method.On this basis,a new objective function for the algorithm is proposed,and a solution method is designed based on the Lagrange multiplier method.Finally,the convergence of the solution algorithm is theoretically analyzed.Results and Discussions The experimental results show that as the privacy budgetεincreases,the performance of various clustering algorithms generally improves.Notably,mnFCM achieves at least a 8.5%improvement in accuracy compared to the state-of-the-art PrivPro algorithm(Fig.1).This is because mnFCM innovatively considers both Gaussian noise(reflecting data quality)and Laplacian noise(for LDP protection),designing a hybrid noise-aware distance metric to enhance sample similarity measurement,thereby effectively protecting privacy while balancing clustering performance.Experiments on the fuzziness parameter m reveal that when m=2,all algorithms reach peak F-Measure values and lowest Entropy values(Fig.2),strongly validating m=2 as the optimal balance point for clustering effectiveness.Additionally,running time of mnFCM is 1.0 to 1.4 times that of the non-privacy-preserving Nopriv algorithm(Table 2),due to its refined noise processing mechanism.Ablation experiments demonstrate that the MixDis scheme achieves the best clustering performance on both NG and UW datasets(Fig.4),as it considers both Laplacian and Gaussian noise,making the clustered data more robust.Comparative analysis on the synthetic dataset Syn with other privacy-preserving clustering algorithms shows that DP-DPCL+consistently outperforms DP-DPCL,and DPC+consistently outperforms DPC(Fig.5).In addition,by varying the values of the four adjustable parameters-privacy budgetε,sample size N,dimension K,and cluster number C-it is evident that the mnFCM method outperforms other privacy protection schemes(Fig.6).Conclusions This paper addresses the privacy protection issue in fuzzy clustering algorithms by simultaneously considering Gaussian noise(reflecting data quality)and Laplacian noise(for LDP protection),and innovatively proposes a mixed noise-aware fuzzy C-means clustering algorithm,mnFCM,satisfying LDP to balance privacy security and clustering quality.It designs a mixed noise-aware distance calculation method,formulates a new objective function,and solves it using the Lagrange multiplier method,while theoretically analyzing the algorithm’s convergence.Theoretical analysis shows that the algorithm strictly satisfies LDP,is closer to non-private cluster centroids compared to baseline algorithms,and has similar complexity to non-private algorithms.Experiments demonstrate that the algorithm improves clustering accuracy by 10%~15%on real datasets compared to baseline privacy-preserving algorithms.However,a limitation of this study is that the privacy budget calculation for Laplacian noise in the mixed noise setting may be influenced by Gaussian noise.In future research,the adaptive noise proportion allocation strategies,such as dynamically adjusting the weights of Gaussian/Laplacian noise,will be further explored to optimize the privacy-utility trade-off.
作者
张朋飞
程俊
张治坤
方贤进
孙笠
王杰
姜茸
ZHANG Pengfei;CHENG Jun;ZHANG Zhikun;FANG Xianjin;SUN Li;WANG Jie;JIANG Rong(School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan 232001,China;Key Laboratory of Service Computing,Yunnan University of Finance and Economics,Kunming 650221,China;College of Computer Science and Technology,Zhejiang University,Hangzhou 310058,China;School of Control and Computer Engineering,North China Electric Power University,Beijing 102206,China;School of Safety Science and Engineering,Anhui University of Science and Technology,Huainan 232001,China)
出处
《电子与信息学报》
北大核心
2025年第3期739-757,共19页
Journal of Electronics & Information Technology
基金
安徽理工大学高层次引进人才科研启动基金(2023yjrc92)
云南省服务计算重点实验室开放课题(YNSC24116)
国家自然科学基金(62202164).
关键词
聚类分析
隐私保护
本地差分隐私
模糊C均值聚类
拉普拉斯机制
Clustering analysis
Privacy protection
Local Differential Privacy(LDP)
Fuzzy C-Means(FCM)clustering
Laplace mechanism