The vast potential of medical big data to enhance healthcare outcomes remains underutilized due to privacy concerns,which restrict cross-center data sharing and the construction of diverse,large-scale datasets.To addr...The vast potential of medical big data to enhance healthcare outcomes remains underutilized due to privacy concerns,which restrict cross-center data sharing and the construction of diverse,large-scale datasets.To address this challenge,we developed a deep generative model aimed at synthesizing medical data to overcome data sharing barriers,with a focus on breast ultrasound(US)image synthesis.Specifically,we introduce CoLDiT,a conditional latent diffusion model with a transformer backbone,to generate US images of breast lesions across various Breast Imaging Reporting and Data System(BI-RADS)categories.Using a training dataset of 9,705 US images from 5,243 patients across 202 hospitals with diverse US systems,CoLDiT generated breast US images without duplicating private information,as confirmed through nearest-neighbor analysis.Blinded reader studies further validated the realism of these images,with area under the receiver operating characteristic curve(AUC)scores ranging from 0.53 to 0.77.Additionally,synthetic breast US images effectively augmented the training dataset for BI-RADS classification,achieving performance comparable to that using an equal-sized training set comprising solely real images(P=0.81 for AUC).Our findings suggest that synthetic data,such as CoLDiT-generated images,offer a viable,privacy-preserving solution to facilitate secure medical data sharing and advance the utilization of medical big data.展开更多
基金supported by the National Natural Science Foundation of China(grant no.82071928)the Program of Shanghai Academic/Technology Research Leader(grant no.23XD1401300).
文摘The vast potential of medical big data to enhance healthcare outcomes remains underutilized due to privacy concerns,which restrict cross-center data sharing and the construction of diverse,large-scale datasets.To address this challenge,we developed a deep generative model aimed at synthesizing medical data to overcome data sharing barriers,with a focus on breast ultrasound(US)image synthesis.Specifically,we introduce CoLDiT,a conditional latent diffusion model with a transformer backbone,to generate US images of breast lesions across various Breast Imaging Reporting and Data System(BI-RADS)categories.Using a training dataset of 9,705 US images from 5,243 patients across 202 hospitals with diverse US systems,CoLDiT generated breast US images without duplicating private information,as confirmed through nearest-neighbor analysis.Blinded reader studies further validated the realism of these images,with area under the receiver operating characteristic curve(AUC)scores ranging from 0.53 to 0.77.Additionally,synthetic breast US images effectively augmented the training dataset for BI-RADS classification,achieving performance comparable to that using an equal-sized training set comprising solely real images(P=0.81 for AUC).Our findings suggest that synthetic data,such as CoLDiT-generated images,offer a viable,privacy-preserving solution to facilitate secure medical data sharing and advance the utilization of medical big data.