On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th Nation...On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.展开更多
Objectives:Electronic health records(EHRs)offer valuable real-world data(RWD)for Chinese medicine research.However,significant methodological challenges remain in developing integrative Chinese-Western medicine(ICWM)d...Objectives:Electronic health records(EHRs)offer valuable real-world data(RWD)for Chinese medicine research.However,significant methodological challenges remain in developing integrative Chinese-Western medicine(ICWM)databases.This study aims to establish a best-practice methodological framework,referred to as BRIDGE,to guide the construction of ICWM databases using EHRs.Methods:We developed the methodological framework through a comprehensive process,including systematic literature review,synthesis of empirical experiences,thematic expert discussions,and consultation with an external panel to reach consensus.Results:The BRIDGE framework outlines 6 core components for ICWM-EHR database development:Overall design,database architecture,data extraction and linkage,data governance,data verification,and data quality evaluation.Key data elements include variables related to study population,treatment or exposure,outcomes,and confounders.These databases support various research applications,particularly in evaluating the effectiveness and safety of integrative therapies.To demonstrate its practical value,we developed an ICWM-EHR database on women’s reproductive lifespan,encompassing 2,064,482 patients.This database captures women’s health conditions across the life course,from reproductive age to older adulthood.Conclusions:The BRIDGE methodological framework provides a standardized approach to building high-quality ICWM-EHR databases.It offers a unique opportunity to strengthen the methodological rigor and real-world relevance of Chinese medicine research in integrated healthcare settings.展开更多
As smart grid technology rapidly advances,the vast amount of user data collected by smart meter presents significant challenges in data security and privacy protection.Current research emphasizes data security and use...As smart grid technology rapidly advances,the vast amount of user data collected by smart meter presents significant challenges in data security and privacy protection.Current research emphasizes data security and user privacy concerns within smart grids.However,existing methods struggle with efficiency and security when processing large-scale data.Balancing efficient data processing with stringent privacy protection during data aggregation in smart grids remains an urgent challenge.This paper proposes an AI-based multi-type data aggregation method designed to enhance aggregation efficiency and security by standardizing and normalizing various data modalities.The approach optimizes data preprocessing,integrates Long Short-Term Memory(LSTM)networks for handling time-series data,and employs homomorphic encryption to safeguard user privacy.It also explores the application of Boneh Lynn Shacham(BLS)signatures for user authentication.The proposed scheme’s efficiency,security,and privacy protection capabilities are validated through rigorous security proofs and experimental analysis.展开更多
【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data...【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data are not clearly defined around the world.This study aims to develop a tool that meets compliance-driven desensitization requirements tailored to diverse research needs.【Methods】To enhance the security of medical image data,we designed and implemented a DICOM format medical image de-identification system on the Windows operating system.【Results】Our custom de-identification system is adaptable to the legal standards of different countries and can accommodate specific research demands.The system offers both web-based online and desktop offline de-identification capabilities,enabling customization of de-identification rules and facilitating batch processing to improve efficiency.【Conclusions】This medical image de-identification system robustly strengthens the stewardship of sensitive medical data,aligning with data security protection requirements while facilitating the sharing and utilization of medical image data.This approach unlocks the intrinsic value inherent in such datasets.展开更多
With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are charac...With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are characterized by high complexity and privacy sensitivity,calling for robust and systematic data management skills.This paper reviews current trends in scientific data governance and the evolving policy landscape,highlighting persistent challenges such as inconsistent standards,semantic misalignment,and limited awareness of compliance.These issues are largely rooted in the lack of structured training and practical support for researchers.In response,this study builds on existing data literacy frameworks and integrates the specific demands of biomedical research to propose a comprehensive,lifecycle-oriented data literacy competency model with an emphasis on ethics and regulatory awareness.Furthermore,it outlines a tiered training strategy tailored to different research stages—undergraduate,graduate,and professional,offering theoretical foundations and practical pathways for universities and research institutions to advance data literacy education.展开更多
Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemic...Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemical transport model in simulating PM_(2.5).Based on the NASM joint chemical data assimilation system,the authors quantified the impacts of different meteorological fields on the pollutant simulations as well as revealed the role of meteorological conditions in the accumulation,maintenance,and dissipation of heavy haze pollution.During the two heavy pollution processes from 10 to 24 November 2018,the meteorological fields were obtained using NCEP FNL and ERA5 reanalysis data,each used to drive the WRF model,to analyze the differences in the simulated PM_(2.5) concentration.The results show that the meteorological field has a strong influence on the concentration levels and spatial distribution of the pollution simulations.The ERA5 group had relatively small simulation errors,and more accurate PM_(2.5) simulation results could be obtained.The RMSE was 11.86𝜇g m^(-3)lower than that of the FNL group before assimilation,and 5.77𝜇g m^(-3)lower after joint assimilation.The authors used the PM_(2.5) simulation results obtained by ERA5 data to discuss the role of the wind field and circulation situation on the pollution process,to analyze the correlation between wind speed,temperature,relative humidity,and boundary layer height and pollutant concentrations,and to further clarify the key formation mechanism of this pollution process.展开更多
The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limi...The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limits the accuracy of prediction models,which is escalated by the incompletion of data induced by the issues such as sensor failures.To address these challenges,we propose a novel approach to accommodate data insufficiency through achieving external information from incomplete data samples,which are usually discarded in existing studies.In order to fully unleash the prediction power of incomplete data,we have investigated the Multiple Imputation by Chained Equations(MICE)method that diversifies the training data through exploring the potential data patterns.The experimental results demonstrate that the proposed method significantly outperforms the baselines in the most considered scenarios while reducing the prediction root mean square error(RMSE)by up to 18.9%.Furthermore,we have also observed that the penetration of incomplete data benefits the explainability of the prediction model through facilitating the feature selection.展开更多
Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing t...Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.展开更多
The data production elements are driving profound transformations in the real economy across production objects,methods,and tools,generating significant economic effects such as industrial structure upgrading.This pap...The data production elements are driving profound transformations in the real economy across production objects,methods,and tools,generating significant economic effects such as industrial structure upgrading.This paper aims to reveal the impact mechanism of the data elements on the“three transformations”(high-end,intelligent,and green)in the manufacturing sector,theoretically elucidating the intrinsic mechanisms by which the data elements influence these transformations.The study finds that the data elements significantly enhance the high-end,intelligent,and green levels of China's manufacturing industry.In terms of the pathways of impact,the data elements primarily influence the development of high-tech industries and overall green technological innovation,thereby affecting the high-end,intelligent,and green transformation of the industry.展开更多
Lead(Pb)plays a significant role in the nuclear industry and is extensively used in radiation shielding,radiation protection,neutron moderation,radiation measurements,and various other critical functions.Consequently,...Lead(Pb)plays a significant role in the nuclear industry and is extensively used in radiation shielding,radiation protection,neutron moderation,radiation measurements,and various other critical functions.Consequently,the measurement and evaluation of Pb nuclear data are highly regarded in nuclear scientific research,emphasizing its crucial role in the field.Using the time-of-flight(ToF)method,the neutron leakage spectra from three^(nat)Pb samples were measured at 60°and 120°based on the neutronics integral experimental facility at the China Institute of Atomic Energy(CIAE).The^(nat)Pb sample sizes were30 cm×30 cm×5 cm,30 cm×30 cm×10 cm,and 30 cm×30 cm×15 cm.Neutron sources were generated by the Cockcroft-Walton accelerator,producing approximately 14.5 MeV and 3.5 MeV neutrons through the T(d,n)^(4)He and D(d,n)^(3)He reactions,respectively.Leakage neutron spectra were also calculated by employing the Monte Carlo code of MCNP-4C,and the nuclear data of Pb isotopes from four libraries:CENDL-3.2,JEFF-3.3,JENDL-5,and ENDF/B-Ⅷ.0 were used individually.By comparing the simulation and experimental results,improvements and deficiencies in the evaluated nuclear data of the Pb isotopes were analyzed.Most of the calculated results were consistent with the experimental results;however,a few areas did not fit well.In the(n,el)energy range,the simulated results from CENDL-3.2 were significantly overestimated;in the(n,inl)D and the(n,inl)C energy regions,the results from CENDL-3.2 and ENDF/B-Ⅷ.0 were significantly overestimated at 120°,and the results from JENDL-5 and JEFF-3.3 are underestimated at 60°in the(n,inl)D energy region.The calculated spectra were analyzed by comparing them with the experimental spectra in terms of the neutron spectrum shape and C/E values.The results indicate that the theoretical simulations,using different data libraries,overestimated or underestimated the measured values in certain energy ranges.Secondary neutron energies and angular distributions in the data files have been presented to explain these discrepancies.展开更多
In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquak...In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquake.This map provides a significantly finer resolution compared to existing V_(S30) maps,which typically use a 30 arcsec grid.The V_(S30) values were estimated using the Cokriging-based V_(S30) proxy model(SCK model),which integrates V_(S30) measurements as primary constraints and utilizes topographic slope as a secondary parameter.The findings indicate that the V_(S30) values range from 200 to 250 m/s in the sedimentary deposit areas near the earthquake’s epicenter and from 400 to 600 m/s in the surrounding mountainous regions.This study showcases the capability of the SCK model to efficiently generate V_(S30) estimations across various spatial resolutions and demonstrates its effectiveness in producing reliable estimations in data-sparse regions.展开更多
Semantic communication(SemCom)aims to achieve high-fidelity information delivery under low communication consumption by only guaranteeing semantic accuracy.Nevertheless,semantic communication still suffers from unexpe...Semantic communication(SemCom)aims to achieve high-fidelity information delivery under low communication consumption by only guaranteeing semantic accuracy.Nevertheless,semantic communication still suffers from unexpected channel volatility and thus developing a re-transmission mechanism(e.g.,hybrid automatic repeat request[HARQ])becomes indispensable.In that regard,instead of discarding previously transmitted information,the incremental knowledge-based HARQ(IK-HARQ)is deemed as a more effective mechanism that could sufficiently utilize the information semantics.However,considering the possible existence of semantic ambiguity in image transmission,a simple bit-level cyclic redundancy check(CRC)might compromise the performance of IK-HARQ.Therefore,there emerges a strong incentive to revolutionize the CRC mechanism,thus more effectively reaping the benefits of both SemCom and HARQ.In this paper,built on top of swin transformer-based joint source-channel coding(JSCC)and IK-HARQ,we propose a semantic image transmission framework SC-TDA-HARQ.In particular,different from the conventional CRC,we introduce a topological data analysis(TDA)-based error detection method,which capably digs out the inner topological and geometric information of images,to capture semantic information and determine the necessity for re-transmission.Extensive numerical results validate the effectiveness and efficiency of the proposed SC-TDA-HARQ framework,especially under the limited bandwidth condition,and manifest the superiority of TDA-based error detection method in image transmission.展开更多
The load profile is a key characteristic of the power grid and lies at the basis for the power flow control and generation scheduling.However,due to the wide adoption of internet-of-things(IoT)-based metering infrastr...The load profile is a key characteristic of the power grid and lies at the basis for the power flow control and generation scheduling.However,due to the wide adoption of internet-of-things(IoT)-based metering infrastructure,the cyber vulnerability of load meters has attracted the adversary’s great attention.In this paper,we investigate the vulnerability of manipulating the nodal prices by injecting false load data into the meter measurements.By taking advantage of the changing properties of real-world load profile,we propose a deeply hidden load data attack(i.e.,DH-LDA)that can evade bad data detection,clustering-based detection,and price anomaly detection.The main contributions of this work are as follows:(i)We design a stealthy attack framework that exploits historical load patterns to generate load data with minimal statistical deviation from normalmeasurements,thereby maximizing concealment;(ii)We identify the optimal time window for data injection to ensure that the altered nodal prices follow natural fluctuations,enhancing the undetectability of the attack in real-time market operations;(iii)We develop a resilience evaluation metric and formulate an optimization-based approach to quantify the electricity market’s robustness against DH-LDAs.Our experiments show that the adversary can gain profits from the electricity market while remaining undetected.展开更多
Lunar wrinkle ridges are an important stress geological structure on the Moon, which reflect the stress state and geological activity on the Moon. They provide important insights into the evolution of the Moon and are...Lunar wrinkle ridges are an important stress geological structure on the Moon, which reflect the stress state and geological activity on the Moon. They provide important insights into the evolution of the Moon and are key factors influencing future lunar activity, such as the choice of landing sites. However, automatic extraction of lunar wrinkle ridges is a challenging task due to their complex morphology and ambiguous features. Traditional manual extraction methods are time-consuming and labor-intensive. To achieve automated and detailed detection of lunar wrinkle ridges, we have constructed a lunar wrinkle ridge data set, incorporating previously unused aspect data to provide edge information, and proposed a Dual-Branch Ridge Detection Network(DBR-Net) based on deep learning technology. This method employs a dual-branch architecture and an Attention Complementary Feature Fusion module to address the issue of insufficient lunar wrinkle ridge features. Through comparisons with the results of various deep learning approaches, it is demonstrated that the proposed method exhibits superior detection performance. Furthermore, the trained model was applied to lunar mare regions, generating a distribution map of lunar mare wrinkle ridges;a significant linear relationship between the length and area of the lunar wrinkle ridges was obtained through statistical analysis, and six previously unrecorded potential lunar wrinkle ridges were detected. The proposed method upgrades the automated extraction of lunar wrinkle ridges to a pixel-level precision and verifies the effectiveness of DBR-Net in lunar wrinkle ridge detection.展开更多
To transmit customer power data collected by smart meters(SMs)to utility companies,data must first be transmitted to the corresponding data aggregation point(DAP)of the SM.The number of DAPs installed and the installa...To transmit customer power data collected by smart meters(SMs)to utility companies,data must first be transmitted to the corresponding data aggregation point(DAP)of the SM.The number of DAPs installed and the installation location greatly impact the whole network.For the traditional DAP placement algorithm,the number of DAPs must be set in advance,but determining the best number of DAPs is difficult,which undoubtedly reduces the overall performance of the network.Moreover,the excessive gap between the loads of different DAPs is also an important factor affecting the quality of the network.To address the above problems,this paper proposes a DAP placement algorithm,APSSA,based on the improved affinity propagation(AP)algorithm and sparrow search(SSA)algorithm,which can select the appropriate number of DAPs to be installed and the corresponding installation locations according to the number of SMs and their distribution locations in different environments.The algorithm adds an allocation mechanism to optimize the subnetwork in the SSA.APSSA is evaluated under three different areas and compared with other DAP placement algorithms.The experimental results validated that the method in this paper can reduce the network cost,shorten the average transmission distance,and reduce the load gap.展开更多
Distributed data fusion is essential for numerous applications,yet faces significant privacy security challenges.Federated learning(FL),as a distributed machine learning paradigm,offers enhanced data privacy protectio...Distributed data fusion is essential for numerous applications,yet faces significant privacy security challenges.Federated learning(FL),as a distributed machine learning paradigm,offers enhanced data privacy protection and has attracted widespread attention.Consequently,research increasingly focuses on developing more secure FL techniques.However,in real-world scenarios involving malicious entities,the accuracy of FL results is often compromised,particularly due to the threat of collusion between two servers.To address this challenge,this paper proposes an efficient and verifiable data aggregation protocol with enhanced privacy protection.After analyzing attack methods against prior schemes,we implement key improvements.Specifically,by incorporating cascaded random numbers and perturbation terms into gradients,we strengthen the privacy protection afforded by polynomial masking,effectively preventing information leakage.Furthermore,our protocol features an enhanced verification mechanism capable of detecting collusive behaviors between two servers.Accuracy testing on the MNIST and CIFAR-10 datasets demonstrates that our protocol maintains accuracy comparable to the Federated Averaging Algorithm.In scheme efficiency comparisons,while incurring only a marginal increase in verification overhead relative to the baseline scheme,our protocol achieves an average improvement of 93.13% in privacy protection and verification overhead compared to the state-of-the-art scheme.This result highlights its optimal balance between overall overhead and functionality.A current limitation is that the verificationmechanismcannot precisely pinpoint the source of anomalies within aggregated results when server-side malicious behavior occurs.Addressing this limitation will be a focus of future research.展开更多
Heterogeneous federated learning(HtFL)has gained significant attention due to its ability to accommodate diverse models and data from distributed combat units.The prototype-based HtFL methods were proposed to reduce t...Heterogeneous federated learning(HtFL)has gained significant attention due to its ability to accommodate diverse models and data from distributed combat units.The prototype-based HtFL methods were proposed to reduce the high communication cost of transmitting model parameters.These methods allow for the sharing of only class representatives between heterogeneous clients while maintaining privacy.However,existing prototype learning approaches fail to take the data distribution of clients into consideration,which results in suboptimal global prototype learning and insufficient client model personalization capabilities.To address these issues,we propose a fair trainable prototype federated learning(FedFTP)algorithm,which employs a fair sampling training prototype(FSTP)mechanism and a hyperbolic space constraints(HSC)mechanism to enhance the fairness and effectiveness of prototype learning on the server in heterogeneous environments.Furthermore,a local prototype stable update(LPSU)mechanism is proposed as a means of maintaining personalization while promoting global consistency,based on contrastive learning.Comprehensive experimental results demonstrate that FedFTP achieves state-of-the-art performance in HtFL scenarios.展开更多
This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this resear...This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this research to build a system that enables image segmentation training with low-power machines.To accomplish this,all data are divided into several segments,each being trained separately.In the case of prediction,the initial output is predicted from each trained model for an input,where the ultimate output is selected based on the pixel-wise majority voting of the expected outputs,which also ensures data privacy.In addition,this kind of distributed training system allows different computers to be used simultaneously.That is how the training process takes comparatively less time than typical training approaches.Even after completing the training,the proposed prediction system allows a newly trained model to be included in the system.Thus,the prediction is consistently more accurate.We evaluated the effectiveness of the ultimate output based on four performance matrices:average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy.The experimental results show that the scores of average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy are 0.9216,0.0687,0.9477,and 0.8674,respectively.In addition,the proposed method was compared with four other state-of-the-art models in terms of total training time and usage of computational resources.And it outperformed all of them in these aspects.展开更多
Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific hea...Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific health conditions(e.g.,diabetes or sepsis)for statistical analyses.However,there has been substantial variation in the algorithm development and validation,leading to frequently suboptimal performance and posing a significant threat to the validity of study findings.Unfortunately,these issues are often overlooked.Methods:We systematically developed guidance for the development,validation,and evaluation of algorithms designed to identify health status(DEVELOP-RCD).Our initial efforts involved conducting both a narrative review and a systematic review of published studies on the concepts and methodological issues related to algorithm development,validation,and evaluation.Subsequently,we conducted an empirical study on an algorithm for identifying sepsis.Based on these findings,we formulated specific workflow and recommendations for algorithm development,validation,and evaluation within the guidance.Finally,the guidance underwent independent review by a panel of 20 external experts who then convened a consensus meeting to finalize it.Results:A standardized workflow for algorithm development,validation,and evaluation was established.Guided by specific health status considerations,the workflow comprises four integrated steps:assessing an existing algorithm’s suitability for the target health status;developing a new algorithm using recommended methods;validating the algorithm using prescribed performance measures;and evaluating the impact of the algorithm on study results.Additionally,13 good practice recommendations were formulated with detailed explanations.Furthermore,a practical study on sepsis identification was included to demonstrate the application of this guidance.Conclusions:The establishment of guidance is intended to aid researchers and clinicians in the appropriate and accurate development and application of algorithms for identifying health status from RCD.This guidance has the potential to enhance the credibility of findings from observational studies involving RCD.展开更多
Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their sim...Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.展开更多
文摘On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.
基金supported by the National Key Research&Development Program of China(No.2024YFC3505800)the National Natural Science Foundation of China(Nos.82474334,82474335 and 72174132)+3 种基金National Science Fund for Distinguished Young Scholars(No.82225049)the Key Research&Development Projects of Sichuan Provincial Department of Science and Technology(Nos.2024YFFK0174 and 2024YFFK0152)1.3.5 Project for Disciplines of Excellence,West China Hospital,Sichuan University(Nos.ZYYC24010 and ZYGD23004)the Special Fund for Traditional Chinese Medicine of Sichuan Provincial Administration of Traditional Chinese Medicine(No.2024zd023).
文摘Objectives:Electronic health records(EHRs)offer valuable real-world data(RWD)for Chinese medicine research.However,significant methodological challenges remain in developing integrative Chinese-Western medicine(ICWM)databases.This study aims to establish a best-practice methodological framework,referred to as BRIDGE,to guide the construction of ICWM databases using EHRs.Methods:We developed the methodological framework through a comprehensive process,including systematic literature review,synthesis of empirical experiences,thematic expert discussions,and consultation with an external panel to reach consensus.Results:The BRIDGE framework outlines 6 core components for ICWM-EHR database development:Overall design,database architecture,data extraction and linkage,data governance,data verification,and data quality evaluation.Key data elements include variables related to study population,treatment or exposure,outcomes,and confounders.These databases support various research applications,particularly in evaluating the effectiveness and safety of integrative therapies.To demonstrate its practical value,we developed an ICWM-EHR database on women’s reproductive lifespan,encompassing 2,064,482 patients.This database captures women’s health conditions across the life course,from reproductive age to older adulthood.Conclusions:The BRIDGE methodological framework provides a standardized approach to building high-quality ICWM-EHR databases.It offers a unique opportunity to strengthen the methodological rigor and real-world relevance of Chinese medicine research in integrated healthcare settings.
基金supported by the National Key R&D Program of China(No.2023YFB2703700)the National Natural Science Foundation of China(Nos.U21A20465,62302457,62402444,62172292)+4 种基金the Fundamental Research Funds of Zhejiang Sci-Tech University(Nos.23222092-Y,22222266-Y)the Program for Leading Innovative Research Team of Zhejiang Province(No.2023R01001)the Zhejiang Provincial Natural Science Foundation of China(Nos.LQ24F020008,LQ24F020012)the Foundation of State Key Laboratory of Public Big Data(No.[2022]417)the“Pioneer”and“Leading Goose”R&D Program of Zhejiang(No.2023C01119).
文摘As smart grid technology rapidly advances,the vast amount of user data collected by smart meter presents significant challenges in data security and privacy protection.Current research emphasizes data security and user privacy concerns within smart grids.However,existing methods struggle with efficiency and security when processing large-scale data.Balancing efficient data processing with stringent privacy protection during data aggregation in smart grids remains an urgent challenge.This paper proposes an AI-based multi-type data aggregation method designed to enhance aggregation efficiency and security by standardizing and normalizing various data modalities.The approach optimizes data preprocessing,integrates Long Short-Term Memory(LSTM)networks for handling time-series data,and employs homomorphic encryption to safeguard user privacy.It also explores the application of Boneh Lynn Shacham(BLS)signatures for user authentication.The proposed scheme’s efficiency,security,and privacy protection capabilities are validated through rigorous security proofs and experimental analysis.
基金CAMS Innovation Fund for Medical Sciences(CIFMS):“Construction of an Intelligent Management and Efficient Utilization Technology System for Big Data in Population Health Science.”(2021-I2M-1-057)Key Projects of the Innovation Fund of the National Clinical Research Center for Orthopedics and Sports Rehabilitation:“National Orthopedics and Sports Rehabilitation Real-World Research Platform System Construction”(23-NCRC-CXJJ-ZD4)。
文摘【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data are not clearly defined around the world.This study aims to develop a tool that meets compliance-driven desensitization requirements tailored to diverse research needs.【Methods】To enhance the security of medical image data,we designed and implemented a DICOM format medical image de-identification system on the Windows operating system.【Results】Our custom de-identification system is adaptable to the legal standards of different countries and can accommodate specific research demands.The system offers both web-based online and desktop offline de-identification capabilities,enabling customization of de-identification rules and facilitating batch processing to improve efficiency.【Conclusions】This medical image de-identification system robustly strengthens the stewardship of sensitive medical data,aligning with data security protection requirements while facilitating the sharing and utilization of medical image data.This approach unlocks the intrinsic value inherent in such datasets.
文摘With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are characterized by high complexity and privacy sensitivity,calling for robust and systematic data management skills.This paper reviews current trends in scientific data governance and the evolving policy landscape,highlighting persistent challenges such as inconsistent standards,semantic misalignment,and limited awareness of compliance.These issues are largely rooted in the lack of structured training and practical support for researchers.In response,this study builds on existing data literacy frameworks and integrates the specific demands of biomedical research to propose a comprehensive,lifecycle-oriented data literacy competency model with an emphasis on ethics and regulatory awareness.Furthermore,it outlines a tiered training strategy tailored to different research stages—undergraduate,graduate,and professional,offering theoretical foundations and practical pathways for universities and research institutions to advance data literacy education.
基金supported by the Second Tibetan Plateau Scientific Expedition and Research Program of Ministry of Science and Technology of the People's Republic of China[grant number 2022QZKK0101]the Science and Technology Department of the Tibet Program[grant number XZ202301ZY0035G]。
文摘Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemical transport model in simulating PM_(2.5).Based on the NASM joint chemical data assimilation system,the authors quantified the impacts of different meteorological fields on the pollutant simulations as well as revealed the role of meteorological conditions in the accumulation,maintenance,and dissipation of heavy haze pollution.During the two heavy pollution processes from 10 to 24 November 2018,the meteorological fields were obtained using NCEP FNL and ERA5 reanalysis data,each used to drive the WRF model,to analyze the differences in the simulated PM_(2.5) concentration.The results show that the meteorological field has a strong influence on the concentration levels and spatial distribution of the pollution simulations.The ERA5 group had relatively small simulation errors,and more accurate PM_(2.5) simulation results could be obtained.The RMSE was 11.86𝜇g m^(-3)lower than that of the FNL group before assimilation,and 5.77𝜇g m^(-3)lower after joint assimilation.The authors used the PM_(2.5) simulation results obtained by ERA5 data to discuss the role of the wind field and circulation situation on the pollution process,to analyze the correlation between wind speed,temperature,relative humidity,and boundary layer height and pollutant concentrations,and to further clarify the key formation mechanism of this pollution process.
文摘The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limits the accuracy of prediction models,which is escalated by the incompletion of data induced by the issues such as sensor failures.To address these challenges,we propose a novel approach to accommodate data insufficiency through achieving external information from incomplete data samples,which are usually discarded in existing studies.In order to fully unleash the prediction power of incomplete data,we have investigated the Multiple Imputation by Chained Equations(MICE)method that diversifies the training data through exploring the potential data patterns.The experimental results demonstrate that the proposed method significantly outperforms the baselines in the most considered scenarios while reducing the prediction root mean square error(RMSE)by up to 18.9%.Furthermore,we have also observed that the penetration of incomplete data benefits the explainability of the prediction model through facilitating the feature selection.
基金supported by the National Nature Science Foundation of China(12363010)supported by the Guizhou Provincial Basic Research Program(Natural Science)(ZK[2023]039)the Key Technology R&D Program([2023]352).
文摘Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.
文摘The data production elements are driving profound transformations in the real economy across production objects,methods,and tools,generating significant economic effects such as industrial structure upgrading.This paper aims to reveal the impact mechanism of the data elements on the“three transformations”(high-end,intelligent,and green)in the manufacturing sector,theoretically elucidating the intrinsic mechanisms by which the data elements influence these transformations.The study finds that the data elements significantly enhance the high-end,intelligent,and green levels of China's manufacturing industry.In terms of the pathways of impact,the data elements primarily influence the development of high-tech industries and overall green technological innovation,thereby affecting the high-end,intelligent,and green transformation of the industry.
基金supported by the National Natural Science Foundation of China(Nos.11775311 and U2067205)the Stable Support Basic Research Program Grant(BJ010261223282)the Research and Development Project of China National Nuclear Corporation。
文摘Lead(Pb)plays a significant role in the nuclear industry and is extensively used in radiation shielding,radiation protection,neutron moderation,radiation measurements,and various other critical functions.Consequently,the measurement and evaluation of Pb nuclear data are highly regarded in nuclear scientific research,emphasizing its crucial role in the field.Using the time-of-flight(ToF)method,the neutron leakage spectra from three^(nat)Pb samples were measured at 60°and 120°based on the neutronics integral experimental facility at the China Institute of Atomic Energy(CIAE).The^(nat)Pb sample sizes were30 cm×30 cm×5 cm,30 cm×30 cm×10 cm,and 30 cm×30 cm×15 cm.Neutron sources were generated by the Cockcroft-Walton accelerator,producing approximately 14.5 MeV and 3.5 MeV neutrons through the T(d,n)^(4)He and D(d,n)^(3)He reactions,respectively.Leakage neutron spectra were also calculated by employing the Monte Carlo code of MCNP-4C,and the nuclear data of Pb isotopes from four libraries:CENDL-3.2,JEFF-3.3,JENDL-5,and ENDF/B-Ⅷ.0 were used individually.By comparing the simulation and experimental results,improvements and deficiencies in the evaluated nuclear data of the Pb isotopes were analyzed.Most of the calculated results were consistent with the experimental results;however,a few areas did not fit well.In the(n,el)energy range,the simulated results from CENDL-3.2 were significantly overestimated;in the(n,inl)D and the(n,inl)C energy regions,the results from CENDL-3.2 and ENDF/B-Ⅷ.0 were significantly overestimated at 120°,and the results from JENDL-5 and JEFF-3.3 are underestimated at 60°in the(n,inl)D energy region.The calculated spectra were analyzed by comparing them with the experimental spectra in terms of the neutron spectrum shape and C/E values.The results indicate that the theoretical simulations,using different data libraries,overestimated or underestimated the measured values in certain energy ranges.Secondary neutron energies and angular distributions in the data files have been presented to explain these discrepancies.
基金supported by the National Natural Science Foundation of China(No.42120104002).
文摘In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquake.This map provides a significantly finer resolution compared to existing V_(S30) maps,which typically use a 30 arcsec grid.The V_(S30) values were estimated using the Cokriging-based V_(S30) proxy model(SCK model),which integrates V_(S30) measurements as primary constraints and utilizes topographic slope as a secondary parameter.The findings indicate that the V_(S30) values range from 200 to 250 m/s in the sedimentary deposit areas near the earthquake’s epicenter and from 400 to 600 m/s in the surrounding mountainous regions.This study showcases the capability of the SCK model to efficiently generate V_(S30) estimations across various spatial resolutions and demonstrates its effectiveness in producing reliable estimations in data-sparse regions.
基金supported in part by the National Key Research and Development Program of China under Grant 2024YFE0200600in part by the National Natural Science Foundation of China under Grant 62071425+3 种基金in part by the Zhejiang Key Research and Development Plan under Grant 2022C01093in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LR23F010005in part by the National Key Laboratory of Wireless Communications Foundation under Grant 2023KP01601in part by the Big Data and Intelligent Computing Key Lab of CQUPT under Grant BDIC-2023-B-001.
文摘Semantic communication(SemCom)aims to achieve high-fidelity information delivery under low communication consumption by only guaranteeing semantic accuracy.Nevertheless,semantic communication still suffers from unexpected channel volatility and thus developing a re-transmission mechanism(e.g.,hybrid automatic repeat request[HARQ])becomes indispensable.In that regard,instead of discarding previously transmitted information,the incremental knowledge-based HARQ(IK-HARQ)is deemed as a more effective mechanism that could sufficiently utilize the information semantics.However,considering the possible existence of semantic ambiguity in image transmission,a simple bit-level cyclic redundancy check(CRC)might compromise the performance of IK-HARQ.Therefore,there emerges a strong incentive to revolutionize the CRC mechanism,thus more effectively reaping the benefits of both SemCom and HARQ.In this paper,built on top of swin transformer-based joint source-channel coding(JSCC)and IK-HARQ,we propose a semantic image transmission framework SC-TDA-HARQ.In particular,different from the conventional CRC,we introduce a topological data analysis(TDA)-based error detection method,which capably digs out the inner topological and geometric information of images,to capture semantic information and determine the necessity for re-transmission.Extensive numerical results validate the effectiveness and efficiency of the proposed SC-TDA-HARQ framework,especially under the limited bandwidth condition,and manifest the superiority of TDA-based error detection method in image transmission.
基金supported by the project Major Scientific and Technological Special Project of Guizhou Province([2024]014).
文摘The load profile is a key characteristic of the power grid and lies at the basis for the power flow control and generation scheduling.However,due to the wide adoption of internet-of-things(IoT)-based metering infrastructure,the cyber vulnerability of load meters has attracted the adversary’s great attention.In this paper,we investigate the vulnerability of manipulating the nodal prices by injecting false load data into the meter measurements.By taking advantage of the changing properties of real-world load profile,we propose a deeply hidden load data attack(i.e.,DH-LDA)that can evade bad data detection,clustering-based detection,and price anomaly detection.The main contributions of this work are as follows:(i)We design a stealthy attack framework that exploits historical load patterns to generate load data with minimal statistical deviation from normalmeasurements,thereby maximizing concealment;(ii)We identify the optimal time window for data injection to ensure that the altered nodal prices follow natural fluctuations,enhancing the undetectability of the attack in real-time market operations;(iii)We develop a resilience evaluation metric and formulate an optimization-based approach to quantify the electricity market’s robustness against DH-LDAs.Our experiments show that the adversary can gain profits from the electricity market while remaining undetected.
文摘Lunar wrinkle ridges are an important stress geological structure on the Moon, which reflect the stress state and geological activity on the Moon. They provide important insights into the evolution of the Moon and are key factors influencing future lunar activity, such as the choice of landing sites. However, automatic extraction of lunar wrinkle ridges is a challenging task due to their complex morphology and ambiguous features. Traditional manual extraction methods are time-consuming and labor-intensive. To achieve automated and detailed detection of lunar wrinkle ridges, we have constructed a lunar wrinkle ridge data set, incorporating previously unused aspect data to provide edge information, and proposed a Dual-Branch Ridge Detection Network(DBR-Net) based on deep learning technology. This method employs a dual-branch architecture and an Attention Complementary Feature Fusion module to address the issue of insufficient lunar wrinkle ridge features. Through comparisons with the results of various deep learning approaches, it is demonstrated that the proposed method exhibits superior detection performance. Furthermore, the trained model was applied to lunar mare regions, generating a distribution map of lunar mare wrinkle ridges;a significant linear relationship between the length and area of the lunar wrinkle ridges was obtained through statistical analysis, and six previously unrecorded potential lunar wrinkle ridges were detected. The proposed method upgrades the automated extraction of lunar wrinkle ridges to a pixel-level precision and verifies the effectiveness of DBR-Net in lunar wrinkle ridge detection.
基金supported by the Fujian University of Technology under Grant GYZ20016,GY-Z18183,and GY-Z19005partially supported by the National Science and Technology Council under Grant NSTC 113-2221-E-224-056-.
文摘To transmit customer power data collected by smart meters(SMs)to utility companies,data must first be transmitted to the corresponding data aggregation point(DAP)of the SM.The number of DAPs installed and the installation location greatly impact the whole network.For the traditional DAP placement algorithm,the number of DAPs must be set in advance,but determining the best number of DAPs is difficult,which undoubtedly reduces the overall performance of the network.Moreover,the excessive gap between the loads of different DAPs is also an important factor affecting the quality of the network.To address the above problems,this paper proposes a DAP placement algorithm,APSSA,based on the improved affinity propagation(AP)algorithm and sparrow search(SSA)algorithm,which can select the appropriate number of DAPs to be installed and the corresponding installation locations according to the number of SMs and their distribution locations in different environments.The algorithm adds an allocation mechanism to optimize the subnetwork in the SSA.APSSA is evaluated under three different areas and compared with other DAP placement algorithms.The experimental results validated that the method in this paper can reduce the network cost,shorten the average transmission distance,and reduce the load gap.
基金supported by National Key R&D Program of China(2023YFB3106100)National Natural Science Foundation of China(62102452,62172436)Natural Science Foundation of Shaanxi Province(2023-JCYB-584).
文摘Distributed data fusion is essential for numerous applications,yet faces significant privacy security challenges.Federated learning(FL),as a distributed machine learning paradigm,offers enhanced data privacy protection and has attracted widespread attention.Consequently,research increasingly focuses on developing more secure FL techniques.However,in real-world scenarios involving malicious entities,the accuracy of FL results is often compromised,particularly due to the threat of collusion between two servers.To address this challenge,this paper proposes an efficient and verifiable data aggregation protocol with enhanced privacy protection.After analyzing attack methods against prior schemes,we implement key improvements.Specifically,by incorporating cascaded random numbers and perturbation terms into gradients,we strengthen the privacy protection afforded by polynomial masking,effectively preventing information leakage.Furthermore,our protocol features an enhanced verification mechanism capable of detecting collusive behaviors between two servers.Accuracy testing on the MNIST and CIFAR-10 datasets demonstrates that our protocol maintains accuracy comparable to the Federated Averaging Algorithm.In scheme efficiency comparisons,while incurring only a marginal increase in verification overhead relative to the baseline scheme,our protocol achieves an average improvement of 93.13% in privacy protection and verification overhead compared to the state-of-the-art scheme.This result highlights its optimal balance between overall overhead and functionality.A current limitation is that the verificationmechanismcannot precisely pinpoint the source of anomalies within aggregated results when server-side malicious behavior occurs.Addressing this limitation will be a focus of future research.
基金supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region(No.2022D01B187).
文摘Heterogeneous federated learning(HtFL)has gained significant attention due to its ability to accommodate diverse models and data from distributed combat units.The prototype-based HtFL methods were proposed to reduce the high communication cost of transmitting model parameters.These methods allow for the sharing of only class representatives between heterogeneous clients while maintaining privacy.However,existing prototype learning approaches fail to take the data distribution of clients into consideration,which results in suboptimal global prototype learning and insufficient client model personalization capabilities.To address these issues,we propose a fair trainable prototype federated learning(FedFTP)algorithm,which employs a fair sampling training prototype(FSTP)mechanism and a hyperbolic space constraints(HSC)mechanism to enhance the fairness and effectiveness of prototype learning on the server in heterogeneous environments.Furthermore,a local prototype stable update(LPSU)mechanism is proposed as a means of maintaining personalization while promoting global consistency,based on contrastive learning.Comprehensive experimental results demonstrate that FedFTP achieves state-of-the-art performance in HtFL scenarios.
基金the Researchers Supporting Project,King Saud University,Saudi Arabia,for funding this research work through Project No.RSPD2025R951.
文摘This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this research to build a system that enables image segmentation training with low-power machines.To accomplish this,all data are divided into several segments,each being trained separately.In the case of prediction,the initial output is predicted from each trained model for an input,where the ultimate output is selected based on the pixel-wise majority voting of the expected outputs,which also ensures data privacy.In addition,this kind of distributed training system allows different computers to be used simultaneously.That is how the training process takes comparatively less time than typical training approaches.Even after completing the training,the proposed prediction system allows a newly trained model to be included in the system.Thus,the prediction is consistently more accurate.We evaluated the effectiveness of the ultimate output based on four performance matrices:average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy.The experimental results show that the scores of average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy are 0.9216,0.0687,0.9477,and 0.8674,respectively.In addition,the proposed method was compared with four other state-of-the-art models in terms of total training time and usage of computational resources.And it outperformed all of them in these aspects.
基金supported by the National Natural Science Foundation of China(82225049,72104155)the Sichuan Provincial Central Government Guides Local Science and Technology Development Special Project(2022ZYD0127)the 1·3·5 Project for Disciplines of Excellence,West China Hospital,Sichuan University(ZYGD23004).
文摘Background:In recent years,there has been a growing trend in the utilization of observational studies that make use of routinely collected healthcare data(RCD).These studies rely on algorithms to identify specific health conditions(e.g.,diabetes or sepsis)for statistical analyses.However,there has been substantial variation in the algorithm development and validation,leading to frequently suboptimal performance and posing a significant threat to the validity of study findings.Unfortunately,these issues are often overlooked.Methods:We systematically developed guidance for the development,validation,and evaluation of algorithms designed to identify health status(DEVELOP-RCD).Our initial efforts involved conducting both a narrative review and a systematic review of published studies on the concepts and methodological issues related to algorithm development,validation,and evaluation.Subsequently,we conducted an empirical study on an algorithm for identifying sepsis.Based on these findings,we formulated specific workflow and recommendations for algorithm development,validation,and evaluation within the guidance.Finally,the guidance underwent independent review by a panel of 20 external experts who then convened a consensus meeting to finalize it.Results:A standardized workflow for algorithm development,validation,and evaluation was established.Guided by specific health status considerations,the workflow comprises four integrated steps:assessing an existing algorithm’s suitability for the target health status;developing a new algorithm using recommended methods;validating the algorithm using prescribed performance measures;and evaluating the impact of the algorithm on study results.Additionally,13 good practice recommendations were formulated with detailed explanations.Furthermore,a practical study on sepsis identification was included to demonstrate the application of this guidance.Conclusions:The establishment of guidance is intended to aid researchers and clinicians in the appropriate and accurate development and application of algorithms for identifying health status from RCD.This guidance has the potential to enhance the credibility of findings from observational studies involving RCD.
文摘Data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data analysis.Traditional clustering algorithms,such as K-means,are widely used due to their simplicity and efficiency.This paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering performance.The SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)algorithm.Firstly,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population diversity.Secondly,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence speed.Finally,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation effectively.The performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic algorithms.To further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven datasets.Experimental results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering approaches.This study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.