Parkinson’s disease(PD)is a debilitating neurological disorder affecting over 10 million people worldwide.PD classification models using voice signals as input are common in the literature.It is believed that using d...Parkinson’s disease(PD)is a debilitating neurological disorder affecting over 10 million people worldwide.PD classification models using voice signals as input are common in the literature.It is believed that using deep learning algorithms further enhances performance;nevertheless,it is challenging due to the nature of small-scale and imbalanced PD datasets.This paper proposed a convolutional neural network-based deep support vector machine(CNN-DSVM)to automate the feature extraction process using CNN and extend the conventional SVM to a DSVM for better classification performance in small-scale PD datasets.A customized kernel function reduces the impact of biased classification towards the majority class(healthy candidates in our consideration).An improved generative adversarial network(IGAN)was designed to generate additional training data to enhance the model’s performance.For performance evaluation,the proposed algorithm achieves a sensitivity of 97.6%and a specificity of 97.3%.The performance comparison is evaluated from five perspectives,including comparisons with different data generation algorithms,feature extraction techniques,kernel functions,and existing works.Results reveal the effectiveness of the IGAN algorithm,which improves the sensitivity and specificity by 4.05%–4.72%and 4.96%–5.86%,respectively;and the effectiveness of the CNN-DSVM algorithm,which improves the sensitivity by 1.24%–57.4%and specificity by 1.04%–163%and reduces biased detection towards the majority class.The ablation experiments confirm the effectiveness of individual components.Two future research directions have also been suggested.展开更多
Photonuclear data are increasingly used in fundamental nuclear research and technological applications.These data are generated using advanced γ-ray sources.The Shanghai laser electron gamma source(SLEGS)is a new las...Photonuclear data are increasingly used in fundamental nuclear research and technological applications.These data are generated using advanced γ-ray sources.The Shanghai laser electron gamma source(SLEGS)is a new laser Compton scattering γ-ray source at the Shanghai Synchrotron Radiation Facility.It delivers energy-tunable,quasi-monoenergetic gamma beams for high-precision photonuclear measurements.This paper presents the flat-efficiency detector(FED)array at SLEGS and its application in photoneutron cross-section measurements.Systematic uncertainties of the FED array were determined to be 3.02%through calibration with a ^(252)Cf neutron source.Using ^(197)Au and ^(159)Tb as representative nuclei,we demonstrate the format and processing methodology for raw photoneutron data.The results validate SLEGS’capability for high-precision photoneutron measurements.展开更多
This article presents views on the future development of data science,with a particular focus on its importance to artificial intel-ligence(AI).After discussing the challenges of data science,it elu-cidates a possible...This article presents views on the future development of data science,with a particular focus on its importance to artificial intel-ligence(AI).After discussing the challenges of data science,it elu-cidates a possible approach to tackle these challenges by clarifying the logic and principles of data related to the multi-level complex-ity of the world.Finally,urgently required actions are briefly outlined.展开更多
National Population Health Data Center(NPHDC)is one of China's 20 national-level science data centers,jointly designated by the Ministry of Science and Technology and the Ministry of Finance.Operated by the Chines...National Population Health Data Center(NPHDC)is one of China's 20 national-level science data centers,jointly designated by the Ministry of Science and Technology and the Ministry of Finance.Operated by the Chinese Academy of Medical Sciences under the oversight of the National Health Commission,NPHDC adheres to national regulations including the Scientific Data Management Measures and the National Science and Technology Infrastructure Service Platform Management Measures,and is committed to collecting,integrating,managing,and sharing biomedical and health data through openaccess platform,fostering open sharing and engaging in international cooperation.展开更多
Viral infectious diseases,characterized by their intricate nature and wide-ranging diversity,pose substantial challenges in the domain of data management.The vast volume of data generated by these diseases,spanning fr...Viral infectious diseases,characterized by their intricate nature and wide-ranging diversity,pose substantial challenges in the domain of data management.The vast volume of data generated by these diseases,spanning from the molecular mechanisms within cells to large-scale epidemiological patterns,has surpassed the capabilities of traditional analytical methods.In the era of artificial intelligence(AI)and big data,there is an urgent necessity for the optimization of these analytical methods to more effectively handle and utilize the information.Despite the rapid accumulation of data associated with viral infections,the lack of a comprehensive framework for integrating,selecting,and analyzing these datasets has left numerous researchers uncertain about which data to select,how to access it,and how to utilize it most effectively in their research.This review endeavors to fill these gaps by exploring the multifaceted nature of viral infectious diseases and summarizing relevant data across multiple levels,from the molecular details of pathogens to broad epidemiological trends.The scope extends from the micro-scale to the macro-scale,encompassing pathogens,hosts,and vectors.In addition to data summarization,this review thoroughly investigates various dataset sources.It also traces the historical evolution of data collection in the field of viral infectious diseases,highlighting the progress achieved over time.Simultaneously,it evaluates the current limitations that impede data utilization.Furthermore,we propose strategies to surmount these challenges,focusing on the development and application of advanced computational techniques,AI-driven models,and enhanced data integration practices.By providing a comprehensive synthesis of existing knowledge,this review is designed to guide future research and contribute to more informed approaches in the surveillance,prevention,and control of viral infectious diseases,particularly within the context of the expanding big-data landscape.展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness a...Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.展开更多
Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi...Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.展开更多
With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comp...With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.展开更多
In the era of digital intelligence,data is a key element in promoting social and economic development.Educational data,as a vital component of data,not only supports teaching and learning but also contains much sensit...In the era of digital intelligence,data is a key element in promoting social and economic development.Educational data,as a vital component of data,not only supports teaching and learning but also contains much sensitive information.How to effectively categorize and protect sensitive data has become an urgent issue in educational data security.This paper systematically researches and constructs a multi-dimensional classification framework for sensitive educational data,and discusses its security protection strategy from the aspects of identification and desensitization,aiming to provide new ideas for the security management of sensitive educational data and to help the construction of an educational data security ecosystem in the era of digital intelligence.展开更多
Data space,as an innovative data management and sharing model,is emerging in the medical and health sectors.This study expounds on the conceptual connotation of data space and delineates its key technologies,including...Data space,as an innovative data management and sharing model,is emerging in the medical and health sectors.This study expounds on the conceptual connotation of data space and delineates its key technologies,including distributed data storage,standardization and interoperability of data sharing,data security and privacy protection,data analysis and mining,and data space assessment.By analyzing the real-world cases of data spaces within medicine and health,this study compares the similarities and differences across various dimensions such as purpose,architecture,data interoperability,and privacy protection.Meanwhile,data spaces in these fields are challenged by the limited computing resources,the complexities of data integration,and the need for optimized algorithms.Additionally,legal and ethical issues such as unclear data ownership,undefined usage rights,risks associated with privacy protection need to be addressed.The study notes organizational and management difficulties,calling for enhancements in governance framework,data sharing mechanisms,and value assessment systems.In the future,technological innovation,sound regulations,and optimized management will help the development of the medical and health data space.These developments will enable the secure and efficient utilization of data,propelling the medical industry into an era characterized by precision,intelligence,and personalization.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
1.Introduction Data inference(DInf)is a data security threat in which critical information is inferred from low-sensitivity data.Once regarded as an advanced professional threat limited to intelligence analysts,DInf h...1.Introduction Data inference(DInf)is a data security threat in which critical information is inferred from low-sensitivity data.Once regarded as an advanced professional threat limited to intelligence analysts,DInf has become a widespread risk in the artificial intelligence(AI)era.展开更多
Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods...Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.展开更多
Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for...Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.展开更多
The Intelligent Internet of Things(IIoT)involves real-world things that communicate or interact with each other through networking technologies by collecting data from these“things”and using intelligent approaches,s...The Intelligent Internet of Things(IIoT)involves real-world things that communicate or interact with each other through networking technologies by collecting data from these“things”and using intelligent approaches,such as Artificial Intelligence(AI)and machine learning,to make accurate decisions.Data science is the science of dealing with data and its relationships through intelligent approaches.Most state-of-the-art research focuses independently on either data science or IIoT,rather than exploring their integration.Therefore,to address the gap,this article provides a comprehensive survey on the advances and integration of data science with the Intelligent IoT(IIoT)system by classifying the existing IoT-based data science techniques and presenting a summary of various characteristics.The paper analyzes the data science or big data security and privacy features,including network architecture,data protection,and continuous monitoring of data,which face challenges in various IoT-based systems.Extensive insights into IoT data security,privacy,and challenges are visualized in the context of data science for IoT.In addition,this study reveals the current opportunities to enhance data science and IoT market development.The current gap and challenges faced in the integration of data science and IoT are comprehensively presented,followed by the future outlook and possible solutions.展开更多
On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th Nation...On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.展开更多
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing p...The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.展开更多
With the advent of the big data era,real-time data analysis and decision-support systems have been recognized as essential tools for enhancing enterprise competitiveness and optimizing the decision-making process.This...With the advent of the big data era,real-time data analysis and decision-support systems have been recognized as essential tools for enhancing enterprise competitiveness and optimizing the decision-making process.This study aims to explore the development strategies of real-time data analysis and decision-support systems,and analyze their application status and future development trends in various industries.The article first reviews the basic concepts and importance of real-time data analysis and decision-support systems,and then discusses in detail the key technical aspects such as system architecture,data collection and processing,analysis methods,and visualization techniques.展开更多
With the rapid development of information technology,data security issues have received increasing attention.Data encryption and decryption technology,as a key means of ensuring data security,plays an important role i...With the rapid development of information technology,data security issues have received increasing attention.Data encryption and decryption technology,as a key means of ensuring data security,plays an important role in multiple fields such as communication security,data storage,and data recovery.This article explores the fundamental principles and interrelationships of data encryption and decryption,examines the strengths,weaknesses,and applicability of symmetric,asymmetric,and hybrid encryption algorithms,and introduces key application scenarios for data encryption and decryption technology.It examines the challenges and corresponding countermeasures related to encryption algorithm security,key management,and encryption-decryption performance.Finally,it analyzes the development trends and future prospects of data encryption and decryption technology.This article provides a systematic understanding of data encryption and decryption techniques,which has good reference value for software designers.展开更多
基金The work described in this paper was fully supported by a grant from Hong Kong Metropolitan University(RIF/2021/05).
文摘Parkinson’s disease(PD)is a debilitating neurological disorder affecting over 10 million people worldwide.PD classification models using voice signals as input are common in the literature.It is believed that using deep learning algorithms further enhances performance;nevertheless,it is challenging due to the nature of small-scale and imbalanced PD datasets.This paper proposed a convolutional neural network-based deep support vector machine(CNN-DSVM)to automate the feature extraction process using CNN and extend the conventional SVM to a DSVM for better classification performance in small-scale PD datasets.A customized kernel function reduces the impact of biased classification towards the majority class(healthy candidates in our consideration).An improved generative adversarial network(IGAN)was designed to generate additional training data to enhance the model’s performance.For performance evaluation,the proposed algorithm achieves a sensitivity of 97.6%and a specificity of 97.3%.The performance comparison is evaluated from five perspectives,including comparisons with different data generation algorithms,feature extraction techniques,kernel functions,and existing works.Results reveal the effectiveness of the IGAN algorithm,which improves the sensitivity and specificity by 4.05%–4.72%and 4.96%–5.86%,respectively;and the effectiveness of the CNN-DSVM algorithm,which improves the sensitivity by 1.24%–57.4%and specificity by 1.04%–163%and reduces biased detection towards the majority class.The ablation experiments confirm the effectiveness of individual components.Two future research directions have also been suggested.
基金supported by National Key Research and Development Program of China(Nos.2022YFA1602404 and 2023YFA1606901)the National Natural Science Foundation of China(Nos.12275338,12388102,and U2441221)the Key Laboratory of Nuclear Data Foundation(JCKY2022201C152).
文摘Photonuclear data are increasingly used in fundamental nuclear research and technological applications.These data are generated using advanced γ-ray sources.The Shanghai laser electron gamma source(SLEGS)is a new laser Compton scattering γ-ray source at the Shanghai Synchrotron Radiation Facility.It delivers energy-tunable,quasi-monoenergetic gamma beams for high-precision photonuclear measurements.This paper presents the flat-efficiency detector(FED)array at SLEGS and its application in photoneutron cross-section measurements.Systematic uncertainties of the FED array were determined to be 3.02%through calibration with a ^(252)Cf neutron source.Using ^(197)Au and ^(159)Tb as representative nuclei,we demonstrate the format and processing methodology for raw photoneutron data.The results validate SLEGS’capability for high-precision photoneutron measurements.
文摘This article presents views on the future development of data science,with a particular focus on its importance to artificial intel-ligence(AI).After discussing the challenges of data science,it elu-cidates a possible approach to tackle these challenges by clarifying the logic and principles of data related to the multi-level complex-ity of the world.Finally,urgently required actions are briefly outlined.
文摘National Population Health Data Center(NPHDC)is one of China's 20 national-level science data centers,jointly designated by the Ministry of Science and Technology and the Ministry of Finance.Operated by the Chinese Academy of Medical Sciences under the oversight of the National Health Commission,NPHDC adheres to national regulations including the Scientific Data Management Measures and the National Science and Technology Infrastructure Service Platform Management Measures,and is committed to collecting,integrating,managing,and sharing biomedical and health data through openaccess platform,fostering open sharing and engaging in international cooperation.
基金supported by the National Natural Science Foundation of China(32370703)the CAMS Innovation Fund for Medical Sciences(CIFMS)(2022-I2M-1-021,2021-I2M-1-061)the Major Project of Guangzhou National Labora-tory(GZNL2024A01015).
文摘Viral infectious diseases,characterized by their intricate nature and wide-ranging diversity,pose substantial challenges in the domain of data management.The vast volume of data generated by these diseases,spanning from the molecular mechanisms within cells to large-scale epidemiological patterns,has surpassed the capabilities of traditional analytical methods.In the era of artificial intelligence(AI)and big data,there is an urgent necessity for the optimization of these analytical methods to more effectively handle and utilize the information.Despite the rapid accumulation of data associated with viral infections,the lack of a comprehensive framework for integrating,selecting,and analyzing these datasets has left numerous researchers uncertain about which data to select,how to access it,and how to utilize it most effectively in their research.This review endeavors to fill these gaps by exploring the multifaceted nature of viral infectious diseases and summarizing relevant data across multiple levels,from the molecular details of pathogens to broad epidemiological trends.The scope extends from the micro-scale to the macro-scale,encompassing pathogens,hosts,and vectors.In addition to data summarization,this review thoroughly investigates various dataset sources.It also traces the historical evolution of data collection in the field of viral infectious diseases,highlighting the progress achieved over time.Simultaneously,it evaluates the current limitations that impede data utilization.Furthermore,we propose strategies to surmount these challenges,focusing on the development and application of advanced computational techniques,AI-driven models,and enhanced data integration practices.By providing a comprehensive synthesis of existing knowledge,this review is designed to guide future research and contribute to more informed approaches in the surveillance,prevention,and control of viral infectious diseases,particularly within the context of the expanding big-data landscape.
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R104)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.
基金funded by University of Transport and Communications(UTC)under grant number T2025-CN-004.
文摘Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2023-00235509Development of security monitoring technology based network behavior against encrypted cyber threats in ICT convergence environment).
文摘With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.
基金Education Science planning project of Jiangsu Province in 2024(Grant No:B-b/2024/01/152)2025 Jiangsu Normal University Graduate Research and Innovation Program school-level project“Research on the Construction and Desensitization Strategies of Education Sensitive Data Classification from the Perspective of Educational Ecology”。
文摘In the era of digital intelligence,data is a key element in promoting social and economic development.Educational data,as a vital component of data,not only supports teaching and learning but also contains much sensitive information.How to effectively categorize and protect sensitive data has become an urgent issue in educational data security.This paper systematically researches and constructs a multi-dimensional classification framework for sensitive educational data,and discusses its security protection strategy from the aspects of identification and desensitization,aiming to provide new ideas for the security management of sensitive educational data and to help the construction of an educational data security ecosystem in the era of digital intelligence.
文摘Data space,as an innovative data management and sharing model,is emerging in the medical and health sectors.This study expounds on the conceptual connotation of data space and delineates its key technologies,including distributed data storage,standardization and interoperability of data sharing,data security and privacy protection,data analysis and mining,and data space assessment.By analyzing the real-world cases of data spaces within medicine and health,this study compares the similarities and differences across various dimensions such as purpose,architecture,data interoperability,and privacy protection.Meanwhile,data spaces in these fields are challenged by the limited computing resources,the complexities of data integration,and the need for optimized algorithms.Additionally,legal and ethical issues such as unclear data ownership,undefined usage rights,risks associated with privacy protection need to be addressed.The study notes organizational and management difficulties,calling for enhancements in governance framework,data sharing mechanisms,and value assessment systems.In the future,technological innovation,sound regulations,and optimized management will help the development of the medical and health data space.These developments will enable the secure and efficient utilization of data,propelling the medical industry into an era characterized by precision,intelligence,and personalization.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
基金supported by the National Key Research and Development Program of China(2022YFB2703503)the National Natural Science Foundation of China(62293501,62525210,and 62293502)the China Scholarship Council(202306280318).
文摘1.Introduction Data inference(DInf)is a data security threat in which critical information is inferred from low-sensitivity data.Once regarded as an advanced professional threat limited to intelligence analysts,DInf has become a widespread risk in the artificial intelligence(AI)era.
基金supported by the project“Romanian Hub for Artificial Intelligence-HRIA”,Smart Growth,Digitization and Financial Instruments Program,2021–2027,MySMIS No.334906.
文摘Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.
基金supported by the National Language Commission to research on sign language data specifications for artificial intelligence applications and test standards for language service translation systems (No.ZDI145-70)。
文摘Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.
基金supported in part by the National Natural Science Foundation of China under Grant 62371181in part by the Changzhou Science and Technology International Cooperation Program under Grant CZ20230029+1 种基金supported by a National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(2021R1A2B5B02087169)supported under the framework of international cooperation program managed by the National Research Foundation of Korea(2022K2A9A1A01098051)。
文摘The Intelligent Internet of Things(IIoT)involves real-world things that communicate or interact with each other through networking technologies by collecting data from these“things”and using intelligent approaches,such as Artificial Intelligence(AI)and machine learning,to make accurate decisions.Data science is the science of dealing with data and its relationships through intelligent approaches.Most state-of-the-art research focuses independently on either data science or IIoT,rather than exploring their integration.Therefore,to address the gap,this article provides a comprehensive survey on the advances and integration of data science with the Intelligent IoT(IIoT)system by classifying the existing IoT-based data science techniques and presenting a summary of various characteristics.The paper analyzes the data science or big data security and privacy features,including network architecture,data protection,and continuous monitoring of data,which face challenges in various IoT-based systems.Extensive insights into IoT data security,privacy,and challenges are visualized in the context of data science for IoT.In addition,this study reveals the current opportunities to enhance data science and IoT market development.The current gap and challenges faced in the integration of data science and IoT are comprehensively presented,followed by the future outlook and possible solutions.
文摘On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.
基金supported by the National Key Research and Development Program of China(Grant No.2023YFB4503600)the National Natural Science Foundation of China(Grant Nos.U23A20299,62072460,62172424,62276270,and 62322214).
文摘The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
文摘With the advent of the big data era,real-time data analysis and decision-support systems have been recognized as essential tools for enhancing enterprise competitiveness and optimizing the decision-making process.This study aims to explore the development strategies of real-time data analysis and decision-support systems,and analyze their application status and future development trends in various industries.The article first reviews the basic concepts and importance of real-time data analysis and decision-support systems,and then discusses in detail the key technical aspects such as system architecture,data collection and processing,analysis methods,and visualization techniques.
文摘With the rapid development of information technology,data security issues have received increasing attention.Data encryption and decryption technology,as a key means of ensuring data security,plays an important role in multiple fields such as communication security,data storage,and data recovery.This article explores the fundamental principles and interrelationships of data encryption and decryption,examines the strengths,weaknesses,and applicability of symmetric,asymmetric,and hybrid encryption algorithms,and introduces key application scenarios for data encryption and decryption technology.It examines the challenges and corresponding countermeasures related to encryption algorithm security,key management,and encryption-decryption performance.Finally,it analyzes the development trends and future prospects of data encryption and decryption technology.This article provides a systematic understanding of data encryption and decryption techniques,which has good reference value for software designers.