With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comp...With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.展开更多
The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,suc...The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,such as ticks and glitches,which hamper further seismological studies.This paper presents step-by-step processing of InSight’s Very Broad Band seismic data,focusing on the suppression and removal of non-seismic noise.The processing stages include tick noise removal,glitch signal suppression,multicomponent synchronization,instrument response correction,and rotation of orthogonal components.The processed datasets and associated codes are openly accessible and will support ongoing efforts to explore the geophysical properties of Mars and contribute to the broader field of planetary seismology.展开更多
With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Alth...With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Although distributed streaming data processing frameworks such asApache Flink andApache Spark Streaming provide solutions,meeting stringent response time requirements while ensuring high throughput and resource utilization remains an urgent problem.To address this,the study proposes a formal modeling approach based on Performance Evaluation Process Algebra(PEPA),which abstracts the core components and interactions of cloud-based distributed streaming data processing systems.Additionally,a generic service flow generation algorithmis introduced,enabling the automatic extraction of service flows fromthe PEPAmodel and the computation of key performance metrics,including response time,throughput,and resource utilization.The novelty of this work lies in the integration of PEPA-based formal modeling with the service flow generation algorithm,bridging the gap between formal modeling and practical performance evaluation for IoT systems.Simulation experiments demonstrate that optimizing the execution efficiency of components can significantly improve system performance.For instance,increasing the task execution rate from 10 to 100 improves system performance by 9.53%,while further increasing it to 200 results in a 21.58%improvement.However,diminishing returns are observed when the execution rate reaches 500,with only a 0.42%gain.Similarly,increasing the number of TaskManagers from 10 to 20 improves response time by 18.49%,but the improvement slows to 6.06% when increasing from 20 to 50,highlighting the importance of co-optimizing component efficiency and resource management to achieve substantial performance gains.This study provides a systematic framework for analyzing and optimizing the performance of IoT systems for large-scale real-time streaming data processing.The proposed approach not only identifies performance bottlenecks but also offers insights into improving system efficiency under different configurations and workloads.展开更多
During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resol...During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resolution seismic data processing technologies and methods tailored for drilling scenarios.The high-resolution processing of seismic data is divided into three stages:pre-drilling processing,post-drilling correction,and while-drilling updating.By integrating seismic data from different stages,spatial ranges,and frequencies,together with information from drilled wells and while-drilling data,and applying artificial intelligence modeling techniques,a progressive high-resolution processing technology of seismic data based on multi-source information fusion is developed,which performs simple and efficient seismic information updates during drilling.Case studies show that,with the gradual integration of multi-source information,the resolution and accuracy of seismic data are significantly improved,and thin-bed weak reflections are more clearly imaged.The updated seismic information while-drilling demonstrates high value in predicting geological bodies ahead of the drill bit.Validation using logging,mud logging,and drilling engineering data ensures the fidelity of the processing results of high-resolution seismic data.This provides clearer and more accurate stratigraphic information for drilling operations,enhancing both drilling safety and efficiency.展开更多
Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem....Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem.As the state of art 3D super-resolution localization algorithm based on deep learning,FD-DeepLoc algorithm reported recently still has a gap with the expected goal of online image processing,even though it has greatly improved the data processing throughput.In this paper,a new algorithm Lite-FD-DeepLoc is developed on the basis of FD-DeepLoc algorithm to meet the online image processing requirements of 3D SMLM.This new algorithm uses the feature compression method to reduce the parameters of the model,and combines it with pipeline programming to accelerate the inference process of the deep learning model.The simulated data processing results show that the image processing speed of Lite-FD-DeepLoc is about twice as fast as that of FD-DeepLoc with a slight decrease in localization accuracy,which can realize real-time processing of 256×256 pixels size images.The results of biological experimental data processing imply that Lite-FD-DeepLoc can successfully analyze the data based on astigmatism and saddle point engineering,and the global resolution of the reconstructed image is equivalent to or even better than FD-DeepLoc algorithm.展开更多
The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and ...The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.展开更多
Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experie...Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experience in driving,navigation,and communication.These privacy needs are influenced by various factors,such as data collected at different intervals,trip durations,and user interactions.To address this,the paper proposes a Support Vector Machine(SVM)model designed to process large amounts of aggregated data and recommend privacy preserving measures.The model analyzes data based on user demands and interactions with service providers or neighboring infrastructure.It aims to minimize privacy risks while ensuring service continuity and sustainability.The SVMmodel helps validate the system’s reliability by creating a hyperplane that distinguishes between maximum and minimum privacy recommendations.The results demonstrate the effectiveness of the proposed SVM model in enhancing both privacy and service performance.展开更多
This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model...This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model follows a“three-stage”and“two-subject”framework,incorporating a structured design for teaching content and assessment methods before,during,and after class.Practical results indicate that this approach significantly enhances teaching effectiveness and improves students’learning autonomy.展开更多
Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing t...Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.展开更多
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing p...The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.展开更多
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents ...The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents the data processing methods used by LAMOST,focusing on the classification and redshift measurement of large spectral data sets through template matching,as well as the creation of data products.Additionally,this paper details the construction of the Multiple Epoch Catalogs by integrating LAMOST spectral data with photometric data from Gaia and Pan-STARRS,and explains the creation of both low-and medium-resolution data products.展开更多
Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidl...Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidly adopting LLMs for the classification of biomedical textual data in medical research.The LLM can derive insights from intricate,extensive,unstructured training data.Variants need to be accurately identified and classified to advance genetic research,provide individualized treatment,and assist physicians in making better choices.However,the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize.Such an approach may result in incorrect diagnoses,which could affect a patient’s prognosis and course of therapy.This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data.We have cleaned the clinical textual data using various text preprocessing methods,including stemming,tokenization,and stop word removal.The important features are extracted using Bag of Words(BoW)and Term Frequency-Inverse Document Frequency(TFIDF)feature engineering methods.The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error.According to the experimental results,the random forest model achieved 61%accuracy with 67%precision for class 9 using TFIDF features and 63%accuracy and a 73%F1 score for class 9 using Bag of Words features.The accuracy of the proposed BERT(Bidirectional Encoder Representations from Transformers)model was 70%with 5-fold cross-validation and 71%with 10-fold cross-validation.The research results provide a comprehensive overview of current LLM methods in healthcare,benefiting academics as well as professionals in the discipline.展开更多
The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained...The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.展开更多
Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods...Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.展开更多
The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For ...The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For this,a TBM big data preprocessing method in ML was proposed in the present study.It emphasized the accurate division of TBM tunneling cycle and the optimization method of feature extraction.Based on the data collected from a TBM water conveyance tunnel in China,its effectiveness was demonstrated by application in predicting TBM performance.Firstly,the Score-Kneedle(S-K)method was proposed to divide a TBM tunneling cycle into five phases.Conducted on 500 TBM tunneling cycles,the S-K method accurately divided all five phases in 458 cycles(accuracy of 91.6%),which is superior to the conventional duration division method(accuracy of 74.2%).Additionally,the S-K method accurately divided the stable phase in 493 cycles(accuracy of 98.6%),which is superior to two state-of-the-art division methods,namely the histogram discriminant method(accuracy of 94.6%)and the cumulative sum change point detection method(accuracy of 92.8%).Secondly,features were extracted from the divided phases.Specifically,TBM tunneling resistances were extracted from the free rotating phase and free advancing phase.The resistances were subtracted from the total forces to represent the true rock-fragmentation forces.The secant slope and the mean value were extracted as features of the increasing phase and stable phase,respectively.Finally,an ML model integrating a deep neural network and genetic algorithm(GA-DNN)was established to learn the preprocessed data.The GA-DNN used 6 secant slope features extracted from the increasing phase to predict the mean field penetration index(FPI)and torque penetration index(TPI)in the stable phase,guiding TBM drivers to make better decisions in advance.The results indicate that the proposed TBM big data preprocessing method can improve prediction accuracy significantly(improving R2s of TPI and FPI on the test dataset from 0.7716 to 0.9178 and from 0.7479 to 0.8842,respectively).展开更多
1 Introduction Dissecting the dynamics of cell statesiscrucial for understanding various biological processes,such as tissue development and tumor drug responses.Recent advancements in single-cell lineage tracing(scLT...1 Introduction Dissecting the dynamics of cell statesiscrucial for understanding various biological processes,such as tissue development and tumor drug responses.Recent advancements in single-cell lineage tracing(scLT)technologies provide effective ways to track single-cell lineages through heritable cellular barcodes,while simultaneously detecting the molecular states of cells by sequencing[1].展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness a...Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.展开更多
Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi...Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2023-00235509Development of security monitoring technology based network behavior against encrypted cyber threats in ICT convergence environment).
文摘With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.
基金supported by the National Key R&D Program of China(Nos.2022YFF 0503203 and 2024YFF0809900)the Research Funds of the Institute of Geophysics,China Earthquake Administration(No.DQJB24X28)the National Natural Science Foundation of China(Nos.42474226 and 42441827).
文摘The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,such as ticks and glitches,which hamper further seismological studies.This paper presents step-by-step processing of InSight’s Very Broad Band seismic data,focusing on the suppression and removal of non-seismic noise.The processing stages include tick noise removal,glitch signal suppression,multicomponent synchronization,instrument response correction,and rotation of orthogonal components.The processed datasets and associated codes are openly accessible and will support ongoing efforts to explore the geophysical properties of Mars and contribute to the broader field of planetary seismology.
基金funded by the Joint Project of Industry-University-Research of Jiangsu Province(Grant:BY20231146).
文摘With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Although distributed streaming data processing frameworks such asApache Flink andApache Spark Streaming provide solutions,meeting stringent response time requirements while ensuring high throughput and resource utilization remains an urgent problem.To address this,the study proposes a formal modeling approach based on Performance Evaluation Process Algebra(PEPA),which abstracts the core components and interactions of cloud-based distributed streaming data processing systems.Additionally,a generic service flow generation algorithmis introduced,enabling the automatic extraction of service flows fromthe PEPAmodel and the computation of key performance metrics,including response time,throughput,and resource utilization.The novelty of this work lies in the integration of PEPA-based formal modeling with the service flow generation algorithm,bridging the gap between formal modeling and practical performance evaluation for IoT systems.Simulation experiments demonstrate that optimizing the execution efficiency of components can significantly improve system performance.For instance,increasing the task execution rate from 10 to 100 improves system performance by 9.53%,while further increasing it to 200 results in a 21.58%improvement.However,diminishing returns are observed when the execution rate reaches 500,with only a 0.42%gain.Similarly,increasing the number of TaskManagers from 10 to 20 improves response time by 18.49%,but the improvement slows to 6.06% when increasing from 20 to 50,highlighting the importance of co-optimizing component efficiency and resource management to achieve substantial performance gains.This study provides a systematic framework for analyzing and optimizing the performance of IoT systems for large-scale real-time streaming data processing.The proposed approach not only identifies performance bottlenecks but also offers insights into improving system efficiency under different configurations and workloads.
基金Supported by the National Natural Science Foundation of China(U24B2031)National Key Research and Development Project(2018YFA0702504)"14th Five-Year Plan"Science and Technology Project of CNOOC(KJGG2022-0201)。
文摘During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resolution seismic data processing technologies and methods tailored for drilling scenarios.The high-resolution processing of seismic data is divided into three stages:pre-drilling processing,post-drilling correction,and while-drilling updating.By integrating seismic data from different stages,spatial ranges,and frequencies,together with information from drilled wells and while-drilling data,and applying artificial intelligence modeling techniques,a progressive high-resolution processing technology of seismic data based on multi-source information fusion is developed,which performs simple and efficient seismic information updates during drilling.Case studies show that,with the gradual integration of multi-source information,the resolution and accuracy of seismic data are significantly improved,and thin-bed weak reflections are more clearly imaged.The updated seismic information while-drilling demonstrates high value in predicting geological bodies ahead of the drill bit.Validation using logging,mud logging,and drilling engineering data ensures the fidelity of the processing results of high-resolution seismic data.This provides clearer and more accurate stratigraphic information for drilling operations,enhancing both drilling safety and efficiency.
基金supported by the Start-up Fund from Hainan University(No.KYQD(ZR)-20077)。
文摘Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem.As the state of art 3D super-resolution localization algorithm based on deep learning,FD-DeepLoc algorithm reported recently still has a gap with the expected goal of online image processing,even though it has greatly improved the data processing throughput.In this paper,a new algorithm Lite-FD-DeepLoc is developed on the basis of FD-DeepLoc algorithm to meet the online image processing requirements of 3D SMLM.This new algorithm uses the feature compression method to reduce the parameters of the model,and combines it with pipeline programming to accelerate the inference process of the deep learning model.The simulated data processing results show that the image processing speed of Lite-FD-DeepLoc is about twice as fast as that of FD-DeepLoc with a slight decrease in localization accuracy,which can realize real-time processing of 256×256 pixels size images.The results of biological experimental data processing imply that Lite-FD-DeepLoc can successfully analyze the data based on astigmatism and saddle point engineering,and the global resolution of the reconstructed image is equivalent to or even better than FD-DeepLoc algorithm.
基金the National Natural Science Foundation of China(Grant Nos.52308403 and 52079068)the Yunlong Lake Laboratory of Deep Underground Science and Engineering(No.104023005)the China Postdoctoral Science Foundation(Grant No.2023M731998)for funding provided to this work.
文摘The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.
基金supported by the Deanship of Graduate Studies and Scientific Research at University of Bisha for funding this research through the promising program under grant number(UB-Promising-33-1445).
文摘Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experience in driving,navigation,and communication.These privacy needs are influenced by various factors,such as data collected at different intervals,trip durations,and user interactions.To address this,the paper proposes a Support Vector Machine(SVM)model designed to process large amounts of aggregated data and recommend privacy preserving measures.The model analyzes data based on user demands and interactions with service providers or neighboring infrastructure.It aims to minimize privacy risks while ensuring service continuity and sustainability.The SVMmodel helps validate the system’s reliability by creating a hyperplane that distinguishes between maximum and minimum privacy recommendations.The results demonstrate the effectiveness of the proposed SVM model in enhancing both privacy and service performance.
基金2024 Anqing Normal University University-Level Key Project(ZK2024062D)。
文摘This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model follows a“three-stage”and“two-subject”framework,incorporating a structured design for teaching content and assessment methods before,during,and after class.Practical results indicate that this approach significantly enhances teaching effectiveness and improves students’learning autonomy.
基金supported by the National Nature Science Foundation of China(12363010)supported by the Guizhou Provincial Basic Research Program(Natural Science)(ZK[2023]039)the Key Technology R&D Program([2023]352).
文摘Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.
基金supported by the National Key Research and Development Program of China(Grant No.2023YFB4503600)the National Natural Science Foundation of China(Grant Nos.U23A20299,62072460,62172424,62276270,and 62322214).
文摘The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
基金supported by the Young Data Scientist Program of the China National Astronomical Data Center。
文摘The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents the data processing methods used by LAMOST,focusing on the classification and redshift measurement of large spectral data sets through template matching,as well as the creation of data products.Additionally,this paper details the construction of the Multiple Epoch Catalogs by integrating LAMOST spectral data with photometric data from Gaia and Pan-STARRS,and explains the creation of both low-and medium-resolution data products.
基金funded by Princess Nourah bint Abdulrahman University and Researchers Supporting Project number(PNURSP2025R346),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidly adopting LLMs for the classification of biomedical textual data in medical research.The LLM can derive insights from intricate,extensive,unstructured training data.Variants need to be accurately identified and classified to advance genetic research,provide individualized treatment,and assist physicians in making better choices.However,the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize.Such an approach may result in incorrect diagnoses,which could affect a patient’s prognosis and course of therapy.This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data.We have cleaned the clinical textual data using various text preprocessing methods,including stemming,tokenization,and stop word removal.The important features are extracted using Bag of Words(BoW)and Term Frequency-Inverse Document Frequency(TFIDF)feature engineering methods.The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error.According to the experimental results,the random forest model achieved 61%accuracy with 67%precision for class 9 using TFIDF features and 63%accuracy and a 73%F1 score for class 9 using Bag of Words features.The accuracy of the proposed BERT(Bidirectional Encoder Representations from Transformers)model was 70%with 5-fold cross-validation and 71%with 10-fold cross-validation.The research results provide a comprehensive overview of current LLM methods in healthcare,benefiting academics as well as professionals in the discipline.
基金Funds for High-Level Talents Programof Xi’an International University(Grant No.XAIU202411).
文摘The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.
基金supported by the project“Romanian Hub for Artificial Intelligence-HRIA”,Smart Growth,Digitization and Financial Instruments Program,2021–2027,MySMIS No.334906.
文摘Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.
基金The support provided by the Natural Science Foundation of Hubei Province(Grant No.2021CFA081)the National Natural Science Foundation of China(Grant No.42277160)the fellowship of China Postdoctoral Science Foundation(Grant No.2022TQ0241)is gratefully acknowledged.
文摘The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For this,a TBM big data preprocessing method in ML was proposed in the present study.It emphasized the accurate division of TBM tunneling cycle and the optimization method of feature extraction.Based on the data collected from a TBM water conveyance tunnel in China,its effectiveness was demonstrated by application in predicting TBM performance.Firstly,the Score-Kneedle(S-K)method was proposed to divide a TBM tunneling cycle into five phases.Conducted on 500 TBM tunneling cycles,the S-K method accurately divided all five phases in 458 cycles(accuracy of 91.6%),which is superior to the conventional duration division method(accuracy of 74.2%).Additionally,the S-K method accurately divided the stable phase in 493 cycles(accuracy of 98.6%),which is superior to two state-of-the-art division methods,namely the histogram discriminant method(accuracy of 94.6%)and the cumulative sum change point detection method(accuracy of 92.8%).Secondly,features were extracted from the divided phases.Specifically,TBM tunneling resistances were extracted from the free rotating phase and free advancing phase.The resistances were subtracted from the total forces to represent the true rock-fragmentation forces.The secant slope and the mean value were extracted as features of the increasing phase and stable phase,respectively.Finally,an ML model integrating a deep neural network and genetic algorithm(GA-DNN)was established to learn the preprocessed data.The GA-DNN used 6 secant slope features extracted from the increasing phase to predict the mean field penetration index(FPI)and torque penetration index(TPI)in the stable phase,guiding TBM drivers to make better decisions in advance.The results indicate that the proposed TBM big data preprocessing method can improve prediction accuracy significantly(improving R2s of TPI and FPI on the test dataset from 0.7716 to 0.9178 and from 0.7479 to 0.8842,respectively).
基金supported by the National Key Research and Development Program of China(Nos.2020YFA0712403 and 2021YFF1200901)the National Natural Science Foundation of China(NSFC)(Grant Nos.62133006 and 92268104)+1 种基金the Tsinghua University Initiative Scientific Research Program(No.20221080076)the China Postdoctoral Science Foundation(No.2022M721839).
文摘1 Introduction Dissecting the dynamics of cell statesiscrucial for understanding various biological processes,such as tissue development and tumor drug responses.Recent advancements in single-cell lineage tracing(scLT)technologies provide effective ways to track single-cell lineages through heritable cellular barcodes,while simultaneously detecting the molecular states of cells by sequencing[1].
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R104)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.
基金funded by University of Transport and Communications(UTC)under grant number T2025-CN-004.
文摘Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.