With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comp...With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.展开更多
The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,suc...The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,such as ticks and glitches,which hamper further seismological studies.This paper presents step-by-step processing of InSight’s Very Broad Band seismic data,focusing on the suppression and removal of non-seismic noise.The processing stages include tick noise removal,glitch signal suppression,multicomponent synchronization,instrument response correction,and rotation of orthogonal components.The processed datasets and associated codes are openly accessible and will support ongoing efforts to explore the geophysical properties of Mars and contribute to the broader field of planetary seismology.展开更多
With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Alth...With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Although distributed streaming data processing frameworks such asApache Flink andApache Spark Streaming provide solutions,meeting stringent response time requirements while ensuring high throughput and resource utilization remains an urgent problem.To address this,the study proposes a formal modeling approach based on Performance Evaluation Process Algebra(PEPA),which abstracts the core components and interactions of cloud-based distributed streaming data processing systems.Additionally,a generic service flow generation algorithmis introduced,enabling the automatic extraction of service flows fromthe PEPAmodel and the computation of key performance metrics,including response time,throughput,and resource utilization.The novelty of this work lies in the integration of PEPA-based formal modeling with the service flow generation algorithm,bridging the gap between formal modeling and practical performance evaluation for IoT systems.Simulation experiments demonstrate that optimizing the execution efficiency of components can significantly improve system performance.For instance,increasing the task execution rate from 10 to 100 improves system performance by 9.53%,while further increasing it to 200 results in a 21.58%improvement.However,diminishing returns are observed when the execution rate reaches 500,with only a 0.42%gain.Similarly,increasing the number of TaskManagers from 10 to 20 improves response time by 18.49%,but the improvement slows to 6.06% when increasing from 20 to 50,highlighting the importance of co-optimizing component efficiency and resource management to achieve substantial performance gains.This study provides a systematic framework for analyzing and optimizing the performance of IoT systems for large-scale real-time streaming data processing.The proposed approach not only identifies performance bottlenecks but also offers insights into improving system efficiency under different configurations and workloads.展开更多
During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resol...During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resolution seismic data processing technologies and methods tailored for drilling scenarios.The high-resolution processing of seismic data is divided into three stages:pre-drilling processing,post-drilling correction,and while-drilling updating.By integrating seismic data from different stages,spatial ranges,and frequencies,together with information from drilled wells and while-drilling data,and applying artificial intelligence modeling techniques,a progressive high-resolution processing technology of seismic data based on multi-source information fusion is developed,which performs simple and efficient seismic information updates during drilling.Case studies show that,with the gradual integration of multi-source information,the resolution and accuracy of seismic data are significantly improved,and thin-bed weak reflections are more clearly imaged.The updated seismic information while-drilling demonstrates high value in predicting geological bodies ahead of the drill bit.Validation using logging,mud logging,and drilling engineering data ensures the fidelity of the processing results of high-resolution seismic data.This provides clearer and more accurate stratigraphic information for drilling operations,enhancing both drilling safety and efficiency.展开更多
The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and ...The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.展开更多
Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem....Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem.As the state of art 3D super-resolution localization algorithm based on deep learning,FD-DeepLoc algorithm reported recently still has a gap with the expected goal of online image processing,even though it has greatly improved the data processing throughput.In this paper,a new algorithm Lite-FD-DeepLoc is developed on the basis of FD-DeepLoc algorithm to meet the online image processing requirements of 3D SMLM.This new algorithm uses the feature compression method to reduce the parameters of the model,and combines it with pipeline programming to accelerate the inference process of the deep learning model.The simulated data processing results show that the image processing speed of Lite-FD-DeepLoc is about twice as fast as that of FD-DeepLoc with a slight decrease in localization accuracy,which can realize real-time processing of 256×256 pixels size images.The results of biological experimental data processing imply that Lite-FD-DeepLoc can successfully analyze the data based on astigmatism and saddle point engineering,and the global resolution of the reconstructed image is equivalent to or even better than FD-DeepLoc algorithm.展开更多
Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experie...Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experience in driving,navigation,and communication.These privacy needs are influenced by various factors,such as data collected at different intervals,trip durations,and user interactions.To address this,the paper proposes a Support Vector Machine(SVM)model designed to process large amounts of aggregated data and recommend privacy preserving measures.The model analyzes data based on user demands and interactions with service providers or neighboring infrastructure.It aims to minimize privacy risks while ensuring service continuity and sustainability.The SVMmodel helps validate the system’s reliability by creating a hyperplane that distinguishes between maximum and minimum privacy recommendations.The results demonstrate the effectiveness of the proposed SVM model in enhancing both privacy and service performance.展开更多
This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model...This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model follows a“three-stage”and“two-subject”framework,incorporating a structured design for teaching content and assessment methods before,during,and after class.Practical results indicate that this approach significantly enhances teaching effectiveness and improves students’learning autonomy.展开更多
The paper utilized a standardized methodology to identify prognostic biomarkers in hepatocellular carcinoma(HCC)by analyzing transcriptomic and clinical data from The Cancer Genome Atlas(TCGA)database.The approach,whi...The paper utilized a standardized methodology to identify prognostic biomarkers in hepatocellular carcinoma(HCC)by analyzing transcriptomic and clinical data from The Cancer Genome Atlas(TCGA)database.The approach,which included stringent data preprocessing,differential gene expression analysis,and Kaplan-Meier survival analysis,provided valuable insights into the genetic underpinnings of HCC.The comprehensive analysis of a dataset involving 370 HCC patients uncovered correlations between survival status and pathological characteristics,including tumor size,lymph node involvement,and distant metastasis.The processed transcriptome dataset,comprising 420 samples and annotating 26,783 genes,served as a robust platform for identifying differential gene expression patterns.Among the significant differential expression genes,the key genes such as FBXO43,HAGLROS,CRISPLD1,LRRC3.DT,and ERN2,were pinpointed,which showed significant associations with patient survival outcomes,indicating their potential as novel prognostic biomarkers.This study can not only enhance the understanding of HCC’s genetic landscape but also establish a blueprint for a standardized process to discover prognostic biomarkers of various diseases using genetic big data.Future research should focus on validating these biomarkers through independent cohorts and exploring their utility in the development of personalized treatment strategies.展开更多
Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing t...Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.展开更多
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing p...The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.展开更多
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents ...The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents the data processing methods used by LAMOST,focusing on the classification and redshift measurement of large spectral data sets through template matching,as well as the creation of data products.Additionally,this paper details the construction of the Multiple Epoch Catalogs by integrating LAMOST spectral data with photometric data from Gaia and Pan-STARRS,and explains the creation of both low-and medium-resolution data products.展开更多
Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidl...Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidly adopting LLMs for the classification of biomedical textual data in medical research.The LLM can derive insights from intricate,extensive,unstructured training data.Variants need to be accurately identified and classified to advance genetic research,provide individualized treatment,and assist physicians in making better choices.However,the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize.Such an approach may result in incorrect diagnoses,which could affect a patient’s prognosis and course of therapy.This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data.We have cleaned the clinical textual data using various text preprocessing methods,including stemming,tokenization,and stop word removal.The important features are extracted using Bag of Words(BoW)and Term Frequency-Inverse Document Frequency(TFIDF)feature engineering methods.The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error.According to the experimental results,the random forest model achieved 61%accuracy with 67%precision for class 9 using TFIDF features and 63%accuracy and a 73%F1 score for class 9 using Bag of Words features.The accuracy of the proposed BERT(Bidirectional Encoder Representations from Transformers)model was 70%with 5-fold cross-validation and 71%with 10-fold cross-validation.The research results provide a comprehensive overview of current LLM methods in healthcare,benefiting academics as well as professionals in the discipline.展开更多
The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained...The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.展开更多
Accurate prediction of manufacturing carbon emissions is of great significance for subsequent low-carbon optimization.To improve the accuracy of carbon emission prediction with insufficient hobbing data,combining the ...Accurate prediction of manufacturing carbon emissions is of great significance for subsequent low-carbon optimization.To improve the accuracy of carbon emission prediction with insufficient hobbing data,combining the advantages of improved algorithm and supplementary data,a method of carbon emission prediction of hobbing based on cross-process data fusion was proposed.Firstly,we analyzed the similarity of machining process and manufacturing characteristics and selected milling data as the fusion material for hobbing data.Then,the adversarial learning was used to reduce the difference between data from the two processes,so as to realize the data fusion at the characteristic level.After that,based on Meta-Transfer Learning method,the carbon emission prediction model of hobbing was established.The effectiveness and superiority of the proposed method were verified by case analysis and comparison.The prediction accuracy of the proposed method is better than other methods across different data sizes.展开更多
Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods...Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.展开更多
The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For ...The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For this,a TBM big data preprocessing method in ML was proposed in the present study.It emphasized the accurate division of TBM tunneling cycle and the optimization method of feature extraction.Based on the data collected from a TBM water conveyance tunnel in China,its effectiveness was demonstrated by application in predicting TBM performance.Firstly,the Score-Kneedle(S-K)method was proposed to divide a TBM tunneling cycle into five phases.Conducted on 500 TBM tunneling cycles,the S-K method accurately divided all five phases in 458 cycles(accuracy of 91.6%),which is superior to the conventional duration division method(accuracy of 74.2%).Additionally,the S-K method accurately divided the stable phase in 493 cycles(accuracy of 98.6%),which is superior to two state-of-the-art division methods,namely the histogram discriminant method(accuracy of 94.6%)and the cumulative sum change point detection method(accuracy of 92.8%).Secondly,features were extracted from the divided phases.Specifically,TBM tunneling resistances were extracted from the free rotating phase and free advancing phase.The resistances were subtracted from the total forces to represent the true rock-fragmentation forces.The secant slope and the mean value were extracted as features of the increasing phase and stable phase,respectively.Finally,an ML model integrating a deep neural network and genetic algorithm(GA-DNN)was established to learn the preprocessed data.The GA-DNN used 6 secant slope features extracted from the increasing phase to predict the mean field penetration index(FPI)and torque penetration index(TPI)in the stable phase,guiding TBM drivers to make better decisions in advance.The results indicate that the proposed TBM big data preprocessing method can improve prediction accuracy significantly(improving R2s of TPI and FPI on the test dataset from 0.7716 to 0.9178 and from 0.7479 to 0.8842,respectively).展开更多
The fracture volume is gradually changed with the depletion of fracture pressure during the production process.However,there are few flowback models available so far that can estimate the fracture volume loss using pr...The fracture volume is gradually changed with the depletion of fracture pressure during the production process.However,there are few flowback models available so far that can estimate the fracture volume loss using pressure transient and rate transient data.The initial flowback involves producing back the fracturing fuid after hydraulic fracturing,while the second flowback involves producing back the preloading fluid injected into the parent wells before fracturing of child wells.The main objective of this research is to compare the initial and second flowback data to capture the changes in fracture volume after production and preload processes.Such a comparison is useful for evaluating well performance and optimizing frac-turing operations.We construct rate-normalized pressure(RNP)versus material balance time(MBT)diagnostic plots using both initial and second flowback data(FB;and FBs,respectively)of six multi-fractured horizontal wells completed in Niobrara and Codell formations in DJ Basin.In general,the slope of RNP plot during the FB,period is higher than that during the FB;period,indicating a potential loss of fracture volume from the FB;to the FB,period.We estimate the changes in effective fracture volume(Ver)by analyzing the changes in the RNP slope and total compressibility between these two flowback periods.Ver during FB,is in general 3%-45%lower than that during FB:.We also compare the drive mechanisms for the two flowback periods by calculating the compaction-drive index(CDI),hydrocarbon-drive index(HDI),and water-drive index(WDI).The dominant drive mechanism during both flowback periods is CDI,but its contribution is reduced by 16%in the FB,period.This drop is generally compensated by a relatively higher HDI during this period.The loss of effective fracture volume might be attributed to the pressure depletion in fractures,which occurs during the production period and can extend 800 days.展开更多
The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficie...The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.展开更多
Processes supported by process-aware information systems are subject to continuous and often subtle changes due to evolving operational,organizational,or regulatory factors.These changes,referred to as incremental con...Processes supported by process-aware information systems are subject to continuous and often subtle changes due to evolving operational,organizational,or regulatory factors.These changes,referred to as incremental concept drift,gradually alter the behavior or structure of processes,making their detection and localization a challenging task.Traditional process mining techniques frequently assume process stationarity and are limited in their ability to detect such drift,particularly from a control-flow perspective.The objective of this research is to develop an interpretable and robust framework capable of detecting and localizing incremental concept drift in event logs,with a specific emphasis on the structural evolution of control-flow semantics in processes.We propose DriftXMiner,a control-flow-aware hybrid framework that combines statistical,machine learning,and process model analysis techniques.The approach comprises three key components:(1)Cumulative Drift Scanner that tracks directional statistical deviations to detect early drift signals;(2)a Temporal Clustering and Drift-Aware Forest Ensemble(DAFE)to capture distributional and classification-level changes in process behavior;and(3)Petri net-based process model reconstruction,which enables the precise localization of structural drift using transition deviation metrics and replay fitness scores.Experimental validation on the BPI Challenge 2017 event log demonstrates that DriftXMiner effectively identifies and localizes gradual and incremental process drift over time.The framework achieves a detection accuracy of 92.5%,a localization precision of 90.3%,and an F1-score of 0.91,outperforming competitive baselines such as CUSUM+Histograms and ADWIN+Alpha Miner.Visual analyses further confirm that identified drift points align with transitions in control-flow models and behavioral cluster structures.DriftXMiner offers a novel and interpretable solution for incremental concept drift detection and localization in dynamic,process-aware systems.By integrating statistical signal accumulation,temporal behavior profiling,and structural process mining,the framework enables finegrained drift explanation and supports adaptive process intelligence in evolving environments.Its modular architecture supports extension to streaming data and real-time monitoring contexts.展开更多
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2023-00235509Development of security monitoring technology based network behavior against encrypted cyber threats in ICT convergence environment).
文摘With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.
基金supported by the National Key R&D Program of China(Nos.2022YFF 0503203 and 2024YFF0809900)the Research Funds of the Institute of Geophysics,China Earthquake Administration(No.DQJB24X28)the National Natural Science Foundation of China(Nos.42474226 and 42441827).
文摘The InSight mission has obtained seismic data from Mars,offering new insights into the planet’s internal structure and seismic activity.However,the raw data released to the public contain various sources of noise,such as ticks and glitches,which hamper further seismological studies.This paper presents step-by-step processing of InSight’s Very Broad Band seismic data,focusing on the suppression and removal of non-seismic noise.The processing stages include tick noise removal,glitch signal suppression,multicomponent synchronization,instrument response correction,and rotation of orthogonal components.The processed datasets and associated codes are openly accessible and will support ongoing efforts to explore the geophysical properties of Mars and contribute to the broader field of planetary seismology.
基金funded by the Joint Project of Industry-University-Research of Jiangsu Province(Grant:BY20231146).
文摘With the widespread application of Internet of Things(IoT)technology,the processing of massive realtime streaming data poses significant challenges to the computational and data-processing capabilities of systems.Although distributed streaming data processing frameworks such asApache Flink andApache Spark Streaming provide solutions,meeting stringent response time requirements while ensuring high throughput and resource utilization remains an urgent problem.To address this,the study proposes a formal modeling approach based on Performance Evaluation Process Algebra(PEPA),which abstracts the core components and interactions of cloud-based distributed streaming data processing systems.Additionally,a generic service flow generation algorithmis introduced,enabling the automatic extraction of service flows fromthe PEPAmodel and the computation of key performance metrics,including response time,throughput,and resource utilization.The novelty of this work lies in the integration of PEPA-based formal modeling with the service flow generation algorithm,bridging the gap between formal modeling and practical performance evaluation for IoT systems.Simulation experiments demonstrate that optimizing the execution efficiency of components can significantly improve system performance.For instance,increasing the task execution rate from 10 to 100 improves system performance by 9.53%,while further increasing it to 200 results in a 21.58%improvement.However,diminishing returns are observed when the execution rate reaches 500,with only a 0.42%gain.Similarly,increasing the number of TaskManagers from 10 to 20 improves response time by 18.49%,but the improvement slows to 6.06% when increasing from 20 to 50,highlighting the importance of co-optimizing component efficiency and resource management to achieve substantial performance gains.This study provides a systematic framework for analyzing and optimizing the performance of IoT systems for large-scale real-time streaming data processing.The proposed approach not only identifies performance bottlenecks but also offers insights into improving system efficiency under different configurations and workloads.
基金Supported by the National Natural Science Foundation of China(U24B2031)National Key Research and Development Project(2018YFA0702504)"14th Five-Year Plan"Science and Technology Project of CNOOC(KJGG2022-0201)。
文摘During drilling operations,the low resolution of seismic data often limits the accurate characterization of small-scale geological bodies near the borehole and ahead of the drill bit.This study investigates high-resolution seismic data processing technologies and methods tailored for drilling scenarios.The high-resolution processing of seismic data is divided into three stages:pre-drilling processing,post-drilling correction,and while-drilling updating.By integrating seismic data from different stages,spatial ranges,and frequencies,together with information from drilled wells and while-drilling data,and applying artificial intelligence modeling techniques,a progressive high-resolution processing technology of seismic data based on multi-source information fusion is developed,which performs simple and efficient seismic information updates during drilling.Case studies show that,with the gradual integration of multi-source information,the resolution and accuracy of seismic data are significantly improved,and thin-bed weak reflections are more clearly imaged.The updated seismic information while-drilling demonstrates high value in predicting geological bodies ahead of the drill bit.Validation using logging,mud logging,and drilling engineering data ensures the fidelity of the processing results of high-resolution seismic data.This provides clearer and more accurate stratigraphic information for drilling operations,enhancing both drilling safety and efficiency.
基金the National Natural Science Foundation of China(Grant Nos.52308403 and 52079068)the Yunlong Lake Laboratory of Deep Underground Science and Engineering(No.104023005)the China Postdoctoral Science Foundation(Grant No.2023M731998)for funding provided to this work.
文摘The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.
基金supported by the Start-up Fund from Hainan University(No.KYQD(ZR)-20077)。
文摘Three-dimensional(3D)single molecule localization microscopy(SMLM)plays an important role in biomedical applications,but its data processing is very complicated.Deep learning is a potential tool to solve this problem.As the state of art 3D super-resolution localization algorithm based on deep learning,FD-DeepLoc algorithm reported recently still has a gap with the expected goal of online image processing,even though it has greatly improved the data processing throughput.In this paper,a new algorithm Lite-FD-DeepLoc is developed on the basis of FD-DeepLoc algorithm to meet the online image processing requirements of 3D SMLM.This new algorithm uses the feature compression method to reduce the parameters of the model,and combines it with pipeline programming to accelerate the inference process of the deep learning model.The simulated data processing results show that the image processing speed of Lite-FD-DeepLoc is about twice as fast as that of FD-DeepLoc with a slight decrease in localization accuracy,which can realize real-time processing of 256×256 pixels size images.The results of biological experimental data processing imply that Lite-FD-DeepLoc can successfully analyze the data based on astigmatism and saddle point engineering,and the global resolution of the reconstructed image is equivalent to or even better than FD-DeepLoc algorithm.
基金supported by the Deanship of Graduate Studies and Scientific Research at University of Bisha for funding this research through the promising program under grant number(UB-Promising-33-1445).
文摘Open networks and heterogeneous services in the Internet of Vehicles(IoV)can lead to security and privacy challenges.One key requirement for such systems is the preservation of user privacy,ensuring a seamless experience in driving,navigation,and communication.These privacy needs are influenced by various factors,such as data collected at different intervals,trip durations,and user interactions.To address this,the paper proposes a Support Vector Machine(SVM)model designed to process large amounts of aggregated data and recommend privacy preserving measures.The model analyzes data based on user demands and interactions with service providers or neighboring infrastructure.It aims to minimize privacy risks while ensuring service continuity and sustainability.The SVMmodel helps validate the system’s reliability by creating a hyperplane that distinguishes between maximum and minimum privacy recommendations.The results demonstrate the effectiveness of the proposed SVM model in enhancing both privacy and service performance.
基金2024 Anqing Normal University University-Level Key Project(ZK2024062D)。
文摘This study examines the Big Data Collection and Preprocessing course at Anhui Institute of Information Engineering,implementing a hybrid teaching reform using the Bosi Smart Learning Platform.The proposed hybrid model follows a“three-stage”and“two-subject”framework,incorporating a structured design for teaching content and assessment methods before,during,and after class.Practical results indicate that this approach significantly enhances teaching effectiveness and improves students’learning autonomy.
基金the 2023 Inner Mongolia Public Institution High-Level Talent Introduction Scientific Research Support Project with the start-up funding from Linyi Vocational College。
文摘The paper utilized a standardized methodology to identify prognostic biomarkers in hepatocellular carcinoma(HCC)by analyzing transcriptomic and clinical data from The Cancer Genome Atlas(TCGA)database.The approach,which included stringent data preprocessing,differential gene expression analysis,and Kaplan-Meier survival analysis,provided valuable insights into the genetic underpinnings of HCC.The comprehensive analysis of a dataset involving 370 HCC patients uncovered correlations between survival status and pathological characteristics,including tumor size,lymph node involvement,and distant metastasis.The processed transcriptome dataset,comprising 420 samples and annotating 26,783 genes,served as a robust platform for identifying differential gene expression patterns.Among the significant differential expression genes,the key genes such as FBXO43,HAGLROS,CRISPLD1,LRRC3.DT,and ERN2,were pinpointed,which showed significant associations with patient survival outcomes,indicating their potential as novel prognostic biomarkers.This study can not only enhance the understanding of HCC’s genetic landscape but also establish a blueprint for a standardized process to discover prognostic biomarkers of various diseases using genetic big data.Future research should focus on validating these biomarkers through independent cohorts and exploring their utility in the development of personalized treatment strategies.
基金supported by the National Nature Science Foundation of China(12363010)supported by the Guizhou Provincial Basic Research Program(Natural Science)(ZK[2023]039)the Key Technology R&D Program([2023]352).
文摘Previous studies aiming to accelerate data processing have focused on enhancement algorithms,using the graphics processing unit(GPU)to speed up programs,and thread-level parallelism.These methods overlook maximizing the utilization of existing central processing unit(CPU)resources and reducing human and computational time costs via process automation.Accordingly,this paper proposes a scheme,called SSM,that combines“Srun job submission mode”,“Sbatch job submission mode”,and“Monitor function”.The SSM scheme includes three main modules:data management,command management,and resource management.Its core innovations are command splitting and parallel execution.The results show that this method effectively improves CPU utilization and reduces the time required for data processing.In terms of CPU utilization,the average value of this scheme is 89%.In contrast,the average CPU utilizations of“Srun job submission mode”and“Sbatch job submission mode”are significantly lower,at 43%and 52%,respectively.In terms of the data-processing time,SSM testing on the Five-hundred-meter Aperture Spherical radio Telescope(FAST)data requires only 5.5 h,compared with 8 h in the“Srun job submission mode”and 14 h in the“Sbatch job submission mode”.In addition,tests on the FAST and Parkes datasets demonstrate the universality of the SSM scheme,which can process data from different telescopes.The compatibility of the SSM scheme for pulsar searches is verified using 2 days of observational data from the globular cluster M2,with the scheme successfully discovering all published pulsars in M2.
基金supported by the National Key Research and Development Program of China(Grant No.2023YFB4503600)the National Natural Science Foundation of China(Grant Nos.U23A20299,62072460,62172424,62276270,and 62322214).
文摘The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution,primarily by minimizing the number of scanned block files.Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions,often overlooking non-numeric columns and thus limiting optimization potential.Additionally,these techniques,despite creating fine-grained partitions from representative queries to enhance system performance,experience from notable performance declines due to unpredictable fluctuations in future queries.To tackle these issues,we introduce LRP,a learned robust partitioning system for dynamic query processing.LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries.It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries.To create high-quality,robust partitions based on these predictions,LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions.Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
基金supported by the Young Data Scientist Program of the China National Astronomical Data Center。
文摘The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has become a crucial resource in astronomical research,offering a vast amount of spectral data for stars,galaxies,and quasars.This paper presents the data processing methods used by LAMOST,focusing on the classification and redshift measurement of large spectral data sets through template matching,as well as the creation of data products.Additionally,this paper details the construction of the Multiple Epoch Catalogs by integrating LAMOST spectral data with photometric data from Gaia and Pan-STARRS,and explains the creation of both low-and medium-resolution data products.
基金funded by Princess Nourah bint Abdulrahman University and Researchers Supporting Project number(PNURSP2025R346),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Large language models(LLMs)and natural language processing(NLP)have significant promise to improve efficiency and refine healthcare decision-making and clinical results.Numerous domains,including healthcare,are rapidly adopting LLMs for the classification of biomedical textual data in medical research.The LLM can derive insights from intricate,extensive,unstructured training data.Variants need to be accurately identified and classified to advance genetic research,provide individualized treatment,and assist physicians in making better choices.However,the sophisticated and perplexing language of medical reports is often beyond the capabilities of the devices we now utilize.Such an approach may result in incorrect diagnoses,which could affect a patient’s prognosis and course of therapy.This study evaluated the efficacy of the proposed model by looking at publicly accessible textual clinical data.We have cleaned the clinical textual data using various text preprocessing methods,including stemming,tokenization,and stop word removal.The important features are extracted using Bag of Words(BoW)and Term Frequency-Inverse Document Frequency(TFIDF)feature engineering methods.The important motive of this study is to predict the genetic variants based on the clinical evidence using a novel method with minimal error.According to the experimental results,the random forest model achieved 61%accuracy with 67%precision for class 9 using TFIDF features and 63%accuracy and a 73%F1 score for class 9 using Bag of Words features.The accuracy of the proposed BERT(Bidirectional Encoder Representations from Transformers)model was 70%with 5-fold cross-validation and 71%with 10-fold cross-validation.The research results provide a comprehensive overview of current LLM methods in healthcare,benefiting academics as well as professionals in the discipline.
基金Funds for High-Level Talents Programof Xi’an International University(Grant No.XAIU202411).
文摘The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.
基金Supported by National Natural Science Foundation of China(Grant No.52005062)Chongqing Municipal Natural Science Foundation of China(Grant No.CSTB2023NSCQ-MSX0390)。
文摘Accurate prediction of manufacturing carbon emissions is of great significance for subsequent low-carbon optimization.To improve the accuracy of carbon emission prediction with insufficient hobbing data,combining the advantages of improved algorithm and supplementary data,a method of carbon emission prediction of hobbing based on cross-process data fusion was proposed.Firstly,we analyzed the similarity of machining process and manufacturing characteristics and selected milling data as the fusion material for hobbing data.Then,the adversarial learning was used to reduce the difference between data from the two processes,so as to realize the data fusion at the characteristic level.After that,based on Meta-Transfer Learning method,the carbon emission prediction model of hobbing was established.The effectiveness and superiority of the proposed method were verified by case analysis and comparison.The prediction accuracy of the proposed method is better than other methods across different data sizes.
基金supported by the project“Romanian Hub for Artificial Intelligence-HRIA”,Smart Growth,Digitization and Financial Instruments Program,2021–2027,MySMIS No.334906.
文摘Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.
基金The support provided by the Natural Science Foundation of Hubei Province(Grant No.2021CFA081)the National Natural Science Foundation of China(Grant No.42277160)the fellowship of China Postdoctoral Science Foundation(Grant No.2022TQ0241)is gratefully acknowledged.
文摘The big data generated by tunnel boring machines(TBMs)are widely used to reveal complex rock-machine interactions by machine learning(ML)algorithms.Data preprocessing plays a crucial role in improving ML accuracy.For this,a TBM big data preprocessing method in ML was proposed in the present study.It emphasized the accurate division of TBM tunneling cycle and the optimization method of feature extraction.Based on the data collected from a TBM water conveyance tunnel in China,its effectiveness was demonstrated by application in predicting TBM performance.Firstly,the Score-Kneedle(S-K)method was proposed to divide a TBM tunneling cycle into five phases.Conducted on 500 TBM tunneling cycles,the S-K method accurately divided all five phases in 458 cycles(accuracy of 91.6%),which is superior to the conventional duration division method(accuracy of 74.2%).Additionally,the S-K method accurately divided the stable phase in 493 cycles(accuracy of 98.6%),which is superior to two state-of-the-art division methods,namely the histogram discriminant method(accuracy of 94.6%)and the cumulative sum change point detection method(accuracy of 92.8%).Secondly,features were extracted from the divided phases.Specifically,TBM tunneling resistances were extracted from the free rotating phase and free advancing phase.The resistances were subtracted from the total forces to represent the true rock-fragmentation forces.The secant slope and the mean value were extracted as features of the increasing phase and stable phase,respectively.Finally,an ML model integrating a deep neural network and genetic algorithm(GA-DNN)was established to learn the preprocessed data.The GA-DNN used 6 secant slope features extracted from the increasing phase to predict the mean field penetration index(FPI)and torque penetration index(TPI)in the stable phase,guiding TBM drivers to make better decisions in advance.The results indicate that the proposed TBM big data preprocessing method can improve prediction accuracy significantly(improving R2s of TPI and FPI on the test dataset from 0.7716 to 0.9178 and from 0.7479 to 0.8842,respectively).
文摘The fracture volume is gradually changed with the depletion of fracture pressure during the production process.However,there are few flowback models available so far that can estimate the fracture volume loss using pressure transient and rate transient data.The initial flowback involves producing back the fracturing fuid after hydraulic fracturing,while the second flowback involves producing back the preloading fluid injected into the parent wells before fracturing of child wells.The main objective of this research is to compare the initial and second flowback data to capture the changes in fracture volume after production and preload processes.Such a comparison is useful for evaluating well performance and optimizing frac-turing operations.We construct rate-normalized pressure(RNP)versus material balance time(MBT)diagnostic plots using both initial and second flowback data(FB;and FBs,respectively)of six multi-fractured horizontal wells completed in Niobrara and Codell formations in DJ Basin.In general,the slope of RNP plot during the FB,period is higher than that during the FB;period,indicating a potential loss of fracture volume from the FB;to the FB,period.We estimate the changes in effective fracture volume(Ver)by analyzing the changes in the RNP slope and total compressibility between these two flowback periods.Ver during FB,is in general 3%-45%lower than that during FB:.We also compare the drive mechanisms for the two flowback periods by calculating the compaction-drive index(CDI),hydrocarbon-drive index(HDI),and water-drive index(WDI).The dominant drive mechanism during both flowback periods is CDI,but its contribution is reduced by 16%in the FB,period.This drop is generally compensated by a relatively higher HDI during this period.The loss of effective fracture volume might be attributed to the pressure depletion in fractures,which occurs during the production period and can extend 800 days.
基金supported by the National Key Research and Development Program of China(2023YFB3307801)the National Natural Science Foundation of China(62394343,62373155,62073142)+3 种基金Major Science and Technology Project of Xinjiang(No.2022A01006-4)the Programme of Introducing Talents of Discipline to Universities(the 111 Project)under Grant B17017the Fundamental Research Funds for the Central Universities,Science Foundation of China University of Petroleum,Beijing(No.2462024YJRC011)the Open Research Project of the State Key Laboratory of Industrial Control Technology,China(Grant No.ICT2024B70).
文摘The distillation process is an important chemical process,and the application of data-driven modelling approach has the potential to reduce model complexity compared to mechanistic modelling,thus improving the efficiency of process optimization or monitoring studies.However,the distillation process is highly nonlinear and has multiple uncertainty perturbation intervals,which brings challenges to accurate data-driven modelling of distillation processes.This paper proposes a systematic data-driven modelling framework to solve these problems.Firstly,data segment variance was introduced into the K-means algorithm to form K-means data interval(KMDI)clustering in order to cluster the data into perturbed and steady state intervals for steady-state data extraction.Secondly,maximal information coefficient(MIC)was employed to calculate the nonlinear correlation between variables for removing redundant features.Finally,extreme gradient boosting(XGBoost)was integrated as the basic learner into adaptive boosting(AdaBoost)with the error threshold(ET)set to improve weights update strategy to construct the new integrated learning algorithm,XGBoost-AdaBoost-ET.The superiority of the proposed framework is verified by applying this data-driven modelling framework to a real industrial process of propylene distillation.
文摘Processes supported by process-aware information systems are subject to continuous and often subtle changes due to evolving operational,organizational,or regulatory factors.These changes,referred to as incremental concept drift,gradually alter the behavior or structure of processes,making their detection and localization a challenging task.Traditional process mining techniques frequently assume process stationarity and are limited in their ability to detect such drift,particularly from a control-flow perspective.The objective of this research is to develop an interpretable and robust framework capable of detecting and localizing incremental concept drift in event logs,with a specific emphasis on the structural evolution of control-flow semantics in processes.We propose DriftXMiner,a control-flow-aware hybrid framework that combines statistical,machine learning,and process model analysis techniques.The approach comprises three key components:(1)Cumulative Drift Scanner that tracks directional statistical deviations to detect early drift signals;(2)a Temporal Clustering and Drift-Aware Forest Ensemble(DAFE)to capture distributional and classification-level changes in process behavior;and(3)Petri net-based process model reconstruction,which enables the precise localization of structural drift using transition deviation metrics and replay fitness scores.Experimental validation on the BPI Challenge 2017 event log demonstrates that DriftXMiner effectively identifies and localizes gradual and incremental process drift over time.The framework achieves a detection accuracy of 92.5%,a localization precision of 90.3%,and an F1-score of 0.91,outperforming competitive baselines such as CUSUM+Histograms and ADWIN+Alpha Miner.Visual analyses further confirm that identified drift points align with transitions in control-flow models and behavioral cluster structures.DriftXMiner offers a novel and interpretable solution for incremental concept drift detection and localization in dynamic,process-aware systems.By integrating statistical signal accumulation,temporal behavior profiling,and structural process mining,the framework enables finegrained drift explanation and supports adaptive process intelligence in evolving environments.Its modular architecture supports extension to streaming data and real-time monitoring contexts.