The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected t...The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected truly. This paper discusses identification methods and the data credibility evaluation method for formation water in oil and gas fields of petroliferous basins within China. The results of the study show that: (1) the identification methods of formation water include the basic methods of single factors such as physical characteristics, water composition characteristics, water type characteristics, and characteristic coefficients, as well as the comprehensive evaluation method of data credibility proposed on this basis, which mainly relies on the correlation analysis sodium chloride coefficient and desulfurization coefficient and combines geological background evaluation;(2) The basic identifying methods for formation water enable the preliminary identification of hydrochemical data and the preliminary screening of data on site, the proposed comprehensive method realizes the evaluation by classifying the CaCl2-type water into types A-I to A-VI and the NaHCO3-type water into types B-I to B-IV, so that researchers can make in-depth evaluation on the credibility of hydrochemical data and analysis of influencing factors;(3) When the basic methods are used to identify the formation water, the formation water containing anions such as CO_(3)^(2-), OH- and NO_(3)^(-), or the formation water with the sodium chloride coefficient and desulphurization coefficient not matching the geological setting, are all invaded with surface water or polluted by working fluid;(4) When the comprehensive method is used, the data credibility of A-I, A-II, B-I and B-II formation water can be evaluated effectively and accurately only if the geological setting analysis in respect of the factors such as formation environment, sampling conditions, condensate water, acid fluid, leaching of ancient weathering crust, and ancient atmospheric fresh water, is combined, although such formation water is believed with high credibility.展开更多
Formation pore pressure is the foundation of well plan,and it is related to the safety and efficiency of drilling operations in oil and gas development.However,the traditional method for predicting formation pore pres...Formation pore pressure is the foundation of well plan,and it is related to the safety and efficiency of drilling operations in oil and gas development.However,the traditional method for predicting formation pore pressure involves applying post-drilling measurement data from nearby wells to the target well,which may not accurately reflect the formation pore pressure of the target well.In this paper,a novel method for predicting formation pore pressure ahead of the drill bit by embedding petrophysical theory into machine learning based on seismic and logging-while-drilling(LWD)data was proposed.Gated recurrent unit(GRU)and long short-term memory(LSTM)models were developed and validated using data from three wells in the Bohai Oilfield,and the Shapley additive explanations(SHAP)were utilized to visualize and interpret the models proposed in this study,thereby providing valuable insights into the relative importance and impact of input features.The results show that among the eight models trained in this study,almost all model prediction errors converge to 0.05 g/cm^(3),with the largest root mean square error(RMSE)being 0.03072 and the smallest RMSE being 0.008964.Moreover,continuously updating the model with the increasing training data during drilling operations can further improve accuracy.Compared to other approaches,this study accurately and precisely depicts formation pore pressure,while SHAP analysis guides effective model refinement and feature engineering strategies.This work underscores the potential of integrating advanced machine learning techniques with domain-specific knowledge to enhance predictive accuracy for petroleum engineering applications.展开更多
The widespread dolomite of the Sinian Dengying Formation in the Sichuan Basin(China)serves as one of the most important oil and gas reservoir rocks of the basin.Well WT1,as an exploration well,is recently drilled in t...The widespread dolomite of the Sinian Dengying Formation in the Sichuan Basin(China)serves as one of the most important oil and gas reservoir rocks of the basin.Well WT1,as an exploration well,is recently drilled in the Kaijiang County,northeastern Sichuan Basin(SW China),and it drills through the Dengying Formation dolomite at the depth interval of 7500–7580 m.In this study,samples are systematically collected from the cores of that interval,followed by new analyses of carbon-oxygen isotope,major elements,trace elements,rare earth elements(REEs)and EP-MA.The Dengying Formation dolomites of Well WT1 haveδ13C values of 0.37‰to 2.91‰andδ18O values of-5.72‰to-2.73‰,indicating that the dolomitization fluid is derived from contemporary seawater in the near-surface environment,rather than the burial environment.Based on the REE patterns of EPMA-based in-situ data,we recognized the seawater-sourced components,the mixedsourced components and the terrigenous-sourced components,indicating the marine origin of the dolomite with detrital contamination and diagenetic alteration.Moreover,high Al,Th,and Zr contents indicate significant detrital contamination derived from clay and quartz minerals,and high Sr/Ba and Sr/Cu ratios imply a relatively dry depositional environment with extremely high seawater salinity,intensive evaporation,and strong influences of terrigenous sediment.展开更多
The carbonate-rich shale of the Permian Wujiaping Formation in Sichuan Basin exhibits significant heterogeneity in its lithology and pore structure,which directly influence its potential for shale gas extraction.This ...The carbonate-rich shale of the Permian Wujiaping Formation in Sichuan Basin exhibits significant heterogeneity in its lithology and pore structure,which directly influence its potential for shale gas extraction.This study assesses the factors that govern pore heterogeneity by analyzing the mineral composition of the shale,as well as its pore types and their multifractal characteristics.Three primary shale facies-siliceous,mixed,and calcareous-are identified based on mineralogy,and their multifractal characteristics reveal strongly heterogeneous pore structures.The brittleness of siliceous shale,rich in quartz and pyrite,is favorable for hydraulic fracturing;while calcareous shale,with higher levels of calcite,exhibits reduced brittleness.Multifractal analysis,using nitrogen adsorption isotherms,reveals complex pore structures across different shale facies,with siliceous shale showing better pore connectivity and uniformity.The types of pores in shales include organic matter pores,interparticle pores,and intraparticle pores,among which organic matter pores are the most abundant.Pore size distribution and connectivity are notably higher in siliceous shale compared to calcareous shale,which exhibit a predominance of micropores and more isolated pore structures.Pore heterogeneity of the carbonate-rich shale in the Wujiaping Formation is primarily governed by its intrinsic mineral composition,carbonate diagenesis,mechanical compaction,and its subsequent thermal maturation with the micro-migration of organic matter.This study highlights the importance of mineral composition,especially the presence of dolomite and calcite,in shaping pore heterogeneity.These findings emphasize the critical role of shale lithofacies and pore structure in optimizing shale gas extraction methods.展开更多
Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for...Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.展开更多
The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained...The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.展开更多
The Pressure-Volume-Temperature(PVT)properties of crude oil are typically determined through laboratory analysis during the early phases of exploration and fielddevelopment.However,due to extensive data required,time-...The Pressure-Volume-Temperature(PVT)properties of crude oil are typically determined through laboratory analysis during the early phases of exploration and fielddevelopment.However,due to extensive data required,time-consuming nature,and high costs,laboratory methods are often not preferred.Machine learning,with its efficiencyand rapid convergence,has emerged as a promising alternative for PVT properties estimation.This study employs the modified particle swarm optimization-based group method of data handling(PSO-GMDH)to develop predictive models for estimating both the oil formation volume factor(OFVF)and bubble point pressure(P_(b)).Data from the Mpyo oil fieldin Uganda were used to create the models.The input parameters included solution gas-oil ratio(R_(s)),oil American Petroleum Institute gravity(API),specificgravity(SG),and reservoir temperature(T).The results demonstrated that PSO-GMDH outperformed backpropagation neural networks(BPNN)and radial basis function neural networks(RBFNN),achieving higher correlation coefficientsand lower prediction errors during training and testing.For OFVF prediction,PSO-GMDH yielded a correlation coefficient(R)of 0.9979(training)and 0.9876(testing),with corresponding root mean square error(RMSE)values of 0.0021 and 0.0099,and mean absolute error(MAE)values of 0.00055 and 0.00256,respectively.For P_(b)prediction,R was 0.9994(training)and 0.9876(testing),with RMSE values of 6.08 and 8.26,and MAE values of 1.35 and 2.63.The study also revealed that R_(s)significantlyimpacts OFVF and P_(b)predictions compared to other input parameters.The models followed physical laws and remained stable,demonstrating that PSO-GMDH is a robust and efficientmethod for predicting OFVF and P_(b),offering a time and cost-effective alternative.展开更多
In this paper,we address a cross-layer resilient control issue for a kind of multi-spacecraft system(MSS)under attack.Attackers with bad intentions use the false data injection(FDI)attack to prevent the MSS from reach...In this paper,we address a cross-layer resilient control issue for a kind of multi-spacecraft system(MSS)under attack.Attackers with bad intentions use the false data injection(FDI)attack to prevent the MSS from reaching the goal of consensus.In order to ensure the effectiveness of the control,the embedded defender in MSS preliminarily allocates the defense resources among spacecrafts.Then,the attacker selects its target spacecrafts to mount FDI attack to achieve the maximum damage.In physical layer,a Nash equilibrium(NE)control strategy is proposed for MSS to quantify system performance under the effect of attacks by solving a game problem.In cyber layer,a fuzzy Stackelberg game framework is used to examine the rivalry process between the attacker and defender.The strategies of both attacker and defender are given based on the analysis of physical layer and cyber layer.Finally,a simulation example is used to test the viability of the proposed cross layer fuzzy game algorithm.展开更多
Conventional error cancellation approaches separate molecules into smaller fragments and sum the errors of all fragments to counteract the overall computational error of the parent molecules.However,these approaches m...Conventional error cancellation approaches separate molecules into smaller fragments and sum the errors of all fragments to counteract the overall computational error of the parent molecules.However,these approaches may be ineffective for systems with strong localized chemical effects,as fragmenting specific substructures into simpler chemical bonds can introduce additional errors instead of mitigating them.To address this issue,we propose the Substructure-Preserved Connection-Based Hierarchy(SCBH),a method that automatically identifies and freezes substructures with significant local chemical effects prior to molecular fragmentation.The SCBH is validated by the gas-phase enthalpy of formation calculation of CHNO molecules.Therein,based on the atomization scheme,the reference and test values are derived at the levels of Gaussian-4(G4)and M062X/6-31+G(2df,p),respectively.Compared to commonly used approaches,SCBH reduces the average computational error by half and requires only15%of the computational cost of G4 to achieve comparable accuracy.Since different types of local effect structures have differentiated influences on gas-phase enthalpy of formation,substituents with strong electronic effects should be retained preferentially.SCBH can be readily extended to diverse classes of organic compounds.Its workflow and source code allow flexible customization of molecular moieties,including azide,carboxyl,trinitromethyl,phenyl,and others.This strategy facilitates accurate,rapid,and automated computations and corrections,making it well-suited for high-throughput molecular screening and dataset construction for gas-phase enthalpy of formation.展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
Cholelithiasis has a complex pathogenesis,necessitating better therapeutic and preventive strategies.We recently read with interest Wang et al’s study on lysine acetyltransferase 2A(KAT2A)-mediated adenosine monophos...Cholelithiasis has a complex pathogenesis,necessitating better therapeutic and preventive strategies.We recently read with interest Wang et al’s study on lysine acetyltransferase 2A(KAT2A)-mediated adenosine monophosphate-activated protein kinase(AMPK)succinylation in cholelithiasis.Using mouse models and gallbladder mucosal epithelial cells,they found that KAT2A inhibits gallstones through AMPK K170 succinylation,thereby activating the AMPK/silent information regulator 1 pathway to reduce inflammation and pyroptosis.This study is the first to connect lysine succinylation with cholelithiasis,offering new insights and identifying succinylation as a potential therapeutic target.Future research should confirm these findings using patient samples,investigate other posttranslational modifications,and use structural biology to clarify succinylationinduced conformational changes,thereby bridging basic research to clinical applications.展开更多
Pre-chamber ignition technology can address the issue of uneven in-cylinder mixture combustion in large-bore marine engines.The impact of various pre-chamber structures on the formation of the mixture and jet flames w...Pre-chamber ignition technology can address the issue of uneven in-cylinder mixture combustion in large-bore marine engines.The impact of various pre-chamber structures on the formation of the mixture and jet flames within the pre-chamber is explored.This study performed numerical simulations on a large-bore marine ammonia/hydrogen pre-chamber engine prototype,considering pre-chamber volume,throat diameter,the distance between the hydrogen injector and the spark plug,and the hydrogen injector angle.Compared with the original engine,when the pre-chamber volume is 73.4 ml,the throat diameter is 14 mm,the distance ratio is 0.92,and the hydrogen injector angle is 80°.Moreover,the peak pressure in the pre-chamber increased by 23.1%,and that in the main chamber increased by 46.3%.The results indicate that the performance of the original engine is greatly enhanced by altering its fuel and pre-chamber structure.展开更多
Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness a...Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.展开更多
Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi...Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.展开更多
With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comp...With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.展开更多
The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack...The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack a unified data structure,and depend heavily on manual intervention to process high-frequency and retroactive transactions.To address these limitations,a graph-based unified settlement framework is proposed to enhance automation,flexibility,and adaptability in electricity market settlements.A flexible attribute-graph model is employed to represent heterogeneousmulti-market data,enabling standardized integration,rapid querying,and seamless adaptation to evolving business requirements.An extensible operator library is designed to support configurable settlement rules,and a suite of modular tools—including dataset generation,formula configuration,billing templates,and task scheduling—facilitates end-to-end automated settlement processing.A robust refund-clearing mechanism is further incorporated,utilizing sandbox execution,data-version snapshots,dynamic lineage tracing,and real-time changecapture technologies to enable rapid and accurate recalculations under dynamic policy and data revisions.Case studies based on real-world data from regional Chinese markets validate the effectiveness of the proposed approach,demonstrating marked improvements in computational efficiency,system robustness,and automation.Moreover,enhanced settlement accuracy and high temporal granularity improve price-signal fidelity,promote cost-reflective tariffs,and incentivize energy-efficient and demand-responsive behavior among market participants.The method not only supports equitable and transparent market operations but also provides a generalizable,scalable foundation for modern electricity settlement platforms in increasingly complex and dynamic market environments.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods...Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.展开更多
The rapid growth of biomedical data,particularly multi-omics data including genomes,transcriptomics,proteomics,metabolomics,and epigenomics,medical research and clinical decision-making confront both new opportunities...The rapid growth of biomedical data,particularly multi-omics data including genomes,transcriptomics,proteomics,metabolomics,and epigenomics,medical research and clinical decision-making confront both new opportunities and obstacles.The huge and diversified nature of these datasets cannot always be managed using traditional data analysis methods.As a consequence,deep learning has emerged as a strong tool for analysing numerous omics data due to its ability to handle complex and non-linear relationships.This paper explores the fundamental concepts of deep learning and how they are used in multi-omics medical data mining.We demonstrate how autoencoders,variational autoencoders,multimodal models,attention mechanisms,transformers,and graph neural networks enable pattern analysis and recognition across all omics data.Deep learning has been found to be effective in illness classification,biomarker identification,gene network learning,and therapeutic efficacy prediction.We also consider critical problems like as data quality,model explainability,whether findings can be repeated,and computational power requirements.We now consider future elements of combining omics with clinical and imaging data,explainable AI,federated learning,and real-time diagnostics.Overall,this study emphasises the need of collaborating across disciplines to advance deep learning-based multi-omics research for precision medicine and comprehending complicated disorders.展开更多
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ...High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).展开更多
基金Supported by the PetroChina Science and Technology Project(2023ZZ0202)。
文摘The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected truly. This paper discusses identification methods and the data credibility evaluation method for formation water in oil and gas fields of petroliferous basins within China. The results of the study show that: (1) the identification methods of formation water include the basic methods of single factors such as physical characteristics, water composition characteristics, water type characteristics, and characteristic coefficients, as well as the comprehensive evaluation method of data credibility proposed on this basis, which mainly relies on the correlation analysis sodium chloride coefficient and desulfurization coefficient and combines geological background evaluation;(2) The basic identifying methods for formation water enable the preliminary identification of hydrochemical data and the preliminary screening of data on site, the proposed comprehensive method realizes the evaluation by classifying the CaCl2-type water into types A-I to A-VI and the NaHCO3-type water into types B-I to B-IV, so that researchers can make in-depth evaluation on the credibility of hydrochemical data and analysis of influencing factors;(3) When the basic methods are used to identify the formation water, the formation water containing anions such as CO_(3)^(2-), OH- and NO_(3)^(-), or the formation water with the sodium chloride coefficient and desulphurization coefficient not matching the geological setting, are all invaded with surface water or polluted by working fluid;(4) When the comprehensive method is used, the data credibility of A-I, A-II, B-I and B-II formation water can be evaluated effectively and accurately only if the geological setting analysis in respect of the factors such as formation environment, sampling conditions, condensate water, acid fluid, leaching of ancient weathering crust, and ancient atmospheric fresh water, is combined, although such formation water is believed with high credibility.
基金supported by the National Natural Science Foundation of China(Grant numbers:52174012,52394250,52394255,52234002,U22B20126,51804322).
文摘Formation pore pressure is the foundation of well plan,and it is related to the safety and efficiency of drilling operations in oil and gas development.However,the traditional method for predicting formation pore pressure involves applying post-drilling measurement data from nearby wells to the target well,which may not accurately reflect the formation pore pressure of the target well.In this paper,a novel method for predicting formation pore pressure ahead of the drill bit by embedding petrophysical theory into machine learning based on seismic and logging-while-drilling(LWD)data was proposed.Gated recurrent unit(GRU)and long short-term memory(LSTM)models were developed and validated using data from three wells in the Bohai Oilfield,and the Shapley additive explanations(SHAP)were utilized to visualize and interpret the models proposed in this study,thereby providing valuable insights into the relative importance and impact of input features.The results show that among the eight models trained in this study,almost all model prediction errors converge to 0.05 g/cm^(3),with the largest root mean square error(RMSE)being 0.03072 and the smallest RMSE being 0.008964.Moreover,continuously updating the model with the increasing training data during drilling operations can further improve accuracy.Compared to other approaches,this study accurately and precisely depicts formation pore pressure,while SHAP analysis guides effective model refinement and feature engineering strategies.This work underscores the potential of integrating advanced machine learning techniques with domain-specific knowledge to enhance predictive accuracy for petroleum engineering applications.
基金financially supported by the Science Foundation of China University of Petroleum,Beijing(Nos.2462018YJRC030 and 2462020YXZZ020)the China Sponsorship Council(No.202306440071)。
文摘The widespread dolomite of the Sinian Dengying Formation in the Sichuan Basin(China)serves as one of the most important oil and gas reservoir rocks of the basin.Well WT1,as an exploration well,is recently drilled in the Kaijiang County,northeastern Sichuan Basin(SW China),and it drills through the Dengying Formation dolomite at the depth interval of 7500–7580 m.In this study,samples are systematically collected from the cores of that interval,followed by new analyses of carbon-oxygen isotope,major elements,trace elements,rare earth elements(REEs)and EP-MA.The Dengying Formation dolomites of Well WT1 haveδ13C values of 0.37‰to 2.91‰andδ18O values of-5.72‰to-2.73‰,indicating that the dolomitization fluid is derived from contemporary seawater in the near-surface environment,rather than the burial environment.Based on the REE patterns of EPMA-based in-situ data,we recognized the seawater-sourced components,the mixedsourced components and the terrigenous-sourced components,indicating the marine origin of the dolomite with detrital contamination and diagenetic alteration.Moreover,high Al,Th,and Zr contents indicate significant detrital contamination derived from clay and quartz minerals,and high Sr/Ba and Sr/Cu ratios imply a relatively dry depositional environment with extremely high seawater salinity,intensive evaporation,and strong influences of terrigenous sediment.
基金supported by research projects P23057 and JKK4624004 of Jianghan Oilfield Company,SINOPEC.
文摘The carbonate-rich shale of the Permian Wujiaping Formation in Sichuan Basin exhibits significant heterogeneity in its lithology and pore structure,which directly influence its potential for shale gas extraction.This study assesses the factors that govern pore heterogeneity by analyzing the mineral composition of the shale,as well as its pore types and their multifractal characteristics.Three primary shale facies-siliceous,mixed,and calcareous-are identified based on mineralogy,and their multifractal characteristics reveal strongly heterogeneous pore structures.The brittleness of siliceous shale,rich in quartz and pyrite,is favorable for hydraulic fracturing;while calcareous shale,with higher levels of calcite,exhibits reduced brittleness.Multifractal analysis,using nitrogen adsorption isotherms,reveals complex pore structures across different shale facies,with siliceous shale showing better pore connectivity and uniformity.The types of pores in shales include organic matter pores,interparticle pores,and intraparticle pores,among which organic matter pores are the most abundant.Pore size distribution and connectivity are notably higher in siliceous shale compared to calcareous shale,which exhibit a predominance of micropores and more isolated pore structures.Pore heterogeneity of the carbonate-rich shale in the Wujiaping Formation is primarily governed by its intrinsic mineral composition,carbonate diagenesis,mechanical compaction,and its subsequent thermal maturation with the micro-migration of organic matter.This study highlights the importance of mineral composition,especially the presence of dolomite and calcite,in shaping pore heterogeneity.These findings emphasize the critical role of shale lithofacies and pore structure in optimizing shale gas extraction methods.
基金supported by the National Language Commission to research on sign language data specifications for artificial intelligence applications and test standards for language service translation systems (No.ZDI145-70)。
文摘Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.
基金Funds for High-Level Talents Programof Xi’an International University(Grant No.XAIU202411).
文摘The Cyber-Physical Systems (CPS) supported by Wireless Sensor Networks (WSN) helps factories collect data and achieve seamless communication between physical and virtual components. Sensor nodes are energy-constrained devices. Their energy consumption is typically correlated with the amount of data collection. The purpose of data aggregation is to reduce data transmission, lower energy consumption, and reduce network congestion. For large-scale WSN, data aggregation can greatly improve network efficiency. However, as many heterogeneous data is poured into a specific area at the same time, it sometimes causes data loss and then results in incompleteness and irregularity of production data. This paper proposes an information processing model that encompasses the Energy-Conserving Data Aggregation Algorithm (ECDA) and the Efficient Message Reception Algorithm (EMRA). ECDA is divided into two stages, Energy conservation based on the global cost and Data aggregation based on ant colony optimization. The EMRA comprises the Polling Message Reception Algorithm (PMRA), the Shortest Time Message Reception Algorithm (STMRA), and the Specific Condition Message Reception Algorithm (SCMRA). These algorithms are not only available for the regularity and directionality of sensor information transmission, but also satisfy the different requirements in small factory environments. To compare with the recent HPSO-ILEACH and E-PEGASIS, DCDA can effectively reduce energy consumption. Experimental results show that STMRA consumes 1.3 times the time of SCMRA. Both optimization algorithms exhibit higher time efficiency than PMRA. Furthermore, this paper also evaluates these three algorithms using AHP.
基金support from the Chinese Scholarship Council(Grant No.2022GXZ005733)。
文摘The Pressure-Volume-Temperature(PVT)properties of crude oil are typically determined through laboratory analysis during the early phases of exploration and fielddevelopment.However,due to extensive data required,time-consuming nature,and high costs,laboratory methods are often not preferred.Machine learning,with its efficiencyand rapid convergence,has emerged as a promising alternative for PVT properties estimation.This study employs the modified particle swarm optimization-based group method of data handling(PSO-GMDH)to develop predictive models for estimating both the oil formation volume factor(OFVF)and bubble point pressure(P_(b)).Data from the Mpyo oil fieldin Uganda were used to create the models.The input parameters included solution gas-oil ratio(R_(s)),oil American Petroleum Institute gravity(API),specificgravity(SG),and reservoir temperature(T).The results demonstrated that PSO-GMDH outperformed backpropagation neural networks(BPNN)and radial basis function neural networks(RBFNN),achieving higher correlation coefficientsand lower prediction errors during training and testing.For OFVF prediction,PSO-GMDH yielded a correlation coefficient(R)of 0.9979(training)and 0.9876(testing),with corresponding root mean square error(RMSE)values of 0.0021 and 0.0099,and mean absolute error(MAE)values of 0.00055 and 0.00256,respectively.For P_(b)prediction,R was 0.9994(training)and 0.9876(testing),with RMSE values of 6.08 and 8.26,and MAE values of 1.35 and 2.63.The study also revealed that R_(s)significantlyimpacts OFVF and P_(b)predictions compared to other input parameters.The models followed physical laws and remained stable,demonstrating that PSO-GMDH is a robust and efficientmethod for predicting OFVF and P_(b),offering a time and cost-effective alternative.
基金supported by the Natural Science Foundation of China(62073268,62122063,62203360)the Young Star of Science and Technology in Shaanxi Province(2020KJXX-078).
文摘In this paper,we address a cross-layer resilient control issue for a kind of multi-spacecraft system(MSS)under attack.Attackers with bad intentions use the false data injection(FDI)attack to prevent the MSS from reaching the goal of consensus.In order to ensure the effectiveness of the control,the embedded defender in MSS preliminarily allocates the defense resources among spacecrafts.Then,the attacker selects its target spacecrafts to mount FDI attack to achieve the maximum damage.In physical layer,a Nash equilibrium(NE)control strategy is proposed for MSS to quantify system performance under the effect of attacks by solving a game problem.In cyber layer,a fuzzy Stackelberg game framework is used to examine the rivalry process between the attacker and defender.The strategies of both attacker and defender are given based on the analysis of physical layer and cyber layer.Finally,a simulation example is used to test the viability of the proposed cross layer fuzzy game algorithm.
基金the support of the National Natural Science Foundation of China(22575230)。
文摘Conventional error cancellation approaches separate molecules into smaller fragments and sum the errors of all fragments to counteract the overall computational error of the parent molecules.However,these approaches may be ineffective for systems with strong localized chemical effects,as fragmenting specific substructures into simpler chemical bonds can introduce additional errors instead of mitigating them.To address this issue,we propose the Substructure-Preserved Connection-Based Hierarchy(SCBH),a method that automatically identifies and freezes substructures with significant local chemical effects prior to molecular fragmentation.The SCBH is validated by the gas-phase enthalpy of formation calculation of CHNO molecules.Therein,based on the atomization scheme,the reference and test values are derived at the levels of Gaussian-4(G4)and M062X/6-31+G(2df,p),respectively.Compared to commonly used approaches,SCBH reduces the average computational error by half and requires only15%of the computational cost of G4 to achieve comparable accuracy.Since different types of local effect structures have differentiated influences on gas-phase enthalpy of formation,substituents with strong electronic effects should be retained preferentially.SCBH can be readily extended to diverse classes of organic compounds.Its workflow and source code allow flexible customization of molecular moieties,including azide,carboxyl,trinitromethyl,phenyl,and others.This strategy facilitates accurate,rapid,and automated computations and corrections,making it well-suited for high-throughput molecular screening and dataset construction for gas-phase enthalpy of formation.
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
基金Supported by Wenzhou Science and Technology Bureau,No.Y20240207.
文摘Cholelithiasis has a complex pathogenesis,necessitating better therapeutic and preventive strategies.We recently read with interest Wang et al’s study on lysine acetyltransferase 2A(KAT2A)-mediated adenosine monophosphate-activated protein kinase(AMPK)succinylation in cholelithiasis.Using mouse models and gallbladder mucosal epithelial cells,they found that KAT2A inhibits gallstones through AMPK K170 succinylation,thereby activating the AMPK/silent information regulator 1 pathway to reduce inflammation and pyroptosis.This study is the first to connect lysine succinylation with cholelithiasis,offering new insights and identifying succinylation as a potential therapeutic target.Future research should confirm these findings using patient samples,investigate other posttranslational modifications,and use structural biology to clarify succinylationinduced conformational changes,thereby bridging basic research to clinical applications.
基金Supported by the Priority Academic Program Development of Jiangsu Higher Education Institutions under Grant No.014000319/2018-00391.
文摘Pre-chamber ignition technology can address the issue of uneven in-cylinder mixture combustion in large-bore marine engines.The impact of various pre-chamber structures on the formation of the mixture and jet flames within the pre-chamber is explored.This study performed numerical simulations on a large-bore marine ammonia/hydrogen pre-chamber engine prototype,considering pre-chamber volume,throat diameter,the distance between the hydrogen injector and the spark plug,and the hydrogen injector angle.Compared with the original engine,when the pre-chamber volume is 73.4 ml,the throat diameter is 14 mm,the distance ratio is 0.92,and the hydrogen injector angle is 80°.Moreover,the peak pressure in the pre-chamber increased by 23.1%,and that in the main chamber increased by 46.3%.The results indicate that the performance of the original engine is greatly enhanced by altering its fuel and pre-chamber structure.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R104)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.
基金funded by University of Transport and Communications(UTC)under grant number T2025-CN-004.
文摘Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2023-00235509Development of security monitoring technology based network behavior against encrypted cyber threats in ICT convergence environment).
文摘With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.
基金funded by the Science and Technology Project of State Grid Corporation of China(5108-202355437A-3-2-ZN).
文摘The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack a unified data structure,and depend heavily on manual intervention to process high-frequency and retroactive transactions.To address these limitations,a graph-based unified settlement framework is proposed to enhance automation,flexibility,and adaptability in electricity market settlements.A flexible attribute-graph model is employed to represent heterogeneousmulti-market data,enabling standardized integration,rapid querying,and seamless adaptation to evolving business requirements.An extensible operator library is designed to support configurable settlement rules,and a suite of modular tools—including dataset generation,formula configuration,billing templates,and task scheduling—facilitates end-to-end automated settlement processing.A robust refund-clearing mechanism is further incorporated,utilizing sandbox execution,data-version snapshots,dynamic lineage tracing,and real-time changecapture technologies to enable rapid and accurate recalculations under dynamic policy and data revisions.Case studies based on real-world data from regional Chinese markets validate the effectiveness of the proposed approach,demonstrating marked improvements in computational efficiency,system robustness,and automation.Moreover,enhanced settlement accuracy and high temporal granularity improve price-signal fidelity,promote cost-reflective tariffs,and incentivize energy-efficient and demand-responsive behavior among market participants.The method not only supports equitable and transparent market operations but also provides a generalizable,scalable foundation for modern electricity settlement platforms in increasingly complex and dynamic market environments.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
基金supported by the project“Romanian Hub for Artificial Intelligence-HRIA”,Smart Growth,Digitization and Financial Instruments Program,2021–2027,MySMIS No.334906.
文摘Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.
文摘The rapid growth of biomedical data,particularly multi-omics data including genomes,transcriptomics,proteomics,metabolomics,and epigenomics,medical research and clinical decision-making confront both new opportunities and obstacles.The huge and diversified nature of these datasets cannot always be managed using traditional data analysis methods.As a consequence,deep learning has emerged as a strong tool for analysing numerous omics data due to its ability to handle complex and non-linear relationships.This paper explores the fundamental concepts of deep learning and how they are used in multi-omics medical data mining.We demonstrate how autoencoders,variational autoencoders,multimodal models,attention mechanisms,transformers,and graph neural networks enable pattern analysis and recognition across all omics data.Deep learning has been found to be effective in illness classification,biomarker identification,gene network learning,and therapeutic efficacy prediction.We also consider critical problems like as data quality,model explainability,whether findings can be repeated,and computational power requirements.We now consider future elements of combining omics with clinical and imaging data,explainable AI,federated learning,and real-time diagnostics.Overall,this study emphasises the need of collaborating across disciplines to advance deep learning-based multi-omics research for precision medicine and comprehending complicated disorders.
文摘High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).