Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or...Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or indirect slurs.To address this gap,we propose a hybrid framework combining Term Frequency-Inverse Document Frequency(TF-IDF),word-to-vector(Word2Vec),and Bidirectional Encoder Representations from Transformers(BERT)based models for multi-class cyberbullying detection.Our approach integrates TF-IDF for lexical specificity and Word2Vec for semantic relationships,fused with BERT’s contextual embeddings to capture syntactic and semantic complexities.We evaluate the framework on a publicly available dataset of 47,000 annotated social media posts across five cyberbullying categories:age,ethnicity,gender,religion,and indirect aggression.Among BERT variants tested,BERT Base Un-Cased achieved the highest performance with 93%accuracy(standard deviation across±1%5-fold cross-validation)and an average AUC of 0.96,outperforming standalone TF-IDF(78%)and Word2Vec(82%)models.Notably,it achieved near-perfect AUC scores(0.99)for age and ethnicity-based bullying.A comparative analysis with state-of-the-art benchmarks,including Generative Pre-trained Transformer 2(GPT-2)and Text-to-Text Transfer Transformer(T5)models highlights BERT’s superiority in handling ambiguous language.This work advances cyberbullying detection by demonstrating how hybrid feature extraction and transformer models improve multi-class classification,offering a scalable solution for moderating nuanced harmful content.展开更多
The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was p...The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.展开更多
In this study,eight different varieties of maize seeds were used as the research objects.Conduct 81 types of combined preprocessing on the original spectra.Through comparison,Savitzky-Golay(SG)-multivariate scattering...In this study,eight different varieties of maize seeds were used as the research objects.Conduct 81 types of combined preprocessing on the original spectra.Through comparison,Savitzky-Golay(SG)-multivariate scattering correction(MSC)-maximum-minimum normalization(MN)was identified as the optimal preprocessing technique.The competitive adaptive reweighted sampling(CARS),successive projections algorithm(SPA),and their combined methods were employed to extract feature wavelengths.Classification models based on back propagation(BP),support vector machine(SVM),random forest(RF),and partial least squares(PLS)were established using full-band data and feature wavelengths.Among all models,the(CARS-SPA)-BP model achieved the highest accuracy rate of 98.44%.This study offers novel insights and methodologies for the rapid and accurate identification of corn seeds as well as other crop seeds.展开更多
A comprehensive understanding of village development patterns and the identification of different village types is crucial for formulating tailored planning for rural revitalization.However,a model for large-scale vil...A comprehensive understanding of village development patterns and the identification of different village types is crucial for formulating tailored planning for rural revitalization.However,a model for large-scale village classification to support tailored rural revitalization planning is still lacking.This study aims to develop a large-scale village classification model using the Gaussian Mixture Models to support tailored rural revitalization efforts.Firstly,we propose a multi-dimensional index system to capture the diverse features of massive villages.Secondly,the GMM clustering algorithm is applied to identify distinct village types based on their unique features.The model was employed to classify the 25,409 villages in Hubei province in China into four classes.Villages in these classes exhibit discernible differences in spatial distribution,topography,location,economic development level,industrial structure,infrastructure,and resource endowment.In addition,the GMM-based village classification model demonstrates a high level of agreement with evaluations made by planning experts,confirming its accuracy and reliability.In the empirical study,our model achieves an overall accuracy of 95.29%,signifying substantial concordance between the classifications made by planning experts and the results generated by our model.Based on the identified features,tailored paths are proposed r each village class for rural revitalization efforts.展开更多
Detecting brain tumours is complex due to the natural variation in their location, shape, and intensity in images. While having accurate detection and segmentation of brain tumours would be beneficial, current methods...Detecting brain tumours is complex due to the natural variation in their location, shape, and intensity in images. While having accurate detection and segmentation of brain tumours would be beneficial, current methods still need to solve this problem despite the numerous available approaches. Precise analysis of Magnetic Resonance Imaging (MRI) is crucial for detecting, segmenting, and classifying brain tumours in medical diagnostics. Magnetic Resonance Imaging is a vital component in medical diagnosis, and it requires precise, efficient, careful, efficient, and reliable image analysis techniques. The authors developed a Deep Learning (DL) fusion model to classify brain tumours reliably. Deep Learning models require large amounts of training data to achieve good results, so the researchers utilised data augmentation techniques to increase the dataset size for training models. VGG16, ResNet50, and convolutional deep belief networks networks extracted deep features from MRI images. Softmax was used as the classifier, and the training set was supplemented with intentionally created MRI images of brain tumours in addition to the genuine ones. The features of two DL models were combined in the proposed model to generate a fusion model, which significantly increased classification accuracy. An openly accessible dataset from the internet was used to test the model's performance, and the experimental results showed that the proposed fusion model achieved a classification accuracy of 98.98%. Finally, the results were compared with existing methods, and the proposed model outperformed them significantly.展开更多
This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictabi...This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and,hence,proper environmental management.This study addresses these challenges by proposing an effective machine learning methodology applied to the“Water Quality”public dataset.The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy,sensitivity,and specificity.The proposed methodology is based on two novel approaches:(a)the SMOTE method to deal with unbalanced data and(b)the skillfully involved classical machine learning models.This paper uses Random Forests,Decision Trees,XGBoost,and Support Vector Machines because they can handle large datasets,train models for handling skewed datasets,and provide high accuracy in water quality classification.A key contribution of this work is the use of custom sampling strategies within the SMOTE approach,which significantly enhanced performance metrics and improved class imbalance handling.The results demonstrate significant improvements in predictive performance,achieving the highest reported metrics:accuracy(98.92%vs.96.06%),sensitivity(98.3%vs.71.26%),and F1 score(98.37%vs.79.74%)using the XGBoost model.These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance.The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring,assessing,and managing drinking water quality,ensuring better ecosystem and public health outcomes.展开更多
Automated and accurate movie genre classification is crucial for content organization,recommendation systems,and audience targeting in the film industry.Although most existing approaches focus on audiovisual features ...Automated and accurate movie genre classification is crucial for content organization,recommendation systems,and audience targeting in the film industry.Although most existing approaches focus on audiovisual features such as trailers and posters,the text-based classification remains underexplored despite its accessibility and semantic richness.This paper introduces the Genre Attention Model(GAM),a deep learning architecture that integrates transformer models with a hierarchical attention mechanism to extract and leverage contextual information from movie plots formulti-label genre classification.In order to assess its effectiveness,we assessmultiple transformer-based models,including Bidirectional Encoder Representations fromTransformers(BERT),ALite BERT(ALBERT),Distilled BERT(DistilBERT),Robustly Optimized BERT Pretraining Approach(RoBERTa),Efficiently Learning an Encoder that Classifies Token Replacements Accurately(ELECTRA),eXtreme Learning Network(XLNet)and Decodingenhanced BERT with Disentangled Attention(DeBERTa).Experimental results demonstrate the superior performance of DeBERTa-based GAM,which employs a two-tier hierarchical attention mechanism:word-level attention highlights key terms,while sentence-level attention captures critical narrative segments,ensuring a refined and interpretable representation of movie plots.Evaluated on three benchmark datasets Trailers12K,Large Movie Trailer Dataset-9(LMTD-9),and MovieLens37K.GAM achieves micro-average precision scores of 83.63%,83.32%,and 83.34%,respectively,surpassing state-of-the-artmodels.Additionally,GAMis computationally efficient,requiring just 6.10Giga Floating Point Operations Per Second(GFLOPS),making it a scalable and cost-effective solution.These results highlight the growing potential of text-based deep learning models in genre classification and GAM’s effectiveness in improving predictive accuracy while maintaining computational efficiency.With its robust performance,GAM offers a versatile and scalable framework for content recommendation,film indexing,and media analytics,providing an interpretable alternative to traditional audiovisual-based classification techniques.展开更多
The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classificati...The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classification.However,BERT’s size and computational demands limit its practicality,especially in resource-constrained settings.This research compresses the BERT base model for Bengali emotion classification through knowledge distillation(KD),pruning,and quantization techniques.Despite Bengali being the sixth most spoken language globally,NLP research in this area is limited.Our approach addresses this gap by creating an efficient BERT-based model for Bengali text.We have explored 20 combinations for KD,quantization,and pruning,resulting in improved speedup,fewer parameters,and reduced memory size.Our best results demonstrate significant improvements in both speed and efficiency.For instance,in the case of mBERT,we achieved a 3.87×speedup and 4×compression ratio with a combination of Distil+Prune+Quant that reduced parameters from 178 to 46 M,while the memory size decreased from 711 to 178 MB.These results offer scalable solutions for NLP tasks in various languages and advance the field of model compression,making these models suitable for real-world applications in resource-limited environments.展开更多
Effectively handling imbalanced datasets remains a fundamental challenge in computational modeling and machine learning,particularly when class overlap significantly deteriorates classification performance.Traditional...Effectively handling imbalanced datasets remains a fundamental challenge in computational modeling and machine learning,particularly when class overlap significantly deteriorates classification performance.Traditional oversampling methods often generate synthetic samples without considering density variations,leading to redundant or misleading instances that exacerbate class overlap in high-density regions.To address these limitations,we propose Wasserstein Generative Adversarial Network Variational Density Estimation WGAN-VDE,a computationally efficient density-aware adversarial resampling framework that enhances minority class representation while strategically reducing class overlap.The originality of WGAN-VDE lies in its density-aware sample refinement,ensuring that synthetic samples are positioned in underrepresented regions,thereby improving class distinctiveness.By applying structured feature representation,targeted sample generation,and density-based selection mechanisms strategies,the proposed framework ensures the generation of well-separated and diverse synthetic samples,improving class separability and reducing redundancy.The experimental evaluation on 20 benchmark datasets demonstrates that this approach outperforms 11 state-of-the-art rebalancing techniques,achieving superior results in F1-score,Accuracy,G-Mean,and AUC metrics.These results establish the proposed method as an effective and robust computational approach,suitable for diverse engineering and scientific applications involving imbalanced data classification and computational modeling.展开更多
Medical image classification is crucial in disease diagnosis,treatment planning,and clinical decisionmaking.We introduced a novel medical image classification approach that integrates Bayesian Random Semantic Data Aug...Medical image classification is crucial in disease diagnosis,treatment planning,and clinical decisionmaking.We introduced a novel medical image classification approach that integrates Bayesian Random Semantic Data Augmentation(BSDA)with a Vision Mamba-based model for medical image classification(MedMamba),enhanced by residual connection blocks,we named the model BSDA-Mamba.BSDA augments medical image data semantically,enhancing the model’s generalization ability and classification performance.MedMamba,a deep learning-based state space model,excels in capturing long-range dependencies in medical images.By incorporating residual connections,BSDA-Mamba further improves feature extraction capabilities.Through comprehensive experiments on eight medical image datasets,we demonstrate that BSDA-Mamba outperforms existing models in accuracy,area under the curve,and F1-score.Our results highlight BSDA-Mamba’s potential as a reliable tool for medical image analysis,particularly in handling diverse imaging modalities from X-rays to MRI.The open-sourcing of our model’s code and datasets,will facilitate the reproduction and extension of our work.展开更多
Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence sem...Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence semantic dependencies.A robust entity extraction method tailored for accident texts is proposed.Design/methodology/approach–This method is implemented through a dual-branch multi-task mutual learning model named R-MLP,which jointly performs entity recognition and accident phase classification.The model leverages a shared BERT encoder to extract contextual features and incorporates a sentence span indexing module to align feature granularity.A cross-task mutual learning mechanism is also introduced to strengthen semantic representation.Findings–R-MLP effectively mitigates the impact of semantic complexity and data sparsity in domain entities and enhances the model’s ability to capture inter-sentence semantic dependencies.Experimental results show that R-MLP achieves a maximum F1-score of 0.736 in extracting six types of key railway accident entities,significantly outperforming baseline models such as RoBERTa and MacBERT.Originality/value–This demonstrates the proposed method’s superior generalization and accuracy in domainspecific entity extraction tasks,confirming its effectiveness and practical value.展开更多
Texture analysis methods offer substantial advantages and potential in examining macro-topographic features of dunes.Despite these advantages,comprehensive approaches that integrate digital elevation model(DEM)with qu...Texture analysis methods offer substantial advantages and potential in examining macro-topographic features of dunes.Despite these advantages,comprehensive approaches that integrate digital elevation model(DEM)with quantitative texture features have not been fully developed.This study introduced an automatic classification framework for dunes that combines texture and topographic features and validated it through a typical coastal aeolian landform,namely,dunes in the Namib Desert.A three-stage approach was outlined:(1)segmentation of dune units was conducted using digital terrain analysis;(2)six texture features(angular second moment,contrast,correlation,variance,entropy,and inverse difference moment)were extracted from the gray-level co-occurrence matrix(GLCM)and subsequently quantified;and(3)texture–topographic indices were integrated into the random forest(RF)model for classification.The results show that the RF model fused with texture features can accurately identify dune morphological characteristics;through accuracy evaluation and remote sensing image verification,the overall accuracy reaches 78.0%(kappa coefficient=0.72),outperforming traditional spectral-based methods.In addition,spatial analysis reveals that coastal dunes exhibit complex texture patterns,with texture homogeneity being closely linked to dune-type transitions.Specifically,homogeneous textures correspond to simple and stable forms such as barchans,while heterogeneous textures are associated with complex or composite dunes.The complexity,periodicity,and directionality of texture features are highly consistent with the spatial distribution of dunes.Validation using high-resolution remote sensing imagery(Sentinel-2)further confirms that the method effectively clusters similar dunes and distinguishes different dune types.Additionally,the dune classification results have a good correspondence with changes in near-surface wind regimes.Overall,the findings suggest that texture features derived from DEM can accurately capture the dynamic characteristics of dune morphology,offering a novel approach for automatic dune classification.Compared with traditional methods,the developed approach facilitates large-scale and high-precision dune mapping while reducing the workload of manual interpretation,thus advancing research on aeolian geomorphology.展开更多
We apply stochastic seismic inversion and Bayesian facies classification for porosity modeling and igneous rock identification in the presalt interval of the Santos Basin. This integration of seismic and well-derived ...We apply stochastic seismic inversion and Bayesian facies classification for porosity modeling and igneous rock identification in the presalt interval of the Santos Basin. This integration of seismic and well-derived information enhances reservoir characterization. Stochastic inversion and Bayesian classification are powerful tools because they permit addressing the uncertainties in the model. We used the ES-MDA algorithm to achieve the realizations equivalent to the percentiles P10, P50, and P90 of acoustic impedance, a novel method for acoustic inversion in presalt. The facies were divided into five: reservoir 1,reservoir 2, tight carbonates, clayey rocks, and igneous rocks. To deal with the overlaps in acoustic impedance values of facies, we included geological information using a priori probability, indicating that structural highs are reservoir-dominated. To illustrate our approach, we conducted porosity modeling using facies-related rock-physics models for rock-physics inversion in an area with a well drilled in a coquina bank and evaluated the thickness and extension of an igneous intrusion near the carbonate-salt interface. The modeled porosity and the classified seismic facies are in good agreement with the ones observed in the wells. Notably, the coquinas bank presents an improvement in the porosity towards the top. The a priori probability model was crucial for limiting the clayey rocks to the structural lows. In Well B, the hit rate of the igneous rock in the three scenarios is higher than 60%, showing an excellent thickness-prediction capability.展开更多
Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucia...Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.展开更多
Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requir...Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.展开更多
Machine learning(ML)and data mining are used in various fields such as data analysis,prediction,image processing and especially in healthcare.Researchers in the past decade have focused on applying ML and data mining ...Machine learning(ML)and data mining are used in various fields such as data analysis,prediction,image processing and especially in healthcare.Researchers in the past decade have focused on applying ML and data mining to generate conclusions from historical data in order to improve healthcare systems by making predictions about the results.Using ML algorithms,researchers have developed applications for decision support,analyzed clinical aspects,extracted informative information from historical data,predicted the outcomes and categorized diseases which help physicians make better decisions.It is observed that there is a huge difference between women depending on the region and their social lives.Due to these differences,scholars have been encouraged to conduct studies at a local level in order to better understand those factors that affect maternal health and the expected child.In this study,the ensemble modeling technique is applied to classify birth outcomes based on either cesarean section(C-Section)or normal delivery.A voting ensemble model for the classification of a birth dataset was made by using a Random Forest(RF),Gradient Boosting Classifier,Extra Trees Classifier and Bagging Classifier as base learners.It is observed that the voting ensemble modal of proposed classifiers provides the best accuracy,i.e.,94.78%,as compared to the individual classifiers.ML algorithms are more accurate due to ensemble models,which reduce variance and classification errors.It is reported that when a suitable classification model has been developed for birth classification,decision support systems can be created to enable clinicians to gain in-depth insights into the patterns in the datasets.Developing such a system will not only allow health organizations to improve maternal health assessment processes,but also open doors for interdisciplinary research in two different fields in the region.展开更多
We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract informa...We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract information from medical text,facilitating more accurate classification while minimizing the number of trainable parameters.Extensive experiments conducted on various datasets demonstrate the effectiveness of our approach.展开更多
A new arrival and departure flight classification method based on the transitive closure algorithm (TCA) is proposed. Firstly, the fuzzy set theory and the transitive closure algorithm are introduced. Then four diff...A new arrival and departure flight classification method based on the transitive closure algorithm (TCA) is proposed. Firstly, the fuzzy set theory and the transitive closure algorithm are introduced. Then four different factors are selected to establish the flight classification model and a method is given to calculate the delay cost for each class. Finally, the proposed method is implemented in the sequencing problems of flights in a terminal area, and results are compared with that of the traditional classification method(TCM). Results show that the new classification model is effective in reducing the expenses of flight delays, thus optimizing the sequences of arrival and departure flights, and improving the efficiency of air traffic control.展开更多
This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first ...This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...展开更多
In order to reduce amount of data storage and improve processing capacity of the system, this paper proposes a new classification method of data source by combining phase synchronization model in network clusteri...In order to reduce amount of data storage and improve processing capacity of the system, this paper proposes a new classification method of data source by combining phase synchronization model in network clustering with cloud model. Firstly, taking data source as a complex network, after the topography of network is obtained, the cloud model of each node data is determined by fuzzy analytic hierarchy process (AHP). Secondly, by calculating expectation, entropy and hyper entropy of the cloud model, comprehensive coupling strength is got and then it is regarded as the edge weight of topography. Finally, distribution curve is obtained by iterating the phase of each node by means of phase synchronization model. Thus classification of data source is completed. This method can not only provide convenience for storage, cleaning and compression of data, but also improve the efficiency of data analysis.展开更多
基金funded by Scientific Research Deanship at University of Hail-Saudi Arabia through Project Number RG-23092.
文摘Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or indirect slurs.To address this gap,we propose a hybrid framework combining Term Frequency-Inverse Document Frequency(TF-IDF),word-to-vector(Word2Vec),and Bidirectional Encoder Representations from Transformers(BERT)based models for multi-class cyberbullying detection.Our approach integrates TF-IDF for lexical specificity and Word2Vec for semantic relationships,fused with BERT’s contextual embeddings to capture syntactic and semantic complexities.We evaluate the framework on a publicly available dataset of 47,000 annotated social media posts across five cyberbullying categories:age,ethnicity,gender,religion,and indirect aggression.Among BERT variants tested,BERT Base Un-Cased achieved the highest performance with 93%accuracy(standard deviation across±1%5-fold cross-validation)and an average AUC of 0.96,outperforming standalone TF-IDF(78%)and Word2Vec(82%)models.Notably,it achieved near-perfect AUC scores(0.99)for age and ethnicity-based bullying.A comparative analysis with state-of-the-art benchmarks,including Generative Pre-trained Transformer 2(GPT-2)and Text-to-Text Transfer Transformer(T5)models highlights BERT’s superiority in handling ambiguous language.This work advances cyberbullying detection by demonstrating how hybrid feature extraction and transformer models improve multi-class classification,offering a scalable solution for moderating nuanced harmful content.
基金financially supported by the National Key Research and Development Program of China(2022YFB3706800,2020YFB1710100)the National Natural Science Foundation of China(51821001,52090042,52074183)。
文摘The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.
基金supported by the Science and Technology Development Plan Project of Jilin Provincial Department of Science and Technology (No.20220203112S)the Jilin Provincial Department of Education Science and Technology Research Project (No.JJKH20210039KJ)。
文摘In this study,eight different varieties of maize seeds were used as the research objects.Conduct 81 types of combined preprocessing on the original spectra.Through comparison,Savitzky-Golay(SG)-multivariate scattering correction(MSC)-maximum-minimum normalization(MN)was identified as the optimal preprocessing technique.The competitive adaptive reweighted sampling(CARS),successive projections algorithm(SPA),and their combined methods were employed to extract feature wavelengths.Classification models based on back propagation(BP),support vector machine(SVM),random forest(RF),and partial least squares(PLS)were established using full-band data and feature wavelengths.Among all models,the(CARS-SPA)-BP model achieved the highest accuracy rate of 98.44%.This study offers novel insights and methodologies for the rapid and accurate identification of corn seeds as well as other crop seeds.
基金National Natural Science Foundation of China,No.42293271,No.41971336。
文摘A comprehensive understanding of village development patterns and the identification of different village types is crucial for formulating tailored planning for rural revitalization.However,a model for large-scale village classification to support tailored rural revitalization planning is still lacking.This study aims to develop a large-scale village classification model using the Gaussian Mixture Models to support tailored rural revitalization efforts.Firstly,we propose a multi-dimensional index system to capture the diverse features of massive villages.Secondly,the GMM clustering algorithm is applied to identify distinct village types based on their unique features.The model was employed to classify the 25,409 villages in Hubei province in China into four classes.Villages in these classes exhibit discernible differences in spatial distribution,topography,location,economic development level,industrial structure,infrastructure,and resource endowment.In addition,the GMM-based village classification model demonstrates a high level of agreement with evaluations made by planning experts,confirming its accuracy and reliability.In the empirical study,our model achieves an overall accuracy of 95.29%,signifying substantial concordance between the classifications made by planning experts and the results generated by our model.Based on the identified features,tailored paths are proposed r each village class for rural revitalization efforts.
基金Ministry of Education,Youth and Sports of the Chezk Republic,Grant/Award Numbers:SP2023/039,SP2023/042the European Union under the REFRESH,Grant/Award Number:CZ.10.03.01/00/22_003/0000048。
文摘Detecting brain tumours is complex due to the natural variation in their location, shape, and intensity in images. While having accurate detection and segmentation of brain tumours would be beneficial, current methods still need to solve this problem despite the numerous available approaches. Precise analysis of Magnetic Resonance Imaging (MRI) is crucial for detecting, segmenting, and classifying brain tumours in medical diagnostics. Magnetic Resonance Imaging is a vital component in medical diagnosis, and it requires precise, efficient, careful, efficient, and reliable image analysis techniques. The authors developed a Deep Learning (DL) fusion model to classify brain tumours reliably. Deep Learning models require large amounts of training data to achieve good results, so the researchers utilised data augmentation techniques to increase the dataset size for training models. VGG16, ResNet50, and convolutional deep belief networks networks extracted deep features from MRI images. Softmax was used as the classifier, and the training set was supplemented with intentionally created MRI images of brain tumours in addition to the genuine ones. The features of two DL models were combined in the proposed model to generate a fusion model, which significantly increased classification accuracy. An openly accessible dataset from the internet was used to test the model's performance, and the experimental results showed that the proposed fusion model achieved a classification accuracy of 98.98%. Finally, the results were compared with existing methods, and the proposed model outperformed them significantly.
文摘This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and,hence,proper environmental management.This study addresses these challenges by proposing an effective machine learning methodology applied to the“Water Quality”public dataset.The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy,sensitivity,and specificity.The proposed methodology is based on two novel approaches:(a)the SMOTE method to deal with unbalanced data and(b)the skillfully involved classical machine learning models.This paper uses Random Forests,Decision Trees,XGBoost,and Support Vector Machines because they can handle large datasets,train models for handling skewed datasets,and provide high accuracy in water quality classification.A key contribution of this work is the use of custom sampling strategies within the SMOTE approach,which significantly enhanced performance metrics and improved class imbalance handling.The results demonstrate significant improvements in predictive performance,achieving the highest reported metrics:accuracy(98.92%vs.96.06%),sensitivity(98.3%vs.71.26%),and F1 score(98.37%vs.79.74%)using the XGBoost model.These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance.The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring,assessing,and managing drinking water quality,ensuring better ecosystem and public health outcomes.
基金would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support(QU-APC-2025).
文摘Automated and accurate movie genre classification is crucial for content organization,recommendation systems,and audience targeting in the film industry.Although most existing approaches focus on audiovisual features such as trailers and posters,the text-based classification remains underexplored despite its accessibility and semantic richness.This paper introduces the Genre Attention Model(GAM),a deep learning architecture that integrates transformer models with a hierarchical attention mechanism to extract and leverage contextual information from movie plots formulti-label genre classification.In order to assess its effectiveness,we assessmultiple transformer-based models,including Bidirectional Encoder Representations fromTransformers(BERT),ALite BERT(ALBERT),Distilled BERT(DistilBERT),Robustly Optimized BERT Pretraining Approach(RoBERTa),Efficiently Learning an Encoder that Classifies Token Replacements Accurately(ELECTRA),eXtreme Learning Network(XLNet)and Decodingenhanced BERT with Disentangled Attention(DeBERTa).Experimental results demonstrate the superior performance of DeBERTa-based GAM,which employs a two-tier hierarchical attention mechanism:word-level attention highlights key terms,while sentence-level attention captures critical narrative segments,ensuring a refined and interpretable representation of movie plots.Evaluated on three benchmark datasets Trailers12K,Large Movie Trailer Dataset-9(LMTD-9),and MovieLens37K.GAM achieves micro-average precision scores of 83.63%,83.32%,and 83.34%,respectively,surpassing state-of-the-artmodels.Additionally,GAMis computationally efficient,requiring just 6.10Giga Floating Point Operations Per Second(GFLOPS),making it a scalable and cost-effective solution.These results highlight the growing potential of text-based deep learning models in genre classification and GAM’s effectiveness in improving predictive accuracy while maintaining computational efficiency.With its robust performance,GAM offers a versatile and scalable framework for content recommendation,film indexing,and media analytics,providing an interpretable alternative to traditional audiovisual-based classification techniques.
文摘The rapid growth of digital data necessitates advanced natural language processing(NLP)models like BERT(Bidi-rectional Encoder Representations from Transformers),known for its superior performance in text classification.However,BERT’s size and computational demands limit its practicality,especially in resource-constrained settings.This research compresses the BERT base model for Bengali emotion classification through knowledge distillation(KD),pruning,and quantization techniques.Despite Bengali being the sixth most spoken language globally,NLP research in this area is limited.Our approach addresses this gap by creating an efficient BERT-based model for Bengali text.We have explored 20 combinations for KD,quantization,and pruning,resulting in improved speedup,fewer parameters,and reduced memory size.Our best results demonstrate significant improvements in both speed and efficiency.For instance,in the case of mBERT,we achieved a 3.87×speedup and 4×compression ratio with a combination of Distil+Prune+Quant that reduced parameters from 178 to 46 M,while the memory size decreased from 711 to 178 MB.These results offer scalable solutions for NLP tasks in various languages and advance the field of model compression,making these models suitable for real-world applications in resource-limited environments.
基金supported by Ongoing Research Funding Program(ORF-2025-488)King Saud University,Riyadh,Saudi Arabia.
文摘Effectively handling imbalanced datasets remains a fundamental challenge in computational modeling and machine learning,particularly when class overlap significantly deteriorates classification performance.Traditional oversampling methods often generate synthetic samples without considering density variations,leading to redundant or misleading instances that exacerbate class overlap in high-density regions.To address these limitations,we propose Wasserstein Generative Adversarial Network Variational Density Estimation WGAN-VDE,a computationally efficient density-aware adversarial resampling framework that enhances minority class representation while strategically reducing class overlap.The originality of WGAN-VDE lies in its density-aware sample refinement,ensuring that synthetic samples are positioned in underrepresented regions,thereby improving class distinctiveness.By applying structured feature representation,targeted sample generation,and density-based selection mechanisms strategies,the proposed framework ensures the generation of well-separated and diverse synthetic samples,improving class separability and reducing redundancy.The experimental evaluation on 20 benchmark datasets demonstrates that this approach outperforms 11 state-of-the-art rebalancing techniques,achieving superior results in F1-score,Accuracy,G-Mean,and AUC metrics.These results establish the proposed method as an effective and robust computational approach,suitable for diverse engineering and scientific applications involving imbalanced data classification and computational modeling.
文摘Medical image classification is crucial in disease diagnosis,treatment planning,and clinical decisionmaking.We introduced a novel medical image classification approach that integrates Bayesian Random Semantic Data Augmentation(BSDA)with a Vision Mamba-based model for medical image classification(MedMamba),enhanced by residual connection blocks,we named the model BSDA-Mamba.BSDA augments medical image data semantically,enhancing the model’s generalization ability and classification performance.MedMamba,a deep learning-based state space model,excels in capturing long-range dependencies in medical images.By incorporating residual connections,BSDA-Mamba further improves feature extraction capabilities.Through comprehensive experiments on eight medical image datasets,we demonstrate that BSDA-Mamba outperforms existing models in accuracy,area under the curve,and F1-score.Our results highlight BSDA-Mamba’s potential as a reliable tool for medical image analysis,particularly in handling diverse imaging modalities from X-rays to MRI.The open-sourcing of our model’s code and datasets,will facilitate the reproduction and extension of our work.
基金funded by the Technology Research and Development Plan Program of China State Railway Group Co.,Ltd.(No.Q2024T001)the Foundation of China Academy of Railway Sciences Co.,Ltd.(No:2024YJ259).
文摘Purpose–This study aims to enhance the accuracy of key entity extraction from railway accident report texts and address challenges such as complex domain-specific semantics,data sparsity and strong inter-sentence semantic dependencies.A robust entity extraction method tailored for accident texts is proposed.Design/methodology/approach–This method is implemented through a dual-branch multi-task mutual learning model named R-MLP,which jointly performs entity recognition and accident phase classification.The model leverages a shared BERT encoder to extract contextual features and incorporates a sentence span indexing module to align feature granularity.A cross-task mutual learning mechanism is also introduced to strengthen semantic representation.Findings–R-MLP effectively mitigates the impact of semantic complexity and data sparsity in domain entities and enhances the model’s ability to capture inter-sentence semantic dependencies.Experimental results show that R-MLP achieves a maximum F1-score of 0.736 in extracting six types of key railway accident entities,significantly outperforming baseline models such as RoBERTa and MacBERT.Originality/value–This demonstrates the proposed method’s superior generalization and accuracy in domainspecific entity extraction tasks,confirming its effectiveness and practical value.
基金supported by the National Natural Science Foundation of China(42271421).
文摘Texture analysis methods offer substantial advantages and potential in examining macro-topographic features of dunes.Despite these advantages,comprehensive approaches that integrate digital elevation model(DEM)with quantitative texture features have not been fully developed.This study introduced an automatic classification framework for dunes that combines texture and topographic features and validated it through a typical coastal aeolian landform,namely,dunes in the Namib Desert.A three-stage approach was outlined:(1)segmentation of dune units was conducted using digital terrain analysis;(2)six texture features(angular second moment,contrast,correlation,variance,entropy,and inverse difference moment)were extracted from the gray-level co-occurrence matrix(GLCM)and subsequently quantified;and(3)texture–topographic indices were integrated into the random forest(RF)model for classification.The results show that the RF model fused with texture features can accurately identify dune morphological characteristics;through accuracy evaluation and remote sensing image verification,the overall accuracy reaches 78.0%(kappa coefficient=0.72),outperforming traditional spectral-based methods.In addition,spatial analysis reveals that coastal dunes exhibit complex texture patterns,with texture homogeneity being closely linked to dune-type transitions.Specifically,homogeneous textures correspond to simple and stable forms such as barchans,while heterogeneous textures are associated with complex or composite dunes.The complexity,periodicity,and directionality of texture features are highly consistent with the spatial distribution of dunes.Validation using high-resolution remote sensing imagery(Sentinel-2)further confirms that the method effectively clusters similar dunes and distinguishes different dune types.Additionally,the dune classification results have a good correspondence with changes in near-surface wind regimes.Overall,the findings suggest that texture features derived from DEM can accurately capture the dynamic characteristics of dune morphology,offering a novel approach for automatic dune classification.Compared with traditional methods,the developed approach facilitates large-scale and high-precision dune mapping while reducing the workload of manual interpretation,thus advancing research on aeolian geomorphology.
基金Equinor for financing the R&D projectthe Institute of Science and Technology of Petroleum Geophysics of Brazil for supporting this research。
文摘We apply stochastic seismic inversion and Bayesian facies classification for porosity modeling and igneous rock identification in the presalt interval of the Santos Basin. This integration of seismic and well-derived information enhances reservoir characterization. Stochastic inversion and Bayesian classification are powerful tools because they permit addressing the uncertainties in the model. We used the ES-MDA algorithm to achieve the realizations equivalent to the percentiles P10, P50, and P90 of acoustic impedance, a novel method for acoustic inversion in presalt. The facies were divided into five: reservoir 1,reservoir 2, tight carbonates, clayey rocks, and igneous rocks. To deal with the overlaps in acoustic impedance values of facies, we included geological information using a priori probability, indicating that structural highs are reservoir-dominated. To illustrate our approach, we conducted porosity modeling using facies-related rock-physics models for rock-physics inversion in an area with a well drilled in a coquina bank and evaluated the thickness and extension of an igneous intrusion near the carbonate-salt interface. The modeled porosity and the classified seismic facies are in good agreement with the ones observed in the wells. Notably, the coquinas bank presents an improvement in the porosity towards the top. The a priori probability model was crucial for limiting the clayey rocks to the structural lows. In Well B, the hit rate of the igneous rock in the three scenarios is higher than 60%, showing an excellent thickness-prediction capability.
文摘Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.
文摘Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.
基金Natural Sciences and Engineering Research Council of Canada(NSERC)and New Brunswick Innovation Foundation(NBIF)for the financial support of the global project.These granting agencies did not contribute in the design of the study and collection,analysis,and interpretation of data。
文摘Machine learning(ML)and data mining are used in various fields such as data analysis,prediction,image processing and especially in healthcare.Researchers in the past decade have focused on applying ML and data mining to generate conclusions from historical data in order to improve healthcare systems by making predictions about the results.Using ML algorithms,researchers have developed applications for decision support,analyzed clinical aspects,extracted informative information from historical data,predicted the outcomes and categorized diseases which help physicians make better decisions.It is observed that there is a huge difference between women depending on the region and their social lives.Due to these differences,scholars have been encouraged to conduct studies at a local level in order to better understand those factors that affect maternal health and the expected child.In this study,the ensemble modeling technique is applied to classify birth outcomes based on either cesarean section(C-Section)or normal delivery.A voting ensemble model for the classification of a birth dataset was made by using a Random Forest(RF),Gradient Boosting Classifier,Extra Trees Classifier and Bagging Classifier as base learners.It is observed that the voting ensemble modal of proposed classifiers provides the best accuracy,i.e.,94.78%,as compared to the individual classifiers.ML algorithms are more accurate due to ensemble models,which reduce variance and classification errors.It is reported that when a suitable classification model has been developed for birth classification,decision support systems can be created to enable clinicians to gain in-depth insights into the patterns in the datasets.Developing such a system will not only allow health organizations to improve maternal health assessment processes,but also open doors for interdisciplinary research in two different fields in the region.
文摘We present an approach to classify medical text at a sentence level automatically.Given the inherent complexity of medical text classification,we employ adapters based on pre-trained language models to extract information from medical text,facilitating more accurate classification while minimizing the number of trainable parameters.Extensive experiments conducted on various datasets demonstrate the effectiveness of our approach.
文摘A new arrival and departure flight classification method based on the transitive closure algorithm (TCA) is proposed. Firstly, the fuzzy set theory and the transitive closure algorithm are introduced. Then four different factors are selected to establish the flight classification model and a method is given to calculate the delay cost for each class. Finally, the proposed method is implemented in the sequencing problems of flights in a terminal area, and results are compared with that of the traditional classification method(TCM). Results show that the new classification model is effective in reducing the expenses of flight delays, thus optimizing the sequences of arrival and departure flights, and improving the efficiency of air traffic control.
文摘This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...
基金National Natural Science Foundation of China(No.61171057,No.61503345)Science Foundation for North University of China(No.110246)+1 种基金Specialized Research Fund for Doctoral Program of Higher Education of China(No.20121420110004)International Office of Shanxi Province Education Department of China,and Basic Research Project in Shanxi Province(Young Foundation)
文摘In order to reduce amount of data storage and improve processing capacity of the system, this paper proposes a new classification method of data source by combining phase synchronization model in network clustering with cloud model. Firstly, taking data source as a complex network, after the topography of network is obtained, the cloud model of each node data is determined by fuzzy analytic hierarchy process (AHP). Secondly, by calculating expectation, entropy and hyper entropy of the cloud model, comprehensive coupling strength is got and then it is regarded as the edge weight of topography. Finally, distribution curve is obtained by iterating the phase of each node by means of phase synchronization model. Thus classification of data source is completed. This method can not only provide convenience for storage, cleaning and compression of data, but also improve the efficiency of data analysis.