The inherent class imbalance within textual data poses a significant challenge for machine learning-based techniques,as the available data often fails to adequately represent all classes.This scarcity of instances can...The inherent class imbalance within textual data poses a significant challenge for machine learning-based techniques,as the available data often fails to adequately represent all classes.This scarcity of instances can make it even more challenging when there are overlapping regions within different classes.To address these limitations,this study introduces a refinement model for textual data classification with imbalanced datasets.The proposed approach,refined classification using overlap data with bagging and genetic algorithms(ReCO-BGA),aims to refine the classification predictions by creating a two-tier classification process.First,a bagging model is employed,incorporating three distinct classes:majority,minority,and an additional extracted class specifically for overlapping instances.Second,we propose to rectify the predicted overlap instances using a genetic-based oversampling technique.To evaluate the performance of ReCO-BGA,we conducted several experiments,focusing on two practical use cases:hate speech detection and sentiment analysis.The results demonstrated the effectiveness of the proposed method and showed that it outperforms state-of-the-art methods.展开更多
文摘The inherent class imbalance within textual data poses a significant challenge for machine learning-based techniques,as the available data often fails to adequately represent all classes.This scarcity of instances can make it even more challenging when there are overlapping regions within different classes.To address these limitations,this study introduces a refinement model for textual data classification with imbalanced datasets.The proposed approach,refined classification using overlap data with bagging and genetic algorithms(ReCO-BGA),aims to refine the classification predictions by creating a two-tier classification process.First,a bagging model is employed,incorporating three distinct classes:majority,minority,and an additional extracted class specifically for overlapping instances.Second,we propose to rectify the predicted overlap instances using a genetic-based oversampling technique.To evaluate the performance of ReCO-BGA,we conducted several experiments,focusing on two practical use cases:hate speech detection and sentiment analysis.The results demonstrated the effectiveness of the proposed method and showed that it outperforms state-of-the-art methods.