An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Techniqu...An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, several weaknesses of the SMOTE method have been identified in generating synthetic minority class data, such as overlapping, noise, and small disjuncts. However, these studies generally focus on only one of SMOTE’s weaknesses: noise or overlapping. Therefore, this study addresses both issues simultaneously by tackling noise and overlapping in SMOTE-generated data. This study proposes a combined approach of filtering, clustering, and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions, with the k-nn method applied for filtering. The use of Noise Reduction (NR), which removes data that is considered noise before applying SMOTE, has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters, allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data that could introduce noise. The proposed method is called “NR-Clustering SMOTE,” which has several stages in balancing data: (1) filtering by removing minority classes close to majority classes (data noise) using the k-nn method;(2) clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters;(3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest, SVM, and Naїve Bayes, compared to the original data and traditional SMOTE. The proposed method (NR-Clustering SMOTE) improves accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset compared to SMOTE-LOF. Compared to Radius-SMOTE, this method increases accuracy by 3.16% on the Pima dataset and 13.24% on the Haberman dataset. Meanwhile, compared to RN-SMOTE, the accuracy improvement reaches 15.56% on the Pima dataset and 19.84% on the Haberman dataset. This research result implies that the proposed method experiences consistent performance improvement compared to traditional SMOTE and its latest variants, such as SMOTE-LOF, Radius-SMOTE, and RN-SMOTE, in solving imbalanced health data with class binaries.展开更多
垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,...垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,故采用SMOTE算法对垃圾邮件进行数据扩充,之后采用逻辑回归、支持向量机、决策树和随机森林四种学习算法构建垃圾邮件检测模型。本文对比了SMOTE前后四种检测模型的性能,尤其比较了准确率、精确度、召回率和F1-Score几个指标,以及混淆矩阵。实验结果可见,SMOTE算法有效提高了垃圾邮件检出的准确度,基于SMOTE算法的垃圾邮件检测模型具有较好性能。The detection of spam has always been a research hotspot in big data and artificial intelligence. This paper presents a complete data analysis process for the spam data set on the Kaggle, including data preprocessing, the construction of text feature, building the detection model of a spam. Due to the imbalance between ham and spam, the SMOTE algorithm is used to expand the spam data, then four learning algorithms such as logistic regression, SVM, decision tree and random forest are used to build the detection model of spam. The performance of four detection models is compared before and after SMOTE, especially the classification accuracy, precision, recall, F1-Score and confusion matrix. The experimental results show that SMOTE algorithm can effectively improve the accuracy of spam detection, and the spam detection model based on SMOTE algorithm has good performance.展开更多
基金funded by Universitas Negeri Malang,contract number 4.4.841/UN32.14.1/LT/2024.
文摘An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, several weaknesses of the SMOTE method have been identified in generating synthetic minority class data, such as overlapping, noise, and small disjuncts. However, these studies generally focus on only one of SMOTE’s weaknesses: noise or overlapping. Therefore, this study addresses both issues simultaneously by tackling noise and overlapping in SMOTE-generated data. This study proposes a combined approach of filtering, clustering, and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions, with the k-nn method applied for filtering. The use of Noise Reduction (NR), which removes data that is considered noise before applying SMOTE, has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters, allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data that could introduce noise. The proposed method is called “NR-Clustering SMOTE,” which has several stages in balancing data: (1) filtering by removing minority classes close to majority classes (data noise) using the k-nn method;(2) clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters;(3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest, SVM, and Naїve Bayes, compared to the original data and traditional SMOTE. The proposed method (NR-Clustering SMOTE) improves accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset compared to SMOTE-LOF. Compared to Radius-SMOTE, this method increases accuracy by 3.16% on the Pima dataset and 13.24% on the Haberman dataset. Meanwhile, compared to RN-SMOTE, the accuracy improvement reaches 15.56% on the Pima dataset and 19.84% on the Haberman dataset. This research result implies that the proposed method experiences consistent performance improvement compared to traditional SMOTE and its latest variants, such as SMOTE-LOF, Radius-SMOTE, and RN-SMOTE, in solving imbalanced health data with class binaries.
文摘垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,故采用SMOTE算法对垃圾邮件进行数据扩充,之后采用逻辑回归、支持向量机、决策树和随机森林四种学习算法构建垃圾邮件检测模型。本文对比了SMOTE前后四种检测模型的性能,尤其比较了准确率、精确度、召回率和F1-Score几个指标,以及混淆矩阵。实验结果可见,SMOTE算法有效提高了垃圾邮件检出的准确度,基于SMOTE算法的垃圾邮件检测模型具有较好性能。The detection of spam has always been a research hotspot in big data and artificial intelligence. This paper presents a complete data analysis process for the spam data set on the Kaggle, including data preprocessing, the construction of text feature, building the detection model of a spam. Due to the imbalance between ham and spam, the SMOTE algorithm is used to expand the spam data, then four learning algorithms such as logistic regression, SVM, decision tree and random forest are used to build the detection model of spam. The performance of four detection models is compared before and after SMOTE, especially the classification accuracy, precision, recall, F1-Score and confusion matrix. The experimental results show that SMOTE algorithm can effectively improve the accuracy of spam detection, and the spam detection model based on SMOTE algorithm has good performance.