垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,...垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,故采用SMOTE算法对垃圾邮件进行数据扩充,之后采用逻辑回归、支持向量机、决策树和随机森林四种学习算法构建垃圾邮件检测模型。本文对比了SMOTE前后四种检测模型的性能,尤其比较了准确率、精确度、召回率和F1-Score几个指标,以及混淆矩阵。实验结果可见,SMOTE算法有效提高了垃圾邮件检出的准确度,基于SMOTE算法的垃圾邮件检测模型具有较好性能。The detection of spam has always been a research hotspot in big data and artificial intelligence. This paper presents a complete data analysis process for the spam data set on the Kaggle, including data preprocessing, the construction of text feature, building the detection model of a spam. Due to the imbalance between ham and spam, the SMOTE algorithm is used to expand the spam data, then four learning algorithms such as logistic regression, SVM, decision tree and random forest are used to build the detection model of spam. The performance of four detection models is compared before and after SMOTE, especially the classification accuracy, precision, recall, F1-Score and confusion matrix. The experimental results show that SMOTE algorithm can effectively improve the accuracy of spam detection, and the spam detection model based on SMOTE algorithm has good performance.展开更多
LightGBM is an open-source, distributed and high-performance GB framework built by Microsoft company. LightGBM has some advantages such as fast learning speed, high parallelism efficiency and high-volume data, and so ...LightGBM is an open-source, distributed and high-performance GB framework built by Microsoft company. LightGBM has some advantages such as fast learning speed, high parallelism efficiency and high-volume data, and so on. Based on the open data set of credit card in Taiwan, five data mining methods, Logistic regression, SVM, neural network, Xgboost and LightGBM, are compared in this paper. The results show that the AUC, F1-Score and the predictive correct ratio of LightGBM are the best, and that of Xgboost is second. It indicates that LightGBM or Xgboost has a good performance in the prediction of categorical response variables and has a good application value in the big data era.展开更多
The rise of fake news on social media has had a detrimental effect on society. Numerous performance evaluations on classifiers that can detect fake news have previously been undertaken by researchers in this area. To ...The rise of fake news on social media has had a detrimental effect on society. Numerous performance evaluations on classifiers that can detect fake news have previously been undertaken by researchers in this area. To assess their performance, we used 14 different classifiers in this study. Secondly, we looked at how soft voting and hard voting classifiers performed in a mixture of distinct individual classifiers. Finally, heuristics are used to create 9 models of stacking classifiers. The F1 score, prediction, recall, and accuracy have all been used to assess performance. Models 6 and 7 achieved the best accuracy of 96.13 while having a larger computational complexity. For benchmarking purposes, other individual classifiers are also tested.展开更多
Microfinance institutions in Kenya play a unique role in promoting financial inclusion,loans,and savings provision,especially to low-income individuals and small-scale entrepreneurs.However,despite their benefits,most...Microfinance institutions in Kenya play a unique role in promoting financial inclusion,loans,and savings provision,especially to low-income individuals and small-scale entrepreneurs.However,despite their benefits,most of their products and programs in Machakos County have been reducing due to re-payment challenges,threatening their financial ability to extend further credit.This could be attributed to ineffective credit scoring models which are not able to establish the nuanced non-linear repayment behavior and patterns of the loan applicants.The research objective was to enhance credit risk scoring for microfinance institutions in Machakos County using supervised machine learning algorithms.The study adopted a mixed research design under supervised machine learning approach.It randomly sampled 6771 loan application ac-count records and repayment history.Rstudio and Python programming lan-guages were deployed for data pre-processing and analysis.Logistic regression algorithm,XG Boosting and the random forest ensemble method were used.Metric evaluations used included the performance accuracy,Area under the Curve and F1-Score.Based on the study findings:XG Boosting was the best performer with 83.3%accuracy and 0.202 Brier score.Development of legal framework to govern ethical and open use of machine learning assessment was recommended.A similar research but using different machine learning al-gorithms,locations,and institutions,to ascertain the validity,reliability and the generalizability of the study findings was recommended for further re-search.展开更多
考虑到传统物理分析方法无法解决导线舞动的预测问题,综合运用机器学习算法,对已有的舞动历史数据进行筛选和预处理,并挖掘有效信息,利用one class SVM算法解决舞动数据中负样本缺失问题,采用集成学习算法中Bagging算法建立分类器学习方...考虑到传统物理分析方法无法解决导线舞动的预测问题,综合运用机器学习算法,对已有的舞动历史数据进行筛选和预处理,并挖掘有效信息,利用one class SVM算法解决舞动数据中负样本缺失问题,采用集成学习算法中Bagging算法建立分类器学习方法,实现了数据的随机抽样,分成不同组数据集进行相互独立的训练,避免对舞动数据过拟合,提升机器学习算法的抗噪声能力以及泛化能力,采用k折交叉验证算法进行模型的验证,并利用F1-score描述导线舞动预警模型的性能,验证了该方法在舞动预测方面的有效性。展开更多
文摘垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,故采用SMOTE算法对垃圾邮件进行数据扩充,之后采用逻辑回归、支持向量机、决策树和随机森林四种学习算法构建垃圾邮件检测模型。本文对比了SMOTE前后四种检测模型的性能,尤其比较了准确率、精确度、召回率和F1-Score几个指标,以及混淆矩阵。实验结果可见,SMOTE算法有效提高了垃圾邮件检出的准确度,基于SMOTE算法的垃圾邮件检测模型具有较好性能。The detection of spam has always been a research hotspot in big data and artificial intelligence. This paper presents a complete data analysis process for the spam data set on the Kaggle, including data preprocessing, the construction of text feature, building the detection model of a spam. Due to the imbalance between ham and spam, the SMOTE algorithm is used to expand the spam data, then four learning algorithms such as logistic regression, SVM, decision tree and random forest are used to build the detection model of spam. The performance of four detection models is compared before and after SMOTE, especially the classification accuracy, precision, recall, F1-Score and confusion matrix. The experimental results show that SMOTE algorithm can effectively improve the accuracy of spam detection, and the spam detection model based on SMOTE algorithm has good performance.
文摘LightGBM is an open-source, distributed and high-performance GB framework built by Microsoft company. LightGBM has some advantages such as fast learning speed, high parallelism efficiency and high-volume data, and so on. Based on the open data set of credit card in Taiwan, five data mining methods, Logistic regression, SVM, neural network, Xgboost and LightGBM, are compared in this paper. The results show that the AUC, F1-Score and the predictive correct ratio of LightGBM are the best, and that of Xgboost is second. It indicates that LightGBM or Xgboost has a good performance in the prediction of categorical response variables and has a good application value in the big data era.
文摘The rise of fake news on social media has had a detrimental effect on society. Numerous performance evaluations on classifiers that can detect fake news have previously been undertaken by researchers in this area. To assess their performance, we used 14 different classifiers in this study. Secondly, we looked at how soft voting and hard voting classifiers performed in a mixture of distinct individual classifiers. Finally, heuristics are used to create 9 models of stacking classifiers. The F1 score, prediction, recall, and accuracy have all been used to assess performance. Models 6 and 7 achieved the best accuracy of 96.13 while having a larger computational complexity. For benchmarking purposes, other individual classifiers are also tested.
文摘Microfinance institutions in Kenya play a unique role in promoting financial inclusion,loans,and savings provision,especially to low-income individuals and small-scale entrepreneurs.However,despite their benefits,most of their products and programs in Machakos County have been reducing due to re-payment challenges,threatening their financial ability to extend further credit.This could be attributed to ineffective credit scoring models which are not able to establish the nuanced non-linear repayment behavior and patterns of the loan applicants.The research objective was to enhance credit risk scoring for microfinance institutions in Machakos County using supervised machine learning algorithms.The study adopted a mixed research design under supervised machine learning approach.It randomly sampled 6771 loan application ac-count records and repayment history.Rstudio and Python programming lan-guages were deployed for data pre-processing and analysis.Logistic regression algorithm,XG Boosting and the random forest ensemble method were used.Metric evaluations used included the performance accuracy,Area under the Curve and F1-Score.Based on the study findings:XG Boosting was the best performer with 83.3%accuracy and 0.202 Brier score.Development of legal framework to govern ethical and open use of machine learning assessment was recommended.A similar research but using different machine learning al-gorithms,locations,and institutions,to ascertain the validity,reliability and the generalizability of the study findings was recommended for further re-search.
文摘考虑到传统物理分析方法无法解决导线舞动的预测问题,综合运用机器学习算法,对已有的舞动历史数据进行筛选和预处理,并挖掘有效信息,利用one class SVM算法解决舞动数据中负样本缺失问题,采用集成学习算法中Bagging算法建立分类器学习方法,实现了数据的随机抽样,分成不同组数据集进行相互独立的训练,避免对舞动数据过拟合,提升机器学习算法的抗噪声能力以及泛化能力,采用k折交叉验证算法进行模型的验证,并利用F1-score描述导线舞动预警模型的性能,验证了该方法在舞动预测方面的有效性。