This paper investigates the impact of reducing feature-vector dimensionality on the performance of machine learning(ML)models.Dimensionality reduction and feature selection techniques can improve computational efficie...This paper investigates the impact of reducing feature-vector dimensionality on the performance of machine learning(ML)models.Dimensionality reduction and feature selection techniques can improve computational efficiency,accuracy,robustness,transparency,and interpretability of ML models.In high-dimensional data,where features outnumber training instances,redundant or irrelevant features introduce noise,hindering model generalization and accuracy.This study explores the effects of dimensionality reduction methods on binary classifier performance using network traffic data for cybersecurity applications.The paper examines how dimensionality reduction techniques influence classifier operation and performance across diverse performancemetrics for seven ML models.Four dimensionality reduction methods are evaluated:principal component analysis(PCA),singular value decomposition(SVD),univariate feature selection(UFS)using chi-square statistics,and feature selection based on mutual information(MI).The results suggest that direct feature selection can be more effective than data projection methods in some applications.Direct selection offers lower computational complexity and,in some cases,superior classifier performance.This study emphasizes that evaluation and comparison of binary classifiers depend on specific performance metrics,each providing insights into different aspects of ML model operation.Using open-source network traffic data,this paper demonstrates that dimensionality reduction can be a valuable tool.It reduces computational overhead,enhances model interpretability and transparency,and maintains or even improves the performance of trained classifiers.The study also reveals that direct feature selection can be a more effective strategy when compared to feature engineering in specific scenarios.展开更多
基金funded by US Army Combat Capabilities Development Command(CCDC)Aviation&Missile Center,https://www.avmc.army.mil/(accessed on 5 February 2024),CONTRACT NUMBER:W31P4Q-18-D-0002 through Georgia Tech Research Institute and AAMU-RISE。
文摘This paper investigates the impact of reducing feature-vector dimensionality on the performance of machine learning(ML)models.Dimensionality reduction and feature selection techniques can improve computational efficiency,accuracy,robustness,transparency,and interpretability of ML models.In high-dimensional data,where features outnumber training instances,redundant or irrelevant features introduce noise,hindering model generalization and accuracy.This study explores the effects of dimensionality reduction methods on binary classifier performance using network traffic data for cybersecurity applications.The paper examines how dimensionality reduction techniques influence classifier operation and performance across diverse performancemetrics for seven ML models.Four dimensionality reduction methods are evaluated:principal component analysis(PCA),singular value decomposition(SVD),univariate feature selection(UFS)using chi-square statistics,and feature selection based on mutual information(MI).The results suggest that direct feature selection can be more effective than data projection methods in some applications.Direct selection offers lower computational complexity and,in some cases,superior classifier performance.This study emphasizes that evaluation and comparison of binary classifiers depend on specific performance metrics,each providing insights into different aspects of ML model operation.Using open-source network traffic data,this paper demonstrates that dimensionality reduction can be a valuable tool.It reduces computational overhead,enhances model interpretability and transparency,and maintains or even improves the performance of trained classifiers.The study also reveals that direct feature selection can be a more effective strategy when compared to feature engineering in specific scenarios.