In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (...In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.展开更多
An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommend...An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommender systems. To support various data types and flexible distance metrics involved in real applications, we study AkNN retrieval in metric spaces, namely, metric AkNN (MAkNN) search. Consider that the underlying indexes on the query set and the object set may not exist, which is natural in many scenarios. For example, the query set and the object set could be the results of other queries, and thus, the underlying indexes cannot be built in advance. To support MAkNN search on datasets without any underlying index, we propose an efficient disk-based algorithm, termed as Partition-Based MAkNN Algorithm (PMA), which follows a partition-search framework and employs a series of pruning rules for accelerating the search. In addition, we extend our techniques to tackle an interesting variant of MAkNN queries, i.e., metric self-AkNN (MSAkNN) search, where the query set is identical to the object set. Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our pruning rules and the efficiency of the proposed algorithms, compared with state-of-the-art MAkNN and MSAkNN algorithms.展开更多
文摘In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.
基金This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2015CB352502, the National Natural Science Foundation of China under Grant Nos. 61522208, 61379033, and 61472348, and the Fundamental Research Funds for the Central Universities of China under Grant Nos. 2015XZZX004-18 and 2015XZZX005-07.
文摘An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommender systems. To support various data types and flexible distance metrics involved in real applications, we study AkNN retrieval in metric spaces, namely, metric AkNN (MAkNN) search. Consider that the underlying indexes on the query set and the object set may not exist, which is natural in many scenarios. For example, the query set and the object set could be the results of other queries, and thus, the underlying indexes cannot be built in advance. To support MAkNN search on datasets without any underlying index, we propose an efficient disk-based algorithm, termed as Partition-Based MAkNN Algorithm (PMA), which follows a partition-search framework and employs a series of pruning rules for accelerating the search. In addition, we extend our techniques to tackle an interesting variant of MAkNN queries, i.e., metric self-AkNN (MSAkNN) search, where the query set is identical to the object set. Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our pruning rules and the efficiency of the proposed algorithms, compared with state-of-the-art MAkNN and MSAkNN algorithms.