In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (...In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.展开更多
聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领...聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.展开更多
An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommend...An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommender systems. To support various data types and flexible distance metrics involved in real applications, we study AkNN retrieval in metric spaces, namely, metric AkNN (MAkNN) search. Consider that the underlying indexes on the query set and the object set may not exist, which is natural in many scenarios. For example, the query set and the object set could be the results of other queries, and thus, the underlying indexes cannot be built in advance. To support MAkNN search on datasets without any underlying index, we propose an efficient disk-based algorithm, termed as Partition-Based MAkNN Algorithm (PMA), which follows a partition-search framework and employs a series of pruning rules for accelerating the search. In addition, we extend our techniques to tackle an interesting variant of MAkNN queries, i.e., metric self-AkNN (MSAkNN) search, where the query set is identical to the object set. Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our pruning rules and the efficiency of the proposed algorithms, compared with state-of-the-art MAkNN and MSAkNN algorithms.展开更多
文摘In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.
文摘聚类是大规模高维向量数据分析的关键技术之一.近年来,基于密度的聚类算法DBSCAN(density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性,在数据分析领域得到了广泛应用.然而,现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题,难以在实际场景中部署应用.此外,随着信息技术的发展,高维向量数据规模急剧增加,使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战.为此,提出一种GPU加速的高维向量聚类算法,通过引入K近邻(K-nearest neighbor,KNN)图索引加速DBSCAN的计算.首先,设计了GPU加速的并行K近邻图构建算法,显著降低了K近邻图索引的构建开销.其次,提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法,改进了DBSCAN算法的计算流程,实现了高并发向量聚类.最后,在真实向量数据集上进行了大量实验,并将所提出的方法与现有方法进行了性能对比.实验结果表明,所提方法在保证聚类精度的前提下,将大规模向量聚类的效率提高了5.7–2822.5倍.
基金This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2015CB352502, the National Natural Science Foundation of China under Grant Nos. 61522208, 61379033, and 61472348, and the Fundamental Research Funds for the Central Universities of China under Grant Nos. 2015XZZX004-18 and 2015XZZX005-07.
文摘An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object. This problem arises naturally in many areas, such as GIS (geographic information system), multimedia retrieval, and recommender systems. To support various data types and flexible distance metrics involved in real applications, we study AkNN retrieval in metric spaces, namely, metric AkNN (MAkNN) search. Consider that the underlying indexes on the query set and the object set may not exist, which is natural in many scenarios. For example, the query set and the object set could be the results of other queries, and thus, the underlying indexes cannot be built in advance. To support MAkNN search on datasets without any underlying index, we propose an efficient disk-based algorithm, termed as Partition-Based MAkNN Algorithm (PMA), which follows a partition-search framework and employs a series of pruning rules for accelerating the search. In addition, we extend our techniques to tackle an interesting variant of MAkNN queries, i.e., metric self-AkNN (MSAkNN) search, where the query set is identical to the object set. Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our pruning rules and the efficiency of the proposed algorithms, compared with state-of-the-art MAkNN and MSAkNN algorithms.