Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled b...Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled bag that consists of a number of unlabeled instances. A bag is negative if all instances in it are negative. A bag is positive if it has at least one positive instance. Because the instances in the positive bag are not labeled, each positive bag is an ambiguous. The mining aim is to classify unseen bags. The main idea of existing multi-instance algorithms is to find true positive instances in positive bags and convert the multi-instance problem to the supervised problem, and get the labels of test bags according to predict the labels of unknown instances. In this paper, we aim at mining the multi-instance data from another point of view, i.e., excluding the false positive instances in positive bags and predicting the label of an entire unknown bag. We propose an algorithm called Multi-Instance Covering kNN (MICkNN) for mining from multi-instance data. Briefly, constructive covering algorithm is utilized to restructure the structure of the original multi-instance data at first. Then, the kNN algorithm is applied to discriminate the false positive instances. In the test stage, we label the tested bag directly according to the similarity between the unseen bag and sphere neighbors obtained from last two steps. Experimental results demonstrate the proposed algorithm is competitive with most of the state-of-the-art multi-instance methods both in classification accuracy and running time.展开更多
The use of machine learning algorithms to identify characteristics in Distributed Denial of Service (DDoS) attacks has emerged as a powerful approach in cybersecurity. DDoS attacks, which aim to overwhelm a network or...The use of machine learning algorithms to identify characteristics in Distributed Denial of Service (DDoS) attacks has emerged as a powerful approach in cybersecurity. DDoS attacks, which aim to overwhelm a network or service with a flood of malicious traffic, pose significant threats to online systems. Traditional methods of detection and mitigation often struggle to keep pace with the evolving nature of these attacks. Machine learning, with its ability to analyze vast amounts of data and recognize patterns, offers a robust solution to this challenge. The aim of the paper is to demonstrate the application of ensemble ML algorithms, namely the K-Means and the KNN, for a dual clustering mechanism when used with PySpark to collect 99% accurate data. The algorithms, when used together, identify distinctive features of DDoS attacks that prove a very accurate reflection of reality, so they are a good combination for this aim. Impressively, having preprocessed the data, both algorithms with the PySpark foundation enabled the achievement of 99% accuracy when tuned on the features of a DDoS big dataset. The semi-supervised dataset tabulates traffic anomalies in terms of packet size distribution in correlation to Flow Duration. By training the K-Means Clustering and then applying the KNN to the dataset, the algorithms learn to evaluate the character of activity to a greater degree by displaying density with ease. The study evaluates the effectiveness of the K-Means Clustering with the KNN as ensemble algorithms that adapt very well in detecting complex patterns. Ultimately, cross-reaching environmental results indicate that ML-based approaches significantly improve detection rates compared to traditional methods. Furthermore, ensemble learning methods, which combine two plus multiple models to improve prediction accuracy, show greatness in handling the complexity and variability of big data sets especially when implemented by PySpark. The findings suggest that the enhancement of accuracy derives from newer software that’s designed to reflect reality. However, challenges remain in the deployment of these systems, including the need for large, high-quality datasets and the potential for adversarial attacks that attempt to deceive the ML models. Future research should continue to improve the robustness and efficiency of combining algorithms, as well as integrate them with existing security frameworks to provide comprehensive protection against DDoS attacks and other areas. The dataset was originally created by the University of New Brunswick to analyze DDoS data. The dataset itself was based on logs of the university’s servers, which found various DoS attacks throughout the publicly available period to totally generate 80 attributes with a 6.40GB size. In this dataset, the label and binary column become a very important portion of the final classification. In the last column, this means the normal traffic would be differentiated by the attack traffic. Further analysis is then ripe for investigation. Finally, malicious traffic alert software, as an example, should be trained on packet influx to Flow Duration dependence, which creates a mathematical scope for averages to enact. In achieving such high accuracy, the project acts as an illustration (referenced in the form of excerpts from my Google Colab account) of many attempts to tune. Cybersecurity advocates for more work on the character of brute-force attack traffic and normal traffic features overall since most of our investments as humans are digitally based in work, recreational, and social environments.展开更多
The implementation of content-based image retrieval(CBIR)mainly depends on two key technologies:image feature extraction and image feature matching.In this paper,we extract the color features based on Global Color His...The implementation of content-based image retrieval(CBIR)mainly depends on two key technologies:image feature extraction and image feature matching.In this paper,we extract the color features based on Global Color Histogram(GCH)and texture features based on Gray Level Co-occurrence Matrix(GLCM).In order to obtain the effective and representative features of the image,we adopt the fuzzy mathematical algorithm in the process of color feature extraction and texture feature extraction respectively.And we combine the fuzzy color feature vector with the fuzzy texture feature vector to form the comprehensive fuzzy feature vector of the image according to a certain way.Image feature matching mainly depends on the similarity between two image feature vectors.In this paper,we propose a novel similarity measure method based on k-Nearest Neighbors(kNN)and fuzzy mathematical algorithm(SBkNNF).Finding out the k nearest neighborhood images of the query image from the image data set according to an appropriate similarity measure method.Using the k similarity values between the query image and its k neighborhood images to constitute the new k-dimensional fuzzy feature vector corresponding to the query image.And using the k similarity values between the retrieved image and the k neighborhood images of the query image to constitute the new k-dimensional fuzzy feature vector corresponding to the retrieved image.Calculating the similarity between the two kdimensional fuzzy feature vector according to a certain fuzzy similarity algorithm to measure the similarity between the query image and the retrieved image.Extensive experiments are carried out on three data sets:WANG data set,Corel-5k data set and Corel-10k data set.The experimental results show that the outperforming retrieval performance of our proposed CBIR system with the other CBIR systems.展开更多
基金the National Natural Science Foundation of China (Nos. 61073117 and 61175046)the Provincial Natural Science Research Program of Higher Education Institutions of Anhui Province (No. KJ2013A016)+1 种基金the Academic Innovative Research Projects of Anhui University Graduate Students (No. 10117700183)the 211 Project of Anhui University
文摘Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled bag that consists of a number of unlabeled instances. A bag is negative if all instances in it are negative. A bag is positive if it has at least one positive instance. Because the instances in the positive bag are not labeled, each positive bag is an ambiguous. The mining aim is to classify unseen bags. The main idea of existing multi-instance algorithms is to find true positive instances in positive bags and convert the multi-instance problem to the supervised problem, and get the labels of test bags according to predict the labels of unknown instances. In this paper, we aim at mining the multi-instance data from another point of view, i.e., excluding the false positive instances in positive bags and predicting the label of an entire unknown bag. We propose an algorithm called Multi-Instance Covering kNN (MICkNN) for mining from multi-instance data. Briefly, constructive covering algorithm is utilized to restructure the structure of the original multi-instance data at first. Then, the kNN algorithm is applied to discriminate the false positive instances. In the test stage, we label the tested bag directly according to the similarity between the unseen bag and sphere neighbors obtained from last two steps. Experimental results demonstrate the proposed algorithm is competitive with most of the state-of-the-art multi-instance methods both in classification accuracy and running time.
文摘The use of machine learning algorithms to identify characteristics in Distributed Denial of Service (DDoS) attacks has emerged as a powerful approach in cybersecurity. DDoS attacks, which aim to overwhelm a network or service with a flood of malicious traffic, pose significant threats to online systems. Traditional methods of detection and mitigation often struggle to keep pace with the evolving nature of these attacks. Machine learning, with its ability to analyze vast amounts of data and recognize patterns, offers a robust solution to this challenge. The aim of the paper is to demonstrate the application of ensemble ML algorithms, namely the K-Means and the KNN, for a dual clustering mechanism when used with PySpark to collect 99% accurate data. The algorithms, when used together, identify distinctive features of DDoS attacks that prove a very accurate reflection of reality, so they are a good combination for this aim. Impressively, having preprocessed the data, both algorithms with the PySpark foundation enabled the achievement of 99% accuracy when tuned on the features of a DDoS big dataset. The semi-supervised dataset tabulates traffic anomalies in terms of packet size distribution in correlation to Flow Duration. By training the K-Means Clustering and then applying the KNN to the dataset, the algorithms learn to evaluate the character of activity to a greater degree by displaying density with ease. The study evaluates the effectiveness of the K-Means Clustering with the KNN as ensemble algorithms that adapt very well in detecting complex patterns. Ultimately, cross-reaching environmental results indicate that ML-based approaches significantly improve detection rates compared to traditional methods. Furthermore, ensemble learning methods, which combine two plus multiple models to improve prediction accuracy, show greatness in handling the complexity and variability of big data sets especially when implemented by PySpark. The findings suggest that the enhancement of accuracy derives from newer software that’s designed to reflect reality. However, challenges remain in the deployment of these systems, including the need for large, high-quality datasets and the potential for adversarial attacks that attempt to deceive the ML models. Future research should continue to improve the robustness and efficiency of combining algorithms, as well as integrate them with existing security frameworks to provide comprehensive protection against DDoS attacks and other areas. The dataset was originally created by the University of New Brunswick to analyze DDoS data. The dataset itself was based on logs of the university’s servers, which found various DoS attacks throughout the publicly available period to totally generate 80 attributes with a 6.40GB size. In this dataset, the label and binary column become a very important portion of the final classification. In the last column, this means the normal traffic would be differentiated by the attack traffic. Further analysis is then ripe for investigation. Finally, malicious traffic alert software, as an example, should be trained on packet influx to Flow Duration dependence, which creates a mathematical scope for averages to enact. In achieving such high accuracy, the project acts as an illustration (referenced in the form of excerpts from my Google Colab account) of many attempts to tune. Cybersecurity advocates for more work on the character of brute-force attack traffic and normal traffic features overall since most of our investments as humans are digitally based in work, recreational, and social environments.
基金This research was supported by the National Natural Science Foundation of China(Grant Number:61702310)the National Natural Science Foundation of China(Grant Number:61401260).
文摘The implementation of content-based image retrieval(CBIR)mainly depends on two key technologies:image feature extraction and image feature matching.In this paper,we extract the color features based on Global Color Histogram(GCH)and texture features based on Gray Level Co-occurrence Matrix(GLCM).In order to obtain the effective and representative features of the image,we adopt the fuzzy mathematical algorithm in the process of color feature extraction and texture feature extraction respectively.And we combine the fuzzy color feature vector with the fuzzy texture feature vector to form the comprehensive fuzzy feature vector of the image according to a certain way.Image feature matching mainly depends on the similarity between two image feature vectors.In this paper,we propose a novel similarity measure method based on k-Nearest Neighbors(kNN)and fuzzy mathematical algorithm(SBkNNF).Finding out the k nearest neighborhood images of the query image from the image data set according to an appropriate similarity measure method.Using the k similarity values between the query image and its k neighborhood images to constitute the new k-dimensional fuzzy feature vector corresponding to the query image.And using the k similarity values between the retrieved image and the k neighborhood images of the query image to constitute the new k-dimensional fuzzy feature vector corresponding to the retrieved image.Calculating the similarity between the two kdimensional fuzzy feature vector according to a certain fuzzy similarity algorithm to measure the similarity between the query image and the retrieved image.Extensive experiments are carried out on three data sets:WANG data set,Corel-5k data set and Corel-10k data set.The experimental results show that the outperforming retrieval performance of our proposed CBIR system with the other CBIR systems.