Imbalanced multiclass datasets pose challenges for machine learning algorithms.They often contain minority classes that are important for accurate predictions.However,when the data is sparsely distributed and overlaps...Imbalanced multiclass datasets pose challenges for machine learning algorithms.They often contain minority classes that are important for accurate predictions.However,when the data is sparsely distributed and overlaps with data points fromother classes,it introduces noise.As a result,existing resamplingmethods may fail to preserve the original data patterns,further disrupting data quality and reducingmodel performance.This paper introduces Neighbor Displacement-based Enhanced Synthetic Oversampling(NDESO),a hybridmethod that integrates a data displacement strategy with a resampling technique to achieve data balance.It begins by computing the average distance of noisy data points to their neighbors and adjusting their positions toward the center before applying random oversampling.Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasetswith varying imbalance ratios.This evaluation was structured into two distinct test groups.First,the effects of k-neighbor variations and distance metrics are evaluated,followed by a comparison of resampled data distributions against alternatives,and finally,determining the most suitable oversampling technique for data balancing.Second,the overall performance of the NDESO algorithm was assessed,focusing on G-mean and statistical significance.The results demonstrate that our method is robust to a wide range of variations in these parameters and the overall performance achieves an average G-mean score of 0.90,which is among the highest.Additionally,it attains the lowest mean rank of 2.88,indicating statistically significant improvements over existing approaches.This advantage underscores its potential for effectively handling data imbalance in practical scenarios.展开更多
Since the efficiency of photovoltaic(PV) power is closely related to the weather,many PV enterprises install weather instruments to monitor the working state of the PV power system.With the development of the soft mea...Since the efficiency of photovoltaic(PV) power is closely related to the weather,many PV enterprises install weather instruments to monitor the working state of the PV power system.With the development of the soft measurement technology,the instrumental method seems obsolete and involves high cost.This paper proposes a novel method for predicting the types of weather based on the PV power data and partial meteorological data.By this method,the weather types are deduced by data analysis,instead of weather instrument A better fault detection is obtained by using the support vector machines(SVM) and comparing the predicted and the actual weather.The model of the weather prediction is established by a direct SVM for training multiclass predictors.Although SVM is suitable for classification,the classified results depend on the type of the kernel,the parameters of the kernel,and the soft margin coefficient,which are difficult to choose.In this paper,these parameters are optimized by particle swarm optimization(PSO) algorithm in anticipation of good prediction results can be achieved.Prediction results show that this method is feasible and effective.展开更多
In this study, salting-out assisted liquid-liquid extraction combined with high performance liquid chromatography diode array detector (SALLE-HPLC-DAD) method was developed and validated for simultaneous analysis of c...In this study, salting-out assisted liquid-liquid extraction combined with high performance liquid chromatography diode array detector (SALLE-HPLC-DAD) method was developed and validated for simultaneous analysis of carbaryl, atrazine, propazine, chlorothalonil, dimethametryn and terbutryn in environmental water samples. Parameters affecting the extraction efficiency such as type and volume of extraction solvent, sample volume, salt type and amount, centrifugation speed and time, and sample pH were optimized. Under the optimum extraction conditions the method was linear over the range of 10 - 100 μg/L (carbaryl), 8 - 100 μg/L (atarzine), 7 - 100 μg/L (propazine) and 9 - 100 μg/L (chlorothalonil, terbutryn and dimethametryn) with correlation coefficients (R2) between 0.99 and 0.999. Limits of detection and quantification ranged from 2.0 to 2.8 μg/L and 6.7 to 9.5 μg/L, respectively. The extraction recoveries obtained for ground, lake and river waters were in a range of 75.5% to 106.6%, with the intra-day and inter-day relative standard deviation lower than 3.4% for all the target analytes. All of the target analytes were not detected in these samples. Therefore, the proposed SALLE-HPLC-DAD method is simple, rapid, cheap and environmentally friendly for the determination of the aforementioned herbicides, insecticide and fungicide residues in environmental water samples.展开更多
Support vector machines (SVMs) are initially designed for binary classification. How to effectively extend them for multiclass classification is still an ongoing research topic. A multiclass classifier is constructe...Support vector machines (SVMs) are initially designed for binary classification. How to effectively extend them for multiclass classification is still an ongoing research topic. A multiclass classifier is constructed by combining SVM^light algorithm with directed acyclic graph SVM (DAGSVM) method, named DAGSVM^light A new method is proposed to select the working set which is identical to the working set selected by SVM^light approach. Experimental results indicate DAGSVM^light is competitive with DAGSMO. It is more suitable for practice use. It may be an especially useful tool for large-scale multiclass classification problems and lead to more widespread use of SVMs in the engineering community due to its good performance.展开更多
In this study, a miniaturized analytical technique based on high density solvent based dispersive liquid-liquid microextraction (HD-DLLME) was developed for extraction of trace residues of multiclass pesticides includ...In this study, a miniaturized analytical technique based on high density solvent based dispersive liquid-liquid microextraction (HD-DLLME) was developed for extraction of trace residues of multiclass pesticides including three striazine herbicides, two organophosphate insecticides and two organochlorine fungicides from environmental water and sugarcane juice samples. The analytical method was validated and found to offer good linearity: R2 ≥ 0.991;repeatability varied from 0.73% - 5.28%;reproducibility varied from 1.14% - 8.74% and limit of detection ranged from 0.005 to 0.02 μg/L. Moreover, accuracy of the optimized method was evaluated and the recovery was varied from 80.39% - 114.05%. Analytical applications of this method to environmental waters and sugarcane juice samples indicate the presence of trace residues of ametryn in the lake water and sugarcane juice samples. Atrazine and ametryn were also detected in irrigation water.展开更多
It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable ...It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.展开更多
It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limit...It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.展开更多
In contemporary society,rapid and accurate optical cable fault detection is of paramount importance for ensuring the stability and reliability of optical networks.The emergence of novel faults in optical networks has ...In contemporary society,rapid and accurate optical cable fault detection is of paramount importance for ensuring the stability and reliability of optical networks.The emergence of novel faults in optical networks has introduced new challenges,significantly compromising their normal operation.Machine learning has emerged as a highly promising approach.Consequently,it is imperative to develop an automated and reliable algorithm that utilizes telemetry data acquired from Optical Time-Domain Reflectometers(OTDR)to enable real-time fault detection and diagnosis in optical fibers.In this paper,we introduce a multi-scale Convolutional Neural Network–Bidirectional Long Short-Term Memory(CNN-BiLSTM)deep learning model for accurate optical fiber fault detection.The proposed multi-scale CNN-BiLSTM comprises three variants:the Independent Multi-scale CNN-BiLSTM(IMC-BiLSTM),the Combined Multi-scale CNN-BiLSTM(CMC-BiLSTM),and the Shared Multi-scale CNN-BiLSTM(SMC-BiLSTM).These models employ convolutional kernels of varying sizes to extract spatial features from time-series data,while leveraging BiLSTM to enhance the capture of global event characteristics.Experiments were conducted using the publicly available OTDR_data dataset,and comparisons with existing methods demonstrate the effectiveness of our approach.The results show that(i)IMC-BiLSTM,CMC-BiLSTM,and SMC-BiLSTM achieve F1-scores of 97.37%,97.25%,and 97.1%,(ii)respectively,with accuracy of 97.36%,97.23%,and 97.12%.These performances surpass those of traditional techniques.展开更多
tmbalanced data is a common and serious problem in many biomedical classification tasks. It causes a bias on the training of classifiers and results in lower accuracy of minority classes prediction. This problem has a...tmbalanced data is a common and serious problem in many biomedical classification tasks. It causes a bias on the training of classifiers and results in lower accuracy of minority classes prediction. This problem has attracted a lot of research interests in the past decade. Unfortunately, most research efforts only concentrate on 2-class problems. In this paper, we study a new method of formulating a multiclass Support Vector Machine (SVM) problem for imbalanced biomedical data to improve the classification performance. The proposed method applies cost-sensitive approach and ramp loss function to the Crammer and Singer multiclass SVM formulation. Experimental results on multiple biomedical datasets show that the proposed solution can effectively cure the problem when the datasets are noisy and highly imbalanced.展开更多
Botnets based on the Domain Generation Algorithm(DGA) mechanism pose great challenges to the main current detection methods because of their strong concealment and robustness. However, the complexity of the DGA family...Botnets based on the Domain Generation Algorithm(DGA) mechanism pose great challenges to the main current detection methods because of their strong concealment and robustness. However, the complexity of the DGA family and the imbalance of samples continue to impede research on DGA detection. In the existing work, the sample size of each DGA family is regarded as the most important determinant of the resampling proportion;thus,differences in the characteristics of various samples are ignored, and the optimal resampling effect is not achieved.In this paper, a Long Short-Term Memory-based Property and Quantity Dependent Optimization(LSTM.PQDO)method is proposed. This method takes advantage of LSTM to automatically mine the comprehensive features of DGA domain names. It iterates the resampling proportion with the optimal solution based on a comprehensive consideration of the original number and characteristics of the samples to heuristically search for a better solution around the initial solution in the right direction;thus, dynamic optimization of the resampling proportion is realized.The experimental results show that the LSTM.PQDO method can achieve better performance compared with existing models to overcome the difficulties of unbalanced datasets;moreover, it can function as a reference for sample resampling tasks in similar scenarios.展开更多
Head pose estimation has been considered an important and challenging task in computer vision. In this paper we propose a novel method to estimate head pose based on a deep convolutional neural network (DCNN) for 2D...Head pose estimation has been considered an important and challenging task in computer vision. In this paper we propose a novel method to estimate head pose based on a deep convolutional neural network (DCNN) for 2D face images. We design an effective and simple method to roughly crop the face from the input image, maintaining the individual-relative facial features ratio. The method can be used in various poses. Then two convolutional neural networks are set up to train the head pose classifier and then compared with each other. The simpler one has six layers. It performs well on seven yaw poses but is somewhat unsatisfactory when mixed in two pitch poses. The other has eight layers and more pixels in input layers. It has better performance on more poses and more training samples. Before training the network, two reasonable strategies including shift and zoom are executed to prepare training samples. Finally, feature extraction filters are optimized together with the weight of the classification component through training, to minimize the classification error. Our method has been evaluated on the CAS-PEAL-R1, CMU PIE, and CUBIC FacePix databases. It has better performance than state-of-the-art methods for head pose estimation.展开更多
Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challen...Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challenge to machine learning algorithms.To tackle such issues,we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance(OAHD).Two crucial elements are included in OAHD.First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD,thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem.Second,for the multiclass imbalance problem,the distribution and the number of distinct classes are taken into account,and a modified Gini index is designed.Moreover,we give theoretical proofs for the properties of OAHD,including skew insensitivity and the ability to seek a purer node in the decision tree.Finally,we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning(KEEL)repository and the University of California,Irvine(UCI)repository.Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision,F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.展开更多
Digital display instrument identification is a crucial approach for automating the collection of digital display data.In this study,we propose a digital display area detection CTPNpro algorithm to address the problem ...Digital display instrument identification is a crucial approach for automating the collection of digital display data.In this study,we propose a digital display area detection CTPNpro algorithm to address the problem of recognizing multiclass digital display instruments.We developed a multiclass digital display instrument recognition algorithm by combining the character recognition network constructed using a convolutional neural network and bidirectional variable-length long short-term memory(LSTM).First,the digital display region detection CTPNpro network framework was designed based on the CTPN network architecture by introducing feature fusion and residual structure.Next,the digital display instrument identification network was constructed based on a convolutional neural network using twoway LSTM and Connectionist temporal classification(CTC)of indefinite length.Finally,an automatic calibration system for digital display instruments was built,and a multiclass digital display instrument dataset was constructed by sampling in the system.We compared the performance of the CTPNpro algorithm with other methods using this dataset to validate the effectiveness and robustness of the proposed algorithm.展开更多
Precisely understanding the business relationships between autonomous systems(ASes)is essential for studying the Internet structure.To date,many inference algorithms,which mainly focus on peer-to-peer(P2P)and provider...Precisely understanding the business relationships between autonomous systems(ASes)is essential for studying the Internet structure.To date,many inference algorithms,which mainly focus on peer-to-peer(P2P)and provider-to-customer(P2C)binary classification,have been proposed to classify the AS relationships and have achieved excellent results.However,business-based sibling relationships and structure-based exchange relationships have become an increasingly nonnegligible part of the Internet market in recent years.Existing algorithms are often difficult to infer due to the high similarity of these relationships to P2P or P2C relationships.In this study,we focus on multiclassification of AS relationship for the first time.We first summarize the differences between AS relationships under the structural and attribute features,and the reasons why multiclass relationships are difficult to be inferred.We then introduce new features and propose a graph convolutional network(GCN)framework,AS-GCN,to solve this multiclassification problem under complex scenes.The proposed framework considers the global network structure and local link features concurrently.Experiments on real Internet topological data validate the effectiveness of our method,that is,AS-GCN.The proposed method achieves comparable results on the binary classification task and outperforms a series of baselines on the more difficult multiclassification task,with an overall metrics above 95%.展开更多
文摘Imbalanced multiclass datasets pose challenges for machine learning algorithms.They often contain minority classes that are important for accurate predictions.However,when the data is sparsely distributed and overlaps with data points fromother classes,it introduces noise.As a result,existing resamplingmethods may fail to preserve the original data patterns,further disrupting data quality and reducingmodel performance.This paper introduces Neighbor Displacement-based Enhanced Synthetic Oversampling(NDESO),a hybridmethod that integrates a data displacement strategy with a resampling technique to achieve data balance.It begins by computing the average distance of noisy data points to their neighbors and adjusting their positions toward the center before applying random oversampling.Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasetswith varying imbalance ratios.This evaluation was structured into two distinct test groups.First,the effects of k-neighbor variations and distance metrics are evaluated,followed by a comparison of resampled data distributions against alternatives,and finally,determining the most suitable oversampling technique for data balancing.Second,the overall performance of the NDESO algorithm was assessed,focusing on G-mean and statistical significance.The results demonstrate that our method is robust to a wide range of variations in these parameters and the overall performance achieves an average G-mean score of 0.90,which is among the highest.Additionally,it attains the lowest mean rank of 2.88,indicating statistically significant improvements over existing approaches.This advantage underscores its potential for effectively handling data imbalance in practical scenarios.
基金supported by the National Natural Science Foundation of China(61433004,61473069)IAPI Fundamental Research Funds(2013ZCX14)+1 种基金supported by the Development Project of Key Laboratory of Liaoning Provincethe Enterprise Postdoctoral Fund Projects of Liaoning Province
文摘Since the efficiency of photovoltaic(PV) power is closely related to the weather,many PV enterprises install weather instruments to monitor the working state of the PV power system.With the development of the soft measurement technology,the instrumental method seems obsolete and involves high cost.This paper proposes a novel method for predicting the types of weather based on the PV power data and partial meteorological data.By this method,the weather types are deduced by data analysis,instead of weather instrument A better fault detection is obtained by using the support vector machines(SVM) and comparing the predicted and the actual weather.The model of the weather prediction is established by a direct SVM for training multiclass predictors.Although SVM is suitable for classification,the classified results depend on the type of the kernel,the parameters of the kernel,and the soft margin coefficient,which are difficult to choose.In this paper,these parameters are optimized by particle swarm optimization(PSO) algorithm in anticipation of good prediction results can be achieved.Prediction results show that this method is feasible and effective.
文摘In this study, salting-out assisted liquid-liquid extraction combined with high performance liquid chromatography diode array detector (SALLE-HPLC-DAD) method was developed and validated for simultaneous analysis of carbaryl, atrazine, propazine, chlorothalonil, dimethametryn and terbutryn in environmental water samples. Parameters affecting the extraction efficiency such as type and volume of extraction solvent, sample volume, salt type and amount, centrifugation speed and time, and sample pH were optimized. Under the optimum extraction conditions the method was linear over the range of 10 - 100 μg/L (carbaryl), 8 - 100 μg/L (atarzine), 7 - 100 μg/L (propazine) and 9 - 100 μg/L (chlorothalonil, terbutryn and dimethametryn) with correlation coefficients (R2) between 0.99 and 0.999. Limits of detection and quantification ranged from 2.0 to 2.8 μg/L and 6.7 to 9.5 μg/L, respectively. The extraction recoveries obtained for ground, lake and river waters were in a range of 75.5% to 106.6%, with the intra-day and inter-day relative standard deviation lower than 3.4% for all the target analytes. All of the target analytes were not detected in these samples. Therefore, the proposed SALLE-HPLC-DAD method is simple, rapid, cheap and environmentally friendly for the determination of the aforementioned herbicides, insecticide and fungicide residues in environmental water samples.
文摘Support vector machines (SVMs) are initially designed for binary classification. How to effectively extend them for multiclass classification is still an ongoing research topic. A multiclass classifier is constructed by combining SVM^light algorithm with directed acyclic graph SVM (DAGSVM) method, named DAGSVM^light A new method is proposed to select the working set which is identical to the working set selected by SVM^light approach. Experimental results indicate DAGSVM^light is competitive with DAGSMO. It is more suitable for practice use. It may be an especially useful tool for large-scale multiclass classification problems and lead to more widespread use of SVMs in the engineering community due to its good performance.
文摘In this study, a miniaturized analytical technique based on high density solvent based dispersive liquid-liquid microextraction (HD-DLLME) was developed for extraction of trace residues of multiclass pesticides including three striazine herbicides, two organophosphate insecticides and two organochlorine fungicides from environmental water and sugarcane juice samples. The analytical method was validated and found to offer good linearity: R2 ≥ 0.991;repeatability varied from 0.73% - 5.28%;reproducibility varied from 1.14% - 8.74% and limit of detection ranged from 0.005 to 0.02 μg/L. Moreover, accuracy of the optimized method was evaluated and the recovery was varied from 80.39% - 114.05%. Analytical applications of this method to environmental waters and sugarcane juice samples indicate the presence of trace residues of ametryn in the lake water and sugarcane juice samples. Atrazine and ametryn were also detected in irrigation water.
文摘It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.
文摘It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.
基金supported in part by the Guangxi Science and Technology Department Key Research and Development Project(Grant No.23026149)in part by the Guangxi Key Research and Development Plan Project(Grant No.AB24010073).
文摘In contemporary society,rapid and accurate optical cable fault detection is of paramount importance for ensuring the stability and reliability of optical networks.The emergence of novel faults in optical networks has introduced new challenges,significantly compromising their normal operation.Machine learning has emerged as a highly promising approach.Consequently,it is imperative to develop an automated and reliable algorithm that utilizes telemetry data acquired from Optical Time-Domain Reflectometers(OTDR)to enable real-time fault detection and diagnosis in optical fibers.In this paper,we introduce a multi-scale Convolutional Neural Network–Bidirectional Long Short-Term Memory(CNN-BiLSTM)deep learning model for accurate optical fiber fault detection.The proposed multi-scale CNN-BiLSTM comprises three variants:the Independent Multi-scale CNN-BiLSTM(IMC-BiLSTM),the Combined Multi-scale CNN-BiLSTM(CMC-BiLSTM),and the Shared Multi-scale CNN-BiLSTM(SMC-BiLSTM).These models employ convolutional kernels of varying sizes to extract spatial features from time-series data,while leveraging BiLSTM to enhance the capture of global event characteristics.Experiments were conducted using the publicly available OTDR_data dataset,and comparisons with existing methods demonstrate the effectiveness of our approach.The results show that(i)IMC-BiLSTM,CMC-BiLSTM,and SMC-BiLSTM achieve F1-scores of 97.37%,97.25%,and 97.1%,(ii)respectively,with accuracy of 97.36%,97.23%,and 97.12%.These performances surpass those of traditional techniques.
基金Supported by GSU Molecular Basis of Disease Graduate Fellow, 2011-2012
文摘tmbalanced data is a common and serious problem in many biomedical classification tasks. It causes a bias on the training of classifiers and results in lower accuracy of minority classes prediction. This problem has attracted a lot of research interests in the past decade. Unfortunately, most research efforts only concentrate on 2-class problems. In this paper, we study a new method of formulating a multiclass Support Vector Machine (SVM) problem for imbalanced biomedical data to improve the classification performance. The proposed method applies cost-sensitive approach and ramp loss function to the Crammer and Singer multiclass SVM formulation. Experimental results on multiple biomedical datasets show that the proposed solution can effectively cure the problem when the datasets are noisy and highly imbalanced.
基金partially funded by the National Natural Science Foundation of China (No. 61272447)the National Entrepreneurship&Innovation Demonstration Base of China (No. C700011)the Key Research&Development Project of Sichuan Province of China (No.2018G20100)。
文摘Botnets based on the Domain Generation Algorithm(DGA) mechanism pose great challenges to the main current detection methods because of their strong concealment and robustness. However, the complexity of the DGA family and the imbalance of samples continue to impede research on DGA detection. In the existing work, the sample size of each DGA family is regarded as the most important determinant of the resampling proportion;thus,differences in the characteristics of various samples are ignored, and the optimal resampling effect is not achieved.In this paper, a Long Short-Term Memory-based Property and Quantity Dependent Optimization(LSTM.PQDO)method is proposed. This method takes advantage of LSTM to automatically mine the comprehensive features of DGA domain names. It iterates the resampling proportion with the optimal solution based on a comprehensive consideration of the original number and characteristics of the samples to heuristically search for a better solution around the initial solution in the right direction;thus, dynamic optimization of the resampling proportion is realized.The experimental results show that the LSTM.PQDO method can achieve better performance compared with existing models to overcome the difficulties of unbalanced datasets;moreover, it can function as a reference for sample resampling tasks in similar scenarios.
基金Project supported by the National Key Scientific Instrument and Equipment Development Project of China(No.2013YQ49087903)the National Natural Science Foundation of China(No.61402307)the Educational Commission of Sichuan Province,China(No.15ZA0007)
文摘Head pose estimation has been considered an important and challenging task in computer vision. In this paper we propose a novel method to estimate head pose based on a deep convolutional neural network (DCNN) for 2D face images. We design an effective and simple method to roughly crop the face from the input image, maintaining the individual-relative facial features ratio. The method can be used in various poses. Then two convolutional neural networks are set up to train the head pose classifier and then compared with each other. The simpler one has six layers. It performs well on seven yaw poses but is somewhat unsatisfactory when mixed in two pitch poses. The other has eight layers and more pixels in input layers. It has better performance on more poses and more training samples. Before training the network, two reasonable strategies including shift and zoom are executed to prepare training samples. Finally, feature extraction filters are optimized together with the weight of the classification component through training, to minimize the classification error. Our method has been evaluated on the CAS-PEAL-R1, CMU PIE, and CUBIC FacePix databases. It has better performance than state-of-the-art methods for head pose estimation.
基金Project supported by the National Natural Science Foundation of China(Nos.61802085 and 61563012)the Guangxi Provincial Natural Science Foundation,China(Nos.2021GXNSFAA220074and 2020GXNSFAA159038)+1 种基金the Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation,China(No.2018A-04)the Guangxi Key Laboratory of Trusted Software Foundation,China(No.kx202011)。
文摘Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems,the skewed distribution of multiclass data poses a major challenge to machine learning algorithms.To tackle such issues,we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance(OAHD).Two crucial elements are included in OAHD.First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD,thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem.Second,for the multiclass imbalance problem,the distribution and the number of distinct classes are taken into account,and a modified Gini index is designed.Moreover,we give theoretical proofs for the properties of OAHD,including skew insensitivity and the ability to seek a purer node in the decision tree.Finally,we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning(KEEL)repository and the University of California,Irvine(UCI)repository.Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision,F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.
基金supported by the National Key R&D Program of China(2022YFB4701502)the“Leading Goose”R&D Program of Zhejiang(2023C01177)+1 种基金the Key Research Project of Zhejiang Lab(2021NB0AL03)the Key R&D Project on Agriculture and Social Development in Hangzhou City(Asian Games)(20230701 A05).
文摘Digital display instrument identification is a crucial approach for automating the collection of digital display data.In this study,we propose a digital display area detection CTPNpro algorithm to address the problem of recognizing multiclass digital display instruments.We developed a multiclass digital display instrument recognition algorithm by combining the character recognition network constructed using a convolutional neural network and bidirectional variable-length long short-term memory(LSTM).First,the digital display region detection CTPNpro network framework was designed based on the CTPN network architecture by introducing feature fusion and residual structure.Next,the digital display instrument identification network was constructed based on a convolutional neural network using twoway LSTM and Connectionist temporal classification(CTC)of indefinite length.Finally,an automatic calibration system for digital display instruments was built,and a multiclass digital display instrument dataset was constructed by sampling in the system.We compared the performance of the CTPNpro algorithm with other methods using this dataset to validate the effectiveness and robustness of the proposed algorithm.
基金This workwas partially supported by the Key R&D Program of Zhejiang(Grant No.2022C01018)the National Natural Science Foundation of China(Grant Nos.U21B2001 and 61973273)+1 种基金the Zhejiang Provincial Natural Science Foundationof China(Grant Nos.LY21F030017 andLR19F030001)the Major Key Project of PCL(Grant Nos.PCL2022A03,PCL2021A02,and PCL2021A09).
文摘Precisely understanding the business relationships between autonomous systems(ASes)is essential for studying the Internet structure.To date,many inference algorithms,which mainly focus on peer-to-peer(P2P)and provider-to-customer(P2C)binary classification,have been proposed to classify the AS relationships and have achieved excellent results.However,business-based sibling relationships and structure-based exchange relationships have become an increasingly nonnegligible part of the Internet market in recent years.Existing algorithms are often difficult to infer due to the high similarity of these relationships to P2P or P2C relationships.In this study,we focus on multiclassification of AS relationship for the first time.We first summarize the differences between AS relationships under the structural and attribute features,and the reasons why multiclass relationships are difficult to be inferred.We then introduce new features and propose a graph convolutional network(GCN)framework,AS-GCN,to solve this multiclassification problem under complex scenes.The proposed framework considers the global network structure and local link features concurrently.Experiments on real Internet topological data validate the effectiveness of our method,that is,AS-GCN.The proposed method achieves comparable results on the binary classification task and outperforms a series of baselines on the more difficult multiclassification task,with an overall metrics above 95%.