In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defe...In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defect prediction methods based on software metric elements highly rely on software metric data.However,redundant software metric data is not conducive to efficient defect prediction,posing severe challenges to current software defect prediction tasks.To address these issues,this paper focuses on the rational clustering of software metric data.Firstly,multiple software projects are evaluated to determine the preset number of clusters for software metrics,and various clustering methods are employed to cluster the metric elements.Subsequently,a co-occurrence matrix is designed to comprehensively quantify the number of times that metrics appear in the same category.Based on the comprehensive results,the software metric data are divided into two semantic views containing different metrics,thereby analyzing the semantic information behind the software metrics.On this basis,this paper also conducts an in-depth analysis of the impact of different semantic view of metrics on defect prediction results,as well as the performance of various classification models under these semantic views.Experiments show that the joint use of the two semantic views can significantly improve the performance of models in software defect prediction,providing a new understanding and approach at the semantic view level for defect prediction research based on software metrics.展开更多
Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software ...Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.展开更多
Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detecti...Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detection efficiency. Additionally, this technology provides developers with a means to quickly identify errors, thereby improving software robustness and overall quality. However, current research in software defect prediction often faces challenges, such as relying on a single data source or failing to adequately account for the characteristics of multiple coexisting data sources. This approach may overlook the differences and potential value of various data sources, affecting the accuracy and generalization performance of prediction results. To address this issue, this study proposes a multivariate heterogeneous hybrid deep learning algorithm for defect prediction (DP-MHHDL). Initially, Abstract Syntax Tree (AST), Code Dependency Network (CDN), and code static quality metrics are extracted from source code files and used as inputs to ensure data diversity. Subsequently, for the three types of heterogeneous data, the study employs a graph convolutional network optimization model based on adjacency and spatial topologies, a Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) hybrid neural network model, and a TabNet model to extract data features. These features are then concatenated and processed through a fully connected neural network for defect prediction. Finally, the proposed framework is evaluated using ten promise defect repository projects, and performance is assessed with three metrics: F1, Area under the curve (AUC), and Matthews correlation coefficient (MCC). The experimental results demonstrate that the proposed algorithm outperforms existing methods, offering a novel solution for software defect prediction.展开更多
The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect predicti...The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect prediction studies,transfer learning was effective in solving the problem of inconsistent project data distribution.However,target projects often lack sufficient data,which affects the performance of the transfer learning model.In addition,the presence of uncorrelated features between projects can decrease the prediction accuracy of the transfer learning model.To address these problems,this article propose a software defect prediction method based on stable learning(SDP-SL)that combines code visualization techniques and residual networks.This method first transforms code files into code images using code visualization techniques and then constructs a defect prediction model based on these code images.During the model training process,target project data are not required as prior knowledge.Following the principles of stable learning,this paper dynamically adjusted the weights of source project samples to eliminate dependencies between features,thereby capturing the“invariance mechanism”within the data.This approach explores the genuine relationship between code defect features and labels,thereby enhancing defect prediction performance.To evaluate the performance of SDP-SL,this article conducted comparative experiments on 10 open-source projects in the PROMISE dataset.The experimental results demonstrated that in terms of the F-measure,the proposed SDP-SL method outperformed other within-project defect prediction methods by 2.11%-44.03%.In cross-project defect prediction,the SDP-SL method provided an improvement of 5.89%-25.46% in prediction performance compared to other cross-project defect prediction methods.Therefore,SDP-SL can effectively enhance within-and cross-project defect predictions.展开更多
The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of par...The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.展开更多
Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vita...Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vital role.Machine learning,and particularly deep learning,have been advocated for predicting software defects,however both suffer from inadequate accuracy,overfitting,and complicated structure.In this paper,we aim to address such issues in predicting software defects.We propose a novel structure of 1-Dimensional Convolutional Neural Network(1D-CNN),a deep learning architecture to extract useful knowledge,identifying and modelling the knowledge in the data sequence,reduce overfitting,and finally,predict whether the units of code are defects prone.We design large-scale empirical studies to reveal the proposed model’s effectiveness by comparing four established traditional machine learning baseline models and four state-of-the-art baselines in software defect prediction based on the NASA datasets.The experimental results demonstrate that in terms of f-measure,an optimal and modest 1DCNN with a dropout layer outperforms baseline and state-of-the-art models by 66.79%and 23.88%,respectively,in ways that minimize overfitting and improving prediction performance for software defects.According to the results,1D-CNN seems to be successful in predicting software defects and may be applied and adopted for a practical problem in software engineering.This,in turn,could lead to saving software development resources and producing more reliable software.展开更多
Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only u...Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only using machine learning algorithms.In previous studies,some researchers used extreme learning machine(ELM)to conduct defect prediction.However,the initial weights and biases of the ELM are determined randomly,which reduces the prediction performance of ELM.Motivated by the idea of search based software engineering,we propose a novel software defect prediction model named KAEA based on kernel principal component analysis(KPCA),adaptive genetic algorithm,extreme learning machine and Adaboost algorithm,which has three main advantages:(1)KPCA can extract optimal representative features by leveraging a nonlinear mapping function;(2)We leverage adaptive genetic algorithm to optimize the initial weights and biases of ELM,so as to improve the generalization ability and prediction capacity of ELM;(3)We use the Adaboost algorithm to integrate multiple ELM basic predictors optimized by adaptive genetic algorithm into a strong predictor,which can further improve the effect of defect prediction.To effectively evaluate the performance of KAEA,we use eleven datasets from large open source projects,and compare the KAEA with four machine learning basic classifiers,ELM and its three variants.The experimental results show that KAEA is superior to these baseline models in most cases.展开更多
Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mos...Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mostly regard software defect prediction as a single objective optimization problem,and multi-objective software defect prediction has not been thoroughly investigated.For the above two reasons,we propose the following solutions in this paper:(1)we leverage an advanced deep neural network-Stacked Contractive AutoEncoder(SCAE)to extract the robust deep semantic features from the original defect features,which has stronger discrimination capacity for different classes(defective or non-defective).(2)we propose a novel multi-objective defect prediction model named SMONGE that utilizes the Multi-Objective NSGAII algorithm to optimize the advanced neural network-Extreme learning machine(ELM)based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE.We mainly consider two objectives.One objective is to maximize the performance of ELM,which refers to the benefit of the SMONGE model.Another objective is to minimize the output weight norm of ELM,which is related to the cost of the SMONGE model.We compare the SCAE with six state-of-the-art feature extraction methods and compare the SMONGE model with multiple baseline models that contain four classic defect predictors and the MONGE model without SCAE across 20 open source software projects.The experimental results verify that the superiority of SCAE and SMONGE on seven evaluation metrics.展开更多
Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.How...Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.However,there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors.In order to identify and remove the redundant and irrelevant features in software defect datasets,we propose ReliefF-based clustering(RFC),a clusterbased feature selection algorithm.Then,the correlation between features is calculated based on the symmetric uncertainty.According to the correlation degree,RFC partitions features into k clusters based on the k-medoids algorithm,and finally selects the representative features from each cluster to form the final feature subset.In the experiments,we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration(NASA)software defect prediction datasets in terms of area under curve(AUC)and Fvalue.The experimental results show that RFC can effectively improve the performance of SDP.展开更多
Software defect prevention is an important way to reduce the defect introduction rate.As the primary cause of software defects,human error can be the key to understanding and preventing software defects.This paper pro...Software defect prevention is an important way to reduce the defect introduction rate.As the primary cause of software defects,human error can be the key to understanding and preventing software defects.This paper proposes a defect prevention approach based on human error mechanisms:DPe HE.The approach includes both knowledge and regulation training in human error prevention.Knowledge training provides programmers with explicit knowledge on why programmers commit errors,what kinds of errors tend to be committed under different circumstances,and how these errors can be prevented.Regulation training further helps programmers to promote the awareness and ability to prevent human errors through practice.The practice is facilitated by a problem solving checklist and a root cause identification checklist.This paper provides a systematic framework that integrates knowledge across disciplines,e.g.,cognitive science,software psychology and software engineering to defend against human errors in software development.Furthermore,we applied this approach in an international company at CMM Level 5 and a software development institution at CMM Level 1 in the Chinese Aviation Industry.The application cases show that the approach is feasible and effective in promoting developers' ability to prevent software defects,independent of process maturity levels.展开更多
Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model ...Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model is susceptible to high dimensionality of the dataset that contains irrelevant and redundant features.In addition,software metrics for software defect prediction are almost entirely traditional features compared to the deep semantic feature representation from deep learning techniques.To address these two issues,we propose the following two solutions in this paper:(1)We leverage a novel non-linear manifold learning method-SOINN Landmark Isomap(SL-Isomap)to extract the representative features by selecting automatically the reasonable number and position of landmarks,which can reveal the complex intrinsic structure hidden behind the defect data.(2)We propose a novel defect prediction model named DLDD based on hybrid deep learning techniques,which leverages denoising autoencoder to learn true input features that are not contaminated by noise,and utilizes deep neural network to learn the abstract deep semantic features.We combine the squared error loss function of denoising autoencoder with the cross entropy loss function of deep neural network to achieve the best prediction performance by adjusting a hyperparameter.We compare the SL-Isomap with seven state-of-the-art feature extraction methods and compare the DLDD model with six baseline models across 20 open source software projects.The experimental results verify that the superiority of SL-Isomap and DLDD on four evaluation indicators.展开更多
The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to ...The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to con-centrate on problematic modules rather than all the modules.This approach can enhance the quality of the final product while lowering development costs.Identifying defective modules early on can allow for early corrections and ensure the timely delivery of a high-quality product that satisfies customers and instills greater confidence in the development team.This process is known as software defect prediction,and it can improve end-product quality while reducing the cost of testing and maintenance.This study proposes a software defect prediction system that utilizes data fusion,feature selection,and ensemble machine learning fusion techniques.A novel filter-based metric selection technique is proposed in the framework to select the optimum features.A three-step nested approach is presented for predicting defective modules to achieve high accuracy.In the first step,three supervised machine learning techniques,including Decision Tree,Support Vector Machines,and Naïve Bayes,are used to detect faulty modules.The second step involves integrating the predictive accuracy of these classification techniques through three ensemble machine-learning methods:Bagging,Voting,and Stacking.Finally,in the third step,a fuzzy logic technique is employed to integrate the predictive accuracy of the ensemble machine learning techniques.The experiments are performed on a fused software defect dataset to ensure that the developed fused ensemble model can perform effectively on diverse datasets.Five NASA datasets are integrated to create the fused dataset:MW1,PC1,PC3,PC4,and CM1.According to the results,the proposed system exhibited superior performance to other advanced techniques for predicting software defects,achieving a remarkable accuracy rate of 92.08%.展开更多
Software systems have grown significantly and in complexity.As a result of these qualities,preventing software faults is extremely difficult.Software defect prediction(SDP)can assist developers in finding potential bu...Software systems have grown significantly and in complexity.As a result of these qualities,preventing software faults is extremely difficult.Software defect prediction(SDP)can assist developers in finding potential bugs and reducing maintenance costs.When it comes to lowering software costs and assuring software quality,SDP plays a critical role in software development.As a result,automatically forecasting the number of errors in software modules is important,and it may assist developers in allocating limited resources more efficiently.Several methods for detecting and addressing such flaws at a low cost have been offered.These approaches,on the other hand,need to be significantly improved in terms of performance.Therefore in this paper,two deep learning(DL)models Multilayer preceptor(MLP)and deep neural network(DNN)are proposed.The proposed approaches combine the newly established Whale optimization algorithm(WOA)with the complementary Firefly algorithm(FA)to establish the emphasized metaheuristic search EMWS algorithm,which selects fewer but closely related representative features.To find the best-implemented classifier in terms of prediction achievement measurement factor,classifiers were applied to five PROMISE repository datasets.When compared to existing methods,the proposed technique for SDP outperforms,with 0.91%for the JM1 dataset,0.98%accuracy for the KC2 dataset,0.91%accuracy for the PC1 dataset,0.93%accuracy for the MC2 dataset,and 0.92%accuracy for KC3.展开更多
Software defect prediction is a critical component in maintaining software quality,enabling early identification and resolution of issues that could lead to system failures and significant financial losses.With the in...Software defect prediction is a critical component in maintaining software quality,enabling early identification and resolution of issues that could lead to system failures and significant financial losses.With the increasing reliance on user-generated content,social media reviews have emerged as a valuable source of real-time feedback,offering insights into potential software defects that traditional testing methods may overlook.However,existing models face challenges like handling imbalanced data,high computational complexity,and insufficient inte-gration of contextual information from these reviews.To overcome these limitations,this paper introduces the SESDP(Sentiment Analysis-Based Early Software Defect Prediction)model.SESDP employs a Transformer-Based Multi-Task Learning approach using Robustly Optimized Bidirectional Encoder Representations from Transformers Approach(RoBERTa)to simultaneously perform sentiment analysis and defect prediction.By integrating text embedding extraction,sentiment score computation,and feature fusion,the model effectively captures both the contextual nuances and sentiment expressed in user reviews.Experimental results show that SESDP achieves superior performance with an accuracy of 96.37%,precision of 94.7%,and recall of 95.4%,particularly excelling in handling imbalanced datasets compared to baseline models.This approach offers a scalable and efficient solution for early software defect detection,enhancing proactive software quality assurance.展开更多
Purpose-Software defect prediction(SDP)is a critical aspect of software quality assurance,aiming to identify and manage potential defects in software systems.In this paper,we have proposed a novel hybrid approach that...Purpose-Software defect prediction(SDP)is a critical aspect of software quality assurance,aiming to identify and manage potential defects in software systems.In this paper,we have proposed a novel hybrid approach that combines Grey Wolf Optimization with Feature Selection(GWOFS)and multilayer perceptron(MLP)for SDP.The GWOFS-MLP hybrid model is designed to optimize feature selection,ultimately enhancing the accuracy and efficiency of SDP.Grey Wolf Optimization,inspired by the social hierarchy and hunting behavior of grey wolves,is employed to select a subset of relevant features from an extensive pool of potential predictors.This study investigates the key challenges that traditional SDP approaches encounter and proposes promising solutions to overcome time complexity and the curse of the dimensionality reduction problem.Design/methodology/approach-The integration of GWOFS and MLP results in a robust hybrid model that can adapt to diverse software datasets.This feature selection process harnesses the cooperative hunting behavior of wolves,allowing for the exploration of critical feature combinations.The selected features are then fed into an MLP,a powerful artificial neural network(ANN)known for its capability to learn intricate patterns within software metrics.MLP serves as the predictive engine,utilizing the curated feature set to model and classify software defects accurately.Findings-The performance evaluation of the GWOFS-MLP hybrid model on a real-world software defect dataset demonstrates its effectiveness.The model achieves a remarkable training accuracy of 97.69%and a testing accuracy of 97.99%.Additionally,the receiver operating characteristic area under the curve(ROC-AUC)score of 0.89 highlights themodel’s ability to discriminate between defective and defect-free software components.Originality/value-Experimental implementations using machine learning-based techniques with feature reduction are conducted to validate the proposed solutions.The goal is to enhance SDP’s accuracy,relevance and efficiency,ultimately improving software quality assurance processes.The confusion matrix further illustrates the model’s performance,with only a small number of false positives and false negatives.展开更多
The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manp...The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manpower. Within-project defect prediction (WPDP) is a widely used method in SDP. Despite various improvements, current methods still face challenges such as coarse-grained prediction and ineffective handling of data drift due to differences in project distribution. To address these issues, we propose a fine-grained SDP method called DIDP (drift-immune defect prediction), based on drift-immune graph neural networks (DI-GNN). DIDP converts source code into graph representations and uses DI-GNN to mitigate data drift at the model level. It also analyses key statements leading to file defects for a more detailed SDP approach. We evaluated the performance of DIDP in WPDP by examining its file-level and statement-level accuracy compared to state-of-the-art methods, and by examining its cross-project prediction accuracy. The results of the experiment show that DIDP showed significant improvements in F1-score and Recall@Top20%LOC compared to existing methods, even with large software version changes. DIDP also performed well in cross-project SDP. Our study demonstrates that DIDP achieves impressive prediction results in WPDP, effectively mitigating data drift and accurately predicting defective files. Additionally, DIDP can rank the risk of statements in defective files, aiding developers and testers in identifying potential code issues.展开更多
Software defect detection aims to automatically identify defective software modules for efficient software test in order to improve the quality of a software system. Although many machine learning methods have been su...Software defect detection aims to automatically identify defective software modules for efficient software test in order to improve the quality of a software system. Although many machine learning methods have been successfully applied to the task, most of them fail to consider two practical yet important issues in software defect detection. First, it is rather difficult to collect a large amount of labeled training data for learning a well-performing model; second, in a software system there are usually much fewer defective modules than defect-free modules, so learning would have to be conducted over an imbalanced data set. In this paper~ we address these two practical issues simultaneously by proposing a novel semi-supervised learning approach named Rocus. This method exploits the abundant unlabeled examples to improve the detection accuracy, as well as employs under-sampling to tackle the class-imbalance problem in the learning process. Experimental results of real-world software defect detection tasks show that Rocus is effective for software defect detection. Its performance is better than a semi-supervised learning method that ignores the class-imbalance nature of the task and a class-imbalance learning method that does not make effective use of unlabeled data.展开更多
Cross-project defect prediction (CPDP) uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification mo...Cross-project defect prediction (CPDP) uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the source and the target projects may be too large to make the mixed data useful for training. In this paper, we propose a cluster-based novel method FeSCH (Feature Selection Using Clusters of Hybrid-Data) to alleviate the distribution differences by feature selection. FeSCH includes two phases. Tile feature clustering phase clusters features using a density-based clustering method, and the feature selection phase selects features from each cluster using a ranking strategy. For CPDP, we design three different heuristic ranking strategies in the second phase. To investigate the prediction performance of FeSCH, we design experiments based on real-world software projects, and study the effects of design options in FeSCH (such as ranking strategy, feature selection ratio, and classifiers). The experimental results prove the effectiveness of FeSCH. Firstly, compared with the state-of-the-art baseline methods, FeSCH achieves better performance and its performance is less affected by the classifiers used. Secondly, FeSCH enhances the performance by effectively selecting features across feature categories, and provides guidelines for selecting useful features for defect prediction.展开更多
Software Defect Prediction(SDP) technology is an effective tool for improving software system quality that has attracted much attention in recent years.However,the prediction of cross-project data remains a challenge ...Software Defect Prediction(SDP) technology is an effective tool for improving software system quality that has attracted much attention in recent years.However,the prediction of cross-project data remains a challenge for the traditional SDP method due to the different distributions of the training and testing datasets.Another major difficulty is the class imbalance issue that must be addressed in Cross-Project Defect Prediction(CPDP).In this work,we propose a transfer-leaning algorithm(TSboostDF) that considers both knowledge transfer and class imbalance for CPDP.The experimental results demonstrate that the performance achieved by TSboostDF is better than those of existing CPDP methods.展开更多
Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may ...Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.展开更多
基金supported by the CCF-NSFOCUS‘Kunpeng’Research Fund(CCF-NSFOCUS2024012).
文摘In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defect prediction methods based on software metric elements highly rely on software metric data.However,redundant software metric data is not conducive to efficient defect prediction,posing severe challenges to current software defect prediction tasks.To address these issues,this paper focuses on the rational clustering of software metric data.Firstly,multiple software projects are evaluated to determine the preset number of clusters for software metrics,and various clustering methods are employed to cluster the metric elements.Subsequently,a co-occurrence matrix is designed to comprehensively quantify the number of times that metrics appear in the same category.Based on the comprehensive results,the software metric data are divided into two semantic views containing different metrics,thereby analyzing the semantic information behind the software metrics.On this basis,this paper also conducts an in-depth analysis of the impact of different semantic view of metrics on defect prediction results,as well as the performance of various classification models under these semantic views.Experiments show that the joint use of the two semantic views can significantly improve the performance of models in software defect prediction,providing a new understanding and approach at the semantic view level for defect prediction research based on software metrics.
文摘Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.
文摘Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detection efficiency. Additionally, this technology provides developers with a means to quickly identify errors, thereby improving software robustness and overall quality. However, current research in software defect prediction often faces challenges, such as relying on a single data source or failing to adequately account for the characteristics of multiple coexisting data sources. This approach may overlook the differences and potential value of various data sources, affecting the accuracy and generalization performance of prediction results. To address this issue, this study proposes a multivariate heterogeneous hybrid deep learning algorithm for defect prediction (DP-MHHDL). Initially, Abstract Syntax Tree (AST), Code Dependency Network (CDN), and code static quality metrics are extracted from source code files and used as inputs to ensure data diversity. Subsequently, for the three types of heterogeneous data, the study employs a graph convolutional network optimization model based on adjacency and spatial topologies, a Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) hybrid neural network model, and a TabNet model to extract data features. These features are then concatenated and processed through a fully connected neural network for defect prediction. Finally, the proposed framework is evaluated using ten promise defect repository projects, and performance is assessed with three metrics: F1, Area under the curve (AUC), and Matthews correlation coefficient (MCC). The experimental results demonstrate that the proposed algorithm outperforms existing methods, offering a novel solution for software defect prediction.
基金supported by the NationalNatural Science Foundation of China(Grant No.61867004)the Youth Fund of the National Natural Science Foundation of China(Grant No.41801288).
文摘The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect prediction studies,transfer learning was effective in solving the problem of inconsistent project data distribution.However,target projects often lack sufficient data,which affects the performance of the transfer learning model.In addition,the presence of uncorrelated features between projects can decrease the prediction accuracy of the transfer learning model.To address these problems,this article propose a software defect prediction method based on stable learning(SDP-SL)that combines code visualization techniques and residual networks.This method first transforms code files into code images using code visualization techniques and then constructs a defect prediction model based on these code images.During the model training process,target project data are not required as prior knowledge.Following the principles of stable learning,this paper dynamically adjusted the weights of source project samples to eliminate dependencies between features,thereby capturing the“invariance mechanism”within the data.This approach explores the genuine relationship between code defect features and labels,thereby enhancing defect prediction performance.To evaluate the performance of SDP-SL,this article conducted comparative experiments on 10 open-source projects in the PROMISE dataset.The experimental results demonstrated that in terms of the F-measure,the proposed SDP-SL method outperformed other within-project defect prediction methods by 2.11%-44.03%.In cross-project defect prediction,the SDP-SL method provided an improvement of 5.89%-25.46% in prediction performance compared to other cross-project defect prediction methods.Therefore,SDP-SL can effectively enhance within-and cross-project defect predictions.
基金the Deanship of Scientific Research at King Abdulaziz University,Jeddah,Saudi Arabia under the Grant No.RG-12-611-43.
文摘The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.
文摘Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vital role.Machine learning,and particularly deep learning,have been advocated for predicting software defects,however both suffer from inadequate accuracy,overfitting,and complicated structure.In this paper,we aim to address such issues in predicting software defects.We propose a novel structure of 1-Dimensional Convolutional Neural Network(1D-CNN),a deep learning architecture to extract useful knowledge,identifying and modelling the knowledge in the data sequence,reduce overfitting,and finally,predict whether the units of code are defects prone.We design large-scale empirical studies to reveal the proposed model’s effectiveness by comparing four established traditional machine learning baseline models and four state-of-the-art baselines in software defect prediction based on the NASA datasets.The experimental results demonstrate that in terms of f-measure,an optimal and modest 1DCNN with a dropout layer outperforms baseline and state-of-the-art models by 66.79%and 23.88%,respectively,in ways that minimize overfitting and improving prediction performance for software defects.According to the results,1D-CNN seems to be successful in predicting software defects and may be applied and adopted for a practical problem in software engineering.This,in turn,could lead to saving software development resources and producing more reliable software.
基金This work is supported in part by the National Science Foundation of China(61672392,61373038)in part by the National Key Research and Development Program of China(No.2016YFC1202204).
文摘Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only using machine learning algorithms.In previous studies,some researchers used extreme learning machine(ELM)to conduct defect prediction.However,the initial weights and biases of the ELM are determined randomly,which reduces the prediction performance of ELM.Motivated by the idea of search based software engineering,we propose a novel software defect prediction model named KAEA based on kernel principal component analysis(KPCA),adaptive genetic algorithm,extreme learning machine and Adaboost algorithm,which has three main advantages:(1)KPCA can extract optimal representative features by leveraging a nonlinear mapping function;(2)We leverage adaptive genetic algorithm to optimize the initial weights and biases of ELM,so as to improve the generalization ability and prediction capacity of ELM;(3)We use the Adaboost algorithm to integrate multiple ELM basic predictors optimized by adaptive genetic algorithm into a strong predictor,which can further improve the effect of defect prediction.To effectively evaluate the performance of KAEA,we use eleven datasets from large open source projects,and compare the KAEA with four machine learning basic classifiers,ELM and its three variants.The experimental results show that KAEA is superior to these baseline models in most cases.
基金This work is supported in part by the National Science Foundation of China(Grant Nos.61672392,61373038)in part by the National Key Research and Development Program of China(Grant No.2016YFC1202204).
文摘Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mostly regard software defect prediction as a single objective optimization problem,and multi-objective software defect prediction has not been thoroughly investigated.For the above two reasons,we propose the following solutions in this paper:(1)we leverage an advanced deep neural network-Stacked Contractive AutoEncoder(SCAE)to extract the robust deep semantic features from the original defect features,which has stronger discrimination capacity for different classes(defective or non-defective).(2)we propose a novel multi-objective defect prediction model named SMONGE that utilizes the Multi-Objective NSGAII algorithm to optimize the advanced neural network-Extreme learning machine(ELM)based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE.We mainly consider two objectives.One objective is to maximize the performance of ELM,which refers to the benefit of the SMONGE model.Another objective is to minimize the output weight norm of ELM,which is related to the cost of the SMONGE model.We compare the SCAE with six state-of-the-art feature extraction methods and compare the SMONGE model with multiple baseline models that contain four classic defect predictors and the MONGE model without SCAE across 20 open source software projects.The experimental results verify that the superiority of SCAE and SMONGE on seven evaluation metrics.
基金supported by the National Key Research and Development Program of China(2018YFB1003702)the National Natural Science Foundation of China(62072255).
文摘Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.However,there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors.In order to identify and remove the redundant and irrelevant features in software defect datasets,we propose ReliefF-based clustering(RFC),a clusterbased feature selection algorithm.Then,the correlation between features is calculated based on the symmetric uncertainty.According to the correlation degree,RFC partitions features into k clusters based on the k-medoids algorithm,and finally selects the representative features from each cluster to form the final feature subset.In the experiments,we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration(NASA)software defect prediction datasets in terms of area under curve(AUC)and Fvalue.The experimental results show that RFC can effectively improve the performance of SDP.
文摘Software defect prevention is an important way to reduce the defect introduction rate.As the primary cause of software defects,human error can be the key to understanding and preventing software defects.This paper proposes a defect prevention approach based on human error mechanisms:DPe HE.The approach includes both knowledge and regulation training in human error prevention.Knowledge training provides programmers with explicit knowledge on why programmers commit errors,what kinds of errors tend to be committed under different circumstances,and how these errors can be prevented.Regulation training further helps programmers to promote the awareness and ability to prevent human errors through practice.The practice is facilitated by a problem solving checklist and a root cause identification checklist.This paper provides a systematic framework that integrates knowledge across disciplines,e.g.,cognitive science,software psychology and software engineering to defend against human errors in software development.Furthermore,we applied this approach in an international company at CMM Level 5 and a software development institution at CMM Level 1 in the Chinese Aviation Industry.The application cases show that the approach is feasible and effective in promoting developers' ability to prevent software defects,independent of process maturity levels.
基金This work is supported in part by the National Science Foundation of China(Grant Nos.61672392,61373038)in part by the National Key Research and Development Program of China(Grant No.2016YFC1202204).
文摘Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model is susceptible to high dimensionality of the dataset that contains irrelevant and redundant features.In addition,software metrics for software defect prediction are almost entirely traditional features compared to the deep semantic feature representation from deep learning techniques.To address these two issues,we propose the following two solutions in this paper:(1)We leverage a novel non-linear manifold learning method-SOINN Landmark Isomap(SL-Isomap)to extract the representative features by selecting automatically the reasonable number and position of landmarks,which can reveal the complex intrinsic structure hidden behind the defect data.(2)We propose a novel defect prediction model named DLDD based on hybrid deep learning techniques,which leverages denoising autoencoder to learn true input features that are not contaminated by noise,and utilizes deep neural network to learn the abstract deep semantic features.We combine the squared error loss function of denoising autoencoder with the cross entropy loss function of deep neural network to achieve the best prediction performance by adjusting a hyperparameter.We compare the SL-Isomap with seven state-of-the-art feature extraction methods and compare the DLDD model with six baseline models across 20 open source software projects.The experimental results verify that the superiority of SL-Isomap and DLDD on four evaluation indicators.
基金supported by the Center for Cyber-Physical Systems,Khalifa University,under Grant 8474000137-RC1-C2PS-T5.
文摘The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to con-centrate on problematic modules rather than all the modules.This approach can enhance the quality of the final product while lowering development costs.Identifying defective modules early on can allow for early corrections and ensure the timely delivery of a high-quality product that satisfies customers and instills greater confidence in the development team.This process is known as software defect prediction,and it can improve end-product quality while reducing the cost of testing and maintenance.This study proposes a software defect prediction system that utilizes data fusion,feature selection,and ensemble machine learning fusion techniques.A novel filter-based metric selection technique is proposed in the framework to select the optimum features.A three-step nested approach is presented for predicting defective modules to achieve high accuracy.In the first step,three supervised machine learning techniques,including Decision Tree,Support Vector Machines,and Naïve Bayes,are used to detect faulty modules.The second step involves integrating the predictive accuracy of these classification techniques through three ensemble machine-learning methods:Bagging,Voting,and Stacking.Finally,in the third step,a fuzzy logic technique is employed to integrate the predictive accuracy of the ensemble machine learning techniques.The experiments are performed on a fused software defect dataset to ensure that the developed fused ensemble model can perform effectively on diverse datasets.Five NASA datasets are integrated to create the fused dataset:MW1,PC1,PC3,PC4,and CM1.According to the results,the proposed system exhibited superior performance to other advanced techniques for predicting software defects,achieving a remarkable accuracy rate of 92.08%.
文摘Software systems have grown significantly and in complexity.As a result of these qualities,preventing software faults is extremely difficult.Software defect prediction(SDP)can assist developers in finding potential bugs and reducing maintenance costs.When it comes to lowering software costs and assuring software quality,SDP plays a critical role in software development.As a result,automatically forecasting the number of errors in software modules is important,and it may assist developers in allocating limited resources more efficiently.Several methods for detecting and addressing such flaws at a low cost have been offered.These approaches,on the other hand,need to be significantly improved in terms of performance.Therefore in this paper,two deep learning(DL)models Multilayer preceptor(MLP)and deep neural network(DNN)are proposed.The proposed approaches combine the newly established Whale optimization algorithm(WOA)with the complementary Firefly algorithm(FA)to establish the emphasized metaheuristic search EMWS algorithm,which selects fewer but closely related representative features.To find the best-implemented classifier in terms of prediction achievement measurement factor,classifiers were applied to five PROMISE repository datasets.When compared to existing methods,the proposed technique for SDP outperforms,with 0.91%for the JM1 dataset,0.98%accuracy for the KC2 dataset,0.91%accuracy for the PC1 dataset,0.93%accuracy for the MC2 dataset,and 0.92%accuracy for KC3.
基金funded by a grant from the Center of Excellence in Information Assurance(CoEIA),King Saud University(KSU).
文摘Software defect prediction is a critical component in maintaining software quality,enabling early identification and resolution of issues that could lead to system failures and significant financial losses.With the increasing reliance on user-generated content,social media reviews have emerged as a valuable source of real-time feedback,offering insights into potential software defects that traditional testing methods may overlook.However,existing models face challenges like handling imbalanced data,high computational complexity,and insufficient inte-gration of contextual information from these reviews.To overcome these limitations,this paper introduces the SESDP(Sentiment Analysis-Based Early Software Defect Prediction)model.SESDP employs a Transformer-Based Multi-Task Learning approach using Robustly Optimized Bidirectional Encoder Representations from Transformers Approach(RoBERTa)to simultaneously perform sentiment analysis and defect prediction.By integrating text embedding extraction,sentiment score computation,and feature fusion,the model effectively captures both the contextual nuances and sentiment expressed in user reviews.Experimental results show that SESDP achieves superior performance with an accuracy of 96.37%,precision of 94.7%,and recall of 95.4%,particularly excelling in handling imbalanced datasets compared to baseline models.This approach offers a scalable and efficient solution for early software defect detection,enhancing proactive software quality assurance.
文摘Purpose-Software defect prediction(SDP)is a critical aspect of software quality assurance,aiming to identify and manage potential defects in software systems.In this paper,we have proposed a novel hybrid approach that combines Grey Wolf Optimization with Feature Selection(GWOFS)and multilayer perceptron(MLP)for SDP.The GWOFS-MLP hybrid model is designed to optimize feature selection,ultimately enhancing the accuracy and efficiency of SDP.Grey Wolf Optimization,inspired by the social hierarchy and hunting behavior of grey wolves,is employed to select a subset of relevant features from an extensive pool of potential predictors.This study investigates the key challenges that traditional SDP approaches encounter and proposes promising solutions to overcome time complexity and the curse of the dimensionality reduction problem.Design/methodology/approach-The integration of GWOFS and MLP results in a robust hybrid model that can adapt to diverse software datasets.This feature selection process harnesses the cooperative hunting behavior of wolves,allowing for the exploration of critical feature combinations.The selected features are then fed into an MLP,a powerful artificial neural network(ANN)known for its capability to learn intricate patterns within software metrics.MLP serves as the predictive engine,utilizing the curated feature set to model and classify software defects accurately.Findings-The performance evaluation of the GWOFS-MLP hybrid model on a real-world software defect dataset demonstrates its effectiveness.The model achieves a remarkable training accuracy of 97.69%and a testing accuracy of 97.99%.Additionally,the receiver operating characteristic area under the curve(ROC-AUC)score of 0.89 highlights themodel’s ability to discriminate between defective and defect-free software components.Originality/value-Experimental implementations using machine learning-based techniques with feature reduction are conducted to validate the proposed solutions.The goal is to enhance SDP’s accuracy,relevance and efficiency,ultimately improving software quality assurance processes.The confusion matrix further illustrates the model’s performance,with only a small number of false positives and false negatives.
基金The authors would like to express appreciation to the National Natural Science Foundation of China(Grant No.61762067)for their financial support.
文摘The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manpower. Within-project defect prediction (WPDP) is a widely used method in SDP. Despite various improvements, current methods still face challenges such as coarse-grained prediction and ineffective handling of data drift due to differences in project distribution. To address these issues, we propose a fine-grained SDP method called DIDP (drift-immune defect prediction), based on drift-immune graph neural networks (DI-GNN). DIDP converts source code into graph representations and uses DI-GNN to mitigate data drift at the model level. It also analyses key statements leading to file defects for a more detailed SDP approach. We evaluated the performance of DIDP in WPDP by examining its file-level and statement-level accuracy compared to state-of-the-art methods, and by examining its cross-project prediction accuracy. The results of the experiment show that DIDP showed significant improvements in F1-score and Recall@Top20%LOC compared to existing methods, even with large software version changes. DIDP also performed well in cross-project SDP. Our study demonstrates that DIDP achieves impressive prediction results in WPDP, effectively mitigating data drift and accurately predicting defective files. Additionally, DIDP can rank the risk of statements in defective files, aiding developers and testers in identifying potential code issues.
基金supported by the National Natural Science Foundation of China under Grant Nos. 60975043,60903103,and 60721002
文摘Software defect detection aims to automatically identify defective software modules for efficient software test in order to improve the quality of a software system. Although many machine learning methods have been successfully applied to the task, most of them fail to consider two practical yet important issues in software defect detection. First, it is rather difficult to collect a large amount of labeled training data for learning a well-performing model; second, in a software system there are usually much fewer defective modules than defect-free modules, so learning would have to be conducted over an imbalanced data set. In this paper~ we address these two practical issues simultaneously by proposing a novel semi-supervised learning approach named Rocus. This method exploits the abundant unlabeled examples to improve the detection accuracy, as well as employs under-sampling to tackle the class-imbalance problem in the learning process. Experimental results of real-world software defect detection tasks show that Rocus is effective for software defect detection. Its performance is better than a semi-supervised learning method that ignores the class-imbalance nature of the task and a class-imbalance learning method that does not make effective use of unlabeled data.
文摘Cross-project defect prediction (CPDP) uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the source and the target projects may be too large to make the mixed data useful for training. In this paper, we propose a cluster-based novel method FeSCH (Feature Selection Using Clusters of Hybrid-Data) to alleviate the distribution differences by feature selection. FeSCH includes two phases. Tile feature clustering phase clusters features using a density-based clustering method, and the feature selection phase selects features from each cluster using a ranking strategy. For CPDP, we design three different heuristic ranking strategies in the second phase. To investigate the prediction performance of FeSCH, we design experiments based on real-world software projects, and study the effects of design options in FeSCH (such as ranking strategy, feature selection ratio, and classifiers). The experimental results prove the effectiveness of FeSCH. Firstly, compared with the state-of-the-art baseline methods, FeSCH achieves better performance and its performance is less affected by the classifiers used. Secondly, FeSCH enhances the performance by effectively selecting features across feature categories, and provides guidelines for selecting useful features for defect prediction.
基金supported by the Army Weapons and Equipment Internal Research (No. LJ20191C080690)。
文摘Software Defect Prediction(SDP) technology is an effective tool for improving software system quality that has attracted much attention in recent years.However,the prediction of cross-project data remains a challenge for the traditional SDP method due to the different distributions of the training and testing datasets.Another major difficulty is the class imbalance issue that must be addressed in Cross-Project Defect Prediction(CPDP).In this work,we propose a transfer-leaning algorithm(TSboostDF) that considers both knowledge transfer and class imbalance for CPDP.The experimental results demonstrate that the performance achieved by TSboostDF is better than those of existing CPDP methods.
基金Project supported by the National Natural Science Foundation of China (Nos. 61673384 and 61502497), the Guangxi Key Laboratory of Trusted Software (No. kx201530), the China Postdoctoral Science Foundation (No. 2015M581887), and the Scientific Research Innovation Project for Graduate Students of Jiangsu Province, China (No. KYLX15 1443)
文摘Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.