In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defe...In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defect prediction methods based on software metric elements highly rely on software metric data.However,redundant software metric data is not conducive to efficient defect prediction,posing severe challenges to current software defect prediction tasks.To address these issues,this paper focuses on the rational clustering of software metric data.Firstly,multiple software projects are evaluated to determine the preset number of clusters for software metrics,and various clustering methods are employed to cluster the metric elements.Subsequently,a co-occurrence matrix is designed to comprehensively quantify the number of times that metrics appear in the same category.Based on the comprehensive results,the software metric data are divided into two semantic views containing different metrics,thereby analyzing the semantic information behind the software metrics.On this basis,this paper also conducts an in-depth analysis of the impact of different semantic view of metrics on defect prediction results,as well as the performance of various classification models under these semantic views.Experiments show that the joint use of the two semantic views can significantly improve the performance of models in software defect prediction,providing a new understanding and approach at the semantic view level for defect prediction research based on software metrics.展开更多
Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software ...Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.展开更多
The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manp...The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manpower. Within-project defect prediction (WPDP) is a widely used method in SDP. Despite various improvements, current methods still face challenges such as coarse-grained prediction and ineffective handling of data drift due to differences in project distribution. To address these issues, we propose a fine-grained SDP method called DIDP (drift-immune defect prediction), based on drift-immune graph neural networks (DI-GNN). DIDP converts source code into graph representations and uses DI-GNN to mitigate data drift at the model level. It also analyses key statements leading to file defects for a more detailed SDP approach. We evaluated the performance of DIDP in WPDP by examining its file-level and statement-level accuracy compared to state-of-the-art methods, and by examining its cross-project prediction accuracy. The results of the experiment show that DIDP showed significant improvements in F1-score and Recall@Top20%LOC compared to existing methods, even with large software version changes. DIDP also performed well in cross-project SDP. Our study demonstrates that DIDP achieves impressive prediction results in WPDP, effectively mitigating data drift and accurately predicting defective files. Additionally, DIDP can rank the risk of statements in defective files, aiding developers and testers in identifying potential code issues.展开更多
Software defect prediction aims to use measurement data of code and historical defects to predict potential problems,optimize testing resources and defect management.However,current methods face challenges:(1)Coarse-g...Software defect prediction aims to use measurement data of code and historical defects to predict potential problems,optimize testing resources and defect management.However,current methods face challenges:(1)Coarse-grained file level detection cannot accurately locate specific defects.(2)Fine-grained line-level defect prediction methods rely solely on local information of a single line of code,failing to deeply analyze the semantic context of the code line and ignoring the heuristic impact of line-level context on the code line,making it difficult to capture the interaction between global and local information.Therefore,this paper proposes a telecontext-enhanced recursive interactive attention fusion method for line-level defect prediction(TRIA-LineDP).Firstly,using a bidirectional hierarchical attention network to extract semantic features and contextual information from the original code lines as the basis.Then,the extracted contextual information is forwarded to the telecontext capture module to aggregate the global context,thereby enhancing the understanding of broader code dynamics.Finally,a recursive interaction model is used to simulate the interaction between code lines and line-level context,passing information layer by layer to enhance local and global information exchange,thereby achieving accurate defect localization.Experimental results from within-project defect prediction(WPDP)and cross-project defect prediction(CPDP)conducted on nine different projects(encompassing a total of 32 versions)demonstrated that,within the same project,the proposed methods will respectively recall at top 20%of lines of code(Recall@Top20%LOC)and effort at top 20%recall(Effort@Top20%Recall)has increased by 11%–52%and 23%–77%.In different projects,improvements of 9%–60%and 18%–77%have been achieved,which are superior to existing advanced methods and have good detection performance.展开更多
Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detecti...Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detection efficiency. Additionally, this technology provides developers with a means to quickly identify errors, thereby improving software robustness and overall quality. However, current research in software defect prediction often faces challenges, such as relying on a single data source or failing to adequately account for the characteristics of multiple coexisting data sources. This approach may overlook the differences and potential value of various data sources, affecting the accuracy and generalization performance of prediction results. To address this issue, this study proposes a multivariate heterogeneous hybrid deep learning algorithm for defect prediction (DP-MHHDL). Initially, Abstract Syntax Tree (AST), Code Dependency Network (CDN), and code static quality metrics are extracted from source code files and used as inputs to ensure data diversity. Subsequently, for the three types of heterogeneous data, the study employs a graph convolutional network optimization model based on adjacency and spatial topologies, a Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) hybrid neural network model, and a TabNet model to extract data features. These features are then concatenated and processed through a fully connected neural network for defect prediction. Finally, the proposed framework is evaluated using ten promise defect repository projects, and performance is assessed with three metrics: F1, Area under the curve (AUC), and Matthews correlation coefficient (MCC). The experimental results demonstrate that the proposed algorithm outperforms existing methods, offering a novel solution for software defect prediction.展开更多
The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect predicti...The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect prediction studies,transfer learning was effective in solving the problem of inconsistent project data distribution.However,target projects often lack sufficient data,which affects the performance of the transfer learning model.In addition,the presence of uncorrelated features between projects can decrease the prediction accuracy of the transfer learning model.To address these problems,this article propose a software defect prediction method based on stable learning(SDP-SL)that combines code visualization techniques and residual networks.This method first transforms code files into code images using code visualization techniques and then constructs a defect prediction model based on these code images.During the model training process,target project data are not required as prior knowledge.Following the principles of stable learning,this paper dynamically adjusted the weights of source project samples to eliminate dependencies between features,thereby capturing the“invariance mechanism”within the data.This approach explores the genuine relationship between code defect features and labels,thereby enhancing defect prediction performance.To evaluate the performance of SDP-SL,this article conducted comparative experiments on 10 open-source projects in the PROMISE dataset.The experimental results demonstrated that in terms of the F-measure,the proposed SDP-SL method outperformed other within-project defect prediction methods by 2.11%-44.03%.In cross-project defect prediction,the SDP-SL method provided an improvement of 5.89%-25.46% in prediction performance compared to other cross-project defect prediction methods.Therefore,SDP-SL can effectively enhance within-and cross-project defect predictions.展开更多
Cross-Project Defect Prediction(CPDP)is a method that utilizes historical data from other source projects to train predictive models for defect prediction in the target project.However,existing CPDP methods only consi...Cross-Project Defect Prediction(CPDP)is a method that utilizes historical data from other source projects to train predictive models for defect prediction in the target project.However,existing CPDP methods only consider linear correlations between features(indicators)of the source and target projects.These models are not capable of evaluating non-linear correlations between features when they exist,for example,when there are differences in data distributions between the source and target projects.As a result,the performance of such CPDP models is compromised.In this paper,this paper proposes a novel CPDP method based on Synthetic Minority Oversampling Technique(SMOTE)and Deep Canonical Correlation Analysis(DCCA),referred to as S-DCCA.Canonical Correlation Analysis(CCA)is employed to address the issue of non-linear correlations between features of the source and target projects.S-DCCA extends CCA by incorporating the MlpNet model for feature extraction from the dataset.The redundant features are then eliminated by maximizing the correlated feature subset using the CCA loss function.Finally,cross-project defect prediction is achieved through the application of the SMOTE data sampling technique.Area Under Curve(AUC)and F1 scores(F1)are used as evaluation metrics.This paper conducted experiments on 27 projects from four public datasets to validate the proposed method.The results demonstrate that,on average,our method outperforms all baseline approaches by at least 1.2%in AUC and 5.5%in F1 score.This indicates that the proposed method exhibits favorable performance characteristics.展开更多
When a customer uses the software, then it is possible to occur defects that can be removed in the updated versions of the software. Hence, in the present work, a robust examination of cross-project software defect pr...When a customer uses the software, then it is possible to occur defects that can be removed in the updated versions of the software. Hence, in the present work, a robust examination of cross-project software defect prediction is elaborated through an innovative hybrid machine learning framework. The proposed technique combines an advanced deep neural network architecture with ensemble models such as Support Vector Machine (SVM), Random Forest (RF), and XGBoost. The study evaluates the performance by considering multiple software projects like CM1, JM1, KC1, and PC1 using datasets from the PROMISE Software Engineering Repository. The three hybrid models that are compared are Hybrid Model-1 (SVM, RandomForest, XGBoost, Neural Network), Hybrid Model-2 (GradientBoosting, DecisionTree, LogisticRegression, Neural Network), and Hybrid Model-3 (KNeighbors, GaussianNB, Support Vector Classification (SVC), Neural Network), and the Hybrid Model 3 surpasses the others in terms of recall, F1-score, accuracy, ROC AUC, and precision. The presented work offers valuable insights into the effectiveness of hybrid techniques for cross-project defect prediction, providing a comparative perspective on early defect identification and mitigation strategies. .展开更多
Data available in software engineering for many applications contains variability and it is not possible to say which variable helps in the process of the prediction.Most of the work present in software defect predict...Data available in software engineering for many applications contains variability and it is not possible to say which variable helps in the process of the prediction.Most of the work present in software defect prediction is focused on the selection of best prediction techniques.For this purpose,deep learning and ensemble models have shown promising results.In contrast,there are very few researches that deals with cleaning the training data and selection of best parameter values from the data.Sometimes data available for training the models have high variability and this variability may cause a decrease in model accuracy.To deal with this problem we used the Akaike information criterion(AIC)and the Bayesian information criterion(BIC)for selection of the best variables to train the model.A simple ANN model with one input,one output and two hidden layers was used for the training instead of a very deep and complex model.AIC and BIC values are calculated and combination for minimum AIC and BIC values to be selected for the best model.At first,variables were narrowed down to a smaller number using correlation values.Then subsets for all the possible variable combinations were formed.In the end,an artificial neural network(ANN)model was trained for each subset and the best model was selected on the basis of the smallest AIC and BIC value.It was found that combination of only two variables’ns and entropy are best for software defect prediction as it gives minimum AIC and BIC values.While,nm and npt is the worst combination and gives maximum AIC and BIC values.展开更多
The fuzzy measure and fuzzy integral are applied to the classification of software defects in this paper. The fuzzy measure of software attributes and attributes' sets are treated by genetic algorithm, and then softw...The fuzzy measure and fuzzy integral are applied to the classification of software defects in this paper. The fuzzy measure of software attributes and attributes' sets are treated by genetic algorithm, and then software attributes are fused by the Choquet fuzzy integral algorithm. Finally, the class labels of soft- ware modules can be output. Experimental results have shown that there are interactions between characteristic attributes of software modules, and also proved that the fuzzy integral fusing method using Fuzzy Measure based on Genetic Algorithm (GA-FM) can significantly improve the accuracy for software defect prediction.展开更多
With the continuous expansion of software scale,software update and maintenance have become more and more important.However,frequent software code updates will make the software more likely to introduce new defects.So...With the continuous expansion of software scale,software update and maintenance have become more and more important.However,frequent software code updates will make the software more likely to introduce new defects.So how to predict the defects quickly and accurately on the software change has become an important problem for software developers.Current defect prediction methods often cannot reflect the feature information of the defect comprehensively,and the detection effect is not ideal enough.Therefore,we propose a novel defect prediction model named ITNB(Improved Transfer Naive Bayes)based on improved transfer Naive Bayesian algorithm in this paper,which mainly considers the following two aspects:(1)Considering that the edge data of the test set may affect the similarity calculation and final prediction result,we remove the edge data of the test set when calculating the data similarity between the training set and the test set;(2)Considering that each feature dimension has different effects on defect prediction,we construct the calculation formula of training data weight based on feature dimension weight and data gravity,and then calculate the prior probability and the conditional probability of training data from the weight information,so as to construct the weighted bayesian classifier for software defect prediction.To evaluate the performance of the ITNB model,we use six datasets from large open source projects,namely Bugzilla,Columba,Mozilla,JDT,Platform and PostgreSQL.We compare the ITNB model with the transfer Naive Bayesian(TNB)model.The experimental results show that our ITNB model can achieve better results than the TNB model in terms of accurary,precision and pd for within-project and cross-project defect prediction.展开更多
Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vita...Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vital role.Machine learning,and particularly deep learning,have been advocated for predicting software defects,however both suffer from inadequate accuracy,overfitting,and complicated structure.In this paper,we aim to address such issues in predicting software defects.We propose a novel structure of 1-Dimensional Convolutional Neural Network(1D-CNN),a deep learning architecture to extract useful knowledge,identifying and modelling the knowledge in the data sequence,reduce overfitting,and finally,predict whether the units of code are defects prone.We design large-scale empirical studies to reveal the proposed model’s effectiveness by comparing four established traditional machine learning baseline models and four state-of-the-art baselines in software defect prediction based on the NASA datasets.The experimental results demonstrate that in terms of f-measure,an optimal and modest 1DCNN with a dropout layer outperforms baseline and state-of-the-art models by 66.79%and 23.88%,respectively,in ways that minimize overfitting and improving prediction performance for software defects.According to the results,1D-CNN seems to be successful in predicting software defects and may be applied and adopted for a practical problem in software engineering.This,in turn,could lead to saving software development resources and producing more reliable software.展开更多
Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only u...Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only using machine learning algorithms.In previous studies,some researchers used extreme learning machine(ELM)to conduct defect prediction.However,the initial weights and biases of the ELM are determined randomly,which reduces the prediction performance of ELM.Motivated by the idea of search based software engineering,we propose a novel software defect prediction model named KAEA based on kernel principal component analysis(KPCA),adaptive genetic algorithm,extreme learning machine and Adaboost algorithm,which has three main advantages:(1)KPCA can extract optimal representative features by leveraging a nonlinear mapping function;(2)We leverage adaptive genetic algorithm to optimize the initial weights and biases of ELM,so as to improve the generalization ability and prediction capacity of ELM;(3)We use the Adaboost algorithm to integrate multiple ELM basic predictors optimized by adaptive genetic algorithm into a strong predictor,which can further improve the effect of defect prediction.To effectively evaluate the performance of KAEA,we use eleven datasets from large open source projects,and compare the KAEA with four machine learning basic classifiers,ELM and its three variants.The experimental results show that KAEA is superior to these baseline models in most cases.展开更多
Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mos...Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mostly regard software defect prediction as a single objective optimization problem,and multi-objective software defect prediction has not been thoroughly investigated.For the above two reasons,we propose the following solutions in this paper:(1)we leverage an advanced deep neural network-Stacked Contractive AutoEncoder(SCAE)to extract the robust deep semantic features from the original defect features,which has stronger discrimination capacity for different classes(defective or non-defective).(2)we propose a novel multi-objective defect prediction model named SMONGE that utilizes the Multi-Objective NSGAII algorithm to optimize the advanced neural network-Extreme learning machine(ELM)based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE.We mainly consider two objectives.One objective is to maximize the performance of ELM,which refers to the benefit of the SMONGE model.Another objective is to minimize the output weight norm of ELM,which is related to the cost of the SMONGE model.We compare the SCAE with six state-of-the-art feature extraction methods and compare the SMONGE model with multiple baseline models that contain four classic defect predictors and the MONGE model without SCAE across 20 open source software projects.The experimental results verify that the superiority of SCAE and SMONGE on seven evaluation metrics.展开更多
Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.How...Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.However,there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors.In order to identify and remove the redundant and irrelevant features in software defect datasets,we propose ReliefF-based clustering(RFC),a clusterbased feature selection algorithm.Then,the correlation between features is calculated based on the symmetric uncertainty.According to the correlation degree,RFC partitions features into k clusters based on the k-medoids algorithm,and finally selects the representative features from each cluster to form the final feature subset.In the experiments,we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration(NASA)software defect prediction datasets in terms of area under curve(AUC)and Fvalue.The experimental results show that RFC can effectively improve the performance of SDP.展开更多
Cross-project software defect prediction(CPDP)aims to enhance defect prediction in target projects with limited or no historical data by leveraging information from related source projects.The existing CPDP approaches...Cross-project software defect prediction(CPDP)aims to enhance defect prediction in target projects with limited or no historical data by leveraging information from related source projects.The existing CPDP approaches rely on static metrics or dynamic syntactic features,which have shown limited effectiveness in CPDP due to their inability to capture higher-level system properties,such as complex design patterns,relationships between multiple functions,and dependencies in different software projects,that are important for CPDP.This paper introduces a novel approach,a graph-based feature learning model for CPDP(GB-CPDP),that utilizes NetworkX to extract features and learn representations of program entities from control flow graphs(CFGs)and data dependency graphs(DDGs).These graphs capture the structural and data dependencies within the source code.The proposed approach employs Node2Vec to transform CFGs and DDGs into numerical vectors and leverages Long Short-Term Memory(LSTM)networks to learn predictive models.The process involves graph construction,feature learning through graph embedding and LSTM,and defect prediction.Experimental evaluation using nine open-source Java projects from the PROMISE dataset demonstrates that GB-CPDP outperforms state-of-the-art CPDP methods in terms of F1-measure and Area Under the Curve(AUC).The results showcase the effectiveness of GB-CPDP in improving the performance of cross-project defect prediction.展开更多
:Cross-project defect prediction(CPDP)aims to predict the defects on target project by using a prediction model built on source projects.The main problem in CPDP is the huge distribution gap between the source project...:Cross-project defect prediction(CPDP)aims to predict the defects on target project by using a prediction model built on source projects.The main problem in CPDP is the huge distribution gap between the source project and the target project,which prevents the prediction model from performing well.Most existing methods overlook the class discrimination of the learned features.Seeking an effective transferable model from the source project to the target project for CPDP is challenging.In this paper,we propose an unsupervised domain adaptation based on the discriminative subspace learning(DSL)approach for CPDP.DSL treats the data from two projects as being from two domains and maps the data into a common feature space.It employs crossdomain alignment with discriminative information from different projects to reduce the distribution difference of the data between different projects and incorporates the class discriminative information.Specifically,DSL first utilizes subspace learning based domain adaptation to reduce the distribution gap of data between different projects.Then,it makes full use of the class label information of the source project and transfers the discrimination ability of the source project to the target project in the common space.Comprehensive experiments on five projects verify that DSL can build an effective prediction model and improve the performance over the related competing methods by at least 7.10%and 11.08%in terms of G-measure and AUC.展开更多
Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model ...Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model is susceptible to high dimensionality of the dataset that contains irrelevant and redundant features.In addition,software metrics for software defect prediction are almost entirely traditional features compared to the deep semantic feature representation from deep learning techniques.To address these two issues,we propose the following two solutions in this paper:(1)We leverage a novel non-linear manifold learning method-SOINN Landmark Isomap(SL-Isomap)to extract the representative features by selecting automatically the reasonable number and position of landmarks,which can reveal the complex intrinsic structure hidden behind the defect data.(2)We propose a novel defect prediction model named DLDD based on hybrid deep learning techniques,which leverages denoising autoencoder to learn true input features that are not contaminated by noise,and utilizes deep neural network to learn the abstract deep semantic features.We combine the squared error loss function of denoising autoencoder with the cross entropy loss function of deep neural network to achieve the best prediction performance by adjusting a hyperparameter.We compare the SL-Isomap with seven state-of-the-art feature extraction methods and compare the DLDD model with six baseline models across 20 open source software projects.The experimental results verify that the superiority of SL-Isomap and DLDD on four evaluation indicators.展开更多
With the continuous expansion of software applications,people’s requirements for software quality are increasing.Software defect prediction is an important technology to improve software quality.It often encodes the ...With the continuous expansion of software applications,people’s requirements for software quality are increasing.Software defect prediction is an important technology to improve software quality.It often encodes the software into several features and applies the machine learning method to build defect prediction classifiers,which can estimate the software areas is clean or buggy.However,the current encoding methods are mainly based on the traditional manual features or the AST of source code.Traditional manual features are difficult to reflect the deep semantics of programs,and there is a lot of noise information in AST,which affects the expression of semantic features.To overcome the above deficiencies,we combined with the Convolutional Neural Networks(CNN)and proposed a novel compiler Intermediate Representation(IR)based program encoding method for software defect prediction(CIR-CNN).Specifically,our program encoding method is based on the compiler IR,which can eliminate a large amount of noise information in the syntax structure of the source code and facilitate the acquisition of more accurate semantic information.Secondly,with the help of data flow analysis,a Data Dependency Graph(DDG)is constructed on the compiler IR,which helps to capture the deeper semantic information of the program.Finally,we use the widely used CNN model to build a software defect prediction model,which can increase the adaptive ability of the method.To evaluate the performance of the CIR-CNN,we use seven projects from PROMISE datasets to set up comparative experiments.The experiments results show that,in WPDP,with our CIR-CNN method,the prediction accuracy was improved by 12%for the AST-encoded CNN-based model and by 20.9%for the traditional features-based LR model,respectively.And in CPDP,the AST-encoded DBNbased model was improved by 9.1%and the traditional features-based TCA+model by 19.2%,respectively.展开更多
The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to ...The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to con-centrate on problematic modules rather than all the modules.This approach can enhance the quality of the final product while lowering development costs.Identifying defective modules early on can allow for early corrections and ensure the timely delivery of a high-quality product that satisfies customers and instills greater confidence in the development team.This process is known as software defect prediction,and it can improve end-product quality while reducing the cost of testing and maintenance.This study proposes a software defect prediction system that utilizes data fusion,feature selection,and ensemble machine learning fusion techniques.A novel filter-based metric selection technique is proposed in the framework to select the optimum features.A three-step nested approach is presented for predicting defective modules to achieve high accuracy.In the first step,three supervised machine learning techniques,including Decision Tree,Support Vector Machines,and Naïve Bayes,are used to detect faulty modules.The second step involves integrating the predictive accuracy of these classification techniques through three ensemble machine-learning methods:Bagging,Voting,and Stacking.Finally,in the third step,a fuzzy logic technique is employed to integrate the predictive accuracy of the ensemble machine learning techniques.The experiments are performed on a fused software defect dataset to ensure that the developed fused ensemble model can perform effectively on diverse datasets.Five NASA datasets are integrated to create the fused dataset:MW1,PC1,PC3,PC4,and CM1.According to the results,the proposed system exhibited superior performance to other advanced techniques for predicting software defects,achieving a remarkable accuracy rate of 92.08%.展开更多
基金supported by the CCF-NSFOCUS‘Kunpeng’Research Fund(CCF-NSFOCUS2024012).
文摘In recent years,with the rapid development of software systems,the continuous expansion of software scale and the increasing complexity of systems have led to the emergence of a growing number of software metrics.Defect prediction methods based on software metric elements highly rely on software metric data.However,redundant software metric data is not conducive to efficient defect prediction,posing severe challenges to current software defect prediction tasks.To address these issues,this paper focuses on the rational clustering of software metric data.Firstly,multiple software projects are evaluated to determine the preset number of clusters for software metrics,and various clustering methods are employed to cluster the metric elements.Subsequently,a co-occurrence matrix is designed to comprehensively quantify the number of times that metrics appear in the same category.Based on the comprehensive results,the software metric data are divided into two semantic views containing different metrics,thereby analyzing the semantic information behind the software metrics.On this basis,this paper also conducts an in-depth analysis of the impact of different semantic view of metrics on defect prediction results,as well as the performance of various classification models under these semantic views.Experiments show that the joint use of the two semantic views can significantly improve the performance of models in software defect prediction,providing a new understanding and approach at the semantic view level for defect prediction research based on software metrics.
文摘Software defect prediction(SDP)aims to find a reliable method to predict defects in specific software projects and help software engineers allocate limited resources to release high-quality software products.Software defect prediction can be effectively performed using traditional features,but there are some redundant or irrelevant features in them(the presence or absence of this feature has little effect on the prediction results).These problems can be solved using feature selection.However,existing feature selection methods have shortcomings such as insignificant dimensionality reduction effect and low classification accuracy of the selected optimal feature subset.In order to reduce the impact of these shortcomings,this paper proposes a new feature selection method Cubic TraverseMa Beluga whale optimization algorithm(CTMBWO)based on the improved Beluga whale optimization algorithm(BWO).The goal of this study is to determine how well the CTMBWO can extract the features that are most important for correctly predicting software defects,improve the accuracy of fault prediction,reduce the number of the selected feature and mitigate the risk of overfitting,thereby achieving more efficient resource utilization and better distribution of test workload.The CTMBWO comprises three main stages:preprocessing the dataset,selecting relevant features,and evaluating the classification performance of the model.The novel feature selection method can effectively improve the performance of SDP.This study performs experiments on two software defect datasets(PROMISE,NASA)and shows the method’s classification performance using four detailed evaluation metrics,Accuracy,F1-score,MCC,AUC and Recall.The results indicate that the approach presented in this paper achieves outstanding classification performance on both datasets and has significant improvement over the baseline models.
基金The authors would like to express appreciation to the National Natural Science Foundation of China(Grant No.61762067)for their financial support.
文摘The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manpower. Within-project defect prediction (WPDP) is a widely used method in SDP. Despite various improvements, current methods still face challenges such as coarse-grained prediction and ineffective handling of data drift due to differences in project distribution. To address these issues, we propose a fine-grained SDP method called DIDP (drift-immune defect prediction), based on drift-immune graph neural networks (DI-GNN). DIDP converts source code into graph representations and uses DI-GNN to mitigate data drift at the model level. It also analyses key statements leading to file defects for a more detailed SDP approach. We evaluated the performance of DIDP in WPDP by examining its file-level and statement-level accuracy compared to state-of-the-art methods, and by examining its cross-project prediction accuracy. The results of the experiment show that DIDP showed significant improvements in F1-score and Recall@Top20%LOC compared to existing methods, even with large software version changes. DIDP also performed well in cross-project SDP. Our study demonstrates that DIDP achieves impressive prediction results in WPDP, effectively mitigating data drift and accurately predicting defective files. Additionally, DIDP can rank the risk of statements in defective files, aiding developers and testers in identifying potential code issues.
基金supported by National Natural Science Foundation of China(no.62376240).
文摘Software defect prediction aims to use measurement data of code and historical defects to predict potential problems,optimize testing resources and defect management.However,current methods face challenges:(1)Coarse-grained file level detection cannot accurately locate specific defects.(2)Fine-grained line-level defect prediction methods rely solely on local information of a single line of code,failing to deeply analyze the semantic context of the code line and ignoring the heuristic impact of line-level context on the code line,making it difficult to capture the interaction between global and local information.Therefore,this paper proposes a telecontext-enhanced recursive interactive attention fusion method for line-level defect prediction(TRIA-LineDP).Firstly,using a bidirectional hierarchical attention network to extract semantic features and contextual information from the original code lines as the basis.Then,the extracted contextual information is forwarded to the telecontext capture module to aggregate the global context,thereby enhancing the understanding of broader code dynamics.Finally,a recursive interaction model is used to simulate the interaction between code lines and line-level context,passing information layer by layer to enhance local and global information exchange,thereby achieving accurate defect localization.Experimental results from within-project defect prediction(WPDP)and cross-project defect prediction(CPDP)conducted on nine different projects(encompassing a total of 32 versions)demonstrated that,within the same project,the proposed methods will respectively recall at top 20%of lines of code(Recall@Top20%LOC)and effort at top 20%recall(Effort@Top20%Recall)has increased by 11%–52%and 23%–77%.In different projects,improvements of 9%–60%and 18%–77%have been achieved,which are superior to existing advanced methods and have good detection performance.
文摘Software defect prediction plays a critical role in software development and quality assurance processes. Effective defect prediction enables testers to accurately prioritize testing efforts and enhance defect detection efficiency. Additionally, this technology provides developers with a means to quickly identify errors, thereby improving software robustness and overall quality. However, current research in software defect prediction often faces challenges, such as relying on a single data source or failing to adequately account for the characteristics of multiple coexisting data sources. This approach may overlook the differences and potential value of various data sources, affecting the accuracy and generalization performance of prediction results. To address this issue, this study proposes a multivariate heterogeneous hybrid deep learning algorithm for defect prediction (DP-MHHDL). Initially, Abstract Syntax Tree (AST), Code Dependency Network (CDN), and code static quality metrics are extracted from source code files and used as inputs to ensure data diversity. Subsequently, for the three types of heterogeneous data, the study employs a graph convolutional network optimization model based on adjacency and spatial topologies, a Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) hybrid neural network model, and a TabNet model to extract data features. These features are then concatenated and processed through a fully connected neural network for defect prediction. Finally, the proposed framework is evaluated using ten promise defect repository projects, and performance is assessed with three metrics: F1, Area under the curve (AUC), and Matthews correlation coefficient (MCC). The experimental results demonstrate that the proposed algorithm outperforms existing methods, offering a novel solution for software defect prediction.
基金supported by the NationalNatural Science Foundation of China(Grant No.61867004)the Youth Fund of the National Natural Science Foundation of China(Grant No.41801288).
文摘The purpose of software defect prediction is to identify defect-prone code modules to assist software quality assurance teams with the appropriate allocation of resources and labor.In previous software defect prediction studies,transfer learning was effective in solving the problem of inconsistent project data distribution.However,target projects often lack sufficient data,which affects the performance of the transfer learning model.In addition,the presence of uncorrelated features between projects can decrease the prediction accuracy of the transfer learning model.To address these problems,this article propose a software defect prediction method based on stable learning(SDP-SL)that combines code visualization techniques and residual networks.This method first transforms code files into code images using code visualization techniques and then constructs a defect prediction model based on these code images.During the model training process,target project data are not required as prior knowledge.Following the principles of stable learning,this paper dynamically adjusted the weights of source project samples to eliminate dependencies between features,thereby capturing the“invariance mechanism”within the data.This approach explores the genuine relationship between code defect features and labels,thereby enhancing defect prediction performance.To evaluate the performance of SDP-SL,this article conducted comparative experiments on 10 open-source projects in the PROMISE dataset.The experimental results demonstrated that in terms of the F-measure,the proposed SDP-SL method outperformed other within-project defect prediction methods by 2.11%-44.03%.In cross-project defect prediction,the SDP-SL method provided an improvement of 5.89%-25.46% in prediction performance compared to other cross-project defect prediction methods.Therefore,SDP-SL can effectively enhance within-and cross-project defect predictions.
基金NationalNatural Science Foundation of China,Grant/AwardNumber:61867004National Natural Science Foundation of China Youth Fund,Grant/Award Number:41801288.
文摘Cross-Project Defect Prediction(CPDP)is a method that utilizes historical data from other source projects to train predictive models for defect prediction in the target project.However,existing CPDP methods only consider linear correlations between features(indicators)of the source and target projects.These models are not capable of evaluating non-linear correlations between features when they exist,for example,when there are differences in data distributions between the source and target projects.As a result,the performance of such CPDP models is compromised.In this paper,this paper proposes a novel CPDP method based on Synthetic Minority Oversampling Technique(SMOTE)and Deep Canonical Correlation Analysis(DCCA),referred to as S-DCCA.Canonical Correlation Analysis(CCA)is employed to address the issue of non-linear correlations between features of the source and target projects.S-DCCA extends CCA by incorporating the MlpNet model for feature extraction from the dataset.The redundant features are then eliminated by maximizing the correlated feature subset using the CCA loss function.Finally,cross-project defect prediction is achieved through the application of the SMOTE data sampling technique.Area Under Curve(AUC)and F1 scores(F1)are used as evaluation metrics.This paper conducted experiments on 27 projects from four public datasets to validate the proposed method.The results demonstrate that,on average,our method outperforms all baseline approaches by at least 1.2%in AUC and 5.5%in F1 score.This indicates that the proposed method exhibits favorable performance characteristics.
文摘When a customer uses the software, then it is possible to occur defects that can be removed in the updated versions of the software. Hence, in the present work, a robust examination of cross-project software defect prediction is elaborated through an innovative hybrid machine learning framework. The proposed technique combines an advanced deep neural network architecture with ensemble models such as Support Vector Machine (SVM), Random Forest (RF), and XGBoost. The study evaluates the performance by considering multiple software projects like CM1, JM1, KC1, and PC1 using datasets from the PROMISE Software Engineering Repository. The three hybrid models that are compared are Hybrid Model-1 (SVM, RandomForest, XGBoost, Neural Network), Hybrid Model-2 (GradientBoosting, DecisionTree, LogisticRegression, Neural Network), and Hybrid Model-3 (KNeighbors, GaussianNB, Support Vector Classification (SVC), Neural Network), and the Hybrid Model 3 surpasses the others in terms of recall, F1-score, accuracy, ROC AUC, and precision. The presented work offers valuable insights into the effectiveness of hybrid techniques for cross-project defect prediction, providing a comparative perspective on early defect identification and mitigation strategies. .
文摘Data available in software engineering for many applications contains variability and it is not possible to say which variable helps in the process of the prediction.Most of the work present in software defect prediction is focused on the selection of best prediction techniques.For this purpose,deep learning and ensemble models have shown promising results.In contrast,there are very few researches that deals with cleaning the training data and selection of best parameter values from the data.Sometimes data available for training the models have high variability and this variability may cause a decrease in model accuracy.To deal with this problem we used the Akaike information criterion(AIC)and the Bayesian information criterion(BIC)for selection of the best variables to train the model.A simple ANN model with one input,one output and two hidden layers was used for the training instead of a very deep and complex model.AIC and BIC values are calculated and combination for minimum AIC and BIC values to be selected for the best model.At first,variables were narrowed down to a smaller number using correlation values.Then subsets for all the possible variable combinations were formed.In the end,an artificial neural network(ANN)model was trained for each subset and the best model was selected on the basis of the smallest AIC and BIC value.It was found that combination of only two variables’ns and entropy are best for software defect prediction as it gives minimum AIC and BIC values.While,nm and npt is the worst combination and gives maximum AIC and BIC values.
基金Supported by the Natural Science Foundation of Shandong Province(ZR2013FL034)
文摘The fuzzy measure and fuzzy integral are applied to the classification of software defects in this paper. The fuzzy measure of software attributes and attributes' sets are treated by genetic algorithm, and then software attributes are fused by the Choquet fuzzy integral algorithm. Finally, the class labels of soft- ware modules can be output. Experimental results have shown that there are interactions between characteristic attributes of software modules, and also proved that the fuzzy integral fusing method using Fuzzy Measure based on Genetic Algorithm (GA-FM) can significantly improve the accuracy for software defect prediction.
基金This work is supported in part by the National Science Foundation of China(Nos.61672392,61373038)in part by the National Key Research and Development Program of China(No.2016YFC1202204).
文摘With the continuous expansion of software scale,software update and maintenance have become more and more important.However,frequent software code updates will make the software more likely to introduce new defects.So how to predict the defects quickly and accurately on the software change has become an important problem for software developers.Current defect prediction methods often cannot reflect the feature information of the defect comprehensively,and the detection effect is not ideal enough.Therefore,we propose a novel defect prediction model named ITNB(Improved Transfer Naive Bayes)based on improved transfer Naive Bayesian algorithm in this paper,which mainly considers the following two aspects:(1)Considering that the edge data of the test set may affect the similarity calculation and final prediction result,we remove the edge data of the test set when calculating the data similarity between the training set and the test set;(2)Considering that each feature dimension has different effects on defect prediction,we construct the calculation formula of training data weight based on feature dimension weight and data gravity,and then calculate the prior probability and the conditional probability of training data from the weight information,so as to construct the weighted bayesian classifier for software defect prediction.To evaluate the performance of the ITNB model,we use six datasets from large open source projects,namely Bugzilla,Columba,Mozilla,JDT,Platform and PostgreSQL.We compare the ITNB model with the transfer Naive Bayesian(TNB)model.The experimental results show that our ITNB model can achieve better results than the TNB model in terms of accurary,precision and pd for within-project and cross-project defect prediction.
文摘Developing successful software with no defects is one of the main goals of software projects.In order to provide a software project with the anticipated software quality,the prediction of software defects plays a vital role.Machine learning,and particularly deep learning,have been advocated for predicting software defects,however both suffer from inadequate accuracy,overfitting,and complicated structure.In this paper,we aim to address such issues in predicting software defects.We propose a novel structure of 1-Dimensional Convolutional Neural Network(1D-CNN),a deep learning architecture to extract useful knowledge,identifying and modelling the knowledge in the data sequence,reduce overfitting,and finally,predict whether the units of code are defects prone.We design large-scale empirical studies to reveal the proposed model’s effectiveness by comparing four established traditional machine learning baseline models and four state-of-the-art baselines in software defect prediction based on the NASA datasets.The experimental results demonstrate that in terms of f-measure,an optimal and modest 1DCNN with a dropout layer outperforms baseline and state-of-the-art models by 66.79%and 23.88%,respectively,in ways that minimize overfitting and improving prediction performance for software defects.According to the results,1D-CNN seems to be successful in predicting software defects and may be applied and adopted for a practical problem in software engineering.This,in turn,could lead to saving software development resources and producing more reliable software.
基金This work is supported in part by the National Science Foundation of China(61672392,61373038)in part by the National Key Research and Development Program of China(No.2016YFC1202204).
文摘Software defect prediction is a research hotspot in the field of software engineering.However,due to the limitations of current machine learning algorithms,we can’t achieve good effect for defect prediction by only using machine learning algorithms.In previous studies,some researchers used extreme learning machine(ELM)to conduct defect prediction.However,the initial weights and biases of the ELM are determined randomly,which reduces the prediction performance of ELM.Motivated by the idea of search based software engineering,we propose a novel software defect prediction model named KAEA based on kernel principal component analysis(KPCA),adaptive genetic algorithm,extreme learning machine and Adaboost algorithm,which has three main advantages:(1)KPCA can extract optimal representative features by leveraging a nonlinear mapping function;(2)We leverage adaptive genetic algorithm to optimize the initial weights and biases of ELM,so as to improve the generalization ability and prediction capacity of ELM;(3)We use the Adaboost algorithm to integrate multiple ELM basic predictors optimized by adaptive genetic algorithm into a strong predictor,which can further improve the effect of defect prediction.To effectively evaluate the performance of KAEA,we use eleven datasets from large open source projects,and compare the KAEA with four machine learning basic classifiers,ELM and its three variants.The experimental results show that KAEA is superior to these baseline models in most cases.
基金This work is supported in part by the National Science Foundation of China(Grant Nos.61672392,61373038)in part by the National Key Research and Development Program of China(Grant No.2016YFC1202204).
文摘Software defect prediction plays an important role in software quality assurance.However,the performance of the prediction model is susceptible to the irrelevant and redundant features.In addition,previous studies mostly regard software defect prediction as a single objective optimization problem,and multi-objective software defect prediction has not been thoroughly investigated.For the above two reasons,we propose the following solutions in this paper:(1)we leverage an advanced deep neural network-Stacked Contractive AutoEncoder(SCAE)to extract the robust deep semantic features from the original defect features,which has stronger discrimination capacity for different classes(defective or non-defective).(2)we propose a novel multi-objective defect prediction model named SMONGE that utilizes the Multi-Objective NSGAII algorithm to optimize the advanced neural network-Extreme learning machine(ELM)based on state-of-the-art Pareto optimal solutions according to the features extracted by SCAE.We mainly consider two objectives.One objective is to maximize the performance of ELM,which refers to the benefit of the SMONGE model.Another objective is to minimize the output weight norm of ELM,which is related to the cost of the SMONGE model.We compare the SCAE with six state-of-the-art feature extraction methods and compare the SMONGE model with multiple baseline models that contain four classic defect predictors and the MONGE model without SCAE across 20 open source software projects.The experimental results verify that the superiority of SCAE and SMONGE on seven evaluation metrics.
基金supported by the National Key Research and Development Program of China(2018YFB1003702)the National Natural Science Foundation of China(62072255).
文摘Software defect prediction(SDP)is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.However,there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors.In order to identify and remove the redundant and irrelevant features in software defect datasets,we propose ReliefF-based clustering(RFC),a clusterbased feature selection algorithm.Then,the correlation between features is calculated based on the symmetric uncertainty.According to the correlation degree,RFC partitions features into k clusters based on the k-medoids algorithm,and finally selects the representative features from each cluster to form the final feature subset.In the experiments,we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration(NASA)software defect prediction datasets in terms of area under curve(AUC)and Fvalue.The experimental results show that RFC can effectively improve the performance of SDP.
基金supported by Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2022-00155885).
文摘Cross-project software defect prediction(CPDP)aims to enhance defect prediction in target projects with limited or no historical data by leveraging information from related source projects.The existing CPDP approaches rely on static metrics or dynamic syntactic features,which have shown limited effectiveness in CPDP due to their inability to capture higher-level system properties,such as complex design patterns,relationships between multiple functions,and dependencies in different software projects,that are important for CPDP.This paper introduces a novel approach,a graph-based feature learning model for CPDP(GB-CPDP),that utilizes NetworkX to extract features and learn representations of program entities from control flow graphs(CFGs)and data dependency graphs(DDGs).These graphs capture the structural and data dependencies within the source code.The proposed approach employs Node2Vec to transform CFGs and DDGs into numerical vectors and leverages Long Short-Term Memory(LSTM)networks to learn predictive models.The process involves graph construction,feature learning through graph embedding and LSTM,and defect prediction.Experimental evaluation using nine open-source Java projects from the PROMISE dataset demonstrates that GB-CPDP outperforms state-of-the-art CPDP methods in terms of F1-measure and Area Under the Curve(AUC).The results showcase the effectiveness of GB-CPDP in improving the performance of cross-project defect prediction.
基金This paper was supported by the National Natural Science Foundation of China(61772286,61802208,and 61876089)China Postdoctoral Science Foundation Grant 2019M651923Natural Science Foundation of Jiangsu Province of China(BK0191381).
文摘:Cross-project defect prediction(CPDP)aims to predict the defects on target project by using a prediction model built on source projects.The main problem in CPDP is the huge distribution gap between the source project and the target project,which prevents the prediction model from performing well.Most existing methods overlook the class discrimination of the learned features.Seeking an effective transferable model from the source project to the target project for CPDP is challenging.In this paper,we propose an unsupervised domain adaptation based on the discriminative subspace learning(DSL)approach for CPDP.DSL treats the data from two projects as being from two domains and maps the data into a common feature space.It employs crossdomain alignment with discriminative information from different projects to reduce the distribution difference of the data between different projects and incorporates the class discriminative information.Specifically,DSL first utilizes subspace learning based domain adaptation to reduce the distribution gap of data between different projects.Then,it makes full use of the class label information of the source project and transfers the discrimination ability of the source project to the target project in the common space.Comprehensive experiments on five projects verify that DSL can build an effective prediction model and improve the performance over the related competing methods by at least 7.10%and 11.08%in terms of G-measure and AUC.
基金This work is supported in part by the National Science Foundation of China(Grant Nos.61672392,61373038)in part by the National Key Research and Development Program of China(Grant No.2016YFC1202204).
文摘Software defect prediction plays a very important role in software quality assurance,which aims to inspect as many potentially defect-prone software modules as possible.However,the performance of the prediction model is susceptible to high dimensionality of the dataset that contains irrelevant and redundant features.In addition,software metrics for software defect prediction are almost entirely traditional features compared to the deep semantic feature representation from deep learning techniques.To address these two issues,we propose the following two solutions in this paper:(1)We leverage a novel non-linear manifold learning method-SOINN Landmark Isomap(SL-Isomap)to extract the representative features by selecting automatically the reasonable number and position of landmarks,which can reveal the complex intrinsic structure hidden behind the defect data.(2)We propose a novel defect prediction model named DLDD based on hybrid deep learning techniques,which leverages denoising autoencoder to learn true input features that are not contaminated by noise,and utilizes deep neural network to learn the abstract deep semantic features.We combine the squared error loss function of denoising autoencoder with the cross entropy loss function of deep neural network to achieve the best prediction performance by adjusting a hyperparameter.We compare the SL-Isomap with seven state-of-the-art feature extraction methods and compare the DLDD model with six baseline models across 20 open source software projects.The experimental results verify that the superiority of SL-Isomap and DLDD on four evaluation indicators.
基金This work was supported by the Universities Natural Science Research Project of Jiangsu Province under Grant 20KJB520026 and 20KJA520002the Foundation for Young Teachers of Nanjing Auditing University under Grant 19QNPY018the National Nature Science Foundation of China under Grant 71972102 and 61902189.
文摘With the continuous expansion of software applications,people’s requirements for software quality are increasing.Software defect prediction is an important technology to improve software quality.It often encodes the software into several features and applies the machine learning method to build defect prediction classifiers,which can estimate the software areas is clean or buggy.However,the current encoding methods are mainly based on the traditional manual features or the AST of source code.Traditional manual features are difficult to reflect the deep semantics of programs,and there is a lot of noise information in AST,which affects the expression of semantic features.To overcome the above deficiencies,we combined with the Convolutional Neural Networks(CNN)and proposed a novel compiler Intermediate Representation(IR)based program encoding method for software defect prediction(CIR-CNN).Specifically,our program encoding method is based on the compiler IR,which can eliminate a large amount of noise information in the syntax structure of the source code and facilitate the acquisition of more accurate semantic information.Secondly,with the help of data flow analysis,a Data Dependency Graph(DDG)is constructed on the compiler IR,which helps to capture the deeper semantic information of the program.Finally,we use the widely used CNN model to build a software defect prediction model,which can increase the adaptive ability of the method.To evaluate the performance of the CIR-CNN,we use seven projects from PROMISE datasets to set up comparative experiments.The experiments results show that,in WPDP,with our CIR-CNN method,the prediction accuracy was improved by 12%for the AST-encoded CNN-based model and by 20.9%for the traditional features-based LR model,respectively.And in CPDP,the AST-encoded DBNbased model was improved by 9.1%and the traditional features-based TCA+model by 19.2%,respectively.
基金supported by the Center for Cyber-Physical Systems,Khalifa University,under Grant 8474000137-RC1-C2PS-T5.
文摘The software engineering field has long focused on creating high-quality software despite limited resources.Detecting defects before the testing stage of software development can enable quality assurance engineers to con-centrate on problematic modules rather than all the modules.This approach can enhance the quality of the final product while lowering development costs.Identifying defective modules early on can allow for early corrections and ensure the timely delivery of a high-quality product that satisfies customers and instills greater confidence in the development team.This process is known as software defect prediction,and it can improve end-product quality while reducing the cost of testing and maintenance.This study proposes a software defect prediction system that utilizes data fusion,feature selection,and ensemble machine learning fusion techniques.A novel filter-based metric selection technique is proposed in the framework to select the optimum features.A three-step nested approach is presented for predicting defective modules to achieve high accuracy.In the first step,three supervised machine learning techniques,including Decision Tree,Support Vector Machines,and Naïve Bayes,are used to detect faulty modules.The second step involves integrating the predictive accuracy of these classification techniques through three ensemble machine-learning methods:Bagging,Voting,and Stacking.Finally,in the third step,a fuzzy logic technique is employed to integrate the predictive accuracy of the ensemble machine learning techniques.The experiments are performed on a fused software defect dataset to ensure that the developed fused ensemble model can perform effectively on diverse datasets.Five NASA datasets are integrated to create the fused dataset:MW1,PC1,PC3,PC4,and CM1.According to the results,the proposed system exhibited superior performance to other advanced techniques for predicting software defects,achieving a remarkable accuracy rate of 92.08%.