Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identifica...Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identification is essential for streamlining the bug triage process mining area.Several researchers have explored classical information retrieval,natural language processing,text and data mining,and machine learning approaches.The emergence of large language models(LLMs)(ChatGPT and Huggingface)has presented a new line of models for semantic textual similarity(STS).Although LLMs have shown remarkable advancements,there remains a need for longitudinal studies to determine whether performance improvements are due to the scale of the models or the unique embeddings they produce compared to classical encoding models.This study systematically investigates this issue by comparing classical word embedding techniques against LLM-based embeddings for duplicate bug detection.In this study,we have proposed an amalgamation of models to detect duplicate bug reports using textual and non-textual information about bug reports.The empirical evaluation has been performed on the open-source datasets and evaluated based on established metrics using the mean reciprocal rank(MRR),mean average precision(MAP),and recall rate.The experimental results have shown that combined LLMs can outperform(recall-rate@k 68%–74%)other individual=models for duplicate bug detection.These findings highlight the effectiveness of amalgamating multiple techniques in improving the duplicate bug report detection accuracy.展开更多
Software is unavoidable in software development and maintenance.In literature,many methods are discussed which fails to achieve efficient software bug detection and classification.In this paper,efficient Adaptive Deep...Software is unavoidable in software development and maintenance.In literature,many methods are discussed which fails to achieve efficient software bug detection and classification.In this paper,efficient Adaptive Deep Learning Model(ADLM)is developed for automatic duplicate bug report detection and classification process.The proposed ADLM is a combination of Conditional Random Fields decoding with Long Short-Term Memory(CRF-LSTM)and Dingo Optimizer(DO).In the CRF,the DO can be consumed to choose the efficient weight value in network.The proposed automatic bug report detection is proceeding with three stages like pre-processing,feature extraction in addition bug detection with classification.Initially,the bug report input dataset is gathered from the online source system.In the pre-processing phase,the unwanted information from the input data are removed by using cleaning text,convert data types and null value replacement.The pre-processed data is sent into the feature extraction phase.In the feature extraction phase,the four types of feature extraction method are utilized such as contextual,categorical,temporal and textual.Finally,the features are sent to the proposed ADLM for automatic duplication bug report detection and classification.The proposed methodology is proceeding with two phases such as training and testing phases.Based on the working process,the bugs are detected and classified from the input data.The projected technique is assessed by analyzing performance metrics such as accuracy,precision,Recall,F_Measure and kappa.展开更多
Locating bug code snippets(short for BugCode)has been a complex problem throughout the history of software security,mainly because the constraints that define BugCode are obscure and hard to summarize.Previously,secur...Locating bug code snippets(short for BugCode)has been a complex problem throughout the history of software security,mainly because the constraints that define BugCode are obscure and hard to summarize.Previously,security analysts attempted to define such constraints manually(e.g.,limiting buffer size to detect overflow),but were limited to the types of BugCode.Recent researchers address this problem by extracting constraints from program documentation,which shows the potential for API misuse.But for bugs beyond the scope of API misuse,such an approach becomes less effective since the corresponding constraints are not defined in documents,not to mention the programs without documentation In this paper,inspired by the fact that expert programmers often correct the BugCode on open forums such as StackOverflow,we design an approach to automatically extract knowledge from StackOverflow and leverage it to detect BugCode.As we all know,the contexts in StackOverflow come from ordinary developers.Their writing tends to be loosely organized and in various styles,which are more challenging to analyze than program documentation.To address the challenges,we design a custom tokenization approach to segment sentences and employ sentiment analysis to find the Controversial Sentences(CSs)that typically contain the constraints we need for code analysis.Then we use constituency parsing to extract knowledge from CSs,which helps locate Bug-Code.We evaluated our system on 41,144 comments from the questions tagged with Java and Android.The results show that our approach achieves 95.5%precision in discovering CSs.We have discovered 276 pieces of BugCode proved to be true through manual validation including an assigned CVE.89.3%of the discovered bugs remained in the current version of answers,which are unknown to users.展开更多
文摘Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identification is essential for streamlining the bug triage process mining area.Several researchers have explored classical information retrieval,natural language processing,text and data mining,and machine learning approaches.The emergence of large language models(LLMs)(ChatGPT and Huggingface)has presented a new line of models for semantic textual similarity(STS).Although LLMs have shown remarkable advancements,there remains a need for longitudinal studies to determine whether performance improvements are due to the scale of the models or the unique embeddings they produce compared to classical encoding models.This study systematically investigates this issue by comparing classical word embedding techniques against LLM-based embeddings for duplicate bug detection.In this study,we have proposed an amalgamation of models to detect duplicate bug reports using textual and non-textual information about bug reports.The empirical evaluation has been performed on the open-source datasets and evaluated based on established metrics using the mean reciprocal rank(MRR),mean average precision(MAP),and recall rate.The experimental results have shown that combined LLMs can outperform(recall-rate@k 68%–74%)other individual=models for duplicate bug detection.These findings highlight the effectiveness of amalgamating multiple techniques in improving the duplicate bug report detection accuracy.
文摘Software is unavoidable in software development and maintenance.In literature,many methods are discussed which fails to achieve efficient software bug detection and classification.In this paper,efficient Adaptive Deep Learning Model(ADLM)is developed for automatic duplicate bug report detection and classification process.The proposed ADLM is a combination of Conditional Random Fields decoding with Long Short-Term Memory(CRF-LSTM)and Dingo Optimizer(DO).In the CRF,the DO can be consumed to choose the efficient weight value in network.The proposed automatic bug report detection is proceeding with three stages like pre-processing,feature extraction in addition bug detection with classification.Initially,the bug report input dataset is gathered from the online source system.In the pre-processing phase,the unwanted information from the input data are removed by using cleaning text,convert data types and null value replacement.The pre-processed data is sent into the feature extraction phase.In the feature extraction phase,the four types of feature extraction method are utilized such as contextual,categorical,temporal and textual.Finally,the features are sent to the proposed ADLM for automatic duplication bug report detection and classification.The proposed methodology is proceeding with two phases such as training and testing phases.Based on the working process,the bugs are detected and classified from the input data.The projected technique is assessed by analyzing performance metrics such as accuracy,precision,Recall,F_Measure and kappa.
文摘Locating bug code snippets(short for BugCode)has been a complex problem throughout the history of software security,mainly because the constraints that define BugCode are obscure and hard to summarize.Previously,security analysts attempted to define such constraints manually(e.g.,limiting buffer size to detect overflow),but were limited to the types of BugCode.Recent researchers address this problem by extracting constraints from program documentation,which shows the potential for API misuse.But for bugs beyond the scope of API misuse,such an approach becomes less effective since the corresponding constraints are not defined in documents,not to mention the programs without documentation In this paper,inspired by the fact that expert programmers often correct the BugCode on open forums such as StackOverflow,we design an approach to automatically extract knowledge from StackOverflow and leverage it to detect BugCode.As we all know,the contexts in StackOverflow come from ordinary developers.Their writing tends to be loosely organized and in various styles,which are more challenging to analyze than program documentation.To address the challenges,we design a custom tokenization approach to segment sentences and employ sentiment analysis to find the Controversial Sentences(CSs)that typically contain the constraints we need for code analysis.Then we use constituency parsing to extract knowledge from CSs,which helps locate Bug-Code.We evaluated our system on 41,144 comments from the questions tagged with Java and Android.The results show that our approach achieves 95.5%precision in discovering CSs.We have discovered 276 pieces of BugCode proved to be true through manual validation including an assigned CVE.89.3%of the discovered bugs remained in the current version of answers,which are unknown to users.