Nikola Luburić - TechRxiv

Nikola Luburić

Public Documents 5

Automatic detection of Feature Envy and Data Class code smells using machine learning

Milica Škipina

and 3 more

March 27, 2023

A code smell is a surface indication that usually corresponds to a deeper problem in the system. Detecting and removing code smells is crucial for sustainable software development. However, manual detection can be daunting and time-consuming. Machine learning (ML) is a promising approach towards the automation of code smell detection. The first ML-based methods were classifiers trained on feature vectors comprising software metrics extracted by off-the-shelf tools. Determining the optimal set of metrics is a complex problem that requires both ML and software engineering expertise. Recently source code embedding models emerged as a viable feature-inferring alternative. However, their potential is yet to be fully explored. To that aim, we compare state-of-the-art source code embedding models (CuBERT and CodeT5) with the models trained on metrics returned by the CK Tool and RepositoryMiner tools. We focus on detecting the Data Class and Feature Envy code smells within a large-scale, manually labeled, publicly available dataset. After extensive experiments (51 test/train splits), we found that source code embedding models have comparable performances with software metrics, a that they indeed can capture important characteristics of the source code. We discuss our findings in detail in the paper.

Using ChatGPT to Annotate a Dataset: A Case Study in Intelligent Tutoring Systems

Aleksandar Vujinović

and 3 more

July 10, 2023

Large language models like ChatGPT can learn in-context (ICL) from examples. Studies showed that, due to ICL, ChatGPT achieves impressive performance in various natural language processing tasks. However, to the best of our knowledge, this is the first study that assesses ChatGPT’s effectiveness in annotating a dataset for training instructor models in intelligent tutoring systems (ITSs). The task of an ITS’s instructor model is to mimic the human instructor by providing an effective tutoring action for a given student’s state. The instructor models are typically implemented as hardcoded rules, limiting their ability to personalize instruction. This problem could be mitigated by utilizing machine learning (ML). However, training supervised ML models requires a large dataset of student states annotated by corresponding tutoring actions. Using human experts to annotate such datasets is expensive, time-consuming, and requires pedagogical expertise. Thus, this study explores ChatGPT’s potential to act as a pedagogy expert annotator. Using prompt engineering, we created a list of actions a tutor could recommend to a student. We manually filtered this list and instructed ChatGPT to select the appropriate action from the list for the given student’s state. We manually analyzed ChatGPT’s responses that could be considered incorrect labels. Our results indicate that using ChatGPT as an annotator is an effective alternative to human experts. The contributions of our work are (1) a novel dataset annotation methodology for the ITS context, (2) a publicly available dataset of student states annotated with tutoring advice, and (3) a list of possible pedagogical actions.

Automatic detection of code smells using metrics and CodeT5 embeddings: a case study...

Aleksandar Kovačević

and 6 more

August 09, 2022

Code smells are code structures that harm the software’s quality. An obstacle to developing automatic detectors is the available datasets’ limitations. Furthermore, researchers developed many solutions for Java while neglecting other programming languages. Recently, we created the code smell dataset for C# by following an annotation procedure inspired by the established annotation practices in Natural Language Processing. This paper evaluates Machine Learning (ML) code smell detection approaches on our novel dataset. We consider two feature representations to train ML models: (1) code metrics and (2) CodeT5 embeddings. This study is the first to consider the CodeT5 state-of-the-art neural source code embedding for code smell detection in C#. To prove the effectiveness of ML, we consider multiple metrics-based heuristics as alternatives. In our experiments, the best-performing approach was the ML classifier trained on code metrics (F-measure of 0.87 for Long Method and 0.91 for Large Class detection). However, the performance improvement over CodeT5 features is negligible if we consider the advantages of automatically inferring features. We showed that our model exceeds human performance and could be helpful to developers. To the best of our knowledge, this is the first study to compare the performance of automatic smell detectors against human performance.

Automatic detection of Long Method and God Class code smells through neural source co...

Aleksandar Kovačević

and 6 more

December 22, 2021

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Towards a systematic approach to manual annotation of code smells

Nikola Luburić

and 6 more

August 14, 2023

This is a preprint of an article published in the Science of Computer Programming. The final peer-reviewed publication is available online at: https://doi.org/10.1016/j.scico.2023.102999 Code smells are structures in code that indicate the presence of maintainability issues. A significant problem with code smells is their ambiguity. They are challenging to define, and software engineers have a different understanding of what a code smell is and which code suffers from code smells. A solution to this problem could be an AI digital assistant that understands code smells and can detect (and perhaps resolve) them. However, it is challenging to develop such an assistant as there are few usable datasets of code smells on which to train and evaluate it. Furthermore, the existing datasets suffer from issues that mostly arise from an unsystematic approach used for their construction. Through this work, we address this issue by developing a procedure for the systematic manual annotation of code smells. We use this procedure to build a dataset of code smells. During this process, we refine the procedure and identify recommendations and pitfalls for its use. The primary contribution is the proposed annotation model and procedure and the annotators’ experience report. The dataset and supporting tool are secondary contributions of our study. Notably, our dataset includes open-source projects written in the C# programming language, while almost all manually annotated datasets contain projects written in Java.