Automatic detection of code smells using metrics and CodeT5 embeddings: a case study in C#
preprintposted on 04.05.2022, 21:23 authored by Aleksandar Kovačević, Nikola LuburićNikola Luburić, Jelena SlivkaJelena Slivka, Simona Prokić, Katarina-Glorija Grujić, Dragan Vidaković, Goran Sladić
Code smells are code structures that harm the software’s quality. An obstacle to developing automatic detectors is the available datasets' limitations. Furthermore, researchers developed many solutions for Java while neglecting other programming languages. Recently, we created the code smell dataset for C# by following an annotation procedure inspired by the established annotation practices in Natural Language Processing. This paper evaluates Machine Learning (ML) code smell detection approaches on our novel dataset. We consider two feature representations to train ML models: (1) code metrics and (2) CodeT5 embeddings. This study is the first to consider the CodeT5 state-of-the-art neural source code embedding for code smell detection in C#. To prove the effectiveness of ML, we consider multiple metrics-based heuristics as alternatives. In our experiments, the best-performing approach was the ML classifier trained on code metrics (F-measure of 0.87 for Long Method and 0.91 for Large Class detection). However, the performance improvement over CodeT5 features is negligible if we consider the advantages of automatically inferring features. We showed that our model exceeds human performance and could be helpful to developers. To the best of our knowledge, this is the first study to compare the performance of automatic smell detectors against human performance.