TechRxiv
1/1
2 files

Automatic detection of code smells using metrics and CodeT5 embeddings: a case study in C#

preprint
posted on 04.05.2022, 21:23 by Aleksandar Kovačević, Nikola LuburićNikola Luburić, Jelena SlivkaJelena Slivka, Simona Prokić, Katarina-Glorija Grujić, Dragan Vidaković, Goran Sladić
Code smells are code structures that harm the software’s quality. An obstacle to developing automatic detectors is the available datasets' limitations. Furthermore, researchers developed many solutions for Java while neglecting other programming languages. Recently, we created the code smell dataset for C# by following an annotation procedure inspired by the established annotation practices in Natural Language Processing. This paper evaluates Machine Learning (ML) code smell detection approaches on our novel dataset. We consider two feature representations to train ML models: (1) code metrics and (2) CodeT5 embeddings. This study is the first to consider the CodeT5 state-of-the-art neural source code embedding for code smell detection in C#. To prove the effectiveness of ML, we consider multiple metrics-based heuristics as alternatives. In our experiments, the best-performing approach was the ML classifier trained on code metrics (F-measure of 0.87 for Long Method and 0.91 for Large Class detection). However, the performance improvement over CodeT5 features is negligible if we consider the advantages of automatically inferring features. We showed that our model exceeds human performance and could be helpful to developers. To the best of our knowledge, this is the first study to compare the performance of automatic smell detectors against human performance.

Funding

This research was supported by the Science Fund of the Republic of Serbia, Grant No 6521051, AI-Clean CaDET.

History

Email Address of Submitting Author

slivkaje@uns.ac.rs

ORCID of Submitting Author

https://orcid.org/0000-0003-0351-1183

Submitting Author's Institution

University of Novi Sad, Faculty of Technical Sciences

Submitting Author's Country

Serbia

Usage metrics

Licence

Exports