DOCUMENT
1/1
Automatic detection of code smells using metrics and CodeT5 embeddings: a case study in C#
preprint
posted on 2022-08-09, 04:58 authored by Aleksandar Kovačević, Nikola LuburićNikola Luburić, Jelena SlivkaJelena Slivka, Simona Prokić, Katarina-Glorija Grujić, Dragan Vidaković, Goran SladićCode
smells are code structures that harm the software’s quality. An obstacle to
developing automatic detectors is the available datasets' limitations. Furthermore,
researchers developed many solutions for Java while neglecting other
programming languages. Recently, we created the code smell dataset for C# by
following an annotation procedure inspired by the established annotation
practices in Natural Language Processing. This paper evaluates Machine Learning
(ML) code smell detection approaches on our novel dataset. We consider two feature
representations to train ML models: (1) code metrics and (2) CodeT5 embeddings.
This study is the first to consider the CodeT5 state-of-the-art
neural source code embedding for code smell detection in C#. To prove
the effectiveness of ML, we consider multiple metrics-based heuristics as alternatives.
In our experiments, the best-performing approach was the ML classifier trained on
code metrics (F-measure of 0.87 for Long Method and 0.91 for Large Class detection).
However, the performance improvement over CodeT5 features is negligible if we
consider the advantages of automatically inferring features. We showed that our model exceeds human performance and could be
helpful to developers. To the best of our knowledge, this is the first study to
compare the performance of automatic smell detectors against human performance.
Funding
This research was supported by the Science Fund of the Republic of Serbia, Grant No 6521051, AI-Clean CaDET.
History
Email Address of Submitting Author
slivkaje@uns.ac.rsORCID of Submitting Author
https://orcid.org/0000-0003-0351-1183Submitting Author's Institution
University of Novi Sad, Faculty of Technical SciencesSubmitting Author's Country
- Serbia