Automatic detection of code smells using metrics and CodeT5 embeddings:
a case study in C#
Abstract
Code smells are code structures that harm the software’s quality. An
obstacle to developing automatic detectors is the available datasets’
limitations. Furthermore, researchers developed many solutions for Java
while neglecting other programming languages. Recently, we created the
code smell dataset for C# by following an annotation procedure inspired
by the established annotation practices in Natural Language Processing.
This paper evaluates Machine Learning (ML) code smell detection
approaches on our novel dataset. We consider two feature representations
to train ML models: (1) code metrics and (2) CodeT5 embeddings. This
study is the first to consider the CodeT5 state-of-the-art neural source
code embedding for code smell detection in C#. To prove the
effectiveness of ML, we consider multiple metrics-based heuristics as
alternatives. In our experiments, the best-performing approach was the
ML classifier trained on code metrics (F-measure of 0.87 for Long Method
and 0.91 for Large Class detection). However, the performance
improvement over CodeT5 features is negligible if we consider the
advantages of automatically inferring features. We showed that our model
exceeds human performance and could be helpful to developers. To the
best of our knowledge, this is the first study to compare the
performance of automatic smell detectors against human performance.