Automatic detection of Feature Envy and Data Class code smells using
machine learning
Abstract
A code smell is a surface indication that usually corresponds to a
deeper problem in the system. Detecting and removing code smells is
crucial for sustainable software development. However, manual detection
can be daunting and time-consuming. Machine learning (ML) is a promising
approach towards the automation of code smell detection. The first
ML-based methods were classifiers trained on feature vectors comprising
software metrics extracted by off-the-shelf tools. Determining the
optimal set of metrics is a complex problem that requires both ML and
software engineering expertise. Recently source code embedding models
emerged as a viable feature-inferring alternative. However, their
potential is yet to be fully explored. To that aim, we compare
state-of-the-art source code embedding models (CuBERT and CodeT5) with
the models trained on metrics returned by the CK Tool and
RepositoryMiner tools. We focus on detecting the Data Class and Feature
Envy code smells within a large-scale, manually labeled, publicly
available dataset. After extensive experiments (51 test/train splits),
we found that source code embedding models have comparable performances
with software metrics, a that they indeed can capture important
characteristics of the source code. We discuss our findings in detail in
the paper.