Automatic detection of Feature Envy and Data Class code smells using machine learning

Milica Škipina; Jelena Slivka; Nikola Luburić; Aleksandar Kovačević

doi:10.36227/techrxiv.21732059.v2

loading page

Automatic detection of Feature Envy and Data Class code smells using machine learning

Milica Škipina ,
Jelena Slivka ,
Nikola Luburić ,
Aleksandar Kovačević

Abstract

A code smell is a surface indication that usually corresponds to a deeper problem in the system. Detecting and removing code smells is crucial for sustainable software development. However, manual detection can be daunting and time-consuming. Machine learning (ML) is a promising approach towards the automation of code smell detection. The first ML-based methods were classifiers trained on feature vectors comprising software metrics extracted by off-the-shelf tools. Determining the optimal set of metrics is a complex problem that requires both ML and software engineering expertise. Recently source code embedding models emerged as a viable feature-inferring alternative. However, their potential is yet to be fully explored. To that aim, we compare state-of-the-art source code embedding models (CuBERT and CodeT5) with the models trained on metrics returned by the CK Tool and RepositoryMiner tools. We focus on detecting the Data Class and Feature Envy code smells within a large-scale, manually labeled, publicly available dataset. After extensive experiments (51 test/train splits), we found that source code embedding models have comparable performances with software metrics, a that they indeed can capture important characteristics of the source code. We discuss our findings in detail in the paper.

Jun 2024Published in Expert Systems with Applications volume 243 on pages 122855. 10.1016/j.eswa.2023.122855

Abstract

Peer review status:Published