Automatic detection of Feature Envy and Data Class code smells using machine learning
A code smell is a surface indication that usually corresponds to a deeper problem in the system. Detecting and removing code smells is crucial for sustainable software development. However, manual detection can be daunting and time-consuming. Machine learning (ML) is a promising approach towards the automation of code smell detection. The first ML-based methods were classifiers trained on feature vectors comprising software metrics extracted by off-the-shelf tools. Determining the optimal set of metrics is a complex problem that requires both ML and software engineering expertise. Recently source code embedding models emerged as a viable feature-inferring alternative. However, their potential is yet to be fully explored. To that aim, we compare state-of-the-art source code embedding models (CuBERT and CodeT5) with the models trained on metrics returned by the CK Tool and RepositoryMiner tools. We focus on detecting the Data Class and Feature Envy code smells within a large-scale, manually labeled, publicly available dataset. After extensive experiments (51 test/train splits), we found that source code embedding models have comparable performances with software metrics, a that they indeed can capture important characteristics of the source code. We discuss our findings in detail in the paper.
Science Fund of the Republic of Serbia, Grant No 6521051, AI-Clean CaDET
Ministry of Science, Technological Development and Innovation through project no. 451-03-47/2023-01/200156 “Innovative scientific and artistic research from the FTS (activity) domain”
Email Address of Submitting Authorslivkaje@uns.ac.rs
ORCID of Submitting Author0000-0003-0351-1183
Submitting Author's InstitutionUniversity of Novi Sad, Faculty of Technical Sciences
Submitting Author's Country