Abstract
Code smells are structures in code that indicate the presence of
maintainability issues. A significant problem with code smells is their
ambiguity. They are challenging to define, and software engineers have a
different understanding of what a code smell is and which code suffers
from code smells.
A solution to this problem could be an AI digital assistant that
understands code smells and can detect (and perhaps resolve) them.
However, it is challenging to develop such an assistant as there are few
usable datasets of code smells on which to train and evaluate it.
Furthermore, the existing datasets suffer from issues that mostly arise
from an unsystematic approach used for their construction.
Through this work, we address this issue by developing a procedure for
the systematic manual annotation of code smells. We use this procedure
to build a dataset of code smells. During this process, we refine the
procedure and identify recommendations and pitfalls for its use. The
primary contribution is the proposed annotation model and procedure and
the annotators’ experience report. The dataset and supporting tool are
secondary contributions of our study. Notably, our dataset includes
open-source projects written in the C# programming language, while
almost all manually annotated datasets contain projects written in Java.