Improving_CPDP.pdf (501.58 kB)
Download fileImproving Transfer Learning for Cross Project Defect Prediction
preprint
posted on 2022-07-25, 21:29 authored by Osayande Pascal OmondiagbeOsayande Pascal Omondiagbe, Sherlock Licorish, Stephen G. MacDonell—Cross-project defect prediction (CPDP) makes use of cross-project (CP) data to overcome the lack of data necessary to
train well-performing software defect prediction (SDP) classifiers in the early stage of new software projects. Since the CP data (known
as the source) may be different from the new project’s data (known as the target), this makes it difficult for CPDP classifiers to perform
well. In particular, it is a mismatch of data distributions between source and target that creates this difficulty. Transfer learning-based
CPDP classifiers are designed to minimize these distribution differences. The first Transfer learning-based CPDP classifiers treated
these differences equally, thereby degrading prediction performance. To this end, recent research has proposed the Weighted
Balanced Distribution Adaptation (W-BDA) method to leverage the importance of both distribution differences to improve classification
performance. Although W-BDA has been shown to improve model performance in CPDP, research to date has failed to consider model
performance in light of increasing target data or variances in data sampling. We provide the first investigation of when and to what
extent the effect of increasing the target data and using various sampling techniques have when leveraging the importance of both
distribution differences. We extend the initial W-BDA method and call this extension the W-BDA+‘ method. To evaluate the effectiveness
of W-BDA+‘ for improving CPDP performance, we conduct eight experiments on 18 projects from four datasets where data sampling
was performed with different sampling methods. We evaluate our method using four complementary indicators (i.e., Balanced
Accuracy, AUC, F-measure and G-Measure). Our findings reveal an average improvement of 6%, 7.5%, 10% and 12% for these four
indicators when W-BDA+‘ is compared to five other baseline methods (including W-BDA), for all four of the sampling methods used.
Also, as the target to source ratio is increased with different sampling methods, we observe a decrease in performance for the original
W-BDA, with our W-BDA+ approach outperforming the original W-BDA in most cases. Our results highlight the importance of adjusting
for data imbalance and having an awareness of the effect of the increasing availability of target data in CPDP scenarios.
History
Email Address of Submitting Author
omondiagbep@landcareresearch.co.nzSubmitting Author's Institution
Landcare ResearchSubmitting Author's Country
- New Zealand