Monitoring forest health using hyperspectral imagery: Does feature
selection improve the performance of machine-learning techniques?
Abstract
This study analyzed highly-correlated, feature-rich datasets from
hyperspectral remote sensing data using multiple machine and
statistical-learning methods.
The effect of filter-based feature-selection methods on predictive
performance was compared.
Also, the effect of multiple expert-based and data-driven feature sets,
derived from the reflectance data, was investigated.
Defoliation of trees (%) was modeled as a function of reflectance, and
variable importance was assessed using permutation-based feature
importance.
Overall support vector machine (SVM) outperformed others such as random
forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge
(L2) regression by at least three percentage points.
The combination of certain feature sets showed small increases in
predictive performance while no substantial differences between
individual feature sets were observed.
For some combinations of learners and feature sets, filter methods
achieved better predictive performances than the unfiltered feature
sets, while ensemble filters did not have a substantial impact on
performance.
Permutation-based feature importance
estimated features around the red edge to be most important for the
models.
However, the presence of features in the near-infrared region (800 nm -
1000 nm) was essential to achieve the best
performances.
More training data and replication in
similar benchmarking studies is needed for more generalizable
conclusions.
Filter methods have the potential to be helpful in high-dimensional
situations and are able to improve the interpretation of feature effects
in fitted models, which is an essential constraint in environmental
modeling studies.