preprint-2020-08-12.pdf (3.32 MB)
Download fileMonitoring forest health using hyperspectral imagery: Does feature selection improve the performance of machine-learning techniques?
preprint
posted on 2020-08-14, 15:32 authored by Patrick SchratzPatrick Schratz, Jannes Muenchow, Eugenia Iturritxa, José Cortés, Bernd Bischl, Alexander BrenningThis study analyzed highly-correlated, feature-rich datasets from hyperspectral remote sensing data using multiple machine and statistical-learning methods.
The effect of filter-based feature-selection methods on predictive performance was compared.
Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated.
Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance.
Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points.
The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed.
For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance.
Permutation-based feature importance estimated features around the red edge to be most important for the models.
However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances.
More training data and replication in similar benchmarking studies is needed for more generalizable conclusions.
Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.
The effect of filter-based feature-selection methods on predictive performance was compared.
Also, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated.
Defoliation of trees (%) was modeled as a function of reflectance, and variable importance was assessed using permutation-based feature importance.
Overall support vector machine (SVM) outperformed others such as random forest (RF), extreme gradient boosting (XGBoost), lasso (L1) and ridge (L2) regression by at least three percentage points.
The combination of certain feature sets showed small increases in predictive performance while no substantial differences between individual feature sets were observed.
For some combinations of learners and feature sets, filter methods achieved better predictive performances than the unfiltered feature sets, while ensemble filters did not have a substantial impact on performance.
Permutation-based feature importance estimated features around the red edge to be most important for the models.
However, the presence of features in the near-infrared region (800 nm - 1000 nm) was essential to achieve the best performances.
More training data and replication in similar benchmarking studies is needed for more generalizable conclusions.
Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies.
Funding
LIFE14 ENV/ES/000179
History
Email Address of Submitting Author
patrick.schratz@gmail.comORCID of Submitting Author
0000-0003-0748-6624Submitting Author's Institution
Friedrich-Schiller-University JenaSubmitting Author's Country
- Germany