Improving Rare Tree Species Classification Using Domain Knowledge

Forest inventory forms the foundation of forest management. Remote sensing (RS) is an efficient means of measuring forest parameters at scale. Remotely sensed species classification can be used to estimate species abundances, distributions, and to better approximate metrics such as aboveground biomass. State-of-the-art methods of RS species classification rely on deep-learning models such as convolutional neural networks (CNNs). These models have two major drawbacks: they require large samples of each species to classify well and they lack explainability. Therefore, rare species are poorly classified causing poor approximations of their associated parameters. We show that the classification of rare species can be improved by as much as eight F1-points using a neuro-symbolic (NS) approach that combines CNNs with an NS framework. The framework allows for the incorporation of domain knowledge into the model through the use of mathematically represented rules, improving model explainability.

Abstract-Forest inventory forms the foundation of forest management. Remote sensing (RS) is an efficient means of measuring forest parameters at scale. Remotely sensed species classification can be used to estimate species abundances, distributions, and to better approximate metrics such as aboveground biomass. State-of-the-art methods of RS species classification rely on deep-learning models such as convolutional neural networks (CNNs). These models have two major drawbacks: they require large samples of each species to classify well and they lack explainability. Therefore, rare species are poorly classified causing poor approximations of their associated parameters. We show that the classification of rare species can be improved by as much as eight F1-points using a neuro-symbolic (NS) approach that combines CNNs with an NS framework. The framework allows for the incorporation of domain knowledge into the model through the use of mathematically represented rules, improving model explainability.

I. INTRODUCTION
F ORESTS play a vital role in maintaining life on Earth.
They store carbon, are a habitat for countless animals, and provide fuel and production materials for numerous industries. As a result, governments and the forestry industry invest heavily in forest monitoring and management. Traditional inventory methods rely on manual field surveys that are used to estimate forest parameters such as biomass, tree mortality rates, species abundances, and species distributions based on sampling plots within the forest [1]. Though standard field survey plots are 1 hectare or less in area, manual sampling is labor-intensive and, therefore, the number of plots inventoried is limited by available people power. Limited sampling ability hampers high-precision estimates of forest parameters at scale.
Since the 1970s, remotely sensed data products have become readily available [2]. Remote-sensing (RS) data products can include optical images such as red, green, and blue (RGB) and hyperspectral (HS), as well as light detection and ranging (LiDAR) point clouds and synthetic aperture radar (SAR) returns. With the help of automation, these data products are used for forest monitoring at scales of tens to thousands of hectares [3], [4].
Recognizing species from RS data products is termed species classification. Accurate species classification is particularly important for measuring species abundances, species distributions, and biodiversity; nonspecies-specific metrics such as aboveground biomass and the basal area may be estimated more accurately when species are taken into account [5]. Methods for classifying species based on LiDAR, HS images, RGB images, SAR returns, and almost every combination of the aforementioned modalities have been developed [6]. Here, we focus on optical imagery.
Early methods of species classification used parametric statistical models such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) or methods like maximum likelihood estimation (MLE) [6]. Most modern methods use decision tree-based classifiers or neural models on LiDAR, RGB, or HS data [6], [7], [8], with studies suggesting that using HS data gives superior performance. Advanced deep models for species classification include CNNs (2-D and 3-D), CNNs with attention, and transformers [9], [10].
Neural models have several drawbacks. Most prominently, they typically require large training sets, can be computationally intensive to train, and have low explainability [11], [12]. Guidelines for training deep neural models suggest thousands of instances of each class for optimal performance [12], [13]. Unfortunately, datasets are built by sampling from the real-world and ecological systems like forests typically contain a few common species and many rare species [14]. Trees that are rare in a forest are likely to be rare within the dataset. This means that neural models for species classification are typically poor at recognizing rare species. Depending on the application, a species' frequency within the dataset may or may not positively correlate with the importance of its recognition to the user. An analysis of challenges inherent in rare species classification can be found in [15].
One approach to reducing dataset size requirements and improving explainability is neuro-symbolics (NS). NS architectures are a combination of neural and symbolic models [16]. Symbolic models use logical formalisms or distance metrics to make inferences. Domain knowledge in symbolic models is usually represented as a rule, an equation, a knowledge base, or a knowledge graph. First-order logic (FOL) and propositional logic are two commonly used formalisms for creating models where inferences are made by reasoning over dataset instances with a set of rules [17], [18].
While neural models are good at learning from labeled examples, the "reasoning" behind their inferences is generally unclear to humans. By comparison, models that make inferences based on symbolic representations of data tend to have higher explainability, but may learn poorly from examples. The idea behind NS is that by combining the two approaches, we can capture the best of both worlds: the high explainability of symbolic models with the learning capacity of neural models. Studies have also shown that NS models are better able to learn in data-constrained settings compared to purely neural models [19]. In this work, we leverage this property to improve the classification of rare species.
The use of NS models for species classification is not new to ecology. Xu et al. [20] combine a convolutional neural network (CNN) with a knowledge graph and text embeddings to classify bird species from RGB images. Sumbul et al. [21] combine a CNN with text embeddings to classify tree species from RGB images. However, the frameworks and methods used by [20] and [21] are not easily applied to other models and require the user to find auxiliary data in the form of text or knowledge graphs to embed for semantic reasoning.
To address these shortcomings, we propose using a modified version of DeepCTRL, an NS framework created by Google that uses a form of semantic regularization [22]. DeepCTRL allows the user to create rules as equations that incorporate domain knowledge into a neural network through its loss function. By incorporating a rule as a term in the loss function, the model is penalized during optimization for both incorrect inferences and inferences that break rules. Therefore, during training, optimum performance occurs when correct inferences are made without breaking the rule. Because the model is forced to follow a known rule, and the degree to which a rule is followed can be estimated from the training loss, the model becomes more explainable. Ideally, the model would be able to learn the rule solely from the training data, but due to noise and other factors, this is not always the case. Our method gives a simple way for users to build NS and thus explainability into their models.

II. DATA
The dataset for our study comes from the Tea Kettle Experimental Forest (TEAK). TEAK is one of 81 sites monitored by the National Ecological Observatory Network (NEON). TEAK is a mixed coniferous forest in the Sierra National Forest east of Fresno California at 36 • 58 ′ N latitude and 119 • 1 ′ W longitude (see [23] and [24] for a full description of its ecological characteristics).
NEON annually surveys monitored forests from an airborne observation platform that is instrumented with RGB and HS cameras and both discrete and full-waveform LiDAR. Flights occur annually over monitored sites when the ecosystem is in a period of peak greenness. The resolution of RGB and HS data products are 0.1 and 1 m, respectively [25].
The dataset we use was curated for [9]. It consists of HS and RGB rasters, along with a coregistered canopy height model (CHM  [9] for more information on dataset curation). The curated dataset has eight classes: white fir (Abies concolor), red fir (Abies magnifica), incense cedar (Calocedrus decurrens), Jeffrey pine (Pinus jeffreyi), sugar pine (Pinus lambertiana), black oak (Quercus kelloggii), lodgepole pine (Pinus contorta), and "dead." Standing dead trees of any species are assigned this label. Table I gives the number of trees in each class and its abbreviation.
Using the CHM and DEM, we identified differences in the structural traits and topographic preferences of the species within this dataset to be used as the foundation for symbolic rules. The left plot in Fig. 1 shows the distribution of each species' height as represented by the dataset. At this site, black oak (quke) and lodgepole pine (pico) are shorter compared to other species in the dataset and distinct from each other in overall height distribution. Therefore, we use maximum crown height from the training data as the foundation for a pair of rules demonstrating how to leverage the structural traits of species (Rules 1 and 2). The right plot in Fig. 1 gives the distribution of each species' elevation range within the dataset. A number of species show distinctive elevational distributions at the site. We chose the minimum elevation for red fir (abma) as the basis for a rule demonstrating how to leverage topographic distribution limits (Rule 3). Finally, we also demonstrate the use of a rule based only on the imagery itself to differentiate between living and dead trees using the green leaf index (GLI; Rule 4) [29].

III. METHODOLOGY
For classification, we use the model from [9], an eight-layer fully-convolutional CNN. The model architecture is shown in Fig. 2.
We combine the Fricker CNN with the DeepCTRL framework. DeepCTRL is a model and data-agnostic NS framework that is easy to use. The framework is composed of a task encoder, a rule encoder, and a decision block [see Fig. 2(b)]. The loss function is a linear combination of task loss and rule loss, where task loss is the loss contributed by the model's failure to predict a label and the rule loss is contributed by the model's failure to follow a rule.
Following the protocol from [9], we create train, validation, and test sets. Using stratified sampling, the dataset is composed of 15 × 15 pixel patches sampled from the set of tree crowns. Again following the protocol in [9], we use ten fold cross-validation and report the mean of the macro-F1 score for each fold and the mean F1 score for the class on which each rule is based.
For our study, we focus on RGB images. While it is possible to apply our approach to HS images, RGB imagery is much more widely available and the model we used made few mistakes that are correctable with domain knowledge when trained on HS images. We use DeepCTRL as described in [30] with some modifications. Because DeepCTRL is data-agnostic, it can be made to work with any type of input. In our case, the input is a 15 × 15 patch of an RGB image created from the aforementioned NEON geotiffs. We concatenate the image with auxiliary data, a 15 × 15 patch of a coregistered CHM or DEM raster. In the case of the DEM, the raster is scaled by one-tenth, so its values are of the same order as the values of the RGB geotiff.
After removing the final output stage, we use the CNN from the Fricker model as both the task and rule encoders. z d and z r are the output of convolution layer 5 (shown in Fig. 2(a)) from the task and rule encoders, respectively. The decision block is composed of a convolutional layer with an input dimension of 256 and an output dimension of 8. Finally, the output of the decision block is passed through a softmax layer.
In the original design, during training, z d and z r are scaled by the constants α and 1 − α. α is sampled from a β-distribution. This allows the model to learn varying degrees of rule enforcement during training. At inference, the user can vary the value of α depending on the strength of their belief in how much the rule is followed in the test set. We obtained better results by fixing α at 0.4 for both training and inference. By fixing α, the model loses its ability to alter how strongly the rule is adhered to after training, but gains in performance. A pseudocode description of the algorithm is given in [30]. We use the following notation. Dataset D consists of tuples of inputs from set X and labels from set Y, where X is the set of pixel patches and Y is the set of their species labels: D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}. Each label y i is an eightway one-hot encoding. Each model prediction,ŷ, is an eightway probability simplex.
The loss function is composed of the linear combination of two simpler loss functions L rule and L task . L task is the cross-entropy loss between y andŷ as follows: We define L rule as the cross-entropy loss between a function, φ, andŷ, where x is a training instance and φ : x → u ∈ (0, 1). (2) L rule then becomes where 1(·) is an indicator function,ŷ k is kth element ofŷ, and the + indicates the kth class is predicted. We define φ as the composition of two functions. An inner function, f , where Function f quantizes how much x is in compliance with its respective rule. σ is a differentiable function that maps f (x) to a value ∈ [0, 1]. We use the sigmoid function .
For each rule, there is a threshold that we represent as a translation of the sigmoid along the x-axis. Depending on the domain knowledge, the presence or absence of the species of interest may only occur above or below the threshold. The function used for f varies with each rule. φ then becomes the composition of σ and f For rules 1-3, the internal function, f , takes the maximum value of the auxiliary data layer. For rule 4, which uses no auxiliary data, f calculates the GLI of x as where R, G, and B are the pixel values in each RGB channel. The equations for each rule are given in Section IV-A. Rules 1-3 come from examining presence-absence cutoffs in the CHM and DEM distributions. Rule 4 comes from examining errors in the validation set confusion matrix referenced against GLI.

A. Experimental Setup
Following the protocol from [9], stratified sampling was used to create ten folds of 15 × 15 pixel patches from the RGB, CHM, and DEM rasters. We created four rules. In natural language, rule 1 states that if the height of a tree crown is over 46 m, it is unlikely to be a black oak. We write this mathematically as . (8) Rule 2 states that trees taller than 53.2 m are unlikely to be lodgepole pine. We write rule 2 mathematically as . (9) Rule 3 states that trees growing at an elevation less than 2072 m are unlikely to be red fir. Rule 3 is written mathematically as Rule 4 states that trees with a GLI less than 0.1 are unlikely to be incense cedar. Rule 4 is written mathematically as . (11) For rules 1 and 2, the RGB raster is augmented with the CHM by adding the CHM as a fourth channel. Similarly, for rule 3, the DEM is added as the fourth channel. These channels are also available to the baseline neural model when making comparisons. We use the patch classifier from [9] trained on the RGB image with auxiliary data as a baseline. Both baseline and experimental models are trained for five epochs using the Adam optimizer with L2 regularization and a learning rate of 1 × 10 −4 . Finally, we perform an ablation study to determine how much each rule contributes to the change in model performance. For each rule, we set a random threshold value for the CHM, DEM, or GLI between the minimum and maximum values present in the training dataset. The randomized values are selected from a uniform distribution. We repeat the ablation study 30 times for each rule and average the results as the difference between the experimental model with the threshold used in its respective rule and the experimental model with the randomized threshold.

B. Results and Analysis
Compared to the baseline, the rules had a mostly positive effect on performance. Fig. 3 shows that rules 1 and 2 improved both the overall F1 and the rule's class F1, while rule 3 caused a reduction in the overall F1 but still improved its class F1. Differences are quantified as F1-points, where a 0.01 change in F1 is a change of 1 F1-point. Rule 1 improved overall F1 by 0.63 F1-points. The rule's class F1 was improved by 8.3 F1-points. For rule 2, overall F1 and class F1 improved by 0.43 and 1.84 F1-points, respectively. Rule 3 worsened the F1 by 0.97 F1-points, but still increased class F1 by 0.6 F1-points. Rule 4 improved F1 by 0.59 F1-points and class F1 by 1.1 F1-points. Fig. 4 shows the changes in the confusion matrices between the baseline model and the experimental models for each rule. The recall columns are normalized by row and the precision columns are normalized by column. For rule 1, both precision Change in macro-F1 and the class-specific F1 for each rule. Rule 1 had the biggest impact on performance. Fig. 4. Change in the confusion matrices for baseline and experimental models normalized by column for precision and row for recall. and recall are improved. For class 5, black oaks, the precision is improved by three points and the recall by 12 points. The rule has the largest negative impact on the precision of class 1, which is reduced by 5 points.
Rule 2, which was designed to affect class 6, improves both precision and recall. Class precision improves by 2 points and class recall by 5 points. Rule 2 has the largest negative impact on the precision of class 1, reducing it by 4 points. Rule 2 also has a positive effect on class 5, improving its recall by 11 points.
Rule 3, which was written around class 1, improves class 1's precision and recall by 1 and 3 points, respectively. It has a negative impact on the precision and recall of class 5. This is contrary to rules 1 and 2 which both improve class 5.
Rule 4, which is designed around class 2, improves class precision by 1 point. The overall F1 is improved by 0.59 points, while class F1 improves by 1.1 F1-points. This rule improves the precision of class 5 by 2 points, while reducing class 5's recall by 3 points.
The rarest species was most affected by the inclusion of domain knowledge. We hypothesize that this effect is most profound when rules derived from domain knowledge are applicable to the dataset, but the model, due to noise, data imbalance, or other reasons, is unable to learn the rule from the data alone.
By rarity, species are ordered 5, 1, 6, and 2, but by base model ascending class F1 performance, they are ordered 1, 5, 6, and 2. Rule 3, which affects class 1, had the largest ratio of the number of rule-correctable incorrect predictions to the number of total predictions, while rule 1 had the second largest. Rule 3 which is designed for the second rarest species with the worst base model performance is significantly less effective than rule 1, suggesting that the domain knowledge derived from rule 1 may be a better differentiator between species than the domain knowledge applied to rule 3.
The results of the ablation study are shown in Fig. 5. The results suggest that the influence of domain knowledge is strongest for rule 1, which is likely due to the rarity of black oak in the dataset. Nevertheless, each species for which a rule was created was impacted by the inclusion of domain knowledge. As in [30], the study suggests that there is a slight boost in performance when the model is placed in an NS framework and that this boost is independent of additional domain knowledge.

V. CONCLUSION
In this work, we show that domain knowledge can be encoded through a function and then injected into a species classification neural network. This method is more accessible than other NS frameworks that use formalisms such as FOL, knowledge bases, or text embeddings. Our results show that model performance on rare species can be significantly improved through the inclusion of domain knowledge using our method, which simply applies a slight modification to the original model architecture and adds an additional term to the loss function.