Initialize: Set appropriate values for the minimum purity of
samples (Pmin) and the minimum number of samples
(Smin) in the node required to be split.
Input: Training dataset D={(x1,
y1), (x2,
y2), …, (xn,
yn)}, where xi is a
feature vector of size m and yi is a
real-valued target variable.
Construct a root node N with training data set D.
For each node N:
Calculate the purity of the target variable in node N by .
If the purity of node N is less than
Pmin or the number of instances in D is
below Smin, mark N as a leaf node and return
the mean value as the predicted target variable.
If not, for each feature i, calculate the purity loss by based
on splitting the instances in D according to the values of
feature i. Determine the feature with the highest purity loss
as the splitting criterion for node N.
Split the instances in D into two subsets:
Dleft and Dright, based
on the selected feature and splitting value.
Create two child nodes for N: Nleft and
Nright.
Recursively apply the above steps to each child node, using the
corresponding subset of instances.
Return the decision tree T.
|