DecisionTree

Introduction to Decision Tree

In this post, I will discuss about Decision Tree, which is the fundamental algorithm of the well-known Tree-based algorithms. This post will also address two metrics that are employed in Decision tree and many other machine learning algorithms, even in deep learning.

1. Metric

Why do we need the “Metric” in tree-based algorithm?

The answer will be to find the best splits in the tree. We use the metric, calculate a variable, measure the split score with the value, and choose the value for the best split at each step of finding the split or the predictors in the tree. Many other metrics or score functions are used to find the “best” split for different tree-based algorithms. Two metrics will be discussed, which are Gini Impurity (or Gini Index) and Information Gain with Entropy.

1.1 Gini Index

Suppose we have a dependent categorical variable, which have $\mathcal{J}$ classes. Let $i \in \{1,2,…,\mathcal{J}\}$ and $p_i$ is the probability given each factor levels of dependent variable.d

Then, the Gini Index, or Gini Impurity is defined as below,

$$I_{\mathcal{G}}(p) = \sum_{i=1}^{J}p_i \sum_{k \neq i}p_k = \sum_{i=1}^{J}p_i(1-p_i) = \sum_{i=1}^{J}(p_i-p_i^2) = \sum_{i=1}^{J}p_i - \sum_{i=1}^{J}p_i^2 \\ = 1-\sum_{i=1}^{J}p_i^2$$

For example,

suppose we have a binary variable that follows below.

$X = [FALSE,FALSE,TRUE,TRUE,….,TRUE,FALSE]$

Then, the Gini Impurity will be the followng by the definition,

$$1 - (\frac{8}{20})^2 - (\frac{12}{20})^2 = 0.48$$

What about the Gini Impurity for this binary variable?

Then, the Gini Impurity will be

$$1 - (\frac{1}{20})^2 - (\frac{19}{20})^2 = 0.095$$

Now, we see that the smaller Gini Impurity is, we find the better split point.

Therefore, we want to find the smaller Gini Impurity score as possible as we can at each step to find best split in the tree.

Now, let we have two binary variables and make table for each factors like the below,

$$Y = [FALSE, FALSE, … , TRUE] \\ X = [YES, YES, … , NO]$$

And, we want to predict $Y$ with $X$.

We will have only one stump in a Decision Tree for this table. The root node, which is the very top of the tree, will be the predictor, $X = [YES, YES, … , NO]$

The tree is something like..

The typical tree in real world is much more complicated with more nodes and splits, since they have many predictors.
This tree that has only one split is sometimes called a stump, which is often called a weak learner in Adaboost.
But, I will just call this as a tree for this example.

Then, the process of finding Gini Impurity is the below;

1) Find Conditional probability

For example, the Conditional probability given $X$ is

$$P(Y = FALSE |X = YES) = \frac{5}{7} \\ P(Y = TRUE | X = YES) = \frac{2}{7} \\ P(Y = FALSE | X = NO) = \frac{3}{13} \\ P(Y = TRUE | X = NO) = \frac{10}{13}$$

2) Find each Gini Index for the conditional probability

$$I_{Gini}(Y|X=YES) = 1 - (\frac{5}{7})^2 - (\frac{2}{7})^2 = 0.4081633 \\ I_{Gini}(Y|X=NO) = 1 - (\frac{3}{13})^2 - (\frac{10}{13})^2 = 0.3550296 $$

Then, we find a Gini Impurity score for a leaf node in this tree, which is the very bottom of the tree or the final classification.

3) Find Marginal probability

The margianl probabilty given $X$ is

$$P_X(X=YES) = \frac{7}{20} \\ P_X(X=NO) = \frac{13}{20}$$

Since the tree is splitted by $X$, we find the Gini Impurity of the total tree.

4) Multiply and add the Gini Index and marginal probability associated with the given probability respectively. Finally, we have Gini Impurity Score for a split.

Then, the Total Gini Impurity Score for this tree is,

$$I_{Gini}(Y|X=YES) \times P_X(X=YES) + I_{Gini}(Y|X=NO) \times P_X(X=NO) = \\ 0.4081633 \times \frac{7}{20} + 0.3550296 \times \frac{13}{20} = 0.3736264$$

Then, we eventually find the total Gini Impurity score for a tree.

In real world problem, we have many predictors and so many nodes, possibly many trees such in Random Forest.

With this process of finding Gini Impurity, we compare the Gini score and find the best split point of the best predictor with best splits.

1.2 Entropy

Entropy is defined as below;

$$H(T) = -\sum_{i=1}^{J}p_ilog_2(p_i)$$

And the process of finding best split in a tree is to find the Information Gain.

The Information Gain is often called Kullback-Leiber divergence and defined as below;

$$IG(T,a) = H(T) - H(T|a)$$

where $IG(Y,X)$ is the Information Gain of Y given X, $H(Y)$ is the parent Entropy of Y, and $H(Y|X)$ is the children Entropy of Y given X, and the $H(Y|X)$ is often called cross-entropy.

The process of finding Information Gain is very similar to the Gini Score, but here we find the bigger value of IG.

I will use the same example problem as above.

For the above example,

$$Y = [FALSE, FALSE, … , TRUE] \\ X = [YES, YES, … , NO]$$

Then, as I have shown, find the conditional probability.

1) Find the conditional probability given $X$

$$P(Y = FALSE |X = YES) = \frac{5}{7} \\ P(Y = TRUE | X = YES) = \frac{2}{7} \\ P(Y = FALSE | X = NO) = \frac{3}{13} \\ P(Y = TRUE | X = NO) = \frac{10}{13}$$

2) Find the entropy given $X$

$$I_{Entropy}(Y|X=YES) = -\frac{5}{7}log_2(\frac{5}{7}) - \frac{2}{7}log_2(\frac{2}{7}) = 0.8631206\\ I_{Entropy}(Y|X=NO) = -\frac{3}{13}log_2(\frac{3}{13}) - \frac{10}{13}log_2(\frac{10}{13}) = 0.7793498 $$

3) Find the marginal probability

The margianl probabilty given $X$ is

$$P_X(X=YES) = \frac{7}{20} \\ P_X(X=NO) = \frac{13}{20}$$

4) Calculate the Total Entropy and Cross-Entropy

Total Entropy for a tree divided by $X$ is

$$-P_X(X=YES) \times log_2(P_X(X=YES))-P_X(X=NO) \times log_2(P_X(X=NO)) \\ = -\frac{7}{20}log_2(\frac{7}{20}) - \frac{13}{20}log_2(\frac{13}{20}) \\ = 0.9340681$$

Cross Entropy for a leaf node is

$$I_{Entropy}(Y|X=YES) \times P_X(X=YES) + I_{Entropy}(Y|X=NO) \times P_X(X=NO) \\ = 0.8631206 \times \frac{7}{20} + 0.7793498 \times \frac{13}{20} \\ = 0.8086696$$

5) Find Information Gain with Total Entropy and Cross Entropy

Information Gain = Total Entropy - Cross Entropy

$$= 0.9340681 - 0.8086696 = 0.1253985$$

For another example that shows why bigger Information Gain is better,

suppose we have the dataset following;

It’s obviously better to split for Y given X than the above example, since this has only one mistake.

If we find the Information Gain of this data,

The Total Entropy for a tree divided by $X$ is 1, since this splits exactly the same number of target observations with 10 and 10.

$$-P_X(X=YES) \times log_2(P_X(X=YES))-P_X(X=NO) \times log_2(P_X(X=NO)) = \\ -\frac{10}{20}log_2(\frac{10}{20}) - \frac{10}{20}log_2(\frac{10}{20}) \\ = 0.5 + 0.5 = 1$$

The cross entropy for a leaf node is

$$I_{Entropy}(Y|X=YES) \times P_X(X=YES) + I_{Entropy}(Y|X=NO) \times P_X(X=NO) \\ = 0.234478$$

calculated by the same process of the above.

Then, the Information Gain is

$$1 - 0.234478 = 0.7655022$$

Therefore, we find the bigger Information Gain as possible as we can.

As a summary, Decision Tree Algorithm is..

1) Find the best split with the metric score for each splits and each predictors.

2) Compare the metric score with every splits and other predictors and find the best split and best predictors.

3) The best predictors that has the best metric score will be the roof node, and keep building a tree with the metric scores.

4) The very bottom of the tree, which is the leaf node, will be our final classification.

If we have continuous predictors, there is no split point like “YES” or “NO”. Therefore,

1) we sort the data associated with the numerical predictors,

2) calculate the average for all adjacent indexes of the numerical predictor,

find the Gini Impurity for all the adjacent average value,

and compare the Gini Impurity to find the best split point of adjacent average of the numerical predictors.

Below is the Implementation in R. I used iris dataset, and a dataset that I created.
Some of codes are skipped, and you can refer to my Gibhub website to see full algorithm.

DavidGithub

1
2
3
4
5
6
7
8
9
10
#removing one of the classes of the target variable, virginica in Iris Dataset. 
#to make target as a binary variable
iris1 <- iris[which(iris$Species != "virginica"),]

#removing the factor level that we don't have any more
iris1$Species <- as.factor(as.character(iris1$Species))

#quick decision tree built in R, rpart
tr <- rpart(Species~., training)
rpart.plot(tr)

This is the prediction by Decision Tree built in R.

Here is the Decision Tree algorithm with the introduced two metrics that I built.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#average function to create adjacent average between predictor indexes
avg <- function(x1,x2){sum(x1,x2)/2}

#gini function for a leaf
gini <- function(x){

if(dim(x)[1]==1){
return(1)
}
else{
#conditional probability given predictor (row : target, column : predictor)
p11<-x[1,1]/sum(x[,1])
p21<-x[2,1]/sum(x[,1])
p12<-x[1,2]/sum(x[,2])
p22<-x[2,2]/sum(x[,2])

a.false.gini <- 1-p11^2-p21^2
a.true.gini <- 1-p12^2-p22^2

#marginal probability
a.false.prob <- (x[1,1]+x[1,2]) / sum(x)
a.true.prob <- (x[2,1]+x[2,2]) / sum(x)

gini.imp <- a.false.prob * a.false.gini + a.true.prob * a.true.gini
return(gini.imp)
}
}


#log function with base 2 for entropy
log2 <- function(x){
if(x!=0){
return(log(x,base=2))
}
if(x==0){
return(0)
}
}


#entropy function for a leaf
entropy <- function(x){

if(dim(x)[1]==1){
return(0)
}
else{
#conditional probability given predictor (row : target, column : predictor)
p11<-x[1,1]/sum(x[,1])
p21<-x[2,1]/sum(x[,1])
p12<-x[1,2]/sum(x[,2])
p22<-x[2,2]/sum(x[,2])

#Calculating weights, which is the bottom of the tree
a.false.entropy <- -(p11*log2(p11)+p21*log2(p21))
a.true.entropy <- -(p12*log2(p12)+p22*log2(p22))

#marginal probability
a.false.prob <- (x[1,1]+x[2,1]) / sum(x)
a.true.prob <- (x[1,2]+x[2,2]) / sum(x)

#cross entropy
weighted.entropy <- a.true.prob*a.true.entropy + a.false.prob*a.false.entropy

#total entropy
total.entropy <- -(a.false.prob*log2(a.false.prob) + a.true.prob*log2(a.true.prob))

#Information Gain, which is the tree score to find best split
#If the bigger this value is, we find the better split
#maximum value is 1
IG <- total.entropy - weighted.entropy

return(IG)
}
}


#Calculating impurity to find which predictor is the best to split, which will be the top of the tree or first split in the tree
var.impurity <- function(x, dat, fun){
imp.dat <- data.frame(matrix(0, nrow=nrow(dat)-1, ncol=3))
colnames(imp.dat) <- c("index", "impurity", "adj.avg")


for(i in 1:(nrow(dat)-1)){
imp.dat[i,1] <- paste0("between ", i, " and ", i+1)
#average value of the adjacent values
a <- avg(x[i], x[i+1])

predictor.name <- colnames(dat)[which(sapply(dat, function(x,want) isTRUE(all.equal(x,want)),x)==TRUE)]

#Sorting the data by the predictor
dat1 <- dat[order(x,decreasing=FALSE),]
mat <- as.matrix(table(dat1[,predictor.name] < a, dat1[,target] ))

#apply the metric, Gini or Entropy
imp.dat[i,2] <- fun(mat)
imp.dat[i,3] <- a
}
return(imp.dat)
}


#this function will give you the best split score for each predictors
impurity.fun <- function(dat, fun){
predictors <- colnames(dat)[!colnames(dat) %in% target]
var.impur.dat <- data.frame(matrix(0, nrow=length(predictors),ncol=2))
colnames(var.impur.dat) <- c("var", "impurity")

for(i in 1:(ncol(dat)-1)){
var.impur.dat[i,1] <- predictors[i]
if(fun == "gini"){
#the least score of gini is the best split
var.impur.dat[i,2] <- min(var.impurity(dat[,i], dat, gini)$impurity)
}
if(fun == "entropy"){
#the greates score of entropy is the best split
var.impur.dat[i,2] <- max(var.impurity(dat[,i], dat, entropy)$impurity)
}
}
return(var.impur.dat)
}


#give you the best predictor to split or the top of the tree
topTree.predictor <- function(x,fun){
if(fun == "entropy"){
return(which.max(impurity.fun(x, "entropy")[,2]))
}
if(fun == "gini"){
return(which.min(impurity.fun(x, "gini")[,2]))
}
}

#The best split point associated with the best predictor
impurityOfbest <- function(dat, best.pred, fun){
if(fun == "entropy"){
impurity.pred <- var.impurity(dat[,best.pred], dat, entropy)$adj.avg[which.max(var.impurity(dat[,best.pred], dat, entropy)$impurity)]
}
if(fun == "gini"){
impurity.pred <- var.impurity(dat[,best.pred], dat, gini)$adj.avg[which.min(var.impurity(dat[,best.pred], dat, gini)$impurity)]
}

print(paste0("Best predictor, which is top tree node is ", colnames(dat)[best.pred], " with best split is ", impurity.pred, " by the metric, ", fun))
return(impurity.pred)
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#by Entropy metric, we want to find the maximum entropy score
fun <- "entropy"
target <- "Species"

imp.pred <- topTree.predictor(iris1, fun) #This is the best predictor that splits our target the best.
best.impur <- impurityOfbest(iris1, imp.pred, fun) #This is the best split point for the best predictor.


#Let's see how well this value calculated by the function predicts
table(iris1[,imp.pred] < best.impur, iris1$Species)
#perfectly predicted in training set


pred.by<- as.factor(ifelse(iris1[,imp.pred] < best.impur, "setosa","versiclor"))

t1 <- iris1 %>%
ggplot(aes(x=Petal.Width, y=Petal.Length ,col=Species)) +
geom_jitter() +
ggtitle("Actual")
t2 <- iris1 %>%
ggplot(aes(x=Petal.Width, y=Petal.Length ,col=pred.by)) +
geom_jitter() +
geom_vline(xintercept = best.impur, colour="blue", linetype="dashed") +
annotate(geom="text", label=best.impur, x=best.impur, y=0, vjust=-1) +
ggtitle("Predicted")

grid.arrange(t1,t2)
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×