2019-09-19

ML / Decision Tree

15 minutes read (About 2225 words)

DecisionTree

Introduction to Decision Tree

In this post, I will discuss about Decision Tree, which is the fundamental algorithm of the well-known Tree-based algorithms. This post will also address two metrics that are employed in Decision tree and many other machine learning algorithms, even in deep learning.

1. Metric

Why do we need the “Metric” in tree-based algorithm?

The answer will be to find the best splits in the tree. We use the metric, calculate a variable, measure the split score with the value, and choose the value for the best split at each step of finding the split or the predictors in the tree. Many other metrics or score functions are used to find the “best” split for different tree-based algorithms. Two metrics will be discussed, which are Gini Impurity (or Gini Index) and Information Gain with Entropy.

1.1 Gini Index

Suppose we have a dependent categorical variable, which have $\mathcal{J}$ classes. Let $i \in \{1,2,…,\mathcal{J}\}$ and $p_i$ is the probability given each factor levels of dependent variable.d

Then, the Gini Index, or Gini Impurity is defined as below,

$$I_{\mathcal{G}}(p) = \sum_{i=1}^{J}p_i \sum_{k \neq i}p_k = \sum_{i=1}^{J}p_i(1-p_i) = \sum_{i=1}^{J}(p_i-p_i^2) = \sum_{i=1}^{J}p_i - \sum_{i=1}^{J}p_i^2 \\ = 1-\sum_{i=1}^{J}p_i^2$$

For example,

suppose we have a binary variable that follows below.

$X = [FALSE,FALSE,TRUE,TRUE,….,TRUE,FALSE]$

Then, the Gini Impurity will be the followng by the definition,

$$1 - (\frac{8}{20})^2 - (\frac{12}{20})^2 = 0.48$$

What about the Gini Impurity for this binary variable?

Then, the Gini Impurity will be

$$1 - (\frac{1}{20})^2 - (\frac{19}{20})^2 = 0.095$$

Now, we see that the smaller Gini Impurity is, we find the better split point.

Therefore, we want to find the smaller Gini Impurity score as possible as we can at each step to find best split in the tree.

Now, let we have two binary variables and make table for each factors like the below,

$$Y = [FALSE, FALSE, … , TRUE] \\ X = [YES, YES, … , NO]$$

And, we want to predict $Y$ with $X$.

We will have only one stump in a Decision Tree for this table. The root node, which is the very top of the tree, will be the predictor, $X = [YES, YES, … , NO]$

The tree is something like..

The typical tree in real world is much more complicated with more nodes and splits, since they have many predictors.
This tree that has only one split is sometimes called a stump, which is often called a weak learner in Adaboost.
But, I will just call this as a tree for this example.

Then, the process of finding Gini Impurity is the below;

1) Find Conditional probability

For example, the Conditional probability given $X$ is

$$P(Y = FALSE |X = YES) = \frac{5}{7} \\ P(Y = TRUE | X = YES) = \frac{2}{7} \\ P(Y = FALSE | X = NO) = \frac{3}{13} \\ P(Y = TRUE | X = NO) = \frac{10}{13}$$

2) Find each Gini Index for the conditional probability

$$I_{Gini}(Y|X=YES) = 1 - (\frac{5}{7})^2 - (\frac{2}{7})^2 = 0.4081633 \\ I_{Gini}(Y|X=NO) = 1 - (\frac{3}{13})^2 - (\frac{10}{13})^2 = 0.3550296 $$

Then, we find a Gini Impurity score for a leaf node in this tree, which is the very bottom of the tree or the final classification.

3) Find Marginal probability

The margianl probabilty given $X$ is

$$P_X(X=YES) = \frac{7}{20} \\ P_X(X=NO) = \frac{13}{20}$$

Since the tree is splitted by $X$, we find the Gini Impurity of the total tree.

4) Multiply and add the Gini Index and marginal probability associated with the given probability respectively. Finally, we have Gini Impurity Score for a split.

Then, the Total Gini Impurity Score for this tree is,

$$I_{Gini}(Y|X=YES) \times P_X(X=YES) + I_{Gini}(Y|X=NO) \times P_X(X=NO) = \\ 0.4081633 \times \frac{7}{20} + 0.3550296 \times \frac{13}{20} = 0.3736264$$

Then, we eventually find the total Gini Impurity score for a tree.

In real world problem, we have many predictors and so many nodes, possibly many trees such in Random Forest.

With this process of finding Gini Impurity, we compare the Gini score and find the best split point of the best predictor with best splits.

1.2 Entropy

Entropy is defined as below;

$$H(T) = -\sum_{i=1}^{J}p_ilog_2(p_i)$$

And the process of finding best split in a tree is to find the Information Gain.

The Information Gain is often called Kullback-Leiber divergence and defined as below;

$$IG(T,a) = H(T) - H(T|a)$$

where $IG(Y,X)$ is the Information Gain of Y given X, $H(Y)$ is the parent Entropy of Y, and $H(Y|X)$ is the children Entropy of Y given X, and the $H(Y|X)$ is often called cross-entropy.

The process of finding Information Gain is very similar to the Gini Score, but here we find the bigger value of IG.

I will use the same example problem as above.

For the above example,

$$Y = [FALSE, FALSE, … , TRUE] \\ X = [YES, YES, … , NO]$$

Then, as I have shown, find the conditional probability.

1) Find the conditional probability given $X$

$$P(Y = FALSE |X = YES) = \frac{5}{7} \\ P(Y = TRUE | X = YES) = \frac{2}{7} \\ P(Y = FALSE | X = NO) = \frac{3}{13} \\ P(Y = TRUE | X = NO) = \frac{10}{13}$$

2) Find the entropy given $X$

$$I_{Entropy}(Y|X=YES) = -\frac{5}{7}log_2(\frac{5}{7}) - \frac{2}{7}log_2(\frac{2}{7}) = 0.8631206\\ I_{Entropy}(Y|X=NO) = -\frac{3}{13}log_2(\frac{3}{13}) - \frac{10}{13}log_2(\frac{10}{13}) = 0.7793498 $$

3) Find the marginal probability

The margianl probabilty given $X$ is

$$P_X(X=YES) = \frac{7}{20} \\ P_X(X=NO) = \frac{13}{20}$$

4) Calculate the Total Entropy and Cross-Entropy

Total Entropy for a tree divided by $X$ is

$$-P_X(X=YES) \times log_2(P_X(X=YES))-P_X(X=NO) \times log_2(P_X(X=NO)) \\ = -\frac{7}{20}log_2(\frac{7}{20}) - \frac{13}{20}log_2(\frac{13}{20}) \\ = 0.9340681$$

Cross Entropy for a leaf node is

$$I_{Entropy}(Y|X=YES) \times P_X(X=YES) + I_{Entropy}(Y|X=NO) \times P_X(X=NO) \\ = 0.8631206 \times \frac{7}{20} + 0.7793498 \times \frac{13}{20} \\ = 0.8086696$$

5) Find Information Gain with Total Entropy and Cross Entropy

Information Gain = Total Entropy - Cross Entropy

$$= 0.9340681 - 0.8086696 = 0.1253985$$

For another example that shows why bigger Information Gain is better,

suppose we have the dataset following;

It’s obviously better to split for Y given X than the above example, since this has only one mistake.

If we find the Information Gain of this data,

The Total Entropy for a tree divided by $X$ is 1, since this splits exactly the same number of target observations with 10 and 10.

$$-P_X(X=YES) \times log_2(P_X(X=YES))-P_X(X=NO) \times log_2(P_X(X=NO)) = \\ -\frac{10}{20}log_2(\frac{10}{20}) - \frac{10}{20}log_2(\frac{10}{20}) \\ = 0.5 + 0.5 = 1$$

The cross entropy for a leaf node is

$$I_{Entropy}(Y|X=YES) \times P_X(X=YES) + I_{Entropy}(Y|X=NO) \times P_X(X=NO) \\ = 0.234478$$

calculated by the same process of the above.

Then, the Information Gain is

$$1 - 0.234478 = 0.7655022$$

Therefore, we find the bigger Information Gain as possible as we can.

As a summary, Decision Tree Algorithm is..

1) Find the best split with the metric score for each splits and each predictors.

2) Compare the metric score with every splits and other predictors and find the best split and best predictors.

3) The best predictors that has the best metric score will be the roof node, and keep building a tree with the metric scores.

4) The very bottom of the tree, which is the leaf node, will be our final classification.

If we have continuous predictors, there is no split point like “YES” or “NO”. Therefore,

1) we sort the data associated with the numerical predictors,

2) calculate the average for all adjacent indexes of the numerical predictor,

find the Gini Impurity for all the adjacent average value,

and compare the Gini Impurity to find the best split point of adjacent average of the numerical predictors.

Below is the Implementation in R. I used iris dataset, and a dataset that I created.
Some of codes are skipped, and you can refer to my Gibhub website to see full algorithm.

DavidGithub

#removing one of the classes of the target variable, virginica in Iris Dataset. 
#to make target as a binary variable
iris1 <- iris[which(iris$Species != "virginica"),]

#removing the factor level that we don't have any more
iris1$Species <- as.factor(as.character(iris1$Species))

#quick decision tree built in R, rpart
tr <- rpart(Species~., training)
rpart.plot(tr)

This is the prediction by Decision Tree built in R.

Here is the Decision Tree algorithm with the introduced two metrics that I built.

#average function to create adjacent average between predictor indexes
avg <- function(x1,x2){sum(x1,x2)/2}

#gini function for a leaf
gini <- function(x){
  
  if(dim(x)[1]==1){
    return(1)
  }
  else{
    #conditional probability given predictor (row : target, column : predictor)
    p11<-x[1,1]/sum(x[,1])
    p21<-x[2,1]/sum(x[,1])
    p12<-x[1,2]/sum(x[,2])
    p22<-x[2,2]/sum(x[,2])
    
    a.false.gini <- 1-p11^2-p21^2
    a.true.gini <- 1-p12^2-p22^2
    
    #marginal probability
    a.false.prob <- (x[1,1]+x[1,2]) / sum(x)
    a.true.prob <- (x[2,1]+x[2,2]) / sum(x)
    
    gini.imp <- a.false.prob * a.false.gini + a.true.prob * a.true.gini
    return(gini.imp)
  }
}


#log function with base 2 for entropy
log2 <- function(x){
  if(x!=0){
    return(log(x,base=2))
  }
  if(x==0){
    return(0)
  }
}


#entropy function for a leaf
entropy <- function(x){
  
  if(dim(x)[1]==1){
    return(0)
  }
  else{
  	#conditional probability given predictor (row : target, column : predictor)
    p11<-x[1,1]/sum(x[,1])
    p21<-x[2,1]/sum(x[,1])
    p12<-x[1,2]/sum(x[,2])
    p22<-x[2,2]/sum(x[,2])
    
    #Calculating weights, which is the bottom of the tree
    a.false.entropy <- -(p11*log2(p11)+p21*log2(p21))
    a.true.entropy <- -(p12*log2(p12)+p22*log2(p22))
    
    #marginal probability
    a.false.prob <- (x[1,1]+x[2,1]) / sum(x)
    a.true.prob <- (x[1,2]+x[2,2]) / sum(x)
    
    #cross entropy
    weighted.entropy <- a.true.prob*a.true.entropy + a.false.prob*a.false.entropy
    
    #total entropy
    total.entropy <- -(a.false.prob*log2(a.false.prob) + a.true.prob*log2(a.true.prob))
    
    #Information Gain, which is the tree score to find best split
    #If the bigger this value is, we find the better split
    #maximum value is 1
    IG <- total.entropy - weighted.entropy
    
    return(IG) 
  }
}


#Calculating impurity to find which predictor is the best to split, which will be the top of the tree or first split in the tree
var.impurity <- function(x, dat, fun){
  imp.dat <- data.frame(matrix(0, nrow=nrow(dat)-1, ncol=3))
  colnames(imp.dat) <- c("index", "impurity", "adj.avg")
  

  for(i in 1:(nrow(dat)-1)){
    imp.dat[i,1] <- paste0("between ", i, " and ", i+1)
    #average value of the adjacent values
    a <- avg(x[i], x[i+1])
    
    predictor.name <- colnames(dat)[which(sapply(dat, function(x,want) isTRUE(all.equal(x,want)),x)==TRUE)]
    
    #Sorting the data by the predictor
    dat1 <- dat[order(x,decreasing=FALSE),]
    mat <- as.matrix(table(dat1[,predictor.name] < a, dat1[,target] ))
    
    #apply the metric, Gini or Entropy
    imp.dat[i,2] <- fun(mat)
    imp.dat[i,3] <- a
    }
  return(imp.dat)
}


#this function will give you the best split score for each predictors
impurity.fun <- function(dat, fun){
  predictors <- colnames(dat)[!colnames(dat) %in% target]
  var.impur.dat <- data.frame(matrix(0, nrow=length(predictors),ncol=2))
  colnames(var.impur.dat) <- c("var", "impurity")
  
  for(i in 1:(ncol(dat)-1)){
    var.impur.dat[i,1] <- predictors[i]
    if(fun == "gini"){
      #the least score of gini is the best split 
      var.impur.dat[i,2] <- min(var.impurity(dat[,i], dat, gini)$impurity)
    }
    if(fun == "entropy"){
      #the greates score of entropy is the best split
      var.impur.dat[i,2] <- max(var.impurity(dat[,i], dat, entropy)$impurity)
    }
  }
  return(var.impur.dat)
}


#give you the best predictor to split or the top of the tree
topTree.predictor <- function(x,fun){
  if(fun == "entropy"){
    return(which.max(impurity.fun(x, "entropy")[,2]))
  }
  if(fun == "gini"){
    return(which.min(impurity.fun(x, "gini")[,2]))
  }
}

#The best split point associated with the best predictor
impurityOfbest <- function(dat, best.pred, fun){
  if(fun == "entropy"){
    impurity.pred <- var.impurity(dat[,best.pred], dat, entropy)$adj.avg[which.max(var.impurity(dat[,best.pred], dat, entropy)$impurity)]
  }
  if(fun == "gini"){
    impurity.pred <- var.impurity(dat[,best.pred], dat, gini)$adj.avg[which.min(var.impurity(dat[,best.pred], dat, gini)$impurity)]
  }
  
  print(paste0("Best predictor, which is top tree node is ", colnames(dat)[best.pred], " with best split is  ", impurity.pred, " by the metric, ", fun))
  return(impurity.pred)
}

#by Entropy metric, we want to find the maximum entropy score
fun <- "entropy"
target <- "Species"

imp.pred <- topTree.predictor(iris1, fun) #This is the best predictor that splits our target the best. 
best.impur <- impurityOfbest(iris1, imp.pred, fun) #This is the best split point for the best predictor. 


#Let's see how well this value calculated by the function predicts
table(iris1[,imp.pred] < best.impur, iris1$Species)
#perfectly predicted in training set


pred.by<- as.factor(ifelse(iris1[,imp.pred] < best.impur, "setosa","versiclor"))

t1 <- iris1 %>% 
  ggplot(aes(x=Petal.Width, y=Petal.Length ,col=Species)) + 
  geom_jitter() +
  ggtitle("Actual")
t2 <- iris1 %>% 
  ggplot(aes(x=Petal.Width, y=Petal.Length ,col=pred.by)) + 
  geom_jitter() +
  geom_vline(xintercept =  best.impur, colour="blue", linetype="dashed") + 
  annotate(geom="text", label=best.impur, x=best.impur, y=0, vjust=-1) +
  ggtitle("Predicted")

grid.arrange(t1,t2)

DecisionTree

Introduction to Decision Tree

1. Metric

1.1 Gini Index

1.2 Entropy

Categories

Tag Cloud

Recent

Archives

Tags

Your browser is out-of-date!