Decision Tree Classification _ Program

Q. Decision Tree Classification

Problem Description

As you know, Decision Tree is all about splitting nodes at different levels and trying to classify accurately as much as possible.

You are given a feature (1-d array) and label (1-d array) (target) where you have to determine which value in the corresponding feature is best to split upon at the first root level for building a decision tree.

The feature would be having continuous values whereas the target is binary in nature. So, The main task is to determine which value/threshold is best to split upon considering the classification task taking the loss as entropy and maximizing Information Gain.

Input Format

Two inputs:

1. 1-d array of feature

2. 1-d array of label

Output Format

Return threshold value

Example Input

feature: [0.58 0.9  0.45 0.18 0.5  0.12 0.31 0.09 0.24 0.83]

label: [1 0 0 0 0 0 1 0 1 1]

Example Output

0.18

Example Explanation

If you calculate Information Gain for all of the feature values, it would be computed as : (threshold, Information Gain)

 

(0.09 0.08) (0.12 0.17) (0.18 0.28) (0.24 0.05) (0.31 0.00) (0.45 0.02) (0.5 0.09) (0.58 0.01) (0.83 0.08) (0.9 0.08)

Here Information Gain is maximum against the “0.18” threshold. So, that value would be the required answer.

 

Program:

import numpy as np

def entropy(s):
    '''
    Calculates the entropy given list of target(binary) variables
    '''
    # Write your code here
   
    # Caclulate entropy
    entropy = 0
   
   
    #Your code ends here
   
    return -entropy
   

def information_gain(parent, left_child, right_child):
   
    '''
    Compute information gain given left_child target variables (list), right_child target variables(list) and their parent targets(list)
    '''
   
    info_gain=None
    # Write your code here
   
   
    #Your code ends here
    return info_gain
   
def best_split(features,labels):
    '''
    inputs:
        features: nd-array
        labels: nd-array
    output:
        float value determining best threshold for decision tree classification
    '''
   
    best_threshold=None
    best_info_gain = -1
   
    # For every unique value of that feature
    for threshold in np.unique(features):
       
        y_left = _____________  #list of labels in left child
        y_right = _____________  #list of labels in right child
       
        if len(y_left) > 0 and len(y_right) > 0:
            gain = ____________                 # Caclulate the information gain and save the split parameters if the current split if better then the previous best

            if gain > best_info_gain:
                best_threshold = threshold
                best_info_gain = gain
   
    return best_threshold




Final Program:


 import numpy as np


def entropy(s):
    '''
    Calculates the entropy given list of target(binary) variables
    '''
    # Write your code here
    counts = np.bincount(s)
    probabilities = counts / len(s)
    # Caclulate entropy
    entropy = 0
    for p in probabilities:
        if p > 0:
            entropy -= p * np.log2(p)
   
    #Your code ends here
   
    return entropy
   

def information_gain(parent, left_child, right_child):
   
    '''
    Compute information gain given left_child target variables (list), right_child target variables(list) and their parent targets(list)
    '''
   
    info_gain=None
    # Write your code here
    parent_entropy = entropy(parent)
    left_entropy = entropy(left_child)
    right_entropy = entropy(right_child)
   
    # Weighted average of the child entropies
    left_weight = len(left_child) / len(parent)
    right_weight = len(right_child) / len(parent)
   
    weighted_entropy = left_weight * left_entropy + right_weight * right_entropy
   
    # Information gain is the difference in entropy
    info_gain = parent_entropy - weighted_entropy
   
    #Your code ends here
    return info_gain
   
def best_split(features,labels):
    '''
    inputs:
        features: nd-array
        labels: nd-array
    output:
        float value determining best threshold for decision tree classification
    '''
   
    best_threshold=None
    best_info_gain = -1
   
    # For every unique value of that feature
    for threshold in np.unique(features):
       
        y_left = labels[features <= threshold]  #list of labels in left child
        y_right = labels[features > threshold]  #list of labels in right child
       
        if len(y_left) > 0 and len(y_right) > 0:
            gain = information_gain(labels, y_left, y_right)                 # Caclulate the information gain and save the split parameters if the current split if better then the previous best

            if gain > best_info_gain:
                best_threshold = threshold
                best_info_gain = gain
   
    return best_threshold

Comments