Q. Decision Tree Classification
Problem Description
As you know, Decision
Tree is all about splitting nodes at different levels and trying to classify
accurately as much as possible.
You are given a feature
(1-d array) and label (1-d array) (target) where you have to determine which
value in the corresponding feature is best to split upon at the first root
level for building a decision tree.
The feature would be
having continuous values whereas the target is binary in nature. So, The main
task is to determine which value/threshold is best to split upon considering
the classification task taking the loss as entropy and maximizing Information Gain.
Input Format
Two inputs:
1. 1-d array of feature
2. 1-d array of label
Output Format
Return threshold value
Example Input
feature: [0.58 0.9 0.45 0.18 0.5
0.12 0.31 0.09 0.24 0.83]
label: [1 0 0 0 0 0 1 0 1
1]
Example Output
0.18
Example Explanation
If you calculate
Information Gain for all of the feature values, it would be computed as :
(threshold, Information Gain)
(0.09 0.08) (0.12 0.17)
(0.18 0.28) (0.24 0.05) (0.31 0.00) (0.45 0.02) (0.5 0.09) (0.58 0.01) (0.83
0.08) (0.9 0.08)
Here Information Gain is
maximum against the “0.18” threshold. So, that value would be the required
answer.
Program:
import numpy as np
def entropy(s):
'''
Calculates the entropy given list of target(binary) variables
'''
# Write your code here
# Caclulate entropy
entropy = 0
#Your code ends here
return -entropy
def information_gain(parent, left_child, right_child):
'''
Compute information gain given left_child target variables (list), right_child target variables(list) and their parent targets(list)
'''
info_gain=None
# Write your code here
#Your code ends here
return info_gain
def best_split(features,labels):
'''
inputs:
features: nd-array
labels: nd-array
output:
float value determining best threshold for decision tree classification
'''
best_threshold=None
best_info_gain = -1
# For every unique value of that feature
for threshold in np.unique(features):
y_left = _____________ #list of labels in left child
y_right = _____________ #list of labels in right child
if len(y_left) > 0 and len(y_right) > 0:
gain = ____________ # Caclulate the information gain and save the split parameters if the current split if better then the previous best
if gain > best_info_gain:
best_threshold = threshold
best_info_gain = gain
return best_threshold
Final Program:
import numpy as np
def entropy(s):
'''
Calculates the entropy given list of target(binary) variables
'''
# Write your code here
counts = np.bincount(s)
probabilities = counts / len(s)
# Caclulate entropy
entropy = 0
for p in probabilities:
if p > 0:
entropy -= p * np.log2(p)
#Your code ends here
return entropy
def information_gain(parent, left_child, right_child):
'''
Compute information gain given left_child target variables (list), right_child target variables(list) and their parent targets(list)
'''
info_gain=None
# Write your code here
parent_entropy = entropy(parent)
left_entropy = entropy(left_child)
right_entropy = entropy(right_child)
# Weighted average of the child entropies
left_weight = len(left_child) / len(parent)
right_weight = len(right_child) / len(parent)
weighted_entropy = left_weight * left_entropy + right_weight * right_entropy
# Information gain is the difference in entropy
info_gain = parent_entropy - weighted_entropy
#Your code ends here
return info_gain
def best_split(features,labels):
'''
inputs:
features: nd-array
labels: nd-array
output:
float value determining best threshold for decision tree classification
'''
best_threshold=None
best_info_gain = -1
# For every unique value of that feature
for threshold in np.unique(features):
y_left = labels[features <= threshold] #list of labels in left child
y_right = labels[features > threshold] #list of labels in right child
if len(y_left) > 0 and len(y_right) > 0:
gain = information_gain(labels, y_left, y_right) # Caclulate the information gain and save the split parameters if the current split if better then the previous best
if gain > best_info_gain:
best_threshold = threshold
best_info_gain = gain
return best_threshold
Comments
Post a Comment