Decision tree is one of the most commonly used machine learning algorithms which can be used for solving both classification and regression problems. It is very simple to understand and use. Here is a lighter one representing how decision trees and related algorithms (random forest etc) are agile enough for usage.
Figure 1. Trees and Forests
In this post, you will learn about some of the following in relation to machine learning algorithm – decision trees vis-a-vis one of the popular C5.0 algorithm used to build a decision tree for classification. In another post, we shall also be looking at CART methodology for building a decision tree model for classification.
The post also presents a set of practice questions to help you test your knowledge of decision tree fundamentals/concepts. It could prove to be very useful if you are planning to take up an interview for machine learning engineer or intern or freshers or data scientist position.
Decision tree is a machine learning algorithm used for modeling dependent or response variable by sending the values of independent variables through logical statements represented in form of nodes and leaves. The logical statements are determined using the algorithm. Decision tree algorithm, as like support vector machine (SVM) can be used for both classification and regression tasks and even multioutput tasks.
Decision tree algorithm is implemented using Sklearn Python package such as the following:
Decision trees are fundamental components of random forests.
The splitting criterion used in C5.0 algorithm is entropy or information gain which is described later in this post.
Figure 1. Entropy plot
Let’s understand the concept of the pure data segment from the diagram below. In decision tree 2, you would note that the decision node (age > 16) results in the split of data segment which further results in creation of a pure data segment or homogenous node (students whose age is not greater than 16). The overall information gain in decision tree 2 looks to be greater than decision tree 1.
-p * log2 (p)
Thus, for data segment having data belonging to two classes A (say, head) and B (say, tail) where the proportion of value to class A (or probability p(A)) is 0.3 and for class B (p(B)) is 0.7, the entropy can be calculated as the following:
-(0.3)*log2 (0.3) - (0.7)*log2 (0.7) = - (-0.5211) - (-0.3602) = 0.8813
For data segment having split of 50-50, here is the value of entropy (expected value of 1).
-(0.5)*log2 (0.5) - (0.5)*log2 (0.5) = - (0.5)*(-1) - (0.5)*(-1) = 0.5 + 0.5 = 1
For data segment having split 90-10% (highly homogenous/pure data), the value of entropy is (expected value is closer to 0):
-(0.1)*log2 (0.1) - (0.9)*log2 (0.9) = - (0.1)*(-3.3219) - (0.9)*(-0.1520) = 0.3323 + 0.1368 = 0.4691
For completely pure data segment, the value of entropy is (expected value is 0):
-(1)*log2 (1) - (0)*log2 (0) = - (1)*(0) - (0)*(infinity) = 0
Based on the above calculation, one could figure out that the entropy varies as per the following plot:
InfoGain = E(S1) - E(S2)
Fig 3. Decision Tree Visualization
A decision node or a feature can be considered to be suitable or valid when the data split results in children nodes having data with higher homogeneity or lower entropy
A data segment is said to be pure if it contains data instances belonging to just one class. The goal while building decision tree is to reach to a state where leaves (leaf nodes) attain pure state.
The goal of the feature selection is to find the features or attributes which lead to split in children nodes whose combined entropy sums up to lower entropy than the entropy value of data segment before the split.
In this section, you will learn about how to train a decision tree classifier using Scikit-learn. The IRIS data set is used.
from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn import tree # # Load IRIS dataset # iris = datasets.load_iris() X = iris.data y = iris.target # # Create training and test data split # X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # # Create decision tree classifier # dt_clfr = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0) dt_clfr.fit(X_train, y_train)
Once the above code gets executed, you could as well draw the decision tree using the following code in Google Colab notebook
from sklearn.tree import export_graphviz import os # Where to save the figures PROJECT_ROOT_DIR = "." def image_path(fig_id): return os.path.join(PROJECT_ROOT_DIR, "sample_data", fig_id) export_graphviz( dt_clfr, out_file=image_path("iris_tree.dot"), feature_names=iris.feature_names, class_names=iris.target_names, rounded=True, filled=True ) from graphviz import Source Source.from_file("./sample_data/iris_tree.dot")
The following are some of the questions which can be asked in the interviews. The answers can be found in above text:
[wp_quiz id=”6368″]
In this post, you learned about some of the following:
Did you find this article useful? Do you have any questions about this article or understanding decision tree algorithm and related concepts and terminologies? Leave a comment and ask your questions and I shall do my best to address your queries.
Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…
Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…
Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…
Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…
The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…
Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…