Decision Tree

Decision Tree is a tree-based algorithm that recursively splits data to make predictions. It supports both classification and regression tasks.

DecisionTree Class

class lostml.tree.decision_tree.DecisionTree(max_depth=None, min_samples_split=2, min_samples_leaf=1, criterion='gini')[source]

Bases: object

Decision Tree for classification and regression.

Parameters:
  • max_depth (int, default=None) – Maximum depth of the tree. If None, tree grows until all leaves are pure.

  • min_samples_split (int, default=2) – Minimum number of samples required to split a node.

  • min_samples_leaf (int, default=1) – Minimum number of samples required in a leaf node.

  • criterion (str, default='gini') – Splitting criterion. ‘gini’ for classification, ‘mse’ for regression.

class Node[source]

Bases: object

Internal node class to represent tree nodes.

__init__()[source]
__init__(max_depth=None, min_samples_split=2, min_samples_leaf=1, criterion='gini')[source]
fit(X, y)[source]

Build the decision tree from training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data

  • y (array-like of shape (n_samples,)) – Target values

predict(X)[source]

Predict target values for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples

Returns:

Predicted values

Return type:

ndarray of shape (n_samples,)

Parameters

  • max_depth: Maximum depth of the tree. If None, tree grows until all leaves are pure (default: None)

  • min_samples_split: Minimum number of samples required to split a node (default: 2)

  • min_samples_leaf: Minimum number of samples required in a leaf node (default: 1)

  • criterion: Splitting criterion - 'gini': Gini impurity for classification - 'mse': Mean squared error (variance) for regression

Examples

Classification

from lostml.tree import DecisionTree
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7]])
y = np.array([0, 0, 1, 1, 1])

tree = DecisionTree(max_depth=5, criterion='gini')
tree.fit(X, y)
predictions = tree.predict(X)

Regression

from lostml.tree import DecisionTree
import numpy as np

X = np.array([[1], [2], [3], [5], [6]])
y = np.array([2, 4, 6, 10, 12])

tree = DecisionTree(max_depth=3, criterion='mse')
tree.fit(X, y)
predictions = tree.predict(X)

Controlling Overfitting

Decision trees can easily overfit. Use these parameters to control tree growth:

# Prevent overfitting with constraints
tree = DecisionTree(
    max_depth=5,              # Limit tree depth
    min_samples_split=10,     # Require more samples to split
    min_samples_leaf=5        # Ensure leaves have enough samples
)

How It Works

  1. Splitting: At each node, the algorithm finds the best feature and threshold to split on

  2. Impurity: Uses Gini impurity (classification) or variance (regression) to measure node purity

  3. Information Gain: Chooses splits that maximize information gain

  4. Recursion: Recursively builds left and right subtrees

  5. Stopping: Stops when stopping conditions are met (max_depth, min_samples, pure nodes)

  6. Prediction: Traverses tree from root to leaf, returns leaf value

Tips

  • max_depth: Start with 5-10. Too deep = overfitting, too shallow = underfitting

  • min_samples_split: Higher values = simpler trees (default: 2)

  • min_samples_leaf: Higher values = more conservative splits (default: 1)

  • criterion: Use ‘gini’ for classification, ‘mse’ for regression

  • Interpretability: Decision trees are highly interpretable - you can visualize the decision path