Random Forest

Random Forest is an ensemble method that combines multiple decision trees using bootstrap aggregating (bagging) and random feature selection. It reduces overfitting and improves generalization compared to a single decision tree.

RandomForest Class

class lostml.tree.random_forest.RandomForest(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]

Bases: object

Random Forest for classification and regression.

An ensemble method that combines multiple decision trees using: - Bootstrap aggregating (bagging): Each tree trains on random subset of data - Random feature selection: Each split considers random subset of features

Parameters:
  • n_estimators (int, default=100) – Number of trees in the forest.

  • max_depth (int, default=None) – Maximum depth of each tree. If None, trees grow until all leaves are pure.

  • min_samples_split (int, default=2) – Minimum number of samples required to split a node in each tree.

  • min_samples_leaf (int, default=1) – Minimum number of samples required in a leaf node in each tree.

  • max_features (int, float, str, or None, default='sqrt') – Number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a fraction and int(max_features * n_features) features are considered. - If ‘sqrt’, then max_features = sqrt(n_features). - If ‘log2’, then max_features = log2(n_features). - If None, then max_features = n_features.

  • criterion (str, default='gini') – Splitting criterion. ‘gini’ for classification, ‘mse’ for regression.

  • bootstrap (bool, default=True) – Whether to use bootstrap sampling when building trees.

  • random_state (int or None, default=None) – Random seed for reproducibility.

__init__(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]
fit(X, y)[source]

Build a forest of trees from training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data

  • y (array-like of shape (n_samples,)) – Target values

predict(X)[source]

Predict target values for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples

Returns:

Predicted values

Return type:

ndarray of shape (n_samples,)

predict_proba(X)[source]

Predict class probabilities for classification.

Only available for classification (criterion=’gini’).

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples

Returns:

Class probabilities

Return type:

ndarray of shape (n_samples, n_classes)

Parameters

  • n_estimators: Number of trees in the forest (default: 100)

  • max_depth: Maximum depth of each tree. If None, trees grow until all leaves are pure (default: None)

  • min_samples_split: Minimum number of samples required to split a node in each tree (default: 2)

  • min_samples_leaf: Minimum number of samples required in a leaf node in each tree (default: 1)

  • max_features: Number of features to consider when looking for the best split (default: ‘sqrt’) - 'sqrt': sqrt(n_features) features - 'log2': log2(n_features) features - None: All features - int: Exact number of features - float: Fraction of features (e.g., 0.5 = 50% of features)

  • criterion: Splitting criterion - 'gini': Gini impurity for classification - 'mse': Mean squared error (variance) for regression

  • bootstrap: Whether to use bootstrap sampling when building trees (default: True)

  • random_state: Random seed for reproducibility (default: None)

Examples

Classification

from lostml.tree import RandomForest
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7], [7, 8]])
y = np.array([0, 0, 0, 1, 1, 1])

rf = RandomForest(n_estimators=100, criterion='gini', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)

# Get class probabilities
probabilities = rf.predict_proba(X)

Regression

from lostml.tree import RandomForest
import numpy as np

X = np.array([[1], [2], [3], [5], [6], [7]])
y = np.array([2, 4, 6, 10, 12, 14])

rf = RandomForest(n_estimators=100, criterion='mse', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)

Customizing the Forest

# Control tree complexity and feature selection
rf = RandomForest(
    n_estimators=200,        # More trees = better but slower
    max_depth=10,            # Limit depth of each tree
    max_features='sqrt',     # Use sqrt(n_features) at each split
    min_samples_split=5,    # Require more samples to split
    min_samples_leaf=2,     # Ensure leaves have enough samples
    bootstrap=True,          # Use bootstrap sampling
    random_state=42          # For reproducibility
)

How It Works

  1. Bootstrap Sampling: Each tree is trained on a random subset of data (with replacement)

  2. Random Feature Selection: At each split, only a random subset of features is considered

  3. Tree Building: Each tree is built independently using the DecisionTree algorithm

  4. Ensemble Prediction: - Classification: Majority voting across all trees - Regression: Average of predictions from all trees

Key Concepts

Bootstrap Aggregating (Bagging)
  • Each tree sees a different random sample of the training data

  • About 63% of samples appear in each bootstrap sample

  • Reduces variance and overfitting

Random Feature Selection
  • At each split, only a subset of features is considered

  • Increases diversity among trees

  • Default ‘sqrt’ is a good balance between diversity and performance

Ensemble Voting
  • Classification: Most common prediction wins (majority vote)

  • Regression: Average of all tree predictions

  • More trees = more stable predictions

Advantages

  • Reduces Overfitting: Ensemble of trees is less prone to overfitting than a single tree

  • Handles Non-linearity: Can capture complex non-linear relationships

  • Feature Importance: Can identify important features (via tree splits)

  • Robust: Less sensitive to outliers and noise

  • No Feature Scaling: Works well without feature normalization

Tips

  • n_estimators: Start with 100-200. More trees = better but slower

  • max_features: ‘sqrt’ is a good default. Use ‘log2’ for high-dimensional data

  • max_depth: Limit depth to prevent overfitting (default: None allows full growth)

  • bootstrap: Keep True for better generalization (default: True)

  • random_state: Set for reproducible results

  • Classification: Use predict_proba() to get class probabilities

  • Regression: Predictions are averaged across all trees

Comparison with Decision Tree

  • Decision Tree: Single tree, can overfit easily, fast training

  • Random Forest: Ensemble of trees, reduces overfitting, slower training but better generalization

Use Random Forest when: - You want better accuracy than a single decision tree - You have enough data and computational resources - You need robust predictions that handle noise well