Random Forest¶

Random Forest is an ensemble method that combines multiple decision trees using bootstrap aggregating (bagging) and random feature selection. It reduces overfitting and improves generalization compared to a single decision tree.

RandomForest Class¶

class lostml.tree.random_forest.RandomForest(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]¶

Bases: object

Random Forest for classification and regression.

An ensemble method that combines multiple decision trees using: - Bootstrap aggregating (bagging): Each tree trains on random subset of data - Random feature selection: Each split considers random subset of features

Parameters:

n_estimators (int, default=100) – Number of trees in the forest.
max_depth (int, default=None) – Maximum depth of each tree. If None, trees grow until all leaves are pure.
min_samples_split (int, default=2) – Minimum number of samples required to split a node in each tree.
min_samples_leaf (int, default=1) – Minimum number of samples required in a leaf node in each tree.
max_features (int, float, str, or None, default='sqrt') – Number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a fraction and int(max_features * n_features) features are considered. - If ‘sqrt’, then max_features = sqrt(n_features). - If ‘log2’, then max_features = log2(n_features). - If None, then max_features = n_features.
criterion (str, default='gini') – Splitting criterion. ‘gini’ for classification, ‘mse’ for regression.
bootstrap (bool, default=True) – Whether to use bootstrap sampling when building trees.
random_state (int or None, default=None) – Random seed for reproducibility.

__init__(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]¶

fit(X, y)[source]¶

Build a forest of trees from training data.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data
y (array-like of shape (n_samples,)) – Target values

predict(X)[source]¶

Predict target values for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Test samples
Returns:: Predicted values
Return type:: ndarray of shape (n_samples,)

predict_proba(X)[source]¶

Predict class probabilities for classification.

Only available for classification (criterion=’gini’).

Parameters:: X (array-like of shape (n_samples, n_features)) – Test samples
Returns:: Class probabilities
Return type:: ndarray of shape (n_samples, n_classes)

Parameters¶

n_estimators: Number of trees in the forest (default: 100)
max_depth: Maximum depth of each tree. If None, trees grow until all leaves are pure (default: None)
min_samples_split: Minimum number of samples required to split a node in each tree (default: 2)
min_samples_leaf: Minimum number of samples required in a leaf node in each tree (default: 1)
max_features: Number of features to consider when looking for the best split (default: ‘sqrt’) - 'sqrt': sqrt(n_features) features - 'log2': log2(n_features) features - None: All features - int: Exact number of features - float: Fraction of features (e.g., 0.5 = 50% of features)
criterion: Splitting criterion - 'gini': Gini impurity for classification - 'mse': Mean squared error (variance) for regression
bootstrap: Whether to use bootstrap sampling when building trees (default: True)
random_state: Random seed for reproducibility (default: None)

Examples¶

Classification¶

from lostml.tree import RandomForest
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7], [7, 8]])
y = np.array([0, 0, 0, 1, 1, 1])

rf = RandomForest(n_estimators=100, criterion='gini', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)

# Get class probabilities
probabilities = rf.predict_proba(X)

Regression¶

from lostml.tree import RandomForest
import numpy as np

X = np.array([[1], [2], [3], [5], [6], [7]])
y = np.array([2, 4, 6, 10, 12, 14])

rf = RandomForest(n_estimators=100, criterion='mse', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)

Customizing the Forest¶

# Control tree complexity and feature selection
rf = RandomForest(
    n_estimators=200,        # More trees = better but slower
    max_depth=10,            # Limit depth of each tree
    max_features='sqrt',     # Use sqrt(n_features) at each split
    min_samples_split=5,    # Require more samples to split
    min_samples_leaf=2,     # Ensure leaves have enough samples
    bootstrap=True,          # Use bootstrap sampling
    random_state=42          # For reproducibility
)

How It Works¶

Bootstrap Sampling: Each tree is trained on a random subset of data (with replacement)
Random Feature Selection: At each split, only a random subset of features is considered
Tree Building: Each tree is built independently using the DecisionTree algorithm
Ensemble Prediction: - Classification: Majority voting across all trees - Regression: Average of predictions from all trees

Key Concepts¶

Bootstrap Aggregating (Bagging)

Each tree sees a different random sample of the training data
About 63% of samples appear in each bootstrap sample
Reduces variance and overfitting

Random Feature Selection

At each split, only a subset of features is considered
Increases diversity among trees
Default ‘sqrt’ is a good balance between diversity and performance

Ensemble Voting

Classification: Most common prediction wins (majority vote)
Regression: Average of all tree predictions
More trees = more stable predictions

Advantages¶

Reduces Overfitting: Ensemble of trees is less prone to overfitting than a single tree
Handles Non-linearity: Can capture complex non-linear relationships
Feature Importance: Can identify important features (via tree splits)
Robust: Less sensitive to outliers and noise
No Feature Scaling: Works well without feature normalization

Tips¶

n_estimators: Start with 100-200. More trees = better but slower
max_features: ‘sqrt’ is a good default. Use ‘log2’ for high-dimensional data
max_depth: Limit depth to prevent overfitting (default: None allows full growth)
bootstrap: Keep True for better generalization (default: True)
random_state: Set for reproducible results
Classification: Use predict_proba() to get class probabilities
Regression: Predictions are averaged across all trees

Comparison with Decision Tree¶

Decision Tree: Single tree, can overfit easily, fast training
Random Forest: Ensemble of trees, reduces overfitting, slower training but better generalization

Use Random Forest when: - You want better accuracy than a single decision tree - You have enough data and computational resources - You need robust predictions that handle noise well