Random Forest¶
Random Forest is an ensemble method that combines multiple decision trees using bootstrap aggregating (bagging) and random feature selection. It reduces overfitting and improves generalization compared to a single decision tree.
RandomForest Class¶
- class lostml.tree.random_forest.RandomForest(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]¶
Bases:
objectRandom Forest for classification and regression.
An ensemble method that combines multiple decision trees using: - Bootstrap aggregating (bagging): Each tree trains on random subset of data - Random feature selection: Each split considers random subset of features
- Parameters:
n_estimators (int, default=100) – Number of trees in the forest.
max_depth (int, default=None) – Maximum depth of each tree. If None, trees grow until all leaves are pure.
min_samples_split (int, default=2) – Minimum number of samples required to split a node in each tree.
min_samples_leaf (int, default=1) – Minimum number of samples required in a leaf node in each tree.
max_features (int, float, str, or None, default='sqrt') – Number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a fraction and int(max_features * n_features) features are considered. - If ‘sqrt’, then max_features = sqrt(n_features). - If ‘log2’, then max_features = log2(n_features). - If None, then max_features = n_features.
criterion (str, default='gini') – Splitting criterion. ‘gini’ for classification, ‘mse’ for regression.
bootstrap (bool, default=True) – Whether to use bootstrap sampling when building trees.
random_state (int or None, default=None) – Random seed for reproducibility.
- __init__(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', criterion='gini', bootstrap=True, random_state=None)[source]¶
- fit(X, y)[source]¶
Build a forest of trees from training data.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data
y (array-like of shape (n_samples,)) – Target values
Parameters¶
n_estimators: Number of trees in the forest (default: 100)max_depth: Maximum depth of each tree. If None, trees grow until all leaves are pure (default: None)min_samples_split: Minimum number of samples required to split a node in each tree (default: 2)min_samples_leaf: Minimum number of samples required in a leaf node in each tree (default: 1)max_features: Number of features to consider when looking for the best split (default: ‘sqrt’) -'sqrt': sqrt(n_features) features -'log2': log2(n_features) features -None: All features -int: Exact number of features -float: Fraction of features (e.g., 0.5 = 50% of features)criterion: Splitting criterion -'gini': Gini impurity for classification -'mse': Mean squared error (variance) for regressionbootstrap: Whether to use bootstrap sampling when building trees (default: True)random_state: Random seed for reproducibility (default: None)
Examples¶
Classification¶
from lostml.tree import RandomForest
import numpy as np
X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7], [7, 8]])
y = np.array([0, 0, 0, 1, 1, 1])
rf = RandomForest(n_estimators=100, criterion='gini', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)
# Get class probabilities
probabilities = rf.predict_proba(X)
Regression¶
from lostml.tree import RandomForest
import numpy as np
X = np.array([[1], [2], [3], [5], [6], [7]])
y = np.array([2, 4, 6, 10, 12, 14])
rf = RandomForest(n_estimators=100, criterion='mse', random_state=42)
rf.fit(X, y)
predictions = rf.predict(X)
Customizing the Forest¶
# Control tree complexity and feature selection
rf = RandomForest(
n_estimators=200, # More trees = better but slower
max_depth=10, # Limit depth of each tree
max_features='sqrt', # Use sqrt(n_features) at each split
min_samples_split=5, # Require more samples to split
min_samples_leaf=2, # Ensure leaves have enough samples
bootstrap=True, # Use bootstrap sampling
random_state=42 # For reproducibility
)
How It Works¶
Bootstrap Sampling: Each tree is trained on a random subset of data (with replacement)
Random Feature Selection: At each split, only a random subset of features is considered
Tree Building: Each tree is built independently using the DecisionTree algorithm
Ensemble Prediction: - Classification: Majority voting across all trees - Regression: Average of predictions from all trees
Key Concepts¶
- Bootstrap Aggregating (Bagging)
Each tree sees a different random sample of the training data
About 63% of samples appear in each bootstrap sample
Reduces variance and overfitting
- Random Feature Selection
At each split, only a subset of features is considered
Increases diversity among trees
Default ‘sqrt’ is a good balance between diversity and performance
- Ensemble Voting
Classification: Most common prediction wins (majority vote)
Regression: Average of all tree predictions
More trees = more stable predictions
Advantages¶
Reduces Overfitting: Ensemble of trees is less prone to overfitting than a single tree
Handles Non-linearity: Can capture complex non-linear relationships
Feature Importance: Can identify important features (via tree splits)
Robust: Less sensitive to outliers and noise
No Feature Scaling: Works well without feature normalization
Tips¶
n_estimators: Start with 100-200. More trees = better but slower
max_features: ‘sqrt’ is a good default. Use ‘log2’ for high-dimensional data
max_depth: Limit depth to prevent overfitting (default: None allows full growth)
bootstrap: Keep True for better generalization (default: True)
random_state: Set for reproducible results
Classification: Use
predict_proba()to get class probabilitiesRegression: Predictions are averaged across all trees
Comparison with Decision Tree¶
Decision Tree: Single tree, can overfit easily, fast training
Random Forest: Ensemble of trees, reduces overfitting, slower training but better generalization
Use Random Forest when: - You want better accuracy than a single decision tree - You have enough data and computational resources - You need robust predictions that handle noise well