Random Forest ============= Random Forest is an ensemble method that combines multiple decision trees using bootstrap aggregating (bagging) and random feature selection. It reduces overfitting and improves generalization compared to a single decision tree. RandomForest Class ------------------ .. autoclass:: lostml.tree.random_forest.RandomForest :members: :undoc-members: :show-inheritance: :special-members: __init__ Parameters ---------- - ``n_estimators``: Number of trees in the forest (default: 100) - ``max_depth``: Maximum depth of each tree. If None, trees grow until all leaves are pure (default: None) - ``min_samples_split``: Minimum number of samples required to split a node in each tree (default: 2) - ``min_samples_leaf``: Minimum number of samples required in a leaf node in each tree (default: 1) - ``max_features``: Number of features to consider when looking for the best split (default: 'sqrt') - ``'sqrt'``: sqrt(n_features) features - ``'log2'``: log2(n_features) features - ``None``: All features - ``int``: Exact number of features - ``float``: Fraction of features (e.g., 0.5 = 50% of features) - ``criterion``: Splitting criterion - ``'gini'``: Gini impurity for classification - ``'mse'``: Mean squared error (variance) for regression - ``bootstrap``: Whether to use bootstrap sampling when building trees (default: True) - ``random_state``: Random seed for reproducibility (default: None) Examples -------- Classification ~~~~~~~~~~~~~~ .. code-block:: python from lostml.tree import RandomForest import numpy as np X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7], [7, 8]]) y = np.array([0, 0, 0, 1, 1, 1]) rf = RandomForest(n_estimators=100, criterion='gini', random_state=42) rf.fit(X, y) predictions = rf.predict(X) # Get class probabilities probabilities = rf.predict_proba(X) Regression ~~~~~~~~~ .. code-block:: python from lostml.tree import RandomForest import numpy as np X = np.array([[1], [2], [3], [5], [6], [7]]) y = np.array([2, 4, 6, 10, 12, 14]) rf = RandomForest(n_estimators=100, criterion='mse', random_state=42) rf.fit(X, y) predictions = rf.predict(X) Customizing the Forest ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Control tree complexity and feature selection rf = RandomForest( n_estimators=200, # More trees = better but slower max_depth=10, # Limit depth of each tree max_features='sqrt', # Use sqrt(n_features) at each split min_samples_split=5, # Require more samples to split min_samples_leaf=2, # Ensure leaves have enough samples bootstrap=True, # Use bootstrap sampling random_state=42 # For reproducibility ) How It Works ------------ 1. **Bootstrap Sampling**: Each tree is trained on a random subset of data (with replacement) 2. **Random Feature Selection**: At each split, only a random subset of features is considered 3. **Tree Building**: Each tree is built independently using the DecisionTree algorithm 4. **Ensemble Prediction**: - **Classification**: Majority voting across all trees - **Regression**: Average of predictions from all trees Key Concepts ------------ **Bootstrap Aggregating (Bagging)** - Each tree sees a different random sample of the training data - About 63% of samples appear in each bootstrap sample - Reduces variance and overfitting **Random Feature Selection** - At each split, only a subset of features is considered - Increases diversity among trees - Default 'sqrt' is a good balance between diversity and performance **Ensemble Voting** - Classification: Most common prediction wins (majority vote) - Regression: Average of all tree predictions - More trees = more stable predictions Advantages ---------- - **Reduces Overfitting**: Ensemble of trees is less prone to overfitting than a single tree - **Handles Non-linearity**: Can capture complex non-linear relationships - **Feature Importance**: Can identify important features (via tree splits) - **Robust**: Less sensitive to outliers and noise - **No Feature Scaling**: Works well without feature normalization Tips ---- - **n_estimators**: Start with 100-200. More trees = better but slower - **max_features**: 'sqrt' is a good default. Use 'log2' for high-dimensional data - **max_depth**: Limit depth to prevent overfitting (default: None allows full growth) - **bootstrap**: Keep True for better generalization (default: True) - **random_state**: Set for reproducible results - **Classification**: Use ``predict_proba()`` to get class probabilities - **Regression**: Predictions are averaged across all trees Comparison with Decision Tree ------------------------------ - **Decision Tree**: Single tree, can overfit easily, fast training - **Random Forest**: Ensemble of trees, reduces overfitting, slower training but better generalization Use Random Forest when: - You want better accuracy than a single decision tree - You have enough data and computational resources - You need robust predictions that handle noise well