Как делаются подвыборки в Random Forest?

Question

claude-haiku-4.5 · Accepted Answer

## Подвыборки в Random Forest: Полный механизм

**Random Forest** использует два механизма случайного выбора (randomization):
1. **Bootstrap-выборка** (subsampling) — случайный выбор samples
2. **Случайный выбор признаков** (feature subsampling) — случайный выбор features

Эти механизмы — ключ к силе и устойчивости Random Forest.

### 1. Bootstrap Sampling (Выборка с возвращением)

#### Как это работает:

```python
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Пример: У нас есть 100 примеров [0, 1, 2, ..., 99]
n_samples = 100
original_indices = np.arange(n_samples)

# Bootstrap выборка - это выборка С ВОЗВРАЩЕНИЕМ
bootstrap_indices = np.random.choice(
    original_indices,
    size=n_samples,  # По умолчанию выбираем столько же, сколько исходно
    replace=True  # КЛЮЧЕВОЕ СЛОВО - выбор С возвращением!
)

print(f"Исходные индексы: {original_indices[:10]}")
print(f"Bootstrap индексы: {bootstrap_indices[:10]}")
print(f"Количество уникальных индексов: {len(np.unique(bootstrap_indices))}")
# Обычно ~63% уникальных, ~37% не используются (out-of-bag)

# Данные для дерева
X_tree = X[bootstrap_indices]
y_tree = y[bootstrap_indices]
```

**Вероятность попадания элемента:**
```
P(элемент выбран) = 1 - (1 - 1/n)^n ≈ 1 - e^(-1) ≈ 0.632 (для большого n)
```

То есть:
- ~63% уникальных samples попадут в выборку
- ~37% samples не будут использованы (Out-of-Bag - OOB)

#### Визуализация процесса:

```python
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Пример с маленькой выборкой для наглядности
n_samples = 10
rep_counts = np.zeros(n_samples)

# Провести 1000 bootstrap выборок
for _ in range(1000):
    bootstrap_indices = np.random.choice(
        n_samples,
        size=n_samples,
        replace=True
    )
    for idx in bootstrap_indices:
        rep_counts[idx] += 1

# Показать сколько раз каждый sample выбирается
plt.figure(figsize=(10, 5))
plt.bar(range(n_samples), rep_counts / 1000)
plt.xlabel("Sample Index")
plt.ylabel("Average Times Selected per Bootstrap")
plt.title("Bootstrap Selection Distribution (усреднено по 1000 выборок)")
plt.axhline(y=1.0, color="red", linestyle="--", label="Expected value = 1.0")
plt.legend()
plt.show()
```

### 2. Feature Subsampling (Выбор признаков)

#### Как это работает:

```python
from sklearn.ensemble import RandomForestClassifier

# На каждом разбиении (split) в дереве случайно выбирается подмножество признаков
model = RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",  # Выбирать sqrt(n_features) признаков на каждом сплите
    bootstrap=True,  # Использовать bootstrap sampling
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train)

# max_features может быть:
# - "sqrt": sqrt(n_features) - для классификации обычно
# - "log2": log2(n_features)
# - None: все признаки
# - int: конкретное число
# - float: доля признаков

print(f"Всего признаков: {X_train.shape[1]}")
print(f"Признаков на сплит (sqrt): {int(np.sqrt(X_train.shape[1]))}")
```

#### Пример с пошаговым разбором:

```python
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Симуляция одного дерева Random Forest
class SimpleRandomTreeNode:
    def __init__(self, X, y, max_features, depth=0, max_depth=5):
        self.X = X
        self.y = y
        self.n_samples = len(X)
        self.n_features = X.shape[1]
        
        # На каждом разбиении выбрать случайное подмножество признаков
        if max_features == "sqrt":
            n_features_to_use = int(np.sqrt(self.n_features))
        else:
            n_features_to_use = max_features
        
        self.feature_indices = np.random.choice(
            self.n_features,
            size=n_features_to_use,
            replace=False  # Без возвращения для признаков
        )
        
        self.depth = depth
        self.max_depth = max_depth
        self.feature = None
        self.threshold = None
        self.left = None
        self.right = None
        self.value = None
        
        # Найти лучшее разбиение
        self._find_best_split()
    
    def _find_best_split(self):
        """Найти лучший сплит из случайно выбранных признаков"""
        best_gini = float("inf")
        best_feature = None
        best_threshold = None
        
        # Ищем только в выбранных признаках
        for feature_idx in self.feature_indices:
            thresholds = np.unique(self.X[:, feature_idx])
            
            for threshold in thresholds:
                left_mask = self.X[:, feature_idx] <= threshold
                right_mask = ~left_mask
                
                if len(self.y[left_mask]) == 0 or len(self.y[right_mask]) == 0:
                    continue
                
                # Вычислить Gini (примерно)
                gini = self._gini_impurity(self.y[left_mask], self.y[right_mask])
                
                if gini < best_gini:
                    best_gini = gini
                    best_feature = feature_idx
                    best_threshold = threshold
        
        if best_feature is None:
            self.value = np.mean(self.y)
            return
        
        self.feature = best_feature
        self.threshold = best_threshold
        
        # Рекурсивно создать левое и правое поддеревья
        if self.depth < self.max_depth:
            left_mask = self.X[:, best_feature] <= best_threshold
            right_mask = ~left_mask
            
            self.left = SimpleRandomTreeNode(
                self.X[left_mask],
                self.y[left_mask],
                max_features=int(np.sqrt(self.n_features)),
                depth=self.depth + 1,
                max_depth=self.max_depth
            )
            self.right = SimpleRandomTreeNode(
                self.X[right_mask],
                self.y[right_mask],
                max_features=int(np.sqrt(self.n_features)),
                depth=self.depth + 1,
                max_depth=self.max_depth
            )
    
    def _gini_impurity(self, y_left, y_right):
        """Вычислить Gini impurity"""
        n_left, n_right = len(y_left), len(y_right)
        n_total = n_left + n_right
        
        # Примерная реализация Gini
        return 1

# Использование
# tree = SimpleRandomTreeNode(X, y, max_features="sqrt")
```

### 3. Полный процесс Random Forest

```python
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Создать dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=2,
    random_state=42
)

print(f"Dataset size: {X.shape}")
print(f"Number of trees: 100")
print(f"Samples per tree: ~1000 (bootstrap)")
print(f"Features per split: sqrt(20) ≈ 4-5")
print()
print("Процесс обучения Random Forest:")
print("="*50)

model = RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",
    bootstrap=True,
    max_depth=10,
    random_state=42,
    oob_score=True,  # Вычислить OOB accuracy
    n_jobs=-1
)

model.fit(X, y)

print(f"
1. Для каждого из 100 деревьев:")
print(f"   - Выбрать bootstrap выборку (с возвращением): ~1000 samples")
print(f"   - ~63% (630) уникальных samples используются")
print(f"   - ~37% (370) samples не используются (OOB)")
print()
print(f"2. На каждом разбиении в дереве:")
print(f"   - Выбрать sqrt(20) ≈ 4-5 случайных признаков")
print(f"   - Найти лучший сплит среди них")
print()
print(f"3. OOB Score (validation на неиспользованных данных): {model.oob_score_:.3f}")
```

### 4. Out-of-Bag (OOB) Evaluation

Очень полезная особенность Random Forest благодаря bootstrap:

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",
    bootstrap=True,
    oob_score=True,  # Вычислить OOB score
    random_state=42
)

model.fit(X_train, y_train)

# OOB Score - автоматическая validation без отдельного val_set
print(f"OOB Score: {model.oob_score_:.3f}")

# На неполноценном датасете это часто близко к тестовой точности
test_score = model.score(X_test, y_test)
print(f"Test Score: {test_score:.3f}")

# OOB predictions (для каждого sample)
oob_pred = model.oob_predict(X_train)
print(f"OOB Predictions shape: {oob_pred.shape}")
```

### 5. Влияние параметров на подвыборку

#### 5.1 Bootstrap параметры

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Сравнение с bootstrap и без
for bootstrap in [True, False]:
    model = RandomForestClassifier(
        n_estimators=100,
        max_features="sqrt",
        bootstrap=bootstrap,
        random_state=42
    )
    
    scores = cross_val_score(
        model, X_train, y_train,
        cv=5,
        scoring="accuracy"
    )
    
    print(f"
Bootstrap={bootstrap}:")
    print(f"  Mean Accuracy: {scores.mean():.3f}")
    print(f"  Std Deviation: {scores.std():.3f}")
```

Обычно **bootstrap=True** дает лучшие результаты благодаря:
- Большему разнообразию деревьев
- Возможности использования OOB для validation
- Лучшей регуляризации

#### 5.2 max_features параметр

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Сравнение разных значений max_features
for max_features in ["sqrt", "log2", None, 5, 10]:
    model = RandomForestClassifier(
        n_estimators=100,
        max_features=max_features,
        bootstrap=True,
        random_state=42
    )
    
    scores = cross_val_score(
        model, X_train, y_train,
        cv=5,
        scoring="accuracy"
    )
    
    print(f"max_features={max_features}:")
    print(f"  Mean Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Рекомендации:
# - Классификация: sqrt(n_features) или log2(n_features)
# - Регрессия: n_features / 3
```

### 6. Sample Weight в случае Imbalanced Data

```python
from sklearn.ensemble import RandomForestClassifier

# Если классы дисбалансированы, можно дать веса
class_weights = {
    0: 1,
    1: len(y[y == 0]) / len(y[y == 1])  # Вес меньшинства
}

model = RandomForestClassifier(
    n_estimators=100,
    class_weight=class_weights,  # или "balanced"
    bootstrap=True,
    max_features="sqrt",
    random_state=42
)

model.fit(X_train, y_train)
```

### 7. Статистика по Random Forest

```python
from sklearn.ensemble import RandomForestClassifier
import numpy as np

model = RandomForestClassifier(
    n_estimators=100,
    max_features="sqrt",
    bootstrap=True,
    random_state=42
)

model.fit(X, y)

print("Статистика Random Forest:")
print("="*50)
print(f"Количество деревьев: {len(model.estimators_)}")
print()
print("Для каждого дерева:")
print(f"  - Используется ~63% from {len(X)} samples = ~{int(0.63 * len(X))} samples")
print(f"  - На каждом сплите выбирается {model.max_features_} из {X.shape[1]} признаков")
print()
print("Преимущества такого подхода:")
print("  1. Разнообразие (diversity) между деревьями")
print("  2. Снижение переобучения (регуляризация)")
print("  3. Возможность Out-of-Bag evaluation")
print("  4. Стабильные predictions через усреднение")
print()
print("Feature Importance (основано на использованию признаков):")
for i, importance in enumerate(model.feature_importances_):
    if importance > 0.01:
        print(f"  Feature {i}: {importance:.3f}")
```

### 8. Практический пример с визуализацией

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Обучить Random Forest
model = RandomForestClassifier(
    n_estimators=50,
    max_features="sqrt",
    bootstrap=True,
    max_depth=10,
    random_state=42
)

model.fit(X, y)

# Визуализировать важность признаков
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Feature importance
feature_names = iris.feature_names
feature_importance = model.feature_importances_
indices = np.argsort(feature_importance)[::-1]

axes[0].bar(range(X.shape[1]), feature_importance[indices])
axes[0].set_xticks(range(X.shape[1]))
axes[0].set_xticklabels([feature_names[i] for i in indices], rotation=45)
axes[0].set_title("Feature Importance в Random Forest")

# OOB vs Train accuracy
train_scores = []
oob_scores = []
n_trees = np.arange(1, 51)

for n_tree in n_trees:
    temp_model = RandomForestClassifier(
        n_estimators=n_tree,
        max_features="sqrt",
        bootstrap=True,
        oob_score=True,
        random_state=42
    )
    temp_model.fit(X, y)
    train_scores.append(temp_model.score(X, y))
    oob_scores.append(temp_model.oob_score_)

axes[1].plot(n_trees, train_scores, label="Train Score", marker="o")
axes[1].plot(n_trees, oob_scores, label="OOB Score", marker="s")
axes[1].set_xlabel("Number of Trees")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Влияние количества деревьев")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
```

### Итоговая сводка

**Подвыборки в Random Forest делаются двумя способами:**

1. **Bootstrap Sampling** (для samples):
   - Выборка С ВОЗВРАЩЕНИЕМ
   - Для каждого дерева выбирается n samples из n исходных
   - ~63% уникальных, ~37% не используются (OOB)

2. **Feature Subsampling** (для признаков):
   - Выборка БЕЗ ВОЗВРАЩЕНИЯ
   - На каждом разбиении случайно выбирается подмножество признаков
   - Обычно sqrt(n_features) для классификации

**Это создаёт:**
- Разнообразие между деревьями
- Снижение переобучения
- Стабильные предсказания через усреднение
- Возможность OOB evaluation без отдельного test set

Как делаются подвыборки в Random Forest?

Комментарии (1)

Подвыборки в Random Forest: Полный механизм

1. Bootstrap Sampling (Выборка с возвращением)

Как это работает:

Визуализация процесса:

2. Feature Subsampling (Выбор признаков)

Как это работает:

Пример с пошаговым разбором:

3. Полный процесс Random Forest

4. Out-of-Bag (OOB) Evaluation

5. Влияние параметров на подвыборку

5.1 Bootstrap параметры

5.2 max_features параметр

6. Sample Weight в случае Imbalanced Data

7. Статистика по Random Forest

8. Практический пример с визуализацией

Итоговая сводка