Как использовать модель бустинга в случае дисбаланса?

Question

claude-haiku-4.5 · Accepted Answer

## Использование моделей бустинга при дисбалансе классов

Дисбаланс классов — частая проблема в реальных проектах. Модели бустинга обладают встроенными механизмами для работы с этим.

### 1. Встроенные механизмы бустинга

**XGBoost с взвешиванием**

XGBoost предоставляет несколько параметров для обработки дисбаланса:

```python
from xgboost import XGBClassifier

n_negative = (y_train == 0).sum()
n_positive = (y_train == 1).sum()
scale_pos_weight = n_negative / n_positive

model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    max_depth=5,
    learning_rate=0.1,
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)

# Способ 2: sample_weight
weights = np.where(y_train == 0, 1, n_negative / n_positive)
model.fit(X_train, y_train, sample_weight=weights)
```

**LightGBM с балансировкой**

```python
from lightgbm import LGBMClassifier

model = LGBMClassifier(
    scale_pos_weight=scale_pos_weight,
    is_unbalance=True,
    max_depth=5,
    learning_rate=0.1,
    num_leaves=31,
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)
```

**CatBoost с встроенной поддержкой**

```python
from catboost import CatBoostClassifier

model = CatBoostClassifier(
    auto_class_weights="balanced",
    iterations=100,
    depth=5,
    learning_rate=0.1,
    random_state=42,
    verbose=False
)

model.fit(X_train, y_train)
```

### 2. Продвинутые техники

**Стратифицированная кроссвалидация**

```python
from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

scores = cross_val_score(
    model, 
    X_train, 
    y_train, 
    cv=skf, 
    scoring="roc_auc"
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")
```

**Комбинация с переsampling методами**

```python
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

pipeline = Pipeline([
    ("smote", SMOTE(sampling_strategy=0.7, random_state=42)),
    ("under", RandomUnderSampler(sampling_strategy=0.8, random_state=42)),
    ("xgb", XGBClassifier(max_depth=5, n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
```

### 3. Выбор метрик оценки

Точность (accuracy) не подходит для дисбаланса! Используй:

```python
from sklearn.metrics import (
    roc_auc_score, 
    f1_score, 
    precision_recall_curve,
    average_precision_score,
    confusion_matrix
)

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# 1. ROC-AUC (лучший выбор для дисбаланса)
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC: {roc_auc:.3f}")

# 2. F1-score
f1 = f1_score(y_test, y_pred)
print(f"F1-score: {f1:.3f}")

# 3. Average Precision
ap = average_precision_score(y_test, y_pred_proba)
print(f"Average Precision: {ap:.3f}")

# 4. Матрица ошибок
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(f"True Negatives: {tn}, False Positives: {fp}")
print(f"False Negatives: {fn}, True Positives: {tp}")

# 5. Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
```

### 4. Пример полного пайплайна

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Найти оптимальный threshold
y_pred_proba = model.predict_proba(X_val)[:, 1]

best_f1 = 0
best_threshold = 0.5

for threshold in np.linspace(0, 1, 100):
    y_pred_custom = (y_pred_proba >= threshold).astype(int)
    f1 = f1_score(y_val, y_pred_custom)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

print(f"Optimal threshold: {best_threshold:.3f}, F1-score: {best_f1:.3f}")

# Применить на тестовом наборе
y_pred_final = (model.predict_proba(X_test)[:, 1] >= best_threshold).astype(int)
```

### 5. Практические рекомендации

**Выбор стратегии по степени дисбаланса:**

- **Слабый (1:5)**: используй взвешивание в XGBoost
- **Средний (1:20)**: добавь SMOTE + взвешивание
- **Сильный (1:100+)**: комбинируй SMOTE, undersampling и взвешивание

**Ключевые советы:**

1. **Всегда проверяй class distribution** в train/val/test
2. **Используй стратифицированную кроссвалидацию** для надёжной оценки
3. **Выбирай метрики под задачу**: ROC-AUC универсален, F1 практичен
4. **Подбирай порог вероятности** для оптимизации под бизнес-требования
5. **Комбинируй методы**: бустинг + взвешивание + переsampling дают лучший результат

Как использовать модель бустинга в случае дисбаланса?

Комментарии (1)

Использование моделей бустинга при дисбалансе классов

1. Встроенные механизмы бустинга

2. Продвинутые техники

3. Выбор метрик оценки

4. Пример полного пайплайна

5. Практические рекомендации