Как улучшить алгоритм рекомендации товаров?

Question

claude-haiku-4.5 · Accepted Answer

## Как улучшить алгоритм рекомендации товаров

**Рекомендательные системы** — это критичная часть e-commerce стека, которая прямо влияет на выручку. Data Engineer должен понимать архитектуру этих систем, где находятся узкие места и как улучшить качество.

### Основные типы алгоритмов рекомендации

#### 1. Collaborative Filtering (CF)

```python
# User-User Similarity
# Идея: найти похожих пользователей и рекомендовать их товары

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Матрица: пользователи x товары x рейтинги
user_item_matrix = pd.DataFrame({
    'user_1': [5, 4, 0, 0, 1],
    'user_2': [5, 4, 0, 0, 1],  # похож на user_1
    'user_3': [0, 0, 5, 4, 5],  # другая группа
    'user_4': [0, 0, 5, 4, 0],
}, index=['product_1', 'product_2', 'product_3', 'product_4', 'product_5']).T

# Вычисляем similarity между пользователями
similarity_matrix = cosine_similarity(user_item_matrix)
similarity_df = pd.DataFrame(
    similarity_matrix,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print(similarity_df)
# Результат: user_1 и user_2 очень похожи (0.99)

# Для user_1 рекомендуем товары, которые понравились user_2
# но user_1 ещё не смотрел
```

**Преимущества CF:**
- Простая реализация
- Работает для любых категорий
- Не требует знания о товарах

**Недостатки:**
- Cold start problem (новых пользователей нельзя рекомендовать)
- Sparsity (матрица рейтингов очень разреженная)
- Масштабируемость (O(n²) сложность)

#### 2. Content-Based Filtering

```python
# Идея: рекомендовать похожие товары на основе их характеристик

products = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5],
    'category': ['electronics', 'electronics', 'books', 'books', 'electronics'],
    'brand': ['Apple', 'Samsung', 'Penguin', 'Random House', 'Apple'],
    'price': [999, 799, 15, 20, 1299],
    'rating': [4.8, 4.5, 4.2, 4.0, 4.9]
})

# Если пользователь купил product_1 (Apple iPhone),
# рекомендуем похожие товары

def find_similar_products(user_item_id, products, similarity_threshold=0.7):
    reference = products[products['product_id'] == user_item_id].iloc[0]
    
    # Вычисляем similarity
    products['category_match'] = (products['category'] == reference['category']).astype(int)
    products['price_similarity'] = 1 - abs(products['price'] - reference['price']) / reference['price']
    
    products['similarity'] = (
        products['category_match'] * 0.5 +
        products['price_similarity'] * 0.5
    )
    
    similar = products[products['similarity'] >= similarity_threshold]
    return similar.sort_values('similarity', ascending=False)

recommendations = find_similar_products(user_item_id=1, products=products)
print(recommendations)
```

**Преимущества Content-Based:**
- Решает cold start (новый товар можно рекомендовать)
- Интерпретируемо (понимаем, почему рекомендуем)
- Масштабируется хорошо

**Недостатки:**
- Требует хороших metadata о товарах
- Не обнаруживает неочевидные связи
- Переубеждающий (рекомендует только похожие товары)

#### 3. Hybrid подход

```python
# Комбинируем CF + Content-Based + другие сигналы

def hybrid_recommendation(user_id, n_recommendations=10):
    # Score 1: Collaborative Filtering
    cf_scores = collaborative_filtering(user_id)
    
    # Score 2: Content-Based
    cb_scores = content_based_filtering(user_id)
    
    # Score 3: Popularity (Popularity bias)
    popularity_scores = get_popularity_scores()
    
    # Score 4: Diversity (избегаем monotony)
    diversity_scores = calculate_diversity_bonus()
    
    # Score 5: Business rules (promocions, profit margin)
    business_rule_scores = get_business_rule_scores(user_id)
    
    # Комбинируем с весами
    final_scores = (
        cf_scores * 0.35 +
        cb_scores * 0.25 +
        popularity_scores * 0.15 +
        diversity_scores * 0.15 +
        business_rule_scores * 0.10
    )
    
    # Фильтруем и ранжируем
    recommendations = final_scores.nlargest(n_recommendations)
    return recommendations
```

### Практические подходы к улучшению

#### 1. Анализ данных взаимодействий

```sql
-- Какие товары часто покупаются вместе?
WITH order_pairs AS (
    SELECT 
        oi1.product_id as product_a,
        oi2.product_id as product_b,
        COUNT(*) as frequency
    FROM order_items oi1
    JOIN order_items oi2 ON oi1.order_id = oi2.order_id
    WHERE oi1.product_id < oi2.product_id  -- избегаем дубликатов
    GROUP BY oi1.product_id, oi2.product_id
)

SELECT 
    product_a,
    product_b,
    frequency,
    frequency / SUM(frequency) OVER (PARTITION BY product_a) as lift
FROM order_pairs
WHERE frequency > 10  -- только значимые пары
ORDER BY product_a, frequency DESC;
```

#### 2. Матрица корзин (Market Basket Analysis)

```python
# Association Rules (Apriori algorithm)
from mlxtend.frequent_patterns import apriori, association_rules

# Создаём матрицу: товары как столбцы, заказы как строки
from itertools import combinations

order_product_matrix = pd.crosstab(
    orders['order_id'],
    orders['product_id'],
    aggfunc='size',
    fill_value=0
)

# Заменяем на binary (0/1)
order_product_matrix = (order_product_matrix > 0).astype(int)

# Ищем частые itemsets
frequent_itemsets = apriori(
    order_product_matrix,
    min_support=0.02,  # товары куплены в 2% заказов
    use_colnames=True
)

# Генерируем rules
rules = association_rules(
    frequent_itemsets,
    metric='confidence',
    min_threshold=0.3
)

# Сортируем по lift (насколько сильнее связь, чем случайность)
rules_sorted = rules.sort_values('lift', ascending=False)
print(rules_sorted[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```

#### 3. Matrix Factorization (Latent Factor Model)

```python
# Используем Singular Value Decomposition (SVD)
from scipy.sparse.linalg import svds

# Матрица: пользователи x товары
user_item_sparse = sp.csr_matrix(user_item_matrix.values)

# Разложение на факторы
U, s, Vt = svds(user_item_sparse, k=50)  # 50 скрытых факторов

# Реконструируем матрицу рейтингов
predicted_ratings = U @ np.diag(s) @ Vt

# Для каждого пользователя находим товары с высшими predicted ratings
for user_id in range(U.shape[0]):
    user_ratings = predicted_ratings[user_id]
    top_items = np.argsort(user_ratings)[-10:]  # Top 10
    print(f'User {user_id}: {top_items}')
```

### Метрики для оценки качества рекомендаций

```python
# Основные метрики

def evaluate_recommendations(recommendations, ground_truth):
    """
    recommendations: список рекомендованных товаров для пользователя
    ground_truth: товары, которые пользователь действительно купил/лайкнул
    """
    
    # 1. Precision@k: сколько из top-k рекомендаций были правильными
    k = 10
    relevant = len(set(recommendations[:k]) & set(ground_truth))
    precision_at_k = relevant / k
    
    # 2. Recall@k: из всех правильных ответов сколько мы нашли
    recall_at_k = relevant / len(ground_truth)
    
    # 3. MAP (Mean Average Precision): штраф за неправильный порядок
    ap = 0
    for i, item in enumerate(recommendations):
        if item in ground_truth:
            ap += (i + 1) / len(ground_truth)
    
    # 4. NDCG (Normalized Discounted Cumulative Gain): учитывает релевантность
    dcg = sum([
        (1 if rec in ground_truth else 0) / np.log2(i + 2)
        for i, rec in enumerate(recommendations[:10])
    ])
    
    # 5. Coverage: какой % товаров мы когда-либо рекомендуем
    all_recommended = set()
    for user_recs in all_user_recommendations:
        all_recommended.update(user_recs)
    coverage = len(all_recommended) / total_products
    
    return {
        'precision_at_10': precision_at_k,
        'recall_at_10': recall_at_k,
        'map': ap,
        'ndcg': dcg,
        'coverage': coverage
    }
```

### Улучшение рекомендаций: практические шаги

#### 1. Добавьте контекст

```python
# Время года, сезон, тренды
recommendations = hybrid_recommendation(
    user_id=123,
    season='summer',  # рекомендуем летние товары
    trending=True,    # учитываем тренды
    user_preferences=user_profile  # персоналия
)
```

#### 2. A/B тестирование

```python
# Тестируем разные версии
if user_id % 2 == 0:
    recommendations = algorithm_v1(user_id)  # Старый алгоритм
else:
    recommendations = algorithm_v2(user_id)  # Новый алгоритм

# Метрики: CTR, conversion rate, average order value
compare_metrics(algorithm_v1_metrics, algorithm_v2_metrics)
```

#### 3. Personalization

```python
# Построим профиль пользователя
user_profile = {
    'favorite_categories': ['electronics', 'books'],
    'price_range': (100, 1000),
    'preferred_brands': ['Apple', 'Samsung'],
    'reviews_frequency': 'often',  # оставляет отзывы
    'return_rate': 0.05  # низкий процент возвратов
}

# Используем профиль при рекомендации
recommendations = personalized_recommendation(
    user_id,
    user_profile,
    time_of_day='evening',
    device='mobile'
)
```

#### 4. Real-time updating

```python
# Обновляем рекомендации по мере событий
from kafka import KafkaConsumer

consumer = KafkaConsumer('user-events')

for event in consumer:
    user_id = event['user_id']
    action = event['action']  # 'view', 'click', 'purchase'
    product_id = event['product_id']
    
    # Обновляем user embeddings в реальном времени
    update_user_embedding(user_id, product_id, action)
    
    # Пересчитываем рекомендации
    invalidate_cache(user_id)  # сбрасываем кэш
```

### Выводы

✅ **Гибридный подход** (CF + Content + Business rules) даёт лучший результат
✅ **A/B тестируйте** перед production
✅ **Следите за метриками**: precision, recall, coverage, diversity
✅ **Учитывайте контекст**: время, устройство, сезон
✅ **Используйте feedback**: пользовательские действия улучшают модель
✅ **Обновляйте в реальном времени** через Kafka/Redis
✅ **Регулярно пересчитывайте** факторизацию и embeddings
✅ **Боритесь с bias**: избегайте переубеждения однотипными товарами

Как улучшить алгоритм рекомендации товаров?

Комментарии (1)

Как улучшить алгоритм рекомендации товаров

Основные типы алгоритмов рекомендации

1. Collaborative Filtering (CF)

2. Content-Based Filtering

3. Hybrid подход

Практические подходы к улучшению

1. Анализ данных взаимодействий

2. Матрица корзин (Market Basket Analysis)

3. Matrix Factorization (Latent Factor Model)

Метрики для оценки качества рекомендаций

Улучшение рекомендаций: практические шаги

1. Добавьте контекст

2. A/B тестирование

3. Personalization

4. Real-time updating

Выводы