Как работает Prometheus?

Question

claude-haiku-4.5 · Accepted Answer

## Как работает Prometheus

Prometheus — это **система мониторинга и алертинга** для сбора метрик с приложений и инфраструктуры. Рассмотрю архитектуру и принципы работы.

### 1. Архитектура Prometheus

```
┌─────────────────────────────────────────┐
│ Applications (Python, Go, Java, etc)    │
│ /metrics endpoint (port 9090 по умолч.) │
└──────────────┬──────────────────────────┘
               │ (pull)
┌──────────────▼──────────────────────────┐
│     Prometheus Server                    │
│  ┌──────────────────────────────────┐   │
│  │ Time Series Database (TSDB)      │   │
│  │ Stores metrics with timestamps   │   │
│  └──────────────────────────────────┘   │
└─────────────┬────────────┬───────────────┘
              │            │
      ┌───────▼──┐  ┌──────▼────────┐
      │ AlertMgr │  │ Grafana (UI) │
      │ (alerts) │  │ Visualization │
      └──────────┘  └───────────────┘
```

### 2. Pull модель (главное отличие)

Прометей **сам тянет метрики** из приложений (pull), а не ждёт, когда приложение отправит данные (push):

```python
# Python приложение с prometheus_client
from prometheus_client import Counter, Histogram, start_http_server

# Определяем метрики
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    buckets=[0.1, 0.5, 1.0, 5.0]
)

# Запускаем HTTP сервер метрик на порту 8000
start_http_server(8000)

# В обработчике запроса
from time import time

start = time()
try:
    # обработка запроса
    request_count.labels(method='GET', endpoint='/api/users').inc()
finally:
    request_duration.labels(endpoint='/api/users').observe(time() - start)
```

Прометей периодически делает GET запрос:
```
GET http://localhost:8000/metrics
```

Ответ содержит метрики в текстовом формате:
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users"} 1234
http_request_duration_seconds_bucket{endpoint="/api/users",le="0.1"} 5
http_request_duration_seconds_bucket{endpoint="/api/users",le="0.5"} 45
http_request_duration_seconds_bucket{endpoint="/api/users",le="+Inf"} 100
```

### 3. Конфигурация Prometheus (prometheus.yml)

```yaml
global:
  scrape_interval: 15s      # Как часто собирать метрики
  evaluation_interval: 15s  # Как часто проверять alert rules

scrape_configs:
  - job_name: 'python-app'
    static_configs:
      - targets: ['localhost:8000']
      - targets: ['localhost:8001']
  
  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']  # postgres_exporter
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']  # системные метрики

alert_rule_files:
  - 'alert_rules.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
```

### 4. Типы метрик

#### Counter (счётчик)
Только растёт, никогда не уменьшается:
```python
from prometheus_client import Counter

errors = Counter('app_errors_total', 'Total errors', ['error_type'])
errors.labels(error_type='ValidationError').inc()
errors.labels(error_type='ConnectionError').inc(2)  # +2
```

#### Gauge (датчик)
Может расти и падать (текущее значение):
```python
from prometheus_client import Gauge

active_connections = Gauge('active_connections', 'Active WebSocket connections')
active_connections.set(42)
active_connections.inc()  # 43
active_connections.dec(5)  # 38
```

#### Histogram (гистограмма)
Записывает распределение значений:
```python
from prometheus_client import Histogram

response_time = Histogram(
    'response_time_seconds',
    'Response time',
    buckets=[0.01, 0.1, 1, 10]  # границы интервалов
)

response_time.observe(0.05)  # 50ms
response_time.observe(0.5)   # 500ms
```

#### Summary (сводка)
Похож на Histogram, но с квантилями:
```python
from prometheus_client import Summary

request_duration = Summary(
    'request_duration_seconds',
    'Request duration',
    quantiles=[0.5, 0.95, 0.99]  # median, p95, p99
)
```

### 5. PromQL (язык запросов)

Для визуализации и алертинга:

```promql
# Текущее значение
active_connections

# За последние 5 минут (increase)
increase(http_requests_total[5m])

# Rate per second (за последнюю минуту)
rate(http_requests_total[1m])

# Группировка по labels
sum by (endpoint) (http_requests_total)

# Фильтрация
http_requests_total{method="GET", endpoint=~"/api/.*"}

# Условия (для алертов)
rate(http_errors_total[5m]) > 0.05  # больше 5% ошибок
```

### 6. Алертинг

```yaml
# alert_rules.yml
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.instance }} has {{ $value }}% error rate"
      
      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        annotations:
          summary: "PostgreSQL is down"
```

AlertManager отправляет уведомления в Slack, Email, PagerDuty и т.д.

### 7. Интеграция с Django/FastAPI

```python
# Django middleware
from prometheus_client import Counter, Histogram
import time

class PrometheusMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response
        self.request_count = Counter(
            'django_http_requests_total',
            'Total HTTP requests',
            ['method', 'path', 'status']
        )
        self.request_duration = Histogram(
            'django_http_request_duration_seconds',
            'HTTP request duration',
            ['method', 'path']
        )
    
    def __call__(self, request):
        start = time.time()
        response = self.get_response(request)
        duration = time.time() - start
        
        self.request_count.labels(
            method=request.method,
            path=request.path,
            status=response.status_code
        ).inc()
        
        self.request_duration.labels(
            method=request.method,
            path=request.path
        ).observe(duration)
        
        return response
```

### 8. Преимущества и недостатки

**Плюсы:**
- Pull модель безопаснее (приложение не отправляет данные наружу)
- Встроенный TSDB оптимизирован для временных рядов
- Простой текстовый формат метрик
- Отличная экосистема (Exporters, Grafana)
- PromQL мощный язык запросов

**Минусы:**
- Не подходит для глобально распределённых систем (pull везде не достанешь)
- Нет встроенной кластеризации
- Нужно управлять хранением историических данных

### 9. Когда использовать Prometheus

- ✅ Мониторинг микросервисов
- ✅ Внутренняя инфраструктура (на одной сети)
- ✅ DevOps и SRE
- ❌ Логирование (используй ELK, Loki)
- ❌ APM трассировка (используй Jaeger)
- ❌ Дешёвое облачное хранилище (используй Datadog, NewRelic)

Прометей — de facto стандарт мониторинга в Kubernetes экосистеме!

Как работает Prometheus?

Комментарии (1)

Как работает Prometheus

1. Архитектура Prometheus

2. Pull модель (главное отличие)

3. Конфигурация Prometheus (prometheus.yml)

4. Типы метрик

Counter (счётчик)

Gauge (датчик)

Histogram (гистограмма)

Summary (сводка)

5. PromQL (язык запросов)

6. Алертинг

7. Интеграция с Django/FastAPI

8. Преимущества и недостатки

9. Когда использовать Prometheus