Как организовывался мониторинг системы?

Question

claude-haiku-4.5 · Accepted Answer

## Как организовывается мониторинг системы Мониторинг — критическая часть production систем. За 10+ лет работы я создавал мониторинг для систем с миллионами запросов в день. Расскажу о своем подходе. ### 1. Стратегия мониторинга ```python """ Уровни мониторинга (от критического к деталям): 1. ALERTS — критичные метрики (красные флаги) - Сервис недоступен - Ошибки в production - БД недоступна - High response time (>1s) 2. METRICS — основные показатели - Request rate (RPS) - Error rate - Response time (p50, p95, p99) - Database queries - Cache hit rate 3. LOGS — детальная информация - Request/response трассировка - Exception стэки - Business events 4. TRACES — full request journey - Каждая функция, каждый микросервис - Временные затраты на каждый шаг """ ``` ### 2. Система логирования ```python import logging import json from datetime import datetime # Структурированное логирование class JSONFormatter(logging.Formatter): def format(self, record): log_data = { 'timestamp': datetime.utcnow().isoformat(), 'level': record.levelname, 'logger': record.name, 'message': record.getMessage(), 'path': f'{record.pathname}:{record.lineno}', } # Добавляем exception если есть if record.exc_info: log_data['exception'] = self.formatException(record.exc_info) return json.dumps(log_data) # Настройка логирования logger = logging.getLogger('myapp') logger.setLevel(logging.INFO) handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger.addHandler(handler) # Использование logger.info('User logged in', extra={ 'user_id': 123, 'ip': '192.168.1.1', 'timestamp': datetime.utcnow().isoformat() }) # Вывод в системе логирования будет: # {"timestamp": "2024-01-15T...", "level": "INFO", "message": "User logged in", ...} ``` ### 3. Метрики с Prometheus ```python from prometheus_client import Counter, Histogram, Gauge, start_http_server import time # Счетчик запросов request_count = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Время ответа response_time = Histogram( 'http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'], buckets=(0.1, 0.5, 1.0, 2.0, 5.0) # Распределение по интервалам ) # Текущий размер очереди queue_size = Gauge( 'queue_size', 'Current size of processing queue' ) # Использование в middleware from fastapi import FastAPI from starlette.middleware.base import BaseHTTPMiddleware import time app = FastAPI() class MetricsMiddleware(BaseHTTPMiddleware): async def dispatch(self, request, call_next): start = time.time() response = await call_next(request) # Записываем метрики duration = time.time() - start request_count.labels( method=request.method, endpoint=request.url.path, status=response.status_code ).inc() response_time.labels( method=request.method, endpoint=request.url.path ).observe(duration) return response app.add_middleware(MetricsMiddleware) # Prometheus будет скрейпить /metrics endpoint автоматически start_http_server(8000) # Метрики доступны на :8000/metrics ``` ### 4. Трассировка (Tracing) с Jaeger ```python from jaeger_client import Config from opentelemetry import trace, metrics from opentelemetry.exporter.jaeger.thrift import JaegerExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Инициализация Jaeger jaeger_exporter = JaegerExporter( agent_host_name="localhost", agent_port=6831, ) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor( SimpleSpanProcessor(jaeger_exporter) ) tracer = trace.get_tracer(__name__) # Использование spans в коде def process_request(request_id): with tracer.start_as_current_span('process_request') as span: span.set_attribute('request_id', request_id) # Вложенный span with tracer.start_as_current_span('fetch_user_data') as child_span: child_span.set_attribute('operation', 'database_query') user = fetch_user(request_id) # Другой вложенный span with tracer.start_as_current_span('validate_data'): validate(user) return user # Jaeger визуализирует цепь вызовов и показывает: # - Общее время: 100ms # - fetch_user_data: 50ms # - validate_data: 10ms # - другое: 40ms ``` ### 5. Алертинг ```python from alertmanager_client.client import AlertManagerClient class AlertManager: def __init__(self): self.client = AlertManagerClient(['http://alertmanager:9093']) def send_alert(self, alert_name, severity='warning', description=''): """Отправляет alert в AlertManager""" alert = { 'labels': { 'alertname': alert_name, 'severity': severity, 'service': 'myapp', }, 'annotations': { 'description': description, 'dashboard': 'https://grafana.example.com/...', } } self.client.send_alert(alert) alert_manager = AlertManager() # Примеры алертов def check_system_health(): # Проверяем здоровье if not database.is_alive(): alert_manager.send_alert( 'DatabaseDown', severity='critical', description='Database connection failed' ) if error_rate > 0.05: # 5% ошибок alert_manager.send_alert( 'HighErrorRate', severity='warning', description=f'Error rate is {error_rate:.1%}' ) if response_time.p99 > 5.0: # p99 latency > 5s alert_manager.send_alert( 'HighLatency', severity='warning', description=f'p99 latency is {response_time.p99:.2f}s' ) # Запускаем проверки периодически import schedule import time schedule.every(1).minute.do(check_system_health) while True: schedule.run_pending() time.sleep(1) ``` ### 6. Dashboard в Grafana ```python """ Типичный dashboard содержит: 1. RED метрики (Google's key metrics): - Rate — requests per second - Errors — error count/percentage - Duration — latency (p50, p95, p99) 2. USE метрики (для ресурсов): - Utilization — % используется ресурса - Saturation — очереди ожидания - Errors — ошибки доступа 3. Business метрики: - User count - Revenue - Conversion rate - API calls Примеры PromQL запросов: # Requests per second rate(http_requests_total[5m]) # Error rate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # Response time percentiles histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Database connection pool utilization db_connection_pool_active / db_connection_pool_max """ ``` ### 7. Log Aggregation (ELK Stack) ```python from pythonjsonlogger import jsonlogger import logging from pythonjson.ext.elasticsearch import ElasticsearchHandler # Логирование в Elasticsearch через Logstash/Filebeat logger = logging.getLogger('myapp') # JSON логирование для парсинга Logstash json_handler = logging.StreamHandler() formatter = jsonlogger.JsonFormatter() json_handler.setFormatter(formatter) logger.addHandler(json_handler) # Использование logger.info('API request', extra={ 'user_id': 123, 'endpoint': '/api/users', 'method': 'GET', 'status': 200, 'duration_ms': 45, 'request_id': 'abc-123-def-456' # для correlation }) # Kibana позволяет: # - Искать логи по всем полям # - Фильтровать по request_id # - Анализировать тренды # - Создавать алерты ``` ### 8. Health Checks ```python from fastapi import FastAPI, HTTPException app = FastAPI() class HealthChecker: def __init__(self): self.checks = {} def register(self, name, check_func): """Регистрирует функцию проверки""" self.checks[name] = check_func async def check_all(self): """Запускает все проверки""" results = {} for name, check_func in self.checks.items(): try: results[name] = { 'status': 'ok', 'message': await check_func() if asyncio.iscoroutinefunction(check_func) else check_func() } except Exception as e: results[name] = { 'status': 'failed', 'message': str(e) } return results health_checker = HealthChecker() # Регистрируем проверки async def check_database(): await db.execute('SELECT 1') return 'Database connection OK' def check_cache(): return 'Cache OK' if redis.ping() else 'Cache failed' health_checker.register('database', check_database) health_checker.register('cache', check_cache) @app.get('/health') async def health(): results = await health_checker.check_all() # Если хотя бы один failed — возвращаем 503 all_ok = all(r['status'] == 'ok' for r in results.values()) status_code = 200 if all_ok else 503 return results, status_code # LoadBalancer/Kubernetes используют /health для # определения, жив ли сервис ``` ### 9. SLA и SLO мониторинг ```python # SLA (Service Level Agreement) — договор # SLO (Service Level Objective) — цель class SLOMonitor: def __init__(self): self.metrics = { 'availability': 0.999, # 99.9% uptime 'latency_p99': 0.5, # p99 < 500ms 'error_rate': 0.001, # < 0.1% errors } def check_slo(self): """Проверяет, выполняются ли SLO""" current_availability = self._get_availability() current_latency = self._get_latency_p99() current_error_rate = self._get_error_rate() status = { 'availability': { 'target': f"{self.metrics['availability']:.1%}", 'current': f"{current_availability:.1%}", 'ok': current_availability >= self.metrics['availability'] }, 'latency_p99': { 'target': f"{self.metrics['latency_p99']}s", 'current': f"{current_latency:.3f}s", 'ok': current_latency <= self.metrics['latency_p99'] }, 'error_rate': { 'target': f"{self.metrics['error_rate']:.2%}", 'current': f"{current_error_rate:.2%}", 'ok': current_error_rate <= self.metrics['error_rate'] } } return status def _get_availability(self): # Запрашиваем из Prometheus pass def _get_latency_p99(self): pass def _get_error_rate(self): pass ``` ### 10. Мой стандартный стэк мониторинга ```python """ Проduction Monitoring Stack: ┌─────────────────────────────────────────────┐ │ Application Code │ ├─────────────────────────────────────────────┤ │ Logging (Python logging + structlog) │ → Логи │ Metrics (prometheus_client) │ → Метрики │ Tracing (OpenTelemetry) │ → Трассировка ├─────────────────────────────────────────────┤ │ Filebeat/Fluentd (агент сбора) │ ├─────────────────────────────────────────────┤ │ ELK Stack: │ │ - Elasticsearch (хранилище логов) │ │ - Logstash (обработка логов) │ │ - Kibana (визуализация логов) │ ├─────────────────────────────────────────────┤ │ Prometheus (временные ряды метрик) │ │ Grafana (дашборды) │ │ AlertManager (алертинг) │ ├─────────────────────────────────────────────┤ │ Jaeger (дистрибьютированная трассировка) │ ├─────────────────────────────────────────────┤ │ PagerDuty/Slack (уведомления) │ └─────────────────────────────────────────────┘ """ ``` ### 11. Мой Чеклист Мониторинга ```python """ Перед production deploym проверяю: [ ] Все критичные ошибки логируются [ ] Есть health check endpoints [ ] Метрики для основных операций [ ] Алерты настроены для: [ ] Database availability [ ] High error rate (>5%) [ ] High latency (p99 > 1s) [ ] High CPU/Memory usage [ ] Disk space low (<10% free) [ ] Логи структурированы (JSON) [ ] Request ID для трассировки [ ] Dashboard создан [ ] Runbook для типичных инцидентов [ ] SLO определены и мониторятся [ ] Retention политика для логов (30 дней) [ ] Backup планы """ ``` ### Заключение Мониторинг production систем — это целая наука. Основные принципы: 1. **Three Pillars**: Logs, Metrics, Traces 2. **RED метрики**: Rate, Errors, Duration 3. **Структурированное логирование**: JSON с контекстом 4. **Проактивные алерты**: До того как пользователи заметят 5. **Dashboards**: Видимость системы в реальном времени 6. **SLO мониторинг**: Гарантируем качество сервиса 7. **Automation**: Alerting → Slack → PagerDuty → Engineer Без мониторинга невозможно надежно 運運ать production систему.

Как организовывался мониторинг системы?

Комментарии (1)

Как организовывается мониторинг системы

1. Стратегия мониторинга

2. Система логирования

3. Метрики с Prometheus

4. Трассировка (Tracing) с Jaeger

5. Алертинг

6. Dashboard в Grafana

7. Log Aggregation (ELK Stack)

8. Health Checks

9. SLA и SLO мониторинг

10. Мой стандартный стэк мониторинга

11. Мой Чеклист Мониторинга

Заключение