Настройка мониторинга с Prometheus и Grafana

Question

## Условие

Настройте систему мониторинга для веб-приложения в Kubernetes:

## Требования

1. Разверните Prometheus для сбора метрик
2. Настройте ServiceMonitor для автообнаружения сервисов
3. Добавьте Node Exporter для метрик хостов
4. Разверните Grafana с дашбордом
5. Настройте Alertmanager с правилами:
   - CPU usage > 80% в течение 5 минут
   - Pod restart count > 3 за час
   - HTTP 5xx rate > 1% от общего трафика

## Задание

- Напишите манифесты для Prometheus Operator
- Создайте PrometheusRule для алертов
- Опишите структуру Alertmanager config

## Вопросы

- Чем Prometheus отличается от Zabbix?
- Как масштабировать Prometheus для большого кластера?
- Что такое PromQL? Приведите примеры запросов.

claude-haiku-4.5 · Accepted Answer

## Решение ### 1. Установка Prometheus Stack (Kube-Prometheus-Stack) ```bash # Добавляем Helm репозиторий helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Создаем namespace kubectl create namespace monitoring # Установка prometheus stack helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --values values.yaml # Проверка kubectl get all -n monitoring ``` ### 2. values.yaml для Helm chart ```yaml # Prometheus prometheus: enabled: true prometheusSpec: retention: 7d storageSpec: volumeClaimTemplate: spec: resources: requests: storage: 50Gi # Service discovery для сервисов serviceMonitorSelector: {} serviceMonitorNamespaceSelector: {} # Pod monitoring podMonitorSelector: {} podMonitorNamespaceSelector: {} # Ingress ingress: enabled: true ingressClassName: nginx hosts: - prometheus.example.com tls: - secretName: prometheus-tls hosts: - prometheus.example.com resources: requests: cpu: 500m memory: 2Gi limits: cpu: 2000m memory: 4Gi # Grafana grafana: enabled: true adminPassword: SecurePassword123 # Persistence persistence: enabled: true size: 10Gi # Ingress ingress: enabled: true ingressClassName: nginx hosts: - grafana.example.com tls: - secretName: grafana-tls hosts: - grafana.example.com datasources: datasources.yaml: apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://kube-prometheus-stack-prometheus:9090 isDefault: true # Alertmanager alertmanager: enabled: true # Ingress ingress: enabled: true ingressClassName: nginx hosts: - alertmanager.example.com # Node Exporter prometheus-node-exporter: enabled: true rbac: pspEnabled: false tolerations: - effect: NoSchedule operator: Exists # Kube State Metrics kubeStateMetrics: enabled: true # Prometheus Operator prometheusOperator: enabled: true manageCrds: true ``` ### 3. ServiceMonitor для автообнаружения ```yaml --- # ServiceMonitor для приложения apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor namespace: default labels: app: my-app spec: selector: matchLabels: app: my-app endpoints: - port: metrics interval: 30s path: /metrics scrapeTimeout: 10s --- # Service для приложения apiVersion: v1 kind: Service metadata: name: my-app-svc namespace: default labels: app: my-app spec: type: ClusterIP ports: - name: http port: 8080 targetPort: 8080 protocol: TCP - name: metrics port: 9090 targetPort: 9090 protocol: TCP selector: app: my-app --- # Deployment с метриками apiVersion: apps/v1 kind: Deployment metadata: name: my-app namespace: default spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics" spec: containers: - name: app image: my-app:1.0 ports: - name: http containerPort: 8080 - name: metrics containerPort: 9090 livenessProbe: httpGet: path: /health port: http initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: http initialDelaySeconds: 5 periodSeconds: 5 ``` ### 4. Node Exporter для метрик хостов ```yaml # Node Exporter DaemonSet apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostNetwork: true hostPID: true tolerations: - effect: NoSchedule operator: Exists containers: - name: node-exporter image: prom/node-exporter:v1.6.0 args: - --path.procfs=/host/proc - --path.sysfs=/host/sys - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) ports: - name: metrics containerPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys --- # Service для Node Exporter apiVersion: v1 kind: Service metadata: name: node-exporter namespace: monitoring labels: app: node-exporter spec: type: ClusterIP ports: - name: metrics port: 9100 targetPort: 9100 selector: app: node-exporter --- # ServiceMonitor для Node Exporter apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter endpoints: - port: metrics interval: 30s ``` ### 5. PrometheusRule с алертами ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-alerts namespace: monitoring spec: groups: - name: application.rules interval: 30s rules: # Alert 1: CPU usage > 80% за 5 минут - alert: HighCPUUsage expr: | (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: warning component: node annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% on node {{ $labels.instance }}" # Alert 2: Pod restart count > 3 за час - alert: PodRestartingTooOften expr: | rate(kube_pod_container_status_restarts_total[1h]) > 3 for: 5m labels: severity: critical component: kubernetes annotations: summary: "Pod {{ $labels.pod }} restarting too often" description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last hour" # Alert 3: HTTP 5xx rate > 1% - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) ) > 0.01 for: 5m labels: severity: critical component: http annotations: summary: "High HTTP error rate for {{ $labels.job }}" description: "5xx error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}" # Alert 4: Memory usage - alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}" # Alert 5: Pod pending - alert: PodPending expr: | kube_pod_status_phase{phase="Pending"} == 1 for: 10m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} pending for 10 minutes" description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is in Pending state" # Alert 6: Service Down - alert: ServiceDown expr: | up{job="kubernetes-apiservers"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "Service {{ $labels.job }} on {{ $labels.instance }} is not reachable" ``` ### 6. Alertmanager конфигурация ```yaml # alertmanager-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: monitoring data: alertmanager.yml: | global: resolve_timeout: 5m # SMTP для email уведомлений smtp_smarthost: "smtp.gmail.com:587" smtp_auth_username: "alerts@example.com" smtp_auth_password: "${SMTP_PASSWORD}" smtp_from: "alerts@example.com" # Шаблоны для уведомлений templates: - '/etc/alertmanager/templates/*.tmpl' # Главный маршрут route: receiver: "default-receiver" group_by: ['alertname', 'cluster', 'service'] group_wait: 30s # Ждем 30 сек перед отправкой group_interval: 5m # Отправляем каждые 5 минут repeat_interval: 4h # Повторяем уведомление каждые 4 часа # Подмаршруты routes: # Критические алерты - match: severity: critical receiver: critical-receiver group_wait: 0s group_interval: 1m repeat_interval: 1h continue: true # Алерты по POD'ам - match: component: kubernetes receiver: kubernetes-receiver group_wait: 10s # Алерты по HTTP - match: component: http receiver: http-receiver group_wait: 5s # Receivers (назначения для уведомлений) receivers: # Default receiver - Email - name: "default-receiver" email_configs: - to: "devops@example.com" from: "alerts@example.com" headers: Subject: "[Alert] {{ .GroupLabels.alertname }}" # Critical receiver - Slack + PagerDuty + Email - name: "critical-receiver" slack_configs: - api_url: "${SLACK_WEBHOOK_URL}" channel: "#critical-alerts" title: "Critical Alert" text: | {{ range .Alerts.Firing }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} {{ .Annotations.summary }} {{ .Annotations.description }} {{ end }} pagerduty_configs: - service_key: "${PAGERDUTY_SERVICE_KEY}" description: "{{ .GroupLabels.alertname }}" email_configs: - to: "oncall@example.com" # Kubernetes receiver - Slack - name: "kubernetes-receiver" slack_configs: - api_url: "${SLACK_WEBHOOK_URL}" channel: "#kubernetes-alerts" # HTTP receiver - Slack - name: "http-receiver" slack_configs: - api_url: "${SLACK_WEBHOOK_URL}" channel: "#http-alerts" # Inhibition rules (подавление алертов) inhibit_rules: # Не отправляем warning если уже есть critical - source_match: severity: "critical" target_match: severity: "warning" equal: ['alertname', 'cluster', 'service'] # Не отправляем алерты о node если node down - source_match: alertname: "NodeDown" target_match: component: "node" equal: ['instance'] ``` ### 7. Пример Grafana Dashboard ```json { "dashboard": { "title": "Kubernetes Cluster Overview", "panels": [ { "title": "CPU Usage by Node", "targets": [ { "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" } ], "type": "graph" }, { "title": "Memory Usage", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100" } ], "type": "graph" }, { "title": "Pod Restart Count", "targets": [ { "expr": "sum by (namespace, pod) (kube_pod_container_status_restarts_total)" } ], "type": "table" }, { "title": "HTTP Request Rate", "targets": [ { "expr": "sum(rate(http_requests_total[1m])) by (status)" } ], "type": "graph" } ] } } ``` ### 8. Чем Prometheus отличается от Zabbix? #### Prometheus - **Модель**: Pull-based (Prometheus тянет метрики) - **Формат**: Text-based, простой - **Типы данных**: Time-series (временные ряды) - **Язык запросов**: PromQL (специализированный) - **Storage**: Временная БД, in-process - **Масштабируемость**: Вертикальное масштабирование - **Kubernetes**: Встроенная поддержка (ServiceMonitor, Operator) - **Лучше для**: Microservices, Kubernetes, современные приложения - **Стоимость**: Open source - **Сложность**: Проще в начале #### Zabbix - **Модель**: Push-based (Agent отправляет метрики) - **Формат**: Binary protocol - **Типы данных**: Числовые, строки, логи - **Язык запросов**: Простые выражения - **Storage**: PostgreSQL/MySQL (отдельная БД) - **Масштабируемость**: Горизонтальное масштабирование (Zabbix Proxy) - **Kubernetes**: Требует дополнительной конфигурации - **Лучше для**: Traditional infrastructure, VM-based - **Стоимость**: Open source + коммерческая поддержка - **Сложность**: Сложнее в начале #### Сравнение | Аспект | Prometheus | Zabbix | |--------|-----------|--------| | Архитектура | Pull | Push | | Язык запросов | PromQL (мощный) | Простые выражения | | Kubernetes | ✅ Встроено | ⚠️ Требует доп. работы | | Агент | Нет (exporter) | Требуется | | Масштабирование | Вертикальное | Горизонтальное | | Learning curve | Низкая | Средняя | | Для микросервисов | ✅ Отлично | ⚠️ Слабо | | Для инфраструктуры | ⚠️ Нужны exporters | ✅ Отлично | ### 9. Масштабирование Prometheus для большого кластера #### Проблема с одиночным Prometheus ```bash # На кластере с 1000+ нодов и 10000+ подов: # - Один Prometheus не может обработать весь объем метрик # - Требуется слишком много памяти (сотни ГБ) # - Выполнение запросов медленное ``` #### Решение 1: Prometheus Federation ```yaml # Распределенная архитектура # Global Prometheus (собирает метрики от federated) global: scrape_interval: 15s scrape_configs: - job_name: "federated-prometheus" honor_labels: true metrics_path: "/federate" params: match[]: - '{job="kubernetes"}' static_configs: - targets: - "prometheus-dc1:9090" - "prometheus-dc2:9090" - "prometheus-dc3:9090" # Local Prometheus (в каждом датацентре/зоне) # Публикует метрики через /federate endpoint ``` #### Решение 2: Remote Storage ```yaml # Prometheus с remote storage (TimescaleDB, M3DB, ClickHouse) global: scrape_interval: 15s remote_write: - url: "http://timescaledb:9009/write" queue_config: capacity: 10000 max_shards: 200 min_shards: 1 max_samples_per_send: 500 batch_send_wait: 5s min_back_off: 30ms max_back_off: 100ms remote_read: - url: "http://timescaledb:9009/read" ``` #### Решение 3: Cortex / Mimir ```bash # Cortex - горизонтально масштабируемый Prometheus # Архитектура: # Prometheus → Cortex (многих инстансов) → TSDB (S3, GCS) # Преимущества: # - Горизонтальное масштабирование # - High availability # - Multi-tenancy # - Долгоживущее хранилище ``` #### Решение 4: Thanos ```yaml # Thanos - горизонтально масштабируемый слой на top Prometheus # Компоненты Thanos: # 1. Sidecar - прикрепляется к Prometheus # 2. Query - распределенный запрос # 3. Store - долгоживущее объектное хранилище (S3) # 4. Compactor - компакцирование данных # Преимущества: # - Долгоживущее хранилище (годы данных) # - Дедупликация # - Глобальный query engine # - HA Prometheus ``` ### 10. PromQL - язык запросов #### Базовые примеры ```promql # Селектор метрик node_cpu_seconds_total # С labels http_requests_total{job="api-server"} http_requests_total{job="api-server", handler="/api/users"} # Регулярные выражения http_requests_total{job=~"api.*"} http_requests_total{status!="200"} ``` #### Функции ```promql # Rate - скорость изменения за интервал rate(http_requests_total[5m]) # Increase - прирост за интервал increase(http_requests_total[1h]) # Sum - сумма метрик sum(http_requests_total) sum by (job) (http_requests_total) # Group by sum without (instance) (http_requests_total) # Исключить label # Avg - среднее avg(node_memory_MemAvailable_bytes) # Min/Max min(node_cpu_seconds_total) max(node_memory_MemTotal_bytes) # Count count(http_requests_total) # Topk/Bottomk - top K значений topk(10, http_requests_total) bottomk(5, http_requests_total) # Histogram_quantile - квантили (для latency) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) ``` #### Продвинутые примеры ```promql # CPU usage в процентах 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage в процентах (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Request latency p95 histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Requests per second sum(rate(http_requests_total[1m])) by (job) # Pods that have restarted more than 5 times kube_pod_container_status_restarts_total > 5 # Percentage of failed requests sum(rate(http_requests_total{status=~"[45].."}[5m])) by (job) * 100 / sum(rate(http_requests_total[5m])) by (job) # Node disk usage percentage (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 # Pod memory usage container_memory_usage_bytes{pod_name=~"my-app.*"} # Find pods in CrashLoopBackOff kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1 # Average response time by endpoint avg(rate(http_request_duration_seconds_sum[5m])) / avg(rate(http_request_duration_seconds_count[5m])) by (endpoint) ``` #### Binary операторы ```promql # Арифметика metric1 + metric2 metric1 - metric2 metric1 * metric2 metric1 / metric2 metric1 % metric2 metric1 ^ metric2 # Сравнение metric1 > 100 metric1 <= 50 metric1 == 200 metric1 != 0 # Логика metric1 and metric2 metric1 or metric2 metric1 unless metric2 # Пример: Memory usage > 80% AND CPU < 50% node_memory_usage_percent > 80 and node_cpu_usage_percent < 50 ``` #### Агрегация ```promql # На одном уровне sum(http_requests_total) avg(http_requests_total) count(http_requests_total) # По группам (by) sum by (job, instance) (http_requests_total) # Исключить labels (without) sum without (instance) (http_requests_total) # Модификаторы sum(http_requests_total) by (job) on (job) group_left() ``` #### Join операции ```promql # Один-на-один (одна метрика к одной) metric1 + on (job, instance) metric2 # Много-на-один metric1 * on (job) group_left metric2 # Один-на-много metric1 / on (instance) group_right metric2 ``` ### 11. Практические примеры для production ```promql # Dashboard: Server Health # CPU usage avg(100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) by (instance) # Memory usage avg((1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100) by (instance) # Disk usage avg((1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100) by (instance) # Network I/O rate(node_network_receive_bytes_total[5m]) by (instance) rate(node_network_transmit_bytes_total[5m]) by (instance) # Dashboard: Application Health # Request rate sum(rate(http_requests_total[5m])) by (job) # Error rate sum(rate(http_requests_total{status=~"[45].."}[5m])) by (job) * 100 / sum(rate(http_requests_total[5m])) by (job) # Latency p99 histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) by (job) # Database query count sum(rate(mysql_queries_total[5m])) by (instance) # Cache hit ratio sum(rate(cache_hits_total[5m])) by (instance) * 100 / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) by (instance) ```

Аспект	Prometheus	Zabbix
Архитектура	Pull	Push
Язык запросов	PromQL (мощный)	Простые выражения
Kubernetes	✅ Встроено	⚠️ Требует доп. работы
Агент	Нет (exporter)	Требуется
Масштабирование	Вертикальное	Горизонтальное
Learning curve	Низкая	Средняя
Для микросервисов	✅ Отлично	⚠️ Слабо
Для инфраструктуры	⚠️ Нужны exporters	✅ Отлично

Настройка мониторинга с Prometheus и Grafana

Условие

Требования

Задание

Вопросы

Комментарии (1)

Решение

1. Установка Prometheus Stack (Kube-Prometheus-Stack)

2. values.yaml для Helm chart

3. ServiceMonitor для автообнаружения

4. Node Exporter для метрик хостов

5. PrometheusRule с алертами

6. Alertmanager конфигурация

7. Пример Grafana Dashboard

8. Чем Prometheus отличается от Zabbix?

Prometheus

Zabbix

Сравнение

9. Масштабирование Prometheus для большого кластера

Проблема с одиночным Prometheus

Решение 1: Prometheus Federation

Решение 2: Remote Storage

Решение 3: Cortex / Mimir

Решение 4: Thanos

10. PromQL - язык запросов

Базовые примеры

Функции

Продвинутые примеры

Binary операторы

Агрегация

Join операции

11. Практические примеры для production