Databases are the heart of applications, and their failures can lead to serious consequences. Traditional monitoring tools often miss internal DBMS issues. Prometheus is a system that collects and analyzes metrics in real time, helping SREs prevent and diagnose problems. This guide will show you how to set up monitoring for PostgreSQL and MySQL with Prometheus.
Prometheus: A Problem-Solver in the World of Databases
Not just monitoring, but a real-time detective:
Prometheus is an open-source system designed for diagnosing complex distributed systems. Its philosophy: "Every metric is a clue. Every graph tells the story of a problem."
How the detective works:
- Pull Model:
- Prometheus actively collects metrics from monitored systems.
- Multidimensionality:
- Each metric has labels (database, table, user).
- PromQL:
- A query language for investigating relationships between events.
- Time-Series Storage:
- All data is stored for post-mortem analysis.
- Real-world analogy:
- If Zabbix is a police radar (tracking speed), Prometheus is satellite surveillance, showing *why* the driver sped up, where they’re headed, and what risks lie ahead.
Why Prometheus? The Harsh Truths
Advantages:
- Expert-Level Diagnostics:
- Sees inside the DBMS like a surgeon sees organs.
- Root Cause Investigation:
- Links database failures to application actions.
- Peace of Mind:
- Automatic alerts for issues *before* they become catastrophic.
- Open-Source Ecosystem:
- Grafana for visualization, Alertmanager for notifications.
Disadvantages:
- Learning Curve:
- Requires deep understanding of DBMS internals.
- Resource Usage:
- Metric storage grows like a snowball.
- No Long-Term Storage:
- By default, retains data for only 15 days.
Hard Fact:
- 68% of companies using Prometheus for database monitoring discovered hidden issues they were unaware of for years.
Unique Features: Why It’s a Game-Changer
1. Killer Metrics:
- Transaction execution time with 95th and 99th percentile breakdowns.
- Buffer cache efficiency, not as a percentage, but as a prediction of time until collapse.
2. Relationship Investigation:
// Find users whose queries cause locks
pg_blocked_processes
* on (instance) group_left (usename)
pg_stat_activity{state="active"}
3. Predictive Analytics:
// Predict when tablespace will run out
predict_linear(pg_tablespace_size[24h], 86400 * 30)
Deploying Prometheus: Not a Battle, Just 5 Minutes of Work
Step 1 - Install via Helm (Kubernetes):
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
Step 2 - Configuration for Physical Servers (docker-compose):
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
Step 3 - Configure `prometheus.yml` (fragment):
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres-host:9187']
- job_name: 'mysql'
static_configs:
- targets: ['mysql-host:9104']
Setting Up Metric Collection: Tools for Deep Insights
For PostgreSQL:
Install postgres_exporter:
docker run -d --name postgres_exporter \
-e DATA_SOURCE_NAME="postgresql://user:pass@host:5432/?sslmode=disable" \
quay.io/prometheuscommunity/postgres-exporter
Key Metrics in PromQL:
// Deadlocks
pg_deadlocks_total
// Cache efficiency
pg_stat_database_blks_hit / (pg_stat_database_blks_read + pg_stat_database_blks_hit)
For MySQL:
Deploy mysqld_exporter:
docker run -d --name mysql_exporter \
-e DATA_SOURCE_NAME="user:pass@(host:3306)/" \
prom/mysqld-exporter
In-Depth Metrics:
// Replication queue
mysql_slave_status_seconds_behind_master
// Temporary table issues
mysql_global_status_created_tmp_tables
Metrics for SRE: Not Just Numbers, but Distress Signals
PostgreSQL: What Drives Admins Crazy
Silent Killer: pg_stat_activity_waiting
What it means: Number of "stuck" queries.
Alert threshold: > 5 for more than 2 minutes.
Ticking Time Bomb: pg_stat_database_deadlocks
How to analyze:
rate(pg_stat_database_deadlocks[5m]) > 0.1
Impending Collapse: pg_buffercache_usage_ratio
Critical value: 0.95 (requires immediate RAM expansion).
MySQL: Signs of a Database Heart Attack
Fatal Symptom: mysql_slave_lag_seconds
Why it’s scary:
- Replica lag > 30 seconds = risk of data loss.
Hidden Threat: mysql_innodb_row_lock_time_avg
Diagnosis:
mysql_innodb_row_lock_time_avg > 500
- Failure Precursor: mysql_threads_running
Emergency action:
- If > 100, immediately stop long-running queries.
Universal Performance Killers:
// Transaction cannibals
rate(pg_stat_database_xact_commit[1m]) / rate(pg_stat_database_xact_rollback[1m]) < 10 // Disk apocalypse rate(node_disk_io_time_seconds_total[1m]) > 0.8
Life After Implementation: How Your Reality Changes
Before Prometheus:
- "The database is slow again. Let’s restart and pray."
After Prometheus:
- "User `id=1347`’s query on the `orders` table is blocking 5 transactions.
Cause: Nested loop join without an index. Solution: Optimize the query or add an index."
Real-World Stats:
Teams using Prometheus for DBMS monitoring:
- Reduced diagnostics time by 83%.
- Decreased P1/P0 incidents by 40%.
- Lowered DBA workload by 65%.
Prometheus doesn’t just "show metrics." It turns you into a visionary who:
- Sees the invisible:
- How one user’s long query could crash the system for thousands.
- Predicts the future:
- When disk space will run out in three weeks.
- Understands the past:
- Why performance dropped yesterday at 14:23.