Monitoring is the backbone of any reliable DevOps setup. And if you’re working with monitoring, you’ve likely used Prometheus. This open-source powerhouse has redefined how we track system performance, but are you making the most of its API?
Prometheus is the go-to solution for monitoring container-based environments, particularly in Kubernetes. Its pull-based model and flexible query language provide deep visibility into your systems.
But its real strength lies in the HTTP API—a tool that enables programmatic monitoring, automation, and seamless integration into your workflows. If you're not using it yet, you might be leaving a lot on the table.
What Makes the Prometheus API Worth Your Time?
The Prometheus API isn't just another tool in your tech stack – it's the secret weapon that unlocks next-level monitoring capabilities. With it, you can:
- Pull metrics data programmatically from any service Prometheus scrapes
- Create custom dashboards that actually make sense for your specific use cases
- Automate your alerting workflows based on complex conditions
- Integrate with your existing tools like Slack, PagerDuty, or custom webhooks
- Build automation that reacts to metrics in real-time
- Extend Prometheus capabilities beyond what's available in the UI
- Implement custom reporting for stakeholders
The API gives you direct access to everything Prometheus collects, letting you work with that data however you want. That's power.
Unlike some monitoring solutions that lock you into their visualization tools, Prometheus follows the Unix philosophy – it does one thing (collecting and storing metrics) extremely well, then exposes everything through an API that lets you build exactly what you need on top.
Practical API Use Cases
Before jumping into the technical details, let's look at how teams use the Prometheus API:
- Auto-scaling systems – Triggering infrastructure scaling based on custom metrics
- Anomaly detection – Feeding metrics into ML systems to catch unusual patterns
- Business intelligence – Correlating technical metrics with business KPIs
- Capacity planning – Analyzing long-term trends to forecast resource needs
- Custom SLO dashboards – Building service level objective tracking specific to your reliability targets
Getting Started with the Prometheus API
Setting up your first connection is straightforward. The Prometheus API runs on HTTP, making it accessible from practically anywhere.
Base URL Structure
Your Prometheus server exposes its API at:
http://<your-prometheus-server>:<port>/api/v1/
For local testing, this might look like:
http://localhost:9090/api/v1/
The API follows RESTful principles with clearly defined endpoints. All responses come in a consistent JSON format with this general structure:
{
"status": "success",
"data": {
// The actual response data varies by endpoint
}
}
For error cases, you'll get:
{
"status": "error",
"errorType": "bad_data",
"error": "The specific error message"
}
This consistency makes parsing responses straightforward across all API interactions.
Response Format Details
Let's break down what you'll get from different query types:
Range query responses have:
{
"resultType": "matrix",
"result": [
{
"metric": { "label1": "value1", ... },
"values": [
[ timestamp1, "string_value1" ],
[ timestamp2, "string_value2" ],
...
]
},
...
]
}
Instant query responses contain:
{
"resultType": "vector",
"result": [
{
"metric": { "label1": "value1", ... },
"value": [ timestamp, "string_value" ]
},
...
]
}
Understanding these structures is crucial for correctly parsing the data in your applications.
Authentication Options
Prometheus keeps things simple with these authentication methods:
Method | Best For | Setup Complexity | Implementation Approach |
---|---|---|---|
No Auth | Testing, isolated networks | None | Default configuration |
Basic Auth | Standard protection | Low | Reverse proxy (Nginx, Apache) |
OAuth | Enterprise environments | Medium | OAuth2 Proxy sidecar |
TLS Client Certs | High-security needs | High | mTLS with cert management |
API Keys | Microservice architectures | Medium | Custom proxy layer |
Most teams start with Basic Auth and move to OAuth as they scale.
Prometheus itself doesn't include built-in authentication. Instead, you'll typically deploy it behind a reverse proxy that handles auth. Here's how to set up Basic Auth with Nginx:
server {
listen 443 ssl;
server_name prometheus.example.com;
ssl_certificate /etc/nginx/certs/prometheus.crt;
ssl_certificate_key /etc/nginx/certs/prometheus.key;
location / {
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/htpasswd/.htpasswd;
proxy_pass http://localhost:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
For OAuth2, many teams use the oauth2-proxy project as a sidecar:
# docker-compose example
services:
oauth2-proxy:
image: quay.io/oauth2-proxy/oauth2-proxy
command:
- --provider=github
- --email-domain=*
- --upstream=http://prometheus:9090
- --cookie-secret=your-secret
- --client-id=your-github-app-id
- --client-secret=your-github-app-secret
ports:
- "4180:4180"
This setup works well for teams already using GitHub or Google for authentication.
Key Prometheus Endpoints You'll Use
The Prometheus API has several endpoints, but these five will handle 90% of your needs:
1. Query Instant Data
GET /api/v1/query
This endpoint gives you a snapshot of metrics right now. Perfect for current status checks.
Parameters:
query
(required): The PromQL expression to evaluatetime
: Evaluation timestamp (RFC3339 or Unix timestamp), defaults to current timetimeout
: Evaluation timeout (e.g.,30s
,1m
), defaults to global timeout
Example:
curl 'http://localhost:9090/api/v1/query?query=up'
Response:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "up",
"instance": "localhost:9090",
"job": "prometheus"
},
"value": [1675956970.123, "1"]
},
{
"metric": {
"__name__": "up",
"instance": "localhost:8080",
"job": "api-server"
},
"value": [1675956970.123, "0"]
}
]
}
}
2. Query Range Data
GET /api/v1/query_range
When you need metrics over time (like for graphs), this is your go-to.
Parameters:
query
(required): The PromQL expression to evaluatestart
(required): Start timestamp (RFC3339 or Unix timestamp)end
(required): End timestampstep
(required): Query resolution step width in duration format or float secondstimeout
: Evaluation timeout, defaults to global timeout
Example:
curl 'http://localhost:9090/api/v1/query_range?query=rate(http_requests_total[5m])&start=2023-01-01T20:10:30.781Z&end=2023-01-01T20:11:00.781Z&step=15s'
Response:
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "http_requests_total",
"code": "200",
"handler": "query",
"instance": "localhost:9090",
"job": "prometheus"
},
"values": [
[1672602630.781, "3.4"],
[1672602645.781, "5.6"],
[1672602660.781, "4.2"]
]
}
]
}
}
The step
parameter deserves special attention – it defines the resolution of your data. Too small, and you'll hit performance issues; too large, and you'll miss important details.
3. Series Metadata
GET /api/v1/series
This lets you discover what metrics are available and their labels.
Parameters:
match[]
: Repeated series selector parameters (required)start
: Start timestampend
: End timestamp
Example:
curl 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_cpu_seconds_total'
Response:
{
"status": "success",
"data": [
{
"__name__": "up",
"instance": "localhost:9090",
"job": "prometheus"
},
{
"__name__": "process_cpu_seconds_total",
"instance": "localhost:9090",
"job": "prometheus"
}
]
}
4. Label Values
GET /api/v1/label/<label_name>/values
Need to know all possible values for a label? This endpoint has you covered.
Parameters:
start
: Start timestampend
: End timestampmatch[]
: Series selector to filter by
Example:
curl 'http://localhost:9090/api/v1/label/job/values'
Response:
{
"status": "success",
"data": [
"prometheus",
"node-exporter",
"api-gateway",
"database"
]
}
5. Targets
GET /api/v1/targets
This shows all targets Prometheus is scraping, with their health status.
Parameters:
state
: Filter by target state (active
,dropped
, orany
)
Example:
curl 'http://localhost:9090/api/v1/targets?state=active'
Response:
{
"status": "success",
"data": {
"activeTargets": [
{
"discoveredLabels": {
"__address__": "localhost:9090",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "prometheus"
},
"labels": {
"instance": "localhost:9090",
"job": "prometheus"
},
"scrapePool": "prometheus",
"scrapeUrl": "http://localhost:9090/metrics",
"lastError": "",
"lastScrape": "2023-02-09T12:30:00.123456789Z",
"lastScrapeDuration": 0.012345,
"health": "up"
}
]
}
}
Additional Useful Endpoints
While the five endpoints above cover most use cases, these can be handy too:
6. Alerts
GET /api/v1/alerts
Lists all active alerts.
7. Rules
GET /api/v1/rules
Lists all recording and alerting rules.
8. Status Config
GET /api/v1/status/config
Dumps the current Prometheus configuration.
9. Metadata
GET /api/v1/metadata
Returns metadata about metrics (helpful for understanding units and semantics).
Working with PromQL Through the API
The real magic happens when you combine the API with PromQL queries.
Here's a comprehensive chart of essential query patterns that every DevOps engineer should know:
Query Type | Example | Use Case | Notes |
---|---|---|---|
Simple | http_requests_total |
Basic metric retrieval | Returns all time series with this name |
Counter Rate | rate(http_requests_total[5m]) |
Traffic patterns | Per-second rate calculated over 5m |
Counter Increase | increase(http_requests_total[1h]) |
Hourly totals | Total increase over the time period |
Gauge Current | node_memory_MemFree_bytes |
Current state | Point-in-time value |
Gauge Aggregation | avg_over_time(node_memory_MemFree_bytes[1h]) |
Stable representation | Smooths fluctuations |
Sum | sum(node_cpu_seconds_total) |
Resource utilization | Total across all instances |
By | sum by (instance) (up) |
Grouped metrics | Aggregation with dimensions |
Without | sum without (job) (up) |
Remove dimensions | Simplify output |
Offset | rate(http_requests_total[5m] offset 1h) |
Comparison with past | Historical data points |
Delta | delta(cpu_temp_celsius[2h]) |
Change detection | For gauges (vs rate for counters) |
Topk | topk(3, cpu_usage) |
Hotspot identification | Find highest values |
Bottomk | bottomk(3, up) |
Problem detection | Find lowest values |
Quantile | histogram_quantile(0.95, http_request_duration_seconds_bucket) |
SLO tracking | Calculate percentiles |
Prediction | predict_linear(node_filesystem_free_bytes[6h], 24 * 3600) |
Capacity planning | Predict future values |
Resets | resets(counter[5m]) |
Service restarts | Detect counter resets |
Time Functions | http_requests_total offset 1d |
Day-over-day comparison | Compare to same time yesterday |
Label Matching | http_requests_total{status=~"5..", method!="POST"} |
Filtering | Multiple conditions with regex |
Binary Operators | node_memory_MemTotal_bytes - node_memory_MemFree_bytes |
Derived metrics | Arithmetic between metrics |
Boolean | node_filesystem_free_bytes / node_filesystem_size_bytes < 0.10 |
Threshold checks | Returns 0 or 1 |
Practical PromQL Examples
Let me break down some practical examples you'll use:
1. Error Rate Calculation
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
This query calculates your error rate – the percentage of requests returning 5xx errors. Super useful for SLOs.
2. Container Memory Usage by Pod
sum by (pod) (container_memory_working_set_bytes{namespace="production"})
Shows memory consumption grouped by pod name in your production namespace.
3. CPU Throttling Detection
sum by (pod) (rate(container_cpu_cfs_throttled_seconds_total[5m])) / sum by (pod) (rate(container_cpu_cfs_periods_total[5m])) > 0.1
Identifies pods experiencing more than 10% CPU throttling, indicating they need more resources.
4. Disk Space Prediction
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24 * 3600 * 7)
Predicts free disk space in 7 days based on the trend over the last 6 hours.
5. Apdex Score (Application Performance)
(sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) / 2) / sum(rate(http_request_duration_seconds_count[5m]))
Calculates an Apdex score where requests under 0.3s are "satisfied" and under 1.2s are "tolerating".
5 Common PromQL Mistakes
When crafting these queries, watch out for these common pitfalls:
- Missing rate() for counters - Counters always increase; you almost always want the rate
- Incorrect time windows - Too small windows make noisy data, too large miss important spikes
- Missing label context - Aggregating without considering cardinality explosion
- Forgetting by() in division - Division between vectors needs matching labels
- Unescaped regex characters - Remember to escape special characters in label matches
Advanced PromQL Tips You Need to Know
For more complex monitoring needs:
- Create recording rules for complex queries: Recording rules pre-compute expensive expressions, making dashboards faster.
Use absent() to detect missing metrics:
absent(up{job="critical-service"})
Returns 1 if the metric doesn't exist (service is down).
Use subqueries for moving averages:
avg_over_time(rate(http_requests_total[5m])[1h:5m])
This gives you a smoothed rate calculated every 5 minutes over a sliding 1-hour window.
Common API Integration Patterns Worth Knowing
Grafana Integration
Grafana already works with Prometheus out of the box, but you can extend this with custom API calls through Grafana's data source plugins or visualization panels:
// Example fetch in a Grafana panel
async function queryPrometheus(query) {
const response = await fetch(`http://prometheus:9090/api/v1/query?query=${encodeURIComponent(query)}`);
const data = await response.json();
if (data.status !== 'success') {
throw new Error(`Query failed: ${data.error || 'Unknown error'}`);
}
return data.data.result;
}
// Example usage in a Grafana panel
const metricData = await queryPrometheus('sum(rate(http_requests_total[5m]))');
// Custom visualization logic using D3.js or other libraries
You can also use Grafana's Prometheus data source with variables for dynamic dashboards:
sum by (service) (rate(http_requests_total{environment="$env", datacenter="$dc"}[5m]))
Where $env
and $dc
are Grafana template variables that users can change.
CI/CD Pipeline Integration
Want to verify your deployment didn't break things? Check it with an API call in your deployment pipeline:
#!/bin/bash
# progressive_deployment.sh
# Deploy the new version to a canary environment
kubectl apply -f canary-deployment.yaml
# Wait for the deployment to stabilize
sleep 60
# Check error rate for the canary version
ERROR_RATE=$(curl -s -H "Authorization: Bearer $PROM_TOKEN" \
'http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{version="canary",status_code=~"5.."}[5m]))/sum(rate(http_requests_total{version="canary"}[5m]))*100' \
| jq '.data.result[0].value[1] // "0"' \
| tr -d '"')
# Check latency for the canary version
P95_LATENCY=$(curl -s -H "Authorization: Bearer $PROM_TOKEN" \
'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{version="canary"}[5m]))by(le))' \
| jq '.data.result[0].value[1] // "0"' \
| tr -d '"')
# Evaluate if the deployment meets SLOs
if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )) || (( $(echo "$P95_LATENCY > 0.3" | bc -l) )); then
echo "Canary deployment failed SLO checks!"
echo "Error rate: $ERROR_RATE% (threshold: 1.0%)"
echo "P95 latency: ${P95_LATENCY}s (threshold: 0.3s)"
# Rollback the canary deployment
kubectl delete -f canary-deployment.yaml
exit 1
else
echo "Canary deployment passed SLO checks!"
echo "Error rate: $ERROR_RATE% (threshold: 1.0%)"
echo "P95 latency: ${P95_LATENCY}s (threshold: 0.3s)"
# Promote canary to production
kubectl apply -f production-deployment.yaml
fi
This script promotes a canary deployment only if error rates and latency meet your SLOs.
Custom Alerting Logic
Sometimes you need alerts based on complex conditions that aren't easily expressed in standard alerting rules:
#!/usr/bin/env python3
# advanced_alerting.py
import requests
import time
import smtplib
from email.message import EmailMessage
import logging
import os
from datetime import datetime, timedelta
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger('prometheus_alerts')
# Configuration
PROMETHEUS_URL = os.environ.get('PROMETHEUS_URL', 'http://prometheus:9090')
CHECK_INTERVAL = int(os.environ.get('CHECK_INTERVAL', 60)) # seconds
ALERT_COOLDOWN = int(os.environ.get('ALERT_COOLDOWN', 3600)) # seconds
RECIPIENTS = os.environ.get('ALERT_RECIPIENTS', '').split(',')
SMTP_SERVER = os.environ.get('SMTP_SERVER', 'smtp.example.com')
# Alert state management
last_alerts = {}
def query_prometheus(query):
"""Execute a PromQL query against the Prometheus API."""
try:
response = requests.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={'query': query},
timeout=10
)
response.raise_for_status()
result = response.json()
if result['status'] != 'success':
logger.error(f"Query failed: {result.get('error', 'Unknown error')}")
return None
return result['data']['result']
except Exception as e:
logger.exception(f"Error querying Prometheus: {e}")
return None
def check_business_hours():
"""Only alert during business hours."""
now = datetime.now()
# Monday-Friday, 9 AM to 5 PM
return now.weekday() < 5 and 9 <= now.hour < 17
def check_conditions():
"""Check for complex alert conditions."""
conditions = [
# High error rate with high traffic
{
'name': 'high_error_rate',
'query': 'sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 and sum(rate(http_requests_total[5m])) > 10',
'message': 'High error rate detected with significant traffic',
'severity': 'critical',
'runbook': 'https://wiki.example.com/runbooks/high-error-rate'
},
# Database connection saturation
{
'name': 'db_connection_saturation',
'query': 'max(pg_stat_activity_count) / max(pg_settings_max_connections) > 0.8',
'message': 'Database connection pool nearing saturation',
'severity': 'warning',
'runbook': 'https://wiki.example.com/runbooks/db-connection-pool'
},
# Correlated conditions: both API latency and DB latency high
{
'name': 'service_degradation',
'query': 'histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le)) > 1 and histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m]))
## Performance Tips for Heavy API Users
When you're making lots of API calls, keep these tips in mind:
1. **Use query_range wisely** – Specify reasonable step values
2. **Cache common queries** – Don't hammer the API with the same requests
3. **Be selective with labels** – The more labels, the bigger the response
4. **Batch related queries** – Reduce network overhead
5. **Consider federation** – For multi-cluster setups
## API Limitations and Workarounds
Let's be honest about some Prometheus API constraints:
### Time Range Limits
The API can get sluggish with very large time ranges. Break these into smaller chunks:
```python
# Instead of one big query
def get_data_in_chunks(query, start_time, end_time, chunk_hours=6):
all_data = []
current = start_time
while current < end_time:
chunk_end = min(current + chunk_hours * 3600, end_time)
# API call for just this chunk
chunk_data = get_prometheus_data(query, current, chunk_end)
all_data.extend(chunk_data)
current = chunk_end
return all_data
Rate Limiting
Some environments put limits on API calls. Implement backoff logic:
def api_call_with_backoff(url, max_retries=5):
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code == 429: # Too Many Requests
sleep_time = 2 ** attempt
time.sleep(sleep_time)
else:
return response
raise Exception("Max retries exceeded")
Security Best Practices
Your Prometheus API is a window into your system's health – protect it:
- Never expose it directly to the internet – Use a proxy or API gateway
- Implement proper authentication – Basic Auth is the minimum
- Use TLS everywhere – Encrypt all API traffic
- Apply RBAC – Limit who can access what data
- Audit API access – Track who's viewing your metrics
Tying It All Together
The Prometheus API transforms passive monitoring into active observability. By programmatically accessing your metrics, you can build automated responses to system conditions, create custom visualizations, and integrate monitoring into your workflow.
Last9 and Prometheus
Last9 integrates with Prometheus to enhance your monitoring experience. It connects directly to your Prometheus API, organizing metrics and turning complex data into clear, intuitive visualizations.
With Last9’s Prometheus integration, you can easily spot patterns across your infrastructure and applications—no need to wrestle with complex queries. Get the insights you need, when you need them.
Let’s talk about how we make observability simpler.