Last9 Last9

Mar 6th, β€˜25 / 13 min read

Prometheus API: From Basics to Advanced Usage

Learn how to use the Prometheus API, from basic queries to advanced techniques, to monitor and analyze your system metrics effectively.

Prometheus API: From Basics to Advanced Usage

Monitoring your infrastructure shouldn’t be a shot in the dark. The Prometheus API helps you pull the right metrics so you actually know what’s going on. Whether you’re just getting started or trying to make sense of your current setup, this guide breaks down how to use the API to get the answers you needβ€”without the guesswork.

Prometheus API: The Key to Understanding Your Metrics

The Prometheus API isn't just another tool to add to your already overflowing tech stack – it's the secret sauce that gives you direct access to all the juicy metrics Prometheus collects.

In simple terms, the Prometheus API is how you talk to your Prometheus server. Think of it as the bouncer that guards the VIP section of metrics. It lets you query, analyze, and extract the data that Prometheus scrapes from your systems.

Why should this matter to you? Because with the right API calls, you can:

  • Pull exactly the metrics you need without wading through the noise
  • Integrate Prometheus data with your other tools and dashboards
  • Automate responses to specific metric conditions
  • Build custom monitoring solutions that fit your infrastructure like a glove

The Prometheus API comes in two main flavors: the HTTP API for direct queries and the management API for handling Prometheus itself. Master both, and you're essentially the monitoring superhero your team didn't know they needed.

πŸ’‘
Explore key Prometheus functions and learn how to use them effectively for querying and analyzing metrics: Read more.

Access Your Metrics with the Prometheus API

Let's start off with the basics. The Prometheus API speaks HTTP, which means you can start playing with it using tools you already know, like cURL or Postman.

The entry point to Prometheus's data goldmine is typically available at:

http://your-prometheus-server:9090/api/v1/

Let's break down the essential endpoints you'll be using on the daily:

Endpoint What It Does When To Use It
/query Executes instant queries When you need the current value of a metric
/query_range Executes range queries When you need data over a time period
/series Finds time series matching a label set When exploring what metrics exist
/labels Gets all label names When you need to know what dimensions exist
/label/<name>/values Gets values for a specific label When filtering by a specific dimension

Here's a quick hit to get you started – fetching the current CPU usage with cURL:

curl 'http://your-prometheus-server:9090/api/v1/query?query=sum(rate(node_cpu_seconds_total{mode!="idle"}[1m])) by (instance)'

This command sends an HTTP request to your Prometheus server's query endpoint. It's asking for the sum of CPU usage rates (excluding idle time) over the last minute, grouped by instance. The result gives you a quick snapshot of which servers are working hardest right now.

This command pulls the CPU usage rate across all your instances. The response comes back as JSON – structured, clean, and ready for parsing.

πŸ’‘
Understand the different Prometheus metric types and how to use them effectively: Read more.

How to Query Your Metrics with Precision

The real power of the Prometheus API lies in PromQL – Prometheus Query Language. It's like SQL for your metrics, but with time-series superpowers.

Basic PromQL through the API is straightforward:

curl 'http://your-prometheus-server:9090/api/v1/query?query=up'

This simple query returns the "up" metric, which Prometheus uses to track whether targets are online (1) or offline (0). It's essentially a health check that shows you at a glance which systems Prometheus is monitoring successfully and which might have connectivity issues.

This query shows which targets are up (1) or down (0). Simple, but incredibly useful when you're trying to figure out why your alerts are blowing up at 3 AM.

Let's level up to something more practical – finding servers that are running low on disk space:

curl 'http://your-prometheus-server:9090/api/v1/query?query=node_filesystem_avail_bytes/node_filesystem_size_bytes*100 < 20'

This command is your early warning system for disk space issues. It calculates the percentage of free space on each filesystem and filters for instances where it's below 20%. The result is a list of systems that are approaching dangerous levels of storage utilization, giving you time to clean up or expand capacity before applications start failing.

This query returns instances where available disk space is less than 20% – giving you time to address the issue before your servers start throwing tantrums.

πŸ’‘
Learn how to write PromQL queries that actually help you understand your system: Read more.

Advanced Prometheus Techniques Your Team Will Rely On

Now that you've got the basics down, let's dive into some advanced moves that separate the monitoring rookies from the pros.

Python Automation:

Why manually check metrics when you can automate? Here's a Python snippet that fetches and processes Prometheus metrics:

import requests
import json

def query_prometheus(query):
    url = 'http://your-prometheus-server:9090/api/v1/query'
    response = requests.get(url, params={'query': query})
    results = response.json()['data']['result']
    return results

# Get the 5 nodes with highest memory usage
high_mem_nodes = query_prometheus('sort_desc(node_memory_MemUsed_bytes / node_memory_MemTotal_bytes)[0:5]')

for node in high_mem_nodes:
    instance = node['metric']['instance']
    usage = float(node['value'][1]) * 100
    print(f"{instance}: {usage:.2f}% memory used")

This Python script creates a reusable function for querying your Prometheus server. The script first defines a helper function that handles the HTTP request and JSON parsing, then uses it to find the top 5 memory-hungry nodes in your infrastructure.

For each node, it extracts the instance name and calculates the memory usage percentage, displaying a clean, formatted output. This is perfect for regular checks or as part of a larger monitoring script that could send alerts when memory usage patterns change.

Run this script on a schedule, and you'll always know which servers are memory-hungry before they become a problem.

DIY Alerting:

While Prometheus has its own alerting rules, sometimes you need something more dynamic. You can query the API and implement custom logic:

def check_custom_condition():
    # Query for error rate across services
    error_rates = query_prometheus('sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05')
    
    if error_rates:
        # More than 5% error rate detected
        affected_services = [rate['metric']['service'] for rate in error_rates]
        send_slack_alert(f"High error rates detected in: {', '.join(affected_services)}")

This function implements a smarter alerting system than standard Prometheus alerts can provide. It calculates the error rate for each service by dividing the rate of 5xx errors by the total request rate.

When any service exceeds a 5% error rate, it collects the names of all affected services and sends a consolidated Slack alert. This approach reduces alert fatigue by grouping related issues and only alerting when the error rate crosses a meaningful threshold, not just when raw error counts increase.

This example alerts you when any service has an error rate above 5% – something that might fly under the radar of traditional threshold-based alerts.

πŸ’‘
Learn practical tips and strategies for scaling Prometheus efficiently: Read more.

Metric Discovery:

The Prometheus API isn't just for querying metrics – it can help you understand what metrics are available and what they mean:

# Get all metric names
curl 'http://your-prometheus-server:9090/api/v1/labels'

# Get all values for the 'job' label
curl 'http://your-prometheus-server:9090/api/v1/label/job/values'

# Get metadata about metrics
curl 'http://your-prometheus-server:9090/api/v1/metadata'

These commands help you explore what metrics exist in your Prometheus system. The first query returns all available label names, which tells you what dimensions you can filter and group by.

The second query lists all values for the 'job' label, showing you every monitored job in your infrastructure. The final query returns metadata about all metrics, including their type (counter, gauge, histogram) and help text.

This metadata exploration is invaluable when you're new to a system or trying to find the right metrics for a specific monitoring need.

How to Connect Prometheus API with Other Tools

The Prometheus API isn't meant to live in isolation – its real value comes from connecting it with your other tools.

Grafana Integration

Sure, Grafana already has a Prometheus data source, but sometimes you need more control. You can use Grafana's HTTP API with Prometheus's API to create dynamic dashboards:

# Create a new Grafana dashboard with Prometheus API data
curl -X POST -H "Content-Type: application/json" -d '{
  "dashboard": {
    "title": "Dynamic API Dashboard",
    "panels": [
      {
        "title": "Custom Error Rates",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.01",
            "refId": "A"
          }
        ]
      }
    ]
  },
  "overwrite": true
}' http://grafana:3000/api/dashboards/db

This command uses Grafana's HTTP API to programmatically create a dashboard that visualizes data from Prometheus.

It creates a new dashboard titled "Dynamic API Dashboard" with a single graph panel showing services with error rates above 1%.

The overwrite parameter ensures that if a dashboard with the same name already exists, it will be updated rather than creating a duplicate. This approach allows you to version control your dashboards as code and automatically deploy them as part of your infrastructure setup.

This creates a dashboard that monitors services with error rates above 1% – giving you visibility exactly where you need it.

CI/CD Pipeline Integration

Make your deployments smarter by integrating Prometheus API checks into your CI/CD pipeline:

#!/bin/bash
# Script to verify service health post-deployment

# Deploy the service
deploy_service

# Wait for service to initialize
sleep 30

# Query error rate for the first 5 minutes after deployment
error_rate=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{service="newly-deployed",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="newly-deployed"}[5m])) or vector(0)' \
  | jq -r '.data.result[0].value[1]')

# If error rate exceeds threshold, rollback
if (( $(echo "$error_rate > 0.05" | bc -l) )); then
  echo "New deployment showing high error rate: $error_rate. Rolling back."
  rollback_deployment
  exit 1
else
  echo "Deployment successful! Error rate: $error_rate"
fi

This shell script integrates Prometheus monitoring directly into your deployment pipeline, creating a safety net for your releases. After deploying a service, it waits 30 seconds for initialization, then queries Prometheus for the error rate of the newly deployed service.

The query calculates the ratio of 5xx errors to total requests, with a fallback to zero if there's no data yet. If the error rate exceeds 5%, the script automatically triggers a rollback and exits with an error code, preventing bad deployments from affecting users. The or vector(0) part is particularly clever, ensuring the query returns a valid result even if the service hasn't received any requests yet.

πŸ’‘
Find out how Prometheus Remote Write works and when to use it for long-term storage: Read more.

Avoiding Common API Headaches

Even the most experienced DevOps engineers run into issues with the Prometheus API. Here are some traps to avoid:

Timeout Troubles: When Queries Take Forever

Complex queries over large time ranges can timeout faster than your morning coffee gets cold. Solution? Break them down:

# Instead of this massive query
curl 'http://prometheus:9090/api/v1/query_range?query=sum(rate(http_requests_total[5m]))&start=1609459200&end=1625097600&step=1h'

# Split it into smaller chunks
for start in {1609459200..1625097600..86400}; do
  end=$((start + 86400))
  curl "http://prometheus:9090/api/v1/query_range?query=sum(rate(http_requests_total[5m]))&start=$start&end=$end&step=1h"
  sleep 1
done

This script demonstrates a smart approach to handling large time ranges in Prometheus queries. Rather than requesting six months of data in a single query (which would likely time out), it breaks the request into manageable daily chunks. The script iterates through the entire time range in 86,400-second (one day) increments, making a separate API call for each day. The one-second sleep between requests prevents overwhelming your Prometheus server. This approach is particularly useful for generating reports or analyzing long-term trends without running into timeout issues.

rate() vs. increase() in PromQL

The rate() and increase() functions in PromQL often trip people up when querying through the API. Here's the cheat code:

  • Use rate() for per-second averages (great for gauging current performance)
  • Use increase() for total increases over time (perfect for billing or usage reports)
# Per-second rate of requests
curl 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])'

# Total number of requests in the last hour
curl 'http://prometheus:9090/api/v1/query?query=increase(http_requests_total[1h])'

These two commands demonstrate the key difference between rate() and increase() in PromQL. The first query calculates the per-second rate of requests over a 5-minute window, giving you the current throughput of your system. This is perfect for real-time dashboards or utilization metrics.

The second query calculates the total increase in the counter over the last hour, showing you the absolute number of requests processed. This is better for billing, usage reports, or capacity planning where you need total counts rather than rates.

High Cardinality Crisis:

High cardinality metrics can make your Prometheus server slower than a Monday morning. When using the API, be mindful of how many time series your queries generate:

# Bad: Creates a time series for every possible status code and path combination
curl 'http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total[5m])) by (status_code, path)'

# Better: Group by service and status code range
curl 'http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"(2|3|4|5).."}[5m])) by (service, status)'

These examples illustrate how to avoid the dreaded cardinality explosion in Prometheus. The first query groups metrics by both status code and path, which could create thousands or even millions of time series if your application has many dynamic URLs.

The improved version uses a regex to match status code ranges (like 2xx, 3xx) and groups by service instead of individual paths. This dramatically reduces the number of time series while still providing actionable information about which services are experiencing errors.

The key takeaway: be mindful of high-cardinality labels when designing queries, especially for frequently used dashboards.

πŸ’‘
If you're running into issues with Prometheus, learn how to fix cardinality, resource usage, and storage challenges: Read more here!

How to Build Custom Tools with the Prometheus API

Why settle for off-the-shelf monitoring when you can build custom solutions? Here are some ideas to get those creative juices flowing:

Custom SLO Monitoring

Service Level Objectives (SLOs) are critical, but tracking them can be a pain. This Python script creates a custom SLO tracker:

import requests
import time
from datetime import datetime, timedelta

def check_error_budget():
    # Calculate error rate over the last 30 days
    end_time = int(time.time())
    start_time = end_time - (30 * 24 * 60 * 60)  # 30 days in seconds
    
    query = 'sum(rate(http_requests_total{status=~"5.."}[1d])) / sum(rate(http_requests_total[1d]))'
    
    url = f'http://prometheus:9090/api/v1/query_range'
    params = {
        'query': query,
        'start': start_time,
        'end': end_time,
        'step': '1d'  # Daily samples
    }
    
    response = requests.get(url, params=params).json()
    values = response['data']['result'][0]['values']
    
    # Calculate average error rate
    error_rates = [float(v[1]) for v in values]
    avg_error_rate = sum(error_rates) / len(error_rates)
    
    # 99.9% availability target means 0.1% error budget
    error_budget = 0.001
    budget_remaining = error_budget - avg_error_rate
    budget_percent = (budget_remaining / error_budget) * 100
    
    return {
        'average_error_rate': avg_error_rate,
        'error_budget_remaining': budget_remaining,
        'error_budget_percent_remaining': budget_percent
    }

result = check_error_budget()
print(f"SLO Status: {result['error_budget_percent_remaining']:.2f}% of error budget remaining")

The above script implements an SLO monitoring system using the Prometheus API and

  • Calculates the error rate over the past 30 days by analyzing the ratio of 5xx errors to total requests, sampled daily.
  • Compares the average error rate to a predefined error budget (0.1%, aligned with 99.9% availability).
  • Outputs:Average error rate over the period.Absolute remaining error budget.Percentage of budget remaining.
  • Helps teams decide when to prioritize reliability vs. feature development.

If most of the error budget is used up, it signals a need to focus on stability before launching new features. Useful for assessing whether a risky deployment is within acceptable reliability limits.

Anomaly Detection System

If you want to get ahead of problems before they become incidents, this simple anomaly detection script uses the Prometheus API to spot unusual patterns:

9090/api/v1/query_range' params = { 'query': query, 'start': int(time.time()) - 86400, 'end': int(time.time()), 'step': '5m' }

response = requests.get(url, params=params).json()

anomalies = []

for result in response['data']['result']:
    metric = result['metric']
    values = [float(v[1]) for v in result['values']]
    
    # Calculate Z-scores
    mean = np.mean(values)
    std = np.std(values)
    
    if std == 0:  # Skip flat metrics
        continue
        
    z_scores = [(v - mean) / std for v in values]
    
    # Find anomalies (Z-score exceeds threshold)
    for i, z in enumerate(z_scores):
        if abs(z) > threshold:
            timestamp = result['values'][i][0]
            value = result['values'][i][1]
            anomalies.append({
                'metric': metric,
                'timestamp': timestamp,
                'value': value,
                'z_score': z
            })

return anomalies

Detect unusual CPU spikes

cpu_anomalies = detect_anomalies('node_cpu_seconds_total{mode="system"}')

for anomaly in cpu_anomalies:
    print(
        f"Anomaly detected on {anomaly['metric']['instance']} "
        f"at {datetime.fromtimestamp(anomaly['timestamp'])}"
    )

This script identifies unusual metric values using Z-scores – perfect for spotting that one rogue process eating CPU or that sudden spike in error rates.

πŸ’‘
If you're unsure how the Prometheus rate() function works, this guide breaks it down with examples: Read more.

Where Prometheus API Is Headed Next

The Prometheus ecosystem keeps evolving, and the API is no exception. Here's what to keep an eye on:

  • Exemplars: Newer Prometheus versions support exemplars – specific traces that help you correlate metrics with traces.
  • Remote Write API improvements: Making it even easier to send Prometheus metrics to long-term storage.
  • Federation enhancements: Better ways to query across multiple Prometheus servers.

Your Next Steps with Prometheus API

If you're just getting your feet wet with monitoring or looking to push the boundaries of what's possible with Prometheus, the API is your ticket to monitoring nirvana.

Remember:

  • Start simple with basic queries
  • Automate repetitive monitoring tasks
  • Integrate with your existing tools
  • Build custom solutions for your specific needs
πŸ’‘
Join our Discord community to discuss the everyday challenges developers face, share insights, and connect with others who share your interests.

FAQs

What's the difference between Prometheus HTTP API and management API?
The HTTP API is what you'll use most often – it's for querying metrics and metadata. The management API is for managing Prometheus itself, including configuration reloads and taking snapshots. Think of the HTTP API as how you talk to your metrics and the management API as how you talk to Prometheus itself.

Can I use the Prometheus API for long-term storage?
While you can query historical data through the API, Prometheus itself isn't designed for long-term storage. For that, you'll want to use the remote write functionality to send metrics to systems like Thanos, Cortex, or cloud provider solutions that can store years of metrics data.

How do I secure the Prometheus API?
By default, the Prometheus API doesn't include authentication. For production environments, you should place Prometheus behind a reverse proxy like Nginx or use Prometheus's built-in TLS and basic auth configuration. Another approach is using OAuth2 Proxy as a sidecar container if you're running in Kubernetes.

What's the best way to handle rate limits with the API?
Prometheus doesn't have built-in rate limiting for its API, but it can get overwhelmed by too many queries. Implement client-side rate limiting in your applications, consider caching frequent queries, and use federation to distribute query load across multiple Prometheus instances.

How can I debug slow API queries?
Add /debug/pprof to your Prometheus URL to access the profiling endpoints. For query performance specifically, add a stats=1 parameter to your query to get detailed information about how long each step took. For example:
http://prometheus:9090/api/v1/query?query=up&stats=1

Can I use the Prometheus API with other programming languages?
Absolutely! While we showed Python examples, there are official and community-maintained client libraries for most popular languages including Go, Java, Ruby, and Node.js. They all follow similar patterns for querying and parsing results.

How do I handle authentication when querying the API?
If your Prometheus instance is behind basic authentication, you can include the credentials in your API requests. For token-based authentication, add an Authorization header with your token.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X