Last9

Why Your Loki Metrics Are Disappearing (And How to Fix It)

Diagnose missing Loki metrics by fixing recording rule gaps, remote write failures, and high-cardinality issues in production setups.

Jul 29th, ‘25
Why Your Loki Metrics Are Disappearing (And How to Fix It)
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Grafana Loki is up and running, log ingestion looks healthy, and dashboards are rendering without issues. But when you query logs from a few weeks ago, the data's missing.

This is a recurring problem for many teams using Loki in production: while the system handles short-term log visibility well, it often lacks the retention guarantees developers expect for historical analysis and incident review.

The Core Issue with Loki Metrics

In most production setups, this is a common pattern:

  • Loki is configured with a 30-day log retention window
  • LogQL queries are used to extract metrics directly from log data
  • Dashboards and alerts rely on these queries for SLA monitoring, capacity planning, and trend analysis
  • Everything works until logs begin to expire.

The underlying problem: logs and metrics have different data lifecycles. When you treat logs as a persistent metrics store, you create a fragile system. Once the 30-day retention limit is hit, queries like the following fail silently:

// Works while logs are fresh
rate({service="api"}[5m])

// Breaks after log expiration
sum(rate({service="api"}[5m])) by (status_code)

As a result:

  • Dashboards return empty graphs
  • Alerts stop triggering
  • Incident timelines lose resolution

What you're left with is a broken observability chain, especially when you need it most, during historical analysis or postmortems.

💡
If you're new to Loki or setting it up for the first time, this blog on using Loki for log management covers the basics.

The Right Way to Handle Loki Metrics

Loki isn't built to store long-term metrics, by design. Logs are high-cardinality, storage-intensive, and best suited for short-term debugging. Metrics are pre-aggregated and optimized for long-term use.

The production-friendly pattern looks like this:

Logs (30 days) ──► Recording Rules ──► Metrics (1+ years)
     │                 │                    │
     └─ Detailed       └─ Real-time         └─ Aggregated
        Expensive         Extraction           Cost-efficient
        Short-lived        From Logs            Long-term visibility

Use logs for debugging. Convert them into metrics for trend analysis and alerting. Trying to skip that step is what breaks most Loki setups.

Implement Recording Rules with Remote Write

To make Loki metrics durable, you need to extract them from logs while they're still available, then store those metrics in a system designed for long-term retention. This is done through recording rules and remote write.

Step 1: Define Recording Rules

Start by writing recording rules that turn log data into time-series metrics. These rules run continuously, evaluating LogQL expressions and generating named metrics you can use in dashboards and alerts.

# loki-rules.yml
groups:
  - name: api_metrics
    rules:
      - record: api:request_rate
        expr: sum(rate({service="api"}[1m])) by (status_code, method)
      
      - record: api:error_rate
        expr: |
          sum(rate({service="api", level="error"}[1m])) 
          / 
          sum(rate({service="api"}[1m]))
      
      - record: api:response_time_p99
        expr: |
          quantile_over_time(0.99, 
            {service="api"} 
            | json 
            | __error__ = "" 
            | unwrap response_time [1m]
          )
  • The request_rate rule captures request throughput, broken down by status code and HTTP method.
  • The error_rate rule calculates the fraction of error logs relative to all requests, giving you a normalized error rate.
  • The response_time_p99 rule computes the 99th percentile latency from structured log fields—ideal for tracking performance regressions.

These rules create new time series that outlive the raw log data they're derived from. Once evaluated, the metrics can be stored, queried, and alerted on just like native Prometheus metrics.

Step 2: Configure Remote Write to Last9

Once you've defined your recording rules, the next step is to persist the generated metrics. Loki’s ruler service supports this by pushing them to any supported remote write backend, such as Last9, Prometheus, or other time-series databases.

Prerequisites: To push to Last9, create a Last9 cluster and gather these details:

  • $levitate_remote_write_url - Last9's remote write endpoint
  • $levitate_remote_write_username - Your cluster ID
  • $levitate_remote_write_password - Write token for the cluster

Configuration: Update your existing Loki configuration to include the remote write config under the ruler section:

# loki-config.yml
ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules

  rule_path: /tmp/loki/rules
  alertmanager_url: http://alertmanager:9093

  ring:
    kvstore:
      store: inmemory

  enable_api: true

  remote_write:
    enabled: true
    client:
      url: "$levitate_remote_write_url"
      basic_auth:
        username: "$levitate_remote_write_username"
        password: "$levitate_remote_write_password"
  • rule_path and storage point to where your recording rules live on disk.
  • remote_write.client.url uses Last9's remote write endpoint.
  • basic_auth handles authentication using your cluster ID and write token.
  • enable_api allows external inspection and debugging of rules via the Loki HTTP API.

This setup ensures your derived metrics are not only calculated but also preserved in Last9, completely decoupled from your log retention policy.

Step 3: Restart and Verify

After setting up recording rules and remote write, restart Loki so the ruler service picks up the configuration.

Use the following commands to validate that everything is working:

# Confirm that rules are loaded and active
curl http://localhost:3100/ruler/api/v1/rules

# Check if metrics are being written to Last9
curl -u "$levitate_remote_write_username:$levitate_remote_write_password" \
  "$levitate_remote_write_url/api/v1/query?query=api:request_rate"
  • The first command lists all active rule groups and rules currently being evaluated by the ruler.
  • The second checks if your newly defined api:request_rate metric is present and populated in Last9.

If the queries return results, your pipeline is working, and your metrics will continue to exist long after your logs have been deleted. You can then explore your metric data using Last9's embedded Grafana by querying for the recording rules directly.

Why Use Last9 for Loki Remote Write

Loki-derived metrics often overwhelm traditional backends. Last9 is purpose-built to handle high-cardinality workloads with better defaults and more control.

Built for High Cardinality

Loki logs frequently contain labels like user_id, trace_id, and other dynamic fields that inflate series counts. Last9 supports 20 M unique series per metric per day, by default. You can:

Analyze and Fix Cardinality Issues

Use Cardinality Explorer to:

  • Identify metrics with excessive series
  • Pinpoint the exact label combinations causing growth
  • Track cardinality trends across time

Runtime Controls with Streaming Aggregations

Streaming Aggregations let you:

  • Drop high-cardinality labels like trace_id on the fly
  • Generate scoped metrics with fewer dimensions
  • Prevent blow-ups without editing upstream pipelines

This happens during ingestion; no code changes needed.

Simple Integration

  • Add Last9’s remote write endpoint to your Loki config
  • No exporters, agents, or custom setup
  • Recording rules continue to work as-is

Embedded Grafana

Last9 includes an embedded Grafana interface, so you can continue using your existing dashboards and start exploring metrics as soon as they’re ingested, no extra setup required.

💡
For a broader comparison of Loki and Prometheus and how they handle metrics differently, see this breakdown.

What Changes Once You Switch to Persistent Metrics

Recording rules and remote write fundamentally shift how you interact with Loki.

Before: Fragile Log-Driven Queries

In the original setup, dashboards and alerts rely directly on raw log data:

sum(rate({service="api"}[5m])) by (status_code)
quantile_over_time(0.99, {service="api"} | json | unwrap response_time [5m])

These queries work only as long as the logs exist. Once the 30-day retention window expires, they return no data.

Operational issues in this model:

  • Queries silently fail after log expiry
  • No way to analyze long-term trends
  • Dashboards show gaps or errors
  • Alerts stop triggering, even if traffic or error patterns persist
  • Capacity planning and SLA reporting become unreliable

After: Durable Metric-Based Queries

Once you extract metrics and push them to a long-term backend, you can rewrite dashboard queries to use the recorded series:

api:request_rate
api:response_time_p99

These are lightweight, low-cost time-series that persist independently of the original logs.

Advantages of this model:

  • Metrics remain available even after log expiration
  • Dashboards stay consistent with stable query performance
  • Historical analysis is accurate, enabling long-term trend visibility
  • Alerts continue to fire reliably, based on pre-aggregated time-series
  • Planning and capacity reports can use real data from months ago

This shift doesn't just solve broken queries; it decouples your operational metrics from the limitations of log storage. You get faster queries, more resilient observability, and metrics that survive beyond 30 days.

Common Pitfalls with Recording Rules

Even with remote write and recording rules in place, Loki setups can run into a few recurring issues. Here’s how to spot and fix them without breaking your flow.

1. Expensive Recording Rules Slowing Down Ruler Nodes
When LogQL expressions in your recording rules get heavy, especially with regex, label parsing, or wide fan-out, Loki’s ruler can start consuming a lot of CPU and memory. Instead of starting with a fully detailed query, start small and build it up.

# Start small
- record: api:request_rate
  expr: sum(rate({service="api"}[1m]))

# Then increase resolution only if needed
- record: api:request_rate_by_endpoint
  expr: sum(rate({service="api"}[1m])) by (endpoint)

These lightweight patterns are easier to validate and cheaper to run. They also help you gradually surface trends, without overwhelming the ruler.

2. Label Explosion from Unscoped Aggregations
Using high-cardinality labels like user_id or request_id in a by() clause isn’t bad in itself, but Loki’s ruler will still have to crunch through those labels on evaluation, and that’s where it can get resource-intensive.

If your rule is for alerting or high-level trend analysis, skip the noise and stick to stable fields like method, status_code, or endpoint.

# High-cardinality labels (works, but expensive)
sum(rate({service="api"}[1m])) by (user_id, request_id)

# More efficient when you're just watching system behavior
sum(rate({service="api"}[1m])) by (status_code, method)

3. Missing Data Points After Ruler Restart
Loki’s ruler runs your recording rules at fixed intervals (e.g., every 30s). If it crashes or restarts mid-cycle, some evaluations might get skipped, causing gaps. This isn’t a bug; it’s just how scheduling works.

To reduce the impact, choose a shorter interval and monitor uptime closely.

groups:
  - name: api_metrics
    interval: 30s
    rules:
      - record: api:request_rate
        expr: sum(rate({service="api"}[1m]))

If you're running this in production, ensure your ruler setup is resilient. Keep it on a stable node, wire in health checks, and avoid accidental restarts during deployments.

When Metrics Don't Show Up

When metrics vanish or remote write stops working, start here:

Inspect remote write connectivity
Check credentials, endpoint availability, and encoding:

echo "$levitate_remote_write_username:$levitate_remote_write_password" | base64
curl -u "$levitate_remote_write_username:$levitate_remote_write_password" \
  "$levitate_remote_write_url/api/v1/labels"

Validate rule syntax
Invalid YAML or malformed expressions will silently break evaluations.

promtool check rules /etc/loki/rules/*.yml

Check the rule file loading
Ensure rule files are mounted correctly:

ls -la /etc/loki/rules/

Verify rule execution
Hit the ruler endpoint to see if your rules are even being picked up:

curl http://localhost:3100/ruler/api/v1/rules

Confirm remote write is set up correctly
Look at your loki.yaml and verify the remote_write section.

grep -A 5 "remote_write" /etc/loki/config.yml

Check for evaluation errors
Look through your ruler logs for any errors during evaluation.

docker logs loki-container | grep ruler

If it still fails, simplify your rule, test locally, and reintroduce complexity incrementally. You can always widen your scope once the basics are flowing into your metrics backend.

💡
Now fix missing Loki metrics faster by bringing production context—logs, metrics, and traces, into your IDE with AI and Last9 MCP.

Final Thoughts

Once you have persistent metrics, you can:

  1. Build reliable alerting that doesn't break when logs expire
  2. Create historical dashboards for capacity planning and trend analysis
  3. Implement SLA tracking with consistent long-term data
  4. Set up automated scaling based on historical patterns
💡
And if you’d like to go deeper or troubleshoot a specific use case, join our Discord; there’s a dedicated channel where you can share implementation details with other developers and debug issues.

Contents

Do More with Less

Unlock high cardinality monitoring for your teams.