Grafana Loki is up and running, log ingestion looks healthy, and dashboards are rendering without issues. But when you query logs from a few weeks ago, the data's missing.
This is a recurring problem for many teams using Loki in production: while the system handles short-term log visibility well, it often lacks the retention guarantees developers expect for historical analysis and incident review.
The Core Issue with Loki Metrics
In most production setups, this is a common pattern:
- Loki is configured with a 30-day log retention window
- LogQL queries are used to extract metrics directly from log data
- Dashboards and alerts rely on these queries for SLA monitoring, capacity planning, and trend analysis
- Everything works until logs begin to expire.
The underlying problem: logs and metrics have different data lifecycles. When you treat logs as a persistent metrics store, you create a fragile system. Once the 30-day retention limit is hit, queries like the following fail silently:
// Works while logs are fresh
rate({service="api"}[5m])
// Breaks after log expiration
sum(rate({service="api"}[5m])) by (status_code)
As a result:
- Dashboards return empty graphs
- Alerts stop triggering
- Incident timelines lose resolution
What you're left with is a broken observability chain, especially when you need it most, during historical analysis or postmortems.
The Right Way to Handle Loki Metrics
Loki isn't built to store long-term metrics, by design. Logs are high-cardinality, storage-intensive, and best suited for short-term debugging. Metrics are pre-aggregated and optimized for long-term use.
The production-friendly pattern looks like this:
Logs (30 days) ──► Recording Rules ──► Metrics (1+ years)
│ │ │
└─ Detailed └─ Real-time └─ Aggregated
Expensive Extraction Cost-efficient
Short-lived From Logs Long-term visibility
Use logs for debugging. Convert them into metrics for trend analysis and alerting. Trying to skip that step is what breaks most Loki setups.
Implement Recording Rules with Remote Write
To make Loki metrics durable, you need to extract them from logs while they're still available, then store those metrics in a system designed for long-term retention. This is done through recording rules and remote write.
Step 1: Define Recording Rules
Start by writing recording rules that turn log data into time-series metrics. These rules run continuously, evaluating LogQL expressions and generating named metrics you can use in dashboards and alerts.
# loki-rules.yml
groups:
- name: api_metrics
rules:
- record: api:request_rate
expr: sum(rate({service="api"}[1m])) by (status_code, method)
- record: api:error_rate
expr: |
sum(rate({service="api", level="error"}[1m]))
/
sum(rate({service="api"}[1m]))
- record: api:response_time_p99
expr: |
quantile_over_time(0.99,
{service="api"}
| json
| __error__ = ""
| unwrap response_time [1m]
)
- The
request_rate
rule captures request throughput, broken down by status code and HTTP method. - The
error_rate
rule calculates the fraction of error logs relative to all requests, giving you a normalized error rate. - The
response_time_p99
rule computes the 99th percentile latency from structured log fields—ideal for tracking performance regressions.
These rules create new time series that outlive the raw log data they're derived from. Once evaluated, the metrics can be stored, queried, and alerted on just like native Prometheus metrics.
Step 2: Configure Remote Write to Last9
Once you've defined your recording rules, the next step is to persist the generated metrics. Loki’s ruler service supports this by pushing them to any supported remote write backend, such as Last9, Prometheus, or other time-series databases.
Prerequisites: To push to Last9, create a Last9 cluster and gather these details:
$levitate_remote_write_url
- Last9's remote write endpoint$levitate_remote_write_username
- Your cluster ID$levitate_remote_write_password
- Write token for the cluster
Configuration: Update your existing Loki configuration to include the remote write config under the ruler section:
# loki-config.yml
ruler:
storage:
type: local
local:
directory: /etc/loki/rules
rule_path: /tmp/loki/rules
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: inmemory
enable_api: true
remote_write:
enabled: true
client:
url: "$levitate_remote_write_url"
basic_auth:
username: "$levitate_remote_write_username"
password: "$levitate_remote_write_password"
rule_path
andstorage
point to where your recording rules live on disk.remote_write.client.url
uses Last9's remote write endpoint.basic_auth
handles authentication using your cluster ID and write token.enable_api
allows external inspection and debugging of rules via the Loki HTTP API.
This setup ensures your derived metrics are not only calculated but also preserved in Last9, completely decoupled from your log retention policy.
Step 3: Restart and Verify
After setting up recording rules and remote write, restart Loki so the ruler service picks up the configuration.
Use the following commands to validate that everything is working:
# Confirm that rules are loaded and active
curl http://localhost:3100/ruler/api/v1/rules
# Check if metrics are being written to Last9
curl -u "$levitate_remote_write_username:$levitate_remote_write_password" \
"$levitate_remote_write_url/api/v1/query?query=api:request_rate"
- The first command lists all active rule groups and rules currently being evaluated by the ruler.
- The second checks if your newly defined
api:request_rate
metric is present and populated in Last9.
If the queries return results, your pipeline is working, and your metrics will continue to exist long after your logs have been deleted. You can then explore your metric data using Last9's embedded Grafana by querying for the recording rules directly.
Why Use Last9 for Loki Remote Write
Loki-derived metrics often overwhelm traditional backends. Last9 is purpose-built to handle high-cardinality workloads with better defaults and more control.
Built for High Cardinality
Loki logs frequently contain labels like user_id
, trace_id
, and other dynamic fields that inflate series counts. Last9 supports 20 M unique series per metric per day, by default. You can:
Analyze and Fix Cardinality Issues
Use Cardinality Explorer to:
- Identify metrics with excessive series
- Pinpoint the exact label combinations causing growth
- Track cardinality trends across time
Runtime Controls with Streaming Aggregations
Streaming Aggregations let you:
- Drop high-cardinality labels like
trace_id
on the fly - Generate scoped metrics with fewer dimensions
- Prevent blow-ups without editing upstream pipelines
This happens during ingestion; no code changes needed.
Simple Integration
- Add Last9’s remote write endpoint to your Loki config
- No exporters, agents, or custom setup
- Recording rules continue to work as-is
Embedded Grafana
Last9 includes an embedded Grafana interface, so you can continue using your existing dashboards and start exploring metrics as soon as they’re ingested, no extra setup required.
What Changes Once You Switch to Persistent Metrics
Recording rules and remote write fundamentally shift how you interact with Loki.
Before: Fragile Log-Driven Queries
In the original setup, dashboards and alerts rely directly on raw log data:
sum(rate({service="api"}[5m])) by (status_code)
quantile_over_time(0.99, {service="api"} | json | unwrap response_time [5m])
These queries work only as long as the logs exist. Once the 30-day retention window expires, they return no data.
Operational issues in this model:
- Queries silently fail after log expiry
- No way to analyze long-term trends
- Dashboards show gaps or errors
- Alerts stop triggering, even if traffic or error patterns persist
- Capacity planning and SLA reporting become unreliable
After: Durable Metric-Based Queries
Once you extract metrics and push them to a long-term backend, you can rewrite dashboard queries to use the recorded series:
api:request_rate
api:response_time_p99
These are lightweight, low-cost time-series that persist independently of the original logs.
Advantages of this model:
- Metrics remain available even after log expiration
- Dashboards stay consistent with stable query performance
- Historical analysis is accurate, enabling long-term trend visibility
- Alerts continue to fire reliably, based on pre-aggregated time-series
- Planning and capacity reports can use real data from months ago
This shift doesn't just solve broken queries; it decouples your operational metrics from the limitations of log storage. You get faster queries, more resilient observability, and metrics that survive beyond 30 days.
Common Pitfalls with Recording Rules
Even with remote write and recording rules in place, Loki setups can run into a few recurring issues. Here’s how to spot and fix them without breaking your flow.
1. Expensive Recording Rules Slowing Down Ruler Nodes
When LogQL expressions in your recording rules get heavy, especially with regex, label parsing, or wide fan-out, Loki’s ruler can start consuming a lot of CPU and memory. Instead of starting with a fully detailed query, start small and build it up.
# Start small
- record: api:request_rate
expr: sum(rate({service="api"}[1m]))
# Then increase resolution only if needed
- record: api:request_rate_by_endpoint
expr: sum(rate({service="api"}[1m])) by (endpoint)
These lightweight patterns are easier to validate and cheaper to run. They also help you gradually surface trends, without overwhelming the ruler.
2. Label Explosion from Unscoped Aggregations
Using high-cardinality labels like user_id
or request_id
in a by()
clause isn’t bad in itself, but Loki’s ruler will still have to crunch through those labels on evaluation, and that’s where it can get resource-intensive.
If your rule is for alerting or high-level trend analysis, skip the noise and stick to stable fields like method
, status_code
, or endpoint
.
# High-cardinality labels (works, but expensive)
sum(rate({service="api"}[1m])) by (user_id, request_id)
# More efficient when you're just watching system behavior
sum(rate({service="api"}[1m])) by (status_code, method)
3. Missing Data Points After Ruler Restart
Loki’s ruler runs your recording rules at fixed intervals (e.g., every 30s). If it crashes or restarts mid-cycle, some evaluations might get skipped, causing gaps. This isn’t a bug; it’s just how scheduling works.
To reduce the impact, choose a shorter interval and monitor uptime closely.
groups:
- name: api_metrics
interval: 30s
rules:
- record: api:request_rate
expr: sum(rate({service="api"}[1m]))
If you're running this in production, ensure your ruler setup is resilient. Keep it on a stable node, wire in health checks, and avoid accidental restarts during deployments.
When Metrics Don't Show Up
When metrics vanish or remote write stops working, start here:
Inspect remote write connectivity
Check credentials, endpoint availability, and encoding:
echo "$levitate_remote_write_username:$levitate_remote_write_password" | base64
curl -u "$levitate_remote_write_username:$levitate_remote_write_password" \
"$levitate_remote_write_url/api/v1/labels"
Validate rule syntax
Invalid YAML or malformed expressions will silently break evaluations.
promtool check rules /etc/loki/rules/*.yml
Check the rule file loading
Ensure rule files are mounted correctly:
ls -la /etc/loki/rules/
Verify rule execution
Hit the ruler endpoint to see if your rules are even being picked up:
curl http://localhost:3100/ruler/api/v1/rules
Confirm remote write is set up correctly
Look at your loki.yaml
and verify the remote_write section.
grep -A 5 "remote_write" /etc/loki/config.yml
Check for evaluation errors
Look through your ruler logs for any errors during evaluation.
docker logs loki-container | grep ruler
If it still fails, simplify your rule, test locally, and reintroduce complexity incrementally. You can always widen your scope once the basics are flowing into your metrics backend.
Final Thoughts
Once you have persistent metrics, you can:
- Build reliable alerting that doesn't break when logs expire
- Create historical dashboards for capacity planning and trend analysis
- Implement SLA tracking with consistent long-term data
- Set up automated scaling based on historical patterns