Streaming Aggregation

Control Plane — Pipeline Sequence: Streaming Aggregations

Order of Last9’s pipeline processing.

Streaming Aggregation transforms incoming telemetry into metrics before storage, improving query performance and cutting storage costs. It’s especially effective for managing high-cardinality time series.

For in-depth explanation of high cardinality challenges and conceptual details, check our guide on Streaming Aggregation.

Common Use Cases

Reducing cardinality while preserving information

Problem: You have a metric with high-cardinality labels (pod, pod_name) alongside lower-cardinality labels (instance, country, service).

Solution: Create two streaming aggregations:

One that drops high-cardinality labels but keeps lower-cardinality ones:

- promql: "sum by (instance, country, service) (my_metric{}[2m])"
  as: my_metric_by_location_2m

Another that preserves only essential high-cardinality information:

- promql: "sum by (pod, pod_name, service) (my_metric{}[2m])"
  as: my_metric_by_pod_2m

This approach reduces cardinality while allowing you to correlate information between the two aggregated metrics.

Transforming Logs to Metrics (Last9 LogMetrics)

Last9 LogMetrics turns your logs into metrics. Build them with the log query builder in the setup form below, or start from Logs Explorer; then alert on the resulting metric.

Service error rates: Count error logs per service to spot problematic services.
- Filter: level=error
- Aggregate: count as _count
- Group by: service
- Timeslice: 5 minutes
API response times: Average response time per endpoint and status code to find slow or failing endpoints.
- Filter: component=api
- Aggregate: avg(response_time) as avg_response_time
- Group by: endpoint, status_code
- Timeslice: 1 minute
Service unavailability: Track log volume from critical services — a sharp drop can signal an outage.
- Filter: service in ("payment", "authentication", "database")
- Aggregate: count as _count
- Group by: service
- Timeslice: 10 minutes

Transforming Traces to Metrics (Last9 TraceMetrics)

Last9 TraceMetrics turns your traces into metrics — latency, errors, throughput — without querying raw spans. Build them with the trace query builder in the setup form below, or start from Traces Explorer; then alert on the resulting metric.

Service latency percentiles: Track P99 latency per service to spot slow services.
- Filter: span_kind=SPAN_KIND_SERVER
- Aggregate: p99(duration) as p99_latency
- Group by: service_name
- Timeslice: 1 minute
Service error rates: Count error spans per service to drive error-rate alerts.
- Filter: status_code=STATUS_CODE_ERROR
- Aggregate: count as _count
- Group by: service_name
- Timeslice: 5 minutes
Endpoint throughput: Count requests per endpoint to track traffic.
- Aggregate: count as _count
- Group by: service_name, span_name
- Timeslice: 1 minute

Rolling up data over time

Use streaming aggregation to roll up data over longer time windows:

- promql: "sum by (service, endpoint) (api_calls_total{}[5m])"
  as: api_calls_total_5m

This creates a 5-minute rolled-up version of your metric, which is useful for longer-term trend analysis while reducing storage requirements.

Getting Started

Last9 offers two main approaches to set up streaming aggregations:

Option 1: Using the UI

Navigate to Control Plane → Streaming Aggregation
Click + NEW RULE to open the rule creation form
Choose your Telemetry source:
- Metrics: For metric-based aggregation
- Events: For event-based aggregation
- Logs: For creating metrics from logs (Last9 LogMetrics)
- Traces: For creating metrics from traces (Last9 TraceMetrics)
For Metrics & Events:
- Enter the metric name, resolution, and aggregation function
- Choose labels to include (With) or exclude (Without)
- Set the output metric name and rule name
For Logs (Last9 LogMetrics):
- Use the Builder or LogQL editor to create your query
- Set filter conditions, aggregate functions, and group by dimensions
- Define your evaluation frequency (timeslice), auto inherited from Editor mode query
- Set the output metric name and rule name
  
  Currently, logs only from the default index is supported. Support for Physical Indexes is coming soon.
You can also create LogMetrics directly from Logs Explorer:
- Create and run a query in Logs Explorer
- Click the Create Metric button next to your visualization
- This opens a pre-filled streaming aggregation form with your query
- Verify the preview looks correct
- Set the output metric name and rule name
For Traces (Last9 TraceMetrics):
- Use the Builder to add FILTER and AGGREGATE stages
- Define your evaluation frequency (timeslice)
- Set the output metric name and rule name
You can also create TraceMetrics directly from Traces Explorer: build a query in the Query Builder and click Create Metric to open a pre-filled streaming aggregation form.
Click SAVE to activate your streaming aggregation rule

Option 2: GitOps Workflow

For teams who prefer infrastructure-as-code approaches:

Request enabling GitOps workflow

Please reach out to cs@last9.io, or on our shared Slack/Teams channel, to switch you over to the GitOps workflow for Streaming Aggregation.
Define Your Aggregation PromQL

Identify the specific metrics you want to aggregate. Using the Explore tab in embedded Grafana, create a PromQL query that defines how you want to aggregate your data.

For example, to aggregate HTTP request durations by stack:
```
sum by (stack) (http_requests_duration_seconds_count{service="pushnotifs"}[1m])
```
This query reduces cardinality by grouping data by the stack label, making it more manageable and queryable.
Configure the Aggregation Rule

Add your aggregation rule to the YAML file for your Last9 cluster. The basic syntax is:
```
- promql: 'sum by (stack, le) (http_requests_duration_seconds_bucket{service="pushnotifs"}[2m])'
  as: pushnotifs_http_requests_duration:2m
```
This configuration:
- Takes the metric http_requests_duration_seconds_bucket filtered for the pushnotifs service
- Aggregates it over a 2-minute window
- Groups by stack and le (latency buckets)
- Creates a new metric named pushnotifs_http_requests_duration:2m
Deploy Using GitOps Workflow
1. Create a Pull Request with your updated rules to the GitHub repository
2. Wait for CI Tests to validate your streaming aggregation syntax
3. Merge the Pull Request to activate the pipeline in Last9
4. Query the New Metric in your Last9 cluster

Histogram Aggregation for Percentiles

Histograms power accurate percentiles (like p95 latency), and they rely on three related metrics: <metric_name>_bucket, <metric_name>_sum, and <metric_name>_count. To aggregate a histogram correctly, define all three:

- promql: 'sum2 by (stack, le) (http_requests_duration_seconds_bucket{service="pushnotifs"}[2m])'
  as: pushnotifs_http_requests_duration_seconds_bucket
- promql: 'sum2 by (stack) (http_requests_duration_seconds_sum{service="pushnotifs"}[2m])'
  as: pushnotifs_http_requests_duration_seconds_sum
- promql: 'sum2 by (stack) (http_requests_duration_seconds_count{service="pushnotifs"}[2m])'
  as: pushnotifs_http_requests_duration_seconds_count

You can then use histogram_quantile functions on the aggregated metrics:

Histogram Quantile functions on Stream Aggregated metric

Querying LogMetric Outputs

Metrics produced by LogMetrics and TraceMetrics (and other count-style streaming aggregations) are per-window counts, not cumulative counters. Each sample’s value is the count of matching events in that single timeslice (the evaluation window configured on the rule), so a series like 1, 1, 2, 0, 1 represents absolute per-timeslice counts — it does not monotonically increase the way a Prometheus counter does.

This changes which PromQL functions you should use to query them:

Goal	Use this	Avoid
Count events in a window	`sum_over_time(my_metric[5m])`	`increase(my_metric[5m])`
Per-second event rate	`sum_over_time(my_metric[5m]) / 300`	`rate(my_metric[5m])`
Alert when any event occurred recently	`sum_over_time(my_metric[5m]) > 0`	`increase(my_metric[5m]) > 0`
Visualize sparse events over a long range	`max_over_time(my_metric[$__interval])`	`rate(my_metric[5m])`

rate() and increase() assume a monotonically-increasing counter and will return empty or misleading values on these metrics. The effect is especially pronounced for sparse events (for example, circuit-breaker transitions or rare errors) where consecutive samples can be hours or days apart and rarely fall within the same [5m] lookback.

The per-window-count caveat applies only to direct count outputs. Histogram aggregations (_bucket, _sum, _count with sum2) retain counter semantics, so rate() and histogram_quantile() work as expected on those.

Supported Functions

For Metrics

The following aggregation functions are available for metric-based Streaming Aggregation:

sum: Total to be used for other metric types
max: The Maximum value of the samples
sum2: Sum, but for counters and reset awareness

For Logs

For log-based aggregations, see the supported functions in the Logs Query Builder Aggregate Stage documentation. Note that only one aggregate function is allowed per query.

For Traces

For trace-based aggregations, see the supported functions in the Traces Query Builder Aggregate Stage documentation.

Troubleshooting

Streaming Aggregation Not Appearing
- Check that your rule was properly saved or PR was successfully merged
- Verify the syntax of your PromQL or log query
- Check Cardinality Explorer to Ensure the cardinality is below the 3M timeseries per hour limit
- For LogMetrics and TraceMetrics, ensure your query is returning numerical data
Incorrect Aggregation Results
- For counters, ensure you’re using sum2 instead of sum
- Check the time window [Nm] is appropriate for your data frequency
- Verify that your by clause or group by includes all necessary labels
- For logs, make sure your filter conditions are correctly specified
Performance Issues
- Start with longer time windows like [5m] to reduce processing load
- Limit the number of labels in your by clause
- Consider creating multiple targeted aggregations instead of one large one
- For log queries, add specific filters to narrow down the data being processed
rate() or increase() Returns No Data on LogMetric or TraceMetric Outputs

LogMetric and TraceMetric count outputs are per-window counts, not cumulative counters — rate() and increase() will not behave the way they do for Prometheus counters. See Querying LogMetric Outputs for the correct query patterns.

Please get in touch with us on Discord or Email if you have any questions.