Guides

The Anatomy of Log Data: Understanding the Data Model

  • LogQL

Life was simple. Maybe too simple. These days, if you're running any kind of serious system, your logs are more like a distributed novel written by thousands of authors simultaneously, each with their own idea of what constitutes "important information."

The Three-Body Problem of Observability

In an earlier guide, we looked at why LogQL was needed. Now, before we dive into LogQL's data model, we need to understand where logs fit in the modern observability landscape. Think of it as a three-body problem:

  • Metrics: Numbers that go up and down (like request counts)
  • Logs: Record of things that happened
  • Traces: The journey of a request through your system

LogQL's genius lies in how it bridges these worlds, particularly metrics and logs. But I'm getting ahead of myself.

The Fundamental Building Blocks

Let's break down how LogQL thinks about your log data. If you've ever used Prometheus, this might feel familiar. If you haven't, don't worry – I promise this will make more sense than your last JIRA ticket.

1. Log Streams

Think of a log stream as a series of log entries that all come from the same source. But here's the key: in LogQL, streams are identified by their labels, not their location. This is radically different from the traditional "logs live in files" model.

{app="payment-service", environment="prod", instance="pod-123"}
→ "2024-03-15 01:23:45 Payment processed for order #1234"
→ "2024-03-15 01:23:46 Payment failed for order #1235"
→ "2024-03-15 01:23:47 Connection timeout to payment gateway"

Each stream is uniquely identified by its label set. This is crucial because:

  • It's how LogQL organizes and indexes your data
  • It's how you query your logs efficiently
  • It's what makes horizontal scaling possible

2. Labels: The Secret Sauce

Labels are key-value pairs that describe your log stream. They're not just metadata – they're the backbone of LogQL's data model. Here's why they matter:

a. Labels are your index

When you deploy Loki (the system behind LogQL), you decide which labels to index. This is like choosing the columns in your database's index, except:

  • You can't index everything (it would explode)
  • You need to think carefully about cardinality (more on this later)
  • The right labels make queries fast, wrong ones make them painful

b. Label Cardinality

This is where things get interesting. High cardinality is the nemesis of efficient log storage and querying. Let's break this down:

Good label choices:

  • app_name (dozens of values)
  • environment (dev, staging, prod)
  • component (api, worker, scheduler)

Bad label choices:

  • user_id (millions of values)
  • request_id (unique per request)
  • timestamp (infinite values)

The Time Series Nature of Logs

Now here's the mind-bending part: LogQL treats your logs as time series data. Each log stream is essentially a series of events over time. This means:

  1. You can switch between logs and metrics seamlessly:

    sum(rate({app="payment-service"} |= "error" [5m])) by (instance)
    
  2. You can think about logs in terms of rates and aggregations

    • Rates of occurrence
    • Time-based aggregations
    • Patterns over time
  3. You can apply time-series analysis to your logs

    Time is a first-class citizen:

    • Time ranges are efficient
    • Time-based comparisons are natural
    • Historical analysis is built-in

The Cardinality Conundrum

Remember when I mentioned cardinality? This is where most logging systems fall over, and it's why LogQL's data model is so important. Let's say you have:

  • 100 services
  • 1000 pods
  • 1 million users

If you label every log line with user_id, you're creating 1 million unique streams. That's:

  • Bad for storage
  • Bad for querying
  • Bad for your cloud bill
  • Bad for your career

Instead, LogQL encourages you to:

  1. Use high-cardinality data in the log content
  2. Index on low-cardinality labels
  3. Extract high-cardinality data at query time

Why This Matters: A Real Example

Let's say you're debugging a payment issue.

{app="payment-service", env="prod"} 
|= "payment failed" 
| json 
| user_id="user123"

This query:

  1. First finds the relevant streams using labels
  2. Then filters for specific text
  3. Finally extracts and filters on JSON fields

It's like having an organized filing system instead of a giant pile of papers.

The Bottom Line

LogQL's data model isn't just an implementation detail — it's a fundamental rethinking of how we organize and query logs. It's built on three key insights:

  1. Labels are more important than log content for organization
  2. Time series are the right way to think about logs
  3. Cardinality is the key to scalability and performance

Next time, we'll look at how to actually query this data model effectively.

When you're working with LogQL, always remember:

  • Choose your labels wisely
  • Keep high-cardinality data in the log body
  • Think in terms of streams and time

Next time, we'll look at how to actually query this data model effectively. But remember: understanding the data model is half the battle. It's like chess — once you know how the pieces move, you can start thinking about strategy.

Contents

See How Last9 Works

Understand how Last9 can unlock a single pane of observability for all your telemetry data. Open Standards compatible. Simple pricing.

Talk to us