What is Log Data? The SRE's Essential Guide

As an SRE, when systems fail and alerts flood in, log data becomes your most valuable asset. But what exactly is log data, and how can you use it to improve system reliability?

What is Log Data?

Log data consists of timestamped records documenting events within your systems and applications. Think of logs as your system's diary – capturing what happened, when, and often why. From user logins to service crashes and slow database queries, these digital breadcrumbs provide essential context for:

Troubleshooting issues during incidents
Monitoring system health proactively
Understanding user behavior patterns
Meeting security and compliance requirements

💡

Learn how log levels help SREs filter important events and reduce noise in this guide on common log level questions.

Log Data Types Every SRE Should Understand

Application Logs

These document events within your software applications:

Info logs: Normal operations like user actions and business events
Warning logs: Non-critical issues needing attention
Error logs: Problems requiring immediate action
Debug logs: Detailed information for development and troubleshooting

System Logs

These track events at the operating system level:

Boot sequences and service startups
Hardware events and resource utilization
User authentication and session management
Service state changes and daemon restarts

Security Logs

These record security-related events:

Authentication attempts and access controls
Permission changes and privilege escalations
Sensitive data access and modification
Network connections and potential intrusions

Network Logs

These monitor communication between systems:

Firewall events and connection attempts
Load balancer activity and request routing
Bandwidth usage and network performance
DNS queries and resolution issues

💡

Understand how long to store logs and manage log retention effectively in this guide on log retention.

Different Log Formats You Should Know

Plain Text Logs

Simple but limited:

[2023-04-15 08:32:15] INFO User john.doe logged in successfully from 192.168.1.105

Structured Logs (JSON)

More machine-readable and easier to analyze:

{
  "timestamp": "2023-04-15T08:32:15Z",
  "level": "INFO",
  "message": "User logged in successfully",
  "user": "john.doe",
  "ip": "192.168.1.105",
  "service": "authentication",
  "request_id": "req_abc123"
}

This JSON structure provides:

Precise timestamp in ISO 8601 format
Clear log level categorization
Human-readable message
Rich context with user, IP, service information
Request ID for tracing across services

Strategies for Efficient Log Collection

Agent-Based Collection

Installing collectors on each server:

# Installing Fluentd on Ubuntu
$ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-focal-td-agent4.sh | sh

# Configure to collect system and application logs
$ cat > /etc/td-agent/td-agent.conf << EOF
<source>
  @type tail
  path /var/log/syslog
  tag system.syslog
  <parse>
    @type syslog
  </parse>
</source>

<match **>
  @type forward
  <server>
    host logserver.example.com
    port 24224
  </server>
</match>
EOF

This script installs Fluentd (td-agent) and configures it to:

Monitor system logs continuously
Parse them according to syslog format
Forward them to a central logging server

Kubernetes Sidecar Pattern

Using containers for log collection:

apiVersion: v1
kind: Pod
metadata:
  name: app-with-sidecar
spec:
  containers:
  - name: app
    image: my-app:latest
    volumeMounts:
    - name: logs
      mountPath: /var/log/app
  - name: log-collector
    image: fluentd:latest
    volumeMounts:
    - name: logs
      mountPath: /var/log/app
      readOnly: true
  volumes:
  - name: logs
    emptyDir: {}

This Kubernetes manifest:

Creates a pod with your application and a log collector
Sets up a shared volume for log files
Allows the collector container to read and forward logs

💡

See how sidecar containers can help with log collection and processing in this guide on sidecar containers in Kubernetes.

How to Choose the Right Log Storage Solution

Elasticsearch

Powerful, searchable storage for logs:

curl -X PUT "localhost:9200/_template/logs_template" -H 'Content-Type: application/json' -d'
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "message": { "type": "text" },
      "level": { "type": "keyword" },
      "service": { "type": "keyword" }
    }
  }
}
'

What this does:

Applies to all indices starting with logs-*
Configures sharding for better performance
Defines field mappings for optimized searching

Elasticsearch is great when you need high-speed full-text search on logs, but it comes with operational overhead—managing clusters, handling scaling, and ensuring query performance.

Last9: Observability Without the Complexity

If you’re looking for a simpler, cost-efficient way to handle logs with an observability-first approach, Last9 is worth considering. Last9 is Otel-native and Prometheus compatible, making it an excellent choice for modern distributed systems.

Why Last9?

Effortless scaling – No worrying about managing shards and replicas.
Optimized storage – Reduces log bloat while keeping the most useful data.
Otel-native – Seamless integration with OpenTelemetry (Otel) pipelines.
Built for engineers – Focus on insights, not just storage.

Unlike Elasticsearch, which requires you to fine-tune indexing and retention manually, Last9 automates much of that work while giving you better cost predictability.

Which One Should You Choose?

Elasticsearch: If you need powerful, customizable log searches and are comfortable managing infrastructure.
Last9: If you want a hassle-free, observability-driven log storage solution that works out of the box with OpenTelemetry.

💡

Learn how SIEM systems use logs for security monitoring and threat detection in this guide on SIEM logs.

Object Storage with Lifecycle Management

Cost-effective long-term storage:

# Setting up S3 bucket with lifecycle policies
aws s3api create-bucket --bucket my-logs-bucket --region us-east-1

# Configure lifecycle rules
aws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket --lifecycle-configuration '{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}'

This configuration:

Creates an S3 bucket for logs
Moves logs to cheaper storage after 30 days
Archives to Glacier after 90 days
Deletes logs after one year

💡

Understand the role of system logs in monitoring, debugging, and maintaining reliability in this guide on system logs.

How to Turn Raw Logs into Actionable Insights

Basic CLI Analysis

Essential commands for quick investigations:

# Finding error patterns
$ grep -i error /var/log/application.log | sort | uniq -c | sort -rn | head -10

# Tracing a request through multiple logs
$ grep "request-abc123" /var/log/*/application.log | sort -k1,1

These commands help you:

Identify the most common error messages
Follow a single request through different services

Advanced Elasticsearch Queries

Finding patterns across distributed systems:

# Finding correlation between error spikes
curl -X GET "localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "errors_by_service": {
      "terms": { "field": "service" },
      "aggs": {
        "error_timeline": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1m"
          }
        }
      }
    }
  }
}
'

This query:

Finds all errors in the past hour
Breaks them down by service
Shows the timeline of errors per service

💡

Improve log quality and troubleshooting with practical tips from this guide on logging best practices.

Using Logs to Solve Practical Problems

The Problem: Intermittent Payment Failures

Users report that some payment transactions are failing sporadically. The issue is not consistent, making it challenging to diagnose without deeper investigation.

The Logs: Gathering Clues

To uncover the root cause, we examine logs from multiple sources: application logs, system logs, and network logs.

Application Logs (Payment Service)

2023-04-15T14:32:10Z [INFO] [payment-service] Payment request received for order ORD-12345
2023-04-15T14:32:11Z [INFO] [payment-service] Calling payment gateway for order ORD-12345
2023-04-15T14:32:18Z [ERROR] [payment-service] Payment gateway timeout for order ORD-12345

The logs indicate that the payment service receives the request and attempts to contact the payment gateway.
However, the call to the payment gateway results in a timeout.

System Logs (Database Server)

2023-04-15T14:30:05Z [WARNING] [system] High CPU utilization (92%) on payment-service-db-01
2023-04-15T14:31:15Z [WARNING] [system] High CPU utilization (95%) on payment-service-db-01

The database server supporting the payment service is experiencing high CPU usage.
This could be causing slow queries or performance degradation, impacting dependent services.

Network Logs (Connectivity)

2023-04-15T14:32:11Z [INFO] [network] Request from 10.0.1.15 to payment-gateway.example.com timed out after 5000ms

The payment service attempts to contact the payment gateway, but the request times out.
This suggests potential network latency or service delays in the backend infrastructure.

The Analysis: Connecting the Dots

By correlating these logs, we can uncover a possible chain reaction:

The payment service initiates a request to the payment gateway.
The request times out due to a delay in backend processing.
The system logs reveal high CPU usage on the database server, likely causing slow response times.
The network logs confirm that payment gateway requests are failing due to these delays.

The Conclusion: A Database Bottleneck

The database server is overloaded, leading to slow query responses.
This slows down the payment service’s processing time, causing calls to the payment gateway to time out.
The issue is not in the payment gateway itself but in the infrastructure supporting the payment service.

The Fix: Optimizing Database Performance

To resolve the issue, consider the following:

Scale up the database resources – Increase CPU or memory allocation to handle peak loads.
Optimize database queries – Identify slow queries and improve indexing or caching strategies.
Load balancing – Distribute traffic more effectively across multiple database instances.
Monitor system health proactively – Set up alerts for high CPU usage to prevent failures before they impact users.

Best Practices for Building a Log Strategy

Consistent Log Formatting

Standardize your logging approach:

# Python structured logging example
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_record = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name
        }
        
        # Add extra fields from record
        for key, value in record.__dict__.items():
            if key.startswith('ctx_'):
                log_record[key[4:]] = value
                
        return json.dumps(log_record)

# Setup logger with JSON formatter
logger = logging.getLogger("app")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
def process_payment(user_id, amount):
    logger.info("Processing payment", 
                extra={
                    "ctx_user_id": user_id,
                    "ctx_amount": amount,
                    "ctx_service": "payment"
                })

This code creates a standardized JSON logging system that:

Provides consistent timestamp formatting
Includes contextual information with each log
Makes logs machine-readable for automated analysis

Strategic Retention Policies

Balance data needs with storage costs:

Log Type	Hot Storage	Warm Storage	Cold Storage
Application	7 days	30 days	1 year
System	3 days	14 days	90 days
Security	30 days	90 days	7 years
Access	1 day	7 days	30 days

💡

Explore the common pitfalls and solutions in tracking requests across services in this guide on challenges of distributed tracing.

Distributed Tracing Integration

Connect logs across service boundaries:

// Adding trace context to logs
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.util.UUID;

public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
    
    public void processOrder(String orderId, double amount) {
        // Generate or use existing trace ID
        String traceId = UUID.randomUUID().toString();
        MDC.put("traceId", traceId);
        
        logger.info("Starting order processing");
        
        try {
            // Call other services with same trace context
            paymentService.processPayment(orderId, amount, traceId);
            
            logger.info("Order processed successfully");
        } catch (Exception e) {
            logger.error("Order processing failed", e);
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

This Java code:

Generates a unique trace ID for the request
Adds it to the Mapped Diagnostic Context (MDC)
Passes the trace ID to other services
Ensures all logs from this transaction are linked

Next Steps

As your systems grow, focus on these key areas:

Centralization: Move from siloed logs to a central repository
Standardization: Create consistent logging practices across teams
Automation: Implement anomaly detection and automated analysis
Integration: Connect logs with metrics and traces for full observability

Conclusion

The key is building a thoughtful approach to collecting, storing, and analyzing logs that matches your team's needs and system complexity.

💡

If you want to chat more about log data strategies, join our Discord Community where you can connect with SREs like you.

FAQs

What is log data retention, and how long should we keep logs?

Log data retention refers to how long you store logs before deletion. The optimal retention period depends on several factors:

Compliance requirements: Some industries require keeping certain logs for years (finance, healthcare)
Operational needs: Most operational troubleshooting requires only recent logs (7-30 days)
Storage costs: Longer retention increases costs, so consider tiered storage strategies
Security investigations: Security logs often need longer retention periods (3-12 months)

Best practice is to implement a tiered strategy where recent logs are kept in fast storage and older logs are moved to lower-cost options.

How do I balance comprehensive logging with performance?

Excessive logging can impact application performance. To find the right balance:

Use appropriate log levels (ERROR, WARN, INFO, DEBUG) and configure production environments to log at INFO level or above
Implement sampling for high-volume logs (e.g., log only 10% of non-error requests)
Use asynchronous logging to minimize impact on request processing
Consider context-aware logging that increases detail only when issues occur
Benchmark your application with and without verbose logging enabled

What's the difference between logs, metrics, and traces?

These three pillars of observability serve different purposes:

Logs: Event-based records with detailed context (who, what, when, why)
Metrics: Numerical measurements sampled over time (counters, gauges, histograms)
Traces: Records of request paths through distributed systems

While logs provide rich context about specific events, metrics offer better performance for aggregation and alerting, and traces connect the dots between services. A complete observability strategy uses all three.

How can I make log parsing more efficient?

Efficient log parsing strategies include:

Use structured logging formats (JSON) from the start to avoid parsing complexity
For legacy text logs, create standardized patterns and test parsing rules thoroughly
Implement parsing as close to the source as possible
Use specialized tools like Logstash, Fluentd, or Vector that optimize parsing
Cache parsed results when possible
Consider hardware acceleration for high-volume parsing

What are common log data security concerns?

Critical security considerations for log data include:

Sensitive data: Ensure PII, passwords, tokens, and other sensitive data aren't logged
Access controls: Implement least-privilege access to log storage and analysis tools
Integrity: Protect logs from unauthorized modification (especially security logs)
Transport security: Encrypt logs in transit between systems
Retention alignment: Ensure retention periods meet security investigation needs
Audit trail: Maintain logs of who accessed log data for sensitive systems

How do I troubleshoot missing log data?

If logs are disappearing, check these common issues:

Disk space: Verify log directories haven't filled up, causing logging failures
Log rotation: Ensure rotation isn't removing logs before they're shipped
Collection agents: Check if collection agents are running and configured correctly
Rate limiting: Look for dropped logs due to rate limiting in your logging pipeline
Permissions: Verify write permissions on log files and directories
Configuration: Confirm log levels are set appropriately

What tools are recommended for SREs managing log data?

Popular tools in the log management ecosystem include:

Collection: Fluentd, Fluent Bit, Logstash, Vector
Processing: Kafka, ELK, OpenSearch, Last9
Storage: Elasticsearch, Loki, Amazon S3, Google Cloud Storage
Analysis: Kibana, Grafana, Splunk, Last9
Alerting: Alertmanager, PagerDuty, OpsGenie, Last9

The right toolset depends on your scale, budget, and specific requirements