Last9 Last9

Mar 12th, ‘25 / 9 min read

What is Log Data? The SRE's Essential Guide

Learn how log data helps SREs debug issues, monitor performance, and understand system behavior effectively.

What is Log Data? The SRE's Essential Guide

As an SRE, when systems fail and alerts flood in, log data becomes your most valuable asset. But what exactly is log data, and how can you use it to improve system reliability?

What is Log Data?

Log data consists of timestamped records documenting events within your systems and applications. Think of logs as your system's diary – capturing what happened, when, and often why. From user logins to service crashes and slow database queries, these digital breadcrumbs provide essential context for:

  • Troubleshooting issues during incidents
  • Monitoring system health proactively
  • Understanding user behavior patterns
  • Meeting security and compliance requirements
💡
Learn how log levels help SREs filter important events and reduce noise in this guide on common log level questions.

Log Data Types Every SRE Should Understand

Application Logs

These document events within your software applications:

  • Info logs: Normal operations like user actions and business events
  • Warning logs: Non-critical issues needing attention
  • Error logs: Problems requiring immediate action
  • Debug logs: Detailed information for development and troubleshooting

System Logs

These track events at the operating system level:

  • Boot sequences and service startups
  • Hardware events and resource utilization
  • User authentication and session management
  • Service state changes and daemon restarts

Security Logs

These record security-related events:

  • Authentication attempts and access controls
  • Permission changes and privilege escalations
  • Sensitive data access and modification
  • Network connections and potential intrusions

Network Logs

These monitor communication between systems:

  • Firewall events and connection attempts
  • Load balancer activity and request routing
  • Bandwidth usage and network performance
  • DNS queries and resolution issues
💡
Understand how long to store logs and manage log retention effectively in this guide on log retention.

Different Log Formats You Should Know

Plain Text Logs

Simple but limited:

[2023-04-15 08:32:15] INFO User john.doe logged in successfully from 192.168.1.105

Structured Logs (JSON)

More machine-readable and easier to analyze:

{
  "timestamp": "2023-04-15T08:32:15Z",
  "level": "INFO",
  "message": "User logged in successfully",
  "user": "john.doe",
  "ip": "192.168.1.105",
  "service": "authentication",
  "request_id": "req_abc123"
}

This JSON structure provides:

  • Precise timestamp in ISO 8601 format
  • Clear log level categorization
  • Human-readable message
  • Rich context with user, IP, service information
  • Request ID for tracing across services

Strategies for Efficient Log Collection

Agent-Based Collection

Installing collectors on each server:

# Installing Fluentd on Ubuntu
$ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-focal-td-agent4.sh | sh

# Configure to collect system and application logs
$ cat > /etc/td-agent/td-agent.conf << EOF
<source>
  @type tail
  path /var/log/syslog
  tag system.syslog
  <parse>
    @type syslog
  </parse>
</source>

<match **>
  @type forward
  <server>
    host logserver.example.com
    port 24224
  </server>
</match>
EOF

This script installs Fluentd (td-agent) and configures it to:

  • Monitor system logs continuously
  • Parse them according to syslog format
  • Forward them to a central logging server

Kubernetes Sidecar Pattern

Using containers for log collection:

apiVersion: v1
kind: Pod
metadata:
  name: app-with-sidecar
spec:
  containers:
  - name: app
    image: my-app:latest
    volumeMounts:
    - name: logs
      mountPath: /var/log/app
  - name: log-collector
    image: fluentd:latest
    volumeMounts:
    - name: logs
      mountPath: /var/log/app
      readOnly: true
  volumes:
  - name: logs
    emptyDir: {}

This Kubernetes manifest:

  • Creates a pod with your application and a log collector
  • Sets up a shared volume for log files
  • Allows the collector container to read and forward logs
💡
See how sidecar containers can help with log collection and processing in this guide on sidecar containers in Kubernetes.

How to Choose the Right Log Storage Solution

Elasticsearch

Powerful, searchable storage for logs:

curl -X PUT "localhost:9200/_template/logs_template" -H 'Content-Type: application/json' -d'
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "message": { "type": "text" },
      "level": { "type": "keyword" },
      "service": { "type": "keyword" }
    }
  }
}
'

What this does:

  • Applies to all indices starting with logs-*
  • Configures sharding for better performance
  • Defines field mappings for optimized searching

Elasticsearch is great when you need high-speed full-text search on logs, but it comes with operational overhead—managing clusters, handling scaling, and ensuring query performance.

Last9: Observability Without the Complexity

If you’re looking for a simpler, cost-efficient way to handle logs with an observability-first approach, Last9 is worth considering. Last9 is Otel-native and Prometheus compatible, making it an excellent choice for modern distributed systems.

Why Last9?

  • Effortless scaling – No worrying about managing shards and replicas.
  • Optimized storage – Reduces log bloat while keeping the most useful data.
  • Otel-native – Seamless integration with OpenTelemetry (Otel) pipelines.
  • Built for engineers – Focus on insights, not just storage.

Unlike Elasticsearch, which requires you to fine-tune indexing and retention manually, Last9 automates much of that work while giving you better cost predictability.

Which One Should You Choose?

  • Elasticsearch: If you need powerful, customizable log searches and are comfortable managing infrastructure.
  • Last9: If you want a hassle-free, observability-driven log storage solution that works out of the box with OpenTelemetry.
💡
Learn how SIEM systems use logs for security monitoring and threat detection in this guide on SIEM logs.

Object Storage with Lifecycle Management

Cost-effective long-term storage:

# Setting up S3 bucket with lifecycle policies
aws s3api create-bucket --bucket my-logs-bucket --region us-east-1

# Configure lifecycle rules
aws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket --lifecycle-configuration '{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}'

This configuration:

  • Creates an S3 bucket for logs
  • Moves logs to cheaper storage after 30 days
  • Archives to Glacier after 90 days
  • Deletes logs after one year
💡
Understand the role of system logs in monitoring, debugging, and maintaining reliability in this guide on system logs.

How to Turn Raw Logs into Actionable Insights

Basic CLI Analysis

Essential commands for quick investigations:

# Finding error patterns
$ grep -i error /var/log/application.log | sort | uniq -c | sort -rn | head -10

# Tracing a request through multiple logs
$ grep "request-abc123" /var/log/*/application.log | sort -k1,1

These commands help you:

  • Identify the most common error messages
  • Follow a single request through different services

Advanced Elasticsearch Queries

Finding patterns across distributed systems:

# Finding correlation between error spikes
curl -X GET "localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "errors_by_service": {
      "terms": { "field": "service" },
      "aggs": {
        "error_timeline": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1m"
          }
        }
      }
    }
  }
}
'

This query:

  • Finds all errors in the past hour
  • Breaks them down by service
  • Shows the timeline of errors per service
💡
Improve log quality and troubleshooting with practical tips from this guide on logging best practices.

Using Logs to Solve Practical Problems

The Problem: Intermittent Payment Failures

Users report that some payment transactions are failing sporadically. The issue is not consistent, making it challenging to diagnose without deeper investigation.

The Logs: Gathering Clues

To uncover the root cause, we examine logs from multiple sources: application logs, system logs, and network logs.

Application Logs (Payment Service)

2023-04-15T14:32:10Z [INFO] [payment-service] Payment request received for order ORD-12345
2023-04-15T14:32:11Z [INFO] [payment-service] Calling payment gateway for order ORD-12345
2023-04-15T14:32:18Z [ERROR] [payment-service] Payment gateway timeout for order ORD-12345
  • The logs indicate that the payment service receives the request and attempts to contact the payment gateway.
  • However, the call to the payment gateway results in a timeout.

System Logs (Database Server)

2023-04-15T14:30:05Z [WARNING] [system] High CPU utilization (92%) on payment-service-db-01
2023-04-15T14:31:15Z [WARNING] [system] High CPU utilization (95%) on payment-service-db-01
  • The database server supporting the payment service is experiencing high CPU usage.
  • This could be causing slow queries or performance degradation, impacting dependent services.

Network Logs (Connectivity)

2023-04-15T14:32:11Z [INFO] [network] Request from 10.0.1.15 to payment-gateway.example.com timed out after 5000ms
  • The payment service attempts to contact the payment gateway, but the request times out.
  • This suggests potential network latency or service delays in the backend infrastructure.

The Analysis: Connecting the Dots

By correlating these logs, we can uncover a possible chain reaction:

  1. The payment service initiates a request to the payment gateway.
  2. The request times out due to a delay in backend processing.
  3. The system logs reveal high CPU usage on the database server, likely causing slow response times.
  4. The network logs confirm that payment gateway requests are failing due to these delays.

The Conclusion: A Database Bottleneck

  • The database server is overloaded, leading to slow query responses.
  • This slows down the payment service’s processing time, causing calls to the payment gateway to time out.
  • The issue is not in the payment gateway itself but in the infrastructure supporting the payment service.

The Fix: Optimizing Database Performance

To resolve the issue, consider the following:

  1. Scale up the database resources – Increase CPU or memory allocation to handle peak loads.
  2. Optimize database queries – Identify slow queries and improve indexing or caching strategies.
  3. Load balancing – Distribute traffic more effectively across multiple database instances.
  4. Monitor system health proactively – Set up alerts for high CPU usage to prevent failures before they impact users.

Best Practices for Building a Log Strategy

Consistent Log Formatting

Standardize your logging approach:

# Python structured logging example
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_record = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name
        }
        
        # Add extra fields from record
        for key, value in record.__dict__.items():
            if key.startswith('ctx_'):
                log_record[key[4:]] = value
                
        return json.dumps(log_record)

# Setup logger with JSON formatter
logger = logging.getLogger("app")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
def process_payment(user_id, amount):
    logger.info("Processing payment", 
                extra={
                    "ctx_user_id": user_id,
                    "ctx_amount": amount,
                    "ctx_service": "payment"
                })

This code creates a standardized JSON logging system that:

  • Provides consistent timestamp formatting
  • Includes contextual information with each log
  • Makes logs machine-readable for automated analysis

Strategic Retention Policies

Balance data needs with storage costs:

Log Type Hot Storage Warm Storage Cold Storage
Application 7 days 30 days 1 year
System 3 days 14 days 90 days
Security 30 days 90 days 7 years
Access 1 day 7 days 30 days
💡
Explore the common pitfalls and solutions in tracking requests across services in this guide on challenges of distributed tracing.

Distributed Tracing Integration

Connect logs across service boundaries:

// Adding trace context to logs
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.util.UUID;

public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
    
    public void processOrder(String orderId, double amount) {
        // Generate or use existing trace ID
        String traceId = UUID.randomUUID().toString();
        MDC.put("traceId", traceId);
        
        logger.info("Starting order processing");
        
        try {
            // Call other services with same trace context
            paymentService.processPayment(orderId, amount, traceId);
            
            logger.info("Order processed successfully");
        } catch (Exception e) {
            logger.error("Order processing failed", e);
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

This Java code:

  • Generates a unique trace ID for the request
  • Adds it to the Mapped Diagnostic Context (MDC)
  • Passes the trace ID to other services
  • Ensures all logs from this transaction are linked

Next Steps

As your systems grow, focus on these key areas:

  1. Centralization: Move from siloed logs to a central repository
  2. Standardization: Create consistent logging practices across teams
  3. Automation: Implement anomaly detection and automated analysis
  4. Integration: Connect logs with metrics and traces for full observability

Conclusion

The key is building a thoughtful approach to collecting, storing, and analyzing logs that matches your team's needs and system complexity.

💡
If you want to chat more about log data strategies, join our Discord Community where you can connect with SREs like you.

FAQs

What is log data retention, and how long should we keep logs?

Log data retention refers to how long you store logs before deletion. The optimal retention period depends on several factors:

  • Compliance requirements: Some industries require keeping certain logs for years (finance, healthcare)
  • Operational needs: Most operational troubleshooting requires only recent logs (7-30 days)
  • Storage costs: Longer retention increases costs, so consider tiered storage strategies
  • Security investigations: Security logs often need longer retention periods (3-12 months)

Best practice is to implement a tiered strategy where recent logs are kept in fast storage and older logs are moved to lower-cost options.

How do I balance comprehensive logging with performance?

Excessive logging can impact application performance. To find the right balance:

  • Use appropriate log levels (ERROR, WARN, INFO, DEBUG) and configure production environments to log at INFO level or above
  • Implement sampling for high-volume logs (e.g., log only 10% of non-error requests)
  • Use asynchronous logging to minimize impact on request processing
  • Consider context-aware logging that increases detail only when issues occur
  • Benchmark your application with and without verbose logging enabled

What's the difference between logs, metrics, and traces?

These three pillars of observability serve different purposes:

  • Logs: Event-based records with detailed context (who, what, when, why)
  • Metrics: Numerical measurements sampled over time (counters, gauges, histograms)
  • Traces: Records of request paths through distributed systems

While logs provide rich context about specific events, metrics offer better performance for aggregation and alerting, and traces connect the dots between services. A complete observability strategy uses all three.

How can I make log parsing more efficient?

Efficient log parsing strategies include:

  • Use structured logging formats (JSON) from the start to avoid parsing complexity
  • For legacy text logs, create standardized patterns and test parsing rules thoroughly
  • Implement parsing as close to the source as possible
  • Use specialized tools like Logstash, Fluentd, or Vector that optimize parsing
  • Cache parsed results when possible
  • Consider hardware acceleration for high-volume parsing

What are common log data security concerns?

Critical security considerations for log data include:

  • Sensitive data: Ensure PII, passwords, tokens, and other sensitive data aren't logged
  • Access controls: Implement least-privilege access to log storage and analysis tools
  • Integrity: Protect logs from unauthorized modification (especially security logs)
  • Transport security: Encrypt logs in transit between systems
  • Retention alignment: Ensure retention periods meet security investigation needs
  • Audit trail: Maintain logs of who accessed log data for sensitive systems

How do I troubleshoot missing log data?

If logs are disappearing, check these common issues:

  • Disk space: Verify log directories haven't filled up, causing logging failures
  • Log rotation: Ensure rotation isn't removing logs before they're shipped
  • Collection agents: Check if collection agents are running and configured correctly
  • Rate limiting: Look for dropped logs due to rate limiting in your logging pipeline
  • Permissions: Verify write permissions on log files and directories
  • Configuration: Confirm log levels are set appropriately

Popular tools in the log management ecosystem include:

  • Collection: Fluentd, Fluent Bit, Logstash, Vector
  • Processing: Kafka, ELK, OpenSearch, Last9
  • Storage: Elasticsearch, Loki, Amazon S3, Google Cloud Storage
  • Analysis: Kibana, Grafana, Splunk, Last9
  • Alerting: Alertmanager, PagerDuty, OpsGenie, Last9

The right toolset depends on your scale, budget, and specific requirements

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.

Topics