As an SRE, when systems fail and alerts flood in, log data becomes your most valuable asset. But what exactly is log data, and how can you use it to improve system reliability?
What is Log Data?
Log data consists of timestamped records documenting events within your systems and applications. Think of logs as your system's diary – capturing what happened, when, and often why. From user logins to service crashes and slow database queries, these digital breadcrumbs provide essential context for:
- Troubleshooting issues during incidents
- Monitoring system health proactively
- Understanding user behavior patterns
- Meeting security and compliance requirements
Log Data Types Every SRE Should Understand
Application Logs
These document events within your software applications:
- Info logs: Normal operations like user actions and business events
- Warning logs: Non-critical issues needing attention
- Error logs: Problems requiring immediate action
- Debug logs: Detailed information for development and troubleshooting
System Logs
These track events at the operating system level:
- Boot sequences and service startups
- Hardware events and resource utilization
- User authentication and session management
- Service state changes and daemon restarts
Security Logs
These record security-related events:
- Authentication attempts and access controls
- Permission changes and privilege escalations
- Sensitive data access and modification
- Network connections and potential intrusions
Network Logs
These monitor communication between systems:
- Firewall events and connection attempts
- Load balancer activity and request routing
- Bandwidth usage and network performance
- DNS queries and resolution issues
Different Log Formats You Should Know
Plain Text Logs
Simple but limited:
[2023-04-15 08:32:15] INFO User john.doe logged in successfully from 192.168.1.105
Structured Logs (JSON)
More machine-readable and easier to analyze:
{
"timestamp": "2023-04-15T08:32:15Z",
"level": "INFO",
"message": "User logged in successfully",
"user": "john.doe",
"ip": "192.168.1.105",
"service": "authentication",
"request_id": "req_abc123"
}
This JSON structure provides:
- Precise timestamp in ISO 8601 format
- Clear log level categorization
- Human-readable message
- Rich context with user, IP, service information
- Request ID for tracing across services
Strategies for Efficient Log Collection
Agent-Based Collection
Installing collectors on each server:
# Installing Fluentd on Ubuntu
$ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-focal-td-agent4.sh | sh
# Configure to collect system and application logs
$ cat > /etc/td-agent/td-agent.conf << EOF
<source>
@type tail
path /var/log/syslog
tag system.syslog
<parse>
@type syslog
</parse>
</source>
<match **>
@type forward
<server>
host logserver.example.com
port 24224
</server>
</match>
EOF
This script installs Fluentd (td-agent) and configures it to:
- Monitor system logs continuously
- Parse them according to syslog format
- Forward them to a central logging server
Kubernetes Sidecar Pattern
Using containers for log collection:
apiVersion: v1
kind: Pod
metadata:
name: app-with-sidecar
spec:
containers:
- name: app
image: my-app:latest
volumeMounts:
- name: logs
mountPath: /var/log/app
- name: log-collector
image: fluentd:latest
volumeMounts:
- name: logs
mountPath: /var/log/app
readOnly: true
volumes:
- name: logs
emptyDir: {}
This Kubernetes manifest:
- Creates a pod with your application and a log collector
- Sets up a shared volume for log files
- Allows the collector container to read and forward logs
How to Choose the Right Log Storage Solution
Elasticsearch
Powerful, searchable storage for logs:
curl -X PUT "localhost:9200/_template/logs_template" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["logs-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"message": { "type": "text" },
"level": { "type": "keyword" },
"service": { "type": "keyword" }
}
}
}
'
What this does:
- Applies to all indices starting with
logs-*
- Configures sharding for better performance
- Defines field mappings for optimized searching
Elasticsearch is great when you need high-speed full-text search on logs, but it comes with operational overhead—managing clusters, handling scaling, and ensuring query performance.
Last9: Observability Without the Complexity
If you’re looking for a simpler, cost-efficient way to handle logs with an observability-first approach, Last9 is worth considering. Last9 is Otel-native and Prometheus compatible, making it an excellent choice for modern distributed systems.
Why Last9?
- Effortless scaling – No worrying about managing shards and replicas.
- Optimized storage – Reduces log bloat while keeping the most useful data.
- Otel-native – Seamless integration with OpenTelemetry (Otel) pipelines.
- Built for engineers – Focus on insights, not just storage.
Unlike Elasticsearch, which requires you to fine-tune indexing and retention manually, Last9 automates much of that work while giving you better cost predictability.
Which One Should You Choose?
- Elasticsearch: If you need powerful, customizable log searches and are comfortable managing infrastructure.
- Last9: If you want a hassle-free, observability-driven log storage solution that works out of the box with OpenTelemetry.
Object Storage with Lifecycle Management
Cost-effective long-term storage:
# Setting up S3 bucket with lifecycle policies
aws s3api create-bucket --bucket my-logs-bucket --region us-east-1
# Configure lifecycle rules
aws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket --lifecycle-configuration '{
"Rules": [
{
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}'
This configuration:
- Creates an S3 bucket for logs
- Moves logs to cheaper storage after 30 days
- Archives to Glacier after 90 days
- Deletes logs after one year
How to Turn Raw Logs into Actionable Insights
Basic CLI Analysis
Essential commands for quick investigations:
# Finding error patterns
$ grep -i error /var/log/application.log | sort | uniq -c | sort -rn | head -10
# Tracing a request through multiple logs
$ grep "request-abc123" /var/log/*/application.log | sort -k1,1
These commands help you:
- Identify the most common error messages
- Follow a single request through different services
Advanced Elasticsearch Queries
Finding patterns across distributed systems:
# Finding correlation between error spikes
curl -X GET "localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
"size": 0,
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"aggs": {
"errors_by_service": {
"terms": { "field": "service" },
"aggs": {
"error_timeline": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "1m"
}
}
}
}
}
}
'
This query:
- Finds all errors in the past hour
- Breaks them down by service
- Shows the timeline of errors per service
Using Logs to Solve Practical Problems
The Problem: Intermittent Payment Failures
Users report that some payment transactions are failing sporadically. The issue is not consistent, making it challenging to diagnose without deeper investigation.
The Logs: Gathering Clues
To uncover the root cause, we examine logs from multiple sources: application logs, system logs, and network logs.
Application Logs (Payment Service)
2023-04-15T14:32:10Z [INFO] [payment-service] Payment request received for order ORD-12345
2023-04-15T14:32:11Z [INFO] [payment-service] Calling payment gateway for order ORD-12345
2023-04-15T14:32:18Z [ERROR] [payment-service] Payment gateway timeout for order ORD-12345
- The logs indicate that the payment service receives the request and attempts to contact the payment gateway.
- However, the call to the payment gateway results in a timeout.
System Logs (Database Server)
2023-04-15T14:30:05Z [WARNING] [system] High CPU utilization (92%) on payment-service-db-01
2023-04-15T14:31:15Z [WARNING] [system] High CPU utilization (95%) on payment-service-db-01
- The database server supporting the payment service is experiencing high CPU usage.
- This could be causing slow queries or performance degradation, impacting dependent services.
Network Logs (Connectivity)
2023-04-15T14:32:11Z [INFO] [network] Request from 10.0.1.15 to payment-gateway.example.com timed out after 5000ms
- The payment service attempts to contact the payment gateway, but the request times out.
- This suggests potential network latency or service delays in the backend infrastructure.
The Analysis: Connecting the Dots
By correlating these logs, we can uncover a possible chain reaction:
- The payment service initiates a request to the payment gateway.
- The request times out due to a delay in backend processing.
- The system logs reveal high CPU usage on the database server, likely causing slow response times.
- The network logs confirm that payment gateway requests are failing due to these delays.
The Conclusion: A Database Bottleneck
- The database server is overloaded, leading to slow query responses.
- This slows down the payment service’s processing time, causing calls to the payment gateway to time out.
- The issue is not in the payment gateway itself but in the infrastructure supporting the payment service.
The Fix: Optimizing Database Performance
To resolve the issue, consider the following:
- Scale up the database resources – Increase CPU or memory allocation to handle peak loads.
- Optimize database queries – Identify slow queries and improve indexing or caching strategies.
- Load balancing – Distribute traffic more effectively across multiple database instances.
- Monitor system health proactively – Set up alerts for high CPU usage to prevent failures before they impact users.
Best Practices for Building a Log Strategy
Consistent Log Formatting
Standardize your logging approach:
# Python structured logging example
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_record = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name
}
# Add extra fields from record
for key, value in record.__dict__.items():
if key.startswith('ctx_'):
log_record[key[4:]] = value
return json.dumps(log_record)
# Setup logger with JSON formatter
logger = logging.getLogger("app")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
def process_payment(user_id, amount):
logger.info("Processing payment",
extra={
"ctx_user_id": user_id,
"ctx_amount": amount,
"ctx_service": "payment"
})
This code creates a standardized JSON logging system that:
- Provides consistent timestamp formatting
- Includes contextual information with each log
- Makes logs machine-readable for automated analysis
Strategic Retention Policies
Balance data needs with storage costs:
Log Type | Hot Storage | Warm Storage | Cold Storage |
---|---|---|---|
Application | 7 days | 30 days | 1 year |
System | 3 days | 14 days | 90 days |
Security | 30 days | 90 days | 7 years |
Access | 1 day | 7 days | 30 days |
Distributed Tracing Integration
Connect logs across service boundaries:
// Adding trace context to logs
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.util.UUID;
public class OrderService {
private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
public void processOrder(String orderId, double amount) {
// Generate or use existing trace ID
String traceId = UUID.randomUUID().toString();
MDC.put("traceId", traceId);
logger.info("Starting order processing");
try {
// Call other services with same trace context
paymentService.processPayment(orderId, amount, traceId);
logger.info("Order processed successfully");
} catch (Exception e) {
logger.error("Order processing failed", e);
throw e;
} finally {
MDC.clear();
}
}
}
This Java code:
- Generates a unique trace ID for the request
- Adds it to the Mapped Diagnostic Context (MDC)
- Passes the trace ID to other services
- Ensures all logs from this transaction are linked
Next Steps
As your systems grow, focus on these key areas:
- Centralization: Move from siloed logs to a central repository
- Standardization: Create consistent logging practices across teams
- Automation: Implement anomaly detection and automated analysis
- Integration: Connect logs with metrics and traces for full observability
Conclusion
The key is building a thoughtful approach to collecting, storing, and analyzing logs that matches your team's needs and system complexity.
FAQs
What is log data retention, and how long should we keep logs?
Log data retention refers to how long you store logs before deletion. The optimal retention period depends on several factors:
- Compliance requirements: Some industries require keeping certain logs for years (finance, healthcare)
- Operational needs: Most operational troubleshooting requires only recent logs (7-30 days)
- Storage costs: Longer retention increases costs, so consider tiered storage strategies
- Security investigations: Security logs often need longer retention periods (3-12 months)
Best practice is to implement a tiered strategy where recent logs are kept in fast storage and older logs are moved to lower-cost options.
How do I balance comprehensive logging with performance?
Excessive logging can impact application performance. To find the right balance:
- Use appropriate log levels (ERROR, WARN, INFO, DEBUG) and configure production environments to log at INFO level or above
- Implement sampling for high-volume logs (e.g., log only 10% of non-error requests)
- Use asynchronous logging to minimize impact on request processing
- Consider context-aware logging that increases detail only when issues occur
- Benchmark your application with and without verbose logging enabled
What's the difference between logs, metrics, and traces?
These three pillars of observability serve different purposes:
- Logs: Event-based records with detailed context (who, what, when, why)
- Metrics: Numerical measurements sampled over time (counters, gauges, histograms)
- Traces: Records of request paths through distributed systems
While logs provide rich context about specific events, metrics offer better performance for aggregation and alerting, and traces connect the dots between services. A complete observability strategy uses all three.
How can I make log parsing more efficient?
Efficient log parsing strategies include:
- Use structured logging formats (JSON) from the start to avoid parsing complexity
- For legacy text logs, create standardized patterns and test parsing rules thoroughly
- Implement parsing as close to the source as possible
- Use specialized tools like Logstash, Fluentd, or Vector that optimize parsing
- Cache parsed results when possible
- Consider hardware acceleration for high-volume parsing
What are common log data security concerns?
Critical security considerations for log data include:
- Sensitive data: Ensure PII, passwords, tokens, and other sensitive data aren't logged
- Access controls: Implement least-privilege access to log storage and analysis tools
- Integrity: Protect logs from unauthorized modification (especially security logs)
- Transport security: Encrypt logs in transit between systems
- Retention alignment: Ensure retention periods meet security investigation needs
- Audit trail: Maintain logs of who accessed log data for sensitive systems
How do I troubleshoot missing log data?
If logs are disappearing, check these common issues:
- Disk space: Verify log directories haven't filled up, causing logging failures
- Log rotation: Ensure rotation isn't removing logs before they're shipped
- Collection agents: Check if collection agents are running and configured correctly
- Rate limiting: Look for dropped logs due to rate limiting in your logging pipeline
- Permissions: Verify write permissions on log files and directories
- Configuration: Confirm log levels are set appropriately
What tools are recommended for SREs managing log data?
Popular tools in the log management ecosystem include:
- Collection: Fluentd, Fluent Bit, Logstash, Vector
- Processing: Kafka, ELK, OpenSearch, Last9
- Storage: Elasticsearch, Loki, Amazon S3, Google Cloud Storage
- Analysis: Kibana, Grafana, Splunk, Last9
- Alerting: Alertmanager, PagerDuty, OpsGenie, Last9
The right toolset depends on your scale, budget, and specific requirements