Monitoring used to be simple—set up some dashboards, configure alerts, and call it a day. But with microservices and cloud-native systems, things aren’t so straightforward anymore. Keeping track of everything can feel like an endless game of whack-a-mole.
That’s where observability comes in. If you’re just getting started or looking to refine your approach, this guide answers the most common (and important) questions.
FAQs
What is observability and how is it different from monitoring?
Monitoring tells you when something's broken. Observability tells you why.
Think of monitoring as checking your car's dashboard lights—it alerts you to problems. Observability is like having x-ray vision into your engine while driving. It gives you context about what's happening under the hood.
Traditional monitoring collects predefined metrics you think you'll need. Observability collects high-cardinality data allowing you to ask questions you hadn't thought of yet.
The technical distinction lies in state observability from control theory—a system is observable if you can determine its internal state from its outputs. In practical terms, this means having enough telemetry data to understand any state your system might get into, even unexpected ones.
Key Differences Between Monitoring and Observability
Monitoring | Observability |
---|---|
Known-unknowns | Unknown-unknowns |
Alert-driven | Query-driven |
Pre-defined dashboards | Dynamic exploration |
Low cardinality | High cardinality |
Fixed thresholds | Anomaly detection |
What are the three pillars of observability?
The three pillars that form your observability strategy are:
- Logs – Text records of events (what happened and when)
- Metrics – Numerical measurements over time (how much, how many)
- Traces – Request paths through your distributed system (where and how long)
Think of them as complementary tools. Logs give you detailed events, metrics show patterns over time, and traces connect the dots across services.
Logs
Structured logs have transformed traditional text logging. JSON-formatted logs with consistent fields enable query-based analysis that was impossible with plain text. Tools like Elasticsearch, Loki, and Splunk can index terabytes of log data for fast retrieval.
Key log types include:
- Application logs (business events)
- Access logs (API/frontend requests)
- Error logs (exceptions, crashes)
- Audit logs (security/compliance events)
Metrics
Metrics shine for time-series analysis and alerting. They're compact, efficient, and perfect for dashboards.
Four types of metrics matter:
- Counters (always increasing, like request count)
- Gauges (can go up/down, like memory usage)
- Histograms (distribution of values in buckets)
- Summaries (similar to histograms but with quantiles)
Cardinality—the number of unique time series—is crucial. A single metric like http_requests_total
can explode into thousands of time series when labeled with dimensions like endpoint, status code, and customer ID.
Traces
Distributed tracing connects the dots across your microservices. A trace represents a request's journey through your system, with each service adding a "span"—a unit of work with timing and metadata.
Key concepts in tracing:
- Trace context propagation (passing trace IDs between services)
- Span attributes (adding metadata to spans)
- Sampling strategies (collecting enough traces without breaking the bank)
- Service maps (visualizing dependencies between services)
Key Comparison of Observability Pillars
Pillar | Storage Requirements | Query Patterns | Retention Strategy | Sampling Approach |
---|---|---|---|---|
Logs | High (10-100x metrics) | Full-text search, structured fields | Short-term hot, cold archive | Filter by severity or service |
Metrics | Low (compressed time-series) | Range queries, aggregations | Long-term, downsampled | Pre-aggregation |
Traces | Medium-High | Request path analysis, span queries | Short-term, sampled | Head-based, tail-based, or attribute-based |
Observability isn't just about collecting data—it's about making sense of it. By leveraging logs, metrics, and traces effectively, you can build a system that not only detects issues but also provides the insights needed to resolve them quickly.
Do I really need all three pillars for good observability?
You don't need all three to start, but you'll want them eventually.
Many teams begin with logging and metrics, then add distributed tracing as they grow. Each pillar answers different questions:
- Logs answer: What events happened in detail?
- Metrics answer: What's the overall health and performance?
- Traces answer: How do requests flow through your services?
The pillar integration maturity model looks like this:
- Level 0: Siloed tools, manual correlation
- Level 1: Common timestamp format, basic cross-referencing
- Level 2: Exemplars linking metrics to traces
- Level 3: Trace IDs in logs, metrics derived from spans
- Level 4: Unified data model with automatic correlation
Cutting-edge approaches like OpenTelemetry are now blurring the boundaries between these pillars. The OTLP (OpenTelemetry Protocol) allows unified collection while specialized backends handle storage and querying.
How do I know if I have an observability problem?
You're dealing with observability issues if:
- You're playing "guess the root cause" during incidents
- Engineers spend hours digging through logs manually
- You find out about problems from users, not your systems
- You can see that something's wrong but not why it's wrong
- Different teams argue about whose service is causing issues
- Your post-mortems repeatedly cite "insufficient visibility" as a factor
- You maintain shadow systems for monitoring critical functions
- Your on-call rotation has become a dreaded assignment
Quantitative signals include:
- MTTR (Mean Time To Resolution) exceeding SLO targets
- Increasing time spent on false-positive alerts
- A growing percentage of incidents discovered by customers
- Expanding gap between 90th and 99th percentile latencies
Engineering sentiment surveys often reveal observability gaps before metrics do – ask your team if they feel confident troubleshooting in production.
What is OpenTelemetry and how is it transforming observability?
OpenTelemetry (OTel) has become the industry standard for generating and collecting telemetry data – a universal language of observability that works across vendors and tools.
Core Components of OpenTelemetry
The OTel project consists of:
- API: The interface applications use to generate telemetry
- SDK: The implementation of those interfaces
- Semantic Conventions: Standardized naming and attributes
- Instrumentation Libraries: Auto-instrumentation for popular frameworks
- Collector: Agent for processing and exporting telemetry
- OTLP: OpenTelemetry Protocol for data transmission
The technical architecture typically includes:
- Instrumentation layer: Auto and manual instrumentation generate telemetry (OTel SDKs)
- Collection layer: Agents or collectors receive, process, and forward data (OTel Collector)
- Pipeline layer: Filtering, sampling, enrichment, transformation (OTel Processor)
- Export layer: Data sent to backends (OTel Exporters)
- Storage/Analysis layer: Time-series DB, log stores, trace backends (Vendor or OSS)
- Visualization layer: Dashboards, query interfaces (Vendor or OSS)
OTel handles the first four layers, creating a clean separation of concerns.
OpenTelemetry Adoption Strategies
Implementation approaches include:
- Full OTel + OSS: OTel with Prometheus, Loki, Jaeger, etc.
- OTel + Vendor: OTel instrumentation with commercial backends
- Hybrid: OTel for some services, vendor SDKs for others
- Vendor-only: Proprietary agents and instrumentation
The trend is clear – OTel adoption has increased between 2022 and 2025, with 76% of enterprises now using it for at least some services.
What are the best OpenTelemetry practices for different languages?
OpenTelemetry implementation varies across programming languages and frameworks. Here's how to approach it for popular stacks:
Best OpenTelemetry Practices for Different Languages
OpenTelemetry implementation varies across programming languages and frameworks. Here's how to approach it for popular stacks:
Java
Auto-instrumentation strategy:
- Use the Java agent JAR with a single JVM parameter
- Cover Spring, Hibernate, JDBC, Kafka, etc. automatically
- Add custom annotations for business-level spans
Manual instrumentation best practices:
- Leverage
@WithSpan
annotations for service methods - Use span processors for common cross-cutting concerns
- Implement custom samplers for high-volume services
Common pitfalls:
- Too many spans in high-throughput loops
- Unhandled exceptions causing orphaned spans
- Heavy payloads in span attributes impacting performance
JavaScript/TypeScript (Node.js)
Auto-instrumentation approach:
- Use the Node.js SDK with auto-instrumentations
- Register instrumentations early in the application lifecycle
- Configure context propagation for async operations
Best practices:
- Create dedicated tracing modules for reusable instrumentation
- Use resource detection for cloud-specific metadata
- Implement custom propagators for legacy systems
Performance considerations:
- Use batch span processors in production
- Implement sampling for high-volume APIs
- Keep span attributes concise
Python
Instrumentation approach:
- Combine auto-instrumentation with manual spans
- Use context managers for scope management
- Leverage ASGI/WSGI middleware for web frameworks
Integration patterns:
- Add span context to logging records
- Use span processors for common attributes
- Implement custom propagators for RPC frameworks
Common issues:
- Context propagation in async code
- Resource leaks with unclosed spans
- Inconsistent attribute naming
Go
Instrumentation strategy:
- Explicit context passing through function calls
- Middleware for standard HTTP handlers
- Custom tracers for specific service boundaries
Best practices:
- Use context propagation consistently
- Create helper functions for common instrumentation patterns
- Leverage attribute conventions for consistent naming
Performance optimization:
- Implement tail sampling for high-cardinality services
- Use batch span exporters with appropriate buffer sizes
- Optimize attribute value serialization
Polyglot Environments
For organizations with multiple languages:
- Standardize on OpenTelemetry Collector deployment
- Create language-agnostic instrumentation guidelines
- Use semantic conventions consistently across languages
- Implement cross-service correlation through B3 or W3C context propagation
What's the difference between OpenTelemetry and vendor solutions?
The observability ecosystem consists of open standards like OpenTelemetry and vendor-specific implementations.
OpenTelemetry vs. Vendor SDKs
Aspect | OpenTelemetry | Vendor SDKs |
---|---|---|
Portability | High (vendor-neutral) | Low (vendor lock-in) |
Feature release | Community-driven pace | Vendor's roadmap |
Customization | Highly customizable | Vendor-dependent |
Support | Community + commercial | Vendor-provided |
Language coverage | Broad but varies by maturity | Depends on vendor focus |
Integration Considerations
Vendors generally fall into three categories in their OTel approach:
- Native OTel Support: Direct ingestion of OTLP data
- Partial Support: OTel collectors with vendor-specific exporters
- Minimal Support: Requiring data transformation or bridges
When evaluating vendor solutions, consider:
- Native OTLP ingest capability
- Support for OTel semantic conventions
- Exemplar support for metrics-to-traces correlation
- Custom attributes and dimensions handling
- Performance impact and overhead
The ideal scenario combines OTel's standardized instrumentation with your choice of backend – allowing you to switch vendors without reinstrumenting your code.
How much data should I collect?
Not "as much as possible" – that's a rookie mistake that leads to skyrocketing costs.
Instead, consider:
Essential vs. Nice-to-Have Signals
Essential:
- Standard infrastructure metrics (CPU, memory, disk, network)
- Service-level RED metrics (Requests, Errors, Duration)
- Key business transactions (logins, checkouts, API calls)
- Critical user journeys (sign-up flow, core features)
- Error logs with context (stack traces, requestIDs)
- Traces for high-value transactions
Nice-to-Have:
- Debug logs in production
- 100% trace sampling
- Raw resource metrics (vs. aggregates)
- User behavior analytics
Instrumentation Hierarchy
Create an instrumentation hierarchy following this pattern:
- Infrastructure layer: Virtual machines, containers, databases
- Platform layer: Service meshes, API gateways, messaging
- Application layer: Services, functions, batch jobs
- Business layer: Transactions, user journeys, revenue events
The hierarchy helps prioritize – each higher layer depends on lower layers, so ensure you have good coverage at the foundation.
Data Collection Strategy
For each service, define:
- Sampling strategy: Which traces to collect (e.g., always sample errors)
- Log levels: When to emit DEBUG vs INFO vs ERROR
- Metric resolution: How often to collect metrics (10s, 30s, 1m)
- Span attributes: What context to include in traces
- Log context: What metadata to include with log events
Document these decisions in a data collection plan with clear ownership and review cycles.

How do I reduce observability costs?
Is your observability bill climbing faster than gas prices? Try these tactics:
Intelligent Data Reduction
- Implement head-based sampling – Sample traces at the entry point based on criteria like customer tier
- Use tail-based sampling – Collect interesting traces (errors, slow) and discard others
- Apply dynamic log levels – Adjust log verbosity in real time based on conditions
- Create focused metrics aggregations – Pre-aggregate high-cardinality metrics instead of storing raw data
- Prune noisy logs – Identify and filter repetitive, low-value log entries
Technical Cost Optimizations
- Compress data in transit – Enable GZIP/Snappy compression in your collectors
- Use efficient serialization – Protobuf often reduces payload size by 30%+ vs. JSON
- Implement local aggregation – Aggregate metrics at the collector level
- Optimize retention policies – Implement tiered storage with decreasing resolution
- Manage cardinality – Set limits on label values, especially for high-volume metrics
Examples of Cost Reduction Impact
Technique | Before | After | Savings |
---|---|---|---|
Error-only logs in production | 2TB/day | 200GB/day | 90% |
5% trace sampling | $12,000/mo | $1,800/mo | 85% |
Metric cardinality limits | 5M series | 500K series | 90% |
Tiered storage policy | $8,000/mo | $3,200/mo | 60% |
Benchmarks from our customers show an average cost reduction of 40-60% through these techniques, without meaningful loss of visibility.
What makes a good alert?
Good alerts are like good friends – they speak up when it matters and stay quiet when it doesn't.
Your alerts should be:
- Actionable – Someone knows exactly what to do when it fires
- Meaningful – Tied to user experience, not just technical metrics
- Clear – Anyone on-call can understand what's wrong
- Precise – Low false-positive rate
- Documented – Links to runbooks and relevant dashboards
Alert Design Patterns
Multi-level alerting:
- L1 (Warning): Might require action soon
- L2 (Error): Requires action within SLA
- L3 (Critical): Requires immediate action
Alert ownership matrix:
- Define clear ownership of each alert by team
- Create escalation paths for cross-functional issues
- Document handoff procedures between teams
Alert consolidation:
- Group-related alerts to prevent alert storms
- Implement alert suppression during known issues
- Create parent/child relationships between alerts
Advanced Alerting Techniques
- Anomaly detection: ML-based alerting for complex patterns
- Composite alerts: Trigger only when multiple conditions are true
- SLO-based alerting: Alert on the burn rate of the error budget
- Business-impact alerting: Correlate technical issues to revenue or user impact
- Seasonality-aware thresholds: Account for time-of-day and day-of-week patterns
How do I build an observability culture?
Tools are just 50% of observability success – culture is the other half.
Organizational Models
Three common observability organizational models:
- Centralized: A dedicated observability team owns all tooling, standards, and practices
- Pros: Consistency, specialized expertise
- Cons: Potential bottleneck, disconnect from application teams
- Federated: The Platform team provides the foundation, application teams handle their instrumentation
- Pros: Scalability, application-specific knowledge
- Cons: Inconsistent implementation, duplication of effort
- Community of Practice: Observability champions across teams with the center of excellence
- Pros: Knowledge sharing, grassroots adoption
- Cons: Relies on individual champions, potential lack of resources
Cultural Implementation Strategies
Technical practices:
- Make observability part of your definition of "done" (no feature ships without proper instrumentation)
- Include observability in architecture reviews
- Add observability champions to each team
- Create "Dark Launch" patterns with observability gates
Team practices:
- Conduct regular "observability reviews" alongside code reviews
- Include observability in post-mortems ("Could better observability have prevented/reduced this incident?")
- Create observability skill ladders for career development
- Add observability KPIs to team goals
Organizational practices:
- Celebrate when good observability helps solve incidents faster
- Share observability wins and lessons learned
- Create cross-team observability working groups
- Tie observability improvements to business outcomes
Maturity Model
Level | Description | Characteristics |
---|---|---|
1 - Reactive | Basic monitoring | Siloed tools, alert-driven, limited visibility |
2 - Proactive | Coordinated monitoring | Shared tools, better coverage, still threshold-driven |
3 - Integrated | Basic observability | Three pillars partially integrated, some exploration capabilities |
4 - Optimized | Advanced observability | Full integration, SLO-driven, business-aligned |
5 - Predictive | Autonomous observability | AI-assisted, predictive capabilities, self-healing systems |
The best observability culture happens when teams see it as a superpower, not a chore. Show concrete examples of how it makes their lives better, rather than treating it as a compliance exercise.
How do managed observability solutions compare to self-hosted options?
Managed observability platforms offer turnkey solutions with varying levels of integration, scalability, and cost models.
Types of Managed Observability Solutions
Cloud provider native services:
- AWS CloudWatch, Azure Monitor, Google Cloud Monitoring
- Strengths: Deep integration with cloud services, familiar billing
- Weaknesses: Vendor lock-in, cross-cloud limitations
- Best for: Single-cloud deployments
Observability-focused vendors:
- Last9, Dynatrace, Honeycomb, Lightstep
- Strengths: Purpose-built features, integrated experience
- Weaknesses: Potential cost scaling issues, vendor lock-in
- Best for: Teams wanting an integrated experience
Open source as managed service:
- Grafana Cloud, InfluxData Cloud, Elastic Cloud
- Strengths: Familiar tools with managed convenience, open formats
- Weaknesses: Less integrated than purpose-built platforms
- Best for: Teams with existing open-source experience
Should I Build or Buy My Observability Stack?
The build vs. buy question comes down to your core business, resources, and specific requirements.
Key Decision Factors
Cost considerations:
- Build: High upfront engineering cost, ongoing maintenance
- Buy: Predictable per-GB or per-host pricing, but the potential for shock bills
- Hybrid: Controlled costs for basics, premium for specialized needs
Scaling factors:
- Build: Requires dedicated scaling expertise but can optimize for your workloads
- Buy: Vendors handle scaling but may impose limits or cost penalties
- Hybrid: Use open source for high-volume basics, and vendors for specialized needs
Integration needs:
- Build: Maximum flexibility but requires custom integration work
- Buy: Pre-built integrations but potential vendor lock-in
- Hybrid: Standard formats (OpenTelemetry) with vendor backends
Decision Framework
Factor | Weight Toward Build | Weight Toward Buy |
---|---|---|
Team size | Large engineering team | Small/medium team |
Observability expertise | Deep in-house knowledge | Limited expertise |
Data volume | Very high (>100TB/day) | Low to moderate |
Compliance needs | Highly specialized | Standard requirements |
Cost sensitivity | Long-term investment view | Predictable OpEx |
Integration needs | Unique system landscape | Standard cloud/tools |
What Observability Metrics Matter?
Skip vanity metrics and focus on these:
Technical Fundamentals
The Four Golden Signals:
- Latency: How long requests take (p50, p90, p99)
- Traffic: How many requests you're serving (RPS)
- Errors: How often requests fail (error rate %)
- Saturation: How "full" your system is (resource usage %)
Service-Level Indicators (SLIs):
- Availability: Percentage of successful requests
- Latency: Request processing time
- Throughput: Requests handled per second
- Correctness: Business-logic errors
User-Centric Metrics
Frontend performance:
- Time to First Byte (TTFB)
- First Contentful Paint (FCP)
- Largest Contentful Paint (LCP)
- First Input Delay (FID)
- Cumulative Layout Shift (CLS)
User journey metrics:
- Funnel completion rates
- User frustration signals (rage clicks, form abandonment)
- Session success rate
- Feature usage frequency
Business Metrics
Direct business impact:
- Revenue per minute
- Transactions per second
- Active users
- Conversion rate
Indirect business impact:
- Customer satisfaction scores
- Net Promoter Score (NPS)
- Support ticket volume
- Customer retention rates
SRE-Focused Metrics
Operational health:
- Mean Time to Detection (MTTD)
- Mean Time to Resolution (MTTR)
- Change failure rate
- Deployment frequency
SLO metrics:
- Error budget consumption
- SLO compliance percentage
- SLI degradation trends
- SLA violations
How Do I Measure Observability ROI?
Quantify your observability ROI with these metrics:
Primary ROI Categories
Incident reduction:
- MTTD/MTTR reduction – How much faster do you detect and resolve issues?
- Incident frequency reduction – Fewer incidents due to better detection of early signals
- Incident severity reduction – Lower impact due to faster response
Engineering efficiency:
- Debugging time reduction – Hours saved per incident or bug
- On-call burden reduction – Fewer pages, shorter incident durations
- Development velocity improvement – Faster deployments with confidence
Business impact:
- Avoided downtime – Issues caught before they impact users
- Customer satisfaction improvement – Fewer outages, happier customers
- Revenue protection – Prevented losses from outages or performance degradation
ROI Calculation Frameworks
Basic ROI calculation:
$ROI = \frac{(Benefit - Cost)}{Cost} \times 100%$, which helps measure the profitability of an investment.
Benefit components:
- Incident hours saved × average hourly cost of incidents
- Engineer hours saved × fully loaded engineering cost
- Downtime avoided × cost per minute of downtime
Cost components:
- Observability platform costs (vendor or infrastructure)
- Engineering time for instrumentation and maintenance
- Training and operational overhead
Where Should Observability Live in My Organization?
Observability isn’t just for operations teams anymore. It should be embedded across multiple functions to ensure reliability, performance, and business impact.
Organizational Models
Centralized observability team:
- Dedicated team responsible for tools, standards, and best practices
- Works closely with platform engineering
- Provides consulting to application teams
- Manages observability budgets and costs
Embedded observability engineers:
- Specialists embedded within development teams
- Focus on service-specific instrumentation
- Build domain-specific dashboards and alerts
- Share learnings across teams
Platform team with observability function:
- Observability as a platform capability
- Self-service tools for development teams
- Standardized instrumentation libraries
- Centralized expertise with distributed implementation
Community of practice:
- Observability champions across teams
- Central knowledge sharing and standards
- Grassroots adoption and advocacy
- Regular cross-team sharing sessions
Responsibility Matrix
The most successful organizations treat observability as a shared responsibility across teams:
Team | Responsibility | Examples |
---|---|---|
Platform | Foundation, tooling, standards | Collection infrastructure, data storage, base dashboards |
Development | Service instrumentation, custom dashboards | App metrics, logs, traces, service SLOs |
SRE/Ops | Alerting, incident response, capacity planning | Alert rules, runbooks, SLO definitions |
Security | Security-focused observability | Audit logs, anomaly detection, compliance monitoring |
Business/Product | Business metrics, user journey monitoring | Conversion funnels, user experience metrics, revenue impact |
What are common challenges when implementing OpenTelemetry?
OpenTelemetry brings great benefits, but implementation comes with hurdles.
Technical Challenges
- Context propagation issues: Lost trace context in async operations, missing spans, incomplete trace trees.
- Performance overhead: CPU/memory impact, network bandwidth, high storage costs.
- Complexity management: Consistent instrumentation, collector deployment, configuration drift.
Organizational Challenges
- Skill gaps: Limited expertise, steep learning curve, complex troubleshooting.
- Cross-team coordination: Standardization, sampling strategies, attribute naming.
- Migration complexity: Moving from vendor SDKs, maintaining compatibility, and dual instrumentation.
Solutions & Best Practices
- Technical fixes: Use auto-instrumentation, OpenTelemetry Collector, global interceptors, and shared libraries.
- Organizational strategies: Form a working group, set guidelines, build reusable components, and enforce verification in CI/CD.
- Migration plan: Start with new services, use the collector for format translation, and migrate one signal at a time.
How do I ensure data quality in my observability pipeline?
Poor data quality weakens observability. Here's how to maintain reliable telemetry.
Common Data Quality Issues
- Inconsistent metadata: Service names, attribute names, and cardinality control vary.
- Incomplete context: Missing links between logs, metrics, and traces; lack of business/environmental context.
- Reliability problems: Data loss under load, incomplete traces, clock sync issues.
Strategies to Improve Data Quality
- Standardization: Use OpenTelemetry semantic conventions, enforce consistent naming, centralize configs.
- Pipeline validation: Run synthetic transactions, add telemetry coverage tests, implement meta-monitoring.
- Data enrichment: Use OTel processors for metadata, automate service discovery, add business context.
Observability Data Governance
- Data lifecycle management: Define retention policies, use tiered storage, apply sampling.
- Quality metrics: Monitor telemetry volume, trace completion rates, and context propagation success.
- Continuous improvement: Regular reviews, dashboards for data quality, track instrumentation coverage.
What's Next for Observability?
Keep an eye on these trends shaping the future of observability.
Technical Innovations
- AI-assisted troubleshooting: Automated anomaly detection, root cause analysis, and predictive alerts.
- eBPF-based observability: Kernel-level insights, real-time network visibility, and zero-instrumentation tracing.
- Continuous verification: Chaos engineering integration, SLO-driven deployments, and synthetic canaries.
Industry Trends
- OpenTelemetry dominance: Becoming the standard for vendor-neutral, unified telemetry.
- Observability-driven development: Systems built with observability as a core requirement.
- Unified observability platforms: Logs, metrics, and traces in one place with context-preserving correlation.
Emerging Technologies
- Web3/blockchain observability: Monitoring distributed ledgers, smart contracts, and cross-chain transactions.
- Edge computing observability: Low-overhead instrumentation, handling intermittent connectivity, and local data processing.
- Quantum computing metrics: Tracking qubit states, circuit performance, and error correction.
The gap between leaders and laggards in observability is widening—which side do you want to be on?
Conclusion
The observability landscape continues to evolve rapidly, with new tools and techniques emerging constantly.
Remember that observability is ultimately about outcomes – faster incident resolution, better user experience, and more reliable systems. Keep those goals in mind as you build your observability strategy.