What needs to change in software monitoring?

Software monitoring is stuck in the Stone Age. It’s the one space that has not kept abreast with the growth of mobile, the proliferation of microservices, and the demand from customers for stellar products.

If you’re a startup gaining traction and building a Monitoring team, where do you start? If you’re an enterprise that wants to embark on a Digital Transformation journey, software monitoring is a vital cog in that piece. This post is for those who want to reinvent their monitoring practices, but don’t know where to start.

Instrumentation

The first and foremost problem to be tackled in software monitoring is Instrumentation. This is where everything right and wrong happens simultaneously. The latter is painful when it matters the most; when your service is down, and you’re scrambling to find out what broke where, when, how, and why.

This is important enough to deep-dive into a separate post entirely. More on this later.

Two points to keep in mind while thinking of Instrumentation:

Deep Instrumentation: Ensure all parts of the application are covered and well-instrumented. This includes business logic, database queries, external API calls, and front-end interactions. The more comprehensive, the better monitoring you have.
Automated Instrumentation: Utilize tools that automatically instrument code without requiring significant developer effort. This can cover more code paths and reduce manual errors.

Advanced Metrics

There are RED (Rate, Errors, Duration) metrics, and there are business-specific metrics. Get a drawing board to measure only what matters. It gets rather complicated to monitor and measure it all. Have first-order and second-order Advanced Metrics clearly defined.

Most orgs fall into this pit of adding alerts to everything, and then nothing really matters.

Granular Metrics: Collect fine-grained metrics at various levels, such as application, system, and network. This includes latency, error rates, throughput, and resource utilization.
Custom Metrics: Define and monitor custom metrics that are specific to business processes and application logic.

Real-Time Monitoring and Alerts

Alerting is where monitoring culminates into actionability. But most alerting tools spiral into fatigue and are eventually rendered useless. You need a real-time view, and dynamic alerting to aid your monitoring journey.

Contextual information propagating through the system helps. Important capabilities like anomalous pattern detection help get better alerting systems in place.

Real-Time Dashboards: Create real-time dashboards that provide a live view of system health and performance.
Dynamic Alerts: Set up alerts that can adapt based on historical data and trends rather than static thresholds, reducing alert fatigue.

Collaboration and Culture

All Site Reliability Engineering comes down to collaboration and culture. It defines how one thinks of software monitoring — an essential, or an annoying deterrent to building features and capabilities.

The ex-CTO of Gojek has written a fantastic post on repaying tech debt, and it highlights the need to focus on rebuilding DevOps culture. It’s a brilliant read.

Teams that don’t develop a culture of training and education around the importance of software monitoring, tend to be penalized brutally it over time. There’s enough said about this, so I won’t rant more. Actually, I have here. 😛

Visualization that means something

Work with folks who understand Shannon Limits, and what this means to build visualization. A good UI and UX is the first step to better visual elements to your monitoring stack. Machines need to convey logic in a manner that is readable and actionable to humans.

Most Dashboards are built for developers building them, and not for users consuming this information. Dashboards also tend to be slow if they don’t have warehousing principles.

Rich Visualization: Utilize advanced visualization tools to make sense of the collected data. This can include heat maps, flame graphs, and dependency graphs.
User-Friendly Dashboards: Design dashboards that are intuitive and customizable based on user roles and needs. (Different teams need different types of monitoring needs - from Customer Support to Product Management)

Data Retention and Management

Think Data Warehouse, NOT Database.

This is an important point to drive home, so bear with me on this tiny rant…

Monitoring involves tracking metric data over time. This means, recording data as a time series allows for the continuous observation of changes and trends. One can then associate these data points with specific timestamps; making it easier to correlate events and activities with system performance, and therefore enabling better diagnosis of issues.

Monitoring data needs to be recorded as a time series because the temporal aspect is crucial for analyzing, understanding, and acting upon system performance and behavior. Time series data enables trend analysis, anomaly detection, capacity planning, root cause analysis, alerting, automation, visualization, and compliance reporting, all of which are essential components of effective monitoring.

Opensource has steered monitoring towards TSDBs a.k.a Prometheus. But that has turned out to be a massive issue. Let me explain:

Databases are used for transactional processing (OLTP - Online Transaction Processing). They are designed to handle a large number of short transactions such as insertions, updates, and deletions.

Question: When collecting monitoring data do we use it for OLTP?
No. Not really.

We actually need it for OLAP (Online Analytical Processing). Data warehouses are designed to handle large volumes of data and complex queries to support business intelligence and decision-making processes.

Monitoring data needs to be denormalized and optimized for read performance. Guess what? That is precisely what a Data Warehouse offers as opposed to a Database where data is typically normalized to reduce redundancy and improve data integrity.

Understand the first principles of data usage and make a wise choice on choosing the foundation of your monitoring stack.

For example, Levitate comes with Blaze, Hot, and Cold tiers to help with fast queries and better cost management. Data Tiering in a Time Series Database is lacking, and needs better solutions.

Machine Learning and AI Integration

These are early days, but teams should have an AI/ML strategy around monitoring. It could be around help in writing better PromQL queries or supporting in data retrieval. The ability to spot recurring issues and patterns is majorly lacking, and one that AI tools can help significantly with.

I don’t want to shoot myself in the foot here, but there are already tonnes of use cases out there. A relevant, absolute-must-have one is entirely dependant on your monitoring stack and the trade-offs you can live with. Will dive into this in a separate post. Still tinkering with some use cases. ✌️

Feel free to chat with us on our Discord or reach out to me if you want to discuss DevOps/SRE.

You can also book a demo with us to understand Last9, or even give us feedback, suggestions et al. ✌️

What needs to change in software monitoring?

Contents

Instrumentation

Advanced Metrics

Real-Time Monitoring and Alerts

Collaboration and Culture

Visualization that means something

Data Retention and Management

Machine Learning and AI Integration

Contents

Do More with Less

Handcrafted Related Posts

India vs Pakistan: SRE and the Shannon Limit

Who should define Reliability — Engineering, or Product?

Understanding the Rasmussen model for failures