A checklist to choose a monitoring system

By virtue of being a ‘managed’ monitoring partner, we speak to a tonne of clients on how to go about choosing the right monitoring systems. While a good chunk of our conversations revolve around transparency and costs, (given how a chunk of folks we speak with are bruised with large bills from incumbents) a lot of the technical day-to-day product necessities get lost in translation.

We’ve had examples of ‘upper management’ buying monitoring products only to realise folks on the floor are not as comfortable with features (taming high cardinality), or simple UI/UX (think dashboards, health, metrics usage etc..) that guide and abets an SRE do better.

So here’s a simple checklist of sorts anyone can use before narrowing in on a monitoring system:

Scalability — High Cardinality

If you’re blessed with more customers, and have to add more engineers to match this growth, scaling monitoring becomes apparent as systems struggle to keep pace with business. From engineering to customers; a monitoring product needs to be simple and intuitive for the engineering team, and match the growth and scale of customers being onboarded.

So the first step is to check how well a monitoring solution can scale.

Sidenote: Think monitoring for 40+ million concurrent users. (We call this “Cricket Scale”)

Reliability

All monitoring systems should be accountable for their own uptime guarantees and how reliable they can be. ‘Who’s monitoring the monitoring system’ - I believe not enough folks ask this. By having guarantees, and clawbacks, customers can have more accountability for the performance of a managed system.

And no, Open Source does not mean it’s cheap.

After all, there’s no point in having a monitoring system if it can’t guarantee uptimes itself. For example: At Last9, we offer clawbacks when our monitoring system is down. This guarantees and improves accountability among our customers.

Mean Time To Detect (MTTD)

The core of a good monitoring system is to reduce MTTDs. A well-thought-out product will have simple and easy alerting, have strong pattern matchmaking and anomaly detection solutions and help/guide an SRE to eke out more from their system. Products that are obsessed with a UX-first approach tend to reduce MTTDs and costs for practitioners.

Among all the points mentioned, this is the one that will matter the most for an SRE on the floor who’s tasked with scaling and managing a monitoring system.

Data Exploration

Storing data is the easy part, providing avenues to explore and dissect is the hard one. A monitoring system should be able to accelerate queries and provide results ASAP, and not crash when multiple people are using it.

These issues are extremely prevalent in e-commerce, ride-sharing, video streaming, and foodtech companies. Most business and product teams are dissuaded from exploring their own data because of how fragile the system is.

Exploration should be rated in two categories:

Ability to rake in Cardinality.
A visual health board of Alerting rules.

Engineering Overheads

This is one of the most overlooked points when choosing a monitoring solution. A good monitoring solution should not just reduce storage and exploration costs. It should ultimately be intuitive and simple enough to reduce engineering overheads.

Most orgs we talk with don’t factor in the cost of engineering salaries it takes to run a scale and manage a monitoring system. These have to be slashed if you have a managed partner who navigates these complexities for a customer.

Automation

A monitoring system will grow as growth kicks in. Any new ifra or component needs to be covered under existing monitoring practices. This has to happen automatically given the pace at which engineering will grow as customers grow. The cascading nature of microservices means dependencies are intertwined. Manual monitoring and mapping infra will only hamper end-to-end monitoring.

Onboarding time/Migration from existing workflows

A good monitoring partner has a simple and easy onboarding flow and helps migrate existing workflows regardless of how complicated or varied they are. Not only should the onboarding time be dramatically reduced to ensure business continuity, existing workflows should not be hampered. Learning a new language, protocol or a locked-in solution only adds more stress and time to the engineering team.

This was one of the key reasons why a large customer chose us as a partner; business continuity in itself is a prohibitive factor to change monitoring partners given the complexities this creates.

OTel Compatibility

OTel is important, and potentially the future of the quagmire ‘Observability’ finds itself in. Being vendor-locked into closed platforms, kills innovation, hampers interoperability and doesn’t take advantage of all the helpful features that make life easier for an SRE.

At this juncture, this point is a no-brainer. Any monitoring solution needs to be OTel Compatible and support open standards — OpenMetrics, OpenTelemetry, integrate with open-source tools such as Prometheus, VictoriaMetrics, InfluxDB, Telegraf, and StatsD.

Customer Support

One of the pain points of using Open Source solutions is, that you have to rely on the community for answers. This is good, but when things go south, it can be pretty bad. Large companies can’t afford to depend on a community to debug a problem and customer support for OSS is not only expensive, but unreliable during crunch times.

A vendor solution should have a clear escalation matrix and ‘time taken to respond’ framework in place to bring in accountability. late support means loss of revenue from customers, and defeats the very purpose of having a robust monitoring solution.

Tooling fatigue

A monitoring solution should not only be interoperable, but should match the needs of your organization to have essentials under one roof. For example: A TSDB should have an alerting solution so you don’t have to depend on third-party tools for alerting.

The more you have it all under one roof, the likelier you can bring in accountability, and reduce knowledge transfer times and demand for customized features that help your engineering teams.

Excessive third-party integrations only accelerate tooling fatigue and eventually add to costs.

I may have missed some points as I hastily wrote this and got some input from my colleagues. Are there other factors to consider in particular I may have missed out on? Please do let me know, will add them 🙏