Prometheus has become the tool of choice for monitoring highly dynamic cloud environments. But managing Prometheus in-house can quickly become complex, especially as organizations look to scale. Scaling Prometheus often involves deploying hundreds of Prometheus servers and instances, and introduces challenges with managing infrastructure, visibility, and access control.
A managed Prometheus service is a fully managed solution that lets you use the open source Prometheus project to collect and analyze metrics without having to manage the underlying Prometheus infrastructure. A managed Prometheus solution can reduce the complexity of managing Prometheus and metric data in a distributed environment.
This article discusses some challenges of self-managed Prometheus and how a managed Prometheus solution resolves those challenges.
Challenges of Self-Managed Prometheus at Scale
Many of the issues with self-managed Prometheus occur as organizations scale, increasing the complexity of the ecosystem and operational burden. Let's look at some of these challenges:
Increased Management Overhead
A single Prometheus installation comprises several components, including Prometheus, blackbox, exporters, Alertmanager, and Pushgateway. For each environment—development, staging, and production—you need to deploy Prometheus servers, exporters, and Grafana for building dashboards and alerts. Each of these servers requires frequent upgrades, configuration, and federation.
In a distributed environment, you have to manage all these components with data split over several instances, multiple exporter options to set up, and metrics exposed from several endpoints. Your DevOps and site reliability engineering (SRE) teams can spend several hours every day managing and maintaining multiple servers and instances instead of focusing on core business needs.
Hard to Scale Horizontally
As you add more applications, increase cluster size, or serve new regions, a single Prometheus instance will eventually reach capacity, and you'll need to scale out to balance the workload. Scaling Prometheus horizontally involves deploying Prometheus on multiple servers using sharding. Sharding splits scrape targets into multiple groups on several Prometheus servers, with each group small enough to be handled by a single Prometheus instance.
But sharding makes the infrastructure more complex, introducing much of that management overhead discussed earlier. It also makes querying and troubleshooting more difficult, as developers now need to query multiple instances to get all the metrics they need. Yes, teams can use federation to aggregate data to a single Prometheus instance for querying. But if you need to aggregate large amounts of time series, you're again limited by the memory and storage capacity of the server on which that instance resides.
Data siloed in local storage on multiple Prometheus instances presents an additional challenge. Teams have limited insight into the health and security of the environment at both the application and hardware levels.
When a failure occurs, debugging and querying data on multiple services manually is time-consuming. It also affects your ability to get your application back up and running as soon as possible. While global visibility can be implemented in Prometheus using configuration and external services, it is complicated and time-consuming to set up and manage.
It's important to store analytics, disaster recovery, and governance-storing metrics long-term for monitoring. While Prometheus only retains metrics for fifteen days, it supports several open source long-term storage solutions, including Cortex and Thanos.
However, these solutions introduce additional complexity. Cortex relies on third-party services, like Amazon DynamoDB and Bigtable, which adds complexity because they scale individually. And to use Thanos sidecar, you need to disable Prometheus data compaction, which may decrease performance.
Lack of Authentication and Access Control
Prometheus does not have built-in security or user management features. Users or components with access to the network can also access telemetry data.
Prometheus also allows open, unauthenticated access to integrated components. Communication between Prometheus and integrations with services like Grafana is secure as long as they are in the same cluster. But outside of that, you need to ensure a secure connection and access control for Prometheus installations.
What Is a Managed Prometheus Solution
A managed Prometheus solution removes much of the time and manual effort that comes with implementing a self-managed Prometheus installation. It's a fully managed service that relieves you of the worries of hosting infrastructure, maintaining integrations, and employing an engineering team to manage the ecosystem.
You can still run the Prometheus Query Language (PromQL) queries and integrate with Grafana for customized dashboards (several providers also offer a hosted Grafana service) but with the added benefit of consistent global visualization and little to no learning curve.
As you scale, you can increase or decrease infrastructure and resources for metric ingestion, long-term storage, and automatic query with very little effort. The managed service provider handles the everyday responsibilities of monitoring, configuring, and scaling your endpoints, allowing your in-house teams to focus on solving business problems and developing new products.
Benefits of a Managed Prometheus Solution
A managed Prometheus solution has several benefits:
A managed Prometheus solution allows you to provision and scale Prometheus up, down, or out without increasing the complexity of your environment. Managed Prometheus acts as a fully scalable, in-place replacement for Prometheus without you having to make any infrastructure changes. It makes it super easy to set up globally scalable data stores that support global queries and alerts.
Because it's set in a cloud environment, managed Prometheus also provides durable long-term storage, with some providers offering as many as two years, much more than the default fifteen days offered by open source Prometheus.
The greatest challenge with managing Prometheus manually is the time your team spends configuring and maintaining the system.
A managed Prometheus solution significantly decreases the time to run and manage Prometheus at scale. Configuration and scaling tasks become the vendor's responsibility. Your teams can focus on building features and gathering insight instead of setting up and maintaining Prometheus infrastructure.
With a fully managed Prometheus solution, you have a global view of all your applications and data, allowing you to monitor and set up alerts on all your workloads. Metric correlation makes it easier to pinpoint the source or impact of a problem, and centralized dashboards enable you to query all your metrics to troubleshoot issues faster.
Many of the managed Prometheus services offered by the major cloud providers, such as Amazon Web Services (AWS), Google Cloud, and Azure, allow you to integrate Prometheus with their identity and access management (IAM) security services to help you audit and control access to your data.
With a managed Prometheus solution, developers and app owners don't need to worry about setting up and managing duplicate Prometheus instances for redundancy and high availability. The managed Prometheus vendor provides highly redundant networking infrastructure with low latency, high throughput, and service level agreements (SLAs) of 99.9% uptime or more. Features such as multiple availability zone deployments and automatic replication and failover between zones ensure that your application stays up and running when there is a service or infrastructure failure.
Prometheus can be arduous to set up, configure, and manage. Configuration is done using YAML configuration text files with a complex syntax, and PromQL, which, although it's a powerful query language, also has a steep learning curve. While there are tons of client libraries available to make up for the lack of built-in features for visualization, long-term storage, service discovery, and alerting, having to learn and manage each integration adds complexity to your setup.
A managed Prometheus provider handles your Prometheus infrastructure, setup, configuration, and maintenance of integration points. Many solutions come with default configurations, alerts, and dashboards, making it easy to get started. Global dashboards and alerts give users a homogeneous set of visualizations to work with instead of a piecemeal solution created from multiple integrated components.
Prometheus-as-a-Service vs. Managed Prometheus
If you're moving from self-managed Prometheus and looking for a managed Prometheus solution, you may also come across providers offeri-service. Although there is some overlap between managed Prometheus and Prometheus-as-a-service, they're not quite the same.
In both cases, the provider sets up the infrastructure, runs Prometheus on their servers, and gives customers access to a hosted Prometheus solution. Customers have access to all the features of self-managed Prometheus, such as PromQL and Grafana integration, with added benefits, such as on-demand resource scaling and global querying and visualization.
With Prometheus-as-a-service, customers have access to the Prometheus instance, typically via a user account with limited access and control over configuration files, security, and others. The service provider manages the backend Prometheus infrastructure only, while the customer is left to configure and manage the service.
A managed Prometheus service goes a step further. It uses a blackboxed approach where the provider gives the customer restricted access to the Prometheus setup but takes over full responsibility for managing the Prometheus installation, including configuration, security, and monitoring.
A managed Prometheus solution offers the full package—management, maintenance, security, compliance storage, and integration of Prometheus services. It is suitable for large production workloads where managing Prometheus infrastructure can become time-consuming and complex as the infrastructure grows.
Self-managed Prometheus setups present several challenges, especially as you scale. Challenges include an increasingly complex environment; considerable time and effort for maintenance, monitoring, and management; infrastructure and storage limitations as you need to scale horizontally; and decreased visibility and security.
A managed Prometheus solution not only takes the everyday management of Prometheus off your hands, but offers additional benefits for security, scalability, and availability.
Last9 can help your businesses by improving your visibility into the Rube Goldberg of microservices. If you're looking to simplify your Prometheus monitoring solution, Levitate is a managed time series database built to manage scale and reduce the complexity of time series database management. With horizontal scalability, high availability, and global visibility into your metrics, you can improve the reliability of your distributed systems without the guesswork.