Monitoring and Observability offer a holistic perspective of the entire IT infrastructure of a company. These equip teams to identify issues, troubleshoot problems, and enhance performance. The two popular TSDB (Time Series Databases), Prometheus and Cortex, are used commonly across SRE and DevOps teams.

In this article, we will dive deep into both of these tools. We will discuss

What is Prometheus
What is Cortex
Advantages of using Prometheus over Cortex
Benefits of using Cortex over Prometheus
Key features of Prometheus
Key features of Cortex
Integration with Grafana
Challenges in running Prometheus
Challenges in running Cortex

Prometheus

Prometheus, originally created at SoundCloud, is an open-source project for monitoring and alerting in TSDM (Time Series Data Management) systems. Prometheus excels at collecting and storing metrics as time series data. This capability empowers users with robust querying abilities, enabling efficient analysis and retrieval of metrics within specific time ranges or filtered by labels. Thanks to this flexible data model, Prometheus grants users deep insights into the behavior of their systems and applications. It simplifies the identification of performance trends, anomalies and the troubleshooting of issues.

The Prometheus server, client libraries, Alertmanager, and other related components can be found at the Prometheus GitHub organization. The main repository is: https://github.com/prometheus/prometheus

Know more about What is Prometheus?

Prometheus leverages a diverse ecosystem of exporters—components that gather metrics from various sources. Hence it can monitor an extensive range of targets, including servers, containers, and databases. Prometheus has scalable data storage, expressive querying language, and extensive integration options.

Prometheus can be deployed on Kubernetes using the stable Prometheus Helm chart or the Prometheus Operator. The Prometheus Operator simplifies the deployment and configs of Prometheus, Alertmanager, and related monitoring components. It also provides Kubernetes custom resources to deploy and manage Prometheus and Alertmanager instances.

Cortex

Cortex, created by WeaveWorks, a CNCF open-source project, is horizontally scalable, highly available, multi-tenant, and long-term storage for Prometheus. Cortex allows users to deploy and manage Prometheus at scale using a service-based architecture that can ingest millions of samples per second. Designed for cloud-native environments, Cortex supports the PromQL and offers capabilities such as multi-tenancy and long-term storage that are beyond the scope of a single Prometheus instance. Cortex can also be a storage backend for Prometheus using remote write protocol.

Cortex code is available on GitHub at https://github.com/cortexproject/cortex, along with docs, tutorials, and use cases.

A key feature of Cortex is its integration with object storage systems like Amazon AWS S3 or Google Cloud Storage. This allows organizations to store monitoring data for extended periods cost-effectively and cost-effectively. This also ensures easy accessibility for analysis.

Despite its enhancements, Cortex maintains compatibility with the Prometheus ecosystem, including using PromQL for querying, alerting, and recording rules. This ensures a seamless transition and compatibility with existing Prometheus setups. Cortex complements Prometheus by addressing scalability and long-term storage challenges. It enables organizations to scale their monitoring infrastructure, retain data for extended periods, and gain deeper insights into systems and applications.

There are several ways to deploy Cortex on Kubernetes, including using raw YAML files, Helm charts, or the Kubernetes Operator.

Advantages of using Prometheus over Cortex

Simplicity: Prometheus is known for its simplicity and ease of use. Prometheus makes it quick to set up and start monitoring your systems. It does not require any significant configuration or learning curve.
Real-Time Monitoring: Prometheus excels in real-time monitoring and alerting. It collects and stores metrics in memory. This allows immediate access to real-time data and enables fast detection and response to issues.
Rich Ecosystem: Prometheus has a mature and extensive ecosystem with many integrations and exporters. Many popular tools, frameworks, and cloud services have native Prometheus support. This makes collecting metrics from various sources easier without additional configuration or customization.
Powerful Querying and Analysis: Prometheus offers a powerful querying language, PromQL, which allows for flexible and expressive metrics analysis. It enables users to gain deep insights into system behavior, detect anomalies, and identify performance trends.
Community Support: Prometheus has a large and active community of developers and users who contribute to its ongoing development, provide support, and share knowledge. This vibrant community ensures continuous improvement, frequent updates, and a wealth of resources for troubleshooting and learning.
Compatibility: Prometheus is compatible with the broader ecosystem of tools and services that support Prometheus metrics, making it easier to integrate with existing monitoring and observability systems.
Real-Time Data Storage: Prometheus stores metrics in memory, enabling fast access and query performance. This is particularly advantageous for monitoring dynamic systems and environments.
Self-Contained Solution: Prometheus is a self-contained monitoring solution that includes data collection, storage, querying, and alerting capabilities. It eliminates additional components or dependencies, simplifying the monitoring setup.
Easy Setup and Maintenance: Setting up and maintaining Prometheus is simple. It does not require complex distributed architectures or extensive configuration, making it more accessible for small to medium-sized monitoring deployments.
Proven Track Record: Prometheus has been widely adopted by many companies and organizations since its inception, proving its reliability and effectiveness in real-world monitoring scenarios.

Advantages of Cortex over Prometheus

Cortex offers many advantages, making it a powerful choice for managing and analyzing large-scale monitoring data. Let us discuss them in detail.

Scalability: Cortex offers horizontal scalability, allowing you to easily handle growing monitoring data by adding more Cortex nodes to the cluster.
Long-term Storage: With Cortex, you can store monitoring data for extended periods using object storage systems like Amazon AWS S3 or Google Cloud Storage. It ensures cost-effective and scalable long-term storage.
Federated Queries: Cortex enables you to execute queries across multiple Cortex nodes. This feature enables developers to get a global view of your monitoring data. This also simplifies analysis across multiple Prometheus instances.
Efficient Resource Utilization: Cortex's sharding capabilities distribute data across multiple storage instances. This feature optimizes resource utilization and query performance.
Compatibility: Cortex maintains compatibility with the Prometheus ecosystem. This feature can help you use the existing PromQL queries and configurations without significant changes.
Community Support: Cortex benefits from an active developer community. Due to this, the tool has ongoing improvements, support, and a wealth of resources that help leverage the best out of the tool.
Integration: Cortex integrates well with other cloud-native technologies, enabling seamless integration into your existing monitoring and observability ecosystem.
Reliability: Cortex uses a distributed architecture to ensure high availability and fault tolerance, minimizing the risk of data loss or downtime.
Data Retention: Cortex's object storage integration allows you to retain monitoring data for compliance or historical analysis purposes, providing valuable insights into past trends and patterns.

Key Features - Prometheus

Flexible Querying: Prometheus offers PromQL, a flexible query language for analyzing time-series data.
Multi-dimensional Data Model: It uses a powerful data model that allows efficient filtering and grouping of metrics based on labels.
Alerting and Alert Manager: Prometheus has built-in alerting capabilities and an Alert Manager for managing and routing alerts.
Rich Ecosystem and Integrations: It has a vibrant ecosystem with many exporters, libraries, and integrations available.
Dynamic Service Discovery: Prometheus supports dynamic service discovery, simplifying the monitoring setup in dynamic environments.

Key Features - Cortex

Scalability: Cortex can scale horizontally across multiple machines, surpassing the limitations of a single machine, and can handle advanced workloads.
High Availability: Cortex can replicate data between machines in a clustered setup. This feature ensures that graphs remain uninterrupted even during machine failures.
Multi-tenancy: Cortex supports isolating data and queries from independent Prometheus sources within a single cluster. This means that multiple untrusted parties can securely share the same cluster.
Long-Term Storage: Cortex supports long-term storage using popular cloud storage services such as AWS S3, GCS, Swift, and Microsoft Azure. This allows you to store metric data durably beyond the lifespan of any individual machine.

The Cost of Running Prometheus vs. Cortex

While Prometheus and Cortex might be open source, significant cost factors are associated with running them at scale. You can consider Prometheus as a Service that can significantly reduce monitoring costs and operational toil.

Infrastructure Costs

Prometheus is a single-node system with a smaller infrastructure footprint than a Cortex cluster. However, as your scale increases, a single Prometheus node may not be able to handle the load, leading to the need for additional Prometheus instances and thus increasing costs.

Cortex is designed to handle many metrics across a distributed system. Managing this effectively requires substantial computational, network, and storage resources. As your scale increases, so does the need for more powerful and numerous machines, directly impacting your infrastructure costs.

External Services

Prometheus, by default, does not rely on external services for storage, as it stores data locally. This can simplify operations and reduce costs, limiting Prometheus's scalability and durability.

Cortex relies on external services for its index and chunk storage, such as Amazon DynamoDB, Google Bigtable, or Cassandra for the index: Amazon AWS S3, Google Cloud Storage, or Azure Blob Storage for the chunks. The costs of these services can add up quickly, particularly for large deployments or heavy usage. Furthermore, network costs for data transfer to and from these services can also be significant.

Data Transfer Costs

Prometheus typically has lower data transfer costs than Cortex, especially if it and the services it monitors are running in the same network region. On the other hand, the distributed nature of Cortex can result in higher data transfer costs.

Substantial data transfer occurs between various system components in a distributed system like Cortex. Depending on your cloud provider's pricing model, you may incur costs for data transfer, especially when it crosses regions or leaves the provider's network.

Operational Costs

Operating a single Prometheus instance is simpler and, thus, often less costly than operating a Cortex cluster. However, as the scale and complexity of your monitoring needs grow, managing multiple Prometheus instances can become a complex task.

The complexity of managing and operating a Cortex cluster includes costs not directly linked to resources or services. These include the costs of the operational team, the time and effort spent on setup, maintenance, monitoring, debugging, and optimizing the system, and the learning curve associated with understanding and managing the system effectively.

Redundancy Costs

Prometheus doesn't provide built-in data redundancy, which has pros and cons. On the one hand, it saves on storage costs compared to Cortex, which stores multiple data replicas. On the other hand, it means that if your Prometheus node fails, you could lose your metrics data.

To ensure high availability and durability, Cortex stores multiple data replicas. While this is crucial for production-grade reliability, it multiplies your storage costs.

Cost of Scaling

Scaling Prometheus as your monitoring needs grow can become challenging. You may need to set up additional Prometheus instances and federation, which can increase costs and complexity. While more costly to operate overall, Cortex is designed for scalability, allowing you to handle much larger volumes of metrics data without the same level of increased complexity.

Cortex is designed to be scalable, but scaling is not cost-free. As your monitoring needs grow, you'll need to scale your Cortex cluster, storage systems, and network capacity. This scaling will increase your costs in a near-linear fashion.

Integration with Grafana

Prometheus and Cortex are powerful monitoring tools on their own. Still, when paired with Grafana as frontend, a popular open-source platform for time series analytics and visualization, they can provide a comprehensive, scalable, and visually pleasing monitoring solution.

Here's how you can use Prometheus, Cortex, and Grafana together:

Prometheus and Cortex Configuration:

Ensure that Prometheus is configured to use Cortex as its remote storage backend, as discussed earlier. Once you've set up Prometheus to scrape metrics from your systems and push the data to Cortex for long-term storage, you can connect this pipeline to Grafana.

With both data sources added, Grafana can be your one-stop shop for generating dashboards and visualizations of your metrics. Here's how:

- Creating Dashboards: Grafana lets you create dashboards comprised of different panels, with each panel displaying a particular visualization (like a graph, gauge, or heatmap) based on a PromQL or a Cortex query.

- Querying Data: When setting up a panel, you can write a query to specify the data you want to display. You can choose whether to execute the query against Prometheus or Cortex, depending on the time range and the scope of the data you're interested in.

- Alerting: Grafana also has built-in alerting functionality. You can set up alerts based on your Prometheus/Cortex data, and Grafana will notify you when the alert conditions are met.

Challenges in running Prometheus

While Prometheus is a powerful and widely used monitoring tool, there are a few challenges and potential issues that users may encounter when deploying and managing Prometheus at scale:

1. Scalability Issues: Prometheus is a single-node system that scrapes and stores time series data. As the number of nodes in your infrastructure increases, Prometheus can struggle to scale, especially when dealing with a high volume of metrics.

2. Data Retention: Prometheus stores data on a local disk by default, and its data retention period is limited by available disk space. Storage can become an issue if you need to store metrics data for a long time or the amount of metrics data grows rapidly.

3. High Availability: Prometheus is a single-node system that doesn't provide built-in high availability. If the Prometheus instance fails, you lose visibility into your metrics until the instance is restored.

4. Long-Term Storage: Out of the box, Prometheus does not provide a solution for long-term storage of metrics. You can configure Prometheus to use a remote storage system, but the support for these systems is explicitly considered experimental.

5. Federation Limitations: Prometheus provides a federation where one example can scrape selected time series from another. However, federation is not recommended for large-scale deployments as it can introduce new performance and management challenges.

6. Multi-Tenancy: Prometheus doesn’t natively support multi-tenancy. This could present an issue for companies or projects requiring isolation between different teams or departments.

7. Resource Usage: Prometheus, especially in large deployments, can consume significant CPU and memory resources due to the nature of the time series database and the number of metrics it's processing.

It's important to note that many of these challenges can be mitigated or addressed using additional tools in conjunction with Prometheus. For example, Levitate can be a scalable, multi-tenant, and long-term storage backend for Prometheus, addressing several of the above-mentioned issues.

Challenges in running Cortex

Cortex is a powerful, highly scalable, and highly available solution for Prometheus metrics storage and querying. However, it does come with a set of unique challenges that users may encounter during deployment and management:

1. Complex Setup and Configuration: Cortex uses a micro-services architecture, which means setting it up involves deploying and configuring multiple components, each with its own configuration options. The setup process can be complicated and daunting, particularly for users who are new to such systems.

2. Operational Complexity: Because of the distributed nature of its architecture, operating a Cortex cluster is more complex than operating a single-node system like Prometheus. This includes managing data replication, sharding, handling failovers, and ensuring high availability.

3. Resource Usage: Cortex is designed to handle large-scale metric data and has substantial computational, network, and storage resource requirements. Especially in large deployments, the cost associated with running a Cortex cluster can be significant.

4. Dependency on External Services: Cortex requires an external data store to store its index and chunks (like Amazon DynamoDB, Google Bigtable, or Cassandra) and an object store to store its chunks (like Amazon S3, Google Cloud Storage, or Azure Blob Storage). This adds another layer of complexity and potential points of failure.

5. Monitoring and Debugging: Given its distributed nature, monitoring and debugging a Cortex deployment can be challenging. Ensuring all components are functioning correctly, troubleshooting issues, and optimizing performance requires a deep understanding of the system's internals.

6. Data Migration: If you plan to switch from Prometheus to Cortex, the lack of a straightforward data migration pathway can be a significant challenge. Currently, there isn't an easy way to migrate old data from Prometheus to Cortex.

💡

Levitate commits to a 99.9% write availability and a 99.5% read availability and provides a managed offering without the hassle of maintaining your observability and monitoring infrastructure. Get started today.

Conclusion

Prometheus and Cortex are powerful monitoring systems that contribute to monitoring and observability in different ways.

Prometheus excels in real-time monitoring, simplicity, and a rich ecosystem of integrations. It offers a straightforward setup, powerful querying language, and a proven track record. It is an excellent choice for small to medium-scale deployments, prioritizing real-time monitoring and simplicity.

On the other hand, Cortex addresses the challenges of scalability and long-term data retention. It provides horizontal scalability, long-term storage with object storage integration, and federated queries for a global view of monitoring data. Cortex is suitable for larger-scale monitoring environments that require distributed architectures and efficient long-term data storage. However, it comes with its own challenges of operational complexity and dependency on external services.

Prometheus vs. Cortex

Prometheus

Cortex

Advantages of using Prometheus over Cortex

Advantages of Cortex over Prometheus

Key Features - Prometheus

Key Features - Cortex

The Cost of Running Prometheus vs. Cortex

Infrastructure Costs

External Services

Data Transfer Costs

Operational Costs

Redundancy Costs

Cost of Scaling

Integration with Grafana

Challenges in running Prometheus

Challenges in running Cortex

Conclusion

Contents

Newsletter

Handcrafted Related Posts

A checklist to choose a monitoring system

A detailed checklist of points you should consider before choosing a monitoring system

Latency SLO

How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?

Interesting talks on Observability from Fosdem 2023

A recap of the talks from the Observability and Monitoring dev room at Fosdem 2023.