This is the abridged developer documentation for Last9 # Introduction > Learn how to get started with Last9 ## [Getting Started](/docs/introduction/) [Understand what Last9 is and how to quickly start sending data](/docs/introduction/) ## [Control Plane](/docs/control-plane/) [Manage your telemetry data, its configurations, and its lifecycle](/docs/control-plane/) ## [Discover](/docs/discover-services/) [Auto-discover and monitor services, background jobs, infra and more with OpenTelemetry](/docs/discover-services/) ## [Logs](/docs/logs/) [Explore your logs data, its details, and related telemetry](/docs/logs/) ## [Traces](/docs/traces/) [Explore your trace spans, its dependencies and timeline charts, and span details](/docs/traces/) ## [Real User Monitoring](/docs/discover-applications/) [Monitor your web application's performance from your users' perspective](/docs/discover-applications/) ## [Alerting](/docs/alerting-overview/) [Set up alerts, pattern matching, receive notifications, an IaC tool for alerting](/docs/alerting-overview/) ## [Instrumentation](/docs/integrations/) [Send data via OpenTelemetry, Prometheus, AWS Cloudwatch, and more](/docs/integrations/) ## [SLOs](/docs/slos/) [Monitor and manage your service reliability with SLOs and SLIs](/docs/slos/) ## [Tutorials](/docs/howto/) [Common how-tos for Prometheus, Kubernetes, VictoriaMetrics, etc.](/docs/howto/) ## [FAQs](/docs/faqs/) [Frequently asked questions about Last9 — what, why, how](/docs/faqs/) ## Other Resources ## [Changelog](https://last9.io/changelog/) ## [Blog](https://last9.io/blog/) ## [Community](https://discord.com/invite/Q3p2EEucx9/) ## [X / Twitter](https://x.com/last9io/) ## [Youtube](https://youtube.com/@last9/) # Access Policies > Leverage Last9's access policies to perform traffic shaping of time series data in real-time. Last9 supports automatic data tiering of the metrics based on retention policies. These data tiers have different retention policies. E.g., Blaze Tier stores data for the last two hours, whereas Hot Tier stores data for the last six months. Depending on the use case, the tiers are designed to access their metrics from a fast or slow tier. It is extremely crucial to ensure that traffic for real-time alerting is always prioritized and served from the fastest Blaze tier. The Grafana queries can be served from the Hot tier without conflicting with alerting. Access Policies ensure that one can create these policy guardrails to ensure the metrics data is accessible from a specific tier based on its purpose. Each Access Policy is associated with a Token created for a Tier. Tokens allow ACL for time series data by providing a way to access data from specific tiers for either `read` `write` or both operations. ## Setting up Tokens To achieve this, create a read token first from Settings -> Tokens. [Creating a Read Token in Levitate](https://www.youtube.com/embed/qfdUYwAMZvw) ## Creating Access Policy Once the Token is created, one can create an Access Policies from Settings -> Policies. [Creating Access Policy in Levitate](https://www.youtube.com/embed/0j_N9CKyigY) That’s it, you don’t have to change anything more. Just use the token associated with the Alerting policy to configure alertmanager and the token associated with the visualization policy to [configure Grafana](/docs/grafana-config/). Last9 will take care of performing traffic shaping in real-time. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # AI Assistant > Ask questions about your infrastructure in plain English and get instant insights from logs, metrics, traces, and alerts. Last9’s AI Assistant brings conversational observability directly into your workflow. Ask questions like “What’s causing the recent spike in errors?” or “Show me the system health overview” and get instant, actionable insights from your telemetry data. Access the AI Assistant in multiple ways: * **In the Last9 dashboard** for detailed investigation * **In Slack** via @mentions for team collaboration during incidents * **In your IDE** through the [Last9 MCP](/docs/integrations/mcp/) for debugging while coding ![AI Assistant Welcome Screen](/_astro/ai-assistant-welcome.CJ-jILtY_Z2kiWUV.webp) ## AI Assistant vs Last9 MCP vs Slack App Last9 offers three complementary AI-powered interfaces for working with your observability data: | | AI Assistant | Last9 MCP | Slack App | | ---------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ----------------------------------------------------------- | | **Where** | Built into the Last9 dashboard | Your IDE (Claude Web/Code/Desktop, Cursor, VS Code, Windsurf) | Your Slack workspace | | **Best for** | Investigating incidents, reviewing system health, exploring alerts | Debugging production issues while coding, correlating code changes with production | Team collaboration, quick queries during incidents in Slack | | **How it works** | Natural language queries against your telemetry data in the browser | MCP protocol connects your AI coding agent to Last9’s observability APIs | @mention @Last9 in any channel to query observability data | The **AI Assistant** is ideal when you’re in the Last9 dashboard and want to quickly interrogate your infrastructure — check system health, investigate errors, or review alerts without writing queries. **[Last9 MCP](/docs/integrations/mcp/)** brings production context into your development environment. When you’re debugging code in your IDE, MCP lets your AI coding agent query exceptions, logs, traces, and metrics without leaving the editor. The **Slack App** enables team-wide collaboration on incidents directly in Slack. When an alert fires in your incident channel, anyone can @mention @Last9 to investigate without context switching. Use them together: respond to alerts in Slack with @Last9, investigate deeply in the dashboard with the AI Assistant, then switch to MCP in your IDE when you’re ready to write the fix. ## Key Features ### Natural Language Queries Ask questions in plain English about your infrastructure: * “What errors are happening currently?” * “Show me the p95 latency for my services” * “Are there any 5xx errors?” * “Analyze the performance of my API” ### Quick Actions The AI Assistant provides quick action cards to help you get started: | Quick Action | Description | | ------------------- | -------------------------------------------------------- | | **Recent Errors** | Latest error patterns and incidents across your services | | **Performance** | P95 latency and response metrics for your applications | | **Active Alerts** | Current system alerts status and firing alerts | | **Database Health** | Performance and usage insights for your databases | | **Trace Analysis** | Identify slow request patterns and bottlenecks | | **System Overview** | Comprehensive health report across all services | ### Query Execution Progress When you ask a question, the AI Assistant shows real-time progress as it queries your observability data: * **Query Logs** - Searching log data for relevant information * **Query Exceptions** - Analyzing application exceptions and errors * **Query Metrics** - Fetching performance metrics * **Query Traces** - Examining distributed traces Each step shows a completion status, so you can see exactly what data sources are being analyzed. ### Deep Links Every query the AI Assistant executes includes a **View in …** link that takes you directly to the underlying data in Last9. ![AI Assistant Deep Links](/_astro/ai-assistant-deep-links.C6ESWb3a_1X8agl.webp) This allows you to: * Continue your investigation with the full query interface * Explore the underlying data in more detail * Share specific queries with your team * Build on the AI-generated queries with additional filters Deep links preserve the exact query, time range, and filters, making it easy to transition from conversational exploration to detailed analysis. ### Intelligent Analysis The AI Assistant doesn’t just return raw data—it provides intelligent analysis including: * **Service Health Tables**: Visual overview of all services with throughput, error rates, response times, and health status * **Key Findings**: Automatically identified critical issues and positive indicators * **Structured Error Details**: Service name, environment, error type, and error messages for quick diagnosis * **Alert Configuration Summary**: Overview of configured alert rules and active instances * **Recommended Actions**: Prioritized action items (Immediate, High Priority, Monitor, Review) ## Getting Started 1. **Navigate to AI Assistant** Click on **AI Assistant** in the left sidebar of your Last9 dashboard, or click the AI Assistant card on the Home page. 2. **Ask a question or use Quick Actions** Type your question in the input field, or click one of the quick action cards to get started immediately. 3. **Review the analysis** The assistant will gather information from your observability data and present: * Service health status * Active alerts and their severity * Key findings and recommendations * Actionable next steps 4. **Continue the conversation** Ask follow-up questions to dive deeper into specific services, alerts, or issues. The assistant maintains context throughout your conversation. ## Example Use Cases ### System Health Overview ```plaintext "Give me a system health overview" ``` The assistant provides: * Available environments (Production, Staging, etc.) * Service health table with throughput, error rates, and response times * Critical issues with firing alerts * Positive indicators (no exceptions, healthy services) * Recommended actions prioritized by urgency ### Error Investigation ```plaintext "Which errors are happening currently?" ``` The assistant analyzes: * Recent exceptions across services * Error patterns and frequencies * Affected endpoints and services * Suggested investigation steps ### Performance Analysis ```plaintext "What is the p95 latency for my services?" ``` The assistant returns: * P95 response times for each service * Services exceeding thresholds (marked as Slow or Critical) * Performance degradation details * Comparison with configured thresholds ### Alert Review ```plaintext "Show me active alerts" ``` The assistant displays: * Currently firing alerts with severity levels * Alert duration and trigger times * Threshold values vs current values * Recommended remediation steps ## Chat History Your conversations are saved in the chat history sidebar, allowing you to: * Resume previous investigations * Reference past analyses * Track recurring issues over time To start a fresh conversation, click the **+ New chat** button. ## Ask Mode in Logs and Traces In addition to the dedicated AI Assistant, you can use natural language queries directly within the Logs and Traces explorers through **Ask Mode**. ### Using Ask Mode Navigate to [Logs Explorer](https://app.last9.io/logs) or [Traces Explorer](https://app.last9.io/traces) and click the **Ask** tab to access AI-powered querying. ![Ask Mode in Traces](/_astro/ai-ask-mode.B-wvu7cQ_17keF8.webp) Type your question in natural language, such as: * “investigate slow requests from last9 api” * “show me errors from the payment service” * “find database queries taking more than 1 second” The AI will generate the appropriate filters and display matching results. ![Ask Mode Results in Logs](/_astro/ai-ask-mode-output.FNG-8oju_Z1CfTaS.webp) ### Quick Start Templates Ask Mode provides pre-built templates to help you get started quickly: | Template | Description | | -------------------- | ---------------------------------------------------- | | **Error Traces** | Show all traces with error status codes | | **Slow Traces** | Find traces with duration greater than 1 second | | **Database Queries** | Filter traces containing database operations | | **HTTP Requests** | Show traces with HTTP status codes and request paths | | **Failed Requests** | View traces with 4xx and 5xx status codes | | **Service Errors** | Display traces from specific service with errors | ### Team Queries Ask Mode also displays saved queries from your team, making it easy to reuse common investigation patterns across your organization. ## Using AI Assistant in Slack Connect Last9 AI Assistant to your Slack workspace to query observability data directly from Slack channels using @mentions. This brings AI-powered insights into your team’s existing incident response workflows without leaving Slack. ### Prerequisites Before installing the Slack app, ensure you have: * **AI Assistant enabled** for your Last9 organization. Request access from the [AI Assistant page](https://app.last9.io/ai-assistant) if not already enabled * **Observability data flowing** to Last9 (metrics, logs, traces, or events) * **Slack workspace permissions** to install apps * **Last9 user account** with the same email as your Slack account (required for authorization) ### Installation 1. **Open the installation URL** Visit 2. **Authorize the app** * Verify the Slack workspace is correct * Review the requested permissions: * Read @mentions in channels * Send messages to channels * Read basic user information * Click **Allow** to authorize Last9 in your Slack workspace 3. **Invite the bot to channels** After installation, invite `@Last9` to the channel(s) where you want to use it: ```plaintext /invite @Last9 ``` 4. **Verify installation** Test by mentioning `@Last9` with no query text. You should receive a helper prompt describing available queries. ### Using the Slack App #### Query Format Mention `@Last9` followed by your question in any channel where the bot is present: ```plaintext @Last9 [your question] ``` #### Example Queries **Error Investigation:** ```plaintext @Last9 Show me errors for cloudflare in the last hour @Last9 Which services have 5xx errors? @Last9 What's causing the recent spike in errors? ``` **Performance Analysis:** ```plaintext @Last9 Get endpoints with 5xx responses in the last hour @Last9 What is the p95 latency for my services? @Last9 Show me slow database queries ``` **System Health:** ```plaintext @Last9 Give me a system health overview @Last9 What alerts are currently firing? @Last9 Show me active incidents ``` #### Response Behavior * **Threaded replies**: All responses appear in threaded replies to keep channels organized * **Contextual follow-ups**: Ask follow-up questions in the same thread to maintain context * **Deep links**: Responses include “View in Logs”, “View in Traces”, or “View in Exceptions” links to the Last9 dashboard for detailed investigation * **Structured output**: Results are formatted with tables, bullet points, and sections for easy scanning ![AI Assistant Slack Response](/_astro/ai-slack-response-example.MLOhQMju_ZwrUIV.webp) ### Channel Configuration **Public Channels:** Mention `@Last9` directly in any public channel where the bot is a member. **Private Channels:** 1. Invite the bot to the private channel first: ```plaintext /invite @Last9 ``` 2. Then use `@Last9` mentions as normal ### Team Collaboration The Slack app enables collaborative incident investigation: * **Shared context**: Everyone in the thread sees the same analysis * **Team visibility**: Questions and answers are visible to all channel members * **Async investigation**: Team members in different time zones can contribute to the same investigation thread * **Audit trail**: Slack’s message history preserves the investigation timeline **Example incident workflow:** 1. Alert fires and posts to `#incidents` 2. On-call engineer asks: `@Last9 Show me errors in the payment service` 3. AI provides structured error analysis with deep links 4. Team lead asks follow-up: `@Last9 When did this start?` 5. Engineer clicks “View in Traces” to investigate root cause in dashboard 6. Resolution documented in the same Slack thread ### Troubleshooting **AI Assistant not responding?** * Check the [AI Assistant page](https://app.last9.io/ai-assistant) to verify it’s enabled for your organization * Ensure `@Last9` is invited to the channel (use `/invite @Last9`) * Verify the bot user appears in the channel member list **“Not authorized” error?** * Your Slack email must match your Last9 account email * Request a Last9 account from your organization admin if you don’t have one * Contact for access issues **Responses seem incomplete?** * Try rephrasing your question with more specific details * Include time ranges (e.g., “in the last hour”, “since 3pm”) * Specify service or environment names if relevant * Use the deep links in responses to continue investigation in the dashboard **Multiple workspaces:** The Slack app supports multi-workspace installations. Install separately in each workspace where you need AI Assistant access. ## Best Practices * **Start broad, then narrow**: Begin with general queries like “system health overview” before diving into specific services * **Include context**: Mention time ranges or specific services when investigating issues * **Use follow-up questions**: The assistant maintains conversation context, so ask “tell me more about that service” to dive deeper * **Review recommended actions**: The AI prioritizes actions by urgency—address Immediate items first ## Privacy and Security The AI Assistant functions as a copilot for telemetry data, including logs, metrics, traces, events, dashboards, and alerts. This feature is optional and can be enabled only with explicit administrator consent within your organization’s Last9 account. ### How the AI Works The AI Assistant provides a natural language interface for querying telemetry data. When you ask questions like “Why is the 5xx error rate increasing?” or “Explain this alert,” the system uses a Large Language Model (LLM) to interpret and respond using telemetry data available within the platform. **Use of LLMs:** * The LLM converts your natural language queries into Last9’s internal query format * No customer telemetry data is transmitted to the LLM during this translation step * When summarization or interpretation is needed, only the minimal necessary data is processed ### Data Shared with LLMs 1. **Limited metadata** (e.g., tags, labels) when needed for query translation 2. **Query results** only when analysis is explicitly requested For example, if you ask “Are there any 5xx errors in the login service?”, the system: 1. Executes the internal Last9 query 2. Processes the response internally 3. Applies sanitization and PII removal 4. Uses the LLM only for high-level summarization ### Model Configuration You can choose between: * **Last9-managed models**: Frontier LLMs under strict security controls and data minimization practices * **Bring Your Own LLM (BYOL)**: Connect your own model for complete control over AI processing **Security guarantees:** * No customer data is used for model training in either scenario * AI Assistant and related models are hosted on secure infrastructure * All data is encrypted in transit and at rest ### Transparency * Each query shows execution progress (Query Logs, Query Exceptions, etc.) * “View in …” links let you verify the exact queries being run * All AI-generated filters and queries are visible and editable *** ## Troubleshooting **AI Assistant not loading?** * Ensure you have an active Last9 account with data flowing * Try refreshing the page or starting a new chat **Not getting expected results?** * Try rephrasing your question with more specific details * Include time ranges (e.g., “in the last hour”) * Specify service or environment names if relevant Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Alert Groups > Overview of Alert Groups An Alert Group is a container for Indicators (ie PromQL queries) and Alert Rules which evaluate these queries. Alerts that are generated by Alert Rules, send notification on the Channels that are configured within the Alert Group. ## Creating an Alert Group 1. Navigate to **Home** → **Alert Studio** → **Alert Groups** and click on **Add New** ![Creating An Alert Group ](/_astro/alert-group-1.Ru2KGP9C_Z23p950.webp) ![Creating An Alert Group ](/_astro/alert-group-2.B_62WYKp_Z2vg6He.webp) 2. Assign a descriptive name to the Alert Group and Select the data source from which you like to query metrics and click **Create.** ![Creating An Alert Group ](/_astro/alert-group-2.B_62WYKp_Z2vg6He.webp) Ensure that you select the correct Data Source (Last9 Cluster) from which you like to query the metrics from or else the Alert Rules will not evaluate. Pro Tip - You can also use Last9’s Health Cluster as a data source to setup alerting to watch your Cluster’s health .. after all q*uis custodiet ipsos custodes?* 3. Click on the **Alert Group** to navigate to your newly created Alert Group ![Creating An Alert Group ](/_astro/alert-group-4.D6KHbGjC_ZQSO3I.webp) If this is your first Alert Group, next you would need to create the first Indicators followed by creating an Alert Rule. ## Deleting an Alert Group Deleting an Alert Group deletes all the Alert Rules, Indicators and all the generated Alerts. To delete an Alert Group: 1. Navigate to **Home** → **Alert Studio** → **Alert Groups** 2. Click the **…** button besides the Alert Group you wish to delete and select **Delete** ![Deleting An Alert Group ](/_astro/alert-group-6.BiaDA1dH_Z1tRtjD.webp) ![Deleting An Alert Group ](/_astro/alert-group-7.DbhZmNj7_4bPbC.webp) ## Features ### Labels Labels are are named pairs (key:value pairs) that add additional information and context to Alert Groups. To add labels to an Alert Group: 1. Click **Edit** in to update Alert Group meta fields ![Alert Group Labels](/_astro/alert-group-8.Dz1qN-2f_1IJGRI.webp) 2. In the Labels card, click **Add Labels** to add a new label Labels must have a unique key (ie a name). Label value can contain alphanumeric text. ![Alert Group Labels](/docs/gif-images/alert-group-9.gif) 3. Click **Done** to exit edit mode To edit or delete labels from an Alert Group: 1. Click **Edit** in to update Alert Group meta fields 2. In the Labels card, hover on the label you wish to edit or delete. Click on the appropriate button to edit or delete the label ![Alert Group Labels](/docs/gif-images/alert-group-10.gif) 3. Click **Done** to exit edit mode ### Tags Tags help you categorize multiple Alert Groups To add tags to an Alert Group: 1. Click **Edit** in to update Alert Group meta fields 2. In the Details card, click on Assign Tags 3. Search from existing or add a new Tag to the Alert Group ![Alert Group Tags](/docs/gif-images/alert-group-11.gif) 4. Click **Done** to exit edit mode To remove tags from an Alert Group: 1. Click **Edit** in to update Alert Group meta fields 2. In the Details card, hover on the tags you wish to edit or delete. Click on the X button to remove tag 3. Click **Done** to exit edit mode ### Links Alert Group links allow you to add links to external resource used by your team. These can be very helpful for your team to quickly navigate to resources like CloudWatch, runbooks or repos, etc. Links are named URLs which can have any custom with several suggested links. To add links to Alert Groups: 1. Click **Edit** in to update Alert Group meta fields 2. In the Links card, add a link to a suggested field or add your own custom name for the link ![Alert Group Tags](/docs/gif-images/alert-group-12.gif) 3. Click **Done** to exit edit mode To edit or remove links from an Alert Group: 1. Click **Edit** in to update Alert Group meta fields 2. In the Links card, hover on the link you wish to edit or delete. Click on the appropriate button to edit or delete the link 3. Click **Done** to exit edit mode ## Alert Group Settings ### Channels Notifications from Last9 are sent on [Notifications Channels](/docs/notification-channels/). Ensure that you have at least one Notification Channel configured, before trying to add an Channels to an Alert Group To add a notification channel: 1. Navigate to `Home` → `Alert Studio` → `Alert Groups` → *Select an Alert Group* Press the ⚙️ icon on the top right to view Alert Group settings. ![Adding a Notification Channel](/_astro/alert-notification-1.1goAbmeG_1cFin9.webp) 2. Under the Channels tab you can assign channels as per Alerts severity level, ie you can set different (or same) channels for Threat and Breach severity alerts ![Adding a Notification Channel](/_astro/alert-notification-2.DouIv_K4_ffnYc.webp) Slack integration also allows you to append additional *@mentions* to tag a person or group ![Adding a Notification Channel](/_astro/alert-notification-3.DtvjI8XB_1TygNH.webp) 3. The configured Alert Channel will now start receiving alerts ![Adding a Notification Channel](/_astro/alert-notification-4.CUQYifw__2os7Jv.webp) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Alert Rules > Alert Rules Overview ## Alert Rules Overview Alert Rules evaluate Indicators using algorithms and configured thresholds/sensitivity to generate Alerts. Alerts can be visualized using Health and sent to configured Notification Channels. ## Creating an Alert Rule Before you can configure an Alert Rule, you need at least one Indicator to be created in the Alert Group. To create an Alert Rule: 1. Navigate to the Alert Group in which you would like to create the Alert Rule: **Home** → **Alert Studio** → **Alert Groups** → *Select an Alert Group* → **Alert Rules** Tab ![Creating An Alert Rule 1](/_astro/alert-rule-1.Bej9KOSX_Z1iGgfl.webp) 2. The following details are required for an Alert Rule: 1. **Rule Name**: Use a descriptive Alert Rule that can be easily identified by your team. This will be also sent as the part of notifications 2. **Indicator**: Select the indicator for which you would like to create the Alert Rule. If you have not created an Indicator, follow the steps as mentioned here 3. **Edit Label Filter**: … 4. **Algorithm**: Select from the available algorithms. For a detailed guide on how various algorithms, refer to this [guide](/docs/anomalous-pattern-detection-guide/) In this tutorial, we will step up an alert using Static Threshold 5. **Threshold / Sensitivity**: Specify the Threshold (or Sensitivity in case of Anomaly Detection algorithms) and the Operator (example: Alert when the Indicator value is *greater than or equal* to 10) 6. **Alert Sensitivity**: Using Alert Sensitivity you can define how reactive is the Alert Rule. Alert Sensitivity requires two inputs: * Total Minutes: This is the total duration of the rolling time window during which the Indicator is evaluated, with the maximum allowed duration being 60 minutes. (All Alert Rules are evaluated in one-minute intervals) * Bad Minutes: This value represents the number of minutes within the evaluation window that exhibit undesirable or unexpected behavior. These “bad” minutes need not be consecutive If the number of ‘Bad Minutes’ exceeds the predefined limit within the ‘Total Minutes’ rolling window, an alert is generated. A rolling time evaluation window offers continuous analysis by constantly updating the period under evaluation. It allows for immediate reaction to issues as they develop, rather than waiting for a static hourly evaluation to complete. This mechanism ensures that users are notified only when there is a significant deviation in expected metric performance, helping to avoid unnecessary alerts for minor or inconsequential fluctuations. 7. **Severity Level**: Helps you provide additional context to Alert Rule by categorizing alerts as either Threat or Breach. We indicate Threat alerts in amber and Breach as red colors 8. **Notification Group**: For Indicators with multiple timeseries, you can choose to receive individual alerts for every single timeseries or to group them into a single alert. We recommend that you group alerts as ungrouping them can lead to noise being generated 9. **Annotations** (Optional): Annotations are optional information labels in `key:value` format that can be sent with every Alert notification. You can use these specify additional description, Runbooks or trigger complex workflows in your incident management systems 10. When the threshold is configured, we generate a preview to help you visualize what values of the Indicators are considered anomalous. The number of timeseries that will be evaluated every minute by the alert rule are indicated Click **Save Rule** to enable alerting for this rule. To start receiving notification for this Alert Rule, ensure that at least one Notification Channel as been configured for this Alert Group. ![Creating An Alert Rule 2](/_astro/alert-rule-2.BkYvZ-S__1INEX9.webp) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Alert Timing & Delivery Reference > Understand exactly when your alerts fire, repeat, and resolve across notification channels. Quick reference for alert timing behavior with threshold alerts, anomaly detection, and SLO violations. Get alerts when you need them. Here’s exactly when notifications fire, repeat, and resolve across your channels. ## Quick Reference ### Threshold & Anomaly Alerts * **First alert:** 2 minutes after threshold breach * **Repeat alerts:** Every 61 minutes while active * **Resolution:** 11 minutes after condition clears ### SLO Alerts * **First alert:** 4 minutes after SLO violation * **Repeat alerts:** Every 16 minutes while active * **Resolution:** 31 minutes after SLO recovers ## How Alert Timing Works When something goes wrong at timestamp `t-1`, here’s the timeline: ### Threshold & Anomaly Alerts ```plaintext t-1: Issue occurs (CPU spikes, error rate jumps) t+1: Last9 generates the alert t+2: Notification hits your channel t+61: Reminder recovery+11: Resolution ``` ![Alert Timing — Threshold \& Anomaly Alerts](/_astro/alert-timeline-anomaly.CU6ii0FJ_yhlg5.webp) ### SLO Alerts ```plaintext t-1: SLO violation begins (error budget burned) t+3: Last9 generates the alert t+4: Notification hits your channel t+16: Reminder recovery+31: Resolution ``` ![Alert Timing — SLO Alerts](/_astro/alert-timeline-slo.ah8di9M5_28aF5H.webp) ## Notification Channels **Threshold & Anomaly Alerts:** All channels (Slack, PagerDuty, OpsGenie, Webhook, Email)\ **SLO Alerts:** Slack, PagerDuty, OpsGenie, Webhook only (Email not supported) All supported channels follow the same timing behavior: | Alert Type | First Notification | Repeat Frequency | Resolution Delay | | --------------------- | ------------------ | ---------------- | ---------------- | | **Threshold/Anomaly** | t+2 minutes | Every 61 minutes | 11 minutes | | **SLO Violation** | t+4 minutes | Every 16 minutes | 31 minutes | ## Why These Delays? * **Processing time:** Last9 needs 1-3 minutes to analyze metrics and confirm alert conditions * **Delivery buffer:** 1-minute buffer accounts for network latency and channel processing * **Resolution delays:** Prevents flapping alerts when conditions briefly recover then fail again ## Examples **Scenario:** API response time spikes at 1:59 PM * **2:01 PM:** Last9 confirms threshold breach * **2:02 PM:** Webhook fires, Slack notification arrives * **3:01 PM:** Reminder if still alerting * **When fixed + 11 min:** Resolution notification **Scenario:** Error budget burns through at 3:14 PM * **3:18 PM:** Last9 confirms SLO violation * **3:19 PM:** Slack/PagerDuty/Webhook notifications sent (Email not supported for SLO) * **3:31 PM:** First reminder if still violating * **When recovered + 31 min:** Resolution notification *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Alerting Overview > Overview of Last9's Alerting Capabilities Last9 comes with complete monitoring support, including alerting and notification capabilities. Irrespective of your tool choice, a few problems plague today’s alerting journey — coverage, fatigue, and cleanup. Unfortunately, there are no easy answers to these complex problems. However, with advanced features like Pattern-based Alerting and a redesigned Alert Manager designed with High Cardinality in mind, Last9 helps you stay ahead. ## Types of Alerts Last9 supports two types of alerting based on your data source: | Alert Type | Data Source | Where to Configure | Query Language | | ------------------ | ------------------ | ----------------------------------------------------------------- | -------------- | | **Metrics Alerts** | Prometheus metrics | [Alert Studio](/docs/alerting-overview/#enabling-alerting-studio) | PromQL | | **Log Alerts** | Logs data | [Scheduled Search](/docs/scheduled-search/) | LogQL | Metrics vs Log Alerts **Alert Studio is for metrics-based alerts only.** If you want to alert on log data (e.g., error counts, log patterns, missing events), use [Scheduled Search](/docs/scheduled-search/) instead. ## Metrics Alerting with Alert Studio Alert Studio provides PromQL-compatible alerting with features like a real-time alert monitor and historical health view. You can also perform advanced tasks, such as correlating alerts with events while focusing on the desired outcome of keeping up with constantly evolving infrastructure and services. Alerting with Alert Studio starts by creating **Alert Groups** which contain one or more **Alert Rules.** These Alert Rules evaluate the PromQL queries which are defined as **Indicators** in the Alert Group. Using **Alert Monitor** you can view a live updating stream of all your Alert Rules across all Alert Groups. In the following section we dive deeper into each of these components. ## Enabling Alerting Studio All new orgs needs to request access to Alert Studio, this is a one time action and takes about 30 minutes to get completed (usually done much faster). To enable Alert Studio: 1. Navigate to **Home** → **Alert Studio** and click on the **Request To Enable** button ![Enabling Alert Studio](/_astro/alerting-overview-1.BiViaQmt_w92AN.webp) 2. Once the request has been sent, come back in some time to start using Alert Studio *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Declarative Alerting via IaC > Last9 supports configuring alerts and notifications automatically using a Python-based SDK tool which takes care of infrastructure changes Configurations for alerting and notifications for observability at scale are hard to start, maintain and fix manually, just like provisioning infrastructure at scale. With infrastructure changes, it’s important that the observability stack also catch up with it to avoid the chances of issues because of a lack of observability or black swarm events. Last9 has introduced.`l9iac` tool to solve the exact same problem. ## Installation Last9’s IaC (Infrastructure as Code) tool is available as a Docker image, providing a consistent and isolated environment for automating entity creation and alert configuration. 1. **Pull the Docker Image** ```bash docker pull last9system/iac:latest ``` The image is available on [DockerHub](https://hub.docker.com/repository/docker/last9system/iac/general). 2. **Prepare Your Working Directory** Create a directory containing: * Your IaC YAML files * `config.json` with your refresh tokens ([see file structure](#configuration-file-structure)) * Space for the state lock file 3. **Run the Docker Container** ```bash docker run --name l9iac -d -v : last9system/iac: ``` Example: ```bash docker run -d -v /home/user/iac-files:/app/rules last9system/iac:2.4.2 ``` > 💡 **Note**: If using Docker Desktop, ensure file sharing is enabled for the volume mount. 4. **Execute IaC Commands** ```bash docker exec -it l9iac -mf -c ``` Example: ```bash docker exec -it bcdea6660fd4 l9iac -mf /app/rules/alert-rules.yaml -c /app/rules/config.json plan ``` ## Configuration File Structure The IaC tool requires a `config.json` file with the following structure: ```json { "api_config": { "read": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" }, "write": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" }, "delete": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" } }, "state_lock_file_path": "state.lock" // Should be in the same directory as model_file and config_file } ``` ### Important Notes * The `refresh_token` values can be obtained from the [API Access](https://app.last9.io/api-access) page in the Last9 dashboard ([know more](/docs/getting-started-with-api/)) * The `` can be obtained from the app’s URL: `app.last9.io/v2/organizations/` * For on-premise Last9 setups, contact to get the correct `api_base_url` * The `state_lock_file_path` should be accessible from the directory where you run the IaC commands ## Quick Start 1. Create a *YAML* as per your alert rule configuration **Example**: notification\_service\_am.yaml ```yaml # notification_service_am.yaml entities: - name: Notification Backend Alert Manager type: service_alert_manager data_source: prod-cluster entity_class: alert-manager external_ref: unqiue-slug-identifier indicators: - name: availability query: count(sum by (job, taskid)(up{job !~ "ome.*"}) > 0) / count(sum by (job, taskid) (up{job=~".*vmagent.*", job !~ "ome.*"})) * 100 - name: loss_of_signal query: 'absent(up{job !~ "ome.*"})' alert_rules: - name: Availability of notification service should not be less than 95% description: The error rate (5xx / total requests) is what defines the availability, lower value means more degradation indicator: availability less_than: 99.5 severity: breach bad_minutes: 3 total_minutes: 5 group_timeseries_notifications: false annotations: team: payments description: Error Rate described as number of 5xx/throughput runbook: https://notion.com/runbooks/payments/error_rates_fixing_strategies ``` 2. Prepare the configuration file for running the IaC tool The configuration file has the following structure. It is a JSON file. ```json { "api_config": { "read": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" }, "write": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" }, "delete": { "refresh_token": "", "api_base_url": "https://app.last9.io/api/v4", "org": "" } }, "state_lock_file_path": "state.lock" } ``` * The `refresh_token` can be obtained from the API Access page from the Last9 dashboard. You need to have `refresh_tokens` for all 3 operations - read, write and delete as the `l9iac` tool will perform all these 3 actions while applying the alert rules. * The `` is your organization’s unique slug in Last9. It can be obtained from the API access page of Last9 dashboard.i * The default `api_base_url` is `https://app.last9.io/api/v4`. If you are on an on-premise setup of Last9, contact to get the `api_base_url`. * The `state_lock_file_path` is name of the file where `l9iac` will store the state lock of current alerting state(on the same lines of terraform state.lock). 3. Run the following command to do a dry run for the changes ```shell l9iac -mf notification_service_am.yaml -c config.json plan ``` 4. Run the following command to apply the changes ```shell l9iac -mf notification_service_am.yaml -c config.json apply ``` ## Schema Here is the complete schema for generating the above `.yaml` file: ### Entities | Field | Type | Unique | Required | Description | | ---------------------------------------------------------- | --------------- | ------ | -------- | ---------------------------------------------------------------------------------------------- | | name | string | false | true | Name of the entity (alert manager) | | type | string | false | true | Type of the entity | | external\_ref | string | true | true | External reference for the entity, it’s a unique slug format identifier for each alert manager | | [adhoc\_filter](#common-rule-filters-adhoc-filters) | object | false | optional | List of common rule filters for the entity | | [alert\_rules](#alert-rules) | array | false | optional | List of alert rules for the entity | | data\_source | string | false | optional | Data source | | data\_source\_id | string | false | optional | The ID of the data source | | description | string | false | optional | Description of the entity | | entity\_class | string | false | optional | Denotes the class of the entity. Supported values: `alert-manager` | | [indicators](#indicators) | array | false | optional | List of indicators for the entity | | labels | object | false | optional | List of key value pairs of group label names and values | | [links](#links) | array | false | optional | List of links associated with the entity | | namespace | string | false | optional | The namespace of the entity | | [notification\_channels](#notification-channels) | string OR array | false | optional | List of notification channels applicable to the entity | | tags | array | false | optional | List of tags for the entity | | team | string | false | optional | The team that owns the entity | | tier | string | false | optional | Tier of the entity | | [ui\_readonly](/docs/alert-group/#creating-an-alert-group) | boolean | false | optional | Disable any sort of edits to the alert group from the UI | | workspace | string | false | optional | Workspace of the entity | ### Common Rule Filters (Adhoc Filters) | Field | Type | Unique | Required | Description | | ------------ | ------ | ------ | -------- | ------------------------------------------------- | | labels | object | false | required | List of key value pairs of label names and values | | data\_source | string | false | required | Defaults to entity’s data source | ### Alert Rules | Field | Type | Unique | Required | Description | | -------------------------------- | ---------- | ------ | -------- | -------------------------------------------------------------------------------------------- | | name | string | true | required | Rule name that describes the alert | | indicator | string | false | required | Name of the indicator | | bad\_minutes | integer | false | required | Number of minutes the indicator must be in a bad state before alerting | | total\_minutes | integer | false | required | Total number of minutes the indicator is sampled over | | description | string | true | optional | Description for an alert rule that is included in the alert payload | | expression | string | false | optional | Alert rule expression, to be used only for pattern-based alerts | | greater\_than | number | false | optional | Alert triggers when the indicator value is greater than this | | greater\_than\_eq | number | false | optional | Alert triggers when the indicator value is greater than or equal to this | | less\_than | number | false | optional | Alert triggers when the indicator value is less than this | | less\_than\_eq | number | false | optional | Alert triggers when the indicator value is less than or equal to this | | equal\_to | number | false | optional | Alert triggers when the indicator value is equal to this | | not\_equal | number | false | optional | Alert triggers when the indicator value is not equal to this | | group\_timeseries\_notifications | boolean | false | optional | If multiple impacted time series in an alert need to be grouped as one notification or not | | is\_disabled | boolean | false | optional | Whether the alert is disabled or not | | label\_filter | map/object | false | optional | Mapping of the variables present in the indicator query and their pattern for the alert rule | | mute | boolean | false | optional | If alert notifications need to be muted or not | | [runbook](#runbook) | | false | optional | Runbook link to be included in the alert payload | | severity | string | false | optional | Can be a `threat` or `breach` | #### Runbook | Field | Type | Unique | Required | Description | | ----- | ------ | ------ | -------- | ------------------------------------------------ | | link | string | false | required | Runbook link to be included in the alert payload | ### Indicators | Field | Type | Unique | Required | Description | | ------------ | ------ | ----------------------------------------- | -------- | ------------------------------------ | | name | string | true, uniqueness enforced at entity level | required | Name of the indicator | | query | string | false | required | PromQL query for the indicator | | data\_source | string | false | optional | Data Source of the indicator (Last9) | | description | string | false | optional | Description of the indicator | | unit | string | false | optional | Unit of the indicator | ### Links | Field | Type | Unique | Required | Description | | ----- | ------ | ------ | -------- | ------------------------ | | name | string | false | required | Display name of the link | | url | string | false | required | URL of the link | ### Notification Channels | Field | Type | Unique | Required | Description | | -------- | ----------------------- | ------ | -------- | ------------------------------------------------------------------------------------------------- | | name | string | false | required | Name of the notification channel | | type | string | false | required | Type of notification channel. Allowed values: `slack`, `pagerduty`, `opsgenie`, `generic_webhook` | | mention | string OR list (string) | false | optional | Only applicable to Slack. The user(s) to tag in the alert message | | severity | string | false | optional | Severity of the alerts sent through this channel. Allowed values: `threat`, `breach` | Before a notification channel can be used in IaC, it needs to be configured. Please see [Notification Channels](/docs/notification-channels/) for more details. ## Supported Macros by IaC * `low_spike (tolerance, metric)` * `high_spike (tolerance, metric)` * `decreasing_changepoint (tolerance, metric)` * `increasing_changepoint (tolerance, metric)` * `increasing_trend (tolerance, metric)` * `decreasing_trend (tolerance, metric)` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Anomalous Pattern Detection Guide > An overview of Pattern Detection algorithms supported by Last9 and guidelines on when to use them. ## Supported algorithms Last9’s Alert Studio supports following algorithms for anomalous pattern detection. ### High Spike The high spike algorithm is designed to detect sudden increases in signal values, particularly when the increase occur within a short time frame. It is especially useful for detecting sudden jumps in the number of 4xx, throughput, and edge hits, which are a good fit for the high spike. The high spike algorithm compares the current data point with the last 60 minutes worth of data points to check whether a given point has a considerably large amplitude or not. #### Eligible Signals for High Spike Signals similar to following can be used for high spike pattern detection. ![High Spike Pattern Detection Signal](/_astro/eligibile-signals-high-spike-1.Njp_wYxy_1clyj0.webp) ![High Spike Pattern Detection Signal](/_astro/eligibile-signals-low-spike-1.B8qjOS9J_2cvxOj.webp) ### Low Spike The low spike algorithm is particularly helpful in identifying sudden drops in signal values, Signals such as CPU utilization, cache hit rate, and availability are good fits for the low spike algorithm. The algorithm compares the current data point with the previous 60 minutes of data to determine whether a given point represents a significant drop or not. #### Eligible Signals for Low Spike ![Low Spike Pattern Detection Signal](/_astro/eligibile-signals-low-spike-1.B8qjOS9J_2cvxOj.webp) ![Low Spike Pattern Detection Signal](/_astro/eligibile-signals-low-spike-2.CR6OcQGH_3CnXU.webp) ![Low Spike Pattern Detection Signal](/_astro/eligibile-signals-low-spike-3.--2pIdwK_1b1C0X.webp) ### Level Change The level change algorithm detects the point at which data begins to exhibit a new pattern that is different from the old. The data will have different patterns before and after the level change. To determine if an incoming point is a candidate level change, the algorithm checks if it is different (too high or too low) from the data over the last hour. #### How is this different from a high/low spike? If the data shows a single or a few large jumps or drops, this algorithm will not detect them. A single different value or even a few of them do not necessarily indicate that the pattern has changed or that there is a new pattern. #### Eligible Signals Level change Algorithm ![Level Change Pattern Detection Signal](/_astro/eligibile-signals-level-change-1.Cdyn2nGK_1MeBgb.webp) ![Level Change Pattern Detection Signal](/_astro/eligibile-signals-level-change-2.DTSZN2M3_oNzj.webp) ![Level Change Pattern Detection Signal](/_astro/eligibile-signals-level-change-3.DIjxsJ9h_Z2hixlj.webp) ### Trend Deviation A trend algorithm is a useful tool for detecting deviations in a signal from its expected pattern compared to its behaviour over a certain number of previous days. * For each incoming data point, collect relevant data from the past (this is the reference period or seasonality). It is not necessary to collect all past data * To determine if an incoming data point is an anomaly, compare it with a reference period from the past #### Illustrations In the below scenario, the trend algorithm will detect anomalies at 10 a.m. (red circled) because it is not expected when compared to its previous days (reference period- red-colored rectangular boxes) ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-1.CQsxmWXQ_Z2dp4N6.webp) In Figure 2, if we observe the signal pattern carefully, the trend algorithm will not detect any anomalies at 10 a.m. (red-circled) because the point or peak is expected when compared to its previous days. ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-2.BgHTjtJ6_1lQxLb.webp) In Figure 3, if we observe the signal pattern carefully, the trend algorithm will detect anomalies at 10 a.m. (red circle). Although it is a repetitive peak, the amplitude of this peak is much higher than the peaks of the previous days. ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-3.DeUxy9fZ_27Lgql.webp) ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-4.COIYkMch_Z26jiKq.webp) #### Eligible Signal for Trend (increasing / Decreasing) ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-5.BxdaCD87_Z1pyBuT.webp) ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-6.CxGZfV0T_Z1vwJUh.webp) ![Trend Detection Pattern Detection Signal](/_astro/eligibile-signals-trend-deviation-7.BTmBjaqW_1ACTW3.webp) ## How to select the right algorithm? Each algorithm matches a specific pattern and raises an alert when it is encountered. To use it effectively, the user should follow the below process when choosing an algorithm. 1. **Define normal behaviour**. It is important to know what the acceptable behaviour of the signal is. One simple way of doing this, is to look at the signal over the relevant span, and try and point out the timestamps where the signal deviates from the normal behaviour, and you would like to get alerted. Remember, an algorithm is not able to detect deviation from normal behaviour, if a trained human cannot 2. **Identify the anomalous pattern(s) in the signal**. Different signals exhibit different anomalous behaviour. Some might show spikes, some might show level change. Eg, for a signal like CPU usage, a sharp spike that returns to baseline may be perfectly normal behaviour, but for a business metric it may not. Knowledge of the underlying processes that generate the signal is essential to determine the correct pattern 3. **Check if a PromQL expression captures the intended deviation better**. PromQL is a very powerful language with many functions. For detecting deviations that can be defined in terms of relative values, percentages, or some rollup formulae on historical data, prefer defining the PromQLs accordingly For eg., if a signal has a normal range if ```text it stays in a range of minimum and maximum of the 15 minute medians over the last 2 days, with a tolerance of 20% ``` The PromQL to detect this would be ```text s < min_over_time(median_over_time(s)[15m])[2d]*0.8 || s > max_over_time(median_over_time(s)[15m])[2d]*0.8 ``` where `s` is the original signal metric. 4. **Check the Algorithm**. If the pattern that you want to match cannot be expressed easily like demonstrated above, check if any built-in algorithm can satisfactorily match the pattern. Remember that each algorithm has its own limitations, and it is important to understand them when working with signals Signals that don’t meet the requirements of any of the algorithms should be handled differently. By selecting the appropriate algorithm and adjusting the sensitivity to match your use case, you can improve the accuracy of these pattern detections. ## When not to choose a pattern matching algorithm? As a rule of thumb, a pattern matching algorithm should be chosen in situations where a human who is looking at the plot can define, with a high level of accuracy, where an alert should be generated and where it should not be generated. If, by looking at the plot, it is not possible for a human to determine the alert points, it is highly unlikely that any of the above algorithms can succeed. Below are a few signals which are not a good fit for any one of the above algorithms ![Ineligible signals for pattern detection](/_astro/ineligible-signal-1.OeR1Ck5x_Z1aavMg.webp) The above signal is mostly zero-valued. Applying high spikes, low spikes, or increasing trend to these types of signals will cause each and every peak to be alerted. It is better to use a static threshold instead of pattern matching functions on these types of signals. *** ![Ineligible signals for pattern detection](/_astro/ineligible-signal-2.BCfoXVjD_Z1j5S2Q.webp) This signal is a discrete-time signal. At any given point in time, it can have one of three possible values (1000, 1500, 2000) or no value at all. For this type of signal, a static threshold may be a better choice. *** ![Ineligible signals for pattern detection](/_astro/ineligible-signal-4.BJQHik2Y_1wR7H1.webp) ![Ineligible signals for pattern detection](/_astro/ineligible-signal-3.CFxo9RNk_Z1kboJz.webp) ![Ineligible signals for pattern detection](/_astro/ineligible-signal-5.BvLglKVh_1dH1QQ.webp) ![Ineligible signals for pattern detection](/_astro/ineligible-signal-6.DfrIi6NR_Vlzsy.webp) These signals should be handled differently as they do not follow a predictable pattern, making it difficult to detect patterns. ## Standard Deviation vs Pattern Detection **[Standard Deviation Alerting](/docs/standard-deviation-alerting/)** offers a simpler alternative to pattern detection algorithms: * Use **standard deviation** for general anomaly detection without specific pattern requirements * Use **pattern detection algorithms** when you need to detect specific behaviors (spikes, level changes, trends) * Standard deviation works well for most services as a starting point before implementing specialized algorithms ## Summary While deciding the pattern detection algorithm, it is important to understand the nature of the signal and the objective of the alert before choosing the algorithm. This guide describes a few guidelines which can be used while deciding the pattern algorithms with Last9 Alert Studio. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Azure AD / Microsoft Entra ID SSO > Security permissions and authentication details for signing in to Last9 with Microsoft Azure AD / Entra ID. Last9 supports signing in with Microsoft Entra ID (formerly Azure Active Directory) using standard OpenID Connect (OIDC) authentication with minimal, user-scoped permissions. ## Permissions Requested Last9 requests the following **delegated permissions** from Microsoft Graph API: | Permission | Type | Description | Admin Consent Required | | ----------- | --------- | ----------------------------- | ---------------------- | | `email` | Delegated | View user’s email address | No | | `openid` | Delegated | Sign users in (enables OIDC) | No | | `profile` | Delegated | View user’s basic profile | No | | `User.Read` | Delegated | Sign in and read user profile | No | All four are delegated permissions, meaning Last9 acts on behalf of the signed-in user and can only access that user’s own data. None require admin consent. For official Microsoft documentation, see the [Microsoft Graph Permissions Reference](https://learn.microsoft.com/en-us/graph/permissions-reference). ## What Last9 Cannot Access Last9 does **not** request any application-level or directory-scoped permissions. This means it cannot: * Read other users’ profiles (`User.Read.All` — not requested) * Access your organization’s directory data (`Directory.Read.All` — not requested) * Modify any user or directory data (`User.ReadWrite.All`, `Directory.ReadWrite.All` — not requested) * Read group memberships (`Group.Read.All` — not requested) ## How to Verify Permissions ### In the Microsoft Entra Admin Center 1. Sign in to [Microsoft Entra admin center](https://entra.microsoft.com) 2. Go to **Identity** → **Applications** → **Enterprise applications** 3. Search for and select **“Last9”** 4. Click **Permissions** under Security 5. Verify only delegated permissions (`email`, `openid`, `profile`, `User.Read`) are listed The Permissions page shows separate **Admin consent** and **User consent** tabs. Last9 should only appear under user consent with the four permissions listed above. ## Access Control Your organization retains full control over who can access Last9 through Entra ID SSO. * **You control access**: Only users you authorize in Entra ID can sign in to Last9 * **Revocation**: When you disable or delete a user’s Entra ID account, they cannot initiate new sign-ins to Last9. Existing sessions may remain active until the access token expires (typically \~1 hour) unless [Continuous Access Evaluation](https://learn.microsoft.com/en-us/entra/identity/conditional-access/concept-continuous-access-evaluation) is enabled * **No standalone accounts**: Users authenticate through your identity provider — Last9 does not maintain separate credentials ### Conditional Access Entra ID [Conditional Access](https://learn.microsoft.com/en-us/entra/identity/conditional-access/overview) policies apply to Last9 sign-ins. This includes MFA requirements, location-based restrictions, device compliance, and sign-in risk policies. ### Restricting Access to Specific Users To restrict Last9 to only assigned users, set **“Assignment required?”** to **Yes** on the Last9 Enterprise Application. When enabled, only users explicitly assigned to the application can sign in. See [Restrict an app to a set of users](https://learn.microsoft.com/en-us/entra/identity-platform/howto-restrict-your-app-to-a-set-of-users). *** ## Troubleshooting If you have questions about Last9’s Entra ID integration or need assistance verifying permissions, please contact us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io). # Cardinality Explorer > Identify metrics and labels impacted by cardinality. Cardinality Explorer helps you understand how the cardinality for metrics in a Cluster is trending. This powerful feature enables you to diagnose cardinality-related challenges with your metrics. ## Using Cardinality Explorer To view an individual metric’s cardinality contribution: 1. Navigate to **Control Plane** → **Cardinality Explore** & Select the Cluster you wish to explore ![Cardinality Explorer 1](/_astro/cardinality-1.DWIjC27w_ZsDM3M.webp) A report with all your metrics in the selected date is generated. When the current day is selected, the data shown in the table will continue to update throughout the day. The report also highlights metrics if they have crossed or are nearing their cardinality quota limits: * Metrics in Red have crossed their daily cardinality quota * Metrics in Amber have crossed 80% of their daily cardinality quota 2. To view how an individual metric is contributing towards the Cluster’s cardinality, click on a metric from the table: ![Cardinality Explorer 2](/_astro/cardinality-2.CxSswZ7-_ZI9kta.webp) You can use this detail view to diagnose issues with the selected metric using: * **Cardinality Trend:** Using this graph you can observe how the cardinality of the selected metric has trended over the last 7 days. A sudden spike or a dip may indicate unexpected changes to the cardinality of this metric * **Cardinality Details: V**iew all the metric’s label names and their top 5 occurring label values for a selected date. Using this, you can find which labels contribute to the metric’s cardinality growth. When the current day is selected, the reported cardinality and labels shown will continue to update throughout the day *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Change Events > Track deployment and configuration change events in Last9 to correlate them with service performance, error rates, and reliability metrics. ## Why Change Events Matter? Software systems and their observability are not just about the telemetry data emitted from these systems. The software systems also get affected by external change events. These events can be from domains such as deployment, configuration, or external 3rd party systems. Last9 allows tracking such change events along with other metrics, seamlessly providing more context to the system observability. Every deployment, configuration tweak, and external change ripples through your system. Last9 Change Events capture these moments and overlay them directly onto your performance metrics, giving you the context to: * Spot performance dips exactly when deployments hit * Correlate configuration changes with error spikes * Validate whether a new release is performing as expected * Share complete system context with your team in one glance This document will show how to start tracking change events with Last9. ## How It Works Last9 offers an HTTP API that can be used to track any domain change event. Each event has two states, start and stop, and both can be tracked with Last9. Once Last9 receives the event, it converts it into a metric that can be used with other metrics for querying using PromQL in Grafana dashboards or alerting. The data flow for change events is as follows. ![Change Events Data Flow](/_astro/change-events.DkIajm3P_sCpBY.webp) ## Last9 Change Events API Last9 offers a REST HTTP endpoint that can be used to send the change events. The API endpoint is as follows. ```shell curl -XPUT https://app.last9.io/api/v4/organizations/{org_slug}/change_events \ --header 'Content-Type: application/json' \ --header 'X-LAST9-API-TOKEN: Bearer ' \ --data-raw '{ "timestamp": "2024-01-15T17:57:22+05:30", "event_name": "new_deployment", "event_state": "start", "data_source_name": "{your_cluster_name}", "attributes": { "service_name": "frontend", "deployment_environment": "production", "version": "v2.1.4", "team": "platform", "change_type": "hotfix" } }' ``` ### API Parameters | Field | Description | Required | | ------------------ | ---------------------------------------------------------------------------------------------------------- | -------- | | `timestamp` | ISO8601 formatted timestamp of the event. Defaults to current time if not provided | No | | `event_name` | Custom event identifier. Added as a label to the resulting time series | Yes | | `event_state` | `start` or `stop` — marks when deployments/changes begin and complete | Yes | | `attributes` | Key-value pairs used as labels while converting the change event to a metric | No | | `data_source_name` | Name of the Last9 cluster where events will be stored. See [Change Events Storage](#change-events-storage) | No | Last9 will convert the events into a metric named `last9_change_events`. ## Visualize Change Events in Service Dashboards After pushing a change event, navigate to your service dashboard. If `service_name` and `deployment_environment` match your APM data, the change event appears as a contextual overlay on all performance charts: ![Change Events on Service Dashboard](/_astro/change-events-service-dashboard.BCNsTwjj_Z1lebsy.webp) What you’ll see: * **Red vertical line** marking your exact deployment time * **Rich context popup** with all your event details (version, team, change type) * **Performance correlation** across APDEX, response time, and throughput charts * **Error patterns** before, during, and after changes This lets you instantly answer questions like: * Did response time spike after deployment? * Is the new release performing as expected? * Which specific configuration change caused the error spike? ## Visualize Change Events in Grafana The change events can also be visualized in Grafana just like any other metrics. Query change events using PromQL: ```promql last9_change_events{event_name="new_deployment", deployment_environment="production"} ``` ![Change Events in Grafana](/_astro/change_events_in_grafana.BVfFcQfz_1nJWva.webp) ## Change Events Storage It is possible that you might be using multiple Last9 clusters. In such scenario, you can choose to store the change events in a Last9 cluster of your choice. The optional `data_source_name` attribute is used to specify the cluster where change event will be stored. If this attribute is not passed, then Last9 will store the change event in a default cluster designated for change events. The default cluster for change events is set as follows. ![Default Cluster for Change Events](/_astro/default-change-events-cluster.C5yBbpNC_ZzmYqD.webp) You can override this by specifying the `data_source_name` in the request payload. Obtain the cluster name from the Data Sources section as follows. ![Data Sources](/_astro/levitate-data-sources.CsdDzD6W_Z2dnReR.webp) ![Copy the Data Source Name](/_astro/copy-data-source-name.BYrCexzM_ZQCv1B.webp) ## Event Naming Best Practices Use consistent, descriptive names for your events: * `deployment_start` / `deployment_complete` — for application deployments * `config_update_redis` — for configuration changes * `feature_flag_toggle` — for feature flag changes * `db_migration_start` / `db_migration_complete` — for database migrations Add meaningful context through attributes: ```json { "service_name": "frontend", "deployment_environment": "production", "version": "v2.1.4", "team": "platform", "change_type": "hotfix" } ``` ## Migrating to Last9 Change Events If you’re already tracking deployments with another observability tool, the table below maps common concepts to their Last9 equivalents. This makes it straightforward to replace your existing deployment event calls with the Last9 API. ### Concept Mapping | Concept in other tools | Last9 equivalent | | ------------------------------------------------ | ----------------------------------------------------------------------------------------- | | Events API / Custom Events / DORA Deployment API | `PUT /api/v4/organizations/{org_slug}/change_events` | | Deployment markers / Annotations / Markers | Change Events with `event_state: start` and `event_state: stop` | | `version` / `commit` / `build_id` tag | `attributes.version` | | `service` / `entityGuid` / dataset slug | `attributes.service_name` | | `env` / `environment` tag | `attributes.deployment_environment` | | Event tags / dimensions / properties | `attributes` (any key-value pairs) | | GraphQL mutations / typed deployment fields | Single REST `PUT` with flexible JSON body | | Visual-only annotations on dashboards | PromQL-queryable metric (`last9_change_events`) with automatic service dashboard overlays | | Region annotations / time-range markers | Separate `start` and `stop` events for the same `event_name` | ### What’s Different in Last9 * **Events become metrics.** Unlike visual-only annotations or markers, Last9 converts every change event into a Prometheus metric (`last9_change_events`). This means you can query, alert, and build recording rules on deployment events using PromQL — not just view them on a chart. * **No entity pre-registration required.** Some tools require the target service to already exist before you can record a deployment against it. Last9 accepts events for any `service_name` immediately. * **No timestamp restrictions.** Some tools limit event timestamps to 18–24 hours in the past. Last9 accepts any valid ISO8601 timestamp. * **Automatic service dashboard correlation.** When `service_name` and `deployment_environment` match your APM data, change events automatically appear as overlays on APDEX, response time, throughput, and error charts — no manual dashboard configuration needed. * **Flexible attributes instead of rigid schemas.** Instead of fixed fields like `deploymentType` or `entityGuid`, Last9 uses open-ended `attributes`. Add any key-value pairs relevant to your workflow (`team`, `change_type`, `rollback`, `ticket_id`, etc.). ### Example: Replacing an Existing Integration If you’re currently sending deployment events via a `POST` to another provider’s API, the migration is typically a one-line change in your CI/CD pipeline. If you’re using GitHub Actions, the [Last9 Deployment Marker action](https://github.com/marketplace/actions/last9-deployment-marker) handles this for you without any custom `curl` steps. ```shell # Replace your existing deployment event call with: curl -XPUT https://app.last9.io/api/v4/organizations/{org_slug}/change_events \ --header 'Content-Type: application/json' \ --header 'X-LAST9-API-TOKEN: Bearer ' \ --data-raw '{ "event_name": "deployment", "event_state": "start", "attributes": { "service_name": "'"$SERVICE_NAME"'", "deployment_environment": "'"$DEPLOY_ENV"'", "version": "'"$GIT_SHA"'", "team": "'"$TEAM"'" } }' ``` ## Native Integrations for Change Events * [LaunchDarkly](/docs/integrations/others/launchdarkly/) * [GitHub Actions](/docs/integrations/ci-cd/github-actions/) *** ## Alerting on Stuck States A common pattern is to send a `start` event when an entity enters a state and a `stop` event when it leaves. You can use change events to alert when an entity has been stuck in a state longer than expected — for example, a job that has been queued for more than 4 hours. ### How It Works When you push a change event, Last9 stores it as a metric data point. The event’s original `timestamp` (from the API payload) is preserved as the sample timestamp in the time series database. If `timestamp` is omitted, the ingestion time is used instead. Caution The database enforces a backfill limit. Samples with timestamps too far in the past are stored with the ingestion time instead. For stuck-state alerting, always call the API in real-time — do not use a backdated `timestamp`. * `timestamp(metric[window])` returns the actual time the event was pushed. * `time() - timestamp(metric[window])` correctly computes elapsed time since the event was pushed. To also handle auto-resolution when the `stop` event arrives, combine the duration expression with an `unless on()` clause. ### The Correct Pattern The PromQL expression computes how long each entity has been stuck, and returns no value (resolving the alert) for entities that have received a `stop` event. ```promql ( time() - timestamp( last9_change_events{ event_name="", state="", event_state="start" }[] ) ) unless on() last_over_time( last9_change_events{ event_name="", state="", event_state="stop" }[] ) ``` Set the alert rule threshold to `greater than ` — for a 4-hour threshold, use `greater than 14400`. ### Alternative Pattern: count\_over\_time If your system pushes a periodic heartbeat start event (rather than a single event on state entry), use the `count_over_time` subquery pattern instead. It counts how many 1-minute windows over the threshold period show the entity stuck, and is robust to multiple pushes. ```promql count_over_time( ( last_over_time( last9_change_events{ event_name="", state="", event_state="start" }[5m] ) unless on() last_over_time( last9_change_events{ event_name="", state="", event_state="stop" }[5m] ) )[:1m] ) ``` Set the alert rule threshold to `greater than or equal to ` — for a 4-hour threshold, use `greater than or equal to 240` (4 × 60 minutes) with `[4h:1m]` as the subquery window. ### Example: Alert When a Job Is Stuck for More Than 4 Hours Suppose your system sends change events with `event_name="job_lifecycle"` and an `entity_id` attribute identifying each job. **Send the start event when the job enters the queued state:** ```shell curl -XPUT https://app.last9.io/api/v4/organizations/{org_slug}/change_events \ --header 'Content-Type: application/json' \ --header 'X-LAST9-API-TOKEN: Bearer ' \ --data-raw '{ "event_name": "job_lifecycle", "event_state": "start", "attributes": { "state": "queued", "entity_id": "job-abc123", "environment": "production" } }' ``` **Send the stop event when the job leaves the queued state:** ```shell curl -XPUT https://app.last9.io/api/v4/organizations/{org_slug}/change_events \ --header 'Content-Type: application/json' \ --header 'X-LAST9-API-TOKEN: Bearer ' \ --data-raw '{ "event_name": "job_lifecycle", "event_state": "stop", "attributes": { "state": "queued", "entity_id": "job-abc123", "environment": "production" } }' ``` **Alert rule PromQL:** ```promql ( time() - timestamp( last9_change_events{ event_name="job_lifecycle", state="queued", event_state="start" }[12h] ) ) unless on(entity_id) last_over_time( last9_change_events{ event_name="job_lifecycle", state="queued", event_state="stop" }[12h] ) ``` Set the alert rule threshold to `greater than 14400` (14400 seconds = 4 hours). The alert fires per `entity_id` once a job has been stuck in `queued` for more than 4 hours. It auto-resolves when the stop event is received. ### Lookback Window Sizing Each change event is written as a **single sample** in the time series database — there is no automatic refresh. As time passes, that sample ages. The lookback window in `timestamp()` and `last_over_time()` controls how far back Last9 looks for that sample. If the window is smaller than the age of the sample, the series disappears from the query result entirely — the entity is no longer found, and the alert silently stops firing even though the entity is still stuck. **Example:** A job got stuck 20 hours ago. Its start event sample is 20h old. * With `[12h]`: the sample is outside the window → job not found → alert never fires * With `[48h]`: the sample is within the window → job appears → alert fires correctly The lookback window must be **larger than the maximum time an entity can be stuck before you want to detect it**. | Alert threshold | Minimum recommended lookback | | --------------- | ---------------------------- | | 1h | 6h | | 4h | 12h | | 12h | 48h | | 24h | 72h | A lookback of `3–4×` the alert threshold is a safe rule of thumb. ### Auto-Resolution When a stop event is received for a given `entity_id`, the `unless on(entity_id)` clause removes that entity from the firing set at the next evaluation. The alert resolves automatically — no manual intervention needed. If a stop event is never sent (for example, the job crashes without a cleanup step), the alert will keep firing until a stop event is pushed manually. ### Per-Entity Resolved Notifications When **“Group Timeseries Notifications”** is disabled (the default), each entity gets its own independent notification lifecycle: * A **firing** notification is sent when an entity crosses the threshold. * A **resolved** notification is sent when that entity’s stop event arrives and it drops from the query result. If **“Group Timeseries Notifications”** is enabled, all matching timeseries are batched into a single notification. The resolved notification is then only sent when **all** entities have resolved — a single long-stuck entity will suppress resolved notifications for all others that have already recovered. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Clusters > Overview of Clusters ## Cluster Overview To start using Last9 you need at least one Cluster, from which you read and write metric data. In this document, we dive deep into all things related to a cluster. To get up and running fast, see our [Quick Start Guide](/docs/onboard/). Think of a Cluster as a logically separated, Prometheus API-compatible data source for all your metric data. You can create as many Clusters as you want, the number of clusters has no impact on your billing. It is typically recommended that you create Clusters for each of your environments. Example: Production Cluster, Staging Cluster, etc. ## Creating a New Cluster To create a Cluster: 1. Navigate to **Home** → **Levitate** ![Creating a Cluster 1](/_astro/cluster-1.Dd0S_MAv_Z5TCIt.webp) 2. Click the *Launch Cluster* button to launch the setup wizard ![Creating a Cluster 2](/_astro/cluster-2.qDzCVIem_1AWpI9.webp) 3. Select the AWS region you would like to deploy the cluster in. This should ideally be the same region as your application 4. Give the Cluster a descriptive name 5. Optionally, add a description which will be displayed on the Cluster Overview screen ![Creating a Cluster 3](/_astro/cluster-3.D6cLSwhK_Z1ofPg6.webp) 6. Press the **Create** to create your new Cluster As the Cluster gets created, you will be presented with an access token that is automatically created. This token is required to start writing & reading data to the Cluster. Tokens are only shown once, so please copy or download credentials (or you can always create another token from Cluster settings). ![Creating a Cluster 4](/_astro/cluster-4.BVJSDX1T_ZDHm1N.webp) Your new Cluster is now ready to receive metrics. 7. To start writing data to this new Cluster, please follow the Write Data steps start writing data from Kubernetes, Prometheus, AWS/CloudStream or quickly try out by running a local demo environment ![Creating a Cluster 5](/_astro/cluster-5.R9lM6D_P_Z1CNc13.webp) Using the **Test Config** button you can verify if your Last9 cluster has started receiving data. Click the **Next** button to start reading data/querying metrics from your Cluster. See our guides on how you can send data from [Prometheus](#), [OpenTelemetry](#), [VMAgent](#), or other various s[Integrations](#) supported. 8. To start reading metrics from your new Cluster you can use Managed Grafana which comes included with every plan. Alternatively, you can use the provided **Read URL** to read data using any Prometheus HTTP API Compatible tool like AlertManager, your own Grafana, KEDA, etc. See the guide on [how to connect your own Grafana](/docs/grafana-config/) with a Last9 cluster. ![Creating a Cluster 6](/_astro/cluster-6.DHUZzhn5_2kNU8P.webp) *** ## Managing a Cluster ### Cluster Usage and Performance Last9 provides the following tools to observe the Cluster’s performance: * [Cluster Health Dashboard](#cluster-health-dashboard) - Performance & usage metrics report * [Query Logs](#query-logs) - Identify slow-running queries #### Cluster Usage ![Cluster Usage](/_astro/cluster-1.Dd0S_MAv_Z5TCIt.webp) Usage for each cluster is reported in *Samples Ingested*. A **sample** refers to a single data point in a time series. Usage for each Cluster can be viewed from the Cluster’s details page. For more granular and historical usage, see the Cluster Health dashboard’s Sample Ingested panel. #### Cluster Quotas There are no *per-cluster* limits in Last9. You are billed for usage across all Clusters combined. The ingestion rate, read query rate, and data retention quotas are applied for all the data across all clusters. #### Default Cluster Quotas Last9’s default cluster quotas are fairly generous. In certain cases, keeping in mind performance and cost impacts, we may be able to increase a quota after a discussion with your team. #### Write Quotas | Type | Base Quota | Reset Period | Note | | -------------------------------------------- | ---------- | ------------ | ------------------------ | | Per Time Series Cardinality | 1M | Per Hour | Can be raised on request | | Per Time Series Cardinality | 20M | Per Day | Can be raised on request | | Streaming Aggregation Cardinality | 3M | Per Hour | Can be raised on request | | Ingestion Concurrency | 20K | Per Second | Can be raised on request | | Number of Metrics Aggregated in one Pipeline | 1 Metric | Per Query | Cannot be changed | #### Read Quotas | Type | Base Quota | Note | | ------------------------------------------ | ---------- | ------------------------ | | Time Series Scanned Per Query — Blaze Tier | 5M | Cannot be changed | | Time Series Scanned Per Query — Hot Tier | 10M | Cannot be changed | | Samples Scanned Per Query | 100M | Cannot be changed | | Query Time Range — Blaze Tier | 2 Hours | Can be raised on request | | Query Time Range — Hot Tier | 35 Days | Can be raised on request | If you wish to change your quotas, please raise a request by emailing us on: ### Cluster Health Dashboard Every Last9 Cluster comes with its own Health dashboard. To view the Health dashboard, navigate to the Cluster details page and click on the **View Health** link in the performance card. ![Cluster Health - 1](/_astro/manage-cluster-2.I73qDeRU_Z1d6YC6.webp) The following Cluster Performance Metrics are available in the health dashboard: ![Cluster Health - 2](/_astro/manage-cluster-3.CclarE7r_1ggdUs.webp) * **Write Success** - Total successful write requests * **Write Error** - Total failed write requests * **Samples Ingested** - Total number of samples ingested * **Write Availability** - Percentage of write requests successful * **Write Latency** - Write request latency * **Lag** - Pending samples waiting to be indexed (in bytes) * **Read Success** - Total successful write requests * **Read Errors** - Total failed read requests * **Cardinality Limited** - Metrics whose cardinality has been limited * **Read Latency** - Query Latency * **Cardinality Limiter (Early Warning)** - Metrics whose cardinality is about to be limited * **Bytes Dropped** - Samples permanently failed to be indexed (in bytes) *** ## Query Logs Query Logs helps identify slow-running queries so that you can debug and optimize your PromQL. Query Logs displays slow queries in the last 24 hours, which were successfully executed but have taken more than 1000ms (ie one second) to execute. ![Query Logs](/_astro/query-logs-1.B5JN_Icv_2b5sYf.webp) When a slow query is identified the following details are displayed: * **Timestamp** - Time when the query was executed * **Query** - PromQL along with the query’s time range and query resolution step width * **Latency** - approximate time taken for the query to execute * **Token** Name - the name of the token used to query * **Tier** - storage tier that was used for this query *** ## Cluster Settings ### Tokens Tokens provide a mechanism for access management for your clients. We generate a default token when the Cluster is created for the first time #### Creating a New Token 1. Navigate to the Cluster that you wish to create a token for: **Control Plane** → **Tokens** ![Create Token 1](/_astro/create-token-1.C_swa1Pm_ZubGi7.webp) 2. Click **New Token** ![Create Token 2](/_astro/create-token-2.CARPMHuN_Z1UY5S6.webp) 3. Provide a descriptive **Token Name** the access **Scope** (Write Only, Read Only, Read & Write) and click **Create** 4. Copy the generated token since it will be visible only once. This token can now be used along with the Read or Write URL (depending on the Scope selected) ![Create Token 3](/_astro/create-token-3.CmM_miQt_ZzTebF.webp) #### Delete a Token To delete/revoke a token: 1. Navigate to the Cluster that you wish to revoke a token from: \*\*Control Plane → \*\*Tokens\*\* 2. Click the **…** button and select Delete ![Delete Token 1](/_astro/delete-token-1._7jdZRI9_ZHHj4J.webp) Note: * This action cannot be undone, once deleted tokens cannot be recovered * Tokens can only be deleted by your organization’s admin ### Write & Read Data Refer to the list of available [Integrations](/docs/integrations/) that can be used to start writing and reading data to a Last9 Cluster. ### Access Policies Last9 has built-in data-tiering capabilities based on retention policies. Access policies let you define policies to control which token or client can query a specified data tier. See our in-depth guide on how you can leverage this powerful feature - Guide on Access Policies #### To define a new access policy: 1. Navigate to **Control Plane** → **Access Policies** ![Access Token 1](/_astro/access-tokens-1.BTtwTfQM_ZJKrYF.webp) Every cluster comes with a default access policy pre-configured. 2. To define a new policy click the Create button ![Access Token 2](/_astro/access-tokens-2.C57iLt4V_Obbc3.webp) Provide the following details: * Policy Name: Give a descriptive name for this access policy * Token: Select a specific Token for which this access policy is applied or select *Any* * Query Client: We can identify traffic from known clients or select Any for the policy to apply from any client * Tier: Select the Tier from which the queries will be served for this policy And click **Create** 3. Your new access policy will be applied instantly ![Access Token 3](/_astro/access-tokens-3.D4vgcaue_ZiG1SG.webp) #### To delete an Access Policy : 1. Select the **…** button beside the access policy you wish to delete ![Access Token 4](/_astro/access-tokens-4.BPMC_CV__2qNMxk.webp) 2. Select **Delete** from the menu Do Note: * Access policies can only be deleted by the admin user(s) of your org * Deleting an access policy may limit or lock access for a client or token, please be mindful before deleting ### Macros Macros lets you define PromQL queries as reusable functions and use them as abstracted metric names across Grafana, Alert Manager, or the CLI We cover how to define and use Macros in detail in [guide on PromQL Macros](/docs/promql-macros/) #### Enabling Macros: 1. Navigate to **Control Plane** → **Macros** ![Macros 1](/_astro/macros-1.MBp7cJYo_Z12BASz.webp) 2. Write/Paste your Macro function and Click Save ![Macros 2](/_astro/macros-2.BA9iCfc0_Z2eTENy.webp) We perform validation once you click Save ![Macros 3](/_astro/macros-3.BeLR0Lz5_3zrVA.webp) Once validated, we will save your Macro function. Do note that it will take upto 5 minutes for new Macros to be available for querying ![Macros 4](/_astro/macros-4.D2puijFr_Z1sa5xI.webp) #### Deleting Macros: 1. Navigate to **Control Plane** → **Macros** ![Macros 5](/_astro/macros-5.DAarWIio_2lp9V8.webp) 2. Click the delete icon and click confirm Note: * Deleted Macros will impact any queries and dashboards where the macro functions were used * Deleted Macros may be available for queries up to 5 minutes after they have been deleted ### Streaming Aggregation Streaming aggregation is a powerful metric cardinality that is built-in with Last9. Refer to our [Guide on Streaming Aggregation](/docs/streaming-aggregations/) for an in-depth tutorial # Configuring an Alert > A step-by-step guide to configuring an alert rule in an Alert Group ![Alert Rule Configuration](/_astro/alert-rule-config-form.Ba66EPQi_Z6RYyv.webp) ## Pre-requisites To be useful, each Alert Group needs Alert Rules to enable monitoring the health of the Alert Group. If you’ve created an Alert Group by importing from a Managed Grafana dashboard, indicators will already exist based on the PromQLs used in the dashboard. Else, you’ll need to create a new indicator. Indicators are required to be selected while creating an alert rule. ## Rule Configuration ### Rule Name Short and simple is good for quick identification when a notification is triggered. Keep in mind that Alert Rule names are shown along with the Alert Group names in the notification. If you want to add more context, use the Rule Description field in the [Annotations](#annotations-optional) section. ### Select Indicator Alert Rules are run against an Indicator. If you’ve imported a Grafana dashboard, Indicators are auto-generated based on the dashboard panel PromQLs, else you’ll have to first add the relevant Indicator. Indicators inherit the ALert Group’s datasource, but can also have their own as an override. ### Edit Label Filter (optional) Indicator queries support PromQL variables. If the query contains a variable, you’re able to specify a specific label filter for the Alert Rule to be triggered only for that. ### Select Alerting Algorithm By default, only Static Threshold is enabled. You can also use: * **[Standard Deviation Alerting](/docs/standard-deviation-alerting/)** for adaptive alerts using statistical analysis * **[Anomaly Detection](/docs/anomalous-pattern-detection-guide/)** algorithms for specialized pattern matching (contact [support](mailto:support@last9.io) to enable) Standard deviation alerting works within static threshold alerts using the `adaptive_std_cmp` macro as your query. ### Set Threshold This section is only visible if you’ve selected Static Threshold. You can select an operator and set the value of the threshold. For [Standard Deviation Alerting](/docs/standard-deviation-alerting/), set your threshold to `0.5` since the macro outputs boolean values (0 or 1). ### Configure Alert Sensitivity Depending on the algorithm selected, the options may vary here. In case of Static Threshold, you can specify the no. of bad minutes the rule needs to be triggered out of no. of total minutes before it appears as firing in the Alert Monitor or send a notification. In case of the Anomaly Detection algorithms, you can specify a value ranging from 0 to 10, and decimal values are accepted. Lower the value, the more sensitive the algorithm will be. You can click on the backtest button in the preview panel to open the indicator and algorithm calculations in Grafana and see at what values will the algorithm trigger. Play around with the query in Grafana to find a balance that you’re comfortable with. ### Severity Level Select if this rule, when firing, should be treated as a threat or a breach. This is helpful as additional metadata for integrations like PagerDuty and OpsGenie to determine severity levels and route accordingly. ### Notification Grouping When the alert rule is firing for multiple labelsets, it may lead to noise. For such case, you may group notifications to a single instance. Such notifications do call out the no. of labelsets and values the alert rule is firing for. ### Annotations (optional) Annotations are used to include additional meta data to alert notifications for an alert rule. For example, to help your team members better understand the context of an alert notification, you may want to include a brief description outlining the behavior or circumstances when the rule should’ve triggered. Or, include a runbook link for quickly reaching the next to-do steps for your team member. #### Dynamic Annotations Annotations can be supercharged by inserting dynamic values using template variable. Currently, the following variables are supported: * **Labels**, where it is the value of the respective label of timeeries under alert, with the syntax `{{ $labels. }}` or `{{ .Labels. }}` * **Value**, where it is the worst value of timeseries under alert, with the syntax `{{ $value }}` or `{{ .Value }}` Template variables can be used alongside plain text as well. For example, `Service name is {{ $labels.service }}`. Usage of multiple variables in a field is also supported. Spaces in the template variable syntax are optional. Template variables can be used in any of the annotation fields — the rule description, runbook, or even custom annotations. Considerations: * Apart from the labels in the metric’s timeseries, the labels of the Alert Group can also be referenced in template variables. In case the labels match, preference to the metric’s timeseries is given * In case a label value is not present, the template variable is shown as is * In case the template variable syntax is incorrect, the UI will display an error. Please note the supported variables above and their respective syntaxes * Notifications with Dynamic Annotations display these dynamic values. In case of [grouped notifications](#notification-grouping), *Labels* are shown as a count of all label values and *Values* are shown as a P99 of all the worst values ##### **Sample usage of Dynamic Annotations with Splunk** A custom annotation named `splunk_debug_url` is added to an alert rule whose value is configured as `https://search.splunk.com/?service={{$labels.service}}&stack={{$labels.stack}}`. When alerts are generated for one or more timeseries, the values of the variable in this custom annotation will be interpolated using the labels in the timeseries. For example, `service=billing` and `stack=my-org` will lead to link `https://search.splunk.com/?service=billing&stack=my-org` and so on. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Control Plane > Manage your data, its configurations, and its lifecycle. ## Introduction ![Control Plane](/_astro/control-plane.CoHYbImX_Z16hSy0.webp) Last9’s Control Plane offers a first-class citizen experience for developers to manage their data, its settings, and its lifecycle. This document provides an overview of the main features and functionalities available in the Control Plane user interface. ## Tools and Configurations ## [Ingestion](/docs/control-plane-ingestion/) [Configurations for how your data is ingested into Last9](/docs/control-plane-ingestion/) ## [Storage](/docs/control-plane-storage/) [Defaults and controls for storing and using your telemetry data](/docs/control-plane-storage/) ## [Query](/docs/control-plane-query/) [Configure query reusability, reads, and pattern match alerts.](/docs/control-plane-query/) ## [Analytics](/docs/control-plane-analytics/) [Understand and debug system usage and performance](/docs/control-plane-analytics/) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Analytics > Control Plane tools to understand and debug system usage and performance. ## Cardinality Explorer ![Control Plane — Cardinality Explorer](/_astro/control-plane-cardinality-explorer.BI43P4d1_1BLwQu.webp) While Last9 offers [superior defaults](/docs/managing-high-cardinality/) on per-metric per-day cardinality, you may need to identify the metrics and its labels that are impacted. Cardinality Explorer helps you understand how the cardinality for metrics and its labels is trending. This enables you to diagnose cardinality-related challenges with your metrics. [Read more](/docs/cardinality-explorer/) on how to use the Cardinality Explorer interface. ## Slow Query Logs ![Control Plane — Slow Query Logs](/_astro/control-plane-slow-query-logs.xhv7h3DP_Z1mGRyO.webp) Quickly identify which queries are taking the longest to debug and optimize them. These queries could be originating from either Last9’s alerting, managed Grafana explore/dasboards, or from your own read workfloads. You can change the latency values on the filter to see slower queries, but the minimum is queries taking longer than 1 second. By default, logs are displayed for the last 1 hour, but the window can be customized to a maximum of last 24 hours. ## Health Dashboard ![Control Plane — Health Dashboard](/_astro/control-plane-health-dashboard.B43nJB5j_Z2mRKIM.webp) While Last9 provides an SLA of 99.9% writes and 99.5% reads, you can also view the health of Last9 by clicking on Health Dashboard. You are redirected to a system-generated Grafana dashboard with panels for availability, successes/errors, latencies, lags, bytes dropped, and more. ## Usage ![Control Plane — Usage](/_astro/control-plane-usage.DT30iXYA_jTFDl.webp) View the ingestion trend and usage breakdown for your telemetry data by total and types (log, span, and metric events). By default, a summary of the last 30 days is displayed. You can select an area on the chart to zoom in or you can click on the icon in each date row of the breakdown table to view an hourly breakdown. You can also click on “Download CSV” to get a hourly breakdown for the last 30 days. ### What is an Event? Usage numbers are shown as Total Events. Each log line, trace span, and metric sample that is ingested by Last9 is considered an event. The number of events is calculated at the ingestion layer, before the data is used by any of the ingestion pipelines like [Streaming Aggregation](/docs/control-plane-ingestion/#streaming-aggregations), [Sensitive Data](/docs/control-plane-ingestion/#sensitive-data), [Forward](/docs/control-plane-ingestion/#forward), and [Drop](/docs/control-plane-ingestion/#drop). *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Cold Storage > Learn how to configure AWS S3 cold storage for log archival and cost optimization with Last9 Automatically archive logs older than 14 days to S3 for cost-effective storage and on-demand rehydration. You can configure cold storage for all services or only specific services. ## Setup 1. **Create IAM Role** with permissions to the S3 bucket: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3express:CreateSession" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ] } ] } ``` 2. **Add Trust Relationship**: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "s3.amazonaws.com", "AWS": "arn:aws:iam::" }, "Action": "sts:AssumeRole" } ] } ``` 3. Make sure that the role session expiry is set to **minimum 4 hours**. 4. **Enable Cold Storage** Navigate to the **Buckets** tab in [Cold Storage](https://app.last9.io/control-plane/cold-storage) and add your bucket name and role ARN. ![Cold Storage Buckets](/_astro/cold-storage-buckets.BEdtGGMK_Z2qahIT.webp) 5. Once the cold storage is enabled, you can rehydrate the logs on demand. Read the [Rehydration](/docs/control-plane-rehydration/) guide for more details. ## Service-Level Backup Configuration Navigate to the **Backups** tab in [Cold Storage](https://app.last9.io/control-plane/cold-storage) to configure which services you want to back up. You have three options: 1. **Default (toggle off):** Data is backed up at index-level granularity only — you cannot rehydrate individual services. 2. **All Services:** Enable the **Service Level Backup** toggle and select **All Services**. All services are backed up and you can rehydrate individual services. ![Cold Storage - All Services](/_astro/cold-storage-all-services.-O13Fv_a_Z1DlBeg.webp) 3. **Only Selected Services:** Enable the **Service Level Backup** toggle, select **Only Selected Services**, and pick the specific services you want to back up. ![Cold Storage - Configure Services](/_astro/cold-storage-configure-services.BZsMSYj7_Z1bjeMv.webp) Click **Save Configuration** after making your selection. ### Benefits of Service-Level Backup * **Targeted cost optimization:** Save money where it makes sense without compromising on critical services * **Service-appropriate retention:** Match data lifecycle to each service’s actual needs * **Strategic resource allocation:** Invest observability resources based on service priority * **Simplified compliance:** Apply different retention rules only where legally necessary *** ## Troubleshooting Need help? Join our [Discord](https://discord.com/invite/Q3p2EEucx9) or email . # Access Cold Storage Logs via AWS Athena > Learn how to query and analyze your cold storage logs in S3 using AWS Athena's SQL interface Last9 automatically backs up your logs to a configured S3 bucket via [Cold Storage](/docs/control-plane-cold-storage/). This doc will show you how to access and query these archived logs using AWS Athena, allowing you to perform powerful SQL-based analysis on your historical data. ## Create a database on Athena ```sql CREATE DATABASE last9; ``` ## Create a table in the database ```sql CREATE EXTERNAL TABLE last9.logs ( `timestamp` bigint, `traceid` string, `spanid` string, `traceflags` int, `severitytext` string, `severitynumber` int, `servicename` string, `body` string, `resourceschemaurl` string, `resourceattributes` array>, `scopeschemaurl` string, `scopename` string, `scopeversion` string, `scopeattributes` array, `logattributes` array> ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://customer_s3_bucket/snappy-files/' TBLPROPERTIES ('parquet.compression'='SNAPPY'); ``` ## Export AWS Profile Before running the script, ensure your AWS profile is properly configured with appropriate permissions to access both your source and destination S3 buckets, as well as Athena. ## Move logs to Athena from backup S3 bucket Save the following Python script as `insert_data_into_athena.py`. ```python import argparse import boto3 import os import pandas as pd import tempfile import lz4.frame from botocore.exceptions import ClientError class ParquetProcessor: def __init__(self): """Initialize the processor using AWS credentials from environment""" self.s3_client = boto3.client('s3') self.athena_client = boto3.client('athena') self.temp_dir = tempfile.mkdtemp() def download_from_s3(self, bucket_name, prefix): """Download all .parquet.lz4 files from the specified S3 path""" downloaded_files = [] try: print(f"Searching in bucket: {bucket_name}") print(f"Using prefix: {prefix}") paginator = self.s3_client.get_paginator('list_objects_v2') for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix): if 'Contents' in page: print("\nObjects found:") for obj in page['Contents']: print(f"Key: {obj['Key']}") if obj['Key'].endswith('.parquet.lz4'): print(f"Found matching file: {obj['Key']}") local_file = os.path.join(self.temp_dir, os.path.basename(obj['Key'])) self.s3_client.download_file(bucket_name, obj['Key'], local_file) downloaded_files.append(local_file) else: print("No 'Contents' in this page") if not downloaded_files: print("No .parquet.lz4 files were found") return downloaded_files except ClientError as e: print(f"Error downloading files: {e}") return [] def decompress_lz4(self, file_path): """Decompress .parquet.lz4 file to .parquet""" try: output_file = file_path.replace('.lz4', '') print(f"Decompressing {file_path} to {output_file}") with open(file_path, 'rb') as compressed: compressed_data = compressed.read() decompressed_data = lz4.frame.decompress(compressed_data) with open(output_file, 'wb') as decompressed: decompressed.write(decompressed_data) os.remove(file_path) print(f"Successfully decompressed to {output_file}") return output_file except Exception as e: print(f"Error decompressing file {file_path}: {e}") return None def convert_to_snappy(self, file_path): """Convert decompressed parquet to Snappy compression""" try: df = pd.read_parquet(file_path) df.to_parquet(file_path, compression='snappy') return file_path except Exception as e: print(f"Error converting file {file_path}: {e}") return None def upload_to_s3(self, bucket, prefix, file_path): """Upload a file to S3""" try: file_name = os.path.basename(file_path) s3_key = os.path.join(prefix.rstrip('/'), file_name) print(f"Uploading {file_path} to s3://{bucket}/{s3_key}") self.s3_client.upload_file(file_path, bucket, s3_key) return True except Exception as e: print(f"Error uploading file: {e}") return False def cleanup_local_files(self, snappy_files): """Clean up temporary local files""" for file in snappy_files: try: os.remove(file) except Exception as e: print(f"Error removing file {file}: {e}") os.rmdir(self.temp_dir) def process_files(self, source_bucket, source_prefix, snappy_destination, athena_results_location=None): """Main process to handle the complete workflow""" # Download LZ4 files lz4_files = self.download_from_s3(source_bucket, source_prefix) if not lz4_files: print("No .parquet.lz4 files found") return # Decompress LZ4 files decompressed_files = [] for file in lz4_files: decompressed_file = self.decompress_lz4(file) if decompressed_file: decompressed_files.append(decompressed_file) if not decompressed_files: print("No files were successfully decompressed") return # Convert to Snappy snappy_files = [] for file in decompressed_files: snappy_file = self.convert_to_snappy(file) if snappy_file: snappy_files.append(snappy_file) if not snappy_files: print("No files were successfully converted to Snappy") return # Upload to snappy destination dest_bucket = snappy_destination.split('//')[1].split('/')[0] dest_prefix = '/'.join(snappy_destination.split('//')[1].split('/')[1:]) for file in snappy_files: if not self.upload_to_s3(dest_bucket, dest_prefix, file): print(f"Failed to upload {file}") continue # Cleanup local files self.cleanup_local_files(snappy_files) print("Processing completed successfully") if __name__ == "__main__": parser = argparse.ArgumentParser(description='Process .parquet.lz4 files and upload to S3') # S3 and Athena configuration parser.add_argument('--source-bucket', required=True, help='Source S3 bucket name where parquet.lz4 (last9 saves archives)') parser.add_argument('--source-prefix', required=True, help='Source S3 prefix path where parquet.lz4 files are stored') parser.add_argument('--snappy-destination', required=True, help='S3 path for converted snappy files') parser.add_argument('--athena-results', required=True, help='S3 path for Athena query results') args = parser.parse_args() processor = ParquetProcessor() processor.process_files( source_bucket=args.source_bucket, source_prefix=args.source_prefix, snappy_destination=args.snappy_destination, athena_results_location=args.athena_results ) ``` The script `insert_data_into_athena.py` is used to process `.parquet.lz4` files from the backup bucket and upload them to a separate S3 location for processing in Athena. ### Help Command Run the following command to see all available options and parameters: ```bash python insert_data_into_athena.py --help ``` ### Usage ```plaintext usage: insert_data_into_athena.py [-h] --source-bucket SOURCE_BUCKET --source-prefix SOURCE_PREFIX --snappy-destination SNAPPY_DESTINATION --athena-results ATHENA_RESULTS Process .parquet.lz4 files and upload to S3 options: -h, --help show this help message and exit --source-bucket SOURCE_BUCKET Source S3 bucket name where parquet.lz4 (where Last9 saves backup files) --source-prefix SOURCE_PREFIX Source S3 prefix path where parquet.lz4 files are stored --snappy-destination SNAPPY_DESTINATION S3 path for converted snappy files --athena-results ATHENA_RESULTS S3 path for Athena query results ``` ### Example Command Here’s a sample command that processes files from your backup bucket to prepare them for Athena queries: ```bash python insert_data_into_athena.py \ --source-bucket last9_backup_bucket \ --source-prefix "path/to/file/" \ --snappy-destination "s3://customer_s3_bucket/snappy-files" \ --athena-results "s3://customer_s3_bucket/athena-results/" ``` In this example: * `last9_backup_bucket` is your source bucket containing the archived logs * `path/to/file/` is the directory path where your .parquet.lz4 files are located * `s3://customer_s3_bucket/snappy-files` is where the converted files will be stored * `s3://customer_s3_bucket/athena-results/` is where Athena will store query results ## Check result on Athena After the data has been uploaded, you can query it using Athena with the following SQL: ```sql SELECT * FROM last9.logs; ``` This will retrieve all logs from the `last9.logs` table, allowing you to verify that your data has been successfully uploaded and is accessible through Athena. ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Drop > Drop unwanted telemetry at ingestion layer using Control Plane ![Control Plane — Pipeline Sequence: Drop](/_astro/control-plane-pipeline-sequence-drop.CjNUosRc_ZzQXgf.webp) Order of Last9's pipeline processing. [Drop](https://app.last9.io/control-plane/drop) lets you discard unwanted telemetry that you don’t want to store and query, at runtime — no code changes, no redeploys, no policy updates. For example, if you need debug logs during an incident, just remove the drop rule. It’s a faster, simpler alternative to code-level governance. Data Loss Warning Dropped data is **not ingested** and **cannot be recovered**. Always use the preview button to verify your filters match the intended data before saving. ## Create a Drop Rule Head to the [Control Plane](https://app.last9.io/control-plane/drop) and click **NEW RULE**. ![Control Plane — New Drop Rule](/_astro/control-plane-drop-rule-new.CLRLc1gX_ZXpJPx.webp) ### Step 1: Select Telemetry Type Choose the type of telemetry you want to drop: | Telemetry | Filter By | | ----------- | --------------------------------------- | | **Metrics** | Metric name only | | **Logs** | Attributes and resource attributes | | **Traces** | Span attributes and resource attributes | ### Step 2: Define Filters Add one or more filter conditions. Multiple filters are combined using **AND** logic — all conditions must match for data to be dropped. ![Control Plane — Drop Rule](/_astro/control-plane-drop-rule.DeJ40Pav_27xL72.webp) #### Filter Operators | Operator | Symbol | Description | | ---------- | ------ | ------------------------------------------ | | Equals | `==` | Exact match | | Not Equals | `!=` | Does not match | | Regex | `=~` | Pattern matching using regular expressions | #### Filter Keys (Logs & Traces) For logs and traces, filter keys use the OpenTelemetry attribute format: * `attributes["key"]` — Span or log attributes (e.g., `attributes["http.status_code"]`) * `resource.attributes["key"]` — Resource-level attributes (e.g., `resource.attributes["service.name"]`) ### Step 3: Preview & Verify Before saving, use the preview button to verify matching data. The button label changes based on telemetry type: * **Metrics**: Click **VIEW IN DASHBOARD** → Opens Grafana with matching metric query * **Logs**: Click **VIEW LOGS** → Opens Logs Explorer with matching filters * **Traces**: Click **VIEW TRACES** → Opens Traces Explorer with matching filters ### Step 4: Name & Save Give your rule a unique name and click **SAVE**. Rule names must be unique within your organization. ## Examples ### Drop debug logs from development environment | Field | Value | | --------- | ------------------------------------------------------------------ | | Telemetry | Logs | | Filter 1 | `resource.attributes["deployment.environment"]` `==` `development` | | Filter 2 | `attributes["level"]` `==` `debug` | ### Drop health check traces | Field | Value | | --------- | --------------------------------------------- | | Telemetry | Traces | | Filter | `attributes["http.route"]` `=~` `.*/health.*` | ### Drop specific metric | Field | Value | | --------- | ---------------------------------- | | Telemetry | Metrics | | Filter | Name `==` `go_gc_duration_seconds` | ## Manage Existing Rules All your drop rules are displayed in a table below the create form. To manage a rule: 1. Click the **three-dot menu** (⋮) on the rule row 2. Select an action: * **Edit**: Opens the rule in the form for modification * **Delete**: Removes the rule (requires confirmation) ## Constraints * **Unique names**: Each drop rule must have a unique name * **AND logic only**: Multiple filters are combined with AND (all must match) * **Ingestion limits**: Your organization may have limits on the total number of ingestion rules * **Regex validation**: When using the `=~` operator, the value must be a valid regular expression *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Forward > Forward telemetry data to object storage backends without storing it in Last9 ![Control Plane — Pipeline Sequence: Forward](/_astro/control-plane-pipeline-sequence-forward.B9gohCZ-_ZdPaeB.webp) Order of Last9’s pipeline processing. The Forward feature allows you to send telemetry data directly to external object storage like AWS S3 without storing it in Last9. This is particularly useful for compliance data that needs long-term storage but isn’t frequently queried. Access the [Forward](https://app.last9.io/control-plane/forward) feature in your Control Plane. ## Prerequisites Before creating forward rules, you’ll need: * An AWS S3 bucket configured for Last9 access * IAM AssumeRole permissions set up for Last9 * Understanding of which telemetry data you want to forward ## Configure AWS S3 Backend 1. Navigate to your Control Plane → [Cold Storage](https://app.last9.io/control-plane/cold-storage) 2. Configure your S3 bucket details and IAM AssumeRole ARN 3. Save the configuration ![Configure Cold Storage Backend](/_astro/control-plane-cold-storage.Dgdbekx1_FHx4t.webp) Learn more about setting up [IAM AssumeRole for AWS S3](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) in AWS documentation. ## Create Forward Rules Forward rules determine which telemetry data gets sent to your configured storage backend. 1. In your Control Plane, navigate to **Forward** and click **New Rule** 2. Configure matching filters using `==` or `!=` operators to specify which data to forward 3. Click **View Logs** to preview which data matches your filters 4. Save your forward rule once you’ve verified the filter criteria and added an identifiable rule name ![Create Forward Rule](/_astro/control-plane-forward-logs.DdrtOo0v_Z21x4Kf.webp) ### Filter Configuration Use matching filters to specify exactly which data to forward: * **Equal (`==`)**: Forward data that matches the specified value * **Not equal (`!=`)**: Forward data that doesn’t match the specified value **Example filters:** ```plaintext service.name == "payment-service" log.level != "debug" ``` Data that matches forward rules is **not stored in Last9** and cannot be recovered. Always verify your filters using “View Logs” before saving. ## Supported Data Types Forward currently supports: * **Logs**: Application logs, system logs, structured log data, etc **Supported storage backends:** * AWS S3 (with more backends planned) ## Multiple Forward Rules You can create multiple forward rules for different data types or services. Each rule operates independently and can forward to different storage backends. ## Best Practices * Start with specific filters to avoid forwarding more data than intended * Use the preview feature to verify your filter logic before enabling rules * Monitor your S3 storage costs as forwarded data accumulates * Consider data retention policies for your S3 bucket * Test forward rules with non-critical data first *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Ingestion > Control Plane tools to configure for how your data is ingested into Last9. Ingestion is the second pillar of our telemetry data platform, Last9. Once you‘ve [instrumented](/docs/integrations/) your system, controls over how your telemetry data flows in to Last9 do a fair bit of heavy lifting. ## Ingestion Tokens ![Control Plane — Ingestion Tokens](/_astro/control-plane-ingestion-tokens.QRukipwK_1I6TUf.webp) Ingestion Tokens authenticate your applications when sending telemetry data to Last9. These tokens control what data can be sent and from which origins, ensuring secure data collection. A system-generated ingestion token is created when you signup — this token is used in the setup wizard to help you configure sending data to Last9. System-generated tokens cannot be deleted. ## Access Policies ![Control Plane — Access Policies](/_astro/control-plane-access-policies.BBU-b5ZY_ZBDmB8.webp) Setup how various clients access your data — depending on the client type and token used, you can control from which tier (blaze, hot, cold) your data is queried. We recommend alerting workloads to always use the Blaze Tier and reporting workfloads to use the Cold Tier. [Read more](/docs/access-policies/) on how to configure these policies. ## Streaming Aggregations ![Control Plane — Streaming Aggregation](/_astro/control-plane-streaming-aggregation.ZDAvejya_ZV5vc6.webp) Streaming Aggregations allow you to transform data in real-time at the ingestion layer before it is stored in Last9. They enable you to generate scoped metrics on runtime without any instrumentation changes and improve performance of your read queries by controlling cardinality of the new metrics. [Read more](/docs/streaming-aggregations/) on how to configure these aggregations. *** ## Sensitive Data ![Control Plane — New Sensitive Data Rule](/_astro/control-plane-sensitive-data-modal.Bzy1azSp_Z3aNd6.webp) Redact sensitive data from your telemetry data at time of ingestion. Currently supported: * Telemetry Type: Logs * Actions: Redact (default), No Action Last9 provides built-in scan rules for PII like emails, phone numbers, and credit card numbers. While the default action is to redact, you can also choose to take no action. This is particularly helpful when you just want to attach additional labels to the telemetry. Configured rules are applied in a sequential order. Once saved, you can drag-and-drop to reorder the rules. ## Forward ![Control Plane — New Forward Rule](/_astro/control-plane-forward-modal.p6KVChYh_2kuqd3.webp) While data after applicable retention periods can be moved to your configured S3 bucket for [Cold Storage](/docs/control-plane-storage/#cold-storage), you can also configure rules with `==` and `!=` matching filters to forward incoming data directly to your cold storage without being ingested and stored. To verify the filters before saving the rule, you can click on “View in Dashboard”. Do note, this data will not be available for querying when forwarded, but once [rehydrated](/docs/control-plane-storage/#rehydration), it can be queried. Supported telemetry types: Logs and Traces. ## Drop ![Control Plane — New Drop Rule](/_astro/control-plane-drop-modal.Bl1FdNj2_ZqifBe.webp) You can configure rules with `==` and `!=` matching filters to drop incoming data. Do note, this data is not ingested and cannot be recovered as well. To verify the filters before saving the rule, you can click on “View in Dashboard”. Supported telemetry types: Logs, Metrics and Traces. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Query > Control Plane tools for query-level configurations. ## Macros ![Control Plane — Macros](/_astro/control-plane-macros.BvHN4N13_2qDbDQ.webp) They work in a way that is similar to how SQL developers use stored procedures. Macros take full advantage of the time-tested best practices of functions, abstractions, and reusability to replace cumbersome and error-prone methods. Simplify your PromQLs that are reused often to avoid repition of code, and improve abstractions and readability. [Read more](/docs/promql-macros/) on how to configure these macros. ## Scrape Interval Set this to the typical scrape and evaluation interval configured in your agent’s config file. If you set this to a greater value than your agent’s config file interval, the embedded Grafana in Explore will evaluate the data according to this interval and you will see less data points. Notes: * Defaults to 1m. * This does not change your agent’s scrape interval. ## Read Data ![Control Plane — Read Data](/_astro/control-plane-read-data.DmxmJe-8_Z22wKH0.webp) If you’re looking to use your stored telemetry data outside of Last9’s [Alerting](https://app.last9.io/alert-studio) or [Managed Grafana](https://app.last9.io/explore), you can refer to the Read Data settings to configure your choice of visualization tool. For additional settings on how to configure your own Grafana to use Last9 as a datasource, [read this](/docs/grafana-config/). ## Scheduled Search ![Control Plane — Scheduled Search](/_astro/control-plane-scheduled-search.BzqpKP_p_1XpAeQ.webp) Create periodic searches on telemetry data and set alerts when patterns are found or missing. [Read more](/docs/scheduled-search/) on how configure these alerts. Supported telemetry types: Logs, and Traces coming soon. ## Query Tokens ![Control Plane — Query Tokens](/_astro/control-plane-query-tokens.DlaYeCpX_1iKmfT.webp) Query Tokens provide read-only access to your telemetry data for external visualization tools like Grafana, alerting systems, and custom applications. A system-generated query token is created when you signup — this token is used to set up dashboard templates, including the Health Dashboard. System-generated tokens cannot be deleted. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Rehydration > Rehydrate logs from cold storage to query historical data beyond your retention period Rehydration allows you to retrieve logs from your cold storage back into Last9 for querying. This feature enables you to access historical log data that’s beyond your organization’s retention period. Access the [Rehydration](https://app.last9.io/control-plane/rehydration) feature in your Last9 Control Plane. ![Rehydration Overview](/_astro/new-rehydrated-index.DR_o-0hD_1eKGd6.webp) ## Prerequisites Before you can rehydrate logs, ensure that: * [Cold Storage](/docs/control-plane-cold-storage/) is enabled for your organization ## Creating a Rehydrated Index 1. Navigate to [Rehydration](https://app.last9.io/control-plane/rehydration) in the Control Plane 2. Click **New Rehydrated Index** to open the configuration modal 3. Configure your rehydration settings: 1. **Select Source**: Select from available indexes (typically “Default Index”) ![Rehydration Source](/_astro/new-1-select-source-index.BZ1Ie1eN_Z5qC0F.webp) 2. **Add Definition**: ![Rehydration Definition](/_astro/new-2-add-definition.Buqwm78T_Z1JjKvT.webp) * **Time Range**: Choose a time period from before your retention period. If your organization has a 14-day retention, you can only select dates older than 14 days * **Services to Rehydrate**: If service-level backup is enabled in your cold storage, you can select specific services to rehydrate instead of all logs * **Estimated Size**: Review the estimated compressed size of data to be rehydrated 3. **Set Destination Details**: ![Rehydration Destinal](/_astro/new-3-add-destination-details.BaJ2ECWl_Z9NP8E.webp) * **Rehydrated Index Name**: Provide a descriptive name for easy identification * **Send Notification When Ready** (Optional): If you have email channels configured in your organization settings, you can enable notifications to receive updates when the rehydration job completes 4. Click **Rehydrate Index** to start the process The rehydration job will appear in your index list with an “Index is being rehydrated” status. ## Understanding Rehydrated Index States ![Rehydration Overview](/_astro/index-status.DC7R5_RF_Z2dnSvR.webp) Your rehydrated indexes can have several different states: * **Index is being rehydrated**: Process is currently running * **Available**: Index is ready for querying, shows availability window * **Expired**: Index has passed its retention period and is no longer queryable * **Failed**: Rehydration process encountered an error and needs to be retried ## Querying Rehydrated Data Once your rehydration is complete and shows “Available” status: 1. Click the **View in Logs** button next to your rehydrated index 2. This opens the Log Explorer with: * Your rehydrated index pre-selected * Time range set to the last 5 minutes of your rehydrated data window 3. You can now modify the time range and apply filters to explore your rehydrated data 4. Click on Run Query or use the `cmd/ctrl + enter` shortcut ## Managing Rehydrated Indexes Each rehydrated index includes management options accessible through the more (⋯) menu: * **Rehydrate**: Create a new rehydration job for the same time period * **Delete**: Remove the rehydrated index to free up storage ## Best Practices * **Selective Rehydration**: When service-level backup is available, rehydrate only the services you need to reduce processing time and storage costs * **Naming Convention**: Use descriptive names that include the date range and purpose, such as `incident_analysis_june_2025` or `compliance_audit_q1_2025` * **Time Range Planning**: Remember that you can only rehydrate data from before your retention period. Plan accordingly when investigating incidents or conducting analysis *** ## Troubleshooting * **Index Failed to Rehydrate**: If you see a “Failed — Rehydration Process Failed, Please Retry” message, select “Retry” from the more (⋯) menu or try creating a new rehydration job with the same parameters. If the issue persists, contact support * **No Data Available**: Ensure your cold storage contains data for the selected time range and that the time range is before your retention period Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Remapping > Transform and standardize your logs and traces data by extracting and mapping fields for better searchability and analysis. ![Control Plane — Pipeline Sequence: Remapping](/_astro/control-plane-pipeline-sequence-remapping.B2HhrwIf_2sCnun.webp) Order of Last9’s pipeline processing. [Remapping](https://app.last9.io/control-plane/remapping) allows you to standardize your **logs and traces** data by extracting fields and mapping them to consistent formats. This powerful feature helps you normalize data across different services and sources, making your telemetry more searchable and easier to analyze. Remapping consists of two primary functions: 1. **Extract**: Pull specific fields or patterns from your log lines 2. **Map**: Transform extracted fields into standardized formats This capability is valuable for scenarios like: * Normalizing different service names across your infrastructure * Standardizing severity levels from various sources (ERROR, err, Fatal, 500) * Creating consistent environment labels (prod, production, prd) * Extracting structured data from JSON or pattern-based logs * Maintaining consistent field naming conventions ## Attribute Priority Order When mapping multiple source attributes to a single target attribute, the **order of preference is left to right**. The system evaluates attributes in sequence and uses the first non-empty value found. **Example:** If you configure Service mapping with three source attributes: | Position | Source Attribute | Value | | -------------- | ---------------------------- | ---------------- | | 1st (leftmost) | `attributes["service_name"]` | `""` (empty) | | 2nd | `attributes["app_name"]` | `"checkout-api"` | | 3rd | `resource["service.name"]` | `"checkout"` | **Result:** Service is set to `"checkout-api"` because it’s the first non-empty value when evaluated left to right. ## Working with Remapping ### Extract ![Control Plane — New Drop Rule](/_astro/control-plane-remapping-extract.C9KjV6W7_Z1DI6hJ.webp) 1. Navigate to [Control Plane > Remapping](https://app.last9.io/control-plane/remapping) 2. Select the “Extract” tab 3. View existing extraction rules in the table showing: * Name: Descriptive name of the extraction rule * Method: JSON or Pattern Match extraction method * Scope: Which lines the extraction applies to * Fields/Pattern: Which fields or patterns to extract * Action: How the extracted data is handled (Upsert/Insert) * Active Since: When the rule was activated 4. Click ”+ NEW RULE” to create a new extraction rule #### Creating a New Extraction Rule 1. Select “Extraction Method”: * **JSON:** Extract fields from structured JSON logs * **Pattern Match:** Use regex patterns to extract fields from unstructured logs 2. Choose “Extraction Scope”: * **All Lines:** Apply extraction to every log line * **Lines that match:** Apply only to lines matching specific criteria 3. Field(s) to Extract: 1. For JSON method: * Select the field(s) to extract * Example fields: requestId, thread\_id, logger\_name, etc. 2. For Pattern Match method: * Enter the regex pattern in “Pattern to Extract” field * Example: `timeseries:\s*(?P\d+)` 4. Set “Action” to “Upsert” (update if exists, insert if not) or “Insert” 5. Choose “Extract Into” option: * **Log Attributes:** Adds fields to the log’s searchable attributes * **Resource Attributes:** Adds fields to the resource’s metadata 6. Optionally add a “Prefix” to extracted field names * Example: “ec2\_” would transform “id” to “ec2\_id” 7. Enter a descriptive “Rule Name” 8. Click “SAVE” to activate the rule ### Map ![Control Plane — New Drop Rule](/_astro/control-plane-remapping-map.CtiCrDBK_ZlWLvl.webp) 1. Navigate to [Control Plane > Remapping](https://app.last9.io/control-plane/remapping) 2. Select the “Map” tab 3. View “Remap Fields” section with existing mappings 4. Map common fields to standardized formats: * **Service:** Map various service names to consistent values * Example: `attributes["service_name"]` * **Severity:** Map different log levels to standard severity * Example: `attributes["level"]` and `attributes["levelname"]` * **Deployment Environment:** Map environment indicators * Select from available attributes 5. Preview the mapping results in the “Preview (Last 2 mins)” section below * SERVICE: How service names appear after mapping * SEVERITY: Standardized severity levels * DEPLOYMENT ENV: Normalized environment names * LOG ATTRIBUTES: Other log details * RESOURCE ATTR: Resource-related information 6. After configuring mappings, click “SAVE” ## Example Use Cases ### Logs 1. **Standardizing Service Names**: Map various service identifiers to consistent names * Raw values: “auth-svc”, “auth\_service”, “authentication” * Mapped to: “authentication-service” 2. **Normalizing Severity Levels**: Create consistent severity levels across sources * Raw values: “ERROR”, “err”, “Fatal”, “500” * Mapped to: “ERROR” 3. **Extracting Thread Information**: Pull thread details from logs for better filtering * Extract fields: thread\_id, thread\_name, thread\_priority * Makes thread-based troubleshooting more efficient 4. **Environment Consistency**: Standardize environment naming * Raw values: “dev”, “development”, “preprod”, “staging” * Mapped to consistent environment names ### Traces 1. **Service Name Standardization**: Ensure consistent service names across spans * Source attributes: `resource["service.name"]`, `attributes["service"]` * Map to standardized service names for cleaner service maps 2. **Deployment Environment**: Tag traces with environment information * Source: `resource["deployment.environment"]`, `attributes["env"]` * Standardize to: “production”, “staging”, “development” 3. **Span Operation Normalization**: Consistent operation naming across services * Different frameworks may use varying conventions for span names * Map to consistent operation names for easier filtering in [Traces Explorer](/docs/traces-explorer/) ## Tips for Effective Remapping * **Start Simple:** Begin with the most common fields you search by * **Use Consistent Naming:** Follow a naming convention for all mapped fields * **Check Preview Results:** Use the preview section to verify your mappings work as expected * **Mind the Order:** Attributes are evaluated left to right—place your most reliable source first * **Use JSON When Possible:** JSON extraction is more reliable for structured logs * **Test Pattern Matches:** Validate regex patterns before implementing them * **Apply to Both Signals:** The same remapping rules work for both logs and traces *** ## Troubleshooting If your remapping rules aren’t working as expected: 1. Check the extraction pattern syntax for errors 2. Verify field names match exactly what appears in your logs 3. Ensure your extraction scope is appropriate 4. Look at the preview to confirm data is flowing as expected 5. Try simplifying complex regex patterns Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Sensitive Data > Redact sensitive data from telemetry at ingestion layer using Control Plane ![Control Plane — Pipeline Sequence: Sensitive Data](/_astro/control-plane-pipeline-sequence-pii.CwEm9qV0_ZNU9tk.webp) Order of Last9's pipeline processing. [Sensitive Data](https://app.last9.io/control-plane/sensitive-data) automatically detects and redacts personally identifiable information (PII) and other sensitive data from your telemetry at ingestion time — no code changes, no redeploys, no policy updates. For example, if customer phone numbers start appearing in your logs, just create a redaction rule to automatically replace them with asterisks. It’s a faster, simpler alternative to code-level data sanitization. ## Create new sensitive data rule Head to the Control Plane and create a new Sensitive Data Rule. ![Control Plane — New Sensitive Data Rule](/_astro/control-plane-pii-new.BeCRzBCz_1peXRO.webp) You can configure rules to scan for different types of Personal Identifiable Information (PII) including email addresses, phone numbers, and credit card numbers. When sensitive data is detected, you can choose to redact it (replace with asterisks) or take no action. Additional labels can be attached to matching samples for filtering and alerts. ![Control Plane — Sensitive Data Rules List](/_astro/control-plane-pii.BOH1lvIR_Z1NCQR8.webp) ## Configuration Options ### Telemetry Data Currently supported telemetry type is **logs only**. All samples for the selected telemetry data will be scanned using the configured rules. ### Scan Rules Choose which types of sensitive data to detect: * **Email** - Detects email addresses in your log data * **Phone Number** - Identifies phone numbers across various formats * **Credit Card Number** - Finds credit card numbers and payment card data ### Actions Available actions for detected sensitive data: * **Redact** - Replace matching sensitive data with asterisks (**\***) * **No Action** - Detect and label but don’t modify the data ### Additional Labels Add custom labels (key:value pairs) to samples containing sensitive data. These labels can be used for filtering, alerting, and downstream processing. Common examples: * `sensitive_data: true` * `redacted: true` * `pii_type: phone` ## Supported Telemetry Types Currently supported telemetry type is **logs**. Support for metrics and traces will be added in future releases. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Storage > Control Plane tools for defaults and configuring how your telemetry data is stored and re-used. ## Sampling, Tiering, and Retention 1. **Sampling:** Last9 applies no sampling on your data to ensure an accurate representation of your system’s health. 2. **Data Tiering:** Last9 offers automated data tiering by default for your metrics data, including ones generated by the traces-to-metric and logs-to-metric pipelines. * *Blaze Tier:* Last 2 hours * *Hot Tier:* Last 28 days * *Cold Tier:* As per your Cold Storage 3. **Retention:** Metrics data is retained for 90 days by default with cold storage for backup. Logs and Traces data is retained for 14 days with cold storage for backup and on-demand rehydration. ## Cold Storage ![Control Plane](/_astro/control-plane-cold-storage.BBvOMXtr_Z6w7zO.webp) For your logs and traces, Last9 currently offers an integration with your AWS S3 bucket to store data older than 15 days. This data will be available for on-demand rehydration to run queries and report on. Read the [Cold Storage](/docs/control-plane-cold-storage/) guide for more details. ## Rehydration The historical logs can be rehydrated for later consumption as needed. You can rehydrate based on a time range filter. Additionally for live debugging use cases, Last9 performs automatic rehydration of logs upto 10M log lines when the requested time range is beyond the retention period. Read the [Rehydration](/docs/control-plane-rehydration/) guide for more details. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Create a GCP service account with read-only access > Step by step guide to create a GCP service account with read-only access for monitoring ## Objective A service account is required to access GCP environment resources for monitoring. This doc provides step by step information on creating a GCP service account with monitoring read-only access. Once you have created the account, share the configuration with Last9 team so that the monitoring data can be sent to [Last9](https://last9.io/). ## Prerequisites * Go to the Google Cloud Console ([console.cloud.google.com](https://console.cloud.google.com/)) account * Select the project in which you want to create the service account * Click on the “IAM & Admin” tab in the left navigation menu * Click on the “Service Accounts” tab ![Create Service Account](/_astro/gcp-account-create-service-account-1.Cy7-X-0W_OoW2I.webp) ## Creating Service Account * Click on the “Create Service Account” button * Enter following details 1. Service Account Name: `last9-monitor` 2. Service Account ID: `last9-monitor` 3. Service Account Description: *Allows Last9 API access to read resource metadata and monitoring data* * Click on the “Create and Continue” button ![Create Service Account Form](/_astro/gcp-service-account-create-form.BVZXqA16_1K2KPa.webp) ## Monitoring Viewer Role Grant Permissions to this Service Account with Role as `Monitoring Viewer`. ![Monitoring Viewer Role](/_astro/gcp-service-account-monitoring-viewer-role.CX4DjOpV_1Ed4Ku.webp) Grant other users internal to your organization access to this Service Account(Optional) ![Add other users optionally](/_astro/gcp-service-account-other-users.kXvAuLga_2avtxx.webp) ## Generate Credentials * Click on the newly created Service Account to view more details ![Click on the Servive Account](/_astro/gcp-service-account-list.LR7c1sX2_Z2X8jC.webp) * Create a new Service Account Key ![Create a new Service Account Key](/_astro/gcp-service-account-create-access-key.Bs-ifppC_Edxj4.webp) ![Download the Service Account Key](/_astro/gcp-service-account-create-access-key.Bs-ifppC_Edxj4.webp) * Share the downloaded key with your Last9 team # Creating Dashboards > Learn how to create custom dashboards with panels, visualization types, and query configurations in Last9 ![Last9 Dashboard Add Panel](/_astro/dashboard-add-panel.XZq4R1vQ_1e9VLv.webp) You can create dashboards in Last9 either from scratch or by promoting queries from the Logs and Traces explorers. ## Creating a Dashboard from Scratch 1. Navigate to **Dashboards** in the left sidebar 2. Click **Create Dashboard** 3. Enter a name for your dashboard 4. Click **Add Panel** to add your first visualization 5. Configure the panel query and visualization type (see below) 6. Click **Save** to save the dashboard ## Creating from Service Overview You can add any performance chart from the [Service Overview](/docs/discover-services/) directly to a dashboard without writing a query. The underlying PromQL is extracted automatically. ![Add to Dashboard from Service Overview](/_astro/dashboards-service-overview-add-to-dashboard.B5CHbnIC_Z1RkrXF.webp) 1. Navigate to [Discover > Services](https://app.last9.io/service-catalog) and open a service 2. On the **Overview** tab, hover over any chart panel 3. Click the **⋮** menu in the top-right corner of the panel 4. Select **Add to Dashboard** 5. Choose to add to an existing dashboard or create a new one 6. Provide a panel name and click **Save** The following chart types are supported: | Chart | Visualization added | | ----------------------- | ------------------- | | APDEX Score | Time Series | | Response Time | Time Series | | Availability | Time Series | | Throughput & Error Rate | Time Series | | Error Distribution | Time Series | ## Creating from Logs or Traces Explorer You can promote aggregated queries from the Logs Explorer or Traces Explorer directly into dashboard panels. 1. Build an aggregation query in the [Logs Explorer](https://app.last9.io/logs) or [Traces Explorer](https://app.last9.io/traces) 2. Click the **Add to Dashboard** button 3. Choose to create a new dashboard or add to an existing one 4. Provide a descriptive panel name 5. You will be redirected to the dashboard with your query added as a panel For detailed instructions on building aggregation queries, see: * [Creating Log Analytics Dashboards](/docs/creating-log-analytics-dashboards-from-logs-explorer/) * [Creating Trace Analytics Dashboards](/docs/creating-trace-analytics-dashboards-from-traces-explorer/) ## Panels Panels are the building blocks of a dashboard. Each panel contains a query and a visualization. ### Adding a Panel 1. Open a dashboard and click **Add Panel** 2. Choose a visualization type from the tabs at the top (Time Series, Bar Chart, Doughnut Chart, Gauge Chart, Stat, or Table) 3. Select a telemetry type (Metrics, Logs, or Traces) 4. Write your query using the appropriate query mode 5. Configure panel options (legend, units, thresholds) 6. Click **Save** to add the panel to your dashboard ### Visualization Types | Type | Description | Best For | | ------------------ | --------------------------------------------------------- | ---------------------------------------------- | | **Time Series** | Line or area charts plotted over time | Monitoring trends, comparing metrics over time | | **Bar Chart** | Vertical or horizontal bar charts, with optional stacking | Comparing categories, distribution analysis | | **Doughnut Chart** | Pie/donut chart showing proportions | Percentage breakdowns, resource allocation | | **Gauge Chart** | Dial indicator showing a single value against a range | Utilization metrics, threshold monitoring | | **Stat** | Single prominent value display | Key metrics at a glance, counters | | **Table** | Tabular data with sorting, filtering, and summaries | Detailed breakdowns, top-N analysis | ### Telemetry Types and Query Modes Each panel supports different query modes depending on the selected telemetry type: **Metrics** panels use PromQL: ```promql rate(http_requests_total{service="api"}[5m]) ``` **Logs** panels support two modes — **Builder** for visual query construction, and **LogQL** for writing queries directly: ```sql sum by (severity) (count_over_time({service="api"} [1m])) ``` **Traces** panels use the **Query Builder** with filter and aggregate stages to query span data. ### Query Modes Each query can run in one of two modes: | Mode | Description | Use Case | | ----------- | ----------------------------------------------- | --------------------------------- | | **Range** | Returns data points over a time range | Timeseries, bar charts, heatmaps | | **Instant** | Returns a single data point at the current time | Stat panels, gauge panels, tables | ## Panel Configuration ### Chart Settings These options are available under **Chart Settings** when editing a panel: * **Unit**: Set the unit format for panel values (see [Units](#units) below) * **Legend Type**: **Auto** (uses the query name as the legend label) or **Custom** (lets you define your own legend label) * **Legend Placement**: **Bottom**, **Left**, or **Right** * **Display Type**: **Line** or **Area** (Time Series panels only) ### Bar Chart Options * **Orientation**: **Vertical** or **horizontal** bars * **Stacked**: Enable stacking to show cumulative values ### Table Settings | Option | Description | | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Transpose** | Swap rows and columns (max 20 columns when transposed) | | **Density** | **Compact** or **comfortable** row spacing | | **Summary Footer** | Display a summary row at the bottom | | **Summary Type** | Aggregation for summary: sum, avg, min, max, first, last, count, p50, p90, p95, p99 | | **Thresholds** | Color cells based on their values. Set a value, color, and target (**Text** or **Background**). Higher thresholds take precedence. Click **+ Add** for multiple thresholds | | **Column Visibility** | Show or hide individual table columns | ### Units Apply unit formatting to panel values. Select from the built-in units or type a custom unit: | Unit | Format | | --------------- | ------------------------------------ | | Bytes (IEC) | KiB, MiB, GiB, TiB (factors of 1024) | | Bytes (SI) | KB, MB, GB, TB (factors of 1000) | | Bytes/sec (IEC) | KiB/s, MiB/s, GiB/s | | Bytes/sec (SI) | KB/s, MB/s, GB/s | | Nanoseconds | ns precision | | Milliseconds | ms precision | | Seconds | s precision | You can also enter any custom unit (e.g., `requests/s`, `%`, `ops`) by typing directly into the unit field. ## Layout Panels can be repositioned by dragging the panel header and resized using the handles at the panel edges. ### Sections Add **Section** dividers to organize panels into named groups. Sections act as visual separators with a label, making it easier to navigate dashboards with many panels. You can drag panels between sections and reposition sections themselves by dragging. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Creating Log Analytics Dashboards > Guide to creating Log Analytics Dashboards from aggegragated queries in Logs Explorer ## Introduction Creating custom log analytics dashboards in Last9 allows you to visualize and monitor log data through aggregated metrics. This guide explains how to create and promote log queries into dashboard visualizations. ## Starting with Logs Explorer ### Using Editor Mode 1. Navigate to the [Logs Explorer](https://app.last9.io/logs) in Last9 2. Switch to Editor Mode — this enables writing advanced LogQL-compatible queries 3. Write a normal query to explore your data ```sql {service="adservice"} ``` 4. Convert it into an aggregation query by adding an aggregation function ```sql sum by (severity) (count_over_time({service="adservice"} [1m])) ``` 5. Promote the query to a dashboard by clicking the **Add to Dashboard** button ![Promote to Dashboard](/_astro/logs-add-aggregated-query-to-dashboard.HOKp2FXX_1fciWa.webp) 6. Create a new dashboard or add it to an existing dashboard by adding a unique panel name ![Create Dashboard](/_astro/logs-create-dashboard._vgW065K_1gEaGe.webp) 7. You will be redirected to the new dashboard with your query added as a panel ![Dashboard with Panel](/_astro/logs-dashboard-with-panel.CRAqxXnl_B6s4B.webp) 8. You can edit the panel by clicking the **⋮** button and then the **edit** button ![Edit Panel](/_astro/logs-edit-panel.CACQUdWF_Z73HHO.webp) 9. Add multiple panels to the dashboard by following the same steps as above ### Supported Aggregation Functions Last9 supports several aggregation functions for creating meaningful visualizations: * Time-based aggregations: * **`count_over_time`**: Counts the number of logs over time * **`sum_over_time`**: Sums the values of a numeric field over time * **`avg_over_time`**: Averages the values of a numeric field over time * **`max_over_time`**: Finds the maximum value of a numeric field over time * **`min_over_time`**: Finds the minimum value of a numeric field over time * **`rate`**: Calculates the rate of change of a numeric field over time * Statistical aggregations: * **`sum`**: Sums the values of a numeric field * **`avg`**: Averages the values of a numeric field * **`count`**: Counts the number of logs * **`max`**: Finds the maximum value of a numeric field * **`min`**: Finds the minimum value of a numeric field * **`stddev`**: Calculates the standard deviation of a numeric field * **`median`**: Finds the median value of a numeric field * **`stdvar`**: Calculates the standard variance of a numeric field ## Query Construction Guidelines ### Time Windows Specify time windows using the following formats: * Minutes: **`[1m]`** * Hours: **`[1h]`** * Days: **`[1d]`** ### Query Examples Basic severity-based aggregation: ```sql sum by (severity) (count_over_time({service!="user-service"} [1m])) ``` Complex bucket-based aggregation: ```sql sum by (bucket) (count_over_time({ service="unknown", ingestor="s3", bucket=~"elb-logs", log.file.path=~".*api.*" } [3h])) ``` ## Best Practices ### Time Range Selection * Match window size to query requirements * For instant queries, set time range equal to window size * Consider data retention and query performance when selecting time ranges ### Query Performance * Leverage accelerated queries by including Service or Severity filters * Use specific filters to reduce data scanning * Test queries with smaller time ranges before expanding to larger windows ### Dashboard Organization * Group related visualizations together * Use clear, descriptive titles * Include context in dashboard descriptions * Set appropriate refresh intervals based on data update frequency *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Creating Trace Analytics Dashboards > Guide to creating Trace Analytics Dashboards from aggregated queries in Traces Explorer Creating custom trace analytics dashboards in Last9 allows you to visualize and monitor trace data through aggregated metrics. This guide explains how to create and promote trace queries into dashboard visualizations. ## Starting with Traces Explorer ### Using Query Builder Mode 1. Navigate to the [Traces Explorer](https://app.last9.io/traces) in Last9 2. Use the **Span** or **Trace** tabs to build your query 3. Add a FILTER stage to narrow down your data ```plaintext Service Name exists Trace Status = STATUS_CODE_ERROR ``` 4. Add an AGGREGATE stage with a timeslice to create a time-series visualization ```plaintext count as _count groupby Service Name as service_name timeslice 15 minutes ``` 5. Click **Run Query** to preview your visualization 6. Promote the query to a dashboard by clicking the **Add to Dashboard** button ![Promote to TraceMetrics Dashboard](/_astro/promote-to-dashboard.Dy0f0EQt_ZAP7F0.webp) 7. Create a new dashboard or add it to an existing dashboard by providing a descriptive panel name ![Create TraceMetrics Dashboard](/_astro/create-dashboard.IAHSz3by_2wj1eb.webp) 8. You will be redirected to the dashboard with your query added as a panel ![New TraceMetrics Dashboard](/_astro/new-dashboard.CFhXChZO_Z1JndlM.webp) 9. Edit the panel by clicking the **⋮** button and selecting **edit** ![Edit TraceMetrics Panel](/_astro/edit-panel.DrvISNLe_WefoA.webp) 10. Add multiple panels to the dashboard by following the same steps ## Query Construction Guidelines ### Aggregation Requirements To create dashboard visualizations from traces: * **timeslice** parameter is required for time-series charts * Use **groupby** to split metrics across different dimensions * Choose appropriate aggregation functions based on your analysis needs ### Common Aggregation Patterns Error rate tracking by service: ```plaintext count as error_count groupby Service Name as service timeslice 5 minutes ``` P99 latency monitoring: ```plaintext quantile(0.99, duration) as p99_latency groupby Service Name as service timeslice 10 minutes ``` Request volume by endpoint: ```plaintext count as request_count groupby http.route as endpoint timeslice 15 minutes ``` ## Supported Aggregation Functions Last9 supports several aggregation functions for creating meaningful visualizations: * Count-based aggregations: * **count**: Counts the number of spans or traces * **count field**: Counts non-null values of a specific field * Statistical aggregations: * **sum**: Sums the values of a numeric field * **avg**: Averages the values of a numeric field * **min**: Finds the minimum value of a numeric field * **max**: Finds the maximum value of a numeric field * **median**: Finds the median value of a numeric field * **stddev**: Calculates the sample standard deviation * **stddev\_pop**: Calculates the population standard deviation * **variance**: Calculates the sample variance * **variance\_pop**: Calculates the population variance * Quantile functions: * **quantile**: Calculates approximate quantile (value between 0 and 1) * **quantile\_exact**: Calculates exact quantile (value between 0 and 1) ## Best Practices ### Time Range Selection * Match timeslice interval to your dashboard’s time range * For real-time monitoring, use shorter intervals (1-5 minutes) * For historical analysis, use longer intervals (15-60 minutes) * Consider query performance when selecting time ranges ### Query Performance * Filter on Service Name for significantly faster query execution * Use **exists** operator for field presence checks * Test queries with smaller time ranges before expanding * Leverage indexed fields for better performance ### Dashboard Organization * Group related trace metrics together * Use clear, descriptive panel titles * Include service context in dashboard descriptions * Set appropriate refresh intervals based on monitoring needs * Consider creating separate dashboards for different services or teams ### Visualization Tips * Use groupby to compare metrics across services * Combine error counts with latency metrics for comprehensive monitoring * Create separate panels for different percentiles (p50, p95, p99) * Monitor both successful and failed requests for complete visibility *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Dashboards > Create and manage custom dashboards to visualize metrics, logs, and traces in Last9 Dashboards in Last9 let you build custom visualizations that bring together metrics, logs, and traces in a single view. Use them to monitor application health, track key performance indicators, and investigate issues across your infrastructure. ![Last9 Dashboards List](/_astro/dashboards-list.DYya-tmm_Z1PwTTP.webp) ## Accessing Dashboards Navigate to [**Dashboards**](https://app.last9.io/dashboards) in the left sidebar to view all your dashboards. The dashboards list shows each dashboard’s **Name**, **Author**, and **Last updated** time. Use the search bar to find dashboards by name or author. The **⋮** menu on each row lets you duplicate or delete a dashboard. Click **+ New Dashboard** in the top-right corner to create a new dashboard. ## Telemetry Types Each dashboard panel can query one of the following telemetry types: | Telemetry Type | Query Modes | Use Cases | | -------------- | ------------------------ | --------------------------------------------------------------- | | Metrics | PromQL | Infrastructure monitoring, application metrics, custom counters | | Logs | Builder mode, LogQL mode | Log volume analysis, error tracking, pattern detection | | Traces | Query Builder | Latency monitoring, error rate tracking, request flow analysis | ## What You Can Do * **[Create dashboards](/docs/creating-dashboards/)** from scratch or by promoting queries from the Logs and Traces explorers * **[Use dashboards](/docs/using-dashboards/)** with variables, time controls, and sharing to collaborate with your team *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Delegate Subdomain between two AWS Accounts using Route 53 > Step-by-step guide on how to delegate Subdomain between two AWS Accounts using Route 53 ## Assumptions 1. AWS Account A is the Primary Account 2. AWS Account B is the Sub Account 3. `example.com` is an arbitrary domain used purely for easy understanding 4. You have enough permissions granted by your AWS Admin to add/modify Route53 ## Premise To set up `subdomain.example.com` as a hosted zone in **AWS Account B** and extend it for internal usage (e.g., `internal.subdomain.example.com`), you need to delegate authority for the subdomain from **AWS Account A** to **AWS Account B**. This process involves creating a new hosted zone in **Account B** for the subdomain and then updating the parent hosted zone in **Account A** to delegate DNS resolution to the nameservers for the subdomain in **Account B**. ![AWS Route 53 Subdomain Delegation](/_astro/route53-subdomain-soa.My5vooC4_Z2hquGE.webp) ## Step-by-Step Procedure ### Step 1: Create a Hosted Zone for the Subdomain in Account B 1. **Sign in to the AWS Management Console for Account B** 2. **Create a Hosted Zone**: * Navigate to the Route 53 console * Click on **“Hosted zones”** in the left navigation pane * Click the **“Create hosted zone”** button * Enter `subdomain.example.com` as the domain name * Choose the type as “**Public hosted zone”** (or **Private hosted zone for Amazon VPC** if it’s for internal usage) * Click **Create hosted zone** 3. **Note the Nameservers**: * After the hosted zone is created, note the nameservers (NS records) provided by Route 53 for the new hosted zone. You will need these nameservers to delegate the subdomain from Account A ### Step 2: Delegate the Subdomain from Account A to Account B 1. **Sign in to the AWS Management Console for Account A** 2. **Navigate to the Hosted Zone for `example.com`**: * Go to the Route 53 console * Click on **Hosted zones** in the left navigation pane * Click on the hosted zone for `example.com` 3. **Create NS Record for the Subdomain**: * Click **Create record** * Choose **Simple routing** and click **Next** * For **Record name**, enter `subdomain` (to delegate `subdomain.example.com`) * Choose **Record type** as **NS - Name Server** * In the **Value** field, enter the nameservers for `subdomain.example.com` provided by Account B * Click **Create records** ### Step 3: Create a Hosted Zone for Internal Usage in Account B 1. **Sign in to the AWS Management Console for Account B** 2. **Create a Hosted Zone for `internal.subdomain.example.com`**: * Navigate to the Route 53 console * Click on **Hosted zones** in the left navigation pane * Click the **Create hosted zone** button * Enter `internal.subdomain.example.com` as the domain name * Choose the type as **Private hosted zone for Amazon VPC** * Select the appropriate VPCs * Click **Create hosted zone** ## Verification **Public Subdomain Delegation**: * You can verify that `subdomain.example.com` is correctly delegated by using the `dig` or `nslookup` commands: ```plaintext dig ns subdomain.example.com ``` **Internal Subdomain Resolution**: * For `internal.subdomain.example.com`, ensure that your VPC’s DNS settings are configured correctly and that Route 53 resolver endpoints are set up if necessary ## Summary 1. **Create a hosted zone for `subdomain.example.com` in Account B** and note the nameservers 2. **Delegate the subdomain** from Account A to Account B by creating an NS record in the `example.com` hosted zone in Account A pointing to the nameservers for `subdomain.example.com` in Account B 3. **(Optional) Create a hosted zone for `internal.subdomain.example.com`** in Account B for internal DNS resolution *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Applications > Monitor and analyze web and mobile application performance, errors, and user behavior from real user sessions across browsers, devices, and networks ![Applications Performance](/_astro/performance.Ds2pSB4Q_5jURX.webp) The Applications feature in Discover is Last9’s Real User Monitoring (RUM) solution for **web and mobile** applications. Install the appropriate SDK and get real-time insight into how your app performs from your actual users’ perspective — Core Web Vitals and JS errors on the web, app startup and screen load times on mobile, plus sessions, errors, and user journeys on both. Unlike synthetic monitoring, Discover Applications captures how your application actually performs for real users across different devices, networks, and locations. ## What you can monitor The Applications UI surfaces different metrics depending on whether the ingested data comes from a web or mobile SDK. ### Performance * **Web**: Core Web Vitals — TTFB, LCP, FCP, CLS, INP. Traffic patterns by path, browser, device, network, geography. * **Mobile**: Cold Start, Warm Start, Screen Load Time. Traffic patterns by screen, OS version, device model, app version. ### Errors * **Web**: JavaScript exceptions, promise rejections, failed network requests, CSP violations. Breakdowns by browser, screen size, network type, device type, geography. * **Mobile**: Unhandled exceptions, native crashes, Flutter framework errors, ANRs (Android). Breakdowns by OS version, device brand/model, app version, network type. ### Sessions Analyze user journeys, session duration, and engagement. Track how users navigate — page paths on web, screen transitions on mobile — and identify drop-off points. ### User interaction insights (web) Capture clicks, scroll depth, keyboard navigation, form focus/input, and touch gestures with optional element text, labels, and ARIA metadata. Correlate frustrating flows with errors and performance. ## Key Benefits * **Identify Real Issues**: See problems that actually affect your users, not just lab conditions * **Prioritize Optimizations**: Focus on pages and issues that impact the most users * **Track Improvements**: Measure the impact of performance optimizations and bug fixes over time * **Environment Comparison**: Compare performance across different deployment environments * **Version Analysis**: Understand how new releases affect user experience * **Real-Time Filtering**: Filter by pages, user attributes, environments, and versions for targeted analysis ## Setup and Usage ## [Install a RUM SDK](/docs/real-user-monitoring/) [Web (JavaScript), Android, iOS, React Native, and Flutter SDKs](/docs/real-user-monitoring/) ## [Performance](/docs/discover-applications-performance/) [Core Web Vitals on web, app startup and screen load times on mobile](/docs/discover-applications-performance/) ## [Errors](/docs/discover-applications-errors/) [JS exceptions and failed requests on web, native crashes and ANRs on mobile](/docs/discover-applications-errors/) ## [Sessions](/docs/discover-applications-sessions/) [User journeys, session duration, and engagement patterns across platforms](/docs/discover-applications-sessions/) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Application Errors > Monitor errors, exceptions, ANRs, and failed requests. Detailed breakdowns by browser, device, OS version, app version, and network conditions. ![Applications Errors](/_astro/errors.DYeY8Fcu_ZteQA2.webp) The Errors tab in [Discover Applications](https://app.last9.io/applications) helps you identify, analyze, and prioritize errors that affect your users. Error sources and breakdown dimensions depend on whether the data is from a web or mobile SDK. ## Error Collection ### Web The JavaScript SDK automatically captures errors from multiple sources: * **Console errors**: JavaScript errors logged to the console * **Global errors**: Unhandled exceptions and promise rejections * **Reporting Observer**: Browser-reported issues like CSP violations * **Network failures**: Failed API requests and resource loading errors Each error includes HTTP details (for network failures), resource timing, session context, and any global attributes you’ve configured. Control which error sources are enabled: ```typescript L9RUM.init({ // ... errors: { console: true, global: true, report: true, network: true, ignorePatterns: [/ResizeObserver/i], beforeSend: (event) => { // Filter or enrich errors if (event.attributes["exception.message"]?.includes("ExpectedNoise")) { return null; // Drop this error } return event; }, }, }); ``` Explicitly track handled exceptions: ```typescript try { await searchProducts(query); } catch (error) { L9RUM.captureError(error, { handled: true, message: "search_query_timeout", query, page: 3, }); } ``` Manual captures default to `handled: true` and pass through the same `beforeSend` hook as automatic errors. ### Mobile The mobile SDKs capture: * **Unhandled exceptions** — uncaught throws on Android (Kotlin/Java), uncaught Swift errors on iOS, unhandled JS errors on React Native, uncaught Dart errors on Flutter * **Promise rejections** — unhandled `Promise` rejections on React Native * **Flutter framework errors** — wiring via `FlutterError.onError`, `PlatformDispatcher.onError`, and `runZonedGuarded` (see [Flutter unhandled error forwarding](/docs/real-user-monitoring/flutter/#unhandled-error-forwarding)) * **ANRs (Android only)** — Application Not Responding events when the main thread is blocked past the configured threshold * **Manual captures** — `L9Rum.captureError(e, context)` on every platform See each SDK’s “Capture errors” reference: * [Android — Capture errors](/docs/real-user-monitoring/android/#capture-errors) * [iOS — Capture errors](/docs/real-user-monitoring/ios/#capture-errors) * [React Native — Capture errors](/docs/real-user-monitoring/react-native/#capture-errors) * [Flutter — Capture errors](/docs/real-user-monitoring/flutter/#capture-errors) ## Accessing Error Monitoring 1. Navigate to [Discover > Applications](https://app.last9.io/applications) in Last9 2. Click the **Errors** tab 3. Choose your version and environment from the top filters 4. Set your desired time range using the date picker ## Dashboard Components ### Errors Over Time Track error frequency throughout your selected time period. The chart shows: * Error trends and patterns over time * Spikes that might correlate with deployments or traffic increases * The option to **Include all views** to show both error and success rates for context * Visual distinction between successful pageviews (green) and errors (red) ### Top Paths / Top Screens by Error Count * **Web**: top URL paths ranked by error count. * **Mobile**: top screens (`screen.name`) ranked by error count. Prioritize bug fixes on the entries with the highest error impact. ## Error Breakdown Analysis The dashboard offers multiple breakdown tables. The available breakdown dimensions depend on the platform — the app surfaces a different set for web vs mobile. ### Web breakdowns * **Errors by Browser** — Chromium, Firefox, Safari, Edge, other. Identifies compatibility issues, polyfill needs, engine-specific behaviors. * **Errors by Platform** — Windows, macOS, iOS, Android (browser platform attribute). * **Errors by Screen Size** — Large (≥ 1024px), Medium (< 1024px), Small (< 768px). Exposes responsive-layout and mobile-web-only issues. * **Errors by Network Type** — WiFi, 4G, 3G, 2G, Unknown. Timeouts and loading failures often cluster on slower connections. * **Errors by Device Type** — Mobile, Desktop, Tablet, Unknown. * **Errors by Error Type** — grouped by `exception.type`. * **Errors by City / Country** — geographic concentration. ### Mobile breakdowns * **Errors by OS Version** — isolate OS-specific regressions (e.g. a crash that only reproduces on Android 14). * **Errors by Device Brand** — Samsung, Apple, Google, Xiaomi, etc. * **Errors by Device Model** — specific model identifier. Often the finest-grained handle on device-specific bugs. * **Errors by App Version** — cross-reference spikes with releases. * **Errors by Network Type** — WiFi, cellular, offline. * **Errors by Error Type** — grouped by `exception.type`. For Android, ANRs surface here distinctly from exceptions. * **Errors by City / Country** — geographic concentration. ## Detailed Error Investigation ![Applications Error Details](/_astro/error-details.SK7ip4pL_Z1Er0sk.webp) ### Individual Error Analysis Click on any error entry to view detailed information: Each error provides: * **Timestamp**: when the error occurred * **View ID**: unique identifier for the view (page view on web, screen view on mobile) * **Path / Screen**: URL where the error happened (web) or screen name (mobile) * **Browser / OS**: user’s browser (web) or OS name + version (mobile) * **Screen Size / Device Model**: device screen category (web) or device model (mobile) * **Network Type**: connection type when error occurred * **App Version** (mobile): release version the user was on ### Exception Details * **Exception timing**: How long after page start the error occurred * **Full stack trace**: Complete error stack with file names and line numbers * **Error context**: Surrounding code and variable states when available * **Trace link**: Click “Trace” to view the complete user session This detailed information helps developers identify the exact source of errors, understand the user context when errors occur, reproduce issues by following the user’s journey, and prioritize fixes based on error frequency and impact. ## Custom Events in Error Context When you open the error details sidepanel, Last9 also displays any custom events that occurred within the same session view — not just the error itself. This lets you reconstruct what the user was doing immediately before and after the error fired. Custom events are tracked via `L9RUM.addEvent()` on web and `L9Rum.addEvent()` on mobile, and are distinct from errors — they carry no exception or stack trace, but they do carry the attributes you attach. Examples include button clicks, checkout steps, feature flag evaluations, or any business action you want to correlate with failures. ```typescript L9RUM.addEvent("checkout_attempted", { step: "payment", plan: "pro", amount: 99, }); ``` ### How custom events differ from errors | | Errors (`captureError`) | Custom events (`addEvent`) | | ---------------------------- | ------------------------------- | ----------------------------------------- | | **Purpose** | Capture exceptions and failures | Track user actions and business events | | **Stack trace** | Yes | No | | **Appears in error count** | Yes | No | | **Shown in error sidepanel** | As the primary event | As contextual events within the same view | | **Passes `beforeSend` hook** | Yes | No | ### Tracking custom events Call `L9RUM.addEvent()` with a name and an optional attributes object at any point in your application: ```typescript // Track a business action L9RUM.addEvent("payment_method_selected", { type: "credit_card" }); // Track a feature interaction L9RUM.addEvent("search_performed", { query_length: query.length, result_count: results.length, }); // Track a navigation step L9RUM.addEvent("onboarding_step_completed", { step: 3, step_name: "invite_team", }); ``` Events inherit session and view context automatically. Any attributes set via `L9RUM.spanAttributes()` are also attached, so you do not need to repeat user or organization context on each event call. ## Filtering and Grouping Same filtering capabilities as the Performance tab: * **Path / Screen filtering**: focus on specific pages (web) or screens (mobile) * **Attribute filtering**: filter by browser/OS, device, network, app version, or custom attributes * **Time-based filtering**: analyze errors within specific time windows * **Environment filtering**: compare error rates across deployments ## Best Practices * **Monitor Error Trends**: Watch for error spikes after deployments * **Focus on High-Impact Errors**: Prioritize errors affecting many users or critical paths * **Correlate with Performance**: Slow pages often have higher error rates * **Browser Testing**: Pay attention to browser-specific error patterns * **Mobile-First Debugging**: Mobile users often experience unique error conditions * **Set Error Budgets**: Define acceptable error rates and alert when exceeded * **Regular Review**: Check error patterns weekly to catch regressions early *** ## Troubleshooting * **No errors showing?** Confirm error toggles in the `errors` configuration are enabled and `sampleRate` is appropriate for your traffic volume. * **Manual captures missing context?** The SDK automatically includes session and user context. Additional attributes passed to `captureError()` are merged with SDK-provided fields. * **Network failures missing?** Check browser DevTools for CORS issues—failed preflights prevent the SDK from capturing response details. * **Custom events not appearing in sidepanel?** Custom events are scoped to the same view as the error. If `L9RUM.addEvent()` is called outside an active view (for example, before the first page navigation completes), the event may not be associated with a view and will not appear. Ensure SDK initialization is complete before calling `addEvent()`. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Application Performance > Track Core Web Vitals on web and app startup / screen load times on mobile. Analyze traffic patterns and optimize performance using real user data. ![Applications Performance](/_astro/performance.Ds2pSB4Q_5jURX.webp) The Performance tab in [Discover Applications](https://app.last9.io/applications) shows loading-speed, traffic, and responsiveness metrics for your app. The surfaced metrics depend on platform: **Core Web Vitals** for web apps, **app startup and screen-load times** for mobile apps. ## Accessing Performance Monitoring 1. Navigate to [Discover > Applications](https://app.last9.io/applications) in Last9 2. Select the Performance tab (active by default) 3. Choose your version and environment from the top filters 4. Set your desired time range using the date picker ## Dashboard Components ### Views / Screens Over Time Track traffic volume across your selected time range. The chart is labeled **Views Over Time** for web apps (each view = a page load) and **Screens Over Time** for mobile apps (each screen = a screen/activity/view-controller transition). Use it to: * Identify peak usage times and traffic spikes * Detect unusual drops that might indicate an outage or release regression * Correlate performance with traffic volume ### Top Paths / Top Screens The “Top 10” list shows the most-trafficked entities: * **Web**: top URL paths (`path` attribute) by view count * **Mobile**: top screens (`screen.name` attribute) by load count Prioritize optimization on the entries that drive the most traffic. ### Web Vitals (web only) For web apps, Core Web Vitals are surfaced as cards with P75, P90, and P99 percentile values — TTFB, LCP, FCP, CLS, INP. See [Core Web Vitals Explained](#core-web-vitals-explained) below. ### Mobile Vitals (mobile only) For mobile apps, three span-duration metrics are surfaced at P75: * **Cold Start** — time from app process launch to first usable frame. Captured as the `AppStart` span with `start.type=cold`. * **Warm Start** — time from foreground resume to usable frame (app process already running). `AppStart` span with `start.type=warm`. * **Screen Load Time** — time to render each screen after navigation. Captured from the screen lifecycle `Created` span. ### Geographic Map (web only) A map of where your traffic originates, overlaid with performance buckets by region. Currently rendered only for web apps — mobile SDKs don’t emit geo-level data yet. ## Grouping and Filtering ### Group by options Click the Group by button to segment your data. Available options depend on platform: **Web:** * **Browser** — Chrome, Firefox, Safari, etc. * **Platform** — Windows, macOS, iOS, Android (browser platform) * **Screen Size** — desktop vs mobile vs tablet * **Network Type** — 4G, WiFi, etc. * **Device Type** — mobile vs desktop * **Error Type** * **City**, **Country** **Mobile:** * **OS Version** — specific Android/iOS version * **Device Brand** — Samsung, Apple, etc. * **Device Model** * **App Version** * **Network Type** * **Error Type** * **City**, **Country** ### Path and Attribute Filtering Use the search bar to filter by specific pages (or screens) or user attributes. The operators below are available for `path` on web; on mobile, use the same operators against `screen.name` or `view.name`. #### Path Filtering (web) / Screen Filtering (mobile) Filter by URL paths using these operators: * `=` (equals): Exact path match * `exists`: Pages where the path field is present * `not exists`: Pages missing path information * `starts with`: Paths beginning with specific text * `ends with`: Paths ending with specific text * `contains`: Paths containing specific text * `does not contain`: Paths excluding specific text Example: `path starts with /api` to analyze API endpoint performance #### Attribute Filtering Filter by collected attributes including: **Page Attributes:** * `origin`: Page origin (protocol + hostname) * `page.hash`: Current page hash * `page.hostname`: Current page hostname * `page.route`: Current page route/path * `page.search`: Current page search/query string * `page.url`: Current page URL * `url.path`: Current page path * `path`: Folded path (for navigation) **Web Vitals Attributes:** For each web vital metric (CLS, FCP, LCP, TTFB, INP), these attributes are available: * `web_vital..id`: Unique ID for the metric * `web_vital..value`: Value of the metric * `web_vital..timestamp`: When the metric was recorded * `web_vital..rating`: Rating (‘good’, ‘needs-improvement’, ‘poor’) * `web_vital..delta`: Delta value * `web_vital..entries_count`: Number of entries * `web_vital.navigation_type`: Navigation type (‘reload’, ‘navigate’) ## Core Web Vitals Explained (web) ### Time To First Byte (TTFB) * **Definition**: Measures server responsiveness—the time between a user’s request and when the first byte of response arrives * **Performance Thresholds**: * 🟢 Good: ≤ 800ms * 🟡 Needs Improvement: ≤ 1.8s * 🔴 Poor: > 1.8s * **What It Means**: High TTFB indicates server-side performance issues like slow database queries, inefficient processing, or network latency * **Optimization Focus**: Server performance, database optimization, CDN usage, caching strategies ### Largest Contentful Paint (LCP) * **Definition**: Tracks when the main content becomes visible—specifically when the largest content element finishes rendering * **Performance Thresholds**: * 🟢 Good: ≤ 2.5s * 🟡 Needs Improvement: ≤ 4s * 🔴 Poor: > 4s * **What It Means**: LCP directly impacts perceived loading performance. Users judge speed based on when they see main content * **Optimization Focus**: Image optimization, lazy loading, critical resource prioritization, server response times ### First Contentful Paint (FCP) * **Definition**: Measures when users first see any content—the time until the first text, image, or element appears * **Performance Thresholds**: * 🟢 Good: ≤ 1.8s * 🟡 Needs Improvement: ≤ 3s * 🔴 Poor: > 3s * **What It Means**: FCP indicates how quickly users perceive your page is starting to load * **Optimization Focus**: Critical CSS, font loading strategies, eliminating render-blocking resources ### Cumulative Layout Shift (CLS) * **Definition**: Quantifies visual stability by measuring unexpected layout shifts during page loading * **Performance Thresholds**: * 🟢 Good: ≤ 0.1 * 🟡 Needs Improvement: ≤ 0.25 * 🔴 Poor: > 0.25 * **What It Means**: High CLS creates frustrating experiences when elements move unexpectedly as users try to interact * **Optimization Focus**: Size attributes for images/videos, space reservation for dynamic content ### Interaction to Next Paint (INP) * **Definition**: Measures interface responsiveness by tracking time between user interactions and the next visual update * **Performance Thresholds**: * 🟢 Good: ≤ 200ms * 🟡 Needs Improvement: ≤ 500ms * 🔴 Poor: > 500ms * **What It Means**: INP affects how responsive your application feels to user interactions * **Optimization Focus**: JavaScript optimization, reducing main thread work, efficient event handlers ## Mobile Vitals Explained (mobile) ### Cold Start * **Definition**: Time from app process launch to the first usable frame. This is what users experience when they tap the app icon after it has been fully killed (or after a reboot / first install). * **Captured as**: span named `AppStart` with attribute `start.type=cold`. The span’s duration is the cold-start time. * **What it means**: High cold-start times typically point to slow app initialization — heavy work on the main thread during `Application.onCreate()` / `AppDelegate didFinishLaunching` / Flutter engine startup. * **Optimization focus**: defer non-critical SDK initialization, lazy-load heavy dependencies, audit work done during `onCreate` / app delegate launch. ### Warm Start * **Definition**: Time from app foreground resume to the first usable frame when the app process is already running in the background. * **Captured as**: `AppStart` span with `start.type=warm`. * **What it means**: Warm start should be significantly faster than cold start (typically under 1s). If warm start is slow, foregrounding logic is doing too much work. * **Optimization focus**: minimize work in `onResume` / `applicationWillEnterForeground` / `AppLifecycleState.resumed`. ### Screen Load Time * **Definition**: Time to render a screen after navigating to it. * **Captured as**: span emitted by the screen lifecycle (`Activity.onCreate` on Android, view controller lifecycle on iOS, `NavigatorObserver.didPush` on Flutter, `onStateChange` on React Navigation). Duration = time from navigation intent to first frame rendered. * **What it means**: Slow screens indicate heavy work in view construction, synchronous data fetches, or expensive layout computation. * **Optimization focus**: move blocking data fetches off the critical path, show skeleton UI, avoid layout thrashing. ## Best Practices * **Regular Monitoring**: Check performance metrics weekly to catch regressions early * **Focus on High-Impact Pages**: Prioritize optimization on pages with high traffic and poor performance * **Monitor All Percentiles**: P99 shows what your worst-performing users experience * **Correlate with Deployments**: Use version filtering to understand how releases affect performance * **Set Performance Budgets**: Define acceptable thresholds for each metric and monitor against them * **Use Grouping**: Segment by browser, device, or network to identify specific user experience issues *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # RUM Session Correlation > Propagate RUM session IDs end-to-end across backend services and async workers so you can filter logs, traces, and spans by browser session. RUM session correlation lets you take a session ID from the client and trace it all the way through your backend — across HTTP service calls, async queues, and workers. Once set up, you can filter logs, traces, and spans by session ID anywhere in your stack. ## How It Works L9RUM sends a W3C `baggage` header alongside `traceparent` on every outgoing request. Backend services extract this baggage, attach it to spans and logs, and forward it to any downstream services — including async message queues. ```plaintext Browser (L9RUM) ── traceparent + baggage: session.id=abc ──► API Service ├── span attribute: session.id=abc ├── log field: session.id=abc └── SQS message attribute: baggage=session.id=abc └──► Worker Service ├── span attribute: session.id=abc └── log field: session.id=abc ``` ## Prerequisites * L9RUM SDK initialized with `network.backendCorrelation.enabled: true` * Backend services instrumented with OpenTelemetry (Node.js guides: [Express](/docs/integrations/frameworks/javascript/expressjs/), [NestJS](/docs/integrations/frameworks/javascript/nestjs/), [Node.js](/docs/integrations/languages/nodejs/)) ## Setup 1. ### Configure L9RUM Enable baggage propagation and add `session.id` to the allowed keys. Set the session ID value once the SDK initializes. ```javascript L9RUM.init({ baseUrl: "YOUR_BASE_URL", headers: { clientToken: "YOUR_CLIENT_TOKEN" }, resourceAttributes: { serviceName: "your-frontend-app", deploymentEnvironment: "production", }, network: { backendCorrelation: { enabled: true, injectToAllRequests: true, baggage: { enabled: true, allowedKeys: ["session.id"], }, }, }, }); // Set the session ID — use any stable identifier for this browser session L9RUM.spanAttributes({ "session.id": getYourSessionId(), }); ``` L9RUM will include `baggage: session.id=` on every `fetch` and `XHR` request from that point on. 2. ### Configure Backend Services Each backend service that receives requests from the browser (or from another service that forwarded the baggage) needs three additions to its OTel setup: 1. `W3CBaggagePropagator` — parses the `baggage` header on incoming requests and forwards it on outgoing calls 2. `BaggageSpanProcessor` — promotes baggage entries to span attributes so they appear in traces 3. `logHook` on `WinstonInstrumentation` — injects baggage entries into every Winston log record automatically, alongside `trace_id` and `span_id` Update `instrumentation.ts` / `instrumentation.js` in each service: * TypeScript ```typescript import { CompositePropagator, W3CTraceContextPropagator, W3CBaggagePropagator, } from '@opentelemetry/core'; import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'; import { BatchSpanProcessor, SpanProcessor, Span } from '@opentelemetry/sdk-trace-base'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { registerInstrumentations } from '@opentelemetry/instrumentation'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { context, propagation } from '@opentelemetry/api'; import type { Context } from '@opentelemetry/api'; // Promotes all baggage entries to span attributes class BaggageSpanProcessor implements SpanProcessor { onStart(span: Span, parentContext: Context): void { const baggage = propagation.getBaggage(parentContext ?? context.active()); if (!baggage) return; for (const [key, entry] of baggage.getAllEntries()) { span.setAttribute(key, entry.value); } } onEnd(): void {} forceFlush(): Promise { return Promise.resolve(); } shutdown(): Promise { return Promise.resolve(); } } const provider = new NodeTracerProvider(); provider.addSpanProcessor(new BaggageSpanProcessor()); provider.addSpanProcessor(new BatchSpanProcessor(new OTLPTraceExporter())); provider.register({ propagator: new CompositePropagator({ propagators: [ new W3CTraceContextPropagator(), new W3CBaggagePropagator(), ], }), }); registerInstrumentations({ instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, // Runs after trace_id/span_id are injected — adds baggage to every log record '@opentelemetry/instrumentation-winston': { logHook: (_span, record) => { const baggage = propagation.getBaggage(context.active()); if (!baggage) return; for (const [key, entry] of baggage.getAllEntries()) { record[key] = entry.value; } }, }, }), ], }); ``` * JavaScript ```javascript const { CompositePropagator, W3CTraceContextPropagator, W3CBaggagePropagator, } = require('@opentelemetry/core'); const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const { registerInstrumentations } = require('@opentelemetry/instrumentation'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { context, propagation } = require('@opentelemetry/api'); // Promotes all baggage entries to span attributes class BaggageSpanProcessor { onStart(span, parentContext) { const baggage = propagation.getBaggage(parentContext ?? context.active()); if (!baggage) return; for (const [key, entry] of baggage.getAllEntries()) { span.setAttribute(key, entry.value); } } onEnd() {} forceFlush() { return Promise.resolve(); } shutdown() { return Promise.resolve(); } } const provider = new NodeTracerProvider(); provider.addSpanProcessor(new BaggageSpanProcessor()); provider.addSpanProcessor(new BatchSpanProcessor(new OTLPTraceExporter())); provider.register({ propagator: new CompositePropagator({ propagators: [ new W3CTraceContextPropagator(), new W3CBaggagePropagator(), ], }), }); registerInstrumentations({ instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, // Runs after trace_id/span_id are injected — adds baggage to every log record '@opentelemetry/instrumentation-winston': { logHook: (_span, record) => { const baggage = propagation.getBaggage(context.active()); if (!baggage) return; for (const [key, entry] of baggage.getAllEntries()) { record[key] = entry.value; } }, }, }), ], }); ``` Once registered, OTel handles propagation automatically: * **Incoming requests**: the `baggage` header is parsed and stored in the active context * **Outgoing HTTP calls**: the `baggage` header is forwarded to downstream services * **Every Winston log line**: baggage entries (e.g. `session.id`) are injected alongside `trace_id` and `span_id` via `logHook` — no changes to your logger or middleware needed 3. ### Propagate Through SQS When a backend service publishes to SQS, it must inject the current context (including baggage) into the message attributes. The consumer extracts it before processing. SQS allows up to **10 MessageAttributes** per message. `traceparent`, `tracestate`, and `baggage` count as 3 toward this limit. **Producer — inject on send** ```typescript import { propagation, context } from '@opentelemetry/api'; import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs'; const sqs = new SQSClient({}); async function sendMessage(queueUrl: string, body: object) { const carrier: Record = {}; propagation.inject(context.active(), carrier); // injects traceparent, tracestate, baggage const messageAttributes: Record = {}; for (const [key, value] of Object.entries(carrier)) { messageAttributes[key] = { DataType: 'String', StringValue: value }; } await sqs.send(new SendMessageCommand({ QueueUrl: queueUrl, MessageBody: JSON.stringify(body), MessageAttributes: messageAttributes, })); } ``` **Consumer — extract on receive** ```typescript import { propagation, context, trace, SpanKind } from '@opentelemetry/api'; const tracer = trace.getTracer('worker'); async function processMessage(message: { MessageAttributes?: Record }) { const carrier: Record = {}; for (const [key, attr] of Object.entries(message.MessageAttributes ?? {})) { // Handle both Lambda ESM format (stringValue) and SDK format (StringValue) const value = attr.stringValue ?? attr.StringValue; if (value) carrier[key] = value; } const parentCtx = propagation.extract(context.active(), carrier); await context.with(parentCtx, async () => { const baggage = propagation.getBaggage(context.active()); const sessionId = baggage?.getEntry('session.id')?.value; await tracer.startActiveSpan('process_message', { kind: SpanKind.CONSUMER }, async (span) => { if (sessionId) span.setAttribute('session.id', sessionId); // ... processing logic span.end(); }); }); } ``` Caution When Lambda receives SQS messages via Event Source Mapping (ESM), message attribute keys use lowercase (`stringValue`, `dataType`) instead of the SDK’s PascalCase (`StringValue`, `DataType`). The extraction code above handles both. **SNS → SQS** When messages flow through SNS before reaching SQS, inject baggage on the SNS publish call the same way as the SQS producer above. SNS forwards `MessageAttributes` to subscribed SQS queues unchanged, so the consumer extraction code works without modification. ## Verification 1. Open the browser, perform an action that triggers a backend request 2. In Last9, open a trace for that request — the root span should have a `session.id` attribute 3. Find a downstream span (auth service, internal API) — it should also carry `session.id` 4. If using SQS, find a worker span — `session.id` should appear there too 5. Filter logs by `session.id` to see all log lines across services for a single browser session *** ## Troubleshooting * Services without `W3CBaggagePropagator` registered will silently drop the `baggage` header. Every service in the call chain needs it. * Background jobs and queue consumers that originate independently (no browser session upstream) will have no `session.id`. Always handle the `undefined` case in your logging middleware. * Keep baggage lean. Each key in `allowedKeys` is sent on every outgoing browser request. The W3C spec recommends staying well under 8 KB total. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Application Sessions > Analyze user journeys, session duration, and engagement patterns to understand how users navigate through your application. ![Applications Sessions](/_astro/sessions.C0HillRG_vzyuB.webp) Analyze user journeys, session duration, and engagement patterns to understand how users navigate through your application. The Sessions tab in [Discover > Applications](https://app.last9.io/applications) provides detailed insights into user behavior patterns, session duration, and navigation paths. Understand how users interact with your application and identify drop-off points or problematic user flows. ## Understanding Sessions and Views ### What is a Session? A session represents a period of continuous user activity. The scoping rules differ by platform: **Web** * **Tab-based**: each browser tab creates a separate session, even for the same user * **Maximum duration**: sessions automatically end after 4 hours. If a user is actively browsing when the 4-hour limit is reached, a new session begins automatically * **Independent tracking**: multiple tabs from the same user are tracked as separate sessions **Mobile** * **App-install-based**: one session per app install at a time — not per tab. Multiple app launches within the idle window join the same session. * **Idle timeout**: sessions roll over when the app stays backgrounded past the SDK’s idle threshold * **Maximum duration**: sessions also roll over when they exceed the configured max duration, matching web behavior ### What is a View? A view represents a discrete user-facing screen within a session. **Web** * A view is a page load. Only actual page navigation creates new views — query parameter changes (e.g. `/products?category=shoes` → `/products?category=shirts`) do not create new views. **Mobile** * A view is a screen transition: Activity lifecycle on Android, view-controller lifecycle on iOS, `NavigatorObserver` push on Flutter, React Navigation state change on React Native. Screen name comes from the SDK’s automatic tracking or an explicit `startView(name)` call. ## Accessing Session Analysis 1. Navigate to [Discover > Applications](https://app.last9.io/applications) in Last9 2. Click the **Sessions** tab 3. Choose your version and environment version from the top filters 4. Set your desired time range using the date picker ## Session Overview Table The main interface displays a list of user sessions with key metrics: * **Session Start Time**: when the session began * **Session ID**: unique identifier * **User**: user associated with the session (set via `identify()`) * **Duration**: total time from session start to last activity * **Views**: pages viewed (web) or screens opened (mobile) * **Views with Errors**: count of views that experienced errors * **First / Last** — the entry and exit point of the session: * **Web**: URL path (`path` attribute) * **Mobile**: screen name (`view.name` attribute) This table helps identify long-duration (engaged) sessions, short (bounce) sessions, error-prone sessions, and common entry/exit patterns. Custom attributes added via `L9RUM.spanAttributes()` (web) or `L9Rum.spanAttributes()` (mobile) become available as filters for isolating sessions by organization, workspace, or other business dimensions. ## Individual Session Details Click on any session to view detailed information about the user’s journey: ### Session Metadata Each session provides context appropriate to its platform: **Web** * **Browser**: Chrome, Firefox, Safari, etc. * **Operating System**: macOS, Windows, Linux, iOS, Android (browser platform) * **IP Address**: user’s network location * **Geolocation**: country, region, city (when available) * **Session Duration**: total time from first to last page view **Mobile** * **OS**: Android / iOS name and version * **Device**: brand and model (e.g. Samsung Galaxy S23, iPhone 15 Pro) * **App Version**: release version the user was on * **Network Type**: WiFi, cellular, offline * **Session Duration**: total time from session start to last activity ### Page View Timeline See the complete user journey with: * **Timestamp**: Exact time of each page view * **Errors**: Number of errors encountered on each page * **Event Type**: View events and navigation patterns The session detail view provides two filtering tabs for analyzing the timeline: * **All Views**: Shows every page view in the session, including both successful and error views (indicated by the number badge showing total views) ![Applications Sessions — All Views](/_astro/sessions-all-views.DQXRzuDM_ZcfJQE.webp) * **Views with Errors**: Filters to show only the page views that experienced errors (indicated by the number badge showing error count), helping you focus on problematic interactions ![Applications Sessions — Views with Errors](/_astro/sessions-views-with-errors.Cm6rO-FY_Z1DNn70.webp) ### View Details Panel Click on any view within a session to see per-view detail. **Web view details** * **View Information**: view ID, time spent, query parameters * **Page Attributes**: host, path * **Browser Attributes**: screen width/height, browser name and version, OS platform, touch support, network type, user agent * **Web Vitals**: TTFB, LCP, FCP, CLS (per-view values) **Mobile view (screen) details** * **View Information**: view ID, time spent on screen, screen name * **App Attributes**: app version, build ID, installation ID * **Device Attributes**: OS name and version, device brand and model, network type * **Screen Load Time**: duration from navigation intent to first frame (see [Performance — Mobile Vitals](/docs/discover-applications-performance/#mobile-vitals-mobile-only)) ### Backend Trace Correlation Every outgoing HTTP request from the SDK includes a W3C `traceparent` header so the backend span shares a traceId with the client. The session timeline surfaces these trace links, letting you: * Follow requests end-to-end from client to backend services * Identify whether delays come from client, network, or backend * Verify trace headers are being sent to the correct services Configuration per platform: * **Web** — enable `network.backendCorrelation`. See [Web RUM SDK — Backend Trace Correlation](/docs/real-user-monitoring/web/#backend-trace-correlation). * **Mobile** — `networkInstrumentation: true` is sufficient; `traceparent` is injected automatically by the OkHttp interceptor (Android), URLProtocol (iOS), fetch/XHR wrapper (React Native), or HttpOverrides (Flutter). Session ID propagation requires enabling `baggage` — see the [Session Correlation](/docs/discover-applications-session-correlation/) doc. ### Custom Events in Timeline Events tracked with `L9RUM.addEvent()` (web) or `L9Rum.addEvent()` (mobile) appear in the session timeline alongside views and errors. Events automatically include session and view context, helping you understand how user actions relate to performance and error patterns. **Searchable Event Attributes:** Custom event attributes are fully searchable, enabling you to filter sessions based on specific event data. For example: * **E-commerce**: Filter by `attributes['merchant_id']`, `attributes['cart_value']`, or `attributes['payment_method']` * **SaaS Applications**: Search by `attributes['workspace_id']`, `attributes['feature_flag']`, or `attributes['subscription_tier']` * **Content Platforms**: Query using `attributes['content_id']`, `attributes['category']`, or `attributes['user_role']` This eliminates the need to emit duplicate log entries for custom event searchability—query span events directly in the RUM interface. ## Filtering and Search Use the search functionality to filter sessions by: **Path Filtering:** * Sessions that visited specific pages * Entry or exit path patterns * Navigation through particular flows **Attribute Filtering:** * Browser or device type * Network conditions * Geographic regions (country, city, etc.) * Business dimensions via `L9RUM.spanAttributes()` (tenant, workspace, feature flags) * **Custom event attributes** from `L9RUM.addEvent()` (e.g., `attributes['merchant_id']`, `attributes['path']`, custom business metrics) * Session duration ranges * Error counts **Time-Based Filtering:** * Sessions within specific time windows * Peak usage periods * Post-deployment sessions * Maintenance window impacts ## Best Practices * **Track Critical User Journeys**: Monitor important conversion flows and business-critical paths * **Investigate High-Error Sessions**: Sessions with multiple errors often reveal systemic issues * **Analyze Bounce Patterns**: Short sessions with single page views might indicate problems * **Monitor Engagement Metrics**: Track average session duration and pages per session trends * **Correlate with Performance**: Compare session quality with Web Vitals performance * **Regular Pattern Review**: Check session patterns weekly for unusual behavior or trends * **User Experience Validation**: Use session data to validate UX design decisions * **Maintain Context**: Refresh `L9RUM.spanAttributes()` on route changes so filters stay accurate *** ## Troubleshooting * **Custom events not appearing?** Confirm `L9RUM.addEvent()` is called after SDK initialization and event names are non-empty. * **Trace links missing?** Verify `network.backendCorrelation.enabled` is `true` and target APIs accept trace headers (check DevTools for CORS issues). * **Missing geolocation data?** The SDK enriches data when the GeoJS service is reachable. Network restrictions may prevent enrichment. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Exceptions > Monitor, investigate, and resolve application exceptions across all your services with detailed context, correlated traces, and AI-powered analysis The Exceptions feature in Discover gives you a unified view of all application exceptions across your services. Instead of checking each service individually, you can see every exception type, its frequency, severity, and the service it originates from in a single table. Drill into any exception to view correlated traces, request context, and logs for fast root cause analysis. ![Exceptions List View](/_astro/exceptions-list-view.D-gJi4n7_Z28ivQD.webp) ## Prerequisites To use the Exceptions feature, ensure you have the following integrations configured: **Required:** * **Traces**: Distributed tracing data is mandatory for exception detection and correlation. Configure OpenTelemetry or other tracing instrumentation for your applications. [See all traces integrations](https://app.last9.io/integrations?category=traces). **Optional:** * **Logs**: Application logs provide additional troubleshooting context when investigating exceptions. Configure log forwarding from your applications. [See all logs integrations](https://app.last9.io/integrations?category=logs). ## Understanding the Exceptions Dashboard Access the Exceptions dashboard at [Discover > Exceptions](https://app.last9.io/exceptions) in Last9. The dashboard displays all detected exceptions in a sortable table. Each row represents a unique combination of exception type, service, and operation. ### Table Columns | Column | Description | | ------------------ | ------------------------------------------------------------------------------------------------- | | **Exception Type** | The exception class or error code (e.g., `HttpError`, `TypeError`, `ECONNREFUSED`, `errorString`) | | **Service** | The service where the exception originated, shown with a language/runtime icon | | **Operation** | The span name or API endpoint where the exception occurred (e.g., `POST`, `View`, `sql:query`) | | **Operation Type** | The span kind: Server, Client, Internal, Producer, or Consumer | | **Count** | Total number of occurrences in the selected time range. Default sort is by count descending | | **Severity** | Color-coded badge based on occurrence count | | **Last Seen** | How recently the exception last occurred (e.g., “3m ago”, “0s ago”) | ### Severity Levels Severity is automatically assigned based on the exception count within the selected time range: | Severity | Count Threshold | Badge Color | | ------------ | --------------------- | ----------- | | **Critical** | 1,000+ occurrences | Red | | **High** | 100 - 999 occurrences | Orange | | **Medium** | 10 - 99 occurrences | Yellow | | **Low** | 1 - 9 occurrences | Blue | Click any column header to sort the table by that column. Click an exception row to open the detail panel. ## Filtering Exceptions The left sidebar provides filters to narrow down the exceptions list. **Default filters:** * **Service**: Filter by one or more services * **Exception Type**: Filter by specific exception classes * **Severity**: Filter by severity level (Critical, High, Medium, Low) **Dynamic label filters:** Additional filter sections appear based on the labels present in your trace data. Common examples include `process_runtime_name`, `process_runtime_version`, `telemetry_sdk_language`, and `segment`. These vary by organization and environment depending on the attributes your instrumentation reports. 1. Expand any filter category in the left sidebar 2. Select one or more values 3. Click **Apply Filters** to update the table 4. Click **Clear** to reset all filters Additional controls at the top of the page: * **Environment selector**: Switch between deployment environments (e.g., production, staging) * **Date picker**: Set the time range for exception data ## Exception Details Click any exception row to open the detail panel on the right side of the page. ![Exception Details - Traces](/_astro/exception-details-traces.FMpCM2eP_2iDHp3.webp) The detail panel header shows: * **Exception type** with severity badge * **Service name** * **Operation name** * **Occurrences count** for the selected time range * **Occurrences chart**: A time series showing exception frequency over time, useful for spotting spikes and correlating with deployments Use the up/down arrows in the top-right corner to navigate between exceptions without closing the panel. ### Traces The **Traces** tab lists all correlated traces where this exception occurred. Each trace row shows: | Column | Description | | ----------------------- | ------------------------------------------------ | | **Start Time** | Timestamp of the trace | | **Trace ID** | Unique identifier for the distributed trace | | **Operation & Service** | The operation name and originating service | | **Duration** | Total trace duration | | **Kind** | Span kind (Internal, Server, Client, etc.) | | **Type & Status** | Shows “Error” status for exception-bearing spans | Click any trace row to navigate to the full distributed trace view for detailed span-level analysis. ### Context The **Context** tab shows trace attributes from the exception span and its root span, including HTTP request details, user information, environment data, and custom attributes. ![Exception Details - Context](/_astro/exception-details-context.eflrIeyn_Z1DdyWC.webp) Context sections include: * **HTTP Request**: Method, URL, route, status code, user agent, request/response size * **User Context**: User ID, email, session ID (when available) * **Device & Environment**: Device type, OS, browser, IP addresses * **Database**: Database system, name, query, operation * **gRPC**: RPC system, service, method, status code * **Custom Attributes**: Any additional trace attributes your application reports All values are copyable with a single click. ### Logs The **Logs** tab shows logs correlated with the exception, pre-filtered by service name and exception type. ![Exception Details - Logs](/_astro/exception-details-logs.DTDW5PxL_Z13wyq7.webp) Features include: * **Index selector**: Choose between default, physical, or rehydration log indexes * **Pre-populated filters**: Automatically set to `service = ` and `body contains all ` * **Log volume chart**: Visual representation of log volume over time * **View in Logs**: Opens the full [Logs Explorer](/docs/logs-explorer/) with the same filters applied You can add or modify filters directly in the Logs tab to refine your search. ## Adaptive Alerts Adaptive alerts automatically detect unusual spikes in exception frequency using a statistical deviation model, reducing false positives compared to static threshold-based alerts. You can enable adaptive alerts for any exception directly from the detail panel. ### Enabling Adaptive Alerts 1. Click any exception row to open the detail panel 2. Click **Manage Adaptive Alerts** in the top-right corner 3. Toggle the switch to enable adaptive alerting for this exception 4. The alert rule is created automatically with an optimized configuration based on the exception’s historical patterns When you enable an adaptive alert, Last9 automatically: * Creates an alert rule named `exception_alert_for___` * Monitors the `exception_count` indicator for this specific exception * Uses the exception’s trend data to establish a baseline * Detects anomalous spikes based on statistical deviation from the baseline ### Managing Alert Rules Once enabled, you can manage the alert rule like any other Last9 alert: * **View alert rules**: Navigate to [Alerting > Monitor](https://app.last9.io/alerting/monitor) to see all active alert rules * **Set notification channels**: Configure where alerts are sent at [Alerting > Notification Channels](https://app.last9.io/alerting/notification-channels) * **Inspect alerts**: When an alert fires, click the inspect link in the notification to open the exception detail panel with the relevant time range pre-selected Alert rules are grouped by environment and exception, making it easy to manage alerts across multiple services and operations. ## AI-Powered Exception Analysis If the AI Assistant is enabled for your organization, the **Auto-fix Exception** button appears in the detail panel. ![AI-Powered Exception Analysis](/_astro/exception-ai-autofix.RZlX3tLO_nzikE.webp) Clicking it opens the AI-Powered Exception Resolution modal with two options: | Option | Description | Requirement | | ---------------- | ------------------------------------------------------------------------------------- | ----------------------------------------- | | **Analyze Only** | Get AI-generated insights and recommendations without making any changes to your code | AI Assistant enabled | | **Auto-Fix** | An agent analyzes the exception, creates a fix PR, and deploys automatically | AI Assistant + [Agents](/agents/) enabled | ## Investigating Exceptions Follow this workflow to efficiently triage and resolve exceptions: 1. **Spot high-impact exceptions**: Sort by **Count** (default) or filter by **Critical** / **High** severity to focus on the most frequent exceptions 2. **Open the detail panel**: Click an exception row to view the occurrences chart and identify when spikes occurred 3. **Check correlated traces**: Use the **Traces** tab to examine individual traces and understand the execution path leading to the exception 4. **Review request context**: Switch to the **Context** tab to see HTTP request details, user information, and custom attributes for additional debugging clues 5. **Search related logs**: Use the **Logs** tab to find detailed error messages and stack traces around the time of the exception 6. **Set up alerts**: Click **Manage Adaptive Alerts** to get notified of future spikes for this exception 7. **Use AI analysis**: If available, click **Auto-fix Exception** to get AI-powered insights or an automated fix ## Best Practices **Prioritization:** * Focus on Critical and High severity exceptions first, as they represent the highest volume errors * Pay attention to exceptions with recent **Last Seen** timestamps, especially “0s ago”, indicating actively occurring issues * Monitor `ECONNREFUSED` and network-related exceptions as they often indicate infrastructure problems **Monitoring Strategy:** * Review the Exceptions dashboard after every deployment to catch newly introduced errors * Use the time range selector to compare exception counts before and after a release * Set up adaptive alerts for critical exception types to get proactive notifications **Troubleshooting:** * Start with the **Traces** tab to understand the execution flow and identify where the exception originates * Use the **Context** tab to check if exceptions are tied to specific users, endpoints, or request patterns * Pivot to the **Logs** tab for detailed stack traces and error messages * For service-specific exception analysis, navigate to the service’s Exceptions tab under [Discover > Services](/docs/discover-services/) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Hosts > Monitor and analyze infrastructure performance across all your hosts with comprehensive system metrics The Hosts feature in Discover provides comprehensive infrastructure monitoring, delivering real-time visibility into system performance across all your hosts. Monitor CPU usage, memory consumption, storage capacity, and detailed system metrics to optimize resource allocation, identify performance bottlenecks, and maintain healthy infrastructure across your entire environment. ![Hosts Heatmap Overview](/_astro/hosts-heatmap-overview.DN61Hksx_n2zp9.webp) This infrastructure monitoring solution helps you proactively identify resource constraints, track system health trends, and ensure optimal host performance for your applications and services. ## Prerequisites To monitor your host infrastructure with Last9, you need to configure at least one of the following data collection integrations: **Required (Choose at least one):** * **Host Metrics**: Core system metrics collection for CPU, memory, disk, and network monitoring. Configure the [Host Metrics integration](https://app.last9.io/integrations?category=all\&search_term=hostmetrics) to collect infrastructure metrics via OpenTelemetry collectors * **Kubernetes Operator** (recommended for Kubernetes deployments): Comprehensive Kubernetes monitoring including host-level metrics. Configure the [Kubernetes Operator](https://app.last9.io/integrations?integration=kubernetes-operator) for Kubernetes environments * **Kubernetes Cluster Monitoring**: Alternative Kubernetes monitoring solution that includes host metrics collection. Set up [Kubernetes Cluster Monitoring](https://app.last9.io/integrations?integration=kubernetes-monitoring) for cluster-wide infrastructure monitoring You can use any combination of these integrations based on your infrastructure setup. For Kubernetes environments, the Kubernetes Operator is the recommended choice as it provides the most comprehensive monitoring capabilities. ## Understanding the Hosts Dashboard Access the Hosts dashboard at [Discover > Hosts](https://app.last9.io/hosts) in Last9. The Hosts dashboard provides two visualization modes: **List View** and **Map View** (heatmap). Toggle between views using the view selector in the top-right corner. Your preference is automatically saved for future sessions. ### Map View (Heatmap) The Map View displays your hosts as a heatmap visualization, providing an at-a-glance overview of infrastructure health across your entire environment. **Key features:** * **Color-coded health**: Each cell represents a host, with colors indicating health status (green for healthy, yellow/orange for warning, red for critical) * **Health status summary**: Cards at the top show counts of Healthy, Warning, and Critical hosts with threshold definitions * **Quick identification**: Instantly spot problematic hosts in large infrastructure deployments #### Group By Use the **Group by** dropdown to organize hosts into logical groups: | Option | Description | | ----------------- | ------------------------------------------------------------------------------------ | | **Job** | Groups hosts by their collection job (e.g., prometheus-node-exporter, node-exporter) | | **Health Status** | Groups hosts by their current health state (Healthy, Warning, Critical) | | **None** | Displays all hosts in a single flat grid without grouping | | **Custom Labels** | Groups by any label attached to your hosts (e.g., environment, region) | #### View By Use the **View by** dropdown to change which metric determines each host’s color: | Option | What it shows | | ---------- | ------------------------------------------------- | | **CPU** | Colors based on CPU utilization percentage | | **Memory** | Colors based on memory usage percentage | | **Disk** | Colors based on root volume disk usage percentage | The health thresholds for coloring depend on the selected metric: **Health Thresholds:** | Metric | Healthy | Warning | Critical | | ------ | ------- | ------- | -------- | | CPU | < 70% | 70-90% | ≥ 90% | | Memory | < 70% | 70-90% | ≥ 90% | | Disk | < 80% | 80-95% | ≥ 95% | ![Hosts Heatmap Tooltip](/_astro/hosts-heatmap-tooltip.scirJ_ej_1gFWP9.webp) **Hover details**: Mouse over any cell to see detailed host information including: * Host ID and IP address * Current health status * Associated job * Resource usage (CPU, Memory, Disk) with visual progress bars ### List View ![Hosts Overview](/_astro/hosts-table.C3ljbIDf_1vUtq5.webp) The List View displays all monitored infrastructure in a unified table with key performance indicators at a glance: * **Host ID**: Unique identifier for each monitored host * **Host IP**: Network address of the host * **Job**: Associated collection job * **Uptime**: How long the host has been running * **CPU**: Current CPU utilization with visual indicators * **Memory**: RAM usage showing used/total capacity * **Root Volume**: Primary disk usage percentage Use the filtering capabilities to focus on specific hosts: 1. Click on any column header to sort hosts by that metric 2. Use the search box to filter by host ID or IP address 3. Select multiple hosts using the checkboxes for bulk analysis 4. Toggle between “ALL” and “NONE” to quickly select or deselect all hosts Color-coded metrics help identify hosts requiring attention - green indicates normal operation while red suggests potential issues that need investigation. ## Analyzing Individual Hosts Click on any host to access comprehensive performance data and system analysis. ![Host Detail Overview](/_astro/hosts-detail-overview.DZgWoeiE_1N4Km.webp) ### Overview The Overview tab provides high-level resource utilization dashboards with essential system metrics: * **Resource Summary:** View current CPU utilization, memory consumption, root volume usage, and network throughput at a glance * **Performance Charts:** Track CPU usage, memory consumption, and storage device usage over time with detailed graphs * **Host Metadata:** Essential configuration details including Host IP, uptime duration, instance type, container information, availability zone, and system architecture ### Metrics The Metrics tab offers comprehensive system performance analytics with detailed monitoring capabilities: ![Host Detail Metrics](/_astro/hosts-detail-metrics.9l_52VS4_Z2uP8rw.webp) **Core System Metrics:** * **CPU Usage**: Processor utilization tracking over time for performance optimization * **Memory Usage**: RAM consumption patterns with available memory monitoring for capacity planning * **Storage Device Usage**: Disk utilization for mounted volumes and storage performance analysis * **Network Bandwidth Usage**: Network I/O rates and throughput monitoring for connectivity analysis **Advanced System Metrics:** * **System Load**: System load averages indicating overall system stress and resource demand * **Disk R/W Data**: Read/write operations and throughput rates for storage performance optimization * **Disk R/W Time**: I/O operation latency and timing analysis for identifying storage bottlenecks * **Disk IOps Completed**: Input/output operations per second for storage performance monitoring * **Time Spent Doing I/Os**: Time spent on disk operations for I/O efficiency analysis * **Network Sockstat**: Network socket statistics and connection monitoring for network health * **Open File Descriptor/Context Switches**: System-level resource usage for process management analysis ## Best Practices **Choosing the Right View:** * Use **Map View** for daily health checks and quick infrastructure status reviews * Use **List View** when you need to compare specific metrics across hosts or sort by performance * Start with Map View to identify problem areas, then switch to List View for detailed investigation **Infrastructure Monitoring Strategy:** * Regularly review host performance to identify trends and potential capacity issues before they impact applications * Monitor both individual host metrics and overall infrastructure health patterns * Use color-coded indicators to quickly identify hosts requiring immediate attention * Set up systematic monitoring schedules to track infrastructure health over time **Resource Optimization:** * Use historical CPU and memory data to plan for capacity expansion and optimize resource allocation * Monitor disk I/O patterns to identify storage bottlenecks and optimize disk usage * Track network bandwidth utilization to plan for network capacity and identify connectivity issues * Analyze system load trends to understand resource demand patterns and optimize workload distribution **Performance Analysis:** * Establish baseline performance ranges for your hosts to quickly identify anomalies and performance degradation * Monitor advanced metrics like file descriptor usage and context switches to identify system-level bottlenecks * Use storage device metrics to optimize disk allocation and identify potential hardware issues * Correlate network metrics with application performance to understand infrastructure impact on service delivery **Troubleshooting Workflow:** * Start with the Overview tab to identify resource utilization anomalies and system health issues * Use the Metrics tab for detailed performance analysis and trend identification * Monitor system load and I/O metrics to identify infrastructure bottlenecks * Analyze network statistics to understand connectivity and throughput issues *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Jobs > Monitor and analyze background jobs, scheduled tasks, and asynchronous operations across your infrastructure The Jobs feature in Discover provides comprehensive monitoring for background jobs, scheduled tasks, queue processing, and asynchronous operations across your application infrastructure. Track job execution performance, error rates, processing duration, and queue health to ensure reliable background processing and maintain optimal system performance. ![Job Overview](/_astro/jobs-operation.DT8wJ7Hr_1KE0aC.webp) This background job monitoring solution helps you identify processing bottlenecks, monitor queue backlogs, detect failed jobs, and optimize task execution across your entire job processing infrastructure. ## Prerequisites To monitor background jobs and scheduled tasks with Last9, configure the following integrations: **Required:** * **Traces**: Distributed tracing data is mandatory for job discovery, execution tracking, and operation-level analysis. Configure OpenTelemetry or other tracing instrumentation for your job processing systems. [See all traces integrations](https://app.last9.io/integrations?category=traces). **Optional but Recommended:** * **Logs**: Application and job processing logs provide detailed execution context and error information. Configure log forwarding from your job runners and processing systems. [See all logs integrations](https://app.last9.io/integrations?category=logs). * **Infrastructure Metrics**: Container and host metrics for job processing infrastructure. Set up [Docker](https://app.last9.io/integrations?category=docker), [Kubernetes](https://app.last9.io/integrations?category=kubernetes), or cloud monitoring for resource tracking during job execution. Without traces, the Discover Jobs feature will have limited functionality. Logs and infrastructure metrics enhance troubleshooting capabilities and provide deeper operational context. ## Understanding the Jobs Dashboard Access the Jobs dashboard at [Discover > Jobs](https://app.last9.io/jobs) in Last9. ![Jobs Overview](/_astro/jobs-overview.CbxwxgC8_JmcE0.webp) The Jobs dashboard displays all monitored background jobs and scheduled tasks in your environment with key performance indicators. The dashboard provides two viewing modes controlled by the **Group by Service** toggle. * **Default View (Ungrouped)**: The dashboard shows individual jobs with their performance metrics. * **Service**: The service or application running the job * **Job Name**: Specific job identifier or task name * **Queue Name**: Job queue or processing system (when applicable) * **Throughput**: Jobs processed per minute (RPM) * **Error Rate**: Percentage of failed job executions * **Duration (P95)**: 95th percentile job execution time * **Grouped View**: When **Group by Service** is enabled, jobs are organized hierarchically by service, allowing you to: * **Expand/Collapse Services**: Click the arrow icons to show or hide jobs within each service * **Job-Level Details**: View individual job performance within the service context Use the sidebar filters to focus on specific job types or services: 1. Select filter categories from the left sidebar (process\_runtime\_name, process\_runtime\_version, telemetry\_sdk\_language) 2. Choose specific values to filter the jobs list 3. Click **Apply Filters** to update the view 4. Use **Clear** to reset all applied filters ## Analyzing Individual Jobs Click on any job name to access detailed performance monitoring and execution analysis. ![Job Overview](/_astro/jobs-operation.DT8wJ7Hr_1KE0aC.webp) ### Overview The Overview tab provides comprehensive job performance dashboards with key execution metrics: **Performance Metrics:** * **Availability**: Job execution success rate and reliability tracking * **Response Time**: Execution duration with P50, P95, P99, and AVG percentiles * **Throughput & Error Rate**: Job processing volume and failure rates over time * **Error Distribution**: Breakdown of error types and their frequency during job execution **Key Performance Analysis:** * **Top 10 Errors**: Most frequent error types and their occurrence counts for prioritizing fixes ### Exceptions Monitor job failures and execution errors: * **Error Trend Visualization**: Track error frequency over time with trend analysis for different error types * **Exception Type Filtering**: Filter by specific exception classes and error types that occur during job execution * **Operation-Level Error Analysis**: Identify which specific job operations are generating the most errors * **Error Count Tracking**: Monitor total error occurrences for each exception type to prioritize troubleshooting efforts ### Breakdown Analyze job execution performance by individual operations. The Breakdown tab shows detailed operation-level metrics: * **Response Time Visualization**: Area chart showing P50, P95, P99, and AVG response times with color-coded percentile bands * **Operation Performance Table**: Detailed metrics for each job operation including: * **Operation Name**: Specific job operation or database query * **Operation Type**: Classification (Database, HTTP Client, Consumer, etc.) * **Avg. Calls/Transaction**: Average number of operations per job execution * **Response Time (P95)**: 95th percentile execution time * **Total Time Spent**: Cumulative time spent on the operation ### Related Logs Access job-specific logs for troubleshooting: * **Pre-filtered Logs**: Automatically filtered to the selected job with service context * **Log Volume Indicator**: Visual representation of log activity over time * **Time Range Alignment**: Logs correspond to the selected monitoring time window * **Search and Filter**: Use the search bar to find specific log entries or filter by attributes * Click on any log line to view more details ### Related Traces Examine distributed traces for job execution. The Traces tab provides detailed execution flow analysis: * **Trace Filtering**: Filter traces by service name, span name, and span kind (Consumer, Internal, etc.) * **Execution Timeline**: View job execution traces with start times, trace IDs, and duration * **Operation Details**: Examine specific operations and services involved in job execution * Click on any trace or span to view distributed tracing visualization and more details ## Best Practices **Job Monitoring Strategy:** * Focus on jobs with high error rates or extended execution times in the main dashboard * Use the grouped view to understand service-level job health and identify problematic services * Monitor both individual job performance and overall job processing throughput **Performance Optimization:** * Use the Breakdown tab to identify slow operations within job execution * Monitor database queries and external API calls that may be bottlenecks * Track resource usage patterns to optimize job scheduling and concurrency * Analyze execution time trends to validate the impact of job optimizations **Troubleshooting Workflow:** * Start with the Overview tab to identify performance anomalies and error spikes * Use the Exceptions tab to understand specific error patterns and their frequency * Examine the Breakdown tab for operation-level performance issues * Access Logs for detailed execution context and error messages * Use Traces to understand job execution flow and identify bottlenecks in distributed processing **Queue Management:** * Monitor throughput trends to identify processing capacity issues * Track error rates to detect systemic problems with job processing * Use duration metrics to optimize job execution and resource allocation * Analyze job scheduling patterns to balance system load and processing efficiency *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Kubernetes > Monitor Kubernetes clusters, deployments, and pods with comprehensive resource utilization and performance tracking The Kubernetes feature in Discover provides comprehensive monitoring for your Kubernetes infrastructure, delivering real-time visibility into cluster health, resource utilization, and workload performance. Monitor deployments, pods, and individual containers with detailed metrics for CPU, memory, network, and storage across your entire Kubernetes environment. ![Deployment Overview](/_astro/k8s-deployment.CbN7Fohy_1I78Rw.webp) This infrastructure monitoring solution helps you optimize resource allocation, identify performance bottlenecks, and maintain healthy Kubernetes workloads across all environments. ## Prerequisites To monitor your Kubernetes infrastructure with Last9, you need to configure the appropriate data collection integrations: **Required:** * **Kubernetes Cluster Monitoring**: Core metrics collection for cluster, node, deployment, and pod monitoring. Configure the [Kubernetes Cluster Monitoring](https://app.last9.io/integrations?integration=kubernetes-monitoring) integration to collect infrastructure metrics. **Optional but Recommended:** * **Kubernetes Logs**: Application and system logs from pods and containers for troubleshooting context. Set up the [Kubernetes Logs](https://app.last9.io/integrations?integration=Last9+Otel+Collector+Setup+for+Kubernetes) integration for comprehensive log collection. * **Kubernetes Events**: Cluster events and state changes for operational insights. Enable the [Kubernetes Events](https://app.last9.io/integrations?integration=Kubernetes+Events) integration to track pod scheduling, resource constraints, and cluster events. Without the core Kubernetes Cluster Monitoring integration, the Discover Kubernetes feature will not function. The additional integrations enhance troubleshooting capabilities and provide deeper operational context. ## Understanding the Kubernetes Dashboard Access the Kubernetes dashboard at [Discover > Kubernetes](https://app.last9.io/kubernetes) in Last9. ![Kubernetes Overview](/_astro/k8s-overview.CXOUNZKZ_Z213CdO.webp) The main dashboard provides two primary views for monitoring your Kubernetes infrastructure: * **Deployments View:** * **Name**: Deployment name and associated container images * **Cluster**: Kubernetes cluster hosting the deployment * **Namespace**: Kubernetes namespace organization * **Replicas**: Current vs. desired replica count (e.g., 1/1/1 showing current/ready/desired) * **CPU Usage**: Current CPU utilization with resource requests and limits * **Memory Usage**: Current memory usage with allocated limits * **Pods View:** * **Name**: Pod identifier with deployment association * **Deployment**: Parent deployment name * **Status**: Running, Pending, Failed, or other states * **Cluster**: Kubernetes cluster location * **Namespace**: Kubernetes namespace organization * **Ready/Restarts**: Container readiness and restart count * **CPU/Memory**: Current resource utilization Use the left sidebar filters to focus on specific clusters, deployments, or namespaces: 1. Select filter categories from the sidebar (Cluster, Deployment, Namespace) 2. Choose specific values to narrow your view 3. Click **Apply Filters** to update the display 4. Use **Clear** to reset all applied filters ## Analyzing Individual Deployments Click on any deployment name to access detailed performance monitoring and resource analysis. ![Deployment Overview](/_astro/k8s-deployment.CbN7Fohy_1I78Rw.webp) ### Overview The deployment detail page provides comprehensive performance dashboards with key operational metrics: * **Performance Summary:** View replica health, CPU and memory utilization percentages, and restart count indicators that signal deployment stability. * **Resource Performance Charts:** Track CPU and memory allocation efficiency comparing resource requests, limits, and actual usage over time. Click on `All Metrics` to view more. * **Deployment Metadata:** Essential configuration details including cluster, namespace, and pod count for operational context. ### Metrics The Metrics tab offers detailed resource utilization analytics: * **Pod Health Monitoring:** Visual timeline showing pod availability and analysis of pod lifecycle events and termination reasons. * **Resource Utilization Analysis:** Detailed CPU and memory performance tracking with requests, limits, and usage trends, plus storage performance monitoring for data-intensive applications. ### Pods The Pods tab shows all pods belonging to the specific deployment, filtered from the main dashboard view, with additional performance visualizations: * **Pod Status Timeline**: Visual representation of pod health and availability over time * **Pod Termination Breakdown**: Analysis of why pods have terminated or restarted Click on any pod name to open detailed pod analysis in a side panel with the same information available in the individual pod analysis section. ## Analyzing Individual Pods Click on any pod name to examine detailed pod-level performance and resource consumption. ![Pod Overview](/_astro/k8s-pod._DADITys_Z116JdK.webp) ### Overview * **Pod Health Summary:** Monitor pod uptime, restart frequency, and resource allocation efficiency showing current CPU and memory usage versus requests. * **Resource Performance Charts:** Track resource requests, limits, and actual consumption patterns over time for optimization insights. Click on `All Metrics` to view more. * **Pod Metadata:** Configuration details including cluster, namespace, node assignment, pod IP, and parent deployment. ### Metrics * **Container Health Analysis:** Individual container health patterns, restart analysis, and out-of-memory event detection for resource optimization. * **Resource Performance Monitoring:** Per-container CPU and memory consumption, throttling analysis, storage performance, and disk I/O monitoring. * **Network Performance:** Pod network performance tracking and connectivity issue detection. ## Best Practices **Resource Optimization:** * Monitor the gap between resource requests/limits and actual usage to identify over-provisioned deployments for cost savings * Use CPU and memory data to right-size allocations and identify containers hitting their limits * Track CPU throttling patterns and disk I/O bottlenecks for performance tuning **Troubleshooting:** * Track restart counts and OOM events to identify unstable applications * Use deployment-level monitoring for performance analysis, pod-level for detailed troubleshooting * Review pod age and restart patterns to identify stability issues **Monitoring Strategy:** * Use namespace filtering to organize by team or application boundaries * Analyze resource usage patterns for capacity planning and autoscaler configuration * Monitor resource trends over time to identify gradual performance degradation *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Map > Visualize service dependencies and trace request flows across your distributed architecture The Map feature in Discover provides a real-time topology view of your distributed architecture, showing how services communicate with each other and where performance issues originate. Each service node displays key metrics — throughput, error rate, and P95 latency — so you can assess the health of your entire system at a glance. ![Map Overview](/_astro/map-overview.Bpng1TFq_Z27VAVH.webp) Use the Map to understand service dependencies, trace error propagation across services, and quickly identify bottlenecks during incident response. ## Prerequisites To use the Map feature, you need distributed tracing configured for your services: **Required:** * **Traces**: Distributed tracing data is mandatory for automatic service discovery and dependency mapping. Configure OpenTelemetry or other tracing instrumentation for your applications. [See all traces integrations](https://app.last9.io/integrations?category=traces). Without traces, the Map cannot discover services or their relationships. Services appear on the Map automatically once trace data is received. ## Understanding the Map View Access the Map at [Discover > Map](https://app.last9.io/map) in Last9. ### Service Nodes Each service appears as a node on the Map displaying three key metrics: | Metric | Description | | -------------- | ------------------------------------------------ | | **Throughput** | Requests per minute (rpm) handled by the service | | **Error Rate** | Number of failed requests per minute (rpm) | | **P95** | 95th percentile response time | ### Connection Lines Directed arrows between nodes represent service-to-service communication. The arrow points from the caller to the downstream dependency, showing the direction of request flow. ### Health Indicators The Map uses color coding to surface problems quickly: | Indicator | Meaning | | -------------------------- | --------------------------------------------------------------- | | **Green border** | Service is healthy with no errors | | **Red border** | Service is experiencing errors | | **Red error rate text** | Error rate is non-zero, displayed in red for visibility | | **Red dotted connection** | Errors are occurring on requests between the connected services | | **Green solid connection** | Requests between services are healthy | ![Map with Error Indicators](/_astro/map-error-indicators.Cuy7rwfl_24S4v5.webp) In the example above, the red borders on services and the dotted red connection line between them indicate active errors flowing through that path. ## Filtering the Map Use the toolbar at the top of the Map to filter and focus the view: **Environment**: Select the environment to display (e.g., **production**, **staging**) using the **Environment** dropdown. **Service status filters**: | Filter | Description | | ---------- | ------------------------------------------- | | **All** | Show all discovered services | | **Errors** | Show only services with active errors | | **Slow** | Show only services with high response times | **Time range**: Adjust the time window using the time picker in the top-right corner. The Map displays service metrics aggregated over the selected period. ## Interacting with Services Click on any service node to open a context menu with quick actions: ![Service Context Menu](/_astro/map-node-context-menu.DgciGEsR_2jneeC.webp) | Action | Description | | ------------------------ | ------------------------------------------------------------------------------------------------------------ | | **Focus Dependencies** | Isolate the selected service and its direct upstream and downstream dependencies, dimming unrelated services | | **View Service Details** | Navigate to the full [Service Details](/docs/discover-services/) page for in-depth performance analysis | | **View Traces** | Open the Traces explorer pre-filtered to traces involving this service | | **View Logs** | Open the Logs explorer pre-filtered to logs from this service | ## Navigating the Map Use the controls in the bottom-left corner or standard input gestures to navigate: * **Zoom in / out (+/-)**: Adjust the zoom level. You can also pinch to zoom on a trackpad or use the scroll wheel with a mouse. * **Fit to screen**: Reset the view to fit all services in the viewport * **Pan**: Drag on the canvas background to move around the Map The minimap in the bottom-right corner shows your current viewport position within the full Map. ## Best Practices **Incident Response:** * Start with the **Errors** filter to see which services are affected * Use **Focus Dependencies** on the service closest to the error source to understand the blast radius * Follow red dotted connection lines to trace error propagation across services * Drill into traces and logs from the context menu for root cause analysis **Architecture Review:** * Periodically review the full Map to understand how your architecture has evolved * Identify services with unusually high fan-out (many downstream dependencies) as potential single points of failure * Look for services with high P95 latency that may be bottlenecks for upstream callers **Performance Monitoring:** * Use the **Slow** filter to identify services with degraded response times * Compare throughput across connected services to spot capacity imbalances * Monitor error rates on critical paths between high-traffic services *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Discover Services > Monitor and analyze service performance, operations, and dependencies across your infrastructure The Services feature in Discover serves as Last9’s comprehensive Application Performance Monitoring (APM) solution, providing deep visibility into your application’s health, performance, and user experience. Monitor critical APM metrics including throughput, response times, error rates, and APDEX scores across all your services, with the ability to drill down into individual operations, traces, and dependencies. ![Services Overview](/_astro/service-overview.4iNnSrDS_13AYCh.webp) This full-featured APM platform helps you proactively identify performance bottlenecks, track application health trends, and maintain optimal user experience across your entire service architecture. ## Prerequisites To fully utilize the Services APM capabilities, ensure you have the following integrations configured: **Required:** * **Traces**: Distributed tracing data is mandatory for service discovery, dependency mapping, and operation-level analysis. Configure OpenTelemetry or other tracing instrumentation for your applications. [See all traces integrations](https://app.last9.io/integrations?category=traces). **Optional:** * **Logs**: Application and infrastructure logs provide detailed troubleshooting context. Configure log forwarding from your applications and infrastructure. [See all logs integrations](https://app.last9.io/integrations?category=traces). * **Infrastructure Metrics**: Container and host metrics enable infrastructure monitoring. Set up [Docker](https://app.last9.io/integrations?category=docker) or [Kubernetes](https://app.last9.io/integrations?category=kubernetes) for container and host-level metric collection. * **Process Metrics**: JVM and application runtimes such as Node.js provide deep process insights. Configure runtime metrics via OpenTelemetry SDK for Java and Node.js applications. [See all Java integrations](https://app.last9.io/integrations?category=java), [See all Node.js integrations](https://app.last9.io/integrations?category=nodejs). Without traces, the Discover Services feature will have limited functionality. Logs and metrics enhance the experience but are not required for basic service monitoring and APM capabilities. ## Understanding the Services Dashboard Access the Services dashboard at [Discover > Services](https://app.last9.io/service-catalog) in Last9. ![Services Overview](/_astro/service-overview.4iNnSrDS_13AYCh.webp) The Services dashboard displays all monitored services in your environment with key performance indicators at a glance. Each service shows critical metrics including throughput (requests per minute), error rate percentage, availability, and response time. The main dashboard includes: * **Service Names**: Organized by environment and runtime technology * **Throughput**: Requests per minute for each service * **Error Rate**: Percentage of failed requests * **Availability**: Service uptime percentage * **Response Time**: p95 response time in milliseconds Use the sidebar filters to narrow down your view: 1. Click on any filter category in the left sidebar 2. Select specific values to filter the services list 3. Click **Apply Filters** to update the view 4. Use **Clear** to reset all applied filters ## Analyzing Individual Services Click on any service name to access detailed performance analysis and monitoring capabilities. ### Overview The Overview tab provides a comprehensive performance dashboard with multiple visualization panels: **Performance Metrics Include:** * **APDEX Score**: Application Performance Index showing user satisfaction * **Response Time**: P50, P95, P99, AVG, and Max percentiles with alert threshold reference lines * **Availability**: Service uptime tracking * **Throughput & Error Rate**: Request volume and failure rates over time * **Error Distribution**: Breakdown of error types and their frequency ![Service Performance Overview](/_astro/service-performance.KDOkh_4__Z2divFx.webp) **Key Performance Tables:** * **Top 10 Web Operations**: Slowest operations by response time * **Top 10 Operations with Errors**: Operations with highest error counts * **Top 10 Errors**: Most frequent error types and their occurrence You can also click on each of the rows in the key performance tables to view more details. ### Setting Up Alerts Configure performance-based alerts directly from the service overview: ![Alert Configuration](/_astro/service-set-alert.DqMCGiGZ_CVij6.webp) 1. Click **Enable Alert Rule** in the Performance section 2. Configure the alert condition (e.g., APDEX Score falls below threshold, or Max Response Time exceeds threshold) 3. Set the threshold value and time window 4. Preview the alert behavior with the visual timeline 5. Click **Configure Alert Rule** to finalize The alert preview shows how the rule would have triggered based on historical data, helping you validate the threshold settings. **Available alert metrics:** * APDEX Score * Response Time (P50, P95, P99, AVG, Max) * Error Rate * Throughput #### Notification Channels Configure where alerts are delivered by setting up notification channels. Access this through the **No Notification Channels** button (when no channels are configured) or the **Settings** button. Available notification channels include Slack, PagerDuty, Opsgenie, Webhook, and Email integrations. For detailed setup instructions, see the [Alerting documentation](/docs/alerting-overview/). ## Operation Analysis ### All Operations View detailed performance metrics for all operations within a service. The operations view includes filters for: ![All Operations](/_astro/service-operations.Bgh1wFPn_Z20DYmv.webp) * **Operation Type**: Filter by Endpoints, HTTP types, Database operations * **Operation Categories**: Client-Internal, Client-External, Messaging, etc. Each operation shows: * **Throughput**: Requests per minute * **Error Rate**: Failure percentage * **Response Time**: P95 latency metrics by default * **Operation Type**: Classification of the operation #### Viewing Individual Operation Details Click on any operation name to open the detailed operation view. In the operation detail panel, you can: * **Select Response Time Metrics**: Choose from P50, P95, P99, AVG, or **Max** to analyze different latency percentiles * **View Correlated Traces**: See all traces where this operation executed, with filtering and sorting options * **Monitor Trends**: Visualize response time patterns over the selected time range The **Max** metric selector is particularly useful for identifying worst-case latency scenarios and performance outliers that may not be visible in percentile metrics. ### Database Monitor database-specific operations and queries across different database technologies: * **Multi-Database Support**: Automatically detects and monitors operations across MySQL, MongoDB, Redis, PostgreSQL, and other database technologies * **Query Performance Tracking**: Monitor throughput (RPM), error rates, and P95 response times for SELECT, INSERT, UPDATE, and other database operations * **Time-Series Visualization**: Identify slow queries, high-volume operations, and performance trends with detailed graphs * **Operation-Level Details**: View specific queries and statements with individual performance characteristics and metrics ### Outgoing Calls Monitor external dependencies and third-party service calls: * **External API Monitoring**: Track HTTP calls to third-party services, cloud APIs, and external endpoints with detailed performance metrics * **Internal Service Communication**: Monitor microservice-to-microservice communication and internal network calls * **Client Type Classification**: Distinguish between Client-Internal and Client-External operations for better dependency analysis * **Dependency Performance Impact**: Analyze how external service latency and availability affect your application’s overall performance * **Service Reliability Tracking**: Monitor throughput, error rates, and response times for all outgoing dependencies to identify unreliable external services ## Exception Monitoring Using the **Exceptions** tab, track and analyze application errors and exceptions: * **Error Trend Visualization**: Monitor error frequency over time with multiple trend analysis views (Error Type, Operation) * **Exception Type Filtering**: Filter by specific exception classes like ReadTimeout, UnknownHostException, SQLIntegrityConstraintViolation, and HTTP error codes (400, 404, 500, 502, 503) * **Operation-Level Error Analysis**: Identify which specific operations (Database queries, HTTP Client calls, Endpoints) are generating the most errors * **Error Count Tracking**: See total error occurrences for each exception type to prioritize troubleshooting efforts * **Multi-Dimensional Analysis**: Analyze errors by operation type (Database, HTTP Client - Internal/External, Endpoints) to understand error patterns across your application stack ## Performance History Analyze performance trends over time with historical comparisons: * **Operation Type Filtering**: Switch between Endpoints, Consumer, and Database operations to analyze specific operation categories * **Period Comparison**: Compare current performance against previous periods (Last 24 Hours, Previous 24 Hours, Last Monday, 7 Day Average) * **Color-coded Performance**: Green indicates improvements, red shows degradation in throughput and response times * **Trend Analysis**: Track throughput (RPM), response time (P95), error rates, and APDEX scores across different time periods * **Operation-Level Insights**: See performance changes for individual operations like API endpoints, database operations, and health checks ## Service Dependencies Using the **Dependency** tab, Visualize service relationships and dependencies: ![Service Dependencies](/_astro/service-dependencies.CbJTRn-6_20pFYw.webp) The dependency map shows: * **Service Connections**: How services communicate with each other * **Infrastructure Dependencies**: Database and external service connections * **Performance Impact**: Metrics for each dependency relationship * **Dependency Health**: Red nodes and arrows indicate services/relationships with errors, green indicates healthy services Navigate the dependency map using: * **Zoom Controls**: Use + and - buttons to adjust view * **Pan**: Click and drag to move around the map ## Related Logs Access service-specific logs for troubleshooting: * **Pre-filtered Logs**: Automatically filtered to the selected service * **Time Range Alignment**: Logs correspond to the selected time window * **Volume Indicator**: Visual representation of log volume over time * Click on any log line to view more details ## Related Traces Examine distributed traces for the selected service: * **Operation Filtering**: Filter traces by specific operations * **Span Analysis**: Examine individual spans within traces * **Performance Correlation**: Connect traces to performance metrics * **Duration Tracking**: Analyze request flow and timing * Click on any trace or span to view distributed tracing visualization and more details ## Related Metrics Services may have one or both of the following, depending on their monitoring configuration and deployment type: * Infrastructure metrics monitor container-level resources (CPU, memory, network, disk) * Process metrics focus on application runtime performance (JVM, memory management, garbage collection) ### Infrastructure Metrics Monitor underlying infrastructure performance. Infrastructure monitoring covers: * **Container Overview**: High-level dashboard showing total containers, average CPU and memory usage, and network traffic (incoming and outgoing) * **CPU Monitoring**: Processor utilization per container with breakdowns by user mode and kernel mode * **Memory Monitoring**: RAM usage percentage by container, memory limits, and cache utilization patterns * **Network I/O Monitoring**: Network traffic analysis including data transfer rates and packet statistics * **Block I/O Monitoring**: Disk I/O operations for storage performance analysis ### Process Metrics Monitor JVM and application-level performance metrics. Process monitoring provides deep insights into application runtime performance: * **Memory Health Overview**: Track memory usage after garbage collection and memory growth patterns * **Heap Pool Analysis**: Monitor different heap space regions (Eden, Survivor) for garbage collection optimization * **GC Performance Metrics**: Analyze garbage collection overhead, frequency, duration, and efficiency * **CPU and System Analysis**: Monitor JVM and system CPU utilization, load, and thread management * **Buffer Pool Intelligence**: Track I/O buffer usage, limits, and utilization for performance optimization * **Class Loading Monitoring**: Monitor dynamic class loading and unloading behavior * **Advanced Analysis**: Detect memory leaks, performance degradation, and thread issues with predictive monitoring ## Best Practices **Service Monitoring Strategy:** * Set up alerts for critical services using APDEX scores below 0.8 * Monitor both throughput trends and error rate spikes for early issue detection * Use dependency maps to understand service impact during incidents * Configure notification channels before alerts to ensure proper incident response **Performance Optimization:** * Focus on operations with high response times in the “Top 10 Web Operations” table * Monitor error distribution to identify systemic vs. isolated issues * Use the Performance History tab to validate the impact of deployments * Correlate infrastructure metrics with application performance during capacity planning **Troubleshooting Workflow:** * Start with the Overview tab to identify performance anomalies * Use the Exceptions tab to understand error patterns * Examine specific operations in the All Operations tab * Check dependencies to identify upstream or downstream impact * Access logs and traces for detailed root cause analysis *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # FAQs > Frequently Asked Questions ## What are the challenges of running your own Prometheus? The key challenge faced by an enterprise: * Modern Time Series systems don’t have to grow along a single axis of Cardinality, Coverage, or Retention alone. Instead, the rate of ingestion and exploration warrants an expansion on all three axes. There is a constant need to scale TimeSeries Database vertically * Data are abundant, but it’s not being used. **Growing costs of storing and querying this data** lead to time and effort in auditing what metrics are used, how much data retention is necessary, and trimming from your database * **Enterprises must dedicate full-time engineering resources to manage their time series database**. Automation to scale with rapid changes in data shape and recovery from downtime must be implemented and maintained. To support needs across the business, teams create multiple database instances to handle concurrency and implement query sharding to improve performance. Orgs will also implement a solution like Thanos or Cortex to enable long-term storage of metric data. These implementations will cost significant engineering time to create and maintain and make things operationally confusing for developers sending and querying metrics ## Is Last9 fully managed? You can set up a cluster and get going in under two minutes. How do you ask? Read our [How to onboard to Last9](/docs/onboard/) guide for more details. ## Should an enterprise run Prometheus internally, or can Last9 support all internal enterprise requirements related to Prometheus? An enterprise can run Prometheus internally or offload all metrics storage to [Last9](https://last9.io/). We provide flexibility in our offerings per the requirements most suited for our customers. Prometheus is open-source software that collects metrics from targets by “**scraping**” metrics HTTP endpoints and **stores** the metrics as time-series data. ## Is Last9 SOC2 compliant? Last9 cares deeply about its customer’s data and is SOC2 Type II certified. ## Can Last9 be deployed on infrastructure owned by the enterprise? Yes, Last9 can also run on any cloud provider of your choice. ## What features are available on SaaS vs. on-prem instances? All features, including retention, are available on SaaS and on-prem instances. ## How do enterprises ingest data from GCP, Azure & AWS? All cloud providers have a Prometheus-compatible exporter. This can be run to scrape metrics from all infrastructure resources and ingested into Last9. ## Does the data go over an open internet? Yes, by default, it does, and securely. Alternatively, you could do VPC Peering and remote write data to Last9 over HTTP. ## How to retrieve data for internal consumption? This can be done with simple querying as one will with Grafana and other visualization tools to query TSDB data. Read our [How to configure Grafana for Last9](/docs/grafana-config/) guide for more details. # Receive Alert Notifications via Flock > Setup Flock integration and receive alert notifications from Last9. ## Getting started Last9 can send alert notifications and resolutions to Flock, a team messaging platform. This document provides step-by-step instructions on how to set up Flock integration with Last9 and start receiving alert notifications. ## Setting up an Incoming Webhook in Flock 1. Navigate to the [Flock Developer Dashboard](https://dev.flock.com) and sign in with your Flock account 2. Click **Webhooks** in the left navigation menu 3. Click **Add** next to **Incoming Webhook** ![Flock webhooks page](/_astro/flock-add-webhook.Z54ZsS-w_Z25TWrg.webp) 4. Configure your webhook: * Select the **Channel** where you want alert notifications to be posted * Give the webhook a descriptive **Name** (e.g., “Last9 Alerts”) * Optionally, set an **Icon** for the webhook messages ![Flock webhook configuration](/_astro/flock-add-admin-webhook.CP2C0346_Z1brACw.webp) 5. Click **Save and Generate URL** 6. Copy the generated webhook URL from the success dialog ![Flock webhook URL generated](/_astro/flock-internal-webhook.KaqX396U_JW8W9.webp) 7. Your webhook will now appear under **Manage Webhooks** ![Flock webhook saved](/_astro/flock-webhook-saved.gYdSjZHo_ZAGomC.webp) ## Setting up a notification channel in Last9 1. In [Notification Channels](https://app.last9.io/alerting/notification-channels/), click **Add** to create a new channel ![Notification channels list](/_astro/notification-channel-1.JNuPUiXR_Z2wfuxX.webp) 2. Provide the following details: * **Channel Name**: A descriptive name to easily identify the channel (e.g., “Flock Production Alerts”) * **Channel**: Select **Webhook** from the dropdown * **Webhook URL**: Paste the Flock webhook URL copied from the previous step * **Send Resolved**: Enable this option if you want to be notified when an alert has been resolved ![Add Flock webhook channel](/_astro/flock-webhook-in-last9.BYUZVz0V_20aMdP.webp) 3. Click **Save** to enable the channel ## Assigning a notification channel to an alert group 1. Navigate to your Alert Group in [Alert Studio](https://app.last9.io/alerting/) 2. Click on the notification channel icons at the top of the alert group to configure notifications ![Alert Studio notification icons](/_astro/last9-set-notification-channel-for-alert-group.BLTRr9mc_ZzROQf.webp) 3. Select your Flock channel from the **Webhook** dropdown under either **Channels for Threat Notification** or **Channels for Breach Notification** ![Select Flock channel for alerts](/_astro/last9-notification-set-flock-alerts.GMdE4CVw_Z17HRtL.webp) ## Flock Notification Format Last9 sends notifications to Flock using FlockML, which provides rich formatting. ### Trigger Notifications When an alert is triggered, the notification includes: | Field | Description | | -------------- | ---------------------------------------------------------------- | | Header | 🚨 TRIGGER: {severity icon} {summary} | | Severity | BREACH or THREAT (uppercase) | | Component | The affected component or service | | Class | Type of alert (e.g., Static Threshold, SLO Breach) | | Timestamp | When the alert was triggered (e.g., “Jan 15, 2024 at 10:30 UTC”) | | Custom Details | Additional context like service, environment, error rates | | Inspect | Link to “View in Last9” | ### Resolved Notifications When an alert is resolved (requires **Send Resolved** to be enabled), the notification includes: | Field | Description | | ------------------ | ----------------------------------------- | | Header | ✅ RESOLVED: {summary} | | Status | RESOLVED | | Original Severity | The severity when the alert was triggered | | Component | The affected component or service | | Original Timestamp | When the alert was originally triggered | | Custom Details | Context from the original alert | | Inspect | Link to “View in Last9” | ### Severity Indicators | Severity | Icon | Description | | -------- | ---------------- | --------------------------------------------- | | Breach | 🔴 Red circle | Critical alerts requiring immediate attention | | Threat | 🟠 Orange circle | Warning alerts indicating potential issues | *** ## Using Terraform You can also create the Flock notification channel using the [Last9 Terraform Provider](/docs/terraform-provider/): ```hcl resource "last9_notification_channel" "flock" { name = "flock-production-alerts" type = "webhook" destination = "https://api.flock.com/hooks/sendMessage/YOUR-WEBHOOK-TOKEN" send_resolved = true } ``` *** ## Troubleshooting ### Notifications not appearing in Flock 1. Verify the webhook URL is correct and starts with `https://api.flock.com/hooks/` 2. Check that the webhook is still active in the [Flock Developer Dashboard](https://dev.flock.com) 3. Ensure the channel selected for the webhook still exists ### Messages are not formatted correctly Last9 auto-detects Flock webhooks based on the URL pattern. Ensure your webhook URL contains `api.flock.com` or `flock.co`. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Getting started with API > Step-by-step walkthrough on how to obtain the API tokens for performing various operations with Last9 The API provides a programmatic method to access and operate Last9. This exposes a subset of features and actions that can be performed on Last9 as REST APIs. For example, you can send [change events](/docs/change-events/) to Last9 using these APIs or you can [generate alert rules](/docs/alerting-via-iac/). ## Access Requirements Access to the API Access page and token generation is controlled by [user roles](/docs/user-roles/): * **Admins** can generate and revoke refresh tokens, and exchange them for access tokens * **Editors** can exchange existing refresh tokens for access tokens, but cannot generate new refresh tokens * **Viewers** cannot access the API Access page If you need API access as an Editor, ask your organization’s Admin to generate a refresh token for you. ## Base URL The base API URL can be obtained from the [API Access](https://app.last9.io/settings/api-access) page. It is in the following format: ```text https://{domain}/api/{version}/organizations/{org}/{endpoint} ``` The `{org}` parameter is your unique organization slug. ## Tokens Authentication is performed using Bearer access tokens. The API Access page has two tabs: ### Refresh Tokens (Admin Only) Admins can create named refresh tokens with specific scopes (read, write, or delete). Each refresh token: * Has a descriptive name for identification * Is associated with a specific scope * Can be revoked at any time by an Admin * Is used to generate short-lived access tokens To create a refresh token: 1. Navigate to the **Refresh Token** tab 2. Click **New token** 3. Enter a descriptive name and select a scope 4. Click **Create** and securely store the generated token Refresh tokens are shown only once when created. Store them securely as they cannot be retrieved later. ![API Access — Refresh Token tab](/_astro/api-access-base-url.BJLfM7wG_Zrx6oA.webp) ### Access Tokens Access tokens are short-lived tokens generated from refresh tokens. Both Admins and Editors can exchange a valid refresh token for an access token: 1. Navigate to the **Access Token** tab 2. Paste your refresh token 3. Click **Generate** to receive an access token ![API Access — Access Token tab](/_astro/api-access-access-token.fqaFTM_u_1lINJo.webp) ### Token Revocation Admins can revoke refresh tokens at any time from the Refresh Token tab. When a refresh token is revoked: * The refresh token immediately becomes invalid * Any access tokens generated from that refresh token will be rejected * This provides a security mechanism to invalidate compromised credentials ### Audit Trail for Tokens All token operations are tracked in Last9’s audit trail for security and compliance purposes (SOC 2, ISO 27001). Navigate to [Settings > Audit Trail](https://app.last9.io/settings/audit-trail) to view: ![Audit Trail — Token Operations](/_astro/audit-trail-tokens.XQTNemXd_Z26BPXw.webp) * **Access Token Creation**: Who generated access tokens and when * **Refresh Token Creation**: When named refresh tokens were created, including token names * **Token Revocation**: Which tokens were revoked and by whom * **User Attribution**: Full name and email of the user who performed each action * **Resource Details**: Token names and identifiers for tracking **Filtering Token Events:** Use the left sidebar filters to focus on token-related activities: 1. **Resource Type** filter: Select “Access Token” or “Refresh Token” to show only token operations 2. **User** filter: See token operations by specific team members 3. **Date Range**: Search token activity within custom time windows The audit trail helps you maintain security compliance, investigate unauthorized access, and track API credential usage across your organization. ### Token Expiry Access tokens expire in 24 hours. Your application should handle token expiration by using the refresh token to generate a new access token. The following error is returned when an access token expires: ```json { "error": "Authorization token is expired" } ``` To generate a new access token, use the refresh token endpoint: ```text POST https://app.last9.io/api/v4/oauth/access_token ``` The OAuth endpoint does **not** include the organization in the URL. Use the exact URL shown above, not the organization-specific base URL. Request Body: ```json { "refresh_token": "eyJhbGciOiXXXXXXXXXXXXX.eyJleHXXXXXXXXX.XXXXXXXXXOwuvUNA" } ``` The response of this endpoint will contain a pair of access tokens and refresh tokens if the refresh token in the request body is valid. Response ```json { "access_token": "eyJhbGciOiXXXXXXXXXXXXXX.eyJleHXXXXXXXXX.XXXXXXXXXOwuvUNA", "expires_at": 1587412870, "issued_at": 1587240070, "refresh_token": "eyJhbGciOiXXXXXXXXXXXXX.eyJleHXXXXXXXXX.XXXXXXXXXOwuvUNA", "type": "Bearer", "scopes": ["read", "write", "delete"] } ``` ## Usage The tokens are specifically separated based on the scopes they are authorized to perform based on the impact they might have on the system’s overall behavior. * **Read Tokens**: Have a minimum impact on the performance of the Last9 application. These are to be specifically used for reading the current state of the data * **Write Tokens**: Use this token to create or modify data in any supported entity. This could change the behavior of your usage of Last9 * **Delete Tokens**: Use this token judiciously. This could break the processes and cause an irrevocable state through missing data ## Authentication & Authorization All public API endpoints require a Token to be supplied as an authorization header for all requests. The token is used to identify the user/application and authenticate the requests to API. The header name must be **X-LAST9-API-TOKEN**. Caution The header value **must** be prefixed with the literal word `Bearer `(with a trailing space) before the access token. Sending the raw token without the `Bearer` prefix returns `HTTP 400 {"error":"invalid access token"}`, even when the token itself is valid. Also note: the token goes in the `X-LAST9-API-TOKEN` header, **not** the standard `Authorization` header. Using `Authorization: Bearer ` returns `HTTP 401`. | Example | Result | | ---------------------------------- | ----------------------------- | | `X-LAST9-API-TOKEN: Bearer eyJ...` | ✓ correct | | `X-LAST9-API-TOKEN: eyJ...` | ✗ 400 — missing Bearer prefix | | `Authorization: Bearer eyJ...` | ✗ 401 — wrong header | ## Making your first API request Please follow the steps below to create our first API request for a change event. ### Step 1: Generate Tokens 1. Navigate to the [API Access](https://app.last9.io/settings/api-access) page 2. If you’re an Admin, create a new refresh token with **write** scope from the Refresh Token tab 3. Exchange the refresh token for an access token from the Access Token tab 4. Copy the generated access token for use in your API request ### Step 2: Base URL The base URL of your instance can be obtained as specified in the Base URL section above. ### Step 3: Making the API request The endpoint for creating change events is ```text PUT /change_events ``` ```json { "timestamp": "2024-01-15T17:57:22+05:30", "event_name": "new_deployment", "event_state": "start", "attributes": { "env": "production", "k8s_cluster": "prod-us-east-1", "app": "backend-api" } } ``` The cURL request looks as follows: ```shell curl --location --request PUT 'https://app.last9.io/api/v4/organizations/{org}/change_events' \ --header 'X-LAST9-API-TOKEN: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "timestamp": "2024-01-15T17:57:22+05:30", "event_name": "new_deployment", "event_state": "start", "attributes": { "env": "production", "k8s_cluster": "prod-us-east-1", "app": "backend-api" } }' ``` ### Step 4: Verify the response The API will return the following response in case of success with HTTP status code 200. ```json { "message": "success" } ``` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Getting Started with Events > This document explains how to send events to Last9 and different ways to consume them such as via streaming aggregation. Last9 time series data warehouse is a powerful tool for storing and analyzing time series data. Last9 also supports ingesting real-time events and converting them into metrics so that they can be consumed in myriad ways engineers are already familiar with. This document outlines the steps necessary to send events to Last9 and the different ways of consuming them. ## Events Generally, there are two types of events. 1. They happen over time, and their performance, like frequency, presence, or absence, is interesting * Example: Average hourly take-offs from the San Francisco Airport in the last week. 2. The event and its data are of interest * Example: When was the last time Arsenal won in EPL? The first example is asking questions based on specific aggregations performed on raw events, the individual events may not be necessary, but their aggregations and insights captured using them are relevant for business. The second example is about the event and gives insights based on the event data. Last9 supports extracting both kinds of information from the events. ## Structure of Events Last9 supports accepting events in the following JSON format. Every event has a unique name defined by the `event` key and a list of `properties`. Any extra keys apart from `event` and `properties` are not retained by Last9. ```json { "event": "heartbeat", "properties": { "server": "ip_address", "environment": "staging" }, "extra_fields": "will be dropped" } ``` ## Sending Events to Last9 ### Prerequisites Grab the [Prometheus Remote Write](https://last9.io/blog/what-is-prometheus-remote-write/) URL, cluster ID, and the write token of the Last9 cluster, which you want to use as an Event store. Follow the instructions [here](/docs/onboard/) if you haven’t created the cluster and its write token. ### Sending data Grab the Prometheus Remote Write URL for your Last9 Cluster, and make following changes to the URL. If your Prometheus URL is `https://username:token@app-tsdb.last9.io/v1/metrics/{uuid}/sender/acme/write` The Event URL would be `https://username:token@app-events-tsdb.last9.io/v1/events/{uuid}/sender/acme/publish` ![Last9 Cluster Prometheus Remote Write URL details](/_astro/d8b8120-image.CMG1bCRo_Z2q6A6a.webp) | Cluster Region | Host | | :----------------- | :---------------------------- | | Virginia us-east-1 | app-events-tsdb-use1.last9.io | | India ap-south-1 | app-events-tsdb.last9.io | Events must be sent with the `Content-Type: application/json` header. ```bash curl --location --request XPOST 'https://username:token@app-events-tsdb.last9.io/v1/events/{uuid}/sender/acme/publish' \ --header 'Content-Type: application/json' \ --data-raw '[{ "event": "sign_up", "properties": { "currentAppVersion": "4.0.1", "deviceType": "iPhone 14", "dataConnectionType": "wifi", "osType": "iOS", "platformType": "mobile", "mobileNetworkType": "wifi", "country": "US", "state": "CA" } }]' ``` The API endpoint accepts an array of events in the payload so one or more events can be sent in the same packet. The API is greedy and allows partial ingestion. If one or more events in the packet have a problem, they are returned to the response body; everything else is ingested into the system. ## Consuming events as metrics Events are converted to Metrics at 1 DPM (Data Point per minute) and combined to emit `gauges` per combinations of `event name + properties` for the previous minute. We use **[Tumbling Windows](https://learn.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics)** to represent the Event stream’s consistent and disjoint time intervals. All published events are partitioned into buckets of 1 minute each and then grouped by event Name and properties. For example, the elements with timestamp values `\[0:00:00-0:01:00)` are in the first window. Elements with timestamp values `\[0:01:00-0:02:00)` are in the second window. And so on. ![Events timing diagram](/_astro/35a7331-image.ChCMUJBT_2vi1C8.webp) The following events will produce the following metrics: ```bash event_name_count{properties...1} 5 event_name_count{properties...2} 2 ... event_name_count{properties...n} 5 event_name_count{properties...1} 5 event_name_count{properties...2} 1 ... event_name_count{properties...n} 3 event_name_count{properties...1} 2 event_name_count{properties...2} 4 ... event_name_count{properties...n} 3 ``` ### Define Streaming Aggregations You need to define streaming aggregations to query the metrics converted from events. Last9 allows defining a Streaming Aggregation as a PromQL to emit an aggregated metric that alerts or dashboards can then consume. ```yaml - promql: "sum(device_health_total{version='1.0.1'})[5m] by (os)" as: total_devices_by_os_5m with_name: total - promql: "sum(device_health_total{os='ios'})[1m] by (version)" as: concurrency_by_ios_version with_name: concurrency ``` Please refer to the PromQL-powered [Streaming Aggregations](/docs/streaming-aggregations/) to understand the workflow of where and how to define the Streaming Aggregation Pipelines. This feature enables the folding of all metrics that would otherwise explode in cardinality and allows for the emission of meaningful aggregations and views. It is also available for Last9 Metrics, not just limited to Events. ## Events to Gauge Metrics Consider an event named `memoryUsageSpikeAlert` with the following properties: * `increaseInBytes` indicating an increase in memory usage by 1,610,612,736 bytes * `host` represents the host’s IP address associated with the event * `osType` specifying the operating system type as “linux” ```json { "event": "memoryUsageSpikeAlert", "properties": { "increaseInBytes": "1610612736", "host": "10.1.6.14", "osType": "linux" } } ``` Define the streaming aggregation configuration for `max` of `memoryUsageSpikeAlert` as follows: ```yaml - promql: "max by (host) (memoryUsageSpikeAlert_maximum)" as: "max_memory_usage_spike" with_value: "increaseInBytes" with_name: "maximum" ``` Let’s break the example: * `max` is the aggregation function * `maximum` is the `with_name` value appended to the intermediate metric name to maintain uniqueness. It can be any string. * `with_value` is the event’s property name on which the gauge aggregation has to be applied. In this case, it is `increasedBytes`. * `max_memory_usage_spike` will be the final output metric you are exposed to query against. > You can also use the `min` and `sum` aggregations. ## Querying Events Once Events have been converted to Metrics, they can be queried like metrics. This could be a Grafana Dashboard or any other Prometheus Query API client. ![Example of querying events](/_astro/f2b9159-image.Fpez-lNu_Z1O5uXv.webp) You may also set alerts on these events, converted to metrics, using Prometheus-compatible Alertmanager. ## Conclusion Here’s a link to the sample repository that brings this all together. It contains some example schemas and aggregation pipelines. ## FAQs **Q: Why is time not accepted as a first-class property?** A: Accepting a User-provided timestamp is extremely risky. A timestamp may not be formatted correctly, or instead of a UnixMilli, one may send Unix alone. Or send January 1st, 1970 as ***testing* data.** Such precision is unfair to expect from developers who want easier integration that is not prone to fragility. Besides, since the system is optimized for time, a malformed timestamp will result in undesired results of dropped packets or the system having to backfill data in frozen shards. Hence, we accept timestamps as when the event is received. This also means that Last9 works well with real-time events! **Q: What happens to the timestamp if there are delays in arrival?** A: Last9 Gateways are designed to be extremely fast and lightweight. They write data as early as they receive it and are highly available across multiple availability zones. They need more processing to ensure messages are not lost or delayed upon arrival. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Receive Alert Notifications via Google Chat > Setup Google Chat integration and receive alert notifications from Last9. ## Getting started Last9 can send alert notifications and resolutions to Google Chat spaces. This document provides step-by-step instructions on how to set up Google Chat integration with Last9 and start receiving alert notifications. ## Setting up an Incoming Webhook in Google Chat 1. Open [Google Chat](https://chat.google.com) and navigate to the space where you want to receive alerts 2. Click on the space name at the top to open the dropdown menu 3. Select **Apps & integrations** (or **Manage webhooks** in older versions) 4. Click **Add webhooks** 5. Configure your webhook: * Enter a **Name** for the webhook (e.g., “Last9 Alerts”) * Optionally, add an **Avatar URL** for the webhook icon 6. Click **Save** 7. Copy the generated webhook URL. It will look like: ```plaintext https://chat.googleapis.com/v1/spaces/SPACE_ID/messages?key=KEY&token=TOKEN ``` ## Setting up a notification channel in Last9 1. In [Notification Channels](https://app.last9.io/alerting/notification-channels/), click **Add** to create a new channel ![Notification channels list](/_astro/notification-channel-1.JNuPUiXR_Z2wfuxX.webp) 2. Provide the following details: * **Channel Name**: A descriptive name to easily identify the channel (e.g., “Google Chat Production Alerts”) * **Channel**: Select **Webhook** from the dropdown * **Webhook URL**: Paste the Google Chat webhook URL copied from the previous step * **Send Resolved**: Enable this option if you want to be notified when an alert has been resolved ![Add webhook channel](/_astro/webhook-channel-config.CdHNOtr0_1RuJYg.webp) 3. Click **Save** to enable the channel ## Assigning a notification channel to an alert group 1. Navigate to your Alert Group in [Alert Studio](https://app.last9.io/alerting/) 2. Click on the notification channel icon to configure notifications 3. Select your Google Chat channel from the dropdown ## Google Chat Notification Format Last9 sends notifications to Google Chat using the Cards v2 API, which provides rich formatting. Each notification card includes: | Section | Description | | -------- | ------------------------------------------------------------------ | | Header | Alert summary with subtitle showing severity, class, and timestamp | | Sections | Organized alert details with decorated text widgets | | Button | ”Inspect in Last9” link to view the alert directly | ### Severity Indicators | Severity | Icon | Description | | -------- | ------------- | --------------------------------------------- | | Breach | Red circle | Critical alerts requiring immediate attention | | Threat | Orange circle | Warning alerts indicating potential issues | *** ## Using Terraform You can also create the Google Chat notification channel using the [Last9 Terraform Provider](/docs/terraform-provider/): ```hcl resource "last9_notification_channel" "google_chat" { name = "google-chat-production-alerts" type = "webhook" destination = "https://chat.googleapis.com/v1/spaces/SPACE_ID/messages?key=KEY&token=TOKEN" send_resolved = true } ``` *** ## Troubleshooting ### Notifications not appearing in Google Chat 1. Verify the webhook URL is correct and starts with `https://chat.googleapis.com/` 2. Check that the webhook is still active in your Google Chat space settings 3. Ensure the space where the webhook was created still exists ### Messages are not formatted correctly Last9 auto-detects Google Chat webhooks based on the URL pattern. Ensure your webhook URL contains `chat.googleapis.com`. ### Rate limiting Google Chat has rate limits for incoming webhooks. If you’re sending a high volume of alerts, some messages may be delayed or dropped. Consider consolidating alerts or using alert grouping in Last9. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Google SSO > Security permissions and authentication details for signing in to Last9 with Google Workspace. Last9 supports signing in with Google (including Google Workspace accounts) using standard OAuth 2.0 / OpenID Connect (OIDC) authentication with minimal, user-scoped permissions. ## Permissions Requested Last9 requests the following **OAuth scopes** from Google: | Scope | Description | Sensitivity | | --------- | --------------------------------- | ------------- | | `openid` | Authenticate using OpenID Connect | Non-sensitive | | `email` | View user’s email address | Non-sensitive | | `profile` | View user’s basic profile info | Non-sensitive | All three scopes are classified as [non-sensitive by Google](https://developers.google.com/workspace/guides/configure-oauth-consent) and do not require additional verification. They provide access to the authenticated user’s basic identity only — name, email, and profile picture. ## What Last9 Cannot Access Last9 does **not** request any scopes that would grant broader access. This includes: * Google Workspace directory (viewing or managing users/groups) * Gmail, Google Drive, or Google Calendar * Domain-wide delegation * Any sensitive or restricted OAuth scopes ## How to Verify Permissions ### For Individual Users 1. Go to [Google Account — Third-party connections](https://myaccount.google.com/connections) 2. Find and click on **“Last9”** 3. Review the permissions listed You should only see access to your email address and basic profile info. ### For Google Workspace Admins 1. Sign in to [Google Admin Console](https://admin.google.com) 2. Go to **Security** → **Access and data control** → **API controls** 3. Click **Manage Third-Party App Access** 4. Search for **“Last9”** and review the scopes and access level ### Revoking Access To revoke Last9’s access to your Google account: 1. Go to [Google Account — Third-party connections](https://myaccount.google.com/connections) 2. Find **“Last9”** and select **“Remove Access”** You will need to re-authorize when signing in to Last9 again. ## Access Control Your organization retains full control over who can access Last9 through Google SSO. * **You control access**: Only users you authorize can sign in to Last9 * **Instant revocation**: When you suspend or delete a user’s Google Workspace account, they can no longer authenticate to Last9 * **No standalone accounts**: Users authenticate through Google — Last9 does not maintain separate credentials Google Workspace administrators can further restrict access by configuring [app access controls](https://support.google.com/a/answer/7281227) to block or allow Last9 for specific organizational units. *** ## Troubleshooting **“This app isn’t verified”** This may be due to your organization’s security settings. Contact your Google Workspace administrator to allow Last9. **“Access blocked: Last9 has not completed the Google verification process”** This can occur if your Workspace admin has restricted third-party app access. Ask your admin to review and allow Last9 in the Admin Console under **Security** → **Access and data control** → **API controls**. **Can’t sign in with a personal Gmail account** Last9 supports both personal Gmail accounts and Google Workspace accounts. If you’re having issues, ensure you’re using the correct account type for your organization. *** If you have questions about Last9’s Google SSO integration, please contact us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io). # Configure Grafana with Last9 as Datasource > How to configure Grafana with Last9 cluster as datasource and visualize the metrics stored in Last9 ## How to configure Grafana for Last9? Create a new data source with the appropriate URL. ![Add New Data Source in Grafana](/_astro/ba92d7c-1.BsPoFgkL_oTee2.webp) Each Last9 Cluster comes with a Read URL which needs to be used when creating a Data Source in Grafana. ![Last9 Read Data Settings](/_astro/levitate-cluster-read-data-settings-tab.BoWiHQqm_26tp2e.webp) Here’s an example of creating a Data Source using the Read URL. You can grab the Read URL either: * While creating your cluster → Read Data → Bring Your Own Visualization * Or, by going to the Last9 Cluster → Settings → Read Data → Bring Your Own Visualization ![Add Last9 Cluster as Prometheus Compatible Data Source in Grafana](/_astro/cc33383-Screenshot_2022-09-25_at_1.23.11_PM.EBzXLx9P_ZjbI0Y.webp) Make sure that the status of the data source appears as all ✅. ![Last9 Cluster as Prometheus Compatible Data Source Status](/_astro/c20f2ba-3.CEjLNv4t_1sR2f0.webp) After this, try exploring data or create a new dashboard in Grafana based on metrics in Last9. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Grafana Loki in Last9 > Use Last9's embedded Grafana Loki to view logs. ## Using Grafana Loki Last9 provides a Grafana Loki interface using LogQL to explore your logs data. ![Grafana Loki in Last9](/_astro/last9-logs-loki.Co_L4Hmz_Z17G9AP.webp) * Access the Loki UI by visiting [Grafana Explore](https://app.last9.io/explore/query) and selecting Loki as the datasource. * You can perform [LogQL queries](https://grafana.com/docs/loki/latest/query/) to explore logs in this interface. This is useful for structured exploration of logs data for people who are familiar with Grafana and Loki. ## LogQL Compatibility Following functions in LogQL are supported: * **`RATE`** * **`COUNT_OVER_TIME`** * **`SUM_OVER_TIME`** * **`AVG_OVER_TIME`** * **`MAX_OVER_TIME`** * **`MIN_OVER_TIME`** * **`SUM`** * **`AVG`** * **`COUNT`** * **`MAX`** * **`MIN`** * **`STDDEV`** * **`MEDIAN`** * **`STDVAR`** Following parsers in LogQL are supported: * **`json`** * **`regexp`** Read more about the documentation for each function [here](https://grafana.com/docs/loki/latest/query/). ## Creating Dashboards ### Accessing Grafana 1. Navigate to the Grafana section in Last9 2. Create a new dashboard by clicking **Create Dashboard** 3. Add a new panel to begin visualizing your data ### Selecting Loki Data Source The Loki data source comes pre-configured in Last9’s embedded Grafana, so you can start querying immediately. ### Query Construction Methods #### Using Builder Mode Builder mode provides a visual interface for constructing Loki queries without writing LogQL. Here’s how to use it: 1. Label Selection * Click **Add label** to start building your query * Select labels (e.g., service, severity) from the dropdown * Choose operators (=, !=, =~~, !~~) * Select or type values for the labels 2. Operations * Add operations using the **Operations** button * Common operations include: * Line contains * Line does not contain * Line contains regex * Line does not contain regex * JSON 3. Aggregations * Click **Add range function** * Select functions like: * Rate * Count over time * Sum over time * Avg over time * Set time windows (\[1m], \[5m], \[1h]) 4. Examples Using Builder Mode: Basic Query: * Label: **`service = "auth-service"`** * Operation: **`Line contains "error"`** * Range: **`count_over_time [5m]`** Advanced Query: * Label: **`service =~ "api.*"`** * Label: **`severity = "error"`** * Operation: **`JSON`** * Operation: **`Line contains "timeout"`** * Range: **`sum by (status_code)`** 5. Builder to Code Mode * Switch between modes to see the LogQL equivalent * Learn LogQL syntax through the Builder interface * Fine-tune queries in Code mode #### Writing LogQL Queries For advanced users or complex queries, you can write LogQL directly: Basic Query Structure: ```sql {service="your-service"} ``` Common Aggregation Patterns: ```sql sum by (severity) (count_over_time({service="your-service"}[5m])) ``` ### Key Query Components * Label matchers: **`{label="value"}`** * Line filters: **`|= "error"`** * Aggregation functions: **`sum`**, **`avg`**, **`max`** * Time windows: **`[1m]`**, **`[1h]`**, **`[1d]`** ### Understanding Window Behavior Remember that Last9’s window behavior differs from standard Loki: * Last9 uses tumbling windows (window size = step size) * Both window and step size are defined by the **`[]`** parameter * For instant queries, match time range to window size ### Creating Visualizations #### Panel Types 1. Time Series * Best for tracking metrics over time * Suitable for rate and count queries 2. Bar Charts * Good for comparing values across categories * Works well with **`sum by`** aggregations 3. Tables * Useful for detailed log analysis * Can show multiple columns of log data #### Panel Configuration 1. Set appropriate panel title and description 2. Configure axes and legends 3. Set up thresholds and alerts if needed 4. Choose color scheme for better visibility ### Advanced Query Techniques #### Using Multiple Queries ```sql sum(rate({service="auth-service"} |= "error" [5m])) by (severity) sum(rate({service="auth-service"} |= "warning" [5m])) by (severity) ``` #### Pattern Matching ```sql {service=~"auth.*"} |= "error" != "timeout" ``` #### Metric Extraction ```sql sum by (status_code) (count_over_time({service="api"} | json | status_code != "" [5m])) ``` #### Complex LogQL Examples ##### Example 1: Cart validation mismatches observed in incoming requests Counts occurrences (per 10m tumbling window) where `service = "your-service"` logs include any of several generic validation patterns and also include the phrase “received in the request”. ```sql count_over_time({service="your-service"} |~ "validation error|missing field|missing parameter|mismatch id|mismatch sku|sample sku" |= "received in the request" [10m]) ``` ##### Example 2: Cart fetch errors for userId, excluding unrelated variant Counts occurrences (per 10m tumbling window) where `service = "your-service"` logs match either a generic fetch error pattern or an “Invalid” condition, while excluding logs that match the broader “Unexpected error while fetching the … response” variant. ```sql count_over_time({service="your-service"} |~ "Unexpected error while fetching .* response for .*|.*exception.*Invalid.*No .* found .*" !~ "Unexpected error while fetching .* response" [10m]) ``` ##### Example 3: Processing errors for large item counts (50–100) Counts occurrences (per 15m tumbling window) where `service = "your-service"` and label `item_count` is between 50 and 100 (inclusive), and the log line indicates a processing/serviceability error. ```sql count_over_time({service="your-service", item_count=~"50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100"} |~ "Error in (default|source) .* (processing|serviceability)" [15m]) ``` ### LogQL boolean logic and symbols (quick primer) * AND (combine conditions): Chain filters with `|`. All stages must match. * Example: `{service="your-service"} |= "error" !~ "timeout"`. * OR (alternatives): Use regex alternation `foo|bar` inside `|~` or label regex `=~`. * Example: `{service=~"api|auth"} |~ "(fail|error)"`. * NOT (exclude): Use `!=` for literal or `!~` for regex. * Example: `!= "error" !~ "(debug|test)"`. * Literal vs regex line filters: * `|= "text"` matches literal. * `|~ "pattern"` uses regex. Group with `( )`, alternate with `|`, wildcard with `.*`. * Label matchers: * `{key="value"}`, `{key!="value"}`, `{key=~"a|b"}`, `{key!~"a|b"}`. * Multiple label matchers inside `{...}` are ANDed. * Escaping: * Escape special regex chars like `.` as `\\.`. In docs, escapes may appear doubled. * Range functions and windows: * `count_over_time( [10m])` counts matches in a 10m window. ### Dashboard Organization #### Best Practices * Group related panels logically * Use consistent time ranges across related panels * Add descriptive titles and documentation * Consider user permissions and sharing settings #### Layout Tips * Arrange panels in order of importance * Use rows to group related visualizations * Consider different screen sizes and resolutions ### Performance Optimization #### Query Efficiency 1. Use label filters before line filters 2. Start with Service and Severity filters for better performance 3. Avoid processing unnecessary data #### Time Range Considerations * Start with smaller time ranges during development * Consider data retention policies * Use appropriate aggregation intervals *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Grafana Tempo in Last9 > Use Last9's embedded Grafana Tempo to view traces. ## Using Grafana Tempo Last9 provides a Grafana Tempo interface to explore your traces data. ![Grafana Tempo in Last9](/_astro/last9-traces-tempo.BDwr2mve_294B3L.webp) * Access the Tempo UI by visiting [Explore](https://app.last9.io/explore/query) and selecting Tempo as the datasource. * You can perform [TraceQL queries](https://grafana.com/docs/tempo/latest/traceql/) to explore traces in this interface. This is useful for structured exploration of traces data for people who are familiar with Grafana and Tempo. \##last9-traces-tempo.png Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Help & Support > Need help with Last9? Learn how to reach our support team via email, Slack, or MS Teams — based on your plan. Not a user yet? Schedule a call. Our engineers are always here to help you get the most from Last9. Depending on your plan, there are multiple ways to get in touch with us. 1. **Email:** `[All Plans]` You can reach our customer success team at . 2. **Slack or MS Teams:** `[Pro Plan] [Enterprise Plan]` Check the shared channel on your preferred tool of choice to contact us. If you’re not sure about the channel, please check with your admin or reach out to us on email and we’d be happy to help. If you’re not a Last9 user yet and have questions, apart from emailing us, you can also [schedule a call](/schedule-demo/) at the earliest slot available. # Calculate usage patterns and data volume in New Relic > Sample queries to understand data volume of spans, metrics and transactions in New Relic This document lists queries which you can run in your New Relic and share with Last9 as process of migration from New Relic to Last9. ## Calculating Data in New Relic This following NRQL queries will calculate calculate total no. of transactions, spans and mertics over the last day. 1. Total no. of spans over a week ```sql SELECT count(*) FROM Span SINCE 1 week ago ``` 2. Total transactions ```sql SELECT count(*) FROM Transaction SINCE 1 week ago ``` ```sql SELECT count(*) FROM Transaction FACET dateOf(timestamp) SINCE 1 week ago TIMESERIES 1 day ``` 3. Total Metrics ```sql SELECT count(*) FROM Metric SINCE 1 week ago ``` 4.Total Event ```sql SELECT count(*) FROM Event SINCE 1 week ago ``` 5. Total Logs ```sql SELECT count(*) FROM Log SINCE 1 week ago ``` Additionally, you can share how much data is getting ingested in New Relic from the data management tab. # Calculate usage patterns and data volume in Prometheus > Sample PromQL queries to understand ingestion rate, read query rate and total time series ## Calculating Ingestion Rate This query will calculate the per-minute ingestion rate by averaging the per-second ingestion rate over the past minute, as measured by the `prometheus_tsdb_head_samples_appended_total` metric. The result will be a single value representing the average number of samples ingested per minute over the past minute. ```sql rate(prometheus_tsdb_head_samples_appended_total[1m]) * 60 ``` ## Calculating Read Query Rate This query will calculate the per-minute query rate by averaging the per-second query rate over the past minute, as measured by the prometheus\_http\_requests\_total metric for GET requests to the /api/v1/query endpoint. This endpoint is used for executing queries against the Prometheus database, so this metric represents the number of read queries executed by the server. ```sql sum by (handler) (rate(prometheus_http_requests_total{handler="/api/v1/query"}[1m]) * 60) ``` ## Calculating Total Time Series This query will count the number of distinct time series in the database, regardless of the metric or label values. The regular expression ”.+” matches all series names, so this query effectively counts all series. ```sql count({__name__=~".+"}) ``` # Create a AWS STS Role > This tutorial walks through setting up a AWS STS (Secure Token Service) role for discovering resources via cloudwatch ## Creating trusted role without external id 1. Visit [AWS Console/Roles](https://console.aws.amazon.com/iam/home#/roles) 2. Click [Create Role](https://console.aws.amazon.com/iam/home#/roles$new?step=type) ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.22.54\_PM.png](/_astro/Screenshot_2021-02-25_at_1.22.54_PM.DoIJGDgl_Z2cvY7B.webp) 3. Select **[Another AWS Account](https://console.aws.amazon.com/iam/home#/roles$new?step=type\&roleType=crossAccount)** tab * **Account ID**: `652845092827` * **Next Permissions** ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.24.53\_PM.png](/_astro/Screenshot_2021-02-25_at_1.24.53_PM.Cg3MpJaH_Z49vx2.webp) 4. Attach policies a. **SecurityAudit** Policy ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.33.46\_PM.png](/_astro/Screenshot_2021-02-25_at_1.33.46_PM.BcnkNl4a_Z1MWRvE.webp) b. **CloudWatchReadOnlyAccess** Policy ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_2.42.11\_PM.png](/_astro/Screenshot_2021-02-25_at_2.42.11_PM.CHmR72gn_15hJyR.webp) c. Proceed to Next Steps 5. Add tags if needed ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.37.36\_PM.png](/_astro/Screenshot_2021-02-25_at_1.37.36_PM.CXA09kuf_Z1cMynA.webp) 6. Review 1. **Role name:** `${business_name}_last9_role` 2. **Role description**: Security Audit Access to Last9 3. **Verify Last9 AWS Account Number** 4. **Verify Granted Policy** 5. **Create Role** ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.44.40\_PM.png](/_astro/Screenshot_2021-02-25_at_1.44.40_PM.CMvZPgze_Zx5Se6.webp) 7. After the role is created, Go to Role → Trust Relationships → Edit Trust Relationship ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/2021-06-08\_22-08.png](/_astro/2021-06-08_22-08.DXphoq4l_Z1u5bk3.webp) 8. Update the JSON to the following and click “**Update Trust Policy**” ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::652845092827:root" }, "Action": "sts:AssumeRole", "Condition": {} } ] } ``` 9. Edit the role and update “**Maximum session duration**” to 3 hours if your security policy permits it. Else leave it as 1 hour. ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/2021-02-25\_14-14.png](/_astro/2021-02-25_14-14.sXXIC7_L_yJrWM.webp) 10. Share the created role ARN with your Last9 point of contact *** ## Creating trusted role with external id 1. Visit [AWS Console/Roles](https://console.aws.amazon.com/iam/home#/roles) 2. Click [Create Role](https://console.aws.amazon.com/iam/home#/roles$new?step=type) ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.22.54\_PM.png](/_astro/Screenshot_2021-02-25_at_1.22.54_PM.DoIJGDgl_Z2cvY7B.webp) 3. Select “**[Another AWS Account](https://console.aws.amazon.com/iam/home#/roles$new?step=type\&roleType=crossAccount)**” tab with external ID as a random string. It has to be something other than “somerandomstring” and share it with Last9 * **Account ID**: `652845092827` ![2021-09-08\_17-24.png](/_astro/2021-09-08_17-24.Dosyx1mb_14B7GX.webp) 4. Attach policies a. **SecurityAudit** Policy ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.33.46\_PM.png](/_astro/Screenshot_2021-02-25_at_1.33.46_PM.BcnkNl4a_Z1MWRvE.webp) b. **CloudWatchReadOnlyAccess** Policy ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_2.42.11\_PM.png](/_astro/Screenshot_2021-02-25_at_2.42.11_PM.CHmR72gn_15hJyR.webp) c. Proceed to Next Steps 5. Add tags if needed ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.37.36\_PM.png](/_astro/Screenshot_2021-02-25_at_1.37.36_PM.CXA09kuf_Z1cMynA.webp) 6. Review 1. **Role name:** `${business_name}_last9_role` 2. **Role description**: Security Audit Access to Last9 3. **Verify Last9 AWS Account Number** 4. **Verify Granted Policy** 5. **Create Role** ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/Screenshot\_2021-02-25\_at\_1.44.40\_PM.png](/_astro/Screenshot_2021-02-25_at_1.44.40_PM.CMvZPgze_Zx5Se6.webp) 7. After the role is created, Go to Role → Trust Relationships → Edit Trust Relationship ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/2021-06-08\_22-08.png](/_astro/2021-06-08_22-08.DXphoq4l_Z1u5bk3.webp) 8. Update the JSON to the following and click “**Update Trust Policy**”. Ensure that the value for `sts:ExternalId` matches the value set earlier for External-ID ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::652845092827:root" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "somerandomstring" } } } ] } ``` 9. Edit the role and update “**Maximum session duration**” to 3 hours if your security policy permits it. Else leave it as 1 hour ![../../../../assets/docs/tutorials/how-to-create-aws-sts-role/2021-02-25\_14-14.png](/_astro/2021-02-25_14-14.sXXIC7_L_yJrWM.webp) 10. Share the created role ARN and external ID string with your Last9 point of contact # Enable EC2 Service Discovery with vmagent > This tutorial walks through setting up service disocvery for EC2 instances with vmagent. ## Service Discovery In the context of monitoring, Service Discovery refers to automatically detecting devices, services, or systems in a network that need to be monitored. Service discovery is significant in cloud environments that use auto-scaling and EC2 instances. These environments often have instances that change rapidly, making manual tracking infeasible from a monitoring point of view. This document lists steps to enable service discovery of EC2 instances so new instances can be monitored as they are created and decommissioned instances can be removed from monitoring, tackling false alerts. This document assumes that the EC2 instance service discovery will be set up for vmagent to send metrics to Last9 via Remote Write. Given that vmagent is successfully running on an EC2 Instance, we need to make provisions for vmagent to discover other EC2 instances, that is, scrape targets based on [ec2\_sd\_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config). ## Create `ec2-trustee` IAM role with assume role policy Go to AWS Console → IAM → Roles → Create Role * Select Trusted Entity ![Select Trusted Entity](/_astro/select-trusted-entity.td_mZcGH_Z2n6ux3.webp) * Do NOT add any permissions and click next ![Add Permissions](/_astro/add-permissions.CxGtOvgG_13dnD5.webp) * Name, Review and Create ![Create IAM Role Step 1](/_astro/create-iam-role-step-1.yVMOnhyv_Z1GupeP.webp) ![Create IAM Role Step 2](/_astro/create-iam-role-step-2.eFuXGCwV_1f3cp5.webp) ## Attach `ec2-trustee` IAM role to vmagent EC2 Host EC2 Instances > Select vmagent Instance > Actions > Instance Settings * Modify IAM Role ![Steps to update IAM role](/_astro/select-vmagent-ec2-instance.DhZA54AY_4XLH.webp) * Select `ec2-trustee` IAM role and Update ![Modify IAM Role](/_astro/modify-iam-role.2sPhH7q7_1fFDEg.webp) ## Create `vmagent-sd-role` IAM role Go to AWS Console → IAM → Roles → Create Role * Select Trusted Entity > Custom Trust Policy with below trust policy ![Custom Trust Policy](/_astro/custom-trust-policy.CseaBt17_Z1diXsJ.webp) ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::AWS_ACCOUNT_ID:role/ec2-trustee" }, "Action": "sts:AssumeRole" } ] } ``` * Add Permissions 1. Create Custom Policy with the below policy ````plaintext ```json { "Statement": [ { "Action": "ec2:Describe*", "Effect": "Allow", "Resource": "*" } ], "Version": "2012-10-17" } ``` ```` 2. Select Policy and click next ![Select Policy Step 1](/_astro/select-policy-step-1.BbUzxJ2v_Z25QfGa.webp) ![Select Policy Step 2](/_astro/select-policy-step-2.Ck7r4d89_6LlOg.webp) 3. Name, Review and Create * Add `vmagent-sd-role` as the name of the role, review permissions and trusted entities and create role ![Add vmagent role](/_astro/add-vmagent-sd-role.DvPfQLED_1kjztm.webp) ## Use the `vmagent-sd-role` ARN in vmagent configuration Update the `scrape_configs` stanza in your `vmagent.yaml` with the `ec2_sd_configs` stanza as follows and restart vmagent. ```yaml # vmagent.yaml # Check https://prometheus.io/docs/prometheus/latest/configuration/configuration for more details scrape_configs: - job_name: "node-exporter-sd" ec2_sd_configs: - region: ap-south-1 role_arn: "__role_arn_with_ec2_read_access__" filters: - name: tag:namespace values: - node-exporter port: 9100 ``` This will discover new EC2 instances automatically using the Service Discovery mechanism and their metrics will sent to Last9 from vmagent. # Scrape selective metrics in Prometheus > Recipe to only scrape selective metrics in Prometheus to reduce cardinality Prometheus provides the ability to filter specific metrics post-scraping, via the `metrics_relabel_config` stanza. We will use the `node_exporter` Prometheus exporter as an example, and send only metrics matching the regex `node_cpu.*`. ## Prometheus scrape config without filtering ```shell global: scrape_interval: 1m scrape_configs: - job_name: 'node-exporter-01' static_configs: - targets: [ 'localhost:9100' ] ``` This scrapes and stores all node exporter metrics. ## Prometheus scrape config with filtering ```bash global: scrape_interval: 1m scrape_configs: - job_name: 'node-exporter-01' static_configs: - targets: [ 'localhost:9100' ] metric_relabel_configs: - source_labels: [__name__] action: keep regex: '(node_cpu)' ``` This will scrape all metrics, but drop anything that does not match the entries in the `regex` section. # Scrape selective kube state metrics > This document describes how to scrape selective kube state metrics ## Obtain the list of kube state metrics from your Last9 cluster ```json curl -XGET 'https://:@read-app-tsdb.last9.io/hot/v1/metrics//sender//api/v1/label/__name__/values' | jq | grep -e "kube_" "kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total", "kube_apiserver_pod_logs_pods_logs_insecure_backend_total", "kube_certificatesigningrequest_annotations", "kube_certificatesigningrequest_cert_length", "kube_certificatesigningrequest_condition", "kube_certificatesigningrequest_created", "kube_certificatesigningrequest_labels", "kube_configmap_annotations", "kube_configmap_created", "kube_configmap_info", "kube_configmap_labels", "kube_configmap_metadata_resource_version", "kube_daemonset_annotations", "kube_daemonset_created", "kube_daemonset_labels", "kube_daemonset_metadata_generation", "kube_daemonset_status_current_number_scheduled", "kube_daemonset_status_desired_number_scheduled", "kube_daemonset_status_number_available", "kube_daemonset_status_number_misscheduled", "kube_daemonset_status_number_ready", "kube_daemonset_status_number_unavailable", "kube_daemonset_status_observed_generation", "kube_daemonset_status_updated_number_scheduled", "kube_deployment_annotations", "kube_deployment_created", "kube_deployment_labels", "kube_deployment_metadata_generation", "kube_deployment_spec_paused", "kube_deployment_spec_replicas", "kube_deployment_spec_strategy_rollingupdate_max_surge", "kube_deployment_spec_strategy_rollingupdate_max_unavailable", "kube_deployment_status_condition", "kube_deployment_status_observed_generation", "kube_deployment_status_replicas", "kube_deployment_status_replicas_available", "kube_deployment_status_replicas_ready", "kube_deployment_status_replicas_unavailable", "kube_deployment_status_replicas_updated", "kube_endpoint_address", "kube_endpoint_address_available", "kube_endpoint_address_not_ready", "kube_endpoint_annotations", "kube_endpoint_created", "kube_endpoint_info", "kube_endpoint_labels", "kube_endpoint_ports", "kube_ingress_annotations", "kube_ingress_created", "kube_ingress_info", "kube_ingress_labels", "kube_ingress_metadata_resource_version", "kube_ingress_path", "kube_job_annotations", "kube_job_complete", "kube_job_created", "kube_job_info", "kube_job_labels", "kube_job_owner", "kube_job_spec_completions", "kube_job_spec_parallelism", "kube_job_status_active", "kube_job_status_completion_time", "kube_job_status_failed", "kube_job_status_start_time", "kube_job_status_succeeded", "kube_lease_owner", "kube_lease_renew_time", "kube_mutatingwebhookconfiguration_created", "kube_mutatingwebhookconfiguration_info", "kube_mutatingwebhookconfiguration_metadata_resource_version", "kube_namespace_annotations", "kube_namespace_created", "kube_namespace_labels", "kube_namespace_status_phase", "kube_node_annotations", "kube_node_created", "kube_node_deletion_timestamp", "kube_node_info", "kube_node_labels", "kube_node_spec_taint", "kube_node_spec_unschedulable", "kube_node_status_allocatable", "kube_node_status_capacity", "kube_node_status_condition", "kube_persistentvolume_annotations", "kube_persistentvolume_capacity_bytes", "kube_persistentvolume_claim_ref", "kube_persistentvolume_created", "kube_persistentvolume_info", "kube_persistentvolume_labels", "kube_persistentvolume_status_phase", "kube_persistentvolumeclaim_access_mode", "kube_persistentvolumeclaim_annotations", "kube_persistentvolumeclaim_created", "kube_persistentvolumeclaim_info", "kube_persistentvolumeclaim_labels", "kube_persistentvolumeclaim_resource_requests_storage_bytes", "kube_persistentvolumeclaim_status_phase", "kube_pod_annotations", "kube_pod_completion_time", "kube_pod_container_info", "kube_pod_container_resource_limits", "kube_pod_container_resource_requests", "kube_pod_container_state_started", "kube_pod_container_status_last_terminated_exitcode", "kube_pod_container_status_last_terminated_reason", "kube_pod_container_status_ready", "kube_pod_container_status_restarts_total", "kube_pod_container_status_running", "kube_pod_container_status_terminated", "kube_pod_container_status_terminated_reason", "kube_pod_container_status_waiting", "kube_pod_container_status_waiting_reason", "kube_pod_created", "kube_pod_deletion_timestamp", "kube_pod_info", "kube_pod_init_container_info", "kube_pod_init_container_status_ready", "kube_pod_init_container_status_restarts_total", "kube_pod_init_container_status_running", "kube_pod_init_container_status_terminated", "kube_pod_init_container_status_terminated_reason", "kube_pod_init_container_status_waiting", "kube_pod_init_container_status_waiting_reason", "kube_pod_ips", "kube_pod_labels", "kube_pod_owner", "kube_pod_restart_policy", "kube_pod_spec_volumes_persistentvolumeclaims_info", "kube_pod_spec_volumes_persistentvolumeclaims_readonly", "kube_pod_start_time", "kube_pod_status_container_ready_time", "kube_pod_status_phase", "kube_pod_status_qos_class", "kube_pod_status_ready", "kube_pod_status_ready_time", "kube_pod_status_reason", "kube_pod_status_scheduled", "kube_pod_status_scheduled_time", "kube_pod_status_unschedulable", "kube_pod_tolerations", "kube_poddisruptionbudget_annotations", "kube_poddisruptionbudget_created", "kube_poddisruptionbudget_labels", "kube_poddisruptionbudget_status_current_healthy", "kube_poddisruptionbudget_status_desired_healthy", "kube_poddisruptionbudget_status_expected_pods", "kube_poddisruptionbudget_status_observed_generation", "kube_poddisruptionbudget_status_pod_disruptions_allowed", "kube_replicaset_annotations", "kube_replicaset_created", "kube_replicaset_labels", "kube_replicaset_metadata_generation", "kube_replicaset_owner", "kube_replicaset_spec_replicas", "kube_replicaset_status_fully_labeled_replicas", "kube_replicaset_status_observed_generation", "kube_replicaset_status_ready_replicas", "kube_replicaset_status_replicas", "kube_secret_annotations", "kube_secret_created", "kube_secret_info", "kube_secret_labels", "kube_secret_metadata_resource_version", "kube_secret_type", "kube_service_annotations", "kube_service_created", "kube_service_info", "kube_service_labels", "kube_service_spec_type", "kube_service_status_load_balancer_ingress", "kube_storageclass_annotations", "kube_storageclass_created", "kube_storageclass_info", "kube_storageclass_labels", "kube_validatingwebhookconfiguration_created", "kube_validatingwebhookconfiguration_info", "kube_validatingwebhookconfiguration_metadata_resource_version", ``` ## Let’s decide to omit `kube_certificatesigningrequest_*` metrics 1. Prepare a list of metrics you want to omit. This needs to be a comma separated array of strings. ```json [ "kube_certificatesigningrequest_annotations", "kube_certificatesigningrequest_cert_length", "kube_certificatesigningrequest_condition", "kube_certificatesigningrequest_created", "kube_certificatesigningrequest_labels" ] ``` 2. Find the `metricDenylist` configuration in your Kube State Metrics Helm chart and append this list to that config. ```json metricDenylist: ["kube_certificatesigningrequest_annotations", "kube_certificatesigningrequest_cert_length", "kube_certificatesigningrequest_condition", "kube_certificatesigningrequest_created", "kube_certificatesigningrequest_labels"] ``` 3. Now deploy your Kube State Metrics Helm chart as usual 4. Run the below command after the deployment to verify that it is in effect. You should not find any metrics with `kube_certificatesigningrequest` prefix being emitted anymore ```json curl -XGET 'https://:@read-app-tsdb.last9.io/hot/v1/metrics//sender//api/v1/label/__name__/values' | jq | grep -e "kube_certificatesigningrequest_" ``` # Install VictoriaMetrics VMAgent on Ubuntu > This tutorial walks through setting up VMAgent on Ubuntu as a standalone process and monitor itself. [VMAgent](https://docs.victoriametrics.com/vmagent.html#vmagent) is a tiny agent which helps you collect metrics from various sources, relabel and filter the collected metrics and store them in Prometheus compatible remote storage such as Last9 using the [Prometheus remote write](https://last9.io/blog/what-is-prometheus-remote-write/) protocol. In this tutorial, we’ll cover how to install VMAgent on your Ubuntu server, manage the VMAgent process, and set up scrape configs to collect metrics from VMAgent itself. ## Prerequisites Before you begin this guide, you should have a regular, non-root user with sudo privileges configured on your server and also basic dependencies such as `wget`. Create a Last9 cluster by following the [Quick Start Guide](/docs/onboard/). Keep the following information handy after creating the cluster: * `$levitate_remote_write_url` - Last9 cluster’s Remote write endpoint * `$levitate_remote_write_username` - Cluster ID * `$levitate_remote_write_password` - Write token created for the cluster ## Download & extract VMAgent VMAgent is available in as part of the VictoriaMetrics repository’s [latest releases](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest). First obtain the required binary. ```bash $ ARCH=$(uname -m) case $ARCH in x86_64 | amd64) ARCH="amd64" ;; aarch64 | arm64) ARCH="arm64" ;; *) echo "Unsupported architecture: $ARCH" exit 1 ;; esac $ wget -O /var/tmp/vmutils.tar.gz "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.96.0/vmutils-linux-${ARCH}-v1.96.0.tar.gz" ``` Extract VMAgent from the compressed file i.e `/var/tmp/vmutils.tar.gz` ```bash $ sudo mkdir -p /opt/vmagent $ tar -xzf /var/tmp/vmutils.tar.gz -C /opt/vmagent ``` ## Setup VMAgent scrape config Create a scrape config in the same directory as the VMAgent binary. ```bash $ sudo cat > /opt/vmagent/vmagent.yaml </etc/systemd/system/vmagent.service < Step by step guide on how to setup vmoperator with only vmagent and vmservicescrape to scrape your Kubernetes svcs and remote write metrics to Last9 The [vmoperator](https://github.com/VictoriaMetrics/operator) streamlines the deployment and management of vmagent on Kubernetes, optimizing for ease of use while retaining native configuration options inherent to Kubernetes environments. It achieves this by introducing various custom resource definitions (CRDs) into the Kubernetes ecosystem and send metrics semalessly to Last9. **Custom Resource Definitions (CRDs)** * vmagent * vmnodescrapes * vmservicescrapes * vmprobes * vmpodscrapes * vmstaticscrapes These CRDs empower users to effortlessly create and manage vmagent instances along with scrape configurations like VMServiceScrape, VMPodScrape, etc. These configurations closely resemble the [Prometheus Operator’s](https://prometheus-operator.dev/) ServiceMonitor and PodMonitor. This eliminates the need for manual setup of vmagent deployment, image configuration, and other intricate details. ## Prerequisites Make sure you have the following prerequisites installed: * [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl) - Kubernetes command-line tool * Clone this repository to your local machine: ```bash git clone https://github.com/last9/vmagent-operator-levitate.git cd vmagent-operator-levitate ``` ## Install Custom Resource Definitions (CRDs) 1. Navigate to the `crds/` directory: ```bash cd ./crds/ ``` 2. Install the CRDs using `kubectl`: ```bash $ kubectl apply -f ./crd.yaml ``` 3. Verify that the CRDs are successfully installed: ```bash $ kubectl get crd --sort-by=.metadata.creationTimestamp # Output NAME CREATED AT vmagents.operator.victoriametrics.com 2023-12-26T12:02:48Z vmnodescrapes.operator.victoriametrics.com 2023-12-26T12:02:49Z vmservicescrapes.operator.victoriametrics.com 2023-12-26T12:02:50Z vmprobes.operator.victoriametrics.com 2023-12-26T12:02:50Z vmpodscrapes.operator.victoriametrics.com 2023-12-26T12:02:50Z vmstaticscrapes.operator.victoriametrics.com 2023-12-26T12:02:51Z ``` You should see the names of the installed CRDs in the output. ## Install vmoperator 1. Navigate to the `operator/` directory: ```bash cd ./operator/ ``` 2. Install the operator and rbac using `kubectl`: ```bash $ kubectl apply -f ./manager.yaml -f rbac.yaml ``` 3. Verify the status of the operator: ```bash $ kubectl get pods -n monitoring-system # Output NAME READY STATUS RESTARTS AGE vm-operator-667dfbff55-cbvkf 1/1 Running 0 101s 2023-12-26T12:02:51Z ``` You should see the vmoperator pod in running status. ## Install vmagent 1. Navigate to the `vmagent/` directory: ```bash cd ./vmagent/ ``` 2. You will need to obtain your Last9 cluster’s Remote Write URL and its credentials. [Here](/docs/onboard/) is a quick way to create your cluster and obtain your credentials. Run the below command to list all the placeholder values in this file [vmagent.yaml](https://github.com/last9/vmagent-operator-levitate/blob/main/vmagent/vmagent.yaml) ```bash $ cat ./vmagent.yaml | grep -n "Todo" # Output 11: levitate_cluster_username: "" # Todo: append levitate cluster username 12: levitate_cluster_password: "" # Todo: append levitate cluster password 31: via_cluster: # Todo: add a relevant cluster name. e.g: k8s cluster name 33: - url: # Todo: append levitate remote write URL 55: storage: 20Gi # Todo: Default is 20Gi. Scale up after you have provisioned more if you need more 58:# Todo: Below configs need to be enabled depending upon your affinity towards nodegroups. 59:# Todo: Ensure that the below selector terms and tolerations are exactly same as the metadata of the nodegroups itself. ``` 3. Proceed to installation once you have replaced the placeholder values with actual values 4. Install vmagent in the `last9-monitoring` namespace using `kubectl`: ```bash $ kubectl apply -f ./vmagent.yaml -n last9-monitoring ``` 5. Verify the status of the vmagent and ensure that it’s running: ```bash $ kubectl get pods -n last9-monitoring -l "last9_monitoring_agent=vmagent" # Output NAME READY STATUS RESTARTS AGE vmagent-demo-6785f7d7b9-zpbv6 2/2 Running 0 72s ``` ## Install VMServiceScrape (i.e ServiceMonitor and PodMonitor) Navigate to the `vmservicescrape/` directory: ````plaintext ```bash cd ./vmservicescrape/ ``` ```` **Caveats** VMServiceScrape works similar to Prometheus Operator’s ServiceMonitor and PodMonitor where you can define scrape selectors to do service and pod discovery. In this file [vmservicescrape.yaml](https://github.com/last9/vmagent-operator-levitate/blob/main/vmservicescrape/vmservicescrape.yaml) you can override default scrape selectors to suit your requirements. By default this assumes the default labels that are applied to the exporter as part of their installations. Below is the generic command to find you the labels of your K8s Services. ````plaintext ```bash $ kubectl get services -n -o jsonpath='{.metadata.labels}' ``` ```` Once, you have inspected the labels for the services that you chose to scrape, you can then proceed to modify this file [vmservicescrape.yaml](https://github.com/last9/vmagent-operator-levitate/blob/main/vmservicescrape/vmservicescrape.yaml) and match the scrape selector labels with the labels of your services. Another caveat to note here is to ensure that the namespaces are also declared correctly for the scrape selectors to correctly perform service discovery. This file also includes scrape configs for Common Exporters such as Kafka, Redis, Node, RabbitMQ, Prometheus Pushgateway etc. Run this command to list all the `Todo` comments which will guide you to customize this file [vmservicescrape.yaml](https://github.com/last9/vmagent-operator-levitate/blob/main/vmservicescrape/vmservicescrape.yaml) as required. ````plaintext ```bash $ cat ./vmservicescrape.yaml | grep -n "Todo" # Output 48: matchNames: [ "kube-system" ] # Todo: append namespaces here 63: matchNames: [ "kube-system" ] # Todo: append namespaces here 86:# Todo: Uncomment this if you have custom application enabled svcs running and you want to scrape them 97:# matchNames: [ ] # Todo: Append more namespaces here 105:# app: "" # Todo: Append app name label here 107:# Todo: Uncomment this if you have node exporters svcs running and you want to scrape them 118:# matchNames: [ ] # Todo: Append more namespaces here 129:# Todo: Uncomment this if you have rabbitMQ exporters svcs running and you want to scrape them 140:# matchNames: [ ] # Todo: Append more namespaces here 151:# Todo: Uncomment this if you have kafka exporters svcs running and you want to scrape them 162:# matchNames: [ ] # Todo: Append more namespaces here 173:# Todo: Uncomment this if you have redis exporters svcs running and you want to scrape them 184:# matchNames: [ ] # Todo: Append more namespaces here 195:# Todo: Uncomment this if you have pushgateway svcs running and you want to scrape them 206:# matchNames: [ ] # Todo: Append more namespaces here ``` ```` Install VMServiceScrape using `kubectl`: ````plaintext ```bash $ kubectl apply -f ./vmservicescrape.yaml -n last9-monitoring # Output vmservicescrape.operator.victoriametrics.com/last9-vmservicescrape-vmagent-01 created vmservicescrape.operator.victoriametrics.com/last9-servicescrape-k8s-01 created vmservicescrape.operator.victoriametrics.com/last9-servicescrape-metrics-server-01 created ``` ```` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Tutorials > Find best practices recipes and tutorials for monitoring and observability curated by the Last9 team. 1. [Calculate usage patterns and data volume in Prometheus](/docs/how-to-calculate-usage-patterns-and-data-volume-in-prometheus/) 2. [Scrape selective metrics in Prometheus](/docs/how-to-scrape-only-selective-metrics-in-prometheus/) 3. [Using OpenTelemetry Exporter for Prometheus Remote Write](/docs/using-open-telemetry-exporter-for-prometheus-remote-write/) 4. [Scrape selective kube state metrics](/docs/how-to-scrape-selective-kube-state-metrics/) 5. [Setting up Docker and Docker Compose](/docs/setting-up-docker-and-docker-compose-on-linux/) 6. [Install VictoriaMetrics VMAgent on Ubuntu](/docs/how-to-setup-vmagent-on-ubuntu/) 7. [Create a GCP service account with read-only access for monitoring](/docs/create-gcp-service-account-with-read-only-access/) 8. [Setup Kubernetes monitoring using kube-state-metrics(KSM) and Prometheus](/docs/ingest-kubernetes-metrics-via-prometheus/) 9. [Setup vmoperator with only vmagent and vmservicescrape](/docs/how-to-setup-vmoperator-with-vmagent-in-kubernetes/) 10. [Enable EC2 Service Discovery with vmagent](/docs/how-to-enable-ec2-service-discovery-with-vmagent/) 11. [Create AWS STS (Secure Token Service) Role](/docs/how-to-create-aws-sts-role/) 12. [Monitor RabbitMQ using Last9](/docs/integrations-rabbitmq/) 13. [Delegate Subdomain between two AWS Accounts using Route 53](/docs/delegate-subdomain-between-aws-accounts-using-route-53/) # Querying Last9 using HTTP API > How to query metrics from Last9 using HTTP API This step-by-step guide explains how to query Last9 using HTTP API. ## Last9 Read URL Create a Last9 cluster by following [Getting Started](/docs/onboard/). Each Last9 Cluster comes with a Read URL which needs to be used when querying metrics data from Last9. ![Last9 Read Data Settings](/_astro/levitate-cluster-read-data-settings-tab.BoWiHQqm_26tp2e.webp) You can grab the Read URL by going to the Last9 Cluster → Settings → Read Data → Bring Your Own Visualization. Keep the following information handy after creating the Last9 cluster: * `$levitate_read_url` - Last9’s Read endpoint * `$levitate_username` - Cluster ID * `$levitate_password` - Read token created for the cluster ## Authentication Generate the authorization header for authenticated access to Last9 metrics API as follows. ```bash USERNAME="$levitate_ cluster_username" PASSWORD="$levitate_cluster_password" BASIC_AUTH_HEADER=$(echo -n "$USERNAME:$PASSWORD" | base64) AUTH_HEADER="Authorization: Basic $BASIC_AUTH_HEADER" ``` ## Instant Query The simplest way to query Last9 is to use `$levitate_read_url` along with `$AUTH_HEADER`, which allows you to execute an instant query. ```bash curl -XPOST "$levitate_read_url/api/v1/query?query=" -H "$AUTH_HEADER" ``` **Example**: Query the current CPU usage: ```bash curl -XPOST "$levitate_read_url/api/v1/query?query=node_cpu_seconds_total{}" -H "$AUTH_HEADER" ``` ## Range Query Use the `query_range` endpoint to retrieve data over a time range. You need to specify the start time, end time, and step duration. ```bash curl -XPOST '$levitate_read_url/api/v1/query_range?query=&start=&end=&step=' -H "$AUTH_HEADER" ``` **Example**: Query CPU usage over the last hour with a 1-minute step: ```bash curl -XPOST '$levitate_read_url/api/v1/query_range?query=node_cpu_seconds_total&start=$(date -d "1 hour ago" +%s)&end=$(date +%s)&step=60' -H "$AUTH_HEADER" ``` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Indicators > Overview of Indicators ## Indicator Overview Indicators are **PromQL queries** saved as a part of an Alert Group. Alert Rules evaluate these Indicators against thresholds to generate alerts. ## Creating an Indicator Before you configure an Alert Rule, you need at least one indicator to be created in the Alert Group. To create an Indicator: 1. Navigate to the Alert Group in which you would like to create the indicator: **Home** → **Alert Studio** → **Alert Groups** → *Select an Alert Group* → **Indicators** Tab and Click on the **Create New Indicator** ![Creating An Indicator 1](/_astro/indicators-1.BQPXx2Dd_ZLqRTL.webp) ![Creating An Indicator 2](/_astro/indicators-2.h2Hoksl__ZLyE9I.webp) 2. The following details are required for an Indicator: 1. **Indicator Name**: Use a descriptive name so that the Indicator is easily identified 2. **Indicator Description** *(Optional)*: Helps your team members identify the purpose of the Indicator 3. **Query**: The PromQL query for the indicator. This can be a query that returns multiple timeseries (as seen in the example below) but cannot contain any variables (for example, `$instance`) 4. **Unit**: The unit that which want to assign to the indicator 5. **Data Source** *(Advanced)*: By default, Indicators inherit the data source from their Alert Group. That is if you change the data source for the Alert Group the same data source will be used for the indicator. You can also override this behavior and assign a different data source for the Indicator. Once you override the Indicator’s data source, the configured data source will now take precedence over the data source configured at the Alert Group level ![Creating An Indicator 3](/_astro/indicators-3.CYTYmEGI_ZdIC1u.webp) After entering the query, you will need to validate it to ensure that it has no syntax errors. If the query is validated successfully, a preview will be generated. ![Creating An Indicator 4](/_astro/indicators-4.CZGiMkVY_Z24TTtB.webp) Click **Create Indicator** to save this Indicator. ![Creating An Indicator 5](/_astro/indicators-5.BeHzA9pr_Z1ysf1z.webp) 3. This Indicator is now ready to be used in Alert Rules. To edit/duplicate or delete this indicator can click the **…** button ![Creating An Indicator 6](/_astro/indicators-6.CsSFHl7o_25XiK1.webp) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Setup Kubernetes monitoring using kube-state-metrics(KSM) and Prometheus Agent > Step by step guide to enable ingesting Kubernetes metrics via Prometheus Agent and send to Last9 via remote write. ## Pre-requisites 1. Ensure that your [kubectl configuration](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is pointing to the right Kubernetes cluster 2. Create a Last9 cluster by following [Quick start guide](/docs/onboard/) ## What is kube-state-metrics(KSM) `kube-state-metrics` (KSM) is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. It is not focused on the health of the individual Kubernetes components but on the health of the various objects inside, such as deployments, nodes, and pods. The metrics are exported by default to the port’s HTTP endpoint `/metrics` on port 8080. They are served as plaintext. They are designed to be consumed either by Prometheus itself or by a scraper compatible with scraping a Prometheus client endpoint. You can also open `/metrics` in a browser to see the raw metrics. Note that the metrics exposed on the `/metrics` endpoint reflect the current state of the Kubernetes cluster. When Kubernetes objects are deleted, they are no longer visible on the `/metrics` endpoint. ## Automated installation (Preferred) ### Step 1: Copy the installation command ### Step 2: Run the installation command Before running the command, update it to use the write token of the Last9 cluster. ```text ``` Running the command will download the manifest yaml in the current working directory. It is strongly recommended that you check the manifest file in git so that it can be extended later. You can just follow the video to see the end-to-end setup. ## Manual Installation 1. Clone the GitHub repo ```shell git clone https://github.com/kubernetes/kube-state-metrics.git ``` 2. Deployment steps To deploy this project, you can simply run `kubectl apply -f examples/standard`, and a Kubernetes service and deployment will be created. ```shell kubectl apply -f examples/standard ``` Read for more details on deployment [here](https://github.com/kubernetes/kube-state-metrics?tab=readme-ov-file#kubernetes-deployment). 3. Validate corresponding deployment ```shell kubectl get deployments kube-state-metrics -n kube-system ``` This is the sample output that you should see. ```shell NAME READY UP-TO-DATE AVAILABLE AGE kube-state-metrics 1/1 1 1 6d1h ``` ## Configure remote write to Last9 If you already have a running Prometheus setup, add the attached scrape configs, and remote write setup to your Prometheus config file to send data to Last9. ```yaml # prometheus.yaml scrape_configs: - job_name: "node-exporter" kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_endpoints_name] regex: "node-exporter" action: keep - job_name: "kubernetes-apiservers" kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [ __meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name, ] action: keep regex: default;kubernetes;https - job_name: "kubernetes-nodes" scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: "kube-state-metrics" static_configs: - targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"] - job_name: "kubernetes-cadvisor" scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: "kubernetes-service-endpoints" kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name remote_write: - url: remote_timeout: 60s queue_config: capacity: 10000 max_samples_per_send: 3000 batch_send_deadline: 20s min_shards: 4 max_shards: 200 min_backoff: 100ms max_backoff: 10s basic_auth: username: password: ``` * Replace the `cluster` variable in `external_labels` as per the description ```yaml external_labels: # TODO - replace xyz.acme.io with a logical name for the cluster being scraped. # by Prometheus e.g. prod1.xyz.com cluster: "xyz.acme.io" ``` ## Steps to uninstall the KSM setup ### For automated setup To uninstall the Kubernetes resources that were created by the automated installation, you can use the `kubectl delete` command with the `-f` flag pointing to the same YAML file. This will delete all the resources defined in the file. ```bash kubectl delete -f kube-state-metrics.yml ``` This command will remove the namespaces, deployments, services, service accounts, and any other resources defined in the `kube-state-metrics.yml` file. ### For manual setup Delete the created kube-state-metrics objects. ```shell kubectl delete -f examples/standard ``` # Ingestion Tokens > Create and manage tokens for sending telemetry data to Last9, including RUM and Prometheus metrics. ![Control Plane — Ingestion Tokens](/_astro/control-plane-ingestion-tokens.QRukipwK_1I6TUf.webp) [Ingestion Tokens](https://app.last9.io/control-plane/ingestion-tokens) authenticate your applications and services when sending telemetry data to Last9. These tokens control what data can be sent and from which origins, ensuring secure data collection. ## Creating Ingestion Tokens ![Control Plane — New Ingestion Tokens](/_astro/control-plane-ingestion-tokens-create.Dc682ue6_Z1J1Upf.webp) 1. **Select Token Type**: * **Client**: For client-based data collection (eg: RUM) * **Prometheus Remote-Write**: For server-side Prometheus-compatible metrics 2. **Configure Client Type** (if Client selected): * **Web Browser**: Default and currently available option for web applications * **Mobile**: Coming soon for mobile app monitoring 3. **Set Origins** (for Client tokens): * Add the domains from which your application will send data * Example: `https://www.example.com` * Multiple origins can be added for multi-domain applications * **Wildcard subdomains are supported** using `*.` prefix (e.g., `https://*.example.com`) Caution Data sent from origins not listed here will be rejected. ```plaintext ✅ Correct Origins: https://app.example.com https://www.example.com http://localhost:3000 https://*.example.com (matches any single subdomain) https://*.apps.example.com (matches subdomains of apps.example.com) https://app.*.example.com (matches app.{any}.example.com) ❌ Incorrect Origins: example.com (missing protocol) app.example.com (missing protocol) https://example.* (TLD must be explicit, e.g., .com, .io) ``` ### Wildcard Origins Wildcard origins allow you to match multiple subdomains with a single entry. This is useful when: * Your application serves multiple customer subdomains (e.g., `customer1.example.com`, `customer2.example.com`) * You have dynamic environments that create subdomains on the fly * You want to simplify token management across many subdomains **Wildcard Rules:** | Pattern | Matches | Does Not Match | | ---------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------- | | `https://*.example.com` | `https://app.example.com`, `https://www.example.com` | `https://example.com`, `https://sub.app.example.com` | | `https://*.apps.example.com` | `https://prod.apps.example.com`, `https://dev.apps.example.com` | `https://apps.example.com` | | `https://app.*.example.com` | `https://app.staging.example.com`, `https://app.prod.example.com` | `https://app.example.com` | 4. **Name Your Token**: * Use a descriptive name to identify the token’s purpose * Example: “Production RUM Token” or “Staging Web Monitoring” 5. Click **CREATE TOKEN** to generate your token *** ## Client Token Security ### Is the Client Token Safe to Expose? **Yes.** Client tokens are specifically designed to be visible in your frontend code and are safe to include in your application bundle. #### Why Client Tokens Are Safe * **Write-Only Permissions**: Client tokens can only send telemetry data to Last9. They cannot read, query, or access any data from your account. * **CORS-Restricted**: Tokens are restricted to the origins (domains) you configure above, preventing unauthorized websites from using your token. * **No Account Access**: Client tokens cannot access your Last9 account settings, billing information, or any other sensitive data. * **Designed for Public Use**: This security model is standard across the industry for frontend monitoring. #### Why Not Use API Keys? API keys have full read/write access to your entire Last9 account and should **never** be exposed in frontend code. Client tokens exist specifically to provide a safe way to send telemetry from browsers. ### Security Best Practices While client tokens are safe to expose, follow these practices to minimize abuse: 1. **Rotate Tokens Periodically**: Rotate your client tokens every 90 days as a preventive measure. 2. **Restrict CORS Origins**: Configure allowed origins (as shown above) to only accept data from your domains. 3. **Use Different Tokens Per Environment**: Use separate tokens for development, staging, and production. 4. **Monitor Usage**: Regularly review your usage dashboard to detect unusual patterns or potential abuse. ### Handling CI/CD Security Scanners Your CI/CD pipeline security scanners may flag client tokens as secrets. This is a false positive. To resolve: #### GitGuardian / GitHub Secret Scanning Add an exception for client tokens: ```yaml # .gitguardian.yml paths-ignore: - "**/*.html" - "**/rum-config.js" allow-patterns: - "clientToken.*" - "L9RUM\\.init" ``` #### Inline Comment Exception Add security scanner directives: ```javascript // gitguardian:ignore - Public client token with write-only permissions // nosemgrep: generic.secrets.security.detected-generic-secret const CLIENT_TOKEN = "your-token-here"; ``` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Integrations > Integrations Overview ## AI & LLMs [![Last9 MCP](https://cdn.simpleicons.org/modelcontextprotocol/fff)](/docs/integrations/mcp/) [Last9 MCP](/docs/integrations/mcp/) [![Claude Code](https://cdn.simpleicons.org/claude/fff)](/docs/integrations/claude-code/) [Claude Code](/docs/integrations/claude-code/) [![Claude Cowork](https://cdn.simpleicons.org/claude/fff)](/docs/integrations/cowork/) [Claude Cowork](/docs/integrations/cowork/) [![GPU Telemetry](https://cdn.simpleicons.org/nvidia/fff)![GPU Telemetry](https://cdn.simpleicons.org/amd/fff)![GPU Telemetry](https://cdn.simpleicons.org/intel/fff)](/docs/integrations/gpu-telemetry/) [GPU Telemetry](/docs/integrations/gpu-telemetry/) [![OpenCode](/docs/icons/opencode.svg)](/docs/integrations/opencode/) [OpenCode](/docs/integrations/opencode/) [![n8n (MCP)](https://cdn.simpleicons.org/n8n/fff)](/docs/integrations/n8n-mcp/) [n8n (MCP)](/docs/integrations/n8n-mcp/) [![n8n (OpenTelemetry)](https://cdn.simpleicons.org/n8n/fff)](/docs/integrations/n8n-otel/) [n8n (OpenTelemetry)](/docs/integrations/n8n-otel/) [![Python AI SDK](https://cdn.simpleicons.org/python/fff)](/docs/integrations/python-genai-sdk/) [Python AI SDK](/docs/integrations/python-genai-sdk/) [![TrueFoundry](https://cdn.simpleicons.org/python/fff)](/docs/integrations/truefoundry-ai-gateway/) [TrueFoundry](/docs/integrations/truefoundry-ai-gateway/) ## CI/CD [![ArgoCD](https://cdn.simpleicons.org/argo/fff)](/docs/integrations/ci-cd/argocd/) [ArgoCD](/docs/integrations/ci-cd/argocd/) [![Buildkite](https://cdn.simpleicons.org/buildkite/fff)](/docs/integrations/ci-cd/buildkite/) [Buildkite](/docs/integrations/ci-cd/buildkite/) [![GitHub Actions](https://cdn.simpleicons.org/githubactions/fff)](/docs/integrations/ci-cd/github-actions/) [GitHub Actions](/docs/integrations/ci-cd/github-actions/) [![Jenkins](https://cdn.simpleicons.org/jenkins/fff)](/docs/integrations/ci-cd/jenkins/) [Jenkins](/docs/integrations/ci-cd/jenkins/) ## Cloud Providers [![AWS Chalice](https://cdn.simpleicons.org/python/fff)](/docs/integrations/cloud-providers/aws-chalice/) [AWS Chalice](/docs/integrations/cloud-providers/aws-chalice/) [![AWS EC2](https://last9.github.io/images/aws-icons/Arch_Amazon-EC2_64.svg)](/docs/integrations/cloud-providers/aws-ec2/) [AWS EC2](/docs/integrations/cloud-providers/aws-ec2/) [![AWS Lambda](https://last9.github.io/images/aws-icons/Arch_AWS-Lambda_64.svg)](/docs/integrations/cloud-providers/aws-lambda/) [AWS Lambda](/docs/integrations/cloud-providers/aws-lambda/) [![Azure Monitor](/docs/icons/azure.svg)](/docs/integrations/cloud-providers/azure-monitor-metrics/) [Azure Monitor](/docs/integrations/cloud-providers/azure-monitor-metrics/) [![Azure Service Apps](/docs/icons/azure.svg)](/docs/integrations/cloud-providers/azure-service-apps/) [Azure Service Apps](/docs/integrations/cloud-providers/azure-service-apps/) [![Google Cloud Functions](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/cloud-providers/google-cloud-functions/) [Google Cloud Functions](/docs/integrations/cloud-providers/google-cloud-functions/) [![Google Compute Engine](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/cloud-providers/google-compute-engine/) [Google Compute Engine](/docs/integrations/cloud-providers/google-compute-engine/) ## Containers & Kubernetes [![AWS ECS](https://last9.github.io/images/aws-icons/Arch_Amazon-Elastic-Container-Service_64.svg)](/docs/integrations/containers-and-k8s/aws-ecs/) [AWS ECS](/docs/integrations/containers-and-k8s/aws-ecs/) [![Azure Container Apps](/docs/icons/azure.svg)](/docs/integrations/containers-and-k8s/azure-container-apps/) [Azure Container Apps](/docs/integrations/containers-and-k8s/azure-container-apps/) [![Docker](https://cdn.simpleicons.org/docker/fff)](/docs/integrations/containers-and-k8s/docker/) [Docker](/docs/integrations/containers-and-k8s/docker/) [![Kubernetes Cluster Monitoring](https://cdn.simpleicons.org/kubernetes/fff)](/docs/integrations/containers-and-k8s/kubernetes-cluster-monitoring/) [Kubernetes Cluster Monitoring](/docs/integrations/containers-and-k8s/kubernetes-cluster-monitoring/) [![Kubernetes Events](https://cdn.simpleicons.org/kubernetes/fff)](/docs/integrations/containers-and-k8s/kubernetes-events/) [Kubernetes Events](/docs/integrations/containers-and-k8s/kubernetes-events/) [![Kubernetes Logs](https://cdn.simpleicons.org/kubernetes/fff)](/docs/integrations/containers-and-k8s/kubernetes-logs/) [Kubernetes Logs](/docs/integrations/containers-and-k8s/kubernetes-logs/) [![Kubernetes Operator](https://cdn.simpleicons.org/kubernetes/fff)](/docs/integrations/containers-and-k8s/kubernetes-operator/) [Kubernetes Operator](/docs/integrations/containers-and-k8s/kubernetes-operator/) [![Kubernetes Sidecar for Logs](https://cdn.simpleicons.org/kubernetes/fff)](/docs/integrations/containers-and-k8s/kubernetes-sidecar-for-logs/) [Kubernetes Sidecar for Logs](/docs/integrations/containers-and-k8s/kubernetes-sidecar-for-logs/) [![Argo Rollouts](https://cdn.simpleicons.org/argo/fff)](/docs/integrations/containers-and-k8s/kubernetes/argo-rollouts/) [Argo Rollouts](/docs/integrations/containers-and-k8s/kubernetes/argo-rollouts/) ## Databases [![Aerospike](/docs/icons/database.svg)](/docs/integrations/aerospike/) [Aerospike](/docs/integrations/aerospike/) [![AWS DynamoDB](https://last9.github.io/images/aws-icons/Arch_Amazon-DynamoDB_64.svg)](/docs/integrations/databases/aws-dynamodb/) [AWS DynamoDB](/docs/integrations/databases/aws-dynamodb/) [![AWS RDS](https://last9.github.io/images/aws-icons/Arch_Amazon-RDS_64.svg)](/docs/integrations/databases/aws-rds/) [AWS RDS](/docs/integrations/databases/aws-rds/) [![AWS S3](https://last9.github.io/images/aws-icons/Arch_Amazon-Simple-Storage-Service_64.svg)](/docs/integrations/databases/aws-s3/) [AWS S3](/docs/integrations/databases/aws-s3/) [![Google Cloud Memorystore Memcached](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/databases/google-cloud-memorystore-memcached/) [Google Cloud Memorystore Memcached](/docs/integrations/databases/google-cloud-memorystore-memcached/) [![Google Cloud Memorystore Redis](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/databases/google-cloud-memorystore-redis/) [Google Cloud Memorystore Redis](/docs/integrations/databases/google-cloud-memorystore-redis/) [![Google Cloud SQL](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/databases/google-cloud-sql/) [Google Cloud SQL](/docs/integrations/databases/google-cloud-sql/) [![MariaDB](https://cdn.simpleicons.org/mariadb/fff)](/docs/integrations/databases/mariadb/) [MariaDB](/docs/integrations/databases/mariadb/) [![Microsoft SQL Server](/docs/icons/database.svg)](/docs/integrations/databases/microsoft-sql-server/) [Microsoft SQL Server](/docs/integrations/databases/microsoft-sql-server/) [![MongoDB](https://cdn.simpleicons.org/mongodb/fff)](/docs/integrations/databases/mongodb/) [MongoDB](/docs/integrations/databases/mongodb/) [![MongoDB Atlas](https://cdn.simpleicons.org/mongodb/fff)](/docs/integrations/databases/mongodb-atlas/) [MongoDB Atlas](/docs/integrations/databases/mongodb-atlas/) [![MySQL](https://cdn.simpleicons.org/mysql/fff)](/docs/integrations/databases/mysql/) [MySQL](/docs/integrations/databases/mysql/) [![Oracle](/docs/icons/database.svg)](/docs/integrations/databases/oracle/) [Oracle](/docs/integrations/databases/oracle/) [![PostgreSQL](https://cdn.simpleicons.org/postgresql/fff)](/docs/integrations/databases/postgresql/) [PostgreSQL](/docs/integrations/databases/postgresql/) [![Redis](https://cdn.simpleicons.org/redis/fff)](/docs/integrations/databases/redis/) [Redis](/docs/integrations/databases/redis/) [![Redis Cloud](https://cdn.simpleicons.org/redis/fff)](/docs/integrations/databases/redis-cloud/) [Redis Cloud](/docs/integrations/databases/redis-cloud/) ## Edge & Gateway [![Kong](https://cdn.simpleicons.org/kong/fff)](/docs/integrations/edge-and-gateway/kong/) [Kong](/docs/integrations/edge-and-gateway/kong/) ## Frameworks ### .NET [![ASP.NET Core](https://cdn.simpleicons.org/dotnet/fff)](/docs/integrations/frameworks/dotnet/aspnet-core/) [ASP.NET Core](/docs/integrations/frameworks/dotnet/aspnet-core/) [![ASP.NET on IIS](https://cdn.simpleicons.org/dotnet/fff)](/docs/integrations/frameworks/dotnet/aspnet-iis/) [ASP.NET on IIS](/docs/integrations/frameworks/dotnet/aspnet-iis/) ### Dart [![Flutter](https://cdn.simpleicons.org/flutter/fff)](/docs/integrations/frameworks/dart/flutter/) [Flutter](/docs/integrations/frameworks/dart/flutter/) ### Elixir [![Phoenix](https://cdn.simpleicons.org/phoenixframework/fff)](/docs/integrations/frameworks/elixir/phoenix/) [Phoenix](/docs/integrations/frameworks/elixir/phoenix/) ### Go [![AWS SDK v2](https://last9.github.io/images/aws-icons/AWS-Cloud-logo_32.svg)](/docs/integrations/frameworks/go/aws-sdk/) [AWS SDK v2](/docs/integrations/frameworks/go/aws-sdk/) [![Beego](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/beego/) [Beego](/docs/integrations/frameworks/go/beego/) [![Chi](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/chi/) [Chi](/docs/integrations/frameworks/go/chi/) [![Echo](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/echo/) [Echo](/docs/integrations/frameworks/go/echo/) [![FastHttp](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/fasthttp/) [FastHttp](/docs/integrations/frameworks/go/fasthttp/) [![Gin](https://cdn.simpleicons.org/gin/fff)](/docs/integrations/frameworks/go/gin/) [Gin](/docs/integrations/frameworks/go/gin/) [![Gorilla Mux](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/gorilla-mux/) [Gorilla Mux](/docs/integrations/frameworks/go/gorilla-mux/) [![gRPC](https://last9.github.io/assets-docs/integration-grpc.svg)](/docs/integrations/frameworks/go/grpc/) [gRPC](/docs/integrations/frameworks/go/grpc/) [![gRPC Gateway](https://last9.github.io/assets-docs/integration-grpc.svg)](/docs/integrations/frameworks/go/grpc-gateway/) [gRPC Gateway](/docs/integrations/frameworks/go/grpc-gateway/) [![Iris](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/iris/) [Iris](/docs/integrations/frameworks/go/iris/) [![Kafka](https://cdn.simpleicons.org/apachekafka/fff)](/docs/integrations/frameworks/go/kafka/) [Kafka](/docs/integrations/frameworks/go/kafka/) [![Log-Trace Correlation](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/log-trace-correlation/) [Log-Trace Correlation](/docs/integrations/frameworks/go/log-trace-correlation/) [![MongoDB](https://cdn.simpleicons.org/mongodb/fff)](/docs/integrations/frameworks/go/mongodb/) [MongoDB](/docs/integrations/frameworks/go/mongodb/) [![net/http](https://cdn.simpleicons.org/go/fff)](/docs/integrations/frameworks/go/net-http/) [net/http](/docs/integrations/frameworks/go/net-http/) ### Java [![Spring Boot](https://cdn.simpleicons.org/spring/fff)](/docs/integrations/frameworks/java/spring-boot/) [Spring Boot](/docs/integrations/frameworks/java/spring-boot/) [![Vert.x](https://cdn.simpleicons.org/eclipsevertdotx/fff)](/docs/integrations/frameworks/java/vertx/) [Vert.x](/docs/integrations/frameworks/java/vertx/) ### JavaScript [![Express.js](https://cdn.simpleicons.org/express/fff)](/docs/integrations/frameworks/javascript/expressjs/) [Express.js](/docs/integrations/frameworks/javascript/expressjs/) [![Fastify](https://cdn.simpleicons.org/fastify/fff)](/docs/integrations/frameworks/javascript/fastify/) [Fastify](/docs/integrations/frameworks/javascript/fastify/) [![Koa](https://cdn.simpleicons.org/koa/fff)](/docs/integrations/frameworks/javascript/koa/) [Koa](/docs/integrations/frameworks/javascript/koa/) [![Nest.js](https://cdn.simpleicons.org/nestjs/fff)](/docs/integrations/frameworks/javascript/nestjs/) [Nest.js](/docs/integrations/frameworks/javascript/nestjs/) [![Next.js](https://cdn.simpleicons.org/nextdotjs/fff)](/docs/integrations/frameworks/javascript/nextjs/) [Next.js](/docs/integrations/frameworks/javascript/nextjs/) [![React](https://cdn.simpleicons.org/react/fff)](/docs/integrations/frameworks/javascript/react/) [React](/docs/integrations/frameworks/javascript/react/) [![Sails.js](https://cdn.simpleicons.org/sailsdotjs/fff)](/docs/integrations/frameworks/javascript/sailsjs/) [Sails.js](/docs/integrations/frameworks/javascript/sailsjs/) ### PHP [![Laravel](https://cdn.simpleicons.org/laravel/fff)](/docs/integrations/frameworks/php/laravel/) [Laravel](/docs/integrations/frameworks/php/laravel/) [![Lumen](https://cdn.simpleicons.org/lumen/fff)](/docs/integrations/frameworks/php/lumen/) [Lumen](/docs/integrations/frameworks/php/lumen/) ### Python [![Celery](https://cdn.simpleicons.org/celery/fff)](/docs/integrations/frameworks/python/celery/) [Celery](/docs/integrations/frameworks/python/celery/) [![Django](https://cdn.simpleicons.org/django/fff)](/docs/integrations/frameworks/python/django/) [Django](/docs/integrations/frameworks/python/django/) [![Falcon](https://cdn.simpleicons.org/falcon/fff)](/docs/integrations/frameworks/python/falcon/) [Falcon](/docs/integrations/frameworks/python/falcon/) [![FastAPI](https://cdn.simpleicons.org/fastapi/fff)](/docs/integrations/frameworks/python/fastapi/) [FastAPI](/docs/integrations/frameworks/python/fastapi/) [![FastMCP](/docs/icons/fastmcp.png)](/docs/integrations/frameworks/python/fastmcp/) [FastMCP](/docs/integrations/frameworks/python/fastmcp/) [![Flask](https://cdn.simpleicons.org/flask/fff)](/docs/integrations/frameworks/python/flask/) [Flask](/docs/integrations/frameworks/python/flask/) [![Sanic](https://cdn.simpleicons.org/sanic/fff)](/docs/integrations/frameworks/python/sanic/) [Sanic](/docs/integrations/frameworks/python/sanic/) ### Ruby [![Roda](https://last9.github.io/assets-docs/integration-roda.svg)](/docs/integrations/frameworks/ruby/roda/) [Roda](/docs/integrations/frameworks/ruby/roda/) [![Ruby on Rails](https://cdn.simpleicons.org/rubyonrails/fff)](/docs/integrations/frameworks/ruby/ruby-on-rails/) [Ruby on Rails](/docs/integrations/frameworks/ruby/ruby-on-rails/) [![Sinatra](https://cdn.simpleicons.org/rubysinatra/fff)](/docs/integrations/frameworks/ruby/sinatra/) [Sinatra](/docs/integrations/frameworks/ruby/sinatra/) ### Scala [![Akka HTTP](https://cdn.simpleicons.org/scala/fff)](/docs/integrations/frameworks/scala/akka-http/) [Akka HTTP](/docs/integrations/frameworks/scala/akka-http/) ### Swift [![iOS](https://cdn.simpleicons.org/swift/fff)](/docs/integrations/frameworks/swift/ios/) [iOS](/docs/integrations/frameworks/swift/ios/) ## Languages [![.NET](https://cdn.simpleicons.org/dotnet/fff)](/docs/integrations/languages/dotnet/) [.NET](/docs/integrations/languages/dotnet/) [![Java](https://cdn.simpleicons.org/openjdk/fff)](/docs/integrations/languages/java/) [Java](/docs/integrations/languages/java/) [![Node.js](https://cdn.simpleicons.org/nodedotjs/fff)](/docs/integrations/languages/nodejs/) [Node.js](/docs/integrations/languages/nodejs/) [![PHP](https://cdn.simpleicons.org/php/fff)](/docs/integrations/languages/php/) [PHP](/docs/integrations/languages/php/) [![Scala](https://cdn.simpleicons.org/scala/fff)](/docs/integrations/languages/scala/) [Scala](/docs/integrations/languages/scala/) ## Messaging [![AWS MSK](https://last9.github.io/images/aws-icons/Arch_Amazon-Managed-Streaming-for-Apache-Kafka_64.svg)](/docs/integrations/messaging/aws-msk/) [AWS MSK](/docs/integrations/messaging/aws-msk/) [![AWS SQS](https://last9.github.io/images/aws-icons/Arch_Amazon-Simple-Queue-Service_64.svg)](/docs/integrations/messaging/aws-sqs/) [AWS SQS](/docs/integrations/messaging/aws-sqs/) [![Azure Event Hubs](/docs/icons/azure.svg)](/docs/integrations/messaging/azure-event-hubs/) [Azure Event Hubs](/docs/integrations/messaging/azure-event-hubs/) [![Google Managed Kafka Service](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/messaging/google-managed-kafka-service/) [Google Managed Kafka Service](/docs/integrations/messaging/google-managed-kafka-service/) [![Kafka](https://cdn.simpleicons.org/apachekafka/fff)](/docs/integrations/messaging/kafka/) [Kafka](/docs/integrations/messaging/kafka/) [![Karafka](https://cdn.simpleicons.org/ruby/fff)](/docs/integrations/messaging/karafka/) [Karafka](/docs/integrations/messaging/karafka/) [![NATS](https://cdn.simpleicons.org/natsdotio/fff)](/docs/integrations/messaging/nats/) [NATS](/docs/integrations/messaging/nats/) ## Observability [![AWS CloudWatch Logs](https://last9.github.io/images/aws-icons/Arch_Amazon-CloudWatch_64.svg)](/docs/integrations/observability/aws-cloudwatch-logs/) [AWS CloudWatch Logs](/docs/integrations/observability/aws-cloudwatch-logs/) [![AWS Cloudwatch Metrics](https://last9.github.io/images/aws-icons/Arch_Amazon-CloudWatch_64.svg)](/docs/integrations/observability/aws-cloudwatch-metrics/) [AWS Cloudwatch Metrics](/docs/integrations/observability/aws-cloudwatch-metrics/) [![Cloudflare Logs](https://cdn.simpleicons.org/cloudflare/fff)](/docs/integrations/observability/cloudflare-logs/) [Cloudflare Logs](/docs/integrations/observability/cloudflare-logs/) [![Cloudflare Workers Logs](https://cdn.simpleicons.org/cloudflare/fff)](/docs/integrations/observability/cloudflare-workers-logs/) [Cloudflare Workers Logs](/docs/integrations/observability/cloudflare-workers-logs/) [![Datadog Agent](https://cdn.simpleicons.org/datadog/fff)](/docs/integrations/observability/datadog-agent/) [Datadog Agent](/docs/integrations/observability/datadog-agent/) [![Elastic Logstash](https://cdn.simpleicons.org/elastic/fff)](/docs/integrations/observability/elastic-logstash/) [Elastic Logstash](/docs/integrations/observability/elastic-logstash/) [![Fluent Bit](https://cdn.simpleicons.org/fluentbit/fff)](/docs/integrations/observability/fluent-bit/) [Fluent Bit](/docs/integrations/observability/fluent-bit/) [![Host Metrics on Raspberry Pi](https://cdn.simpleicons.org/raspberrypi/fff)](/docs/integrations/observability/host-metrics-on-raspberry-pi/) [Host Metrics on Raspberry Pi](/docs/integrations/observability/host-metrics-on-raspberry-pi/) [![Host Metrics on Ubuntu](https://cdn.simpleicons.org/ubuntu/fff)](/docs/integrations/observability/host-metrics-on-ubuntu/) [Host Metrics on Ubuntu](/docs/integrations/observability/host-metrics-on-ubuntu/) [![Host Metrics on Windows](/docs/icons/microsoft-windows.svg)](/docs/integrations/observability/host-metrics-on-windows/) [Host Metrics on Windows](/docs/integrations/observability/host-metrics-on-windows/) [![Loki Logs](https://cdn.simpleicons.org/grafana/fff)](/docs/integrations/observability/loki-logs/) [Loki Logs](/docs/integrations/observability/loki-logs/) [![OpenTelemetry Collector](https://cdn.simpleicons.org/opentelemetry/fff)](/docs/integrations/observability/opentelemetry-collector/) [OpenTelemetry Collector](/docs/integrations/observability/opentelemetry-collector/) [![Prometheus](https://cdn.simpleicons.org/prometheus/fff)](/docs/integrations/observability/prometheus/) [Prometheus](/docs/integrations/observability/prometheus/) [![Sending OpenTelemetry Demo data to Last9](https://cdn.simpleicons.org/opentelemetry/fff)](/docs/integrations/observability/sending-opentelemetry-demo-data-to-last9/) [Sending OpenTelemetry Demo data to Last9](/docs/integrations/observability/sending-opentelemetry-demo-data-to-last9/) [![Ubuntu](https://cdn.simpleicons.org/ubuntu/fff)](/docs/integrations/observability/ubuntu/) [Ubuntu](/docs/integrations/observability/ubuntu/) [![Winston Logger](https://cdn.simpleicons.org/node.js/fff)](/docs/integrations/observability/winston-logger/) [Winston Logger](/docs/integrations/observability/winston-logger/) ## Real User Monitoring [![Web](https://cdn.simpleicons.org/javascript/fff)](/docs/real-user-monitoring/web/) [Web](/docs/real-user-monitoring/web/) [![Android](https://cdn.simpleicons.org/android/fff)](/docs/real-user-monitoring/android/) [Android](/docs/real-user-monitoring/android/) [![iOS](https://cdn.simpleicons.org/apple/fff)](/docs/real-user-monitoring/ios/) [iOS](/docs/real-user-monitoring/ios/) [![React Native](https://cdn.simpleicons.org/react/fff)](/docs/real-user-monitoring/react-native/) [React Native](/docs/real-user-monitoring/react-native/) [![Flutter](https://cdn.simpleicons.org/flutter/fff)](/docs/real-user-monitoring/flutter/) [Flutter](/docs/real-user-monitoring/flutter/) ## Web & CDN [![Akamai](https://cdn.simpleicons.org/akamai/fff)](/docs/integrations/web-and-cdn/akamai/) [Akamai](/docs/integrations/web-and-cdn/akamai/) [![Apache](https://cdn.simpleicons.org/apache/fff)](/docs/integrations/web-and-cdn/apache/) [Apache](/docs/integrations/web-and-cdn/apache/) [![AWS CloudFront](https://last9.github.io/images/aws-icons/Arch_Amazon-CloudFront_64.svg)](/docs/integrations/web-and-cdn/aws-cloudfront/) [AWS CloudFront](/docs/integrations/web-and-cdn/aws-cloudfront/) [![AWS Elastic Load Balancer](https://last9.github.io/images/aws-icons/Arch_Elastic-Load-Balancing_64.svg)](/docs/integrations/web-and-cdn/aws-elastic-load-balancer/) [AWS Elastic Load Balancer](/docs/integrations/web-and-cdn/aws-elastic-load-balancer/) [![Fastly](https://cdn.simpleicons.org/fastly/fff)](/docs/integrations/web-and-cdn/fastly/) [Fastly](/docs/integrations/web-and-cdn/fastly/) [![Google Cloud Load Balancer](https://cdn.simpleicons.org/googlecloud/fff)](/docs/integrations/web-and-cdn/google-cloud-load-balancer/) [Google Cloud Load Balancer](/docs/integrations/web-and-cdn/google-cloud-load-balancer/) [![Nginx](https://cdn.simpleicons.org/nginx/fff)](/docs/integrations/web-and-cdn/nginx/) [Nginx](/docs/integrations/web-and-cdn/nginx/) [![OpenResty](/docs/icons/openresty.png)](/docs/integrations/web-and-cdn/openresty/) [OpenResty](/docs/integrations/web-and-cdn/openresty/) [![Tomcat](https://cdn.simpleicons.org/apachetomcat/fff)](/docs/integrations/web-and-cdn/tomcat/) [Tomcat](/docs/integrations/web-and-cdn/tomcat/) ## Others [![Apache APISIX](https://cdn.simpleicons.org/apache/fff)](/docs/integrations/others/apache-apisix/) [Apache APISIX](/docs/integrations/others/apache-apisix/) [![Confluent Cloud](https://last9.github.io/assets-docs/integration-confluent.svg)](/docs/integrations/others/confluent-cloud/) [Confluent Cloud](/docs/integrations/others/confluent-cloud/) [![Google Play Reporting](https://cdn.simpleicons.org/googleplay/fff)](/docs/integrations/others/google-play-reporting/) [Google Play Reporting](/docs/integrations/others/google-play-reporting/) [![jmxtrans](https://last9.github.io/assets-docs/integration-jmxtrans.png)](/docs/integrations/others/jmxtrans/) [jmxtrans](/docs/integrations/others/jmxtrans/) [![Keda](https://last9.github.io/assets-docs/integration-keda.png)](/docs/integrations/others/keda/) [Keda](/docs/integrations/others/keda/) [![LaunchDarkly](https://last9.github.io/assets-docs/integration-launchdarkly.svg)](/docs/integrations/others/launchdarkly/) [LaunchDarkly](/docs/integrations/others/launchdarkly/) [![StatsD](https://last9.github.io/assets-docs/integration-statsd.png)](/docs/integrations/others/statsd/) [StatsD](/docs/integrations/others/statsd/) [![Telegraf](https://cdn.simpleicons.org/influxdb/fff)](/docs/integrations/others/telegraf/) [Telegraf](/docs/integrations/others/telegraf/) [![Vmagent](https://cdn.simpleicons.org/victoriametrics/fff)](/docs/integrations/others/vmagent/) [Vmagent](/docs/integrations/others/vmagent/) # Monitor RabbitMQ using Last9 > Send RabbitMQ metrics to Last9 This document lists step-by-step instructions for setting up monitoring for RabbitMQ using Last9. ## Prerequisites Create a Last9 cluster by following [Getting Started](/docs/onboard/). Keep the following information handy after creating the cluster: * `$levitate_read_url` - Last9’s Read endpoint * `$levitate_username` - Cluster ID * `$levitate_password` - Read token created for the cluster Ensure RabbitMQ is installed and running on each VM. You should have root or administrative access. Ensure that [rabbitmq\_prometheus plugin](https://www.rabbitmq.com/docs/prometheus#enable-rabbitmq_prometheus) is configured as per its documentation. ## Configure Prometheus Agent to Scrape Metrics ### Architecture ![prometheus-rabbitmq](/_astro/rabbitmq-levitate.DKStmc_T_QNhuf.webp) ### Setting up Prometheus Agent Install Prometheus Agent on a central server that can access all VMs running RabbitMQ. This script can be run on the server: ```bash #!/bin/bash # Function to determine OS architecture get_architecture() { architecture=$(uname -m) case $architecture in x86_64) arch="amd64" ;; aarch64) arch="arm64" ;; arm*) arch="armv7" ;; *) echo "Architecture $architecture is not supported by this script." exit 1 ;; esac echo $arch } # Define version and architecture VERSION="2.37.0" ARCH=$(get_architecture) # Download and install Prometheus Agent URL="https://github.com/prometheus/prometheus/releases/download/v$VERSION/prometheus-$VERSION.linux-$ARCH.tar.gz" wget $URL -O prometheus.tar.gz tar -xzf prometheus.tar.gz cd prometheus-$VERSION.linux-$ARCH # Move executables to your PATH sudo mv prometheus promtool /usr/local/bin/ sudo mkdir /etc/prometheus sudo mkdir -p /var/lib/prometheus sudo mv consoles/ console_libraries/ prometheus.yml /etc/prometheus/ # Create a service file for Prometheus echo "[Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=$USER Restart=on-failure ExecStart=/usr/local/bin/prometheus \\ --config.file=/etc/prometheus/prometheus.yml \\ --enable-feature=agent [Install] WantedBy=multi-user.target" | sudo tee /etc/systemd/system/prometheus.service # Reload systemd to pick up the new service and start Prometheus Agent sudo systemctl daemon-reload sudo systemctl start prometheus sudo systemctl enable prometheus ``` ### Configure Scrape Targets Use either `ec2_sd_config` or `file_sd_config` to configure scrape targets. The example uses `file_sd_config` for simplicity. Create a file `targets.json` and add the following: Ensure that the targets are the Private IPs of the RabbitMQ cluster or instances. ```json [ { "targets": ["rbmq-vm-1:15692", "rbmq-vm-1:15692"], "labels": { "job": "rabbitmq" } } ] ``` ### Configure Last9 Remote Write Credentials into Prometheus Agent Add the following remote\_write configuration to the prometheus.yml file: ```shell echo "remote_write: - url: "$remote_write_url" basic_auth: username: "$remote_write_username" password: "$remote_write_password" write_relabel_configs: - source_labels: [__name__] regex: 'rabbitmq_(.*)'" | sudo tee -a /etc/prometheus/prometheus.y ``` Replace `remote_write_*` template variables with your Last9 Credentials that can be obtained by following this documentation ### Restart Prometheus Agent Restart Prometheus Agent to apply changes: ```plaintext sudo systemctl restart prometheus ``` ### Verification * Check Prometheus Agent’ `/targets` and `/graph` web interfaces to ensure it’s scraping the RabbitMQ metrics * Verify that metrics are being received in the remote storage This script-driven setup automates the deployment and configuration of RabbitMQ monitoring using Prometheus Agent. ### Dashboard Download the latest dashboard from [here](https://grafana.com/grafana/dashboards/10991-rabbitmq-overview/) to visualize the metrics. ![rabbitmq-levitate-1](/_astro/rabbitmq-dashboard-1.CZzLrEg6_1QGOIK.webp) ![rabbitmq-levitate-2](/_astro/rabbitmq-dashboard-2.CUp-rZgR_4Itso.webp) ![rabbitmq-levitate-3](/_astro/rabbitmq-dashboard-3.C6ZdGi3I_ZsD5Qp.webp) *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Aerospike > Monitor Aerospike node and namespace metrics with OpenTelemetry and Last9 Use OpenTelemetry to monitor Aerospike infrastructure and send telemetry data to Last9. This integration collects node-level and namespace-level metrics from your Aerospike cluster using the OpenTelemetry Collector’s Aerospike receiver. ## Prerequisites Before setting up Aerospike monitoring, ensure you have: * Aerospike server installed and running * Administrative access to your Aerospike server * OpenTelemetry Collector installation access * Last9 account with integration credentials 1. **Install OpenTelemetry Collector** Choose the appropriate package for your operating system. * DEB Package For Debian/Ubuntu systems: ```bash wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb ``` * RPM Package For Red Hat/CentOS systems: ```bash wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpm sudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpm ``` More installation options can be found in the [official OpenTelemetry documentation](https://opentelemetry.io/docs/collector/installation/#linux). 2. **Configure OpenTelemetry Collector** Create the collector configuration file: ```bash sudo nano /etc/otelcol-contrib/config.yaml ``` Add the following configuration to monitor Aerospike metrics and system resources: ```yaml receivers: aerospike: endpoint: "localhost:3000" collection_interval: 60s hostmetrics: collection_interval: 60s scrapers: cpu: metrics: system.cpu.logical.count: enabled: true memory: metrics: system.memory.utilization: enabled: true system.memory.limit: enabled: true load: disk: filesystem: metrics: system.filesystem.utilization: enabled: true network: paging: processors: batch: timeout: 20s send_batch_size: 10000 send_batch_max_size: 10000 resourcedetection/cloud: detectors: ["aws", "gcp", "azure"] resourcedetection/system: detectors: ["system"] system: hostname_sources: ["os"] exporters: otlp/last9: endpoint: "{{ .Logs.WriteURL }}" headers: "Authorization": "{{ .Logs.AuthValue }}" debug: verbosity: detailed service: pipelines: metrics: receivers: [aerospike, hostmetrics] processors: [batch, resourcedetection/system, resourcedetection/cloud] exporters: [otlp/last9] ``` Replace the placeholder values in the `exporters` section with your actual Last9 credentials from the Last9 Integrations page. Update the Aerospike endpoint if your server runs on a different host or port (default: `localhost:3000`). 3. **Configure OpenTelemetry Collector Service** Create a systemd service file for the collector: ```bash sudo nano /etc/systemd/system/otelcol-contrib.service ``` Add the following service configuration: ```bash [Unit] Description=OpenTelemetry Collector Contrib After=network.target [Service] ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml Restart=always User=root Group=root [Install] WantedBy=multi-user.target ``` 4. **Start and Enable the Service** Reload systemd configuration and start the collector service: ```bash sudo systemctl daemon-reload sudo systemctl enable otelcol-contrib sudo systemctl start otelcol-contrib ``` ## Understanding the Setup ### Node Metrics The Aerospike receiver collects node-level metrics reflecting the health of each server in your cluster: * **Connections**: Current open connections and cumulative connection counts, broken down by type (client, fabric, heartbeat) * **Memory**: Percentage of node memory still free * **Query Tracking**: Number of long-running queries tracked by the system (those that exceeded the untracked-time threshold) ### Namespace Metrics Namespace metrics provide per-namespace insight into storage and workload: * **Storage**: Minimum contiguous disk free percentage across all devices, and memory usage broken down by component (data, index, set index, secondary index) * **Transactions**: Read, write, delete, and UDF operations with result breakdowns (success, error, timeout, filtered out, not found) * **Scans and Queries**: Aggregation, basic, and background scan and query counts with result breakdowns ### Host Metrics The `hostmetrics` receiver gathers system-level metrics to provide context for Aerospike performance: * CPU utilization and load averages * Memory usage and availability * Disk I/O and filesystem utilization * Network traffic statistics ## Key Metrics to Monitor ### Node-Level | Metric | Description | Unit | | --------------------------------- | ---------------------------------- | ----------- | | `aerospike.node.connection.open` | Current open connections (by type) | connections | | `aerospike.node.connection.count` | Connections opened and closed | connections | | `aerospike.node.memory.free` | Percentage of node memory free | % | | `aerospike.node.query.tracked` | Long-running queries tracked | queries | ### Namespace-Level | Metric | Description | Unit | | --------------------------------------- | -------------------------------------------------------------------- | ------------ | | `aerospike.namespace.disk.available` | Minimum contiguous disk free across all devices | % | | `aerospike.namespace.memory.free` | Percentage of namespace memory free | % | | `aerospike.namespace.memory.usage` | Memory used by component (data, index, set\_index, secondary\_index) | bytes | | `aerospike.namespace.transaction.count` | Transactions by type and result | transactions | | `aerospike.namespace.scan.count` | Scan operations by type and result | scans | | `aerospike.namespace.query.count` | Query operations by type and result | queries | ## Configuration Tips ### Authentication If your Aerospike cluster requires authentication: ```yaml aerospike: endpoint: "localhost:3000" username: "your_username" password: "your_password" collection_interval: 60s ``` ### TLS For TLS-secured connections: ```yaml aerospike: endpoint: "localhost:3000" collection_interval: 60s tls: insecure: false ca_file: "/etc/ssl/certs/aerospike-ca.crt" cert_file: "/etc/ssl/certs/aerospike-client.crt" key_file: "/etc/ssl/certs/aerospike-client.key" ``` ### Multi-Node Clusters The Aerospike receiver connects to a single node endpoint. To monitor multiple nodes, define multiple receivers with distinct names: ```yaml receivers: aerospike/node1: endpoint: "node1.example.com:3000" collection_interval: 60s aerospike/node2: endpoint: "node2.example.com:3000" collection_interval: 60s service: pipelines: metrics: receivers: [aerospike/node1, aerospike/node2, hostmetrics] processors: [batch, resourcedetection/system, resourcedetection/cloud] exporters: [otlp/last9] ``` ## Verification and Monitoring 1. **Check Service Status** Verify the OpenTelemetry Collector service is running: ```bash sudo systemctl status otelcol-contrib ``` 2. **Monitor Service Logs** Check for any configuration errors or connection issues: ```bash sudo journalctl -u otelcol-contrib -f ``` 3. **Verify Aerospike Connectivity** Test that Aerospike is accessible on the expected port: ```bash # Check node info asinfo -v "node" # List namespaces asinfo -v "namespaces" ``` 4. **Verify Data in Last9** Log into your Last9 account and check that Aerospike metrics are being received in [Grafana](https://app.last9.io/grafana). ## Troubleshooting ### Service Fails to Start ```bash # Check service status sudo systemctl status otelcol-contrib # View detailed logs sudo journalctl -u otelcol-contrib --no-pager # Validate configuration syntax sudo /usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml --dry-run ``` ### Aerospike Connection Issues ```bash # Test port accessibility nc -zv localhost 3000 # Check Aerospike service status sudo systemctl status aerospike # View Aerospike logs sudo tail -f /var/log/aerospike/aerospike.log ``` ### No Metrics Appearing * Confirm the Aerospike endpoint is reachable from the collector host * Verify port 3000 is not blocked by a firewall * For authenticated clusters, confirm the credentials have read access to node and namespace statistics * The Aerospike receiver requires `otelcol-contrib` — the core distribution does not include it ## Need Help? If you encounter any issues or have questions: * Join our [Discord community](https://discord.com/channels/652153247672729619/652153247672729621) for real-time support * Contact our support team at # ArgoCD > Send deployment markers to Last9 when ArgoCD syncs an application, so you can correlate Kubernetes deployments with service health, error rates, and APDEX shifts. The [Last9 ArgoCD Integration](https://github.com/last9/last9-argocd-integration) sends deployment markers to Last9’s [Change Events API](/docs/change-events/) whenever ArgoCD syncs an application. Every deployment appears as a vertical overlay on your service dashboards — correlated with latency, error rates, and APDEX. No operator. No sidecar. No Lambda. ArgoCD’s built-in notification controller handles everything. ## Prerequisites * ArgoCD with the [notifications controller](https://argo-cd.readthedocs.io/en/stable/operator-manual/notifications/) installed * A Last9 refresh token (write scope). See [Getting Started with API](/docs/getting-started-with-api/) to generate one. * Your Last9 organization slug (the `{org}` segment in your Last9 dashboard URL). ## Install 1. **Add your credentials** ```yaml # install/secret.yaml apiVersion: v1 kind: Secret metadata: name: argocd-notifications-secret namespace: argocd stringData: last9-refresh-token: last9-org-slug: last9-token: "" ``` ```sh kubectl apply -f install/secret.yaml ``` 2. **Apply the integration** ```sh kubectl apply -f https://raw.githubusercontent.com/last9/last9-argocd-integration/master/install/token-refresher.yaml kubectl apply -f https://raw.githubusercontent.com/last9/last9-argocd-integration/master/install/argocd-notifications-cm-patch.yaml ``` The first command installs a CronJob that exchanges your refresh token for an access token immediately and every 48 hours — you never need to rotate credentials manually. The second patches `argocd-notifications-cm` with the Last9 webhook service, notification templates, and triggers. Nothing else in your ArgoCD installation changes. 3. **Subscribe your applications** Add two annotations to each ArgoCD Application you want to track: ```yaml metadata: annotations: notifications.argoproj.io/subscribe.last9-on-sync-running.last9: "" notifications.argoproj.io/subscribe.last9-on-sync-finished.last9: "" ``` Or enable Last9 markers for every application in the cluster at once by adding `defaultTriggers` to `argocd-notifications-cm`: ```yaml data: defaultTriggers: | - last9-on-sync-running - last9-on-sync-finished ``` ## What gets captured Two markers per deployment — one when the sync starts, one when it finishes. | Attribute | Value | | -------------- | ---------------------------------------------------- | | `service` | ArgoCD application name | | `revision` | Git commit SHA | | `namespace` | Destination namespace | | `cluster` | Destination cluster name | | `project` | ArgoCD project | | `repo` | Git repository URL | | `sync_status` | `Succeeded`, `Failed`, or `Error` (stop marker only) | | `health` | `Healthy` or `Degraded` (stop marker only) | | `initiated_by` | Username who triggered the sync | | `automated` | `true` if triggered by auto-sync | ## Dashboard correlation Filter your Last9 service dashboards by `service=` to see deployment markers alongside metrics. The `revision` attribute links each marker to the exact commit that changed behavior. *** ## Troubleshooting * **No markers appearing** Check the notifications controller logs: ```sh kubectl logs -n argocd deployment/argocd-notifications-controller ``` Look for `last9` in the output. If the webhook is firing but events aren’t appearing, check the token-refresher ran successfully: ```sh kubectl logs -n argocd -l job-name=last9-token-refresher-init ``` * **Duplicate markers** The triggers use `oncePer: app.status.operationState?.syncResult?.revision` to deduplicate — each commit SHA fires at most once per trigger. If you see duplicates, check that the patch applied cleanly: ```sh kubectl get cm argocd-notifications-cm -n argocd -o yaml | grep last9 ``` * **Helm users** If you manage ArgoCD with Helm, add the contents of [`install/argocd-notifications-cm-patch.yaml`](https://github.com/last9/last9-argocd-integration/blob/master/install/argocd-notifications-cm-patch.yaml) under `notifications.cm` in your values file instead of applying it directly. Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Buildkite > Send deployment markers to Last9 from Buildkite pipelines to correlate deployments with service performance and error rates. The [Last9 Deployment Marker Buildkite Plugin](https://github.com/last9/deployment-marker-buildkite-plugin) sends deployment markers to Last9’s [Change Events API](/docs/change-events/) directly from your Buildkite pipelines. Deployment events appear as overlays on your service dashboards, letting you correlate releases with performance changes, error spikes, and APDEX shifts. ## Prerequisites * A Last9 account with an API refresh token (write scope). See [Getting Started with API](/docs/getting-started-with-api/) to generate one. * Your Last9 organization slug (the `{org}` segment in your Last9 dashboard URL). * `curl` and `jq` installed on your Buildkite agent (available by default on most agent environments). * The `env` value must match the `deployment_environment` label on your APM services for automatic dashboard correlation. Store your refresh token as a [Buildkite secret](https://buildkite.com/docs/pipelines/security/secrets/managing) or in your agent’s environment hooks — never hard-code it in `pipeline.yml`. ## Setup Add the plugin to your deploy step in `pipeline.yml`: ```yaml steps: - label: ":rocket: Deploy to production" command: ./deploy.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "production" ``` The plugin runs `pre-command` and `post-command` hooks around your deploy step. It automatically sets `result` to: * `success` — when your command exits with code `0` * `failure` — when your command exits with a non-zero code Deployment duration (`duration_ms`) is captured automatically between the two hooks and attached to the event. ## Configuration Reference | Option | Required | Default | Description | | ---------------------- | -------- | ---------------------- | ---------------------------------------------------------------------------------------------------- | | `refresh_token` | **Yes** | — | Last9 API refresh token with write scope | | `org_slug` | **Yes** | — | Your Last9 organization slug | | `env` | **Yes** | — | Deployment environment. Must match your APM `deployment_environment` label for dashboard correlation | | `service` | No | Pipeline slug | Service identifier. Must match your APM `service_name` label for dashboard correlation | | `event_name` | No | `deployment` | Custom label for the event | | `data_source_name` | No | — | Last9 cluster or data source name to associate the event with | | `api_base_url` | No | `https://app.last9.io` | Override the Last9 API base URL | | `max_retry_attempts` | No | `3` | Number of retry attempts with exponential backoff on failure | | `retry_backoff_ms` | No | `1000` | Initial backoff in milliseconds between retries | | `max_retry_backoff_ms` | No | `30000` | Maximum backoff cap in milliseconds | ## Pipeline Patterns ### Mark a single deploy step The simplest setup — add the plugin to your deploy step and it handles start, end, and result automatically: ```yaml steps: - label: ":rocket: Deploy" command: ./deploy.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "production" service: "payments-api" ``` ### Multiple services in one pipeline Add the plugin to each relevant step with a distinct `service`: ```yaml steps: - label: "Deploy API" command: ./deploy-api.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "production" service: "payments-api" - label: "Deploy Worker" command: ./deploy-worker.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "production" service: "payments-worker" ``` ### Multi-environment pipelines Pass the environment dynamically using a Buildkite [meta-data](https://buildkite.com/docs/pipelines/build-meta-data) value or pipeline variable: ```yaml steps: - label: "Deploy to ${DEPLOY_ENV}" command: ./deploy.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "${DEPLOY_ENV}" ``` ### Track rollbacks Use a distinct `event_name` to distinguish rollbacks from regular deployments: ```yaml steps: - label: ":rewind: Rollback" command: ./rollback.sh plugins: - last9/deployment-marker#v1.0.0: refresh_token: "${LAST9_REFRESH_TOKEN}" org_slug: "your-org-slug" env: "production" event_name: "rollback" ``` ## Auto-captured Buildkite Context The plugin automatically captures and attaches the following Buildkite context to every event: | Attribute | Source | | ------------------------- | --------------------------------------------------------------------- | | `service` | `service` plugin config (or pipeline slug) | | `env` | Plugin config | | `result` | `success` / `failure` (stop event only) | | `pipeline_slug` | `BUILDKITE_PIPELINE_SLUG` | | `pipeline_name` | `BUILDKITE_PIPELINE_NAME` | | `build_id` | `BUILDKITE_BUILD_ID` | | `build_number` | `BUILDKITE_BUILD_NUMBER` | | `build_url` | `BUILDKITE_BUILD_URL` | | `commit_sha` | `BUILDKITE_COMMIT` | | `branch` | `BUILDKITE_BRANCH` | | `tag` | `BUILDKITE_TAG` | | `commit_message` | `BUILDKITE_MESSAGE` | | `actor` | `BUILDKITE_BUILD_CREATOR` (falls back to `BUILDKITE_BUILD_AUTHOR`) | | `step_label` | `BUILDKITE_LABEL` | | `step_key` | `BUILDKITE_STEP_KEY` | | `job_id` | `BUILDKITE_JOB_ID` | | `source` | `BUILDKITE_SOURCE` | | `retry_count` | `BUILDKITE_RETRY_COUNT` | | `triggered_from_build_id` | `BUILDKITE_TRIGGERED_FROM_BUILD_ID` | | `rebuilt_from_build_id` | `BUILDKITE_REBUILT_FROM_BUILD_ID` | | `duration_ms` | Computed: time from `pre-command` to `post-command` (stop event only) | ## Service Dashboard Correlation For deployment markers to appear as overlays on your Last9 service dashboards, two values must match your APM data exactly: * `env` must equal the `deployment_environment` label on your APM services * `service` must equal the `service_name` label on your APM services (defaults to the Buildkite pipeline slug if not set) When both match, every deployment event appears as a vertical marker on your APDEX, response time, throughput, and error rate charts. See [Change Events](/docs/change-events/) for details on storage, PromQL queries, and visualisation options. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # GitHub Actions > Send deployment markers to Last9 from GitHub Actions workflows to correlate deployments with service performance and error rates. The [Last9 Deployment Marker action](https://github.com/marketplace/actions/last9-deployment-marker) sends deployment markers to Last9’s [Change Events API](/docs/change-events/) directly from your GitHub Actions workflows. Deployment events appear as overlays on your service dashboards, letting you correlate releases with performance changes, error spikes, and APDEX shifts. ## Prerequisites * A Last9 account with an API refresh token (write scope). See [Getting Started with API](/docs/getting-started-with-api/) to generate one. * Your Last9 organization slug (the `{org}` segment in your Last9 dashboard URL). * The `env` value must match the `deployment_environment` label on your APM services for automatic dashboard correlation. Store your refresh token as a [GitHub Actions secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets) named `LAST9_REFRESH_TOKEN`. ## Setup Add the action to any job in your workflow: ```yaml - name: Mark deployment in Last9 uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production ``` By default, this sends a `stop` event with `event_name: deployment`. The `service_name` defaults to your repository name. ## Configuration Reference | Input | Required | Default | Description | | --------------------------- | -------- | ---------------------- | ------------------------------------------------------------------------------------------------------------ | | `refresh_token` | Yes | — | Last9 API refresh token with write scope | | `org_slug` | Yes | — | Your Last9 organization slug | | `env` | Yes | — | Deployment environment. Must match your APM `deployment_environment` label exactly for dashboard correlation | | `service_name` | No | Repository name | Service identifier. Must match your APM `service_name` label for dashboard correlation | | `event_state` | No | `stop` | `start`, `stop`, or `both`. Use `both` to fire start and stop atomically from a single step | | `event_name` | No | `deployment` | Custom label for the event. Appears as a label on the resulting `last9_change_events` metric | | `custom_attributes` | No | — | Additional metadata as a JSON string (e.g. `{"team":"platform","ticket":"ENG-123"}`) | | `include_github_attributes` | No | `true` | Automatically attach workflow context: commit SHA, actor, branch, run ID, and repository | | `api_base_url` | No | `https://app.last9.io` | Override the Last9 API base URL | | `max_retry_attempts` | No | `3` | Number of retry attempts with exponential backoff on failure | ### Outputs | Output | Description | | ----------------- | ------------------------------------------------------------------------------ | | `success` | `true` if the event was delivered successfully | | `start_timestamp` | ISO8601 timestamp of the start event (when `event_state` is `start` or `both`) | | `stop_timestamp` | ISO8601 timestamp of the stop event (when `event_state` is `stop` or `both`) | ## CI/CD Workflows ### Mark deployment complete The simplest pattern — a single step that fires both start and stop atomically after the deployment finishes: ```yaml jobs: deploy: runs-on: ubuntu-latest steps: - name: Deploy run: ./deploy.sh - name: Mark deployment in Last9 if: always() uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production event_state: both ``` ### Track deployment duration Use `start` and `stop` events to bracket the actual deploy step and capture deployment duration on your dashboards: ```yaml jobs: deploy: runs-on: ubuntu-latest steps: - name: Mark deployment start uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production event_state: start - name: Deploy run: ./deploy.sh - name: Mark deployment complete if: always() uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production event_state: stop ``` ### Track rollbacks Use a distinct `event_name` to distinguish rollbacks from regular deployments: ```yaml - name: Mark rollback in Last9 if: always() uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production event_state: both event_name: rollback custom_attributes: '{"rolled_back_to": "${{ env.PREVIOUS_VERSION }}"}' ``` ### Multi-environment deployments Use `github.event.inputs` or a matrix to pass the environment dynamically: ```yaml on: workflow_dispatch: inputs: environment: description: "Target environment" required: true default: staging jobs: deploy: runs-on: ubuntu-latest steps: - name: Deploy run: ./deploy.sh ${{ github.event.inputs.environment }} - name: Mark deployment in Last9 if: always() uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: ${{ github.event.inputs.environment }} event_state: both ``` ### Custom attributes Attach additional context — version, team, ticket ID — as queryable labels on the `last9_change_events` metric: ```yaml - name: Mark deployment in Last9 if: always() uses: last9/deployment-marker-action@v1 with: refresh_token: ${{ secrets.LAST9_REFRESH_TOKEN }} org_slug: your-org-slug env: production service_name: payments-api event_state: both custom_attributes: > { "version": "${{ github.sha }}", "team": "platform", "ticket_id": "${{ github.event.head_commit.message }}" } ``` ## Auto-captured GitHub Context When `include_github_attributes` is `true` (the default), the action automatically attaches the following to every event: | Attribute | Source | | ------------------- | ---------------------------------------------------- | | `github_sha` | `github.sha` — the commit being deployed | | `github_actor` | `github.actor` — the user who triggered the workflow | | `github_ref` | `github.ref` — the branch or tag ref | | `github_run_id` | `github.run_id` — links back to the workflow run | | `github_repository` | `github.repository` — the repo name | These labels are queryable in PromQL alongside your custom attributes: ```promql last9_change_events{event_name="deployment", github_actor="alice"} ``` ## Service Dashboard Correlation For deployment markers to appear as overlays on your Last9 service dashboards, two values must match your APM data exactly: * `env` must equal the `deployment_environment` label on your APM services * `service_name` must equal the `service_name` label on your APM services (defaults to the repository name if not set) When both match, every deployment event appears as a vertical marker on your APDEX, response time, throughput, and error rate charts. See [Change Events](/docs/change-events/) for details on storage, PromQL queries, and visualisation options. *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Jenkins > Send deployment markers to Last9 from Jenkins pipelines and Freestyle jobs to correlate deployments with service performance and error rates. The [Last9 Jenkins Plugin](https://github.com/last9/last9-jenkins-plugin) sends deployment markers to Last9’s [Change Events API](/docs/change-events/) from Jenkins Pipeline and Freestyle jobs. Deployment events appear as overlays on your service dashboards, letting you correlate releases with performance changes, error spikes, and APDEX shifts. API failures never fail your build. The plugin logs a warning and moves on — deployments ship; observability is best-effort. ## Prerequisites * A Last9 account with an API refresh token (write scope). See [Getting Started with API](/docs/getting-started-with-api/) to generate one. * Your Last9 organization slug (the `{org}` segment in your Last9 dashboard URL). * Jenkins 2.462.3 or newer, Java 17 or newer. * The `environment` value must match the `deployment_environment` label on your APM services for automatic dashboard correlation. ## Setup **1. Create a credential** In **Manage Jenkins → Credentials**, add a **Secret text** credential. The secret is your Last9 refresh token. **2. Configure the plugin** Go to **Manage Jenkins → System → Last9**: * **Organization Slug** — your org identifier from the Last9 URL (e.g. `acme`) * **API Credential** — the credential you just created * **Default Data Source Name** — optional Hit **Test Connection** to verify before saving. ## Pipeline ### Deployment window (start + stop) Send `start` before your deploy, `stop` after. Last9 shows the full window so you can see how performance changed during the rollout. ```groovy pipeline { agent any stages { stage('Deploy') { steps { last9DeploymentMarker( serviceName: 'payments-api', environment: 'production', eventState: 'start' ) sh './deploy.sh' last9DeploymentMarker( serviceName: 'payments-api', environment: 'production', eventState: 'stop' ) } } } } ``` ### Single marker If you only need a point-in-time annotation after the deploy finishes: ```groovy post { success { last9DeploymentMarker serviceName: 'payments-api', environment: 'production' } } ``` `eventState` defaults to `stop`. ### All options ```groovy last9DeploymentMarker( serviceName: 'payments-api', // required environment: 'production', // recommended eventState: 'start', // 'start' or 'stop' (default: 'stop') eventName: 'deployment', // default: 'deployment' dataSourceName: 'payments-ds', // overrides global default customAttributes: [ 'deploy.version': '1.4.2', 'deploy.triggered_by': 'release-bot' ], // Override global config per-step (useful for multi-team Jenkins) orgSlug: 'acme', credentialId: 'last9-token-prod' ) ``` ## Freestyle Jobs ### Deployment window (start + stop) Add **Track Last9 Deployment Window (start + stop)** in the **Build Environment** section. Sends `start` before the first build step, `stop` after the last — including on failure. ### Single marker Add **Send Last9 Deployment Marker** as a post-build action. Configure when to send: * **Send on Success** (default: on) * **Send on Failure** (default: off) * **Send on Unstable** (default: off) * **Send on Aborted** (default: off) ## Auto-Captured Attributes These are attached to every event automatically: | Attribute | Source | | --------------------------- | ------------------ | | `scm.commit_sha` | `$GIT_COMMIT` | | `scm.branch` | `$GIT_BRANCH` | | `scm.url` | `$GIT_URL` | | `scm.author` | `$GIT_AUTHOR_NAME` | | `jenkins.job_name` | build metadata | | `jenkins.build_number` | build metadata | | `jenkins.build_url` | build metadata | | `jenkins.build_result` | build metadata | | `jenkins.build_duration_ms` | build metadata | | `jenkins.build_user` | triggered-by user | | `jenkins.node_name` | executor node | ## Service Dashboard Correlation For deployment markers to appear as overlays on your Last9 service dashboards, two values must match your APM data exactly: * `environment` must equal the `deployment_environment` label on your APM services * `serviceName` must equal the `service` label on your APM services When both match, every deployment event appears as a vertical marker on your APDEX, response time, throughput, and error rate charts. See [Change Events](/docs/change-events/) for details. ## Multi-Service Pipelines Each `last9DeploymentMarker` step is independent. Run as many as you need: ```groovy stage('Deploy Services') { parallel { stage('API') { steps { last9DeploymentMarker serviceName: 'api', environment: 'production', eventState: 'start' sh './deploy-api.sh' last9DeploymentMarker serviceName: 'api', environment: 'production', eventState: 'stop' } } stage('Worker') { steps { last9DeploymentMarker serviceName: 'worker', environment: 'production', eventState: 'start' sh './deploy-worker.sh' last9DeploymentMarker serviceName: 'worker', environment: 'production', eventState: 'stop' } } } } ``` *** ## Troubleshooting Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # Claude Code > Send Claude Code session telemetry — prompts, tool calls, API costs, and errors — to Last9 via OpenTelemetry. Claude Code emits structured telemetry for every developer session: prompts submitted, tools invoked, API calls made, and costs incurred. Routing this data to Last9 lets you analyze AI usage patterns, track per-user spend, audit tool decisions, and alert on error rates — all within your existing observability stack. Claude Code exports three OpenTelemetry signal types: * **Logs** — individual events per prompt, API call, tool execution, and error * **Metrics** — aggregated counters for cost, tokens, sessions, and code edits * **Traces** (beta) — span-based correlation linking a prompt to all its API calls and tool executions ## What gets exported ### Logs (events) Claude Code exports five event types as OpenTelemetry log records. All share a `prompt.id` UUID that lets you reconstruct the full sequence for a single user interaction. | Event | Emitted when | Key attributes | | --------------------------- | -------------------------- | ------------------------------------------------------------------- | | `claude_code.user_prompt` | User submits a prompt | `prompt.length`, `session.id`, `user.email` | | `claude_code.api_request` | Claude API responds | `llm.usage.total_tokens`, `cost_usd`, `model`, `duration_ms` | | `claude_code.api_error` | API call fails | `error.message`, `http.status_code`, `retry_attempt` | | `claude_code.tool_result` | Tool execution completes | `tool.name`, `tool.success`, `duration_ms`, `bash.command` | | `claude_code.tool_decision` | Permission prompt resolved | `tool.name`, `tool.decision` (`accept`/`reject`), `decision.source` | ### Metrics Claude Code exports eight counters as OpenTelemetry metrics. These are aggregated — no per-prompt IDs — making them low-cardinality and suitable for dashboards and alerts. | Metric | Unit | Key attributes | | ------------------------------------- | ------- | ------------------------------------------------------------- | | `claude_code.session.count` | count | `session.id`, `user.email`, `app.version` | | `claude_code.cost.usage` | USD | `model`, `user.email` | | `claude_code.token.usage` | tokens | `type` (input/output/cacheRead/cacheCreation), `model` | | `claude_code.lines_of_code.count` | count | `type` (added/removed) | | `claude_code.pull_request.count` | count | standard attributes | | `claude_code.commit.count` | count | standard attributes | | `claude_code.code_edit_tool.decision` | count | `tool_name`, `decision` (accept/reject), `source`, `language` | | `claude_code.active_time.total` | seconds | `type` (user/cli) | ## Prerequisites 1. **Last9 account** — Sign up at [app.last9.io](https://app.last9.io) 2. **Claude Code** — Installed and authenticated (`claude --version` should work) 3. **OTLP credentials** — Get your endpoint and auth header from [**Integrations → OpenTelemetry**](https://app.last9.io/integrations?integration=OpenTelemetry) ## Setup 1. **Get your Last9 OTLP credentials** Navigate to [**Integrations → OpenTelemetry**](https://app.last9.io/integrations?integration=OpenTelemetry) in your Last9 dashboard. Copy: * **OTLP Endpoint** (e.g., `https://otlp-aps1.last9.io:443`) * **Authorization header** (e.g., `Basic `) 2. **Set environment variables** Add the following to your shell profile (`~/.zshrc`, `~/.bashrc`, or equivalent): * Logs + Metrics (recommended) ```bash # Required: enable Claude Code telemetry export CLAUDE_CODE_ENABLE_TELEMETRY=1 # Export both logs and metrics to Last9 export OTEL_LOGS_EXPORTER=otlp export OTEL_METRICS_EXPORTER=otlp # Last9 OTLP destination export OTEL_EXPORTER_OTLP_ENDPOINT="https://" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic " export OTEL_EXPORTER_OTLP_PROTOCOL=http/json # Required for metrics: Last9 expects cumulative counters export OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative # Required for traces: currently in beta export CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 # Identify your sessions export OTEL_SERVICE_NAME="claude-code" export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=local,team=" ``` * Logs only ```bash export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_LOGS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_ENDPOINT="https://" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic " export OTEL_EXPORTER_OTLP_PROTOCOL=http/json export OTEL_SERVICE_NAME="claude-code" ``` * Metrics only ```bash export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_ENDPOINT="https://" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic " export OTEL_EXPORTER_OTLP_PROTOCOL=http/json export OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative export OTEL_SERVICE_NAME="claude-code" ``` Then reload your shell: ```bash source ~/.zshrc ``` 3. **Start a Claude Code session** ```bash claude "summarize what this repo does" ``` * Logs flush within **5 seconds** of each event * Metrics flush every **60 seconds** by default 4. **Verify data is arriving** * **Logs** — navigate to **Logs** in Last9, filter by `service.name = claude-code` * **Metrics** — navigate to **Metrics**, search for `claude_code_cost_usage_total` ## Configuration reference ### Core | Variable | Default | Description | | ------------------------------------- | ------------- | ----------------------------------------- | | `CLAUDE_CODE_ENABLE_TELEMETRY` | `0` | Set to `1` to enable all telemetry | | `OTEL_LOGS_EXPORTER` | `none` | `otlp` to export logs to Last9 | | `OTEL_METRICS_EXPORTER` | `none` | `otlp` to export metrics to Last9 | | `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA` | `0` | Set to `1` to enable trace export (beta) | | `OTEL_EXPORTER_OTLP_ENDPOINT` | — | Last9 OTLP endpoint URL | | `OTEL_EXPORTER_OTLP_HEADERS` | — | `Authorization=Basic ` | | `OTEL_SERVICE_NAME` | `claude-code` | Service name tag on all signals | | `OTEL_RESOURCE_ATTRIBUTES` | — | Comma-separated `key=value` resource tags | ### Logs | Variable | Default | Description | | --------------------------- | ------- | ------------------------------------------- | | `OTEL_LOGS_EXPORT_INTERVAL` | `5000` | Flush interval in milliseconds | | `OTEL_LOG_USER_PROMPTS` | `0` | Set to `1` to include full prompt text | | `OTEL_LOG_TOOL_DETAILS` | `0` | Set to `1` to include tool input parameters | ### Metrics | Variable | Default | Description | | --------------------------------------------------- | ------- | -------------------------------------------------- | | `OTEL_METRIC_EXPORT_INTERVAL` | `60000` | Flush interval in milliseconds | | `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE` | `delta` | `delta` or `cumulative` | | `OTEL_METRICS_INCLUDE_SESSION_ID` | `true` | Include `session.id` label (increases cardinality) | | `OTEL_METRICS_INCLUDE_VERSION` | `false` | Include `app.version` label | | `OTEL_METRICS_INCLUDE_ACCOUNT_UUID` | `true` | Include `user.account_uuid` label | Caution `OTEL_LOG_USER_PROMPTS=1` and `OTEL_LOG_TOOL_DETAILS=1` export raw prompt content and tool arguments. Enable only in environments where this data is safe to log. Tool input values over 512 characters are truncated automatically. ## What you can do in Last9 ### Cost and spend tracking (metrics) `claude_code.cost.usage` is a counter broken down by `model` and `user.email`. In Last9 Metrics, query the rate to get spend per minute, or sum it over a time window for daily/weekly totals: * Dashboard total team spend across all models * Break down by `user.email` to see per-developer cost * Alert when cumulative cost crosses a budget threshold ### Token efficiency (metrics) `claude_code.token.usage` tracks input, output, cache read, and cache creation tokens separately. Compare `cacheRead / input` ratio to measure prompt cache efficiency — a high ratio means less spend per interaction. ### Session replay via `prompt.id` (logs) Every log event shares a `prompt.id` UUID. Filter Last9 Logs by a specific `prompt.id` to reconstruct the full sequence for one user interaction: ```plaintext user_prompt → api_request → tool_decision → tool_result → api_request ``` Useful for debugging slow sessions or unexpected tool rejections. ### Tool usage audit (logs) `claude_code.tool_decision` events record every permission decision with `tool.decision = accept | reject` and `decision.source`. Surface which tools are most frequently rejected and whether your `--allowedTools` policies are working. ### Code output tracking (metrics) `claude_code.lines_of_code.count` and `claude_code.commit.count` let you measure developer output attributable to Claude Code sessions — useful for adoption reporting. ### Error rate monitoring (logs + alerts) `claude_code.api_error` events include `http.status_code` and `retry_attempt`. Create a Last9 alert on error rate spikes to catch Claude API degradation before users report it. ## Team-level tagging For organizations with multiple teams, use `OTEL_RESOURCE_ATTRIBUTES` to tag sessions by team or project: ```bash export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,team=platform,project=infra-agent" ``` All signals from that session carry `team` and `project` labels, enabling per-team cost breakdowns in Last9. *** ## Troubleshooting **No logs in Last9 after running Claude Code** * Confirm `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set in the **same shell session** where you run `claude` — `export` does not persist across tabs or new sessions * Set `OTEL_EXPORTER_OTLP_PROTOCOL=http/json` — Last9’s endpoint requires HTTP, not gRPC (the OTel SDK default) * Check that `OTEL_LOGS_EXPORTER=otlp` (not `none` or `console`) * Wait at least 10 seconds — the default export interval is 5s and there may be one flush delay **No metrics appearing** * Set `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative` — Claude Code defaults to delta temporality but Last9 requires cumulative counters * Set `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/json` if the global protocol var is not being picked up * Metrics flush every 60 seconds by default — wait at least 90 seconds before checking * Metric names in Last9 use underscores: `claude_code.cost.usage` becomes `claude_code_cost_usage_total` **401 / authentication errors** * Verify the header format: `Authorization=Basic ` (no extra quotes, no `Bearer` prefix) * Regenerate the token from [**Integrations → OpenTelemetry**](https://app.last9.io/integrations?integration=OpenTelemetry) if it has expired **Events appear but `user.email` is missing** * `user.email` is only populated when authenticated with a user account (not API key only) * Run `claude --version` to confirm you are logged in Please get in touch with us on [Discord](https://discord.com/invite/Q3p2EEucx9) or [Email](mailto:cs@last9.io) if you have any questions. # AWS Chalice > Instrument AWS Chalice Lambda functions with OpenTelemetry using ADOT layers for automatic tracing and observability Use OpenTelemetry to instrument your [AWS Chalice](https://aws.github.io/chalice/) Lambda functions and send telemetry data to Last9. Chalice is AWS’s Python serverless framework that manages Lambda deployment, API Gateway, and IAM policies through a single config file. This integration uses the AWS Distro for OpenTelemetry (ADOT) Layer for automatic instrumentation with **no code changes required**. ## Prerequisites Before setting up AWS Chalice monitoring, ensure you have: * **AWS Account**: With access to Lambda service * **Python 3.8+**: With Chalice installed (`pip install chalice`) * **Chalice Project**: An existing or new Chalice application * **Last9 Account**: With OpenTelemetry integration credentials ## Supported Runtimes Chalice deploys Python Lambda functions. ADOT Python layers support: * **Python**: 3.8, 3.9, 3.10, 3.11, 3.12 1. **Configure `.chalice/config.json`** Add the ADOT Lambda layer and environment variables to your Chalice configuration. The layer provides auto-instrumentation — no application code changes needed. * Dev Stage ```json { "version": "2.0", "app_name": "your-app-name", "stages": { "dev": { "api_gateway_stage": "dev", "lambda_timeout": 30, "lambda_memory_size": 256, "xray": true, "layers": [ "arn:aws:lambda:ap-southeast-1:901920570463:layer:aws-otel-python-amd64-ver-1-25-0:1" ], "environment_variables": { "AWS_LAMBDA_EXEC_WRAPPER": "/opt/otel-instrument", "OPENTELEMETRY_COLLECTOR_CONFIG_FILE": "/var/task/.chalice/collector-config.yaml", "OTEL_SERVICE_NAME": "your-service-name", "OTEL_PROPAGATORS": "tracecontext,xray", "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf", "OTEL_TRACES_EXPORTER": "otlp", "OTEL_TRACES_SAMPLER": "always_on", "OTEL_RESOURCE_ATTRIBUTES": "deployment.environment=dev" } } } } ``` * Production Stage ```json { "version": "2.0", "app_name": "your-app-name", "stages": { "prod": { "api_gateway_stage": "prod", "lambda_timeout": 30, "lambda_memory_size": 512, "xray": true, "layers": [ "arn:aws:lambda:ap-southeast-1:901920570463:layer:aws-otel-python-amd64-ver-1-25-0:1" ], "environment_variables": { "AWS_LAMBDA_EXEC_WRAPPER": "/opt/otel-instrument", "OPENTELEMETRY_COLLECTOR_CONFIG_FILE": "/var/task/.chalice/collector-config.yaml", "OTEL_SERVICE_NAME": "your-service-name", "OTEL_PROPAGATORS": "tracecontext,xray", "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf", "OTEL_TRACES_EXPORTER": "otlp", "OTEL_TRACES_SAMPLER": "traceidratio", "OTEL_TRACES_SAMPLER_ARG": "0.1", "OTEL_RESOURCE_ATTRIBUTES": "deployment.environment=prod" } } } } ``` Production uses `traceidratio` sampler at 10% to control costs. Adjust `OTEL_TRACES_SAMPLER_ARG` as needed. **Important Configuration Notes:** * Replace `your-app-name` and `your-service-name` with descriptive names * Replace the layer ARN with the correct one for your AWS region (see [ADOT Lambda docs](https://aws-otel.github.io/docs/getting-started/lambda)) * `"xray": true` is optional — it enables X-Ray alongside ADOT for co-existence 2. **Create Collector Configuration** Create `collector-config.yaml` in your project root, then copy it to `.chalice/` so it gets packaged with your Lambda: ```yaml receivers: otlp: protocols: grpc: endpoint: localhost:4317 exporters: otlp: endpoint: $last9_otlp_endpoint headers: authorization: $last9_otlp_auth_header tls: insecure: false service: pipelines: traces: receivers: [otlp] exporters: [otlp] metrics: receivers: [otlp] exporters: [otlp] ``` Copy to the `.chalice/` directory: ```bash cp collector-config.yaml .chalice/collector-config.yaml ``` Caution The Lambda ADOT Collector does **not** support the `batch` processor. Do not add it to your collector config. 3. **Deploy** * Dev ```bash chalice deploy --stage dev ``` * Production ```bash chalice deploy --stage prod ``` Chalice packages your app code, `.chalice/` directory (including collector-config.yaml), and requirements into a Lambda deployment. 4. **Test and Verify** * Chalice CLI ```bash # Get your API URL chalice url --stage dev # Test curl $(chalice url --stage dev)/ ``` * AWS CLI ```bash aws lambda invoke \ --function-name your-app-name-dev \ --payload '{"test": "event"}' \ response.json cat response.json ``` ## Understanding the Setup ### How Chalice + ADOT Works 1. Chalice deploys your Python function to Lambda with the ADOT layer attached 2. `AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-instrument` wraps the Python process at startup 3. The ADOT layer injects OpenTelemetry auto-instrumentation before your Chalice app loads 4. All HTTP handlers, scheduled tasks, and AWS SDK calls are traced automatically 5. The in-Lambda ADOT Collector sends traces to Last9 via the collector-config.yaml ### Environment Variables Explained | Variable | Purpose | Example | | ------------------------------------- | ---------------------------------- | ------------------------------------------ | | `AWS_LAMBDA_EXEC_WRAPPER` | Enables ADOT instrumentation | `/opt/otel-instrument` | | `OPENTELEMETRY_COLLECTOR_CONFIG_FILE` | Path to collector config in Lambda | `/var/task/.chalice/collector-config.yaml` | | `OTEL_SERVICE_NAME` | Service identifier in traces | `payment-service` | | `OTEL_EXPORTER_OTLP_PROTOCOL` | Export protocol | `http/protobuf` | | `OTEL_TRACES_SAMPLER` | Sampling strategy | `always_on` or `traceidratio` | | `OTEL_TRACES_SAMPLER_ARG` | Sampling rate (if traceidratio) | `0.1` (10%) | | `OTEL_PROPAGATORS` | Trace context formats | `tracecontext,xray` | | `OTEL_RESOURCE_ATTRIBUTES` | Additional metadata | `deployment.environment=prod` | ### What Gets Traced The ADOT layer automatically traces: * **Chalice Route Handlers**: `@app.route()` decorated functions * **Scheduled Tasks**: `@app.schedule()` decorated functions * **AWS SDK Calls**: DynamoDB, S3, SQS, SNS, etc. * **HTTP Requests**: Outbound API calls via `urllib`, `requests`, `boto3` * **Database Calls**: RDS, DynamoDB operations ## X-Ray Co-existence If your Chalice app already has `"xray": true` in config.json, you can keep it alongside ADOT: * **X-Ray traces** continue going to the AWS X-Ray service (existing dashboards keep working) * **ADOT/OTLP traces** go to Last9 Setting `OTEL_PROPAGATORS=tracecontext,xray` ensures the ADOT layer reads and writes both W3C `traceparent` and AWS `X-Amzn-Trace-Id` headers. Trace context propagates correctly regardless of which format upstream services use. To use ADOT only (no X-Ray), remove `"xray": true` from config.json and set propagators to `tracecontext` only. ## Advanced Configuration ### Custom Spans via Chalice Middleware Auto-instrumentation captures handlers and SDK calls. For custom business logic spans, use the OTel API with Chalice middleware: ```python from chalice import Chalice from opentelemetry import trace app = Chalice(app_name="your-app") tracer = trace.get_tracer(__name__) @app.middleware("all") def add_custom_attributes(event, get_response): span = trace.get_current_span() if span.is_recording(): span.set_attribute("app.framework", "chalice") return get_response(event) @app.route("/process/{order_id}") def process_order(order_id): with tracer.start_as_current_span("process_order") as span: span.set_attribute("order.id", order_id) # your business logic return {"status": "processed"} ``` Only `opentelemetry-api` is needed in `requirements.txt`. The ADOT layer provides the full SDK. ### Per-Function Configuration Chalice supports per-function overrides for layer, memory, and timeout: ```json { "stages": { "dev": { "lambda_functions": { "periodic_check": { "lambda_timeout": 60, "lambda_memory_size": 128 } } } } } ``` ### Sampling Configuration Control trace sampling to manage costs: ```bash # Development: Sample all traces OTEL_TRACES_SAMPLER=always_on # Production: Sample 10% of traces OTEL_TRACES_SAMPLER=traceidratio OTEL_TRACES_SAMPLER_ARG=0.1 ``` ## Troubleshooting ### No Traces Appearing **Check CloudWatch Logs:** ```bash aws logs tail /aws/lambda/your-app-name-dev --follow ``` **Common Issues:** * Verify `collector-config.yaml` is at `/var/task/.chalice/collector-config.yaml` inside the Lambda * Confirm ADOT layer ARN is correct for your region * Check that Last9 credentials in collector config are valid * Ensure `AWS_LAMBDA_EXEC_WRAPPER` is set to `/opt/otel-instrument` ### Module Errors Do NOT add `opentelemetry-sdk` or `opentelemetry-instrumentation-*` to `requirements.txt`. The ADOT layer provides them. Only `opentelemetry-api` is needed (for custom spans). ### Cold Start Latency ADOT adds \~500ms-1s to cold starts. Mitigate with: * Provisioned concurrency for latency-sensitive functions * Adequate memory allocation (256MB+ recommended) ### Error Messages | Error | Solution | | --------------------------- | --------------------------------------------------------------- | | ”batch processor not found” | Remove `batch` processor from collector-config.yaml | | ”parse headers” | Use `authorization=Basic ...` format (lowercase key, key=value) | | “Layer not found” | Use correct layer ARN for your region | | ”Recording is off” | Set `OTEL_TRACES_SAMPLER=always_on` | ## Best Practices * **Service Naming**: Use descriptive, consistent names across Chalice stages * **Sampling**: Start with `always_on` in dev, use `traceidratio` in production * **X-Ray Transition**: Keep X-Ray enabled initially, disable once Last9 dashboards are confirmed * **Memory**: Allocate 256MB+ to account for ADOT layer overhead * **Collector Config**: Always copy collector-config.yaml to `.chalice/` before deploying ## Need Help? If you encounter any issues or have questions: * Join our [Discord community](https://discord.com/channels/652153247672729619/652153247672729621) for real-time support * Contact our support team at # AWS EC2 > Send logs and hostmetrics from AWS EC2 instance using OpenTelemetry This guide will help you instrument your AWS EC2 instance with OpenTelemetry and smoothly send the logs and host metrics to a Last9. ## Pre-requisites 1. You have a AWS EC2 instance and workload running in it. 2. You have signed up for [Last9](https://app.last9.io), created a cluster, and obtained the following OTLP credentials from the [Integrations](https://app.last9.io/integrations?integration=OpenTelemetry) page: * `endpoint` * `auth_header` 3. Optional: Attach an IAM policy to the EC2 instance with `ec2:DescribeTags` permission. This is needed for resource detection processor to fetch additional tags associated with the EC2 instance which can be used as additional resource attributes. 4. Install Otel Collector. There are multiple ways to install the Otel Collector. One possible way of installing it using rpm is as follows. Every Collector release includes APK, DEB and RPM packaging for Linux amd64/arm64/i386 systems. > Note: systemd is required for automatic service configuration. ```sh sudo rpm -ivh otelcol-contrib_0.103.0_linux_amd64.rpm ``` More installation options can be found [here](https://opentelemetry.io/docs/collector/installation/#linux). > Note: We recommend installing `otel-collector-contrib` version `0.103.0`. ## Sample Otel Collector Configuration The default path for otel config is `/etc/otelcol-contrib/config.yaml`. You can edit it and update it with below configuration. The configuration is annotated with comments which should be addressed before applying the configuration. The configuration for operators is especially important to extract the `timestamp` and `severity`. For JSON logs, you can use `json_parser` and use its keys for log attributes. For non-structured logs, use the `regex_parser`. The configuration provdies sample example of both JSON parser and regex parsers. ```yaml receivers: hostmetrics: collection_interval: 30s scrapers: cpu: metrics: system.cpu.logical.count: enabled: true memory: metrics: system.memory.utilization: enabled: true system.memory.limit: enabled: true load: disk: filesystem: metrics: system.filesystem.utilization: enabled: true network: paging: processes: process: mute_process_user_error: true metrics: process.cpu.utilization: enabled: true process.memory.utilization: enabled: true process.threads: enabled: true process.paging.faults: enabled: true otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 # Detailed configuration options can be found at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver filelog: # File path pattern to read logs from. Update this to the destination from where you want to read logs. include: [/tmp/*.log] exclude: [/home/ubuntu/exclude/*.log] include_file_path: true # attributes: # A map of key: value pairs to add to the entry's attributes. # resource: # A map of key: value pairs to add to the entry's resource. retry_on_failure: enabled: true operators: # For logs in JSON format - type: json_parser severity: parse_from: attributes.level timestamp: parse_from: attributes.time layout: "%Y-%m-%d %H:%M:%S" # For plain text logs - type: regex_parser regex: '(?P^[A-Za-z]+) (?P