Mantra Networking Mantra Networking

Prometheus: Deep Dive

Prometheus: Deep Dive
Created By: Lauren R. Garcia

Table of Contents

  • Overview
  • Core Components
  • Prerequisites
  • Configuration
  • Validation
  • Troubleshooting
  • Conclusion

Prometheus: Deep Dive

Overview

Prometheus is an open-source monitoring and alerting solution built for reliability and flexibility in both cloud-native and traditional environments. Originally created at SoundCloud and now part of the Cloud Native Computing Foundation (CNCF), Prometheus helps organizations collect, store, and analyze highly granular time-series metrics from their infrastructure and applications.

What Is Prometheus?

At its core, Prometheus is a time-series database and monitoring system. It continuously collects numerical data points identified by metric names and labels (key-value pairs), allowing users to track everything from hardware resource usage to application-specific events. Prometheus stores this data efficiently for querying and alerting, enabling users to visualize trends, detect anomalies, and respond swiftly to operational issues.

Key features include:

  • Powerful Query Language: PromQL allows users to perform sophisticated queries, aggregations, and visualizations on collected metrics.
  • Pull-Based Data Collection: Prometheus periodically scrapes metrics from configured endpoints, making it well-suited for dynamic cloud environments.
  • Self-Contained Architecture: No external database is required; Prometheus handles storage and retrieval independently.
  • Rich Ecosystem: Numerous exporters, integrations, and visualization tools (like Grafana) are available, extending Prometheus’s reach and capabilities.

Why You Need to Know About Prometheus

  • Modern Monitoring for Modern Infrastructure: As applications and systems grow in complexity, having granular, real-time observability is crucial. Prometheus is designed for the dynamic and scalable nature of cloud-native workloads, especially in environments like Kubernetes.
  • Actionable Insights: With built-in alerting, Prometheus helps teams detect and address problems before they impact users or critical operations.
  • Wide Adoption and Community Support: Backed by a large open-source community and industry support, best practices, exporters, and integrations are readily available.

How Prometheus Works

Prometheus operates via a pull-based model, regularly scraping metrics endpoints that expose data in a simple text format. If an application or system does not natively export Prometheus-formatted metrics, exporters can be deployed to convert and expose the required information.

The basic workflow:

  1. Scrape: Prometheus server discovers and pulls metrics from endpoints at configurable intervals.
  2. Store: Data is stored locally as time-series, indexed by metric name and label set.
  3. Query: Users analyze data using PromQL, exploring trends, performing aggregations, and visualizing results.
  4. Alert: Alertmanager processes rules set within Prometheus, sending notifications via channels like email or messaging apps as soon as thresholds or conditions are met.

Prometheus’s design prioritizes operational simplicity, reliability, and scalability, making it a foundational tool for anyone aiming to build robust observability and monitoring into their systems.

Core Components

These are the essential building blocks that enable Prometheus to deliver robust monitoring and alerting in cloud-native and traditional environments:

  • Prometheus Server: The central component that scrapes and stores time-series data. It handles all the data collection, runs queries via PromQL, and manages built-in alerting.
  • Data Exporters: Lightweight agents or programs that expose metrics in a format Prometheus understands. Exporters bridge the gap between systems that do not natively produce Prometheus metrics (for example, node_exporter for system metrics).
  • Alertmanager: Handles all alert notifications generated by Prometheus rules. Alertmanager groups, de-duplicates, and routes alerts to email, Slack, or other channels based on user-defined policies.
  • Service Discovery: Automated discovery of monitoring targets in dynamic environments. Prometheus can watch platforms such as Kubernetes or Consul to identify endpoints for scraping without manual updates.
  • Web UI and Visualization: A built-in, browser-accessible interface for querying and viewing metrics, troubleshooting data, and exploring time-series trends. For advanced dashboards, Prometheus integrates with tools like Grafana.
Prerequisites

Before deploying Prometheus, ensure the following prerequisites are met to enable a smooth and efficient installation:

  • Supported Operating System: Prometheus can run on most Unix-based systems (Linux is preferred) and also supports Windows and macOS.
  • System Requirements: For typical setups and small environments, allocate at least 2 CPU cores, 4 GB of RAM, and 20 GB of free disk space. Larger production deployments may need additional resources.
  • Administrator Privileges: Ensure you have admin or root privileges to install software and configure services on the server where Prometheus will be installed.
  • Network Access: Allow inbound TCP connections on port 9090, which is used for Prometheus’s web interface and APIs. Also ensure outbound internet access for downloads and updates.
  • Create Dedicated User: For enhanced security, create a non-privileged system user and group (commonly named prometheus) to run the Prometheus process.
  • Set Up Folders: Prepare directories for configuration files (such as /etc/prometheus) and data storage (such as /var/lib/prometheus).
  • Basic Linux Knowledge: Familiarity with the terminal and basic Linux commands will be helpful during installation and configuration.
  1. Update System Packages: Ensure your OS package repositories are up to date.
  2. Download Prometheus Binaries: Get the latest release from the official Prometheus website or GitHub repository, matching your operating system and processor architecture.
  3. Extract and Place Binaries: Unpack the archive and move the prometheus and promtool binaries to a location in your system’s PATH (such as /usr/local/bin).
  4. Copy Console Libraries: Move the consoles and console_libraries directories (included in the download) to the configuration directory.
  5. Initialize Configuration: Create an initial prometheus.yml file and tailor it to your desired monitoring targets.
  6. Open Firewall Port: Adjust firewall rules to permit incoming connections on port 9090.

Once these prerequisites are in place, you are ready to proceed with the Prometheus installation and configuration steps.

Configuration

After installing Prometheus, the next crucial step is configuring it to begin collecting and storing metrics. Configuration is managed via a YAML file typically named prometheus.yml. Below is a step-by-step guide to help you set up Prometheus for effective monitoring:

  1. Locate the Configuration File: The default location for the Prometheus configuration file is /etc/prometheus/prometheus.yml. If you're using a different directory, ensure the path is correctly specified when running the Prometheus binary.
  2. Define Global Settings: Start by setting global configuration parameters such as the scraping interval and evaluation interval:
    
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
        
    These intervals determine how often Prometheus collects data and evaluates alerting rules.
  3. Configure Scrape Targets: Scrape targets define which endpoints Prometheus should collect metrics from. Here’s an example configuration for a basic node exporter:
    
    scrape_configs:
      - job_name: 'node_exporter'
        static_configs:
          - targets: ['localhost:9100']
        
    This tells Prometheus to scrape metrics from the Node Exporter running on the local machine.
  4. Add More Job Definitions: You can add multiple job definitions to scrape different sets of services or exporters. Each job can include static targets or use service discovery:
    
      - job_name: 'app_metrics'
        static_configs:
          - targets: ['app-server-1:8080', 'app-server-2:8080']
        
  5. Use Service Discovery (Optional): For dynamic environments like Kubernetes, use service discovery to automatically discover scrape targets:
    
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
        
  6. Configure Alerting Rules: Prometheus supports custom alerts using the rule_files directive. Add rules to a separate YAML file and reference them:
    
    rule_files:
      - 'alert_rules.yml'
        
    These rules define conditions under which alerts should fire based on metrics measurements.
  7. Validate and Restart: Use the promtool check config prometheus.yml command to validate your config file. Then restart the Prometheus service or rerun the binary to apply the new configuration.

Proper configuration ensures Prometheus scrapes the right targets, stores accurate metric data, and triggers alerts on performance or availability issues as needed.

Validation

Once Prometheus is installed and configured, it's important to validate that everything is working correctly. This step ensures your configuration file is error-free, Prometheus is running, and metrics are being scraped as expected. Follow the steps below to complete validation:

  1. Step 1: Check Prometheus Status
    Run the following command to ensure Prometheus is running:
    systemctl status prometheus
    If you installed Prometheus as a systemd service, this command will confirm that it started successfully. You should see 'active (running)' in the output.
  2. Step 2: Validate Configuration File
    Use the built-in Prometheus tool promtool to validate your configuration file:
    promtool check config /etc/prometheus/prometheus.yml
    This will check for proper YAML syntax and Prometheus-specific rules. If there are no errors, the output will confirm the file is valid.
  3. Step 3: Open the Web UI
    Navigate to the Prometheus web interface by visiting:
    http://localhost:9090
    From here, you can access the built-in expression browser, list of targets, and alerting rules.
  4. Step 4: Check Scrape Targets
    In the web UI, click on Status > Targets to view all configured scrape targets. Each target should display a green "up" status if successful. If any targets are marked as "down," revisit the configuration file to debug.
  5. Step 5: Run a PromQL Query
    In the Prometheus web UI, try running a basic query in the "Expression" bar to confirm metrics are being collected:
    up
    This simple query checks the availability of each target and should return results with labels and status values.
  6. Step 6: Monitor Logs for Errors
    If something seems off, review the Prometheus logs with the following command:
    journalctl -u prometheus -f
    Logs can help pinpoint configuration errors, failed scrapes, or permission issues.

Completing these validation steps ensures your Prometheus instance is operational and ready to collect meaningful metrics from your systems and applications.

Troubleshooting

Troubleshooting Prometheus involves systematically identifying and resolving issues to ensure continuous and reliable monitoring. Follow these step-by-step procedures if you encounter errors or system instability:

  1. Check Prometheus Service Status
    Confirm whether Prometheus is running as expected:
    systemctl status prometheus
    If it shows as “failed” or “inactive,” continue to the next steps.
  2. Review Logs for Error Details
    Examine real-time logs to find specific error messages or warnings:
    journalctl -u prometheus -f
    Look for issues like file permission errors, missing directories, or configuration problems.
  3. Validate Configuration Files
    Configuration errors are a common cause of failure. Check your YAML files for mistakes:
    promtool check config /etc/prometheus/prometheus.yml
    Correct any highlighted syntax problems and restart the service if required.
  4. Check File and Directory Permissions
    Prometheus needs proper access to its config files and data directories:
    ls -l /etc/prometheus/prometheus.yml
    ls -l /var/lib/prometheus/
    If needed, set the correct ownership:
    sudo chown -R prometheus:prometheus /etc/prometheus
    sudo chown -R prometheus:prometheus /var/lib/prometheus
  5. Verify Storage and Disk Space
    Prometheus depends on adequate disk space for data retention. Running out of space can halt ingestion:
    df -h /var/lib/prometheus
    Free up space or expand storage if necessary.
  6. Investigate High Resource Usage
    High memory or CPU usage can signal excessive data or inefficient queries.
    • Reduce scrape intervals or the number of metrics collected.
    • Review PromQL queries and avoid unfiltered metric selectors that can aggregate unnecessary data.
  7. Address High Cardinality Issues
    Too many unique label combinations (cardinality) can overload Prometheus. Identify problematic metrics and remove or relabel high-cardinality labels where possible.
    topk(10, count by(__name__)({__name__=~".+"}))
  8. Resolve Service Start Failures
    If Prometheus fails to start after repeated attempts, reset the service state:
    sudo systemctl reset-failed prometheus
    sudo systemctl start prometheus
  9. Fix Alerting and Scraping Issues
    If alerts don’t fire or expected data is missing:
    • Check the Status > Targets section in the web UI to verify scrape targets are up.
    • Review alert rules for missing for durations, which can prevent transient issues from creating alert fatigue.
  10. Handle Data Corruption
    If storage corruption is detected, you might need to remove or repair corrupted data blocks:
    promtool tsdb list
    Delete only the problematic blocks after backup to restore functionality.
  11. Monitor and Maintain Regular Backups
    Frequent backups of your data directory are essential for recovery in case of major failures.

Follow this methodical troubleshooting process to swiftly resolve most Prometheus issues and keep your monitoring system healthy and dependable.

Conclusion

Throughout our deep dive into Prometheus, we’ve uncovered the power and flexibility that make it a go-to solution for modern monitoring and alerting. Let’s recap what we’ve learned:

  • Overview: We explored what Prometheus is — an open-source, pull-based monitoring tool designed for time-series data collection. It’s flexible, reliable, and purpose-built for cloud-native environments.
  • Core Components: We broke down Prometheus into its essential parts: the Prometheus server, exporters, Alertmanager, service discovery mechanisms, and visualization tools like Grafana.
  • Prerequisites: Before beginning installation, we discussed system requirements, directory setup, user permissions, and other necessary pre-installation steps.
  • Configuration: We walked through a step-by-step explanation of the prometheus.yml file, how to define scrape targets, enable service discovery, and configure alerting rules.
  • Validation: You saw how to validate your installation using built-in tools like promtool, check scrape targets, query metrics through the web UI, and confirm Prometheus is running properly.
  • Troubleshooting: We covered how to diagnose common issues — from configuration errors and permission problems to high cardinality and performance tuning.

By now, you should feel confident in setting up and managing Prometheus in your system. Whether you're monitoring a couple of servers or a dynamic Kubernetes cluster, Prometheus provides the visibility and insights necessary to keep your operations running smoothly.

Thanks for joining us on this journey! If you’re ready to level up your observability game, keep tinkering, keep questioning, and—of course—keep monitoring.

Happy shipping and smooth monitoring! 🚀📊