Sunday, 6 January 2019

Automagically Discovering and Scraping Google Compute Nodes in Prometheus

Prometheus can scrape metrics from either a static list of machines or discover machines dynamically using a service discovery plugin. Service discovery plugins exists for the major cloud providers, which includes Google Cloud Platform (GCP).

A simple configuration for GCP’s service discovery in the Prometheus config (usually prometheus.yml) looks like this
      - job_name: node
        honor_labels: true
        gce_sd_configs:
          - project: ml-platform-a
            zone: us-eastl1-a
            port: 9100
        relabel_configs:
          - source_labels: [__meta_gce_label_cloud_provider]
            target_label: cloud_provider
          - source_labels: [__meta_gce_label_cloud_zone]
            target_label: cloud_zone
          - source_labels: [__meta_gce_label_cloud_tier]
            target_label: cloud_tier
          - source_labels: [__meta_gce_label_cloud_service]
            target_label: cloud_service
          - source_labels: [__meta_gce_instance_name]
            target_label: instance
Let’s dissect this. Running Prometheus with this configuration will fetch all the instances in the GCP project ml-platform-a in the zone us-east1-a, and scrape their "/metrics" endpoints at port 9100. The relabel config lets you convert GCE (Google Compute Engine) labels (source) into Prometheus labels (target).

However, this config will attempt to pull data from all instances whether they are running or not, and end up marking the stopped ones as "DOWN". To get around this, you need to filter out the stopped instances. Add a filter after the port directive, like this
        port: 9100
        filter: '(status="RUNNING")'
The equivalent gcloud command to list all running instances looks like
gcloud compute instances list --filter='status:(RUNNING)
Note the difference in syntax. The keywords, however, are identical.

What if you have multiple exporters running on a specific set of instances? You can select them by their label(s) and add a different gce_sd_config section for them. For instances which have exporters running on say, port 3000, and have a label called “cloud_service:dashboard”, the config would look like
  - job_name: dashboard
    honor_labels: true
    gce_sd_configs:
      - project: ml-plaform-a
        zone: us-central1-c
    port: 3000
        filter: '(status="RUNNING") AND (labels.cloud_service="dashboard")'
    relabel_configs:
      - source_labels: [__meta_gce_label_cloud_provider]
        target_label: cloud_provider
      - source_labels: [__meta_gce_label_cloud_zone]
        target_label: cloud_zone
      - source_labels: [__meta_gce_label_cloud_tier]
        target_label: cloud_tier
      - source_labels: [__meta_gce_label_cloud_service]
        target_label: cloud_service
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
Just for reference, the analogous gcloud command is
gcloud compute instances list --filter='status:(RUNNING) AND labels.cloud_service:dashboard'
The relabel_configs is identical to that of the 9100 scraper. It would have been nice if Prometheus had allowed for a common relabel config section that could be reused for such cases.

The GCE service discovery plugin needs read permission on the GCE Compute API to be able to pull the list of instances. There are several ways to do this, depending on how you are running Prometheus

  • Prometheus on a GCE instance in the same project : You can assign the correct IAM permissions to your GCE instance, and nothing more needs to be done.
  • Prometheus on a GCE instance in a different project, or a non-GCE machine : You can create a service account in your GCP project, download the key as a JSON and start Prometheus with the JSON set in an environment variable, like this
GOOGLE_APPLICATION_CREDENTIALS=...path..to..json..credentials…  ./prometheus -- (other options)