Sunday, 6 January 2019

Automagically Discovering and Scraping Google Compute Nodes in Prometheus

Prometheus can scrape metrics from either a static list of machines or discover machines dynamically using a service discovery plugin. Service discovery plugins exists for the major cloud providers, which includes Google Cloud Platform (GCP).

A simple configuration for GCP’s service discovery in the Prometheus config (usually prometheus.yml) looks like this
      - job_name: node
        honor_labels: true
          - project: ml-platform-a
            zone: us-eastl1-a
            port: 9100
          - source_labels: [__meta_gce_label_cloud_provider]
            target_label: cloud_provider
          - source_labels: [__meta_gce_label_cloud_zone]
            target_label: cloud_zone
          - source_labels: [__meta_gce_label_cloud_tier]
            target_label: cloud_tier
          - source_labels: [__meta_gce_label_cloud_service]
            target_label: cloud_service
          - source_labels: [__meta_gce_instance_name]
            target_label: instance
Let’s dissect this. Running Prometheus with this configuration will fetch all the instances in the GCP project ml-platform-a in the zone us-east1-a, and scrape their "/metrics" endpoints at port 9100. The relabel config lets you convert GCE (Google Compute Engine) labels (source) into Prometheus labels (target).

However, this config will attempt to pull data from all instances whether they are running or not, and end up marking the stopped ones as "DOWN". To get around this, you need to filter out the stopped instances. Add a filter after the port directive, like this
        port: 9100
        filter: '(status="RUNNING")'
The equivalent gcloud command to list all running instances looks like
gcloud compute instances list --filter='status:(RUNNING)
Note the difference in syntax. The keywords, however, are identical.

What if you have multiple exporters running on a specific set of instances? You can select them by their label(s) and add a different gce_sd_config section for them. For instances which have exporters running on say, port 3000, and have a label called “cloud_service:dashboard”, the config would look like
  - job_name: dashboard
    honor_labels: true
      - project: ml-plaform-a
        zone: us-central1-c
    port: 3000
        filter: '(status="RUNNING") AND (labels.cloud_service="dashboard")'
      - source_labels: [__meta_gce_label_cloud_provider]
        target_label: cloud_provider
      - source_labels: [__meta_gce_label_cloud_zone]
        target_label: cloud_zone
      - source_labels: [__meta_gce_label_cloud_tier]
        target_label: cloud_tier
      - source_labels: [__meta_gce_label_cloud_service]
        target_label: cloud_service
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
Just for reference, the analogous gcloud command is
gcloud compute instances list --filter='status:(RUNNING) AND labels.cloud_service:dashboard'
The relabel_configs is identical to that of the 9100 scraper. It would have been nice if Prometheus had allowed for a common relabel config section that could be reused for such cases.

The GCE service discovery plugin needs read permission on the GCE Compute API to be able to pull the list of instances. There are several ways to do this, depending on how you are running Prometheus

  • Prometheus on a GCE instance in the same project : You can assign the correct IAM permissions to your GCE instance, and nothing more needs to be done.
  • Prometheus on a GCE instance in a different project, or a non-GCE machine : You can create a service account in your GCP project, download the key as a JSON and start Prometheus with the JSON set in an environment variable, like this…  ./prometheus -- (other options)

Saturday, 14 May 2016

Executing External Commands in Go

Sometimes we need to invoke operating system commands from our code. Most languages have APIs for this - Java has Runtime.exec(), Python has subprocess and Go has the os/exec package. This post briefly explores the Go API.

The APIs are part of the exec/os package. The Cmd abstraction encapsulates a command object, where various tweaks can be done including setting the standard output and error streams.

Simple execution of a command is very easy. However, if one wants finer control over the execution, including control over streams and the correct exit code, maybe when it's to be used in a framework or a library, the code becomes slightly more involved. 

Creating the Cmd object is straighforward

    cmd := exec.Command(binaryName, args...) 

The output and error streams can be redirected as follows

    stdout := &bytes.Buffer {}
    stderr := &bytes.Buffer {}
    cmd.Stdout = stdout
    cmd.Stderr = stderr

Once the command has been executed, it returns an Error object if the execution failed.

    err := cmd.Run()

The command execution can fail for various reasons - it might not have been a valid command, it might have exited with an error code or their might have been IO errors. We need to detect these cases so that the caller of the API gets the correct response.

The Go source file exec.go documents the error types that can occur.


An unsuccessful exit by a command. The ExitError object also has a "subset of the standard error output from the Cmd.Output method if standard error was not otherwise being collected." <quote docs>.


One of the cases where this Error can be returned is when the command could not be located. When the Command struct instance is created, it calls the LookPath method to locate the binary if the binaryName argument does not have path separators, which can return one of these Error instances when the executable could not be located. The actual implementation depends on the OS.

We can switch on the Error type

        switch err.(type) {
            case *exec.ExitError:
                e := err.(*exec.ExitError)
                if status, ok := e.Sys().(syscall.WaitStatus); ok {
                    exitcode = status.ExitStatus()
            case *exec.Error:
                e := err.(*exec.Error)
                panic("Unknown err type: " + reflect.TypeOf(err).String())


If it's ExitError, we need to query the OS specific implementations using the Sys interface. The Unix implementation is syscall.WaitStatus. 

if the err instance is nil, the command execution succeeded and we can get the exit code from the Cmd itself.

        if status, ok := cmd.ProcessState.Sys().(syscall.WaitStatus); ok {
            exitcode = status.ExitStatus()


The complete source code is here

Wednesday, 3 June 2015

Principles of Reactive Programming - Coursera MOOC - Review

The recently concluded Principles of Reactive Programming on Coursera was a good introduction to the paradigm of Reactive Programming in the Scala programming language. It was a kind of sequel to "Functional Programming Principles in Scala" from last year.  I say kind of as you can still take this course without taking the first one provided you have familiarity with Scala and functional programming ideas.

In a nutshell, here is what I think about the course.

It's an introduction to a different mode of concurrent programming, to reactive principles, all using Scala libraries. It does not go into much depth (which is probably a drawback of most MOOCs) but provides a foundation on which one can build. For example, I can dive deeper into Actor programming now that I know the fundamentals.

- Great introduction to Reactive Programming
- Instructors are experts in their fields (Martin Odersky, Eric Meijer, Roland Kuhn)
- Assignments corresponding to every week's topic

- Differences in teaching styles and video content among the three instructors make the ride jumpy. Or maybe I am just spoilt after taking Martin Odersky's Functional Programming course - which was superb. For the lectures on Actors, Learning Concurrent Programming in Scala has a chapter on Actors which I would recommend to be read first before viewing the lectures. The same is true for Futures.
- Assignments are completely test driven. That is good for grading, but passing the test is just the first step. Ensuring that your code is written using the finer points of the principles taught is up to you. You might get 10/10 using the automated test grader but your code might not be "correct". I had this experience in the final assignment. This has been pointed out by many in the forums too.

Overall, it's a must-take course if you plan to learn about Reactive Programming.

Tuesday, 2 December 2014

Effective email communication

Communication and its various nuances always fascinate me. There are times when I realize, not always too late, that I have failed in communicating what I wanted to convey. It always ends up being a learning experience for me.

For most people, the word "communication" seems to remain confined to what one says or writes. But it's far, far more than that.

I wanted to share a few tips I have learned about effective email communication over the years. I've picked these up from observation as well as from friends and colleagues. I still commit some of these mistakes when I'm in a hurry but I hope I am getting better.
  •  Know your recipients. Tailor your email accordingly. Put yourself in their situation
    • Their awareness of what you're talking about. Do they have prior context and how much? 
    • Their environment e.g. Sharing a URL in your email that works only on Chrome (and they use Firefox), or sending URLs that don't work outside your office network.
    • Their focus e.g. Are they likely to single out one out of multiple points in the email and downplay the rest? How do you address any concerns that the recipient might have? Thinking about these beforehand might you save an email iteration or more.
  • Make your intentions clear. If there are actionables, point them out. If you know the owner of the action, point him/her out. If you don't, ask. If it's not an actionable email, mention it (FYI, JFYI) and explain why you are sending the email. 
  • Use a meaningful subject line
  • Use To, Cc and Bcc carefully
    • If you're addressing one or more people in the email body, you can put them in the To field
    • Be careful while Bcc'ing. If the Bcc'ed person does not realize she is Bcc'ed, she might respond to all and then everybody will know, which you might not have intended. If you're the Bcc'ed person, it's upto you to check the email headers and be cognizant of this.
    • Be careful while clicking Reply. You might have meant Reply-All. Gmail/Google Apps Mail have a setting where you can set Reply All as the default.
  • If the email thread has been going on for sometime, it's helpful to summarize everything, including repeating what has been already said, when a conclusion has been reached. 
  • Don't clear the previous content when you respond. People often have to look at the whole thread to regain context.
  • If the thread has forked off to another topic, or you want to do the forking, change the subject to something appropriate that suits the new topic.

Somebody said "Communication is about the receiver". If my recipient does not get what I'm trying to convey, I have failed, and not the recipient. This might sound extreme but it's an effective ideal to work towards.

Saturday, 23 November 2013

Graphite Tip: Disabling data averaging while viewing graphs

Graphite, the superb graphing tool, has gained a lot of popularity lately and with good reason. It's flexible, fairly easy to setup, very easy to use and has a thriving community with plugins for many monitoring systems. It can store any kind of numeric data over time.

By default, Graphite stores data in WhisperDB, a fixed size database with configurable retention periods for various resolutions. What this means is that you can store higher resolution data (say data for every 5 seconds) for a shorter period of time (e.g. 1 month) and then store the same data at the lower resolution (say for every hour) beyond that time period. The data will be consolidated based on the the method you configure (sum, average). This behaviour of Graphite is well known.

What is not so well known is that Graphite also does consolidation when you view the graphs. This happens when the number of data points is more than the number of pixels. In such cases, the Graphite graph renderer will consolidate the data into one point using an aggregation function. The default aggregation function is average. So you might end up seeing smaller values than you expect.

Here's an example of a graph where there are more data points than pixels. The actual peak value was a little over 200, but you cannot see it here due to averaging.

Here is the same graph (same data for the time span) where the image width has been increased* (== more pixels). You can see the peak is almost 200.

Click to view larger

Sometimes this behaviour may not be what you want. To see the "actual" data points irrespective of what size your image is, Graphite's URL API provides a property called minXStep. To use it simply add the property as a request parameter (with value 0) in the graph URL. From the documentation:
To disable render-time point consolidation entirely, set this to 0 though note that series with more points than there are pixels in the graph area (e.g. a few month’s worth of per-minute data) will look very ‘smooshed’ as there will be a good deal of line overlap.

The same graph with minXStep=0 now looks like this:

A bit "smooshed" but with the exact data that was collected.

* Pass width=x as a request parameter to the graph URL, x in pixels