Saturday, 1 February 2020

Automating Heap Dumps For Java Containers in Google Kubernetes Engine

Heap dumps are an indispensable tool for debugging memory issues in Java processes. The typical way of taking a memory dump is using the jmap command

jmap -dump:format=b;file=/tmp/heap.dump 2592

This will trigger a heap dump for the process with id 2592 (assuming it's a Java process) and store it in a file /tmp/heap.dump. This can be analyzed later with a heap dump analyzer like jhat, MAT, or VisualVM - which are free tools.

Triggering a heap dump for a Java process that is running in a container, inside a pod, in a Google Kubernetes Engine (GKE) cluster node is not so straightforward. There are many layers of infrastructure that you have to cross to get at the Java process. Your process would usually run as part of a managed abstraction like a Deployment or a StatefulSet in your Kubernetes cluster. Your starting point would be just the pod name.


- Knowing the pod name is not enough - you also have to locate the cluster node where it's running and ssh into it.
- A GKE node might be running many pods, and many Java processes - you have to identify the correct one once once you have ssh'ed into it. "docker ps" can help here.
- The GKE node might not have jmap. It's not straightforward to install the JDK there  because it would typically be running COS. So you have to get inside the container and trigger the dump.
- You have to copy the dump to an accessible location, maybe a GCS bucket, from where you can download it to analyze. Uploading to a GCS bucket requires gsutil, which is not present by default in a COS node.

I have automated this entire process using just shell scripts and gcloud commands. The source code is on GitHub. They also use the toolbox utility that Google provides as a container for running debug tools. Invoking "toolbox" inside your GKE node will launch this container.

These scripts have an assumption which might not be valid for your cluster - I'll point it out at the relevant point in the code.

Here's a step by step explanation of the flow.

There are 3 shell scripts - being the one to run from your dev box or bastion host. This one invokes (inside the GKE cluster node, i.e. the VM) which in turn invokes the

First we find out the node on which the pod is running

node_name=`kubectl get pod ${pod_name} -o json | jq '. | .spec.nodeName'`

and get its public IP

public_ip=`gcloud compute instances list --filter="name=(${node_name})" --format="value(networkInterfaces[].accessConfigs[0].natIP)"`

copy the other two scripts to it

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} ${user}@${public_ip}:

and trigger them

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${keyfile} ${user}@${public_ip} sh ${pod_name} ${action} ${bucket}

I turn off the ssh warnings as I trigger this from a CI system on demand. If you run them manually, you can remove the -o options.

Inside the GKE node, figures out the correct container id and uses docker exec to trigger a heap dump inside it. 

container_id=`docker ps | grep ${pod_name} | grep -v "POD" | awk '{print $1}'`
docker exec ${container_id} sh -c "jmap -dump:format=b,file=heap.dump 1"

Note that the heap dump is inside the container, and not in the VM. You may not be able to push it to a GCS bucket from the container as there is no gsutil and no permissions. So we need to copy the dump to the VM. We copy it to a uniquely named file.

dttime=`echo $(date '+%d-%b-%Y-%H-%M-%S')`
docker cp ${container_id}:heap.dump ${filename}

Now you have the dump file in the VM but there is no gsutil. So you need to invoke toolbox, which has gsutil inside it. Does that mean we need to copy the dump inside toolbox now? No, because toolbox mounts several useful directories by default from the VM it's running on. 

So we just invoke toolbox and pass it the path to the (which is in the home directory of the user you are logged in as in the VM) as it would appear from inside toolbox (since the home directory is also mounted inside toolbox).

Inside toolbox, we can use gsutil to upload the dump file (which is also available inside toolbox because it's in the home directory of the user you are logged in as in the VM). But here's a catch. gsutil requires permissions to upload to a GCS bucket. One way to provide this permission is with an IAM permissions JSON file. But how does it get to the VM? 

This is the caveat I mentioned above. In the infrastructure I manage, almost every Java pod has a config map with a permissions file that the pod uses to access Google Cloud services. This file is accessible as a mounted directory inside toolbox, so, voila!

/google-cloud-sdk/bin/gcloud auth activate-service-account --key-file=${dir}/key.json
/google-cloud-sdk/bin/gsutil cp /media/root/home/${user}/${filename} gs://${bucket}/kdev-debug/${filename}

If you don't have this shortcut, you will a need to set the permissions somehow. There are multiple ways of doing it - one being to run a custom container instead of toolbox that has gsutil installed and can mount a config map which has the permissions. Another is to upload the permissions file to the VM when you run the command, use it from inside toolbox, and then delete it. The second one is a tad risky.

These scripts can be modified to be usable for any Kubernetes cluster and not just GKE. Most of the changes will be in the commands that fetch the list of running nodes. If you are using another OS for your K8S VMs, you can install Java directly on the VM and trigger the dump, after you find out the mapping between the container ids and the process ids as visible from the VM.

Thursday, 28 November 2019

The K8S Networking Implementation in Google Kubernetes Engine

I was recently digging into some finer points of exposing Kubernetes pods as services and came across this fantastic talk from Google Cloud Next '17. It's about how Kubernetes networking works on the Google Cloud Platform.

The Kubernetes specification dictates, among other things, the networking requirements for deployment. On the Google Cloud Platform (GCP), K8S is available as the Google Kubernetes Engine (GKE) product. GKE is part of GCP, and uses Google compute instances as the K8S hosts and its virtual network for network traffic. 

Why is the networking a big deal? Is the specification not already implemented by the K8S project? It is - however, it needs an underlying set of compute, storage and network resources to function. These resources are usually provided by a cloud provider, or bare metal machines + cloud management software, if you are hosting your own cloud. A cloud provider has to go some extra distance to ensure it meets the K8S spec requirements - because it's providing a virtualized environment, and not all things might work as it does in a non-virtualized one. 

The talk is about how Google does it for GKE. I've summarized some of the interesting points, leaving out the vanilla Kubernetes details which are easily found in the documentation.

Internal Traffic

Linux network namespaces and virtual interfaces are used as the foundation. 

For two pods to talk to each other

  • Each VM (K8S cluster host) has a root network namespace (usually eth0)
  • Each pod in that host has its own network namespace, separate from the root
  • For these to talk to each other, we use a pipe between two virtual interfaces, one end of which shows up as (again, usually) vethxx in the VM, and the other end as eth0 in the pod.
  • For two pods to talk to each other, we need a bridge between the vethxxs in the VM, which is (usually) named cbr0. This uses ARP to determine where to route packets.

For two pods to talk to each other across VMs

  • The network between VMs has to know how to route packets whose src and dest are both pods.
  • Each VM has an IP block from which it allocates IPs to pods inside it.
  • Once the packet leaves a pod and reaches the bridge, it gets sent out the default route as there is entry on that VM's ARP table for that dest pod IP.
  •  At this point, the packet will be dropped by GCP's network as the source IP does not match the VM's IP ("anti-spoof"). To get around this, each VM is setup to be able to forward packets, and disable the anti-spoof mechanism. One static route for each VM is setup on the network to route packets for that VM's pod IP range.

To route pod packets to a pod behind a Service

  • Once the packet hits the bridge, it's processed by an iptables rule.
  • iptables first chooses a pod for the Service, load balancing between different pods. In iptables proxy mode, it chooses backends randomly.
  • iptables then performs a DNAT, changing the destination IP in the packet to that of the dest pod. There is a tool called conntrack that keeps track of the fact that a connection was made to the pod's IP for a packet meant for the Service IP. 
  • The packet is routed as usual from src pod to dest pod
  • iptables rewrites the src IP to the Service IP in the response packet before sending it to the pod which made the request 
  • iptables is, in general, routing traffic to pods behind a Service. 
  • kube-proxy just configures and syncs iptables rules based on changes fetched from the K8S API - the name does not reflect anything about its function. It's a legacy name. 
  • DNS runs as a Service, in a pod, in K8S. 
    • Special needs - particular Service IP, autoscaled to the cluster size.


External Traffic


From a pod to the internet

  • A packet's internal address is rewritten to the external IP of the VM on which the pod is running, so that the internet knows where it came from. The reverse rewrite happens on the way back.
  • Before the traffic goes out of the VM, iptables rewrites the pod's src IP to the VM's internal IP. After this, the same thing happens as in the previous point.

From the internet to a pod using Service type: LoadBalancer

  • Service type: LoadBalancer creates a network LB in GCP, pointing the GCP forwarding rule to all the VMs in the K8S cluster
  • Google's NLB is a packet forwarder, not a proxy, making it possible to read the original client's IP address from the packet directly. In the L7 ingress LB, this is achieved by the X-Forwarded-For header.
  • LB chooses a VM, which may or may not have the pod (or any pods for that matter) the packet is meant for.
  • iptables on the VM chooses a pod. If it's on a different VM, a DNAT happens like before changing the dest to the pod's IP, instead of the LB's IP.
  • There is a second NAT happening here, changing the src from the client, to this VM's IP. This ensures the original VM on which the packet lands stays in the flow. If this does not happen, and the packet is sent to a different VM from this one, and the response goes back to the NAT layer just before the LB, it will be dropped since the packet was sent to the first VM, and not this one, or the pod where it ended up. This loses the original client IP information though.
  • Once it lands on the other VM, it gets routed to the pod, and the response goes back, with all the reverse NAT happening on the way back.
  • The "imbalance" here that can be caused by the LB knowing only about VMs, and not about pods, is mitigated by re-balancing inside K8S between pods. This balancing is random and apparently is "well-balanced" in practice, but can cause an extra network hop, and the client IP is hidden from the pod.
  • There is an annotation to tune this part.

Note that this annotation has been superseded by another property since this talk.
Setting this will lead to iptables always choosing a pod on the same node, which also preserves the client IP, but risks imbalance.

From the internet to a pod using an Ingress LoadBalancer

  • The NodePort service port forwards to the pod(s) using iptables, like before
  • Source IP of a packet is the internal address of the LB, not the external one. This one is a proxy.
  • The SNAT/DNAT works as in the previous case
  • To avoid the extra network hop, the same OnlyLocal annotation works.
This talk is more than 2 years old. Since then, there have been newer developments in GKE, including "container-native load balancing" and in K8S itself, e.g., IPVS based load balancing.

Sunday, 6 January 2019

Automagically Discovering and Scraping Google Compute Nodes in Prometheus

Prometheus can scrape metrics from either a static list of machines or discover machines dynamically using a service discovery plugin. Service discovery plugins exists for the major cloud providers, which includes Google Cloud Platform (GCP).

A simple configuration for GCP’s service discovery in the Prometheus config (usually prometheus.yml) looks like this
      - job_name: node
        honor_labels: true
          - project: ml-platform-a
            zone: us-eastl1-a
            port: 9100
          - source_labels: [__meta_gce_label_cloud_provider]
            target_label: cloud_provider
          - source_labels: [__meta_gce_label_cloud_zone]
            target_label: cloud_zone
          - source_labels: [__meta_gce_label_cloud_tier]
            target_label: cloud_tier
          - source_labels: [__meta_gce_label_cloud_service]
            target_label: cloud_service
          - source_labels: [__meta_gce_instance_name]
            target_label: instance
Let’s dissect this. Running Prometheus with this configuration will fetch all the instances in the GCP project ml-platform-a in the zone us-east1-a, and scrape their "/metrics" endpoints at port 9100. The relabel config lets you convert GCE (Google Compute Engine) labels (source) into Prometheus labels (target).

However, this config will attempt to pull data from all instances whether they are running or not, and end up marking the stopped ones as "DOWN". To get around this, you need to filter out the stopped instances. Add a filter after the port directive, like this
        port: 9100
        filter: '(status="RUNNING")'
The equivalent gcloud command to list all running instances looks like
gcloud compute instances list --filter='status:(RUNNING)
Note the difference in syntax. The keywords, however, are identical.

What if you have multiple exporters running on a specific set of instances? You can select them by their label(s) and add a different gce_sd_config section for them. For instances which have exporters running on say, port 3000, and have a label called “cloud_service:dashboard”, the config would look like
  - job_name: dashboard
    honor_labels: true
      - project: ml-plaform-a
        zone: us-central1-c
    port: 3000
        filter: '(status="RUNNING") AND (labels.cloud_service="dashboard")'
      - source_labels: [__meta_gce_label_cloud_provider]
        target_label: cloud_provider
      - source_labels: [__meta_gce_label_cloud_zone]
        target_label: cloud_zone
      - source_labels: [__meta_gce_label_cloud_tier]
        target_label: cloud_tier
      - source_labels: [__meta_gce_label_cloud_service]
        target_label: cloud_service
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
Just for reference, the analogous gcloud command is
gcloud compute instances list --filter='status:(RUNNING) AND labels.cloud_service:dashboard'
The relabel_configs is identical to that of the 9100 scraper. It would have been nice if Prometheus had allowed for a common relabel config section that could be reused for such cases.

The GCE service discovery plugin needs read permission on the GCE Compute API to be able to pull the list of instances. There are several ways to do this, depending on how you are running Prometheus

  • Prometheus on a GCE instance in the same project : You can assign the correct IAM permissions to your GCE instance, and nothing more needs to be done.
  • Prometheus on a GCE instance in a different project, or a non-GCE machine : You can create a service account in your GCP project, download the key as a JSON and start Prometheus with the JSON set in an environment variable, like this…  ./prometheus -- (other options)

Saturday, 14 May 2016

Executing External Commands in Go

Sometimes we need to invoke operating system commands from our code. Most languages have APIs for this - Java has Runtime.exec(), Python has subprocess and Go has the os/exec package. This post briefly explores the Go API.

The APIs are part of the exec/os package. The Cmd abstraction encapsulates a command object, where various tweaks can be done including setting the standard output and error streams.

Simple execution of a command is very easy. However, if one wants finer control over the execution, including control over streams and the correct exit code, maybe when it's to be used in a framework or a library, the code becomes slightly more involved. 

Creating the Cmd object is straighforward

    cmd := exec.Command(binaryName, args...) 

The output and error streams can be redirected as follows

    stdout := &bytes.Buffer {}
    stderr := &bytes.Buffer {}
    cmd.Stdout = stdout
    cmd.Stderr = stderr

Once the command has been executed, it returns an Error object if the execution failed.

    err := cmd.Run()

The command execution can fail for various reasons - it might not have been a valid command, it might have exited with an error code or their might have been IO errors. We need to detect these cases so that the caller of the API gets the correct response.

The Go source file exec.go documents the error types that can occur.


An unsuccessful exit by a command. The ExitError object also has a "subset of the standard error output from the Cmd.Output method if standard error was not otherwise being collected." <quote docs>.


One of the cases where this Error can be returned is when the command could not be located. When the Command struct instance is created, it calls the LookPath method to locate the binary if the binaryName argument does not have path separators, which can return one of these Error instances when the executable could not be located. The actual implementation depends on the OS.

We can switch on the Error type

        switch err.(type) {
            case *exec.ExitError:
                e := err.(*exec.ExitError)
                if status, ok := e.Sys().(syscall.WaitStatus); ok {
                    exitcode = status.ExitStatus()
            case *exec.Error:
                e := err.(*exec.Error)
                panic("Unknown err type: " + reflect.TypeOf(err).String())


If it's ExitError, we need to query the OS specific implementations using the Sys interface. The Unix implementation is syscall.WaitStatus. 

if the err instance is nil, the command execution succeeded and we can get the exit code from the Cmd itself.

        if status, ok := cmd.ProcessState.Sys().(syscall.WaitStatus); ok {
            exitcode = status.ExitStatus()


The complete source code is here

Wednesday, 3 June 2015

Principles of Reactive Programming - Coursera MOOC - Review

The recently concluded Principles of Reactive Programming on Coursera was a good introduction to the paradigm of Reactive Programming in the Scala programming language. It was a kind of sequel to "Functional Programming Principles in Scala" from last year.  I say kind of as you can still take this course without taking the first one provided you have familiarity with Scala and functional programming ideas.

In a nutshell, here is what I think about the course.

It's an introduction to a different mode of concurrent programming, to reactive principles, all using Scala libraries. It does not go into much depth (which is probably a drawback of most MOOCs) but provides a foundation on which one can build. For example, I can dive deeper into Actor programming now that I know the fundamentals.

- Great introduction to Reactive Programming
- Instructors are experts in their fields (Martin Odersky, Eric Meijer, Roland Kuhn)
- Assignments corresponding to every week's topic

- Differences in teaching styles and video content among the three instructors make the ride jumpy. Or maybe I am just spoilt after taking Martin Odersky's Functional Programming course - which was superb. For the lectures on Actors, Learning Concurrent Programming in Scala has a chapter on Actors which I would recommend to be read first before viewing the lectures. The same is true for Futures.
- Assignments are completely test driven. That is good for grading, but passing the test is just the first step. Ensuring that your code is written using the finer points of the principles taught is up to you. You might get 10/10 using the automated test grader but your code might not be "correct". I had this experience in the final assignment. This has been pointed out by many in the forums too.

Overall, it's a must-take course if you plan to learn about Reactive Programming.