The K8S Networking Implementation in Google Kubernetes Engine

I was recently digging into some finer points of exposing Kubernetes pods as services and came across this fantastic talk from Google Cloud Next '17. It's about how Kubernetes networking works on the Google Cloud Platform.

The Kubernetes specification dictates, among other things, the networking requirements for deployment. On the Google Cloud Platform (GCP), K8S is available as the Google Kubernetes Engine (GKE) product. GKE is part of GCP, and uses Google compute instances as the K8S hosts and its virtual network for network traffic.

Why is the networking a big deal? Is the specification not already implemented by the K8S project? It is - however, it needs an underlying set of compute, storage and network resources to function. These resources are usually provided by a cloud provider, or bare metal machines + cloud management software, if you are hosting your own cloud. A cloud provider has to go some extra distance to ensure it meets the K8S spec requirements - because it's providing a virtualized environment, and not all things might work as it does in a non-virtualized one.

The talk is about how Google does it for GKE.I've summarized some of the interesting points, leaving out the vanilla Kubernetes details which are easily found in the documentation.

Internal Traffic

Linux network namespaces and virtual interfaces are used as the foundation.

For two pods to talk to each other

Each VM (K8S cluster host) has a root network namespace (usually eth0)
Each pod in that host has its own network namespace, separate from the root
For these to talk to each other, we use a pipe between two virtual interfaces, one end of which shows up as (again, usually) vethxx in the VM, and the other end as eth0 in the pod.
For two pods to talk to each other, we need a bridge between the vethxxs in the VM, which is (usually) named cbr0. This uses ARP to determine where to route packets.

For two pods to talk to each other across VMs

The network between VMs has to know how to route packets whose src and dest are both pods.
Each VM has an IP block from which it allocates IPs to pods inside it.
Once the packet leaves a pod and reaches the bridge, it gets sent out the default route as there is entry on that VM's ARP table for that dest pod IP.
At this point, the packet will be dropped by GCP's network as the source IP does not match the VM's IP ("anti-spoof"). To get around this, each VM is setup to be able to forward packets, and disable the anti-spoof mechanism. One static route for each VM is setup on the network to route packets for that VM's pod IP range.

To route pod packets to a pod behind a Service

Once the packet hits the bridge, it's processed by an iptables rule.
iptables first chooses a pod for the Service, load balancing between different pods. In iptables proxy mode, it chooses backends randomly.
iptables then performs a DNAT, changing the destination IP in the packet to that of the dest pod. There is a tool called conntrack that keeps track of the fact that a connection was made to the pod's IP for a packet meant for the Service IP.
The packet is routed as usual from src pod to dest pod
iptables rewrites the src IP to the Service IP in the response packet before sending it to the pod which made the request
iptables is, in general, routing traffic to pods behind a Service.
kube-proxy just configures and syncs iptables rules based on changes fetched from the K8S API - the name does not reflect anything about its function. It's a legacy name.
DNS runs as a Service, in a pod, in K8S.

Special needs - particular Service IP, autoscaled to the cluster size.

External Traffic

From a pod to the internet

A packet's internal address is rewritten to the external IP of the VM on which the pod is running, so that the internet knows where it came from. The reverse rewrite happens on the way back.
Before the traffic goes out of the VM, iptables rewrites the pod's src IP to the VM's internal IP. After this, the same thing happens as in the previous point.

From the internet to a pod using Service type: LoadBalancer

Service type: LoadBalancer creates a network LB in GCP, pointing the GCP forwarding rule to all the VMs in the K8S cluster
Google's NLB is a packet forwarder, not a proxy, making it possible to read the original client's IP address from the packet directly. In the L7 ingress LB, this is achieved by the X-Forwarded-For header.
LB chooses a VM, which may or may not have the pod (or any pods for that matter) the packet is meant for.
iptables on the VM chooses a pod. If it's on a different VM, a DNAT happens like before changing the dest to the pod's IP, instead of the LB's IP.
There is a second NAT happening here, changing the src from the client, to this VM's IP. This ensures the original VM on which the packet lands stays in the flow. If this does not happen, and the packet is sent to a different VM from this one, and the response goes back to the NAT layer just before the LB, it will be dropped since the packet was sent to the first VM, and not this one, or the pod where it ended up. This loses the original client IP information though.
Once it lands on the other VM, it gets routed to the pod, and the response goes back, with all the reverse NAT happening on the way back.
The "imbalance" here that can be caused by the LB knowing only about VMs, and not about pods, is mitigated by re-balancing inside K8S between pods. This balancing is random and apparently is "well-balanced" in practice, but can cause an extra network hop, and the client IP is hidden from the pod.
There is an annotation to tune this part.

Note that this annotation has been superseded by another property since this talk.
Setting this will lead to iptables always choosing a pod on the same node, which also preserves the client IP, but risks imbalance.

From the internet to a pod using an Ingress LoadBalancer

The NodePort service port forwards to the pod(s) using iptables, like before
Source IP of a packet is the internal address of the LB, not the external one. This one is a proxy.
The SNAT/DNAT works as in the previous case
To avoid the extra network hop, the same OnlyLocal annotation works.

This talk is more than 2 years old. Since then, there have been newer developments in GKE, including "container-native load balancing" and in K8S itself, e.g., IPVS based load balancing.