This is an extensive tutorial on how to set up a Kubernetes cluster that supports pod migration.
Why
Statelessness is the basic foundation for microservices run inside Kubernetes. Outside it’s main application domain, the platform also appeals to the High Performance Computing (HPC) community for that infrastructure management can be delegated to cloud providers and it’s on-demand scaling. The challenge is that HPC jobs are usually long running and stateful. Jobs such as simulations or optimization problems usually keep their state in memory and state checkpointing on disk is not always available.
This is undesirable because failures are expected to occur.
Matters becomes even worse for jobs with unpredictable resource requirements. Unexpected spikes in memory can lead to out-of-memory node situations, which results in pods being killed. The catastrophic consequence is the complete loss of job progress from many hours or even days of compute time.
To avoid this, a migration of stateful pods to another node would be desirable.
Status quo
Currently, Kubernetes does not support pod migration.
However, a PoC of a pod migration n prior work by Jakob Schrettenbrunner showed the feasibility. A proposal to support very basic checkpointing (forensic checkpoiting without restore) functionality has recently been accepted by the Kubernetes community as well and is expected to be available in future releases.
Goal
Building on the prior PoC of Jakob Schrettenbrunner, I want to show you step by step how to set up a Kubernetes cluster with pod migration functionality. Bootstrapping a Kubernetes cluster from scratch is not a trivial task, but kubeadm will help us. Jakob also provided some documentation on his setup and while very helpful it is far from complete and does not mention all potential gotchas. You might suspect already that this won’t be a quick and easy process, but I hope to make it a lot easier for you through this extensive tutorial.
Demo
To see what to expect, here is a quick demo of the steps to migrate a pod:
1. Cluster setup
The cluster consists of 1 master node and 2 worker nodes. The VMs are provisioned in Microsoft Azure. For migrating the pod across a worker node, Azure’s SMB file share server) is used. You might also use an NFS server (and it might even make things easier as mentioned later..), but this was not possible for company policy reasons in my case.
Kubernetes is bootstrapped using kubeadm. It’s tested with version v1.19.0-beta.0.1015+b521fb5114995f-dirty ( binaries are available here, but I recommend building from source).
Network setup
To set up the cluster network, I followed this tutorial. You can use the web shell on Azure for this:
After connecting to the VM, go into root mode:
sudo -s
All following steps assume this!
2. Kernel downgrade
Worker
As mentioned here, recent Ubuntu kernels broke compatibility with CRIU.
Hence we downgrade the kernel:
apt install -y linux-image-unsigned-5.4.0-1068-azure
But for safety, you should replace the kubelet with the binary path defined in the systemd service (/usr/bin/kubelet).
You can get the customized kubelet like this:
cat /lib/systemd/system/containerd.service
# Copyright The containerd Authors.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.[Unit]Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
[Service]ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5# Having non-zero Limit*s causes performance problems due to accounting overhead# in the kernel. We recommend using cgroups to do container-local accounting.LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.# Only systemd 226 and above support this version.TasksMax=infinity
OOMScoreAdjust=-999
[Install]WantedBy=multi-user.target
Build
I recommend to build from source, but you may also use the binaries inside bin.
Clone my fork and checkout the checkpoint branch. If you want to use the version that only uploads a zip to the file server (please read under [6. Set up file server](#6. Set up file server), use checkpoint-zip
In the output the CNI version is set to 1.0.0 which is wrong. So we change it to a supported version such as 0.3.0 :
vim /etc/cni/net.d/10-containerd-net.conflist
To be safe, restart the containerd service after after configuring the CNI plugins:
`systemctl restart containerd
Kubelet
1
2
3
4
5
6
7
8
9
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo"deb https://apt.kubernetes.io/ kubernetes-xenial main"| sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt -y install kubelet
Then replace the binary:
vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+[Service]Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamicallyEnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.EnvironmentFile=-/etc/default/kubelet
ExecStart=ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS$KUBELET_CONFIG_ARGS$KUBELET_KUBEADM_ARGS$KUBELET_EXTRA_ARGS --container-runtime-endpoint=/run/containerd/containerd.sock --v=9 --read-only-port=0 --anonymous-auth=true --authorization-mode=AlwaysAllow --container-runtime=remote
~
kubectl get node -w
Should show a ready node after a few seconds.
kubectl get po -A
Should show all pods running (including coredns!).
(Debugging)
If the node does not get ready, check the logs of the kubelet:
1
journalctl -u kubelet [-f]
Or if you don’t find any hints from there, look here:
1
journalctl -u containerd [-f]
It happened to me, that I got the error cni plugin not initialized.
If this is the case, be sure to repeat step the CNI plugin installation from above again.
If a pod is not running, use kubectl describe to debug.
Otherwise, a cluster reset might also help:
./kubeadm reset --cri-socket "/run/containerd/containerd.sock"
Now it’s time to test the migration, with a simple memory allocating pod (here 50 MB):
kubectl run counter1 --restart=Never --image "ghcr.io/schrej/podmigration-testapp:latest" -- -m 50. It’s important to set restartPolicy:Never to prevent the original container from restarting during migration (relevant for large migrations)!
Through kubectl get po -owide, you can get pod IP and increment a stateful counter.
Be sure to do this on the worker node:
curl $POD_IP:8080
Repeat the counter increment a few times, to validate the successful migration later.
Clone pod
The pod spec is identical, except that it has an additional field spec.clonePod :
The migration should be very fast.
Currently, the old pod gets broken during the migration. But the cloned pod should be running.
Requesting it’s endpoint with curl should return a number bigger than 1. Voila - you have successfully cloned a stateful pod in Kubernetes!
6. Set up file server
Important warning
I had consistency problems for bigger file uploads with SMB. The container restore command is issued 1 second after the disk checkpoint has been saved completely.
However, at this time not all files of the checkpoint directory were uploaded successfully.
I circumvented this problem by storing the checkpoint on local disk and only storing a zipped archive on the server.
The temporary local-disk location is /var/lib/kubelet/check. Since, the OS disk is usually only 30GB, you will need to create a symbolic link to a bigger disk. In my case, a temporary disk with 500GB was mounted in /mnt.
To solve this, do:
Interestingly, the compression immensly reduced the checkpoint size for the simple example app. For 50GB of allocated memory, the compressed zip was only around 20MB!
This modification was done inside containerd in the branch checkpoint-zip.
Steps
The procedure is specific to Azure and is well documented here. The server should be mounted inside /var/libe/kubelet/migration. I used the static mount and my /etc/fstab entry looks like this:
To read the logs for the kubelet, you can use: tail -f /tmp/kubelet.log.
End
I admit that this was a long tutorial and it’s likely not everything went smooth while following along. If you are stuck at some step, you can contact me and I can try to help :)