Running Kubernetes in Kubernetes

KinD is a very lightweight Kubernetes installation tool that treats Docker containers as Kubernetes nodes, making it very easy to use. Since you can run a Kubernetes cluster in a Docker container, it’s natural to wonder if you can run it in a Pod. What are the problems with running in a Pod?

Installing Docker Daemon in a Pod

KinD is now dependent on Docker, so first we need to create an image that allows us to run Docker Deamon in the Pod, so that we can execute commands like docker run inside the Pod, but this is not the same as the DIND mode of mounting docker.sock on the host as we mentioned before This is different from the DIND mode of mounting the host docker.sock. There are still a lot of problems with running Docker Deamon in a Pod.

The PID 1 problem

For example, we need to run Docker Daemon and some Kubernetes cluster tests in a container, and these tests depend on KinD and Docker Damon, and we may use systemd to run multiple services in a container, but there are some problems with using systemd.

For example, we need to keep the exit status of the test, and the container used in Kubernetes can watch the exit status of the first process (PID 1) in the container when it is running. If we use systemd, then the exit status of our test process will not be forwarded to Kubernetes. 2. It is also important to get the logs of the tests. In Kubernetes, the container logs written to stdout and stderr are automatically fetched, but if we use systemd, it is more difficult to get the application logs.

To solve the above problem, we can use the following startup script in the container image.

dockerd &
# Wait until dockerd is ready.
until docker ps >/dev/null 2>&1
do
  echo "Waiting for dockerd..."
  sleep 1
done
exec "$@"

However, it is important to note that we cannot use the above script as an entrypoint for the container; the entrypoint defined in the image will run in the container as PID 1 in a separate pid namespace; PID 1 is a special process in the kernel that behaves differently from other processes.

Essentially, the process that receives the signal is PID 1: it will be handled specially by the kernel; if it does not register a processor for the signal, the kernel will not fall back to the default behavior (i.e., kill the process). Since the kernel kills the process by default when the SIGTERM signal is received, some processes may not register a signal handler for the SIGTERM signal. If this happens, when Kubernetes tries to terminate the Pod, SIGTERM will be swallowed and you will notice that the Pod will be stuck in the Terminating state.

This is not really a new problem, but not many people are aware of it and keep building containers with this problem. We can solve this problem using the application tini as the entry point for the image, as follows.

`1`	`ENTRYPOINT ["/usr/bin/tini", "--", "/entrypoint.sh"]`

This program will properly register the signal handler and forward the signal. It will also perform some other PID 1 things, such as reclaiming zombie processes in the container.

Mount cgroups

Since Docker Daemon needs to control cgroups, it needs to mount the cgroup filesystem to the container. But since cgroups are shared with the host, we need to make sure that the cgroups controlled by Docker Daemon do not affect other cgroups used by other containers or host processes, and that the cgroups created by Docker Daemon in the container are not leaked after the container exits.

There is a -cgroup-parent parameter in Docker Daemon to tell Daemon to nest all container cgroups under the specified cgroup. When a container is running under a Kubernetes cluster, we set the -cgroup-parent parameter in Docker Daemon in the container so that all of its cgroups are nested under the cgroup created by Kubernetes for the container.

In the past, to make the cgroup file system available in the container, some users would mount /sys/fs/cgroup from the host to this location in the container. If we use it this way, we need to set -cgroup-parent to the following in the container startup script so that the cgroups created by Docker Daemon will be nested correctly.

`1`	`CGROUP_PARENT="$(grep systemd /proc/self/cgroup \| cut -d: -f3)/docker"`

Note: /proc/self/cgroup shows the cgorup path of the calling process.

But we should know that mounting the host’s /sys/fs/cgroup file is a very dangerous thing to do, because it exposes the entire host’s cgroup hierarchy to the container. To solve this problem, Docker used a trick to hide unrelated cgroups from the containers. docker mounts the root of each cgroup system’s cgroup hierarchy from the container’s cgroups.

$ docker run --rm debian findmnt -lo source,target -t cgroup       
SOURCE                                                                               TARGET
cpuset[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]     /sys/fs/cgroup/cpuset
cpu[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]        /sys/fs/cgroup/cpu
cpuacct[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]    /sys/fs/cgroup/cpuacct
blkio[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]     /sys/fs/cgroup/blkio
memory[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]     /sys/fs/cgroup/memory
 
cgroup[/docker/451b803b3cd7cd2b69dde64cd833fdd799ae16f9d2d942386ec382f6d55bffac]     /sys/fs/cgroup/systemd

From the above, we can see that cgroups control the files at the root of the cgroup hierarchy in the container by mapping the /sys/fs/cgroup/memory/memory.limit_in_bytes file on the host cgroup file system to the /sys/fs/cgroup/memory/docker/<CONTAINER_ID>/ memory.limit_in_bytes to control the file at the root of the cgroup hierarchy within the container, which prevents container processes from accidentally modifying the host’s cgroup.

However, this approach can sometimes be confusing for applications like cadvisor and kubelet, because binding mounts do not change the contents of /proc/<PID>/cgroup.

$ docker run --rm debian cat /proc/1/cgroup                                
14:name=systemd:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
 
5:memory:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
4:blkio:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
3:cpuacct:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
2:cpu:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
1:cpuset:/docker/512f6b62e3963f85f5abc09b69c370d27ab1dc56549fa8afcbb86eec8663a141
0::/

cadvisor will look at /proc/<PID>/cgroup to get the cgroup of a given process and try to get CPU or memory statistics from the corresponding cgroup. To solve this problem, we did another mount inside the container, from /sys/fs/cgroup/memory to /sys/fs/cgroup/memory/ docker/<CONTAINER_ID>/ (for all cgroup subsystems), which works well for this problem.

The new workaround is now to use cgroup namespace if you are running under a Linux system with kernel version 4.6+, and both runc and docker both add support for cgroup namespaces. However, Kubernetes does not currently support the cgroup namespace, but will soon do so as cgroups v2 support cgroups-v2.md).

IPtables

While using it we found that when running a Kubernetes cluster online, sometimes the nested containers started by the Docker Daemon inside the container could not access the extranet, but worked fine on the local development computer, which most developers should encounter often.

Finally, it was found that when this problem occurred, the packets from the nested Docker container did not hit the POSTROUTING chain of iptables, so it was not masqueraded.

This problem is because the image containing the Docker Daemon is based on Debian buster, which by default uses nftables as the default backend for iptables. .debian.org/nftables) as the default backend for iptables, but Docker itself does not support nftables yet. To solve this problem, just switch to the iptables command in the container image.

1
2
3

RUN update-alternatives --set iptables  /usr/sbin/iptables-legacy || true && \
    update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy || true && \
    update-alternatives --set arptables /usr/sbin/arptables-legacy || true

The full Dockerfile file and startup script are available on GitHub (https://github.com/jieyu/docker-images/tree/master/dind), or you can use the jieyu/dind-buster:v0.1.8 image directly to test.

`1`	`$ docker run --rm --privileged jieyu/dind-buster:v0.1.8 docker run alpine wget baidu.com`

Simply deploy under a Kubernetes cluster using the Pod resource list shown below.

apiVersion: v1
kind: Pod
metadata:
  name: dind
spec:
  containers:
  - image: jieyu/dind-buster:v0.1.8
    name: dind
    stdin: true
    tty: true
    args:
    - /bin/bash
    volumeMounts:
    - mountPath: /var/lib/docker
      name: varlibdocker
    securityContext:
      privileged: true
  volumes:
  - name: varlibdocker
    emptyDir: {}

Running KinD in a Pod

We have successfully configured Docker-in-Docker (DinD) above, so let’s start a Kubernetes cluster in that container using KinD.

$ docker run -ti --rm --privileged jieyu/dind-buster:v0.1.8 /bin/bash
Waiting for dockerd...
[root@257b543a91a5 /]# curl -Lso ./kind https://kind.sigs.k8s.io/dl/v0.8.1/kind-$(uname)-amd64
[root@257b543a91a5 /]# chmod +x ./kind
[root@257b543a91a5 /]# mv ./kind /usr/bin/ 
[root@257b543a91a5 /]# kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.18.2) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a nice day! 👋
[root@257b543a91a5 /]# kubectl get nodes
NAME                 STATUS   ROLES    AGE   VERSION
kind-control-plane   Ready    master   11m   v1.18.2

For some reason you may not be able to download kind with the above command, we can find a way to download it to the host in advance, and then directly mount it to the container can also be, I will kind and kubectl commands here are mounted to the container, use the following command to start the container can be.

`1`	`$ docker run -it --rm --privileged -v /usr/local/bin/kind:/usr/bin/kind -v /usr/local/bin/kubectl:/usr/bin/kubectl jieyu/dind-buster:v0.1.8 /bin/bash`

You can see that KinD works well in containers to create Kubernetes clusters. Next, let’s test this directly in Kubernetes.

$ kubectl apply -f dind.yaml
$ kubectl exec -ti dind /bin/bash
root@dind:/# curl -Lso ./kind https://kind.sigs.k8s.io/dl/v0.7.0/kind-$(uname)-amd64
root@dind:/# chmod +x ./kind
root@dind:/# mv ./kind /usr/bin/
root@dind:/# kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.17.0) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✗ Starting control-plane 🕹️ 
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 137

We can see that using KinD in Pod to create a cluster fails because the kubelet running in the KinD nested container randomly kills the processes in the top-level container, which is actually still related to the cgroups mount discussed above.

However, when I was using KinD in v0.8.1, I was able to create clusters in the above Pods normally, so I don’t know if there is any special treatment for KinD-built clusters.

If you are experiencing the same problem as above, then you can continue to see the solution below.

When the top-level container (DIND) is running in a Kubernetes Pod, for each cgroup subsystem (e.g. memory), its cgroup path from the host’s perspective is /kubepods/burstable/<POD_ID>/<DIND_CID>.

When KinD starts a kubelet within a nested node container within a DIND container, the kubelet will operate the cgroup under /kubepods/burstable/ relative to the root cgroup of the nested KIND node container for its Pods. from the host’s perspective, the cgroup path is / kubepods/burstable/<POD_ID>/<DIND_CID>/docker/<KIND_CID>/kubepods/burstable/.

These are correct, but in the nested KinD node container, there is another cgroup that exists under /kubepods/burstable/<POD_ID>/<DIND_CID>/docker/<DIND_CID>, as opposed to the root cgroup of the nested KinD node container, which existed before the kubelet existed before it was started, as a result of the cgroups mount we discussed above, set via the KinD entrypoint script. And if you do a cat /kubepods/burstable/<POD_ID>/docker/<DIND_CID>/tasks inside the KinD node container, you will see the processes of the DinD container.

This is the root cause: the kubelet inside the KinD node container sees the cgroup and thinks it should manage it, but cannot find a Pod associated with the cgroup, so it tries to delete the cgroup by killing the processes belonging to the cgroup, which results in random processes being killed. The solution to this problem is to set the -cgroup-root parameter of the kubelet to instruct the kubelet in the KinD node container to use a root path (e.g. /kubelet) for its Pods that does not have a cgroup. This allows you to start a KinD cluster in a Kubernetes cluster, which we can fix with the YAML resource manifest file below.

apiVersion: v1
kind: Pod
metadata:
  name: kind-cluster
spec:
  containers:
  - image: jieyu/kind-cluster-buster:v0.1.0
    name: kind-cluster
    stdin: true
    tty: true
    args:
    - /bin/bash
    env:
    - name: API_SERVER_ADDRESS
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
    volumeMounts:
    - mountPath: /var/lib/docker
      name: varlibdocker
    - mountPath: /lib/modules
      name: libmodules
      readOnly: true
    securityContext:
      privileged: true
    ports:
    - containerPort: 30001
      name: api-server-port
      protocol: TCP
    readinessProbe:
      failureThreshold: 15
      httpGet:
        path: /healthz
        port: api-server-port
        scheme: HTTPS
      initialDelaySeconds: 120
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1
  volumes:
  - name: varlibdocker
    emptyDir: {}
  - name: libmodules
    hostPath:
      path: /lib/modules

Once created using the resource list file above, we can go into the Pod and verify it in a moment.

$ kubectl exec -ti kind-cluster /bin/bash
root@kind-cluster:/# kubectl get nodes
NAME                 STATUS   ROLES    AGE   VERSION                                                                                                                   
kind-control-plane   Ready    master   72s   v1.17.0

It is also possible to use the Docker CLI directly to test.

$ docker run -ti --rm --privileged jieyu/kind-cluster-buster:v0.1.0 /bin/bash
Waiting for dockerd...
Setting up KIND cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (jieyu/kind-node:v1.17.0) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Waiting ≤ 15m0s for control-plane = Ready ⏳ 
 • Ready after 31s 💚
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a nice day! 👋
root@d95fa1302557:/# kubectl get nodes
NAME                 STATUS   ROLES    AGE   VERSION
kind-control-plane   Ready    master   71s   v1.17.0
root@d95fa1302557:/#

The above image corresponds to the Dockerfile and startup script address: https://github.com/jieyu/docker-images/tree/master/kind-cluster

The following image shows the final result of a Pod I created in a Kubernetes cluster built by KinD, and then a standalone Kubernetes cluster created in the Pod.

Summary

When implementing the above features, we encountered a lot of obstacles in the process, most of which are caused by the fact that Docker containers don’t provide complete isolation from the host, and some kernel resources like cgroups are shared in the kernel, which may also cause potential conflicts if many containers operate them at the same time. But once these problems are solved, we can easily run a standalone Kubernetes cluster in a Kubernetes cluster Pod, which should be considered a real Kubernetes IN Kubernetes, right?

Table of Contents