Kubernetes has a problem with mounted subpath containers that continue to crash after configmap or other volume changes if the container quits unexpectedly and does not start properly.
Kubernetes has now released version 1.18 and this issue still exists.
Community-related issue #68211
Reproduction steps
|
|
Apply this configuration, after the Pod is started, modify the contents of the configmap, wait 30 seconds for the container to exit automatically, and the kubelet restarts the container, at which point the container is observed to continue to mount failures.
|
|
Cause Analysis
Update of Configmap Volume
Before the container starts for the first time, kubelet first downloads the contents of configmap to the volume directory corresponding to the Pod, for example /var/lib/kubelet/pods/{Pod UID}/volumes/kubernetes.io~configmap/extra-cfg
.
Also, to ensure that updates to the contents of this volume are atomic (when updating the directory), updates are made by soft linking the files in the directory as follows.
|
|
extra.ini
is a soft link to . .data/extra.ini
softlink, . .data
is . .2020_03_29_03_12_44.788930127
softlink, the directory named timestamp holds the real content.
When configmap is updated, a new timestamped directory will be generated to store the updated content.
Create a new softlink . .data_tmp
to the new timestamp directory, then rename it to . .data
, the renaming is an atomic operation.
Finally, the old timestamp directory is deleted.
Preparing the container to mount the subpath Volume
When the configmap volume is ready, kubelet will bind mount the files specified by subpath in configmap to a special directory: /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/ {container name}/0
.
As you can see, the bind mount file is actually the contents of the timestamp directory of the real file.
When Configmap is updated, this timestamp directory is removed and //deleted
is added to the source file.
Bind Mount
When the container is started, /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0
needs to be mounted to the container.
If the original timestamp directory is deleted, mount will error: mount: mount(2) failed: No such file or directory
.
Simulate this problem with a simple command:
As you can see, when a is deleted, the b mount point can no longer be mounted, so when the container abnormally exits and needs to be restarted, if the configmap is updated and the original timestamp file is deleted, the subpath can no longer be mounted to the container.
Solution
Configmap changed after Unmount
Community Related PR: https://github.com/kubernetes/kubernetes/pull/82784
Before a container restart, check that the source file of the subpath mount point and the new target subpath file are consistent.
When configmap is updated and the timestamp directory changes, the inconsistency is checked. Unmount /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0
and re-Bind Mount the corresponding file in the current latest timestamp directory.
Based on the comments in the community PR, this solution may be risky and unclear (it has been noted that the kernel is insecure below 4.18 link), so no progress has been made for a long time.
Testing over time has not yet revealed any obvious problems.
does not use subpath
Use other ways to bypass this problem.
For example, you can mount the whole Configmap to another directory of the container and then link it to the corresponding path by softlinking it when the container starts.
Why use Indirect Bind Mount instead of Direct Mount softlink
Refer to the article at https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/.
You can see that the direct mount softlink was originally used, but there is a security vulnerability, symlink race. A malicious program can construct a softlink that allows a privileged program (kubelet) to mount the contents of an out-of-privilege file into the user’s container.
|
|
Using the above configuration, a softlink to the root directory is created in the initContainer via emptyDir in the mounted Volume directory.
Afterwards, the container starts normally, but specifies a subpath. If the kubelet mounts the softlink directly, it will mount the root directory of the host into the user container.
To solve this problem, we need to resolve the real file path corresponding to the softlink, and determine whether the path is in the Volume directory, and then mount it to the container after passing the verification. However, due to the time gap between verification and mounting, the file may still be tampered with.
After the community discussion, we introduced an intermediate Bind Mount mechanism, which is equivalent to putting a lock on this file and solidifying the path of the original file, so that when you mount it to the container again, you will only mount the source file at the time of creating the mount point.
Update
The fix PR submitted to the community has been merged into 89629.