Principles of container technology (5): file system isolation and sharing

Background Knowledge

All accessible files on a Unix system are organized in a huge tree-like file hierarchy, the root node of which is the / directory. These files can be scattered and stored in different devices, provided that we mount the file systems on those devices into the file tree using the mount system call.

Directories vs. file systems

It is important to understand the difference between a file system and a directory. A file system is a part of a storage device such as a hard disk that is allocated to hold file data. After a file system is mounted on a directory, the files stored in this part can be accessed. After the file system is mounted, it is accessible to the end user just like a normal directory.

We often refer to ext, xfs and zfs as types of file systems. Different types of file systems access and manage data in different ways, and Linux supports many types of file systems.

What is `rootfs`?

rootfs (Root Filesystem) is the top of the hierarchical file tree. It contains files and directories that are critical to the operation of the system, including device directories and programs used to boot the system. rootfs also contains a number of mount points through which other file systems can connect to the rootfs file tree. rootfs is usually provided by Linux distributions, and the contents of a typical rootfs are as follows.

1
2
3

$ ls /
boot  etc   home  lib64  mnt  proc  run   srv  tmp  var
bin   dev   lib   media  opt  root  sbin  sys  usr

When the system boots, the initialization process mounts rootfs to the / directory and then mounts other filesystems to their subdirectories. All mount system calls during this time are recorded in the initialization process’s mount table, and all processes have a separate mount table recorded in /proc/{PID}/mounts. In general, however, all processes on the system use the initialization process’s mount table directly.

The nature of mount namespace and how it works

Each process can create its own mount table, but only if it first copies the mount table of its parent process, and any changes that occur when mount is called later will only affect the mount table of the current process, which is how mount namespace works. If multiple processes are in the same mount namespace, any changes made to the mount table by one process will be visible to the other processes as well.

Now let’s actually see all the mount namespaces that currently exist on the system, with the help of the cinf tool described in the previous article.

$ cinf | grep mnt

 NAMESPACE   TYPE  NPROCS  USERS                      CMD
 
 4026531840  mnt   130     0,32,70,81,89,998,999      /sbin/init splash
 4026531860  mnt   1       0
 4026532320  mnt   1       997                        /usr/sbin/chronyd
 4026532389  mnt   1       0                          sh

The output of the command is somewhat processed to give a clearer picture, and the number of processes (NPROCS) shows that most of the processes are located in the 4026531840 mount namspace created by the initialization process /sbin/init. Generally new processes do not create new mount namespace, and if you print the mount table of these processes, /proc/{PID}/mounts, you will see that the output is the same.

When you create a new mount namespace, a copy of the mount point from the parent namespace is created in the new mount namspace. We will verify this by creating a new mount namespace with unshare -m and will also use the filesystem bundle we created with the Docker image in the previous article.

1
2

cd /mycontainer/rootfs
PS1='\u@new-mnt$ ' unshare -Umr

After executing this command we are in a new mount namespace, but still see all the mount points in the host.

$ df -h
文件系统        容量  已用  可用 已用% 挂载点
/dev/sdb3       119G   28G   86G   25% /
tmpfs           7.8G     0  7.8G    0% /dev/shm
tmpfs           1.6G  1.9M  1.6G    1% /run
tmpfs           5.0M  4.0K  5.0M    1% /run/lock
tmpfs           1.6G   84K  1.6G    1% /run/user/121
tmpfs           1.6G   68K  1.6G    1% /run/user/0
tmpfs           4.0M     0  4.0M    0% /sys/fs/cgroup
/dev/sdb2        94G  5.8G   83G    7% /home
/dev/sda2        96M   30M   67M   31% /boot/efi

In turn, when we perform a new mount on the new mount namespace, the corresponding records are not visible from the host.

# in new mount namspace
$ mount -t tmpfs tmpfs /mnt
$ findmnt | grep mnt
└─/mnt
        tmpfs                  tmpfs           rw,relatime,inode64
# in host
$ findmnt | grep mnt

This is how mount namespace works.

Creating a new mount namespace in a container

Now create a new container to see the changes to mount namespace.

1
2

$ cd /mycontainer
$ runc run mybox

In a new window, we query namespace from the host environment using cinf.

$ cinf | grep mnt
 4026531840  mnt   134     0,32,70,81,89,998,999      /usr/lib/systemd/systemd --swi
 4026531860  mnt   1       0
 4026532320  mnt   1       997                        /usr/sbin/chronyd
 4026532325  mnt   1       0                          sh
 4026532389  mnt   1       0                          sh

You will see that a new mount namespace 4026532325 has been added, and the process that created it is the container’s init process.

$ cinf --namespace 4026532325

 PID   PPID  NAME  CMD  NTHREADS  CGROUPS                                                           STATE

 7656  7646  sh    sh   1         11:blkio:/mybox 10:devices:/mybox 9:pids:/mybox                   S (sleeping)
                                  8:cpu,cpuacct:/mybox 7:net_cls,net_prio:/mybox
                                  6:freezer:/mybox 5:memory:/mybox 4:cpuset:/mybox
                                  3:hugetlb:/mybox 2:perf_event:/mybox
                                  1:name=systemd:/user.slice/user-0.slice/session-1231.scope/mybox

$ runc ps mybox
UID        PID  PPID  C STIME TTY          TIME CMD
root      7656  7646  0 14:39 pts/0    00:00:00 sh

Note that creating a new mount namspace does not create a completely new mount table, but changes it on a copy of the parent process’ mount namspace. So when creating the container, the new container container has all the mount points of the host, which is obviously not what we expect when we want to create an isolated filesystem for the container using rootfs in the filesystem bundle. This requires us to unmount the current rootfs in the new mount namespace and mount the rootfs in the filesystem bundle to the / directory. We can verify this by printing the inode number of the / directory using ls -id in the container and host respectively.

# 在容器中打印 / 目录
$ ls -id /
7077890 /
# 在宿主机中打印 filesystem bundle 中 rootfs 所在目录
$ ls -id /mycontainer/rootfs
7077890 /mycontainer/rootfs

This cannot be done with mount, because we cannot make all processes stop using the current rootfs, which must be in use and cannot be unmounted. But there is an alternative route to take: use the pivot_root system call, which allows us to remount rootfs to a location other than /, while mounting a new directory on the / directory and switching the root of all current processes to the new directory. After that we can unmount the original rootfs without any problems.

Switching root directories with pivot_root or chroot

pivot_root is a system call provided by Linux that switches the root and current working directory of all processes in a mount namespace to a new directory. The main purpose of pivot_root is to mount a temporary rootfs for a specific function at system boot time, and then switch to the real rootfs.

During container creation, after creating a new mount namespace, we can use pivot_root() to switch the root directory of the processes in the container to the directory where rootfs is located in the filesystem bundle.

Example of chroot command usage

In addition to pivot_root, Linux also provides the chroot system call to change the root directory of the current process to a new directory, which will be inherited by all children of the current process.

Both pivot_root and chroot mentioned above are system calls provided by the Linux kernel, and each of them has a command line program that implements a simple wrapper around the system call.

# 切换到有效的 filesystem bundle 目录
$ cd /mycontainer
$ chroot rootfs /bin/sh
# 进入到一个新的shell中
$ ls /
bin   dev   etc   home  old   proc  root  sys   tmp   usr   var

In the new shell program, executing ls returns the contents of the /mycontainer/rootfs directory instead of the contents of the root directory of the host. With the chroot command, we can also switch the root directory of the processes in the container without the need for mount namespace.

The pivot_root system call with mount namspace is safer than chroot and is preferred when the container is running, but chroot is also optional. The implementation of runC has the following (Golang) code snippet.

if config.NoPivotRoot {
        err = msMoveRoot(config.Rootfs)
    } else if config.Namespaces.Contains(configs.NEWNS) {
        err = pivotRoot(config.Rootfs)
    } else {
        err = chroot()
    }

Ignore the first user-specified no root switch selection branch, the remaining branches represent the two ways to switch root directories.

If a new mount namespace is created, the pivot_root system call is used.
If no new mount namespace is created, chroot is used directly.

After establishing an isolated filesystem, we also need a mechanism to access part of the host’s filesystem from the container, or to persist data generated by the container’s operation to the host.

bind mount (bind mount) is a type of mount provided by Linux, which can remount a file or directory to a new target path, after the mount, the original data will be accessible from both old and new paths, and any changes to the data from both paths will take effect, and the original content of the target path will be hidden. Take the following directory structure as an example.

$ tree .
.
├── A
│   ├── a
│   └── a.conf
└── B
    ├── b
    └── b.conf

Mount the A bindings to the B directory with the following command.

`1`	`$ mount --bind A B`

Afterwards, the contents of the original A directory will be accessible from both A and B directories, while the original contents of the B directory will be hidden.

$ tree .
.
├── A
│   ├── a
│   └── a.conf
└── B
    ├── a
    └── a.conf

We can use bind mount to share files between the host and the container, mount directories or even block devices from the host to the container.

Using bind mount in a container

Here we will create a bind mount in the container that connects to the host. the container runtime implementation is to modify config.json in the filesystem bundle and add the following objects to the mounts list of the JSON object.

{
        "destination": "/host_dir",
        "type": "bind",
        "source": "/mycontainer/host",
        "options": ["bind"]
}

This tells the container runtime to mount the /mycontainer/host directory in the host (or host if using a relative path, relative to the filesystem bundle directory) to the /host_dir directory in the container’s rootfs. The host directory in the host must exist in advance, while the host_dir in the container will be created automatically by the container at runtime if it does not exist. Now we create the host directory in the host and populate it with a file.

1
2
3

$ cd /mycontainer
$ mkdir host
$ touch mkdir/hello

Then run the container and we will see the contents of the host /mycontainer/host directory in the container’s /host_dir directory, and the files in that directory will also be visible to the host when modified within the container.

# in container
$ ls /host_dir
hello
$ touch /host_dir/world
# in host
$ ls /mycontainer/host
hello  world

Also, since the binding of the mount takes place in the mount namespace of the container, the host is not aware of the existence of the mount point.

# in container
$ cat /proc/self/mounts | grep host_dir
/dev/sdb3 /host_dir ext4 rw,relatime,errors=remount-ro 0 0
# in host
$ cat /proc/self/mounts | grep host_dir

In the first article of the series, we mentioned that the mount point of the binding mount is located in the writeable layer of the container. Although the whole writeable layer will be deleted after the container is deleted, the write data during the operation of the container will still remain in the mount path of the host, so the data in the container can be persisted through this route.

What are docker volumes

Dcoekr provides Volumes (a term that does not exist in the OCI specification) to persist data in containers, and the underlying implementation uses bind mount to mount path bindings in the host to the container, compared to Docker which provides a more friendly command line interface, and also provides named volume, which (on Instead of specifying an existing directory in the host, Docker manages it by creating a separate directory in its host’s data directory.

Summary

In this article we have discussed several aspects related to container filesystems.

Using mount namespace to create a separate file system environment in the container, but not completely isolated from the host at this point.
Use pivot_root to switch the root directory in the container to the rootfs provided by the image, so that it cannot access other paths on the host and is isolated.
Use bind mount to share data between the host and the container or to persist the data generated while the container is running.

Table of Contents