Background Knowledge
All accessible files on a Unix system are organized in a huge tree-like file hierarchy, the root node of which is the /
directory. These files can be scattered and stored in different devices, provided that we mount the file systems on those devices into the file tree using the mount
system call.
Directories vs. file systems
It is important to understand the difference between a file system and a directory. A file system is a part of a storage device such as a hard disk that is allocated to hold file data. After a file system is mounted on a directory, the files stored in this part can be accessed. After the file system is mounted, it is accessible to the end user just like a normal directory.
We often refer to ext
, xfs
and zfs
as types of file systems. Different types of file systems access and manage data in different ways, and Linux supports many types of file systems.
What is rootfs
?
rootfs
(Root Filesystem) is the top of the hierarchical file tree. It contains files and directories that are critical to the operation of the system, including device directories and programs used to boot the system. rootfs
also contains a number of mount points through which other file systems can connect to the rootfs
file tree. rootfs
is usually provided by Linux distributions, and the contents of a typical rootfs
are as follows.
When the system boots, the initialization process mounts rootfs
to the /
directory and then mounts other filesystems to their subdirectories. All mount
system calls during this time are recorded in the initialization process’s mount table
, and all processes have a separate mount table
recorded in /proc/{PID}/mounts
. In general, however, all processes on the system use the initialization process’s mount table
directly.
The nature of mount namespace and how it works
Each process can create its own mount table
, but only if it first copies the mount table
of its parent process, and any changes that occur when mount
is called later will only affect the mount table
of the current process, which is how mount namespace works. If multiple processes are in the same mount namespace, any changes made to the mount table
by one process will be visible to the other processes as well.
Now let’s actually see all the mount namespaces that currently exist on the system, with the help of the cinf
tool described in the previous article.
The output of the command is somewhat processed to give a clearer picture, and the number of processes (NPROCS) shows that most of the processes are located in the 4026531840
mount namspace created by the initialization process /sbin/init
. Generally new processes do not create new mount namespace, and if you print the mount table
of these processes, /proc/{PID}/mounts
, you will see that the output is the same.
When you create a new mount namespace, a copy of the mount point from the parent namespace is created in the new mount namspace. We will verify this by creating a new mount namespace with unshare -m
and will also use the filesystem bundle
we created with the Docker image in the previous article.
After executing this command we are in a new mount namespace, but still see all the mount points in the host.
|
|
In turn, when we perform a new mount on the new mount namespace, the corresponding records are not visible from the host.
This is how mount namespace works.
Creating a new mount namespace in a container
Now create a new container to see the changes to mount namespace.
In a new window, we query namespace from the host environment using cinf
.
You will see that a new mount namespace 4026532325
has been added, and the process that created it is the container’s init
process.
|
|
Note that creating a new mount namspace does not create a completely new mount table
, but changes it on a copy of the parent process’ mount namspace. So when creating the container, the new container container has all the mount points of the host, which is obviously not what we expect when we want to create an isolated filesystem for the container using rootfs
in the filesystem bundle
. This requires us to unmount
the current rootfs
in the new mount namespace and mount
the rootfs
in the filesystem bundle
to the /
directory. We can verify this by printing the inode
number of the /
directory using ls -id
in the container and host respectively.
This cannot be done with mount
, because we cannot make all processes stop using the current rootfs
, which must be in use and cannot be unmounted
. But there is an alternative route to take: use the pivot_root
system call, which allows us to remount rootfs
to a location other than /
, while mounting a new directory on the /
directory and switching the root of all current processes to the new directory. After that we can unmount
the original rootfs
without any problems.
Switching root directories with pivot_root or chroot
pivot_root
is a system call provided by Linux that switches the root and current working directory of all processes in a mount namespace to a new directory. The main purpose of pivot_root
is to mount a temporary rootfs
for a specific function at system boot time, and then switch to the real rootfs
.
During container creation, after creating a new mount namespace, we can use pivot_root()
to switch the root directory of the processes in the container to the directory where rootfs
is located in the filesystem bundle
.
Example of chroot command usage
In addition to pivot_root
, Linux also provides the chroot
system call to change the root directory of the current process to a new directory, which will be inherited by all children of the current process.
Both pivot_root
and chroot
mentioned above are system calls provided by the Linux kernel, and each of them has a command line program that implements a simple wrapper around the system call.
In the new shell program, executing ls
returns the contents of the /mycontainer/rootfs
directory instead of the contents of the root directory of the host. With the chroot
command, we can also switch the root directory of the processes in the container without the need for mount namespace
.
The pivot_root
system call with mount namspace is safer than chroot
and is preferred when the container is running, but chroot
is also optional. The implementation of runC
has the following (Golang) code snippet.
Ignore the first user-specified no root switch
selection branch, the remaining branches represent the two ways to switch root directories.
- If a new mount namespace is created, the
pivot_root
system call is used. - If no new mount namespace is created,
chroot
is used directly.
bind mount : share files between host and container
After establishing an isolated filesystem, we also need a mechanism to access part of the host’s filesystem from the container, or to persist data generated by the container’s operation to the host.
bind mount (bind mount) is a type of mount provided by Linux, which can remount a file or directory to a new target path, after the mount, the original data will be accessible from both old and new paths, and any changes to the data from both paths will take effect, and the original content of the target path will be hidden. Take the following directory structure as an example.
Mount the A bindings to the B directory with the following command.
|
|
Afterwards, the contents of the original A directory will be accessible from both A and B directories, while the original contents of the B directory will be hidden.
We can use bind mount to share files between the host and the container, mount directories or even block devices from the host to the container.
Using bind mount in a container
Here we will create a bind mount in the container that connects to the host. the container runtime implementation is to modify config.json
in the filesystem bundle
and add the following objects to the mounts
list of the JSON object.
This tells the container runtime to mount the /mycontainer/host
directory in the host (or host
if using a relative path, relative to the filesystem bundle
directory) to the /host_dir
directory in the container’s rootfs
. The host
directory in the host must exist in advance, while the host_dir
in the container will be created automatically by the container at runtime if it does not exist. Now we create the host
directory in the host and populate it with a file.
Then run the container and we will see the contents of the host /mycontainer/host
directory in the container’s /host_dir
directory, and the files in that directory will also be visible to the host when modified within the container.
Also, since the binding of the mount takes place in the mount namespace of the container, the host is not aware of the existence of the mount point.
In the first article of the series, we mentioned that the mount point of the binding mount is located in the writeable layer of the container. Although the whole writeable layer will be deleted after the container is deleted, the write data during the operation of the container will still remain in the mount path of the host, so the data in the container can be persisted through this route.
What are docker volumes
Dcoekr provides Volumes (a term that does not exist in the OCI specification) to persist data in containers, and the underlying implementation uses bind mount to mount path bindings in the host to the container, compared to Docker which provides a more friendly command line interface, and also provides named volume
, which (on Instead of specifying an existing directory in the host, Docker manages it by creating a separate directory in its host’s data directory.
Summary
In this article we have discussed several aspects related to container filesystems.
- Using mount namespace to create a separate file system environment in the container, but not completely isolated from the host at this point.
- Use
pivot_root
to switch the root directory in the container to therootfs
provided by the image, so that it cannot access other paths on the host and is isolated. - Use bind mount to share data between the host and the container or to persist the data generated while the container is running.