If you run a container with runc
and do the following, you will get interesting results.
Even if we use the root
user with a UID
of 0, we do not have the privilege to change the hostname.
The fact that the root
user has the highest privileges is a thing of the past, as the Linux kernel introduced a new privilege checking mechanism, capabilities, in version 2.2.
Finer-grained permissions control than superuser
The traditional Linux privilege checking model is simple, as the kernel only distinguishes between two types of processes when checking privileges.
- Privileged processes with a valid user ID of 0, which is often referred to as the superuser or
root
. - Non-privileged processes, which do not have a valid user ID of 0.
Privileged processes bypass all kernel checks directly, while non-privileged processes need to perform checks based on credentials such as the process’s valid user ID and valid user group ID.
In order to accommodate more complex privileges, the Linux kernel from version 2.2 onwards has been able to further break down superuser privileges into fine-grained units called capabilities; for example, capability CAP_CHOWN
allows the user to make arbitrary changes to the UID and GID of a file by executing the chown
command. Almost all superuser-related privileges have been broken down into separate capabilities.
The introduction of capabilities has the following benefits.
- Removing some capabilities from the superuser’s privileges to weaken them and improve system security.
- The ability to grant some special privileges to ordinary users very precisely on demand.
Security risks of privileged containers
Containers isolate processes and resources by namespace, but not all resources can be namespaced, containers and hosts are not completely isolated, for example, time is shared in containers and hosts. If the process in the container has all the privileges, it can run direct access to the hardware (malicious) programs or even directly modify the host’s file system, so it is necessary to impose certain restrictions on the operation in the container, otherwise it will affect the stability of the host, and even bring serious security risks.
For the above reasons, by default the container runs with a whitelist of capabilities added to the container at the time of creation, so that even if you are a super user in the container does not have permission to perform specific operations.
Let’s deepen our understanding of capabilities in containers with an example.
Preparation
We will use an additional tool library libcap
in the container to interact with capabilities, which needs to be installed in a filesystem bundle
, as described in the previous article, in the following way.
Then you can use runc run
to run a base container with the library installed from the /mycontainer2
directory.
Add capabilities when creating containers
In the opening example, we were unable to set the hostname in the container with the root
user because the capability CAP_SYS_ADMIN
was missing, which is not included in the whitelist of capabilities added to the container by default.
In a previous article, we described that the container runtime sets the runtime parameters and execution environment for the container it creates based on the config.json
in the bundle
, a process that also includes setting the capabilities of the processes in the container.
By modifying config.json
and adding "CAP_SYS_ADMIN"
to the bounding
, permitted
and effective
lists of the process.capabilities
object in JSON, this capability will be added to the container init
process to the corresponding capabilities set.
|
|
Technical details of capabilities
capabilities can be applied to both files and processes (or threads, the Linux kernel does not distinguish between processes and threads), the capabilities of a file are stored in the extended attributes of the file, which are cleaned up when the image is built, so we basically do not need to consider the capabilities of a file in the container.
The capabilities of a process are controlled by five capability sets maintained separately for each process, each of which contains zero or more capabilities.
- Permitted: a superset of capabilities that the process can use
- Inheritable: capabilities that can be inherited by new derived processes when the process executes the
exec()
system call - Effective: the set used by the kernel to perform permission checks on processes
- Bounding: A superset of the Inheritable set, a capability must be in the Bounding set to be added to Inheritable
- Ambient: capabilities that will be retained by unprivileged programs when executing the
exec()
system call
As shown above we have added CAP_SYS_ADMIN
to the Permitted, Bounding and Effective sets of the init
process, so the init
process will pass the kernel’s check for CAP_SYS_ADMIN
.
Next we run a new container based on the new config.json
, and now we can change the hostname.
We are in the sh
process that is the container init
process when we do the above. If we continue to create new processes in the container, will they also have the newly added capability?Let’s try this by executing the following command in a new window.
The hostname change was successful because the newly created process exactly replicates the capabilities of the init
process.
Adding capabilities at container runtime
In addition to modifying config.json
to add capabilities, we can also add capabilities during the container runtime phase.
First restore config.json
, then run a new container mybox3
and make sure it no longer has CAP_SYS_ADMIN
in the new sh
process.
Then create a new process in that container with runc exec
and add CAP_SYS_ADMIN
to that process with the -cap
option.
|
|
The idea is that since runc
can set the capabilities set for the init
process based on config.json
, it can do the same for other processes running in the container.
Check the capabilities of the process
capsh
Execute capsh --print
from within the container to get more information about capabilities.
|
|
This command prints the capabilities of the current process.
The cap_sys_admin
we added via config.json
is included in the Current
and Bounding set
. The +eip
at the end of the capability means that the capability exists in the Effective, Inheritable, and Permitted sets.
pscap
First get the PID of the running process in the container at the host.
Install the pscap
program in the host computer.
|
|
Based on the obtained PID, see the capabilities of the processes in the container.