We all know that the smallest unit of scheduling in k8s is the POD, and each POD has a so-called Infra container Pause
, so what exactly is a Pause
container? What does it look like? What does it do?
Analyze the source code
From the official pause.c.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STRINGIFY(x) #x
#define VERSION_STRING(x) STRINGIFY(x)
#ifndef VERSION
#define VERSION HEAD
#endif
static void sigdown(int signo) {
psignal(signo, "Shutting down, got signal");
exit(0);
}
static void sigreap(int signo) {
while (waitpid(-1, NULL, WNOHANG) > 0)
;
}
int main(int argc, char **argv) {
int i;
for (i = 1; i < argc; ++i) {
if (!strcasecmp(argv[i], "-v")) {
printf("pause.c %s\n", VERSION_STRING(VERSION));
return 0;
}
}
if (getpid() != 1)
/* Not an error because pause sees use outside of infra containers. */
fprintf(stderr, "Warning: pause should be the first process\n");
if (sigaction(SIGINT, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
return 1;
if (sigaction(SIGTERM, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
return 2;
if (sigaction(SIGCHLD, &(struct sigaction){.sa_handler = sigreap,
.sa_flags = SA_NOCLDSTOP},
NULL) < 0)
return 3;
for (;;)
pause();
fprintf(stderr, "Error: infinite loop terminated\n");
return 42;
}
|
You can see that the Pause
container does the following two things.
- register various signal handling functions, which mainly handle two types of information: exit signals and child signals. When it receives
SIGINT
or SIGTERM
, it exits directly. When SIGCHLD
signal is received, call waitpid
and recycle the exiting process.
- The main process for loop calls the
pause()
function, which puts the process to sleep until it is terminated or receives a signal.
The suspicious waitpid
I still don’t have a solid foundation in c. I always thought waitpid
was the parent process waiting to recycle the exiting child process, but is it really so?
1
2
3
4
5
6
7
8
|
zerun.dong$ man waitpid
WAIT(2) BSD System Calls Manual WAIT(2)
NAME
wait, wait3, wait4, waitpid -- wait for process termination
SYNOPSIS
#include <sys/wait.h>
|
Looking at the man manual on the mac, wait for process termination
does say the same thing. Log on to ubuntu 18.04 and check it out.
1
2
3
4
5
|
:~# man waitpid
WAIT(2) Linux Programmer's Manual WAIT(2)
NAME
wait, waitpid, waitid - wait for process to change state
|
For the linux man manual, it becomes wait for process to change state
Wait for the process state to change!!!
1
2
3
4
|
All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose
state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by
a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a
wait is not performed, then the terminated child remains in a "zombie" state (see NOTES below).
|
And it is also very thoughtful to provide test code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
#include <sys/wait.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
pid_t cpid, w;
int wstatus;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) { /* Code executed by child */
printf("Child PID is %ld\n", (long) getpid());
if (argc == 1)
pause(); /* Wait for signals */
_exit(atoi(argv[1]));
} else { /* Code executed by parent */
do {
w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED);
if (w == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
if (WIFEXITED(wstatus)) {
printf("exited, status=%d\n", WEXITSTATUS(wstatus));
} else if (WIFSIGNALED(wstatus)) {
printf("killed by signal %d\n", WTERMSIG(wstatus));
} else if (WIFSTOPPED(wstatus)) {
printf("stopped by signal %d\n", WSTOPSIG(wstatus));
} else if (WIFCONTINUED(wstatus)) {
printf("continued\n");
}
} while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus));
exit(EXIT_SUCCESS);
}
}
|
The child process stays in the pause state, while the parent process waits for the child process to change its state. Let’s open one session to run the code and another session to send signals.
1
2
3
4
5
6
7
8
|
~$ ./a.out
Child PID is 70718
stopped by signal 19
continued
stopped by signal 19
continued
^C
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
~# ps aux | grep a.out
zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
zerun.d+ 70718 0.0 0.0 4512 72 pts/0 S+ 06:48 0:00 ./a.out
root 71155 0.0 0.0 16152 1060 pts/1 S+ 06:49 0:00 grep --color=auto a.out
~#
~# kill -STOP 70718
~#
~# ps aux | grep a.out
zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
zerun.d+ 70718 0.0 0.0 4512 72 pts/0 T+ 06:48 0:00 ./a.out
root 71173 0.0 0.0 16152 1060 pts/1 S+ 06:49 0:00 grep --color=auto a.out
~#
~# kill -CONT 70718
~#
~# ps aux | grep a.out
zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
zerun.d+ 70718 0.0 0.0 4512 72 pts/0 S+ 06:48 0:00 ./a.out
root 71296 0.0 0.0 16152 1056 pts/1 R+ 06:49 0:00 grep --color=auto a.out
|
The process is controlled by sending signals STOP
CONT
to the child processes.
It seems that the shape of the c function with the same name is not quite the same for different systems. I’m the one who made a fuss.
What NS to share
Generally speaking, if you mention POD, you know that if containers within the same POD access each other, you can just call localhost. If you imagine a k8s cluster as a distributed operating system, then POD is the concept of process groups, which must share certain things, so what namespace is shared by default?
To build an environment with minikube, first look at the POD definition file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
shareProcessNamespace: true
containers:
- name: nginx
image: nginx
- name: shell
image: busybox
securityContext:
capabilities:
add:
- SYS_PTRACE
stdin: true
tty: true
|
From 1.17 onwards there is a parameter shareProcessNamespace
to control whether the PID namespace is shared within the POD, after 1.18 the default is false, if there is a need to fill in the field.
1
2
3
4
5
6
7
8
9
|
~$ kubectl attach -it nginx -c shell
If you don't see a command prompt, try pressing enter.
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /pause
8 root 0:00 nginx: master process nginx -g daemon off;
41 101 0:00 nginx: worker process
42 root 0:00 sh
49 root 0:00 ps aux
|
Attaching to the shell container shows all processes in that POD, and only the pause
container is the init 1 process.
1
2
3
4
5
6
7
8
|
/ # kill -HUP 8
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /pause
8 root 0:00 nginx: master process nginx -g daemon off;
42 root 0:00 sh
50 101 0:00 nginx: worker process
51 root 0:00 ps aux
|
Test sending a HUP signal to the nginx master and restarting the child processes.
If PID ns are not shared, then the process pid in each container is the init 1 process. What are the benefits of sharing PID ns? Refer to this article.
Container processes no longer have PID 1
. In the absence of PID 1, some container images refuse to start (for example, containers using systemd) or refuse to execute commands like kill -HUP 1 to notify container processes. In pods with a shared process namespace, kill -HUP 1 will notify the pod sandbox (/pause in the above example).
Processes are visible to other containers in the pod
. This includes all information visible in /proc, such as passwords passed as parameters or environment variables. These are only protected by regular Unix permissions.
- The
container file system is visible to other containers in the pod via the /proc/$pid/root link
. This makes debugging easier, but it also means that filesystem security is only protected by filesystem permissions.
See the process ids of nginx, sh on the host, and the namespace ids via /proc/pid/ns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
~# ls -l /proc/140756/ns
total 0
lrwxrwxrwx 1 root root 0 May 6 09:08 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 6 09:08 ipc -> 'ipc:[4026532497]'
lrwxrwxrwx 1 root root 0 May 6 09:08 mnt -> 'mnt:[4026532561]'
lrwxrwxrwx 1 root root 0 May 6 09:08 net -> 'net:[4026532500]'
lrwxrwxrwx 1 root root 0 May 6 09:08 pid -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May 6 09:08 pid_for_children -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May 6 09:08 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 6 09:08 uts -> 'uts:[4026532562]'
~# ls -l /proc/140879/ns
total 0
lrwxrwxrwx 1 root root 0 May 6 09:08 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 6 09:08 ipc -> 'ipc:[4026532497]'
lrwxrwxrwx 1 root root 0 May 6 09:08 mnt -> 'mnt:[4026532563]'
lrwxrwxrwx 1 root root 0 May 6 09:08 net -> 'net:[4026532500]'
lrwxrwxrwx 1 root root 0 May 6 09:08 pid -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May 6 09:08 pid_for_children -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May 6 09:08 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 6 09:08 uts -> 'uts:[4026532564]'
|
You can see that cgroup, ipc, net, pid, user are shared here. This is limited to test cases.
Kill the Pause container
Test how k8s handles POD if you kill the Pause
container. Build the environment with minikube and look at the POD definition file first.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
shareProcessNamespace: false
containers:
- name: nginx
image: nginx
- name: shell
image: busybox
securityContext:
capabilities:
add:
- SYS_PTRACE
stdin: true
tty: true
|
After starting, check the pause process id, and then kill it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
~$ kubectl describe pod nginx
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 3m1s (x2 over 155m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 3m1s (x2 over 155m) kubelet Stopping container nginx
Normal Killing 3m1s (x2 over 155m) kubelet Stopping container shell
Normal Pulling 2m31s (x3 over 156m) kubelet Pulling image "nginx"
Normal Pulling 2m28s (x3 over 156m) kubelet Pulling image "busybox"
Normal Created 2m28s (x3 over 156m) kubelet Created container nginx
Normal Started 2m28s (x3 over 156m) kubelet Started container nginx
Normal Pulled 2m28s kubelet Successfully pulled image "nginx" in 2.796081224s
Normal Created 2m25s (x3 over 156m) kubelet Created container shell
Normal Started 2m25s (x3 over 156m) kubelet Started container shell
Normal Pulled 2m25s kubelet Successfully pulled image "busybox" in 2.856292466s
|
k8s will restart the POD
when it detects an abnormal state of the pause
container, and it is not hard to understand, whether or not the PID namespace is shared, the infra
container exits, the POD
must be restarted, after all, the life cycle is the same as the infra
container.