Problem description

Users can use shared memory to do some for communication (vs golang’s “Shared memory by communication”). On a kvm or physical machine, the size of shared memory available to the user is about half of the total memory. Below is the shared memory on my pve machine, /dev/shm is the size of the shared memory.

1
2
3
4
5
6
7
8
root@pve:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           47Gi        33Gi       3.4Gi       1.7Gi       9.7Gi        10Gi
Swap:            0B          0B          0B
root@pve:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   24G     0   24G   0% /dev
tmpfs                  24G   54M   24G   1% /dev/shm

Here is the information of a pod on the cluster, we can see that the size of /dev/shm is 64MB, and when writing data to the shared memory via dd, it will throw an exception when it reaches 64MB: “No space left on device”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ kubectl run centos --rm -it --image centos:7 bash
[root@centos-7df8445565-dvpnt /]# df -h|grep shm
Filesystem      Size  Used Avail Use% Mounted on
shm              64M     0   64M   0% /dev/shm
[root@centos-7df8445565-dvpnt /]# dd if=/dev/zero of=/dev/shm/test      bs=1024 count=102400
dd: error writing '/dev/shm/test': No space left on device
65537+0 records in
65536+0 records out
67108864 bytes (67 MB) copied, 0.161447 s, 416 MB/s
[root@centos-7df8445565-dvpnt /]# ls -l -h
total 64M
-rw-r--r-- 1 root root 64M Nov 14 14:33 test
[root@centos-7df8445565-dvpnt /]# df -h|grep shm
Filesystem      Size  Used Avail Use% Mounted on
shm              64M   64M     0 100% /dev/shm

Background knowledge 1: shared memory

Shared memory is a mechanism used for inter-process communication (IPC) on Linux.

We know that inter-process communication can use pipes, sockets, signals, semaphores, message queues, etc., but these methods usually require copying between user state and kernel state, which is generally considered to be four times; in contrast, shared memory maps memory directly to user state space, i.e., multiple processes access the same block of memory, which is theoretically higher performance.

Shared memory is usually used in combination with semaphores to achieve synchronization and mutual exclusion between processes.

There are two ways to use shared memory, System V and POSIX. where.

  • shmget/shmat/shmdt for System V method
  • shm_open/mmap/shm_unlink for POSIX way.

System V

The System V mechanism has a long history, so let’s post a sample code for System V.

The process that performs the write operation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream> 
#include <sys/ipc.h> 
#include <sys/shm.h> 
#include <stdio.h> 
using namespace std; 
  
int main() 
{ 
    // ftok to generate unique key 
    key_t key = ftok("shmfile",65); 
  
    // shmget returns an identifier in shmid 
    int shmid = shmget(key,1024,0666|IPC_CREAT); 
  
    // shmat to attach to shared memory 
    char *str = (char*) shmat(shmid,(void*)0,0); 
  
    cout<<"Write Data : "; 
    gets(str); 
  
    printf("Data written in memory: %s\n",str); 
      
    //detach from shared memory  
    shmdt(str); 
  
    return 0; 
} 

Here is the code that performs the read operation. You can see that the read process finds the corresponding shared memory based on key and reads the contents of the shared memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream> 
#include <sys/ipc.h> 
#include <sys/shm.h> 
#include <stdio.h> 
using namespace std; 
  
int main() 
{ 
    // ftok to generate unique key 
    key_t key = ftok("shmfile",65); 
  
    // shmget returns an identifier in shmid 
    int shmid = shmget(key,1024,0666|IPC_CREAT); 
  
    // shmat to attach to shared memory 
    char *str = (char*) shmat(shmid,(void*)0,0); 
  
    printf("Data read from memory: %s\n",str); 
      
    //detach from shared memory  
    shmdt(str); 
    
    // destroy the shared memory 
    shmctl(shmid,IPC_RMID,NULL); 
     
    return 0; 
} 

POSIX

The POSIX interface is a bit more convenient and easy to use, where shm_open("abc"...), which is actually equivalent to open("/dev/shm/aaa"...), converts shared memory into file operations; then the file is mapped to a pointer through mmap, and the shared memory is addressed through the pointer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/shm.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <unistd.h>

int main()
{
    /* the size (in bytes) of shared memory object */
    long size;
    printf("Size of shm(in bytes): ");
    scanf("%ld", &size);
    printf("size %ld\r\n", size);

    /* name of the shared memory object */
    const char* name = "test";

    /* shared memory file descriptor */
    int shm_fd;

    /* pointer to shared memory obect */
    void* ptr;

    /* create the shared memory object */
    shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);

    /* configure the size of the shared memory object */
    ftruncate(shm_fd, size);

    /* memory map the shared memory object */
    ptr = mmap(0, size, PROT_WRITE, MAP_SHARED, shm_fd, 0);

    memset(ptr, 0, size);
    pause();
    close(shm_fd);
    shm_unlink(name);
    return 0;
}

Background knowledge 2: kubernetes shared memory

Pods created by kubernetes have a default of 64MB of shared memory, which cannot be changed.

Why is this value? Actually, kubernetes itself does not set the size of shared memory. 64MB is actually the default shared memory size of docker.

When you run docker, you can set the shared memory size by using -shm-size.

1
2
3
4
$ docker run  --rm centos:7 df -h |grep shm
shm              64M     0   64M   0% /dev/shm
$ docker run  --rm --shm-size 128M centos:7 df -h |grep shm
shm             128M     0  128M   0% /dev/shm

However, kubernetes does not provide a way to set the shm size. In this issue the community discussed for a long time whether to add a parameter to shm, but in the end there was no conclusion, except for a workgroud solution: mount the memory type There is only one workgroud solution: mount the emptyDir to /dev/shm to solve this.

kubernetes empty dir

The emptyDir volume is useful in some scenarios, such as sharing data between different containers in a Pod, where the same emptyDir can be mounted to both containers for sharing purposes. In this case, emtpyDir is actually a directory on the host, which is still essentially a disk (e.g. collecting logs by means of sidecar).

kubernetes provides a special kind of emptyDir: medium for temporary storage of memory. Users can hook the emptyDir of the memory medium to any directory and then use that directory as a high-performance filesystem.

Of course, we can also take the empty dir of the memory media and mount it under /dev/shm. This will solve the problem of not having enough shared memory /dev/shm.

Here is an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: xxx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: xxx
  template:
    metadata:
      labels:
        app: xxx
    spec:
      containers:
      - args:
        - infinity
        command:
        - sleep
        image: centos:7
        name: centos
        volumeMounts:
        - mountPath: /dev/shm
          name: cache-volume
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 128Mi
        name: cache-volume

After executing on kubernetes, we go into the container and do some operations to see if /dev/shm is expanded and if it will not exceed the sizeLimit value, and what happens if it exceeds the sizeLimit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ kubectl exec -it xxx-f6f865646-l54tq bash
[root@xxx-f6f865646-l54tq /]# df -h |grep shm
tmpfs           3.9G     0  3.9G   0% /dev/shm
[root@xxx-f6f865646-l54tq /]# dd if=/dev/zero of=/dev/shm/test bs=1024 count=204800
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 0.366529 s, 572 MB/s
[root@xxx-f6f865646-l54tq /]# ls -l -h /dev/shm/test
-rw-r--r-- 1 root root 200M Dec  4 14:52 /dev/shm/test
[root@xxx-f6f865646-l54tq /]# command terminated with exit code 137
$ kubectl get pods -o wide |grep xxx
xxx-f6f865646-8ll9p                       1/1     Running   0          10s     10.244.7.110    optiplex-1   <none>           <none>
xxx-f6f865646-l54tq                       0/1     Evicted   0          4m59s   <none>          optiplex-3   <none>           <none>

As you can see from the operation log above, the shared memory /dev/shm can write to files larger than 64MB, but if the file written exceeds the sizeLimit, it will be evict by kubelet after a period of time (1-2 minutes). The reason why it is not evict “immediately” is that kubelet checks periodically and there is a time lag here.

Disadvantages of using empty dir

The above temporary storage using memory media can solve the problem that the maximum shared memory can only be 64MB, but it has disadvantages.

Listed below.

  • Can’t disable users from using memory in time. Although the kubelet will extricate the Pod after 1-2 minutes, there is actually a risk to the node during this time.
  • Affects kubernetes scheduling because empty dir does not involve node resources, which can cause Pods to “secretly” use node memory without the scheduler knowing about it.
  • The user does not realize that memory is not available in time.

Regarding the second point, as shown below, since there is no request and limit set in the Deployment configuration, the node shows that the memory request and limit of the Pod are 0, but the 1GiB of memroy temporary storage set in front of it is not reflected here, so it affects the scheduler and causes the scheduler to The memory usage of the node is “overestimated” by the scheduler.

1
2
3
4
5
6
$ kubectl describe nodes optiplex-1
...
Non-terminated Pods:         (11 in total)
  Namespace                  Name                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                      ------------  ----------  ---------------  -------------  ---
  default                    xxx-f6f865646-8ll9p                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m3s

To address these shortcomings, we need to go back to cgroup first.

cgroup limits

cgroup limits for memory

The following is a small program for c. The code requests a number of memory sizes and memset the memory to 0. Note that you must memset, otherwise malloc just “claims” to have requested the memory and the kernel does not map the virtual addresses to physical memory addresses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

int main () {
   int i, n, size;
   int *a;

   printf("Number of calloc: ");
   scanf("%d", &n);
   printf("Size of calloc(in bytes): ");
   scanf("%d", &size);

   int** x = calloc(n, sizeof(int*));
   for (i=0; i< n; i++) {
     x[i] = (int*)malloc(size);
     if (x[i] == NULL) {
       perror("malloc failed");
       return -1;
     }
     memset(x[i], 0, size);
     printf("calloc %d bytes memory.\r\n", size);
   }
   printf("done\r\n");

   printf("pause\r\n");
   pause();
}

We set the memory limit for the container and copy the binary compiled above calloc to the container after startup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: xxx-cgroup-calloc
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: xxx
  template:
    metadata:
      labels:
        app: xxx
    spec:
      containers:
      - args:
        - infinity
        command:
        - sleep
        image: centos:7
        name: centos
        resources:
          limits:
            memory: 256Mi

After the container is started, we go into the container first to see the current configuration of the cgroup.

1
2
3
4
[root@xxx-cgroup-calloc-5b998f5897-b7xn6 memory]# cat /sys/fs/cgroup/memory/memory.limit_in_bytes
268435456
[root@xxx-cgroup-calloc-5b998f5897-b7xn6 memory]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
2056192

As you can see, the memory limit configured in cgroup is memory.limit_in_bytes of 256MB, and the current used memory memory.usage_in_bytes is about 2MB, because the container starts sleep processes, bash processes, etc., which will consume some memory.

Let’s start by requesting 64MB of memory using the calloc applet above.

Request 64MB of memory in console 1 and do not exit the calloc process.

1
2
3
4
5
6
[root@xxx-cgroup-calloc-5b998f5897-b7xn6 memory]# /bin/calloc
Number of calloc: 1
Size of calloc(in bytes): 67108864
calloc 67108864 bytes memory.
done
pause

Check the current used memory in console 2.

1
2
[root@xxx-cgroup-calloc-5b998f5897-b7xn6 memory]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
69464064

As you can see, the used memory increases by about 67108864 bytes.

So, when will cgroup exercise its power?

We stop the current calloc on console 1 and request 256MB of memory again.

1
2
3
4
[root@xxx-cgroup-calloc-5b998f5897-b7xn6 memory]# /bin/calloc
Number of calloc: 1
Size of calloc(in bytes): 268435456
Killed

As you can see, the calloc process is killed directly. Who killed it?

Let’s go to the node where the pod is located and look at the system logs, we can find traces of the oom killer working.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276093] calloc invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=968
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276094] calloc cpuset=beead52e53d4918e7b2820c2712151f2833e4055bc72abc9ea287a02459fc55a mems_allowed=0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276097] CPU: 0 PID: 28285 Comm: calloc Not tainted 4.15.0-63-generic #72-Ubuntu
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276097] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276098] Call Trace:
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276103]  dump_stack+0x63/0x8e
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276104]  dump_header+0x71/0x285
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276106]  oom_kill_process+0x21f/0x420
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276107]  out_of_memory+0x2b6/0x4d0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276108]  mem_cgroup_out_of_memory+0xbb/0xd0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276109]  mem_cgroup_oom_synchronize+0x2e8/0x320
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276111]  ? mem_cgroup_css_online+0x40/0x40
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276112]  pagefault_out_of_memory+0x36/0x7b
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276114]  mm_fault_error+0x90/0x180
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276115]  __do_page_fault+0x46b/0x4b0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276116]  ? __schedule+0x256/0x880
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276117]  do_page_fault+0x2e/0xe0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276118]  ? async_page_fault+0x2f/0x50
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276120]  do_async_page_fault+0x51/0x80
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276121]  async_page_fault+0x45/0x50
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276122] RIP: 0033:0x7f129f50dfe0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276123] RSP: 002b:00007ffcbafce968 EFLAGS: 00010206
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276124] RAX: 00007f128f47e010 RBX: 0000563b4b7df010 RCX: 00007f129f1cb000
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276124] RDX: 00007f129f47e000 RSI: 0000000000000000 RDI: 00007f128f47e010
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276125] RBP: 00007ffcbafce9a0 R08: ffffffffffffffff R09: 0000000010000000
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276125] R10: 0000000000000022 R11: 0000000000001000 R12: 0000563b49a997c0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276125] R13: 00007ffcbafcea80 R14: 0000000000000000 R15: 0000000000000000
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276126] Task in /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c-7e6b23fd75f8/beead52e53d4918e7b2820c2712151f2833e4055bc72abc9ea287a02459fc55a killed as a result of limit of /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c-7e6b23fd75f8
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276129] memory: usage 262144kB, limit 262144kB, failcnt 131
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276130] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276130] kmem: usage 1704kB, limit 9007199254740988kB, failcnt 0
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276131] Memory cgroup stats for /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c-7e6b23fd75f8: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276136] Memory cgroup stats for /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c-7e6b23fd75f8/d081810d182504b5c3c38e0a28e16467385abda8553e286948f6cc5b3849ca0b: cache:0KB rss:40KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276141] Memory cgroup stats for /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c-7e6b23fd75f8/beead52e53d4918e7b2820c2712151f2833e4055bc72abc9ea287a02459fc55a: cache:0KB rss:260400KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:260396KB inactive_file:0KB active_file:0KB unevictable:0KB
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276145] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276203] [ 8111]     0  8111      256        1    32768        0          -998 pause
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276204] [ 8255]     0  8255     1093      146    57344        0           968 sleep
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276205] [14119]     0 14119     2957      710    65536        0           968 bash
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276206] [21462]     0 21462     2958      690    69632        0           968 bash
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276207] [28285]     0 28285    66627    65119   577536        0           968 calloc
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276208] Memory cgroup out of memory: Kill process 28285 (calloc) score 1955 or sacrifice child
Dec  4 15:24:33 optiplex-2 kernel: [6686782.276513] Killed process 28285 (calloc) total-vm:266508kB, anon-rss:259280kB, file-rss:1196kB, shmem-rss:0kB
Dec  4 15:24:33 optiplex-2 kernel: [6686782.289037] oom_reaper: reaped process 28285 (calloc), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We often see oom killer kill processes on some memory-strapped servers, but actually oom killer works in cgroups, for example, the above cgroup is /kubepods/burstable/pod465a9fc1-0de9-44cd-b22c- 7e6b23fd75f8/. In this cgroup, when there is a missing page interrupt (do_page_fault), oom killer will look up all processes in this cgroup and score them. The scoring process will refer to oom_score_adj, which is a priority factor, the smaller the value, the higher the priority.

From the logs, we can see that except for the pause process (i.e. the creator of the sandbox container, cgroup and namespace in Pod, with priority -998, the highest priority), all other processes are 968, so the scoring is more about the memory used by the process.

Since the calloc process uses the most processes, it has the highest score (1955), so the oom killer kills the process, which is what we saw above when the calloc process was “Killed”.

This is how kubernetes’ memory limit works.

Is shared memory limited by cgroup?

Back to our question.

Since normal memory is limited by cgroup, there is a natural question: Is shared memory limited by cgroup? If it is, then just set the memroy limit for the Pod, and the kubernetes level doesn’t need to care whether the user uses the memory for shared memory or for general use: because shared memory is not a “standalone” memory, it is just included in the memory The shared memory is not a “standalone” memory, but is included in the memory limit for scheduling, and therefore does not affect scheduling.

Let’s add memory limit to deployment xxx and continue testing. Since we are using POSIX to manipulate shared memory, we can test it directly with dd.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: xxx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: xxx
  template:
    metadata:
      labels:
        app: xxx
    spec:
      containers:
      - args:
        - infinity
        command:
        - sleep
        image: centos:7
        name: centos
        resources:
          limits:
            memory: 256Mi
        volumeMounts:
        - mountPath: /dev/shm
          name: cache-volume
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 128Mi
        name: cache-volume

We go to the container, create and delete a 100MB file in /dev/shm by dd, and look at memory.usage_in_bytes to see the usage of shared memory during the process, which is reflected in the cgroup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ kubectl exec -it xxx-789569d8c8-8pls5 bash
[root@xxx-789569d8c8-8pls5 /]# cat /sys/fs/cgroup/memory/memory.limit_in_bytes
268435456
[root@xxx-789569d8c8-8pls5 /]# df -h |grep shm
tmpfs           3.9G     0  3.9G   0% /dev/shm
[root@xxx-789569d8c8-8pls5 /]# dd if=/dev/zero of=/dev/shm/test      bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 0.156172 s, 671 MB/s
[root@xxx-789569d8c8-8pls5 /]# ls -l -h /dev/shm/test
-rw-r--r-- 1 root root 100M Dec  4 15:43 /dev/shm/test
[root@xxx-789569d8c8-8pls5 /]# df -h |grep sh
tmpfs           3.9G  100M  3.8G   3% /dev/shm
[root@xxx-789569d8c8-8pls5 /]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
106643456
[root@xxx-789569d8c8-8pls5 /]# rm /dev/shm/test
rm: remove regular file '/dev/shm/test'? y
[root@xxx-789569d8c8-8pls5 /]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1654784
[root@xxx-789569d8c8-8pls5 /]# dd if=/dev/zero of=/dev/shm/test      bs=1024 count=409600
Killed
[root@xxx-789569d8c8-8pls5 /]# ls
bash: /usr/bin/ls: Argument list too long

Note that when a file larger than the memory limit is finally created, it will be found that the dd process will be killed, which is also in line with the cgroup limit, but this will then cause a problem: no commands can be executed as the memory is all depleted by the shared memory.

But don’t worry, since the shared memory usage exceeds the set sizeLimit, after 1-2 minutes, it will be evict by kubelet and pull up a new pod again, and the new pod can run normally.

So, referring to kvm and physical machine, we can set sizeLimit always to 50% of memory limit, even if users use shared memory and exceed the memory limit, they will be found by kubelet and evict away.

Final design

Based on the above analysis, the “shared memory” feature of the kubernetes-based platform is designed as follows.

  1. mount emptyDir of the memory media to /dev/shm/.
  2. configure the memory limit of the container.
  3. Configure the sizeLimit of the emptyDir of the memory media to be 50% of the memory limit.

Further, the above configuration can be automatically configured in the admission controller for the user, so the user does not need to explicitly “request shared memory”.