Kubernetes Microservices Best Practices

This article introduces a set of “Kubernetes configuration” that I have summarized as my personal “best practices” in the process of using Kubernetes. Most of the content has been tested in an online environment, but there are a few parts that have only been simulated in my head, so please refer to them with caution.

A few notes before reading.

This document is quite long and includes a lot of content, so we recommend using it as a reference manual, first read it briefly with reference to the table of contents, and then read the relevant content in detail if needed.
This document requires a certain amount of Kubernetes foundation to understand, and if you have no practical experience, it may seem boring.
- People with hands-on experience may have different opinions than I do, so feel free to comment.

I will update this document from time to time as appropriate.

0. Examples

First of all, here are some of the premises that this paper adheres to, which only fit the scenarios I encountered and are flexible:

Only stateless services are discussed here, stateful services are out of scope
Instead of using the rolling update capability of Deployment, we create a different Deployment + HPA + PodDisruptionBudget for each version of each service, this is to facilitate doing canary/grayscale releases
Our services may use IngressController / Service Mesh for load balancing of services, traffic slicing

Here’s a demo of Deployment + HPA + PodDisruptionBudget, which will be broken down later for more details.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v3
  namespace: prod
  labels:
    app: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    # 因为服务的每个版本都使用各自的 Deployment，服务更新时其实是用不上这里的滚动更新策略的
    # 这个配置应该只在 SRE 手动修改 Deployment 配置时才会生效（通常不应该发生这种事）
    rollingUpdate:
      maxSurge: 10%  # 滚动更新时，每次最多更新 10% 的 Pods
      maxUnavailable: 0  # 滚动更新时，不允许出现不可用的 Pods，也就是说始终要维持 3 个可用副本
  selector:
    matchLabels:
      app: my-app
      version: v3
  template:
    metadata:
      labels:
        app: my-app
        version: v3
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: # 非强制性条件
          - weight: 100  # weight 用于为节点评分，会优先选择评分最高的节点（只有一条规则的情况下，这个值没啥意义）
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-app
                - key: version
                  operator: In
                  values:
                  - v3
              # pod 尽量使用同一种节点类型，也就是尽量保证节点的性能一致
              topologyKey: node.kubernetes.io/instance-type
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: # 非强制性条件
          - weight: 100  # weight 用于为节点评分，会优先选择评分最高的节点（只有一条规则的情况下，这个值没啥意义）
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-app
                - key: version
                  operator: In
                  values:
                  - v3
              # 将 pod 尽量打散在多个可用区
              topologyKey: topology.kubernetes.io/zone
          requiredDuringSchedulingIgnoredDuringExecution:  # 强制性要求（这个建议按需添加）
          # 注意这个没有 weights，必须满足列表中的所有条件
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - my-app
              - key: version
                operator: In
                values:
                - v3
            # Pod 必须运行在不同的节点上
            topologyKey: kubernetes.io/hostname
      securityContext:
        # runAsUser: 1000  # 设定用户
        # runAsGroup: 1000  # 设定用户组
        runAsNonRoot: true  # Pod 必须以非 root 用户运行
        seccompProfile:  # security compute mode
          type: RuntimeDefault
      nodeSelector:
        eks.amazonaws.com/nodegroup: common  # 使用专用节点组，如果希望使用多个节点组，可改用节点亲和性
      volumes:
      - name: tmp-dir
        emptyDir: {}
      containers:
      - name: my-app-v3
        image: my-app:v3  # 建议使用私有镜像仓库，规避 docker.io 的镜像拉取限制
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /tmp
          name: tmp-dir
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done"
        resources:  # 资源请求与限制
          # 对于核心服务，建议设置 requests = limits，避免资源竞争
          requests:
            # HPA 会使用 requests 计算资源利用率
            # 建议将 requests 设为服务正常状态下的 CPU 使用率，HPA 的目前指标设为 80%
            # 所有容器的 requests 总量不建议为 2c/4G 4c/8G 等常见值，因为节点通常也是这个配置，这会导致 Pod 只能调度到更大的节点上，适当调小 requests 等扩充可用的节点类型，从而扩充节点池。 
            cpu: 1000m
            memory: 1Gi
          limits:
            # limits - requests 为允许超卖的资源量，建议为 requests 的 1 到 2 倍，酌情配置。
            cpu: 1000m
            memory: 1Gi
        securityContext:
          # 将容器层设为只读，防止容器文件被篡改
          ## 如果需要写入临时文件，建议额外挂载 emptyDir 来提供可读写的数据卷
          readOnlyRootFilesystem: true
          # 禁止 Pod 做任何权限提升
          allowPrivilegeEscalation: false
          capabilities:
            # drop ALL 的权限比较严格，可按需修改
            drop:
            - ALL
        startupProbe:  # 要求 kubernetes 1.18+
          httpGet:
            path: /actuator/health  # 直接使用健康检查接口即可
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 20  # 最多提供给服务 5s * 20 的启动时间
          successThreshold: 1
        livenessProbe:
          httpGet:
            path: /actuator/health  # spring 的通用健康检查路径
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
        # Readiness probes are very important for a RollingUpdate to work properly,
        readinessProbe:
          httpGet:
            path: /actuator/health  # 简单起见可直接使用 livenessProbe 相同的接口，当然也可额外定义
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app: my-app
  name: my-app-v3
  namespace: prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-v3
  maxReplicas: 50
  minReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-v3
  namespace: prod
  labels:
    app: my-app
spec:
  minAvailable: 75%
  selector:
    matchLabels:
      app: my-app
      version: v3

1. Graful Shutdown and 502/504 errors

If a Pod is processing a large number of requests (e.g. 1000 QPS+) and is rescheduled due to a node failure or a “bid node” being recalled, you may observe a small number of 502/504s during the period when the container is terminated.

To understand this, you need to first understand the process of terminating a Pod.

the Pod’s status is set to Terminating and (almost) simultaneously the Pod is removed from all associated Service Endpoints
the preStop hook is executed, either as a command, or as an http call to the container in the Pod
1. consider using preStop if your application is unable to exit gracefully when it receives a SIGTERM signal
2. if it’s too much trouble to make the program support graceful exit itself, preStop is a good way to achieve it
send SIGTERM to all containers in the Pod
continue waiting until the time set by spec.terminationGracePeriodSeconds is exceeded, the default value is 30s
1. Note that this graceful exit wait is synchronized with preStop! And it doesn’t wait for preStop to finish either!
If the container does not stop after spec.terminationGracePeriodSeconds, k8s will send a SIGKILL signal to the container.
After all processes are terminated, the entire Pod is completely cleaned up

Note: 1 and 2 jobs happen asynchronously, so there may be a situation where “Pod is still in Service Endpoints, but preStop has already been executed”, and we need to take into account the occurrence of this situation.

After understanding the above flow, we can analyze the reasons for the appearance of two error codes.

502: The application terminated directly after receiving the SIGTERM signal, causing some of the requests that have not yet been processed to be interrupted directly, and the proxy layer returned 502 to indicate this situation
504: Service Endpoints removal is not timely enough, after the Pod has been terminated, there are still individual requests are routed to the Pod, and do not get a response resulting in 504

The usual solution is to add a 15s wait time to the Pod’s preStop step. The rationale is that when the Pod processes the terminating state, it is removed from the Service Endpoints and no new requests are made. Waiting 15s in preStop basically ensures that all requests are processed before the container dies (in general, most requests are processed in 300ms or less).

A simple example, which makes the Pod always wait 15s before sending a SIGTERM signal to the container when it is terminated, is as follows.

    containers:
    - name: my-app
      # 添加下面这部分
      lifecycle:
        preStop:
          exec:
            command:
            - /bin/sleep
            - "15"

A better solution is to just wait until all tcp connections are closed (requires netstat in the image)

    containers:
    - name: my-app
      # 添加下面这部分
      lifecycle:
      preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done"

What if my service is also using Sidecar to proxy network requests?

Using the service grid Istio as an example, the 502/504 issue gets a little more complicated when Envoy is proxying Pod traffic - also consider the order in which Sidecar closes with the main container.

If a new request comes in after Envoy has been shut down, it will result in a 504 (no one will respond to the request)
- So Envoy should be shut down at least 3s after Terminating to make sure the Istio grid configuration is fully updated
If the main container is shut down before Envoy is stopped, and then a new request comes in, Envoy will be unable to connect to upstream resulting in 503
- So it is also better to close the main container after Terminating at least 3s.
If Envoy or one of the main containers stops while the main container is still processing the legacy requests, it will be disconnected directly because of tcp connection resulting in 502
- Therefore Envoy must be closed after the main container has finished processing legacy requests (that is, when there is no tcp connection)

So to summarize: Envoy and the main container’s preStop must be set to at least 3s, and can be closed only when there is no tcp connection, to avoid 502/503/504.

The modification method of the main container has been written in the previous article, the following describes the modification method of Envoy.

Like the main container, Envoy can also add preStop directly, modify istio-sidecar-injector the configmap, and add the preStop sleep command in sidecar:

    containers:
    - name: istio-proxy
      # 添加下面这部分
      lifecycle:
      preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - "while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done"

Reference

2. Service Scaling Configuration - HPA

Kubernetes officially supports Pod CPU-based scaling, which is the most widely used scaling metric and requires the deployment of metrics-server to be available.

To start with, review the example of scaling based on Pod CPU usage given earlier.

apiVersion: autoscaling/v2beta2  # k8s 1.23+ 此 API 已经 GA
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app: my-app
  name: my-app-v3
  namespace: prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-v3
  maxReplicas: 50
  minReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

1. How the current metric value is calculated

To summarize in advance: the metric of each Pod is the sum of all container metrics, and if calculating the percentage, divide it by the requests of the Pod.

HPA uses the Pod’s current metrics for calculation by default, taking CPU usage as an example, its calculation formula is.

`1`	`「Pod 的 CPU 使用率」= 100% * 「所有 Container 的 CPU 用量之和」/「所有 Container 的 CPU requests 之和」`

Note that the denominator is the total number of requests, not the limits.

1.1 Problems and solutions

This is not a problem when the Pod has only one container, but it is a problem when the Pod is injected with sidecar like envoy.

This is because Istio’s Sidecar requests default to 100m which is 0.1 cores. In the untuned case, the actual sidecar usage can easily go up to 0.2-0.4 cores when the service load is high. Substituting these two values into the previous equation, we find that for services with high QPS, after adding Sidecar, the “CPU utilization of Pod” may be higher than the “CPU utilization of application container “, resulting in unnecessary scaling.

Even if we use “CPU utilization of Pods” instead of percentage to scale up and down, it does not solve this problem.

Solutions.

Method 1: Set different requests/limits for each service’s sidecar according to each service’s CPU usage.
- I think this solution is too troublesome.
Method 2: Use a third-party component like KEDA to get the CPU utilization of the application (excluding Sidecar) and use it to scale up and down.
Method 3: Use the alpha feature provided by k8s 1.20: Container Resourse Metrics.

2. HPA’s expansion and contraction algorithm

It is well understood when the HPA will be expanded. However, there is some confusion about HPA’s scaling strategy, which is briefly analyzed below.

HPA’s “target metrics” can be used in two forms: absolute metrics and resource utilization.
- Absolute metrics: CPU, for example, refers to CPU usage
- Resource utilization (resource usage/resource request * 100%): When a Pod sets up a resource request, you can use the resource utilization to scale the Pod. 2.
The “current metric” of HPA is the average of all Pods over a period of time, not the peak.

The scaling algorithm of HPA is.

`1`	`期望副本数 = ceil[当前副本数 * ( 当前指标 / 目标指标 )]`

As you can see from the parameters above.

whenever the current indicator exceeds the target indicator, the expansion will definitely happen.
the current indicator / target indicator has to be small enough to trigger the shrinkage. For example, in the case of a double copy, the above ratio must be less than or equal to 1/2 before scaling down to a single copy. 2. In the case of three copies, the threshold for the above ratio is 2/3. 3. The threshold for five copies is 4/5, for 100 copies the threshold is 99/100, and so on. 4. If the `Current Target/Target Target’ drops from 1 to 0.5, the number of copies will be halved. (Although it is said that the more copies there are, the less likely such a large change will occur.)
the larger the value of Current Replica Count / Target Indicator, the greater the impact of fluctuations in the Current Indicator on the Desired Replica Count.

In order to prevent over-sensitive expansion and contraction, it also has several latency-related parameters.

HPA Loop delay: default 15 seconds, HPA scan is performed every 15 seconds.
--horizontal-pod-autoscaler-cpu-initialization-period:
shrink-capacity cooling time: default 5 minutes.

3. What is the appropriate expectation value for HPA

This needs to be analyzed on a case-by-case basis for each service.

Take the most common scaling by CPU value as an example

Core services
- requests/limits value: It is recommended to set it equal to ensure that the quality of service level is Guaranteed
  - Note that CPU and Memory limits are different, CPU is really limited, while Memory is OOMKilled when it is exceeded.
  - k8s has been using cgroups v1 ( cpu_shares / memory.limit_in_bytes ) to limit cpu/memory, but for Guaranteed Pods, memory is not fully reserved, and resource contention is always possible. 1.22 has alpha feature to use cgroups v2, so keep an eye on that.
- HPA: Generally speaking, an expectation of 60% to 70% may be appropriate, and a minimum number of copies is recommended to be set to 2 - 5. (FYI)
- PodDisruptionBudget: It is recommended to set the PDB according to the service robustness and HPA expectation, which will be introduced in detail later, so we will skip it here first.
Non-core services
- requests/limits value: it is recommended that requests be set to 0.6 - 0.9 times the limits (for reference only), corresponding to the quality of service level of Burstable
  - The main consideration is that many non-core services have a very low load and do not run as high as the limits, so lowering requests can improve cluster resource utilization without compromising service stability.
- HPA: Because requests are reduced, and HPA is calculated using requests as 100% utilization, we can increase the expected value of HPA (if we use the percentage as the expected value), such as 80% to 90%, and the minimum number of copies is recommended to be set to 1 - 3. (For reference only)
- PodDisruptionBudget: non-core services, to ensure that the minimum number of copies of 1 on the line.

4. Frequently Asked Questions about HPA

4.1. Pod scaling - warm-up pitfalls

Warm-up: For languages like Java/C# that run on a virtual machine, the first time you use certain features, you need to initialize some resources, such as “JIT compile on the fly”. If the code also applies dynamic class loading or other features, it is likely that the microservice will be particularly slow to respond when some APIs are called for the first time (to dynamically compile the class). Therefore, Pods need to “slow_start” these interfaces and initialize the resources they need in advance before providing services.

In the case of high load, HPA will automatically expand the capacity. However, if the scaled Pod needs to be warmed up, it may encounter the “warm-up trap”.

When there is a large number of users accessing the Pod, the request will be “stuck” as long as it is forwarded to the newly created Pod, regardless of the load balancing policy used. If the request speed is too fast, the more requests are “stuck” at the moment the Pod starts up, which will cause the new Pod to collapse due to overstress. Then the Pod is overwhelmed as soon as it restarts and enters the CrashLoopBackoff loop.

The effect is even more pronounced when using multithreading for load testing: with 50 threads requesting non-stop, the response time of other Pods is “milliseconds”, while the first response of a new Pod is “seconds”. Almost instantly, all 50 threads will be trapped in the new Pod. The newly created Pod may be particularly vulnerable at the moment of startup, and the 50 concurrent requests can overwhelm it instantly. Then the Pod is crushed as soon as it restarts and enters the CrashLoopBackoff loop.

Solution.

This can be solved at the “application level” by

inside the back-end controller that starts the probe API, call all interfaces that need to be warmed up in turn or otherwise, and initialize all resources in advance. In the controller that starts the probe, you can call its own interface via the localhost loopback address.
Use the “AOT pre-compilation” technique: preheating is usually a problem caused by “JIT just-in-time compilation”, which is compiled when it is needed. AOT is pre-compiled and compiled before use, so AOT can solve the problem of pre-heating.

It can also be solved at the “infrastructure level”.

Like AWS ALB TargetGroup and other cloud service providers’ ALB services, you can usually set slow_start time, that is, for new instances, use a certain amount of time to slowly cut the traffic over, and finally reach the expected load balancing state. This can solve the service warm-up problem. 2.
Envoy also supports slow_start mode, which supports to slowly load the traffic to the new instance within a set time window to achieve the warm-up effect.

4.2. HPA scaling is too sensitive and leads to Pod oscillation

In general, the majority of the load on EKS should be scaled up or down using CPUs. This is because the CPU is usually a good reflection of the load on the service

However, some services have other factors that affect CPU usage that make scaling with CPU less reliable, such as

Some Java services have large heap memory settings and long GC pause settings, so memory GC can cause intermittent CPU spikes and CPU monitoring can have a lot of spikes.
Some services have timed tasks, and the CPU will go up when the timed tasks run, but this is not related to the QPS of the service.
Some services may immediately be in a high state as soon as the CPU is running, and it may want to use other business-side metrics for scaling instead of CPU.

Because of the above problem, using CPU scaling may cause the service to scale up and down frequently, or infinitely. Some services (such as our “recommended services”) are sensitive to both “scaling up” and “scaling down”, and each scaling up and down will cause service availability jitter.

For such services, HPA has these tuning strategies.

Choose to use relatively smooth metrics like QPS for scaling up and scaling down without the interference of GC, which requires community components like KEDA.
For kubernetes 1.18+, you can directly use HPA’s behavior.scaleDown and behavior.scaleUp parameters to control the maximum number of pods or ratio for each scaling. An example is as follows.

---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: podinfo
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # 期望的 CPU 平均值
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # 默认为 0，只使用当前值进行扩缩容
      policies:
      - periodSeconds: 180  # 每 3 分钟最多扩容 5% 的 Pods
        type: Percent
        value: 5
      - periodSeconds: 60  # 每分钟最多扩容 1 个 Pod，扩的慢一点主要是为了一个个地预热，避免一次扩容太多未预热的 Pods 导致服务可用率剧烈抖动
        type: Pods
        value: 1
      selectPolicy: Min  # 选择最小的策略
    # 以下的一切配置，都是为了更平滑地缩容
    scaleDown:
      stabilizationWindowSeconds: 600  # 使用过去 10 mins 的最大 cpu 值进行缩容计算，避免过快缩容
      policies:
      - type: Percent  # 每 3 mins 最多缩容 `ceil[当前副本数 * 5%]` 个 pod（20 个 pod 以内，一次只缩容 1 个 pod）
        value: 5
        periodSeconds: 180
      - type: Pods  # 每 1 mins 最多缩容 1 个 pod
        value: 1
        periodSeconds: 60
      selectPolicy: Min  # 上面的 policies 列表，只生效其中最小的值作为缩容限制（保证平滑缩容）

For the problem of not smooth scaling, consider providing a feature similar to AWS ALB TargetGroup slow_start to slowly cut the traffic to the new Pod when scaling to achieve preheat services (JVM preheat and local cache preheat), so that a better smooth scaling effect can be achieved.

5. HPA Cautions

Note that kubectl versions below 1.23 use hpa.v1.autoscaling to query the HPA configuration by default, and v2beta2 related parameters are encoded in metadata.annotations.

For example, behavior is encoded in the value corresponding to the key autoscaling.alpha.kubernetes.io/behavior.

So if you are using the v2beta2 HPA, be sure to explicitly specify that you are using the v2beta2 version of the HPA.

`1`	`kubectl get hpa.v2beta2.autoscaling`

Otherwise, accidentally touching some parameters encoded in annotations may have unexpected effects, or even crash the control panel… like this issue: Nil pointer dereference in KCM after v1 HPA patch request

6. Reference

3. Node maintenance and Pod interference budget

Node Maintenance and Pod Interference Budget

When we evict containers from a node via kubectl drain, kubernetes will do the Pod eviction based on the Pod’s PodDistruptionBuget.

If you don’t set any explicit PodDistruptionBuget, the Pod will simply be killed and rescheduled on another node, which may cause service interruptions!

PDB is a separate CR custom resource, example is as follows.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: podinfo-pdb
spec:
  # 如果不满足 PDB，Pod 驱逐将会失败！
  minAvailable: 1      # 最少也要维持一个 Pod 可用
#   maxUnavailable: 1  # 最大不可用的 Pod 数，与 minAvailable 不能同时配置！二选一
  selector:
    matchLabels:
      app: podinfo

If the Pod does not satisfy the PDB when performing node maintenance (kubectl drain), the drain will fail, example.

> kubectl drain node-205 --ignore-daemonsets --delete-local-data
node/node-205 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nfhj7, kube-system/kube-proxy-94dz5
evicting pod default/podinfo-7c84d8c94d-h9brq
evicting pod default/podinfo-7c84d8c94d-gw6qf
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
pod/podinfo-7c84d8c94d-gw6qf evicted
pod/podinfo-7c84d8c94d-h9brq evicted
node/node-205 evicted

In the example above, there are two copies of podinfo, both running on node-205. I set the interference budget PDB minAvailable: 1 for it.

Then when I use kubectl drain to evict the Pod, one of the Pods is evicted immediately, while the other Pod keeps failing to evict for about 15 seconds. Because the first Pod has not finished starting on the new node, it does not meet the condition of interfering with the budget PDB minAvailable: 1.

After about 15 seconds, the first Pod to be expelled is started on the new node and the other Pod meets the PDB so it is finally expelled as well. This completes the drain operation of a node.

PodDisruptionBudget is also considered when scaling down nodes in cluster node scaling components such as ClusterAutoscaler. If your cluster uses a component such as ClusterAutoscaler that dynamically scales up and down nodes, it is highly recommended to set PodDisruptionBudget for all services.

Considerations for using percentages in PDB

When using percentages, the calculated number of instances is rounded up, which can cause two phenomena.

If you use minAvailable, a small number of instances may cause ALLOWED DISRUPTIONS to be 0
If maxUnavailable is used, since it is rounded up, the value of ALLOWED DISRUPTIONS must not be lower than 1

So for eviction purposes, if you have at least 2-3 instances of your service, it is recommended to use the percentage configuration maxUnavailable in the PDB instead of minAvailable .

Best Practices Deployment + HPA + PodDisruptionBudget

In general, each version of a service should contain the following three resources.

Deployment: The Pods that manage the service itself
HPA: Responsible for scaling up and down Pods, usually using CPU metrics for scaling up and down
PodDisruptionBudget(PDB): It is recommended to set the PDB according to the target value of HPA.
- For example, if the HPA CPU target is 60%, consider setting PDB minAvailable=65% to ensure that at least 65% of Pods are available. This way, theoretically, the QPS will not be avalanched evenly over the remaining 65% of Pods in the extreme case (assuming a perfectly linear relationship between QPS and CPU here)

4. Node Affinity and Node Groups

We a cluster, usually use different tags to classify node groups, for example, kubernetes automatically generates some node tags.

kubernetes.io/os : linux is usually used
kubernetes.io/arch : amd64 , arm64
topology.kubernetes.io/region and topology.kubernetes.io/zone : regions and availability zones for cloud services

We use node affinity and Pod anti-affinity more often, and the other two policies are used as appropriate.

1. node affinity

If you are using aws, then aws has some custom node tags.

eks.amazonaws.com/nodegroup : the name of the aws eks node group, the same node group uses the same aws ec2 instance template
- e.g. arm64 node group, amd64/x64 node group
- node groups with high memory ratio such as m-series instances, node groups with high computational performance such as c-series
- bid instance node group: this saves money, but is very dynamic and can be recycled at any time
- Pay-per-volume node group: this is expensive but stable instances.

Suppose you want to give preference to bid instances to run your Pod, and if the bid instances are temporarily full, choose pay-per-use instances. Then nodeSelector won’t meet your needs and you need to use nodeAffinity, as in the following example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: xxx
  namespace: xxx
spec:
  # ...
  template:
    # ...
    spec:
      affinity:
        nodeAffinity:
          # 优先选择 spot-group-c 的节点
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - spot-group-c
            weight: 80  # weight 用于为节点评分，会优先选择评分最高的节点
          - preference:
              matchExpressions:
              # 优先选择 aws c6i 的机器
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - "c6i.xlarge"
                - "c6i.2xlarge"
                - "c6i.4xlarge"
                - "c6i.8xlarge"
            weight: 70
          - preference:
              matchExpressions:
              # 其次选择 aws c5 的机器
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - "c5.xlarge"
                - "c5.2xlarge"
                - "c5.4xlarge"
                - "c5.9xlarge"
            weight: 60
         # 如果没 spot-group-c 可用，也可选择 ondemand-group-c 的节点跑
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - spot-group-c
                - ondemand-group-c
      containers:
        # ...

2. Pod Anti-Affinity

It is generally recommended to configure Pod anti-affinity for each Deployment’s template to break up Pods across all nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: xxx
  namespace: xxx
spec:
  # ...
  template:
    # ...
    spec:
      replicas: 3
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: # 非强制性条件
          - weight: 100  # weight 用于为节点评分，会优先选择评分最高的节点
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - xxx
                - key: version
                  operator: In
                  values:
                  - v12
              # 将 pod 尽量打散在多个可用区
              topologyKey: topology.kubernetes.io/zone
          requiredDuringSchedulingIgnoredDuringExecution:  # 强制性要求
          # 注意这个没有 weights，必须满足列表中的所有条件
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - xxx
              - key: version
                operator: In
                values:
                - v12
            # Pod 必须运行在不同的节点上
            topologyKey: kubernetes.io/hostname

5. Pod’s Ready Probe, Live Probe and Startup Probe

Pods provide the following three probes, all of which support service availability probing using Command, HTTP API, and TCP Socket.

startupProbe Startup Probe (Kubernetes v1.18 [beta]): After this probe is passed, the “Ready Probe” and “Survival Probe” will perform the survivability and readiness checks
- Used to perform survivability checks on slow-start containers to prevent them from being killed before they start running
  - startupProbe is obviously more flexible than livenessProbe’s initialDelaySeconds parameter.
  - It also delays the readinessProbe from taking effect, mainly to avoid pointless probes. It is obviously impossible to be ready before the container is even startUp.
- The program will have at most failureThreshold * periodSeconds for start-up, for example, if failureThreshold=20 and periodSeconds=5 are set, the maximum program start-up time will be 100s, if more than 100s still do not pass the “start-up probe”, the container will be killed.
readinessProbe readiness probe:
- If the number of failed readiness probes exceeds the failureThreshold limit (default three times), the service will be temporarily kicked out of the Service’s Endpoints until the service meets the successThreshold again.
livenessProbe Survival Probe: detects if the service is alive, it can catch deadlocks and other conditions and kill such containers in time.
- Possible reasons for the failure of the survivability probe.
  - The service is deadlocked and not responding to all requests
  - Service threads are all stuck waiting for external dependencies such as redis/mysql, resulting in unresponsive requests
- The container will be killed if the number of failed survival probes exceeds the failureThreshold limit (default three times), and then restarted according to the restart policy.
  - kubectl describe pod will show the reboot reason as State.Last State.Reason = Error, Exit Code=137 and Events will have Liveness probe failed: ... in Events.

The parameters of the above three types of probes are common, and the five time-related parameters are listed below.

# 下面的值就是 k8s 的默认值
initialDelaySeconds: 0  # 默认没有 delay 时间
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
successThreshold: 1

Example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v3
spec:
  # ...
  template:
    #  ...
    spec:
      containers:
      - name: my-app-v3
        image: xxx.com/app/my-app:v3
        imagePullPolicy: IfNotPresent 
        # ... 省略若干配置
        startupProbe:
          httpGet:
            path: /actuator/health  # 直接使用健康检查接口即可
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 20  # 最多提供给服务 5s * 20 的启动时间
          successThreshold: 1
        livenessProbe:
          httpGet:
            path: /actuator/health  # spring 的通用健康检查路径
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
        # Readiness probes are very important for a RollingUpdate to work properly,
        readinessProbe:
          httpGet:
            path: /actuator/health  # 简单起见可直接使用 livenessProbe 相同的接口，当然也可额外定义
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1

Prior to Kubernetes 1.18, a common approach was to add a long initialDelaySeconds to the ready probe to achieve a similar function to the start probe to prevent containers from being restarted due to a slow start and failure of the survival probe. The example is as follows.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v3
spec:
  # ...
  template:
    #  ...
    spec:
      containers:
      - name: my-app-v3
        image: xxx.com/app/my-app:v3
        imagePullPolicy: IfNotPresent 
        # ... 省略若干配置
        livenessProbe:
          httpGet:
            path: /actuator/health  # spring 的通用健康检查路径
            port: 8080
          initialDelaySeconds: 120  # 前两分钟，都假设服务健康，避免 livenessProbe 失败导致服务重启
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
        # 容器一启动，Readiness probes 就会不断进行检测
        readinessProbe:
          httpGet:
            path: /actuator/health
            port: 8080
          initialDelaySeconds: 3  # readiness probe 不需要设太长时间，使 Pod 尽快加入到 Endpoints.
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1

6. Pod Security

Only security-related parameters in Pods are described here. Other security policies, such as cluster-wide security policies, are not discussed here.

1. Pod SecurityContext

By setting the Pod’s SecurityContext, you can set a specific security policy for each Pod.

There are two types of SecurityContext.

spec.securityContext : This is a PodSecurityContext object
- As the name implies, it is valid for all contaienrs in the Pod.
spec.containers[*].securityContext : This is a SecurityContext object
- container’s private SecurityContext

The parameters of these two SecurityContexts only partially overlap, with the overlapping part spec.containers[*].securityContext having higher priority.

Some of the more common security policies we encounter for elevating privileges are

Privileged container: spec.containers[*].securityContext.privileged
add (Capabilities) optional system-level capabilities: spec.containers[*].securityContext.capabilities.add.
1. Only a few containers, such as the ntp sync service, can enable this feature. Please note that this is very dangerous.
Sysctls: system parameter: spec.securityContext.sysctls

Permission Restrictions The relevant security policies are (It is highly recommended to configure the following security policies on all Pods on demand!).

spec.volumes : all data volumes can be set with read and write permissions
spec.securityContext.runAsNonRoot: true Pods must be run as a non-root user
spec.containers[*].securityContext.readOnlyRootFileSystem:true Set container level to read-only to prevent container files from being tampered with.
1. If the microservice needs to read and write files, it is recommended to mount additional data volumes of type emptydir.
spec.containers[*].securityContext.allowPrivilegeEscalation: false Do not allow Pod to do any privilege elevation!
spec.containers[*].securityContext.capabilities.drop : removes (Capabilities) optional system-level capabilities

There are other features such as specifying the running users/groups of containers that are not listed, so please refer to the Kubernetes documentation.

An example of a stateless microservice Pod configuration.

apiVersion: v1
kind: Pod
metadata:
  name: <Pod name>
spec:
  containers:
  - name: <container name>
    image: <image>
    imagePullPolicy: IfNotPresent 
    # ......此处省略 500 字
    securityContext:
      readOnlyRootFilesystem: true  # 将容器层设为只读，防止容器文件被篡改。
      allowPrivilegeEscalation: false  # 禁止 Pod 做任何权限提升
      capabilities:
        drop:
        # 禁止容器使用 raw 套接字，通常只有 hacker 才会用到 raw 套接字。
        # raw_socket 可自定义网络层数据，避开 tcp/udp 协议栈，直接操作底层的 ip/icmp 数据包。可实现 ip 伪装、自定义协议等功能。
        # 去掉 net_raw 会导致 tcpdump 无法使用，无法进行容器内抓包。需要抓包时可临时去除这项配置
        - NET_RAW
        # 更好的选择：直接禁用所有 capabilities
        # - ALL
  securityContext:
    # runAsUser: 1000  # 设定用户
    # runAsGroup: 1000  # 设定用户组
    runAsNonRoot: true  # Pod 必须以非 root 用户运行
    seccompProfile:  # security compute mode
      type: RuntimeDefault

2. seccomp: security compute mode

seccomp and seccomp-bpf allow filtering of system calls, preventing user binaries from performing dangerous operations on host operating system components that are not normally needed. It is somewhat similar to Falco, but Seccomp does not provide special support for containers.

Video:

Seccomp: What Can It Do For You? - Justin Cormack, Docker

Other issues

Disparity in performance across node types, resulting in unbalanced CPU load with balanced QPS
- Solution (not verified).
  - Try to use instance types with the same performance: add node type affinity via podAffinity and nodeAffinity

Table of Contents