This article introduces a set of “Kubernetes configuration” that I have summarized as my personal “best practices” in the process of using Kubernetes. Most of the content has been tested in an online environment, but there are a few parts that have only been simulated in my head, so please refer to them with caution.
A few notes before reading.
- This document is quite long and includes a lot of content, so we recommend using it as a reference manual, first read it briefly with reference to the table of contents, and then read the relevant content in detail if needed.
- This document requires a certain amount of Kubernetes foundation to understand, and if you have no practical experience, it may seem boring.
- People with hands-on experience may have different opinions than I do, so feel free to comment.
I will update this document from time to time as appropriate.
0. Examples
First of all, here are some of the premises that this paper adheres to, which only fit the scenarios I encountered and are flexible:
- Only stateless services are discussed here, stateful services are out of scope
- Instead of using the rolling update capability of Deployment, we create a different Deployment + HPA + PodDisruptionBudget for each version of each service, this is to facilitate doing canary/grayscale releases
- Our services may use IngressController / Service Mesh for load balancing of services, traffic slicing
Here’s a demo of Deployment + HPA + PodDisruptionBudget, which will be broken down later for more details.
|
|
1. Graful Shutdown and 502/504 errors
If a Pod is processing a large number of requests (e.g. 1000 QPS+) and is rescheduled due to a node failure or a “bid node” being recalled, you may observe a small number of 502/504s during the period when the container is terminated.
To understand this, you need to first understand the process of terminating a Pod.
- the Pod’s status is set to
Terminating
and (almost) simultaneously the Pod is removed from all associated Service Endpoints - the
preStop
hook is executed, either as a command, or as an http call to the container in the Pod- consider using
preStop
if your application is unable to exit gracefully when it receives a SIGTERM signal - if it’s too much trouble to make the program support graceful exit itself,
preStop
is a good way to achieve it
- consider using
- send SIGTERM to all containers in the Pod
- continue waiting until the time set by
spec.terminationGracePeriodSeconds
is exceeded, the default value is 30s- Note that this graceful exit wait is synchronized with
preStop
! And it doesn’t wait forpreStop
to finish either!
- Note that this graceful exit wait is synchronized with
- If the container does not stop after
spec.terminationGracePeriodSeconds
, k8s will send a SIGKILL signal to the container. - After all processes are terminated, the entire Pod is completely cleaned up
Note: 1 and 2 jobs happen asynchronously, so there may be a situation where “Pod is still in Service Endpoints, but preStop
has already been executed”, and we need to take into account the occurrence of this situation.
After understanding the above flow, we can analyze the reasons for the appearance of two error codes.
- 502: The application terminated directly after receiving the SIGTERM signal, causing some of the requests that have not yet been processed to be interrupted directly, and the proxy layer returned 502 to indicate this situation
- 504: Service Endpoints removal is not timely enough, after the Pod has been terminated, there are still individual requests are routed to the Pod, and do not get a response resulting in 504
The usual solution is to add a 15s wait time to the Pod’s preStop
step. The rationale is that when the Pod processes the terminating state, it is removed from the Service Endpoints and no new requests are made. Waiting 15s in preStop
basically ensures that all requests are processed before the container dies (in general, most requests are processed in 300ms or less).
A simple example, which makes the Pod always wait 15s before sending a SIGTERM signal to the container when it is terminated, is as follows.
A better solution is to just wait until all tcp connections are closed (requires netstat in the image)
What if my service is also using Sidecar to proxy network requests?
Using the service grid Istio as an example, the 502/504 issue gets a little more complicated when Envoy is proxying Pod traffic - also consider the order in which Sidecar closes with the main container.
- If a new request comes in after Envoy has been shut down, it will result in a 504 (no one will respond to the request)
- So Envoy should be shut down at least 3s after Terminating to make sure the Istio grid configuration is fully updated
- If the main container is shut down before Envoy is stopped, and then a new request comes in, Envoy will be unable to connect to upstream resulting in 503
- So it is also better to close the main container after Terminating at least 3s.
- If Envoy or one of the main containers stops while the main container is still processing the legacy requests, it will be disconnected directly because of tcp connection resulting in 502
- Therefore Envoy must be closed after the main container has finished processing legacy requests (that is, when there is no tcp connection)
So to summarize: Envoy and the main container’s preStop
must be set to at least 3s, and can be closed only when there is no tcp connection, to avoid 502/503/504.
The modification method of the main container has been written in the previous article, the following describes the modification method of Envoy.
Like the main container, Envoy can also add preStop
directly, modify istio-sidecar-injector
the configmap
, and add the preStop sleep command in sidecar:
Reference
- Kubernetes best practices: terminating with grace
- Graceful shutdown in Kubernetes is not always trivial
2. Service Scaling Configuration - HPA
Kubernetes officially supports Pod CPU-based scaling, which is the most widely used scaling metric and requires the deployment of metrics-server to be available.
To start with, review the example of scaling based on Pod CPU usage given earlier.
|
|
1. How the current metric value is calculated
To summarize in advance: the metric of each Pod is the sum of all container metrics, and if calculating the percentage, divide it by the requests of the Pod.
HPA uses the Pod’s current metrics for calculation by default, taking CPU usage as an example, its calculation formula is.
|
|
Note that the denominator is the total number of requests, not the limits.
1.1 Problems and solutions
This is not a problem when the Pod has only one container, but it is a problem when the Pod is injected with sidecar like envoy.
This is because Istio’s Sidecar requests default to 100m
which is 0.1 cores. In the untuned case, the actual sidecar usage can easily go up to 0.2-0.4 cores when the service load is high. Substituting these two values into the previous equation, we find that for services with high QPS, after adding Sidecar, the “CPU utilization of Pod” may be higher than the “CPU utilization of application container “, resulting in unnecessary scaling.
Even if we use “CPU utilization of Pods” instead of percentage to scale up and down, it does not solve this problem.
Solutions.
- Method 1: Set different requests/limits for each service’s sidecar according to each service’s CPU usage.
- I think this solution is too troublesome.
- Method 2: Use a third-party component like KEDA to get the CPU utilization of the application (excluding Sidecar) and use it to scale up and down.
- Method 3: Use the alpha feature provided by k8s 1.20: Container Resourse Metrics.
2. HPA’s expansion and contraction algorithm
It is well understood when the HPA will be expanded. However, there is some confusion about HPA’s scaling strategy, which is briefly analyzed below.
-
HPA’s “target metrics” can be used in two forms: absolute metrics and resource utilization.
- Absolute metrics: CPU, for example, refers to CPU usage
- Resource utilization (resource usage/resource request * 100%): When a Pod sets up a resource request, you can use the resource utilization to scale the Pod. 2.
-
The “current metric” of HPA is the average of all Pods over a period of time, not the peak.
The scaling algorithm of HPA is.
|
|
As you can see from the parameters above.
- whenever the
current indicator
exceeds the target indicator, the expansion will definitely happen. - the
current indicator / target indicator
has to be small enough to trigger the shrinkage. For example, in the case of a double copy, the above ratio must be less than or equal to 1/2 before scaling down to a single copy. 2. In the case of three copies, the threshold for the above ratio is 2/3. 3. The threshold for five copies is 4/5, for 100 copies the threshold is 99/100, and so on. 4. If the `Current Target/Target Target’ drops from 1 to 0.5, the number of copies will be halved. (Although it is said that the more copies there are, the less likely such a large change will occur.) - the larger the value of
Current Replica Count / Target Indicator
, the greater the impact of fluctuations in theCurrent Indicator
on theDesired Replica Count
.
In order to prevent over-sensitive expansion and contraction, it also has several latency-related parameters.
- HPA Loop delay: default 15 seconds, HPA scan is performed every 15 seconds.
--horizontal-pod-autoscaler-cpu-initialization-period
:- shrink-capacity cooling time: default 5 minutes.
3. What is the appropriate expectation value for HPA
This needs to be analyzed on a case-by-case basis for each service.
Take the most common scaling by CPU value as an example
- Core services
- requests/limits value: It is recommended to set it equal to ensure that the quality of service level is Guaranteed
- Note that CPU and Memory limits are different, CPU is really limited, while Memory is OOMKilled when it is exceeded.
- k8s has been using cgroups v1 (
cpu_shares
/memory.limit_in_bytes
) to limit cpu/memory, but forGuaranteed
Pods, memory is not fully reserved, and resource contention is always possible. 1.22 has alpha feature to use cgroups v2, so keep an eye on that.
- HPA: Generally speaking, an expectation of 60% to 70% may be appropriate, and a minimum number of copies is recommended to be set to 2 - 5. (FYI)
- PodDisruptionBudget: It is recommended to set the PDB according to the service robustness and HPA expectation, which will be introduced in detail later, so we will skip it here first.
- requests/limits value: It is recommended to set it equal to ensure that the quality of service level is Guaranteed
- Non-core services
- requests/limits value: it is recommended that requests be set to 0.6 - 0.9 times the limits (for reference only), corresponding to the quality of service level of Burstable
- The main consideration is that many non-core services have a very low load and do not run as high as the limits, so lowering requests can improve cluster resource utilization without compromising service stability.
- HPA: Because requests are reduced, and HPA is calculated using requests as 100% utilization, we can increase the expected value of HPA (if we use the percentage as the expected value), such as 80% to 90%, and the minimum number of copies is recommended to be set to 1 - 3. (For reference only)
- PodDisruptionBudget: non-core services, to ensure that the minimum number of copies of 1 on the line.
- requests/limits value: it is recommended that requests be set to 0.6 - 0.9 times the limits (for reference only), corresponding to the quality of service level of Burstable
4. Frequently Asked Questions about HPA
4.1. Pod scaling - warm-up pitfalls
Warm-up: For languages like Java/C# that run on a virtual machine, the first time you use certain features, you need to initialize some resources, such as “JIT compile on the fly”. If the code also applies dynamic class loading or other features, it is likely that the microservice will be particularly slow to respond when some APIs are called for the first time (to dynamically compile the class). Therefore, Pods need to “slow_start” these interfaces and initialize the resources they need in advance before providing services.
In the case of high load, HPA will automatically expand the capacity. However, if the scaled Pod needs to be warmed up, it may encounter the “warm-up trap”.
When there is a large number of users accessing the Pod, the request will be “stuck” as long as it is forwarded to the newly created Pod, regardless of the load balancing policy used. If the request speed is too fast, the more requests are “stuck” at the moment the Pod starts up, which will cause the new Pod to collapse due to overstress. Then the Pod is overwhelmed as soon as it restarts and enters the CrashLoopBackoff loop.
The effect is even more pronounced when using multithreading for load testing: with 50 threads requesting non-stop, the response time of other Pods is “milliseconds”, while the first response of a new Pod is “seconds”. Almost instantly, all 50 threads will be trapped in the new Pod. The newly created Pod may be particularly vulnerable at the moment of startup, and the 50 concurrent requests can overwhelm it instantly. Then the Pod is crushed as soon as it restarts and enters the CrashLoopBackoff loop.
Solution.
This can be solved at the “application level” by
- inside the back-end controller that starts the probe API, call all interfaces that need to be warmed up in turn or otherwise, and initialize all resources in advance.
In the controller that starts the probe, you can call its own interface via the
localhost
loopback address. - Use the “AOT pre-compilation” technique: preheating is usually a problem caused by “JIT just-in-time compilation”, which is compiled when it is needed. AOT is pre-compiled and compiled before use, so AOT can solve the problem of pre-heating.
It can also be solved at the “infrastructure level”.
- Like AWS ALB TargetGroup and other cloud service providers’ ALB services, you can usually set
slow_start
time, that is, for new instances, use a certain amount of time to slowly cut the traffic over, and finally reach the expected load balancing state. This can solve the service warm-up problem. 2. - Envoy also supports
slow_start
mode, which supports to slowly load the traffic to the new instance within a set time window to achieve the warm-up effect.
4.2. HPA scaling is too sensitive and leads to Pod oscillation
In general, the majority of the load on EKS should be scaled up or down using CPUs. This is because the CPU is usually a good reflection of the load on the service
However, some services have other factors that affect CPU usage that make scaling with CPU less reliable, such as
- Some Java services have large heap memory settings and long GC pause settings, so memory GC can cause intermittent CPU spikes and CPU monitoring can have a lot of spikes.
- Some services have timed tasks, and the CPU will go up when the timed tasks run, but this is not related to the QPS of the service.
- Some services may immediately be in a high state as soon as the CPU is running, and it may want to use other business-side metrics for scaling instead of CPU.
Because of the above problem, using CPU scaling may cause the service to scale up and down frequently, or infinitely. Some services (such as our “recommended services”) are sensitive to both “scaling up” and “scaling down”, and each scaling up and down will cause service availability jitter.
For such services, HPA has these tuning strategies.
- Choose to use relatively smooth metrics like QPS for scaling up and scaling down without the interference of GC, which requires community components like KEDA.
- For kubernetes 1.18+, you can directly use HPA’s
behavior.scaleDown
andbehavior.scaleUp
parameters to control the maximum number of pods or ratio for each scaling. An example is as follows.
|
|
For the problem of not smooth scaling, consider providing a feature similar to AWS ALB TargetGroup slow_start
to slowly cut the traffic to the new Pod when scaling to achieve preheat services (JVM preheat and local cache preheat), so that a better smooth scaling effect can be achieved.
5. HPA Cautions
Note that kubectl versions below 1.23 use hpa.v1.autoscaling
to query the HPA configuration by default, and v2beta2
related parameters are encoded in metadata.annotations
.
For example, behavior
is encoded in the value corresponding to the key autoscaling.alpha.kubernetes.io/behavior
.
So if you are using the v2beta2 HPA, be sure to explicitly specify that you are using the v2beta2
version of the HPA.
|
|
Otherwise, accidentally touching some parameters encoded in annotations
may have unexpected effects, or even crash the control panel… like this issue: Nil pointer dereference in KCM after v1 HPA patch request
6. Reference
3. Node maintenance and Pod interference budget
Node Maintenance and Pod Interference Budget
When we evict containers from a node via kubectl drain
, kubernetes will do the Pod eviction based on the Pod’s PodDistruptionBuget
.
If you don’t set any explicit PodDistruptionBuget, the Pod will simply be killed and rescheduled on another node, which may cause service interruptions!
PDB is a separate CR custom resource, example is as follows.
If the Pod does not satisfy the PDB when performing node maintenance (kubectl drain), the drain will fail, example.
|
|
In the example above, there are two copies of podinfo, both running on node-205. I set the interference budget PDB minAvailable: 1
for it.
Then when I use kubectl drain
to evict the Pod, one of the Pods is evicted immediately, while the other Pod keeps failing to evict for about 15 seconds. Because the first Pod has not finished starting on the new node, it does not meet the condition of interfering with the budget PDB minAvailable: 1
.
After about 15 seconds, the first Pod to be expelled is started on the new node and the other Pod meets the PDB so it is finally expelled as well. This completes the drain operation of a node.
PodDisruptionBudget is also considered when scaling down nodes in cluster node scaling components such as ClusterAutoscaler. If your cluster uses a component such as ClusterAutoscaler that dynamically scales up and down nodes, it is highly recommended to set PodDisruptionBudget for all services.
Considerations for using percentages in PDB
When using percentages, the calculated number of instances is rounded up, which can cause two phenomena.
- If you use
minAvailable
, a small number of instances may cause ALLOWED DISRUPTIONS to be 0 - If
maxUnavailable
is used, since it is rounded up, the value of ALLOWED DISRUPTIONS must not be lower than 1
So for eviction purposes, if you have at least 2-3 instances of your service, it is recommended to use the percentage configuration maxUnavailable
in the PDB instead of minAvailable
.
Best Practices Deployment + HPA + PodDisruptionBudget
In general, each version of a service should contain the following three resources.
- Deployment: The Pods that manage the service itself
- HPA: Responsible for scaling up and down Pods, usually using CPU metrics for scaling up and down
- PodDisruptionBudget(PDB): It is recommended to set the PDB according to the target value of HPA.
- For example, if the HPA CPU target is 60%, consider setting PDB
minAvailable=65%
to ensure that at least 65% of Pods are available. This way, theoretically, the QPS will not be avalanched evenly over the remaining 65% of Pods in the extreme case (assuming a perfectly linear relationship between QPS and CPU here)
- For example, if the HPA CPU target is 60%, consider setting PDB
4. Node Affinity and Node Groups
We a cluster, usually use different tags to classify node groups, for example, kubernetes automatically generates some node tags.
kubernetes.io/os
:linux
is usually usedkubernetes.io/arch
:amd64
,arm64
topology.kubernetes.io/region
andtopology.kubernetes.io/zone
: regions and availability zones for cloud services
We use node affinity
and Pod anti-affinity
more often, and the other two policies are used as appropriate.
1. node affinity
If you are using aws, then aws has some custom node tags.
eks.amazonaws.com/nodegroup
: the name of the aws eks node group, the same node group uses the same aws ec2 instance template- e.g. arm64 node group, amd64/x64 node group
- node groups with high memory ratio such as m-series instances, node groups with high computational performance such as c-series
- bid instance node group: this saves money, but is very dynamic and can be recycled at any time
- Pay-per-volume node group: this is expensive but stable instances.
Suppose you want to give preference to bid instances to run your Pod, and if the bid instances are temporarily full, choose pay-per-use instances. Then nodeSelector
won’t meet your needs and you need to use nodeAffinity
, as in the following example:
|
|
2. Pod Anti-Affinity
It is generally recommended to configure Pod anti-affinity for each Deployment’s template to break up Pods across all nodes.
|
|
5. Pod’s Ready Probe, Live Probe and Startup Probe
Pods provide the following three probes, all of which support service availability probing using Command, HTTP API, and TCP Socket.
startupProbe
Startup Probe (Kubernetes v1.18 [beta]): After this probe is passed, the “Ready Probe” and “Survival Probe” will perform the survivability and readiness checks- Used to perform survivability checks on slow-start containers to prevent them from being killed before they start running
- startupProbe is obviously more flexible than livenessProbe’s initialDelaySeconds parameter.
- It also delays the readinessProbe from taking effect, mainly to avoid pointless probes. It is obviously impossible to be ready before the container is even startUp.
- The program will have at most
failureThreshold * periodSeconds
for start-up, for example, iffailureThreshold=20
andperiodSeconds=5
are set, the maximum program start-up time will be 100s, if more than 100s still do not pass the “start-up probe”, the container will be killed.
- Used to perform survivability checks on slow-start containers to prevent them from being killed before they start running
readinessProbe
readiness probe:- If the number of failed readiness probes exceeds the
failureThreshold
limit (default three times), the service will be temporarily kicked out of the Service’s Endpoints until the service meets thesuccessThreshold
again.
- If the number of failed readiness probes exceeds the
livenessProbe
Survival Probe: detects if the service is alive, it can catch deadlocks and other conditions and kill such containers in time.- Possible reasons for the failure of the survivability probe.
- The service is deadlocked and not responding to all requests
- Service threads are all stuck waiting for external dependencies such as redis/mysql, resulting in unresponsive requests
- The container will be killed if the number of failed survival probes exceeds the
failureThreshold
limit (default three times), and then restarted according to the restart policy.kubectl describe pod
will show the reboot reason asState.Last State.Reason = Error, Exit Code=137
and Events will haveLiveness probe failed: ...
in Events.
- Possible reasons for the failure of the survivability probe.
The parameters of the above three types of probes are common, and the five time-related parameters are listed below.
Example.
|
|
Prior to Kubernetes 1.18, a common approach was to add a long initialDelaySeconds
to the ready probe
to achieve a similar function to the start probe
to prevent containers from being restarted due to a slow start and failure of the survival probe. The example is as follows.
|
|
6. Pod Security
Only security-related parameters in Pods are described here. Other security policies, such as cluster-wide security policies, are not discussed here.
1. Pod SecurityContext
By setting the Pod’s SecurityContext, you can set a specific security policy for each Pod.
There are two types of SecurityContext.
-
spec.securityContext
: This is a PodSecurityContext object- As the name implies, it is valid for all contaienrs in the Pod.
-
spec.containers[*].securityContext
: This is a SecurityContext object- container’s private SecurityContext
The parameters of these two SecurityContexts only partially overlap, with the overlapping part spec.containers[*].securityContext
having higher priority.
Some of the more common security policies we encounter for elevating privileges are
- Privileged container:
spec.containers[*].securityContext.privileged
- add (Capabilities) optional system-level capabilities:
spec.containers[*].securityContext.capabilities.add
.- Only a few containers, such as the ntp sync service, can enable this feature. Please note that this is very dangerous.
- Sysctls: system parameter:
spec.securityContext.sysctls
Permission Restrictions The relevant security policies are (It is highly recommended to configure the following security policies on all Pods on demand!).
spec.volumes
: all data volumes can be set with read and write permissionsspec.securityContext.runAsNonRoot: true
Pods must be run as a non-root userspec.containers[*].securityContext.readOnlyRootFileSystem:true
Set container level to read-only to prevent container files from being tampered with.- If the microservice needs to read and write files, it is recommended to mount additional data volumes of type
emptydir
.
- If the microservice needs to read and write files, it is recommended to mount additional data volumes of type
spec.containers[*].securityContext.allowPrivilegeEscalation: false
Do not allow Pod to do any privilege elevation!spec.containers[*].securityContext.capabilities.drop
: removes (Capabilities) optional system-level capabilities
There are other features such as specifying the running users/groups of containers that are not listed, so please refer to the Kubernetes documentation.
An example of a stateless microservice Pod configuration.
|
|
2. seccomp: security compute mode
seccomp and seccomp-bpf allow filtering of system calls, preventing user binaries from performing dangerous operations on host operating system components that are not normally needed. It is somewhat similar to Falco, but Seccomp does not provide special support for containers.
Video:
Other issues
- Disparity in performance across node types, resulting in unbalanced CPU load with balanced QPS
- Solution (not verified).
- Try to use instance types with the same performance: add node type affinity via
podAffinity
andnodeAffinity
- Try to use instance types with the same performance: add node type affinity via
- Solution (not verified).