Taints and Tolerations in Kubernetes are one of the important mechanisms of the scheduling system, which manages services to ensure that Pods are not scheduled to inappropriate nodes.
In this article, we will briefly introduce the Taints and Tolerations mechanism of Kubernetes.
Taints
Taints are labels defined on Node objects in Kubernetes. Unlike Labels and Annotations mechanisms that record information using key=values, Taints add an effect attribute that is described using the key=value:effect format, where the Key and Value can be user-defined strings, and the effect indicates how the Taints affects the Kubernetes scheduling pod, which currently supports the following three types.
- NoExecute: Pods already running on a node will be evicted and new Pods will not be scheduled to that node.
- NoSchedule: Pods already running on a node are not affected and new Pods are not scheduled to that node.
- PreferNoSchedule: Pods already running on the node are not affected and new Pods are not scheduled to the node.
With the kubectl taint command we can quickly add taint to the Node.
|
|
Tolerations
Tolerations are pieces of information defined on a Pod that tell the Kubernetes Management Service whether the specified Taints will have an impact on the scheduling and failover (Evicted) of that Pod.
Users can define multiple Tolerations in the following ways.
The above definition says: taint key1=value1:NoSchedule will have no effect on the scheduling of this pod, and the user allows schedule to schedule these pods to the Node that has these taints.
The above definition means that the taint key1=value1:NoExecute will not affect the pod for the first 3600s, and the pod will be migrated to another node when the taint exists on the Node for more than 3600s.
When defining tolerations, we can omit the value or both key and value, in which case we need to specify that the operator takes the value Exists, as follows.
Taint based Evictions
Kubernetes’ Node Controller implements Pod migration in case of node failure based on the Taint and Tolerations mechanism described above.
When a node fails, the Node Controller automatically adds the following taint to the Node.
Key | Effect | remark |
---|---|---|
node.kubernetes.io/not-ready | NoExecute | NodeCondition Ready == False |
node.kubernetes.io/unreachable | NoExecute | NodeCondition Ready == Unknown |
node.kubernetes.io/memory-pressure | NoSchedule | Insufficient node memory |
node.kubernetes.io/disk-pressure | NoSchedule | Insufficient disk space for the node |
node.kubernetes.io/pid-pressure | NoSchedule | Insufficient Pid for the node |
node.kubernetes.io/unschedulable | NoSchedule | Nodes are not schedulable, mainly used in the scenario of active codron nodes |
The Node Controller adds Taint based on the node’s NodeCondition information, which is updated in real time by each node’s Kubelet. schuduler works directly on the node’s Taint when scheduling the pod, and does not repeatedly consider the specific value of the NodeCondition.
Only the effect of node.kubernetes.io/not-ready and node.kubernetes.io/unreachable in the above Node Controller managed Taint is NoExecute, the other taint is NoSchedule.
This is because setting the NoExecute taint will immediately evict all pods on the node. To improve the robustness of the system, kubernetes takes the solution of marking the node as NoSchedule by tainting it when it is under memory pressure. At this point, the Kubelet backend performs a resource recovery, prioritizing the low-security pods to other nodes.
At this point, if memory resources are restored below the alert line, kubelet will update the NodeCondition information and Taint will be removed. In other words, a type taint such as memory-pressure is not an indication that the node is completely unavailable and its effect set to NoExecute is inappropriate.
By default, kubernetes automatically adds a number of tolerations to the created pods to enable automatic pod migration for various scenarios.
Kubernetes automatically adds the following tolerations to all pods created by controllers other than Daemon:
For daemon managed pods, Kubernetes injects the following tolerations:
|
|
The official documentation mentions that for Guaranteed and Burstable type Pods, Kubernetes automatically adds node.kubernetes.io/memory-pressure type tolerations (even when Burstable type Pods don’t have memory quotas set). However, in real-world testing, I found that it was not added.