Graceful exit mechanism for Pods in Kubernetes

Kubernetes provides a Pod graceful exit mechanism that allows Pods to complete some cleanup before exiting. But if something goes wrong while performing cleanup, will the Pod exit properly? How long does it take to exit? Can the exit time be specified? Does the system have default parameters? There are several details that we should pay attention to, and this article will start from these details to sort out the behavior of Kubernetes components and their parameters in each case.

Pod normal exit

Pod normal exit is a non-eviction exit, including artificial deletion, deletion by execution error, etc.

When a pod exits, the kubelet executes a pod’s preStop before deleting the container, allowing the pod to execute a script to clear necessary resources, etc. before exiting. However, preStop can also fail or hang, in which case preStop will not prevent the pod from exiting and the kubelet will not repeat the execution, but will wait for a period of time beyond which the container will be deleted to ensure the stability of the entire system.

The whole process is in the function killContainer. What we need to clarify when the pod exits gracefully is that the waiting time of the kubelet is determined by those factors, and how the fields that the user can set and the parameters of the system components work together.

gracePeriod

The process of kubelet calculating gracePeriod is as follows

if the pod’s DeletionGracePeriodSeconds is not nil, meaning it was deleted by the ApiServer, gracePeriod takes the value directly.
if the pod’s Spec.TerminationGracePeriodSeconds is not nil, then see what the reason for the pod deletion is.
1. if the reason for deletion is the failure of executing startupProbe, gracePeriod takes the value of TerminationGracePeriodSeconds set in startupProbe.
2. If the reason for deletion is failure to execute livenessProbe, gracePeriod takes the value of TerminationGracePeriodSeconds set in livenessProbe.

Once the gracePeriod is obtained, the kubelet executes the pod’s preStop and the function executePreStopHook starts a goroutine and calculates its execution time. gracePeriod is subtracted from this time to give the final timeout passed to the runtime for deleting the container. timeout time passed to runtime. So, if we set pod preStop, we need to take into account both the execution time of preStop and the time of container exit, we can set TerminationGracePeriodSeconds to be greater than preStop + the time of container exit.

func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, reason containerKillReason, gracePeriodOverride *int64) error {
    ...
    // From this point, pod and container must be non-nil.
    gracePeriod := int64(minimumGracePeriodInSeconds)
    switch {
    case pod.DeletionGracePeriodSeconds != nil:
        gracePeriod = *pod.DeletionGracePeriodSeconds
    case pod.Spec.TerminationGracePeriodSeconds != nil:
        gracePeriod = *pod.Spec.TerminationGracePeriodSeconds

        switch reason {
        case reasonStartupProbe:
            if containerSpec.StartupProbe != nil && containerSpec.StartupProbe.TerminationGracePeriodSeconds != nil {
                gracePeriod = *containerSpec.StartupProbe.TerminationGracePeriodSeconds
            }
        case reasonLivenessProbe:
            if containerSpec.LivenessProbe != nil && containerSpec.LivenessProbe.TerminationGracePeriodSeconds != nil {
                gracePeriod = *containerSpec.LivenessProbe.TerminationGracePeriodSeconds
            }
        }
    }

    // Run internal pre-stop lifecycle hook
    if err := m.internalLifecycle.PreStopContainer(containerID.ID); err != nil {
        return err
    }

    // Run the pre-stop lifecycle hooks if applicable and if there is enough time to run it
    if containerSpec.Lifecycle != nil && containerSpec.Lifecycle.PreStop != nil && gracePeriod > 0 {
        gracePeriod = gracePeriod - m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)
    }
    // always give containers a minimal shutdown window to avoid unnecessary SIGKILLs
    if gracePeriod < minimumGracePeriodInSeconds {
        gracePeriod = minimumGracePeriodInSeconds
    }
    if gracePeriodOverride != nil {
        gracePeriod = *gracePeriodOverride
    }

    err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
    ...
    return nil
}

gracePeriodOverride

In the above analysis, before the kubelet calls the runtime interface, it will determine another step gracePeriodOverride and if the value passed in is not null, it will directly override the previous gracePeriod with that value.

gracePeriodOverride

The main process for kubelet to calculate gracePeriodOverride is as follows.

fetch the pod’s DeletionGracePeriodSeconds.
if the kubelet is evicting the pod, override the pod exit time with the evicted settings.

func calculateEffectiveGracePeriod(status *podSyncStatus, pod *v1.Pod, options *KillPodOptions) (int64, bool) {
    gracePeriod := status.gracePeriod
    // this value is bedrock truth - the apiserver owns telling us this value calculated by apiserver
    if override := pod.DeletionGracePeriodSeconds; override != nil {
        if gracePeriod == 0 || *override < gracePeriod {
            gracePeriod = *override
        }
    }
    // we allow other parts of the kubelet (namely eviction) to request this pod be terminated faster
    if options != nil {
        if override := options.PodTerminationGracePeriodSecondsOverride; override != nil {
            if gracePeriod == 0 || *override < gracePeriod {
                gracePeriod = *override
            }
        }
    }
    // make a best effort to default this value to the pod's desired intent, in the event
    // the kubelet provided no requested value (graceful termination?)
    if gracePeriod == 0 && pod.Spec.TerminationGracePeriodSeconds != nil {
        gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
    }
    // no matter what, we always supply a grace period of 1
    if gracePeriod < 1 {
        gracePeriod = 1
    }
    return gracePeriod, status.gracePeriod != 0 && status.gracePeriod != gracePeriod
}

ApiServer’s behavior

When analyzing the exit time of a pod handled by kubelet above, we see that kubelet first uses the pod’s DeletionGracePeriodSeconds, which is the value written by ApiServer when it deletes a pod. In this section, we analyze the behavior of ApiServer when it deletes a pod.

DeletionGracePeriodSeconds

The process of calculating the pod’s GracePeriodSeconds in ApiServer is as follows

set to options.GracePeriodSeconds if it is not empty, otherwise set to Spec.TerminationGracePeriodSeconds specified by the user in the spec (default is 30s).
set to 0 if the pod is not scheduled or has been exited, i.e., deleted immediately.

where -options.GracePeriodSeconds is the parameter -grace-period that can be specified when kubectl deletes a pod, or when the ApiServer interface is called in the program, such as DeleteOptions.GracePeriodSeconds in client-go.

func (podStrategy) CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool {
    if options == nil {
        return false
    }
    pod := obj.(*api.Pod)
    period := int64(0)
    // user has specified a value
    if options.GracePeriodSeconds != nil {
        period = *options.GracePeriodSeconds
    } else {
        // use the default value if set, or deletes the pod immediately (0)
        if pod.Spec.TerminationGracePeriodSeconds != nil {
            period = *pod.Spec.TerminationGracePeriodSeconds
        }
    }
    // if the pod is not scheduled, delete immediately
    if len(pod.Spec.NodeName) == 0 {
        period = 0
    }
    // if the pod is already terminated, delete immediately
    if pod.Status.Phase == api.PodFailed || pod.Status.Phase == api.PodSucceeded {
        period = 0
    }

    if period < 0 {
        period = 1
    }

    // ensure the options and the pod are in sync
    options.GracePeriodSeconds = &period
    return true
}

kubelet eviction of pods

In addition, the pod’s graceful exit time is overridden when the pod is evicted by the kubelet.

kubelet eviction of pods

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
   ...
    // we kill at most a single pod during each eviction interval
    for i := range activePods {
        pod := activePods[i]
        gracePeriodOverride := int64(0)
        if !isHardEvictionThreshold(thresholdToReclaim) {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
        if m.evictPod(pod, gracePeriodOverride, message, annotations) {
            metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
            return []*v1.Pod{pod}
        }
    }
    ...
}

The override value is EvictionMaxPodGracePeriod and is only valid for soft eviction, which is the kubelet’s eviction-related configuration parameter.

// Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}.
EvictionHard map[string]string
// Map of signal names to quantities that defines soft eviction thresholds.  For example: {"memory.available": "300Mi"}.
EvictionSoft map[string]string
// Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {"memory.available": "30s"}.
EvictionSoftGracePeriod map[string]string
// Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
EvictionPressureTransitionPeriod metav1.Duration
// Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
EvictionMaxPodGracePeriod int32

The function to evict a pod from a kubelet is injected at startup with the following function.

func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
    return func(pod *v1.Pod, isEvicted bool, gracePeriodOverride *int64, statusFn func(*v1.PodStatus)) error {
        // determine the grace period to use when killing the pod
        gracePeriod := int64(0)
        if gracePeriodOverride != nil {
            gracePeriod = *gracePeriodOverride
        } else if pod.Spec.TerminationGracePeriodSeconds != nil {
            gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
        }

        // we timeout and return an error if we don't get a callback within a reasonable time.
        // the default timeout is relative to the grace period (we settle on 10s to wait for kubelet->runtime traffic to complete in sigkill)
        timeout := int64(gracePeriod + (gracePeriod / 2))
        minTimeout := int64(10)
        if timeout < minTimeout {
            timeout = minTimeout
        }
        timeoutDuration := time.Duration(timeout) * time.Second

        // open a channel we block against until we get a result
        ch := make(chan struct{}, 1)
        podWorkers.UpdatePod(UpdatePodOptions{
            Pod:        pod,
            UpdateType: kubetypes.SyncPodKill,
            KillPodOptions: &KillPodOptions{
                CompletedCh:                              ch,
                Evict:                                    isEvicted,
                PodStatusFunc:                            statusFn,
                PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
            },
        })

        // wait for either a response, or a timeout
        select {
        case <-ch:
            return nil
        case <-time.After(timeoutDuration):
            recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
            return fmt.Errorf("timeout waiting to kill pod")
        }
    }
}

The killPodNow function is the function called by the kubelet when evicting a pod, gracePeriodOverride is the parameter set during soft eviction, when it is not set, gracePeriod still takes the value of pod. TerminationGracePeriodSeconds. This function then calls podWorkers.UpdatePod, passes in the appropriate parameters, sets a timeout associated with gracePeriod, and waits for it to return.

Summary

The graceful exit of a Pod is achieved by preStop. This article provides a brief analysis of what factors affect the exit time of a Pod when it exits normally and when it is evicted, and how each parameter interacts with each other. After understanding these details, we have a more comprehensive knowledge of the Pod exit process.

Table of Contents

Pod normal exit

gracePeriod

gracePeriodOverride

ApiServer’s behavior

kubelet eviction of pods

Summary