Kubelet Graceful Shutdown

UPDATE: The poseidon/scuttle container image incorporates the ideas in this post.

In systemd shutdown units, we discussed how systemd services should be designed to terminate processes and wait for them to exit when stopped (including during shutdown). For less graceful services, we covered strategies for performing cleanup actions on their behalf on shutdown.

Let’s apply this to the Kubernetes Kubelet.

Kubelet

The Kubelet service on a machine registers itself with a Kubernetes cluster and starts containers via a container runtime (like containerd). From systemd’s perspective, the Kubelet (and container runtime) are responsible for (optionally) deregistering the node from the cluster and stopping pods before shutdown - while respecting the terminationGracePeriod and preStop hooks that Kubernetes offers.

Until Kubernetes v1.21, Kubelet didn’t handle these responsibilities. No ExecStop or signal handlers neatly tear down everything Kubelet sets up. Stopping the Kubelet won’t stop Pods and it won’t deregister the node. systemd is unaware of Kubernetes details like preStop or terminationGracePeriod or cluster-oriented concepts like pod disruption budgets. On shutdown, the scopes underlying containers are terminated like any other processes.

Kubelet’s GracefulNodeShutdown (KEP) beta feature takes a stab at part of the problem.

GracefulNodeShutdown

Kubelet’s GracefulNodeShutdown uses systemd-inhibitor locks to delay shutdown for a period. Kubelet uses dbus to watch for shutdown actions, starts terminating Pods, overrides their grace period, runs preStop hooks, sends SIGTERM to containers, and waits. Once the delay lock elapses (or is released early), system shutdown continues as usual.

kind: KubeletConfiguration
...
shutdownGracePeriod: 45s
shutdownGracePeriodCriticalPods: 30s
# /etc/systemd/logind.conf.d/inhibitors.conf
[Login]
InhibitDelayMaxSec=45s
systemd-inhibit
WHO            UID USER PID  COMM           WHAT     WHY                                        MODE
NetworkManager 0   root 1219 NetworkManager sleep    NetworkManager needs to turn off networks  delay
kubelet        0   root 1677 kubelet        shutdown Kubelet needs time to handle node shutdown delay

GracefulNodeShutdown ignores PodDisruptionBudget, since a shutdown is an involuntary disruption from Kubernete’s perspective.

Limitations

Unlike when pods are evicted by a drain, GracefulNodeShutdown marks pods as Completed with a Failed status. These Pods persist until they’re manually deleted or the pod garbage collection threshold (default 12500) is reached. That’s a lot of clutter.

Status:       Failed
Reason:       Terminated
Message:      Pod was terminated in response to imminent node shutdown.

Kubelet has no handling for optionally deleting the node from the cluster when stopped (unlike say Serf or Consul). Though it’s just one API call. When using spot instances, preempted nodes create clutter. Some folks create/manage cloud-specific projects for this. At Poseidon Labs, our Typhoon clusters with spot instances use AWS, Azure, and GCP so we favor a simpler approach (below).

Finally, Kubelet’s GracefulNodeShutdown feature is specific to shutdown. It’s not a general ExecStop kind of functionality, like a (hypothetical) kubelet stop or kubelet drain would be. And at this stage, many users even prefer Kubelet be able to restart without affecting Pods, despite this breaking from the systemd model somewhat.

Kubelet is a service that only partially handles teardown. If you read systemd shutdown units, you know what’s next. Let’s address it!

Shutdown Unit

You may have written a systemd unit to handle draining and/or deleting a node before shutdown (e.g. before GracefulNodeShutdown). Even in Kubernetes v1.25, that approach is still valuable for handling gaps in Kubelet’s GracefulNodeShutdown. Let’s write a complimentary systemd unit.

Poseidon Labs’ cluster nodes use container-optimized OSes like Fedora CoreOS or Flatcar Linux, so cleanup actions like kubectl drain run in a container. We’ll use the pattern of a systemd unit that ExecStop’s a running podman container.

UPDATE: There’s now a Go app and container image that handles this at poseidon/scuttle.

[Unit]
Description=Clean on shutdown
After=multi-user.target

[Service]
Type=simple
ExecStartPre=-/usr/bin/podman rm clean
ExecStart=/usr/bin/podman run \
  --name clean \
  --network host \
  --log-driver=k8s-file \
  -v /var/lib/kubelet:/var/lib/kubelet:ro,z \
  -v /usr/local/bin:/scripts \
  --stop-timeout=60 \
  --entrypoint /scripts/await \
  quay.io/poseidon/kubelet:v1.25.3
ExecStop=/usr/bin/podman stop clean
TimeoutStopSec=180
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target

This Type=simple service unit starts a podman container that watches for shutdown (avoids pitfalls of starting a container in ExecStop during shutdown). The unit is pulled in by multi-user.target and ordered After it, fairly late in startup. Ordered units are stopped in reverse start order, so it will begin stopping before other units. Podman proxies signals (like SIGTERM) to the container’s first process or sends a SIGKILL after --stop-timeout.

On shutdown, the container process receives a SIGTERM and podman stop also sends a SIGTERM (we keep ExecStop to aid in validation), but our await script ensures cleanup happens only once.

#!/bin/bash

export KUBECONFIG=/var/lib/kubelet/kubeconfig

cleanup() {
  trap - SIGINT SIGTERM
  echo "cleaning..."

  kubectl get node $HOSTNAME

  echo "draining..."
  kubectl drain $HOSTNAME --ignore-daemonsets --delete-emptydir-data

  echo "deleting node..."
  kubectl delete node $HOSTNAME

  echo "done"
}

trap cleanup SIGINT SIGTERM

echo "uncordon node..."
kubectl uncordon $HOSTNAME || true
echo "Awaiting signals"
sleep infinity & wait $!

The exact set of cleanup actions performed by the await script is flexible. Here I’ve shown drain, delete, and uncordon.

See the previous post for a deep dive on how systemd units like this are developed.

Drain

Using the kubectl drain step will cordon the node then evict and remove those Pods. When used in conjunction with Kubelet’s GracefulNodeShutdown, it will cleanup the Completed Pods it left behind. Or catch anything it missed.

Draining on shutdown does require the Kubelet’s kubeconfig (or otherwise) have RBAC permissions to perform a drain.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: example-drain
rules:
  - apiGroups: ["apps"]
    resources:
      - deployments
      - daemonsets
      - statefulsets
    verbs:
      - get
      - list
  - apiGroups: [""]
    resources:
      - pods/eviction
    verbs:
      - create

If you’re using the Typhoon open-source Kubernetes distro, Kubelet kubeconfigs are allowed to perform a drain, starting in v1.25.4.

Delete

Using the delete node step will automatically de-register the node from the cluster on shutdown. If the node is just rebooting (e.g. OS upgrade) it’ll re-join on the next startup. If the node shuts down due to instance termination or spot instance preemption, deleting avoids the clutter of NotReady nodes (that will never return).

Deleting on shutdown does require the Kubelet’s kubeconfig (or otherwise) have RBAC permissions to perform a delete.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: example-drain
rules:
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - delete

If you’re using the Typhoon open-source Kubernetes distro, Kubelet kubeconfigs are allowed to perform a delete, starting in v1.25.4.

TLS Bootstrap

If the Kubelet’s kubeconfig is used to perform a drain and/or delete on shutdown, and the Kubelet uses TLS bootstrapping, the kubeconfig may not be present on first boot. A systemd path unit can address this:

# /etc/systemd/system/clean.path
[Unit]
Description=Watch for Kubelet kubeconfig
[Path]
PathExists=/var/lib/kubelet/kubeconfig
[Install]
WantedBy=multi-user.target

Spot Tuning

Cloud platforms provide different delays between preempting a spot instance and terminating it. Tune GracefulNodeShutdown and any shutdown systemd units accordingly.

Some clouds (AWS, Azure) also provide advanced warning of spot terminations via metadata. With further work, shutdown units could poll these endpoints as an alternative trigger, but that’s beyond the scope of this post.

Source

Assemble the systemd unit, path unit, and script into a Butane config for easy use with Fedora CoreOS or Flatcar Linux’s (change variant and version) Ignition system.

# Copyright (C) 2022 Poseidon Labs
# Copyright (C) 2022 Dalton Hubble
#
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
variant: fcos
version: 1.4.0
systemd:
  units:
    - name: clean.service
      contents: |
        [Unit]
        Description=Clean on shutdown
        After=multi-user.target
        [Service]
        Type=simple
        ExecStartPre=-/usr/bin/podman rm clean
        ExecStart=/usr/bin/podman run \
          --name clean \
          --network host \
          --log-driver=k8s-file \
          -v /var/lib/kubelet:/var/lib/kubelet:ro,z \
          -v /usr/local/bin:/scripts \
          --stop-timeout=60 \
          --entrypoint /scripts/await \
          quay.io/poseidon/kubelet:v1.25.3
        ExecStop=/usr/bin/podman stop clean
        TimeoutStopSec=180
        SuccessExitStatus=143
        [Install]
        WantedBy=multi-user.target        
    - name: clean.path
      enabled: true
      contents: |
        [Unit]
        Description=Watch for Kubelet kubeconfig
        [Path]
        PathExists=/var/lib/kubelet/kubeconfig
        [Install]
        WantedBy=multi-user.target        
storage:
  files:
    - path: /usr/local/bin/await
      mode: 0544
      contents:
        inline: |
          #!/bin/bash
          export KUBECONFIG=/var/lib/kubelet/kubeconfig

          cleanup() {
            trap - SIGINT SIGTERM
            echo "cleaning..."

            kubectl describe node $HOSTNAME

            echo "draining..."
            kubectl drain $HOSTNAME --ignore-daemonsets --delete-emptydir-data

            echo "deleting node..."
            kubectl delete node $HOSTNAME

            echo "done"
          }

          trap cleanup SIGINT SIGTERM

          echo "uncordon node..."
          kubectl uncordon $HOSTNAME || true
          echo "Awaiting signals"
          sleep infinity & wait $!          

With Typhoon, the Butane config above can be added to worker node definitions as a snippet.

module "mycluster" {
  ...
  worker_snippets = [
    file("./butane/await-shutdown.yaml"),
  ]
}

Examples from this post are available in blog-bits under the MPL 2.0 license. They were tested on Fedora CoreOS 36.20221001.3.0 (systemd 250.8).

If you find a bug, please send a fix and I’ll try to get it updated.

Motivation

At Poseidon Labs, we use Kubernetes clusters on cloud platforms to develop open-source software (thanks to our sponsors). Adapting Kubelets to gracefully shutdown, drain, and delete themselves from clusters has been very valuable in switching some nodes to spot instances and reducing our costs significantly.

Going Further

Further work could be done to perform cleanup actions when Kubelet is stopped (not just on shutdown), like adding BindsTo or improving the Kubelet itself.

Appreciate content like this?

  • Follow @poseidonlabs for future blog posts
  • Consider supporting Poseidon’s open-source work by joining our amazing sponsors
  • Get help with spot instances or graceful shutdown in your infrastructure. Email tech@psdn.io about a consultation

Resources