Cluster Node Autoscaling in Tanzu Kubernetes Grid (TKG)

Tue 19 October 2021

Problem

We need to reliably implement cluster autoscaling for our Kubernetes platform. How can we do that? And how can we know it's working?

Solution

TKG uses the ClusterAPI provider, part of the Cluster Autoscaler but makes it easy to configure. We'll ask TKG to enable Cluster Autoscaler for us, dial in a few options, and then test it out.

Note: As of this writing, the latest version of TKG is 1.4. This blog post is based on the docs for that version as well as a helpful blog post by Chris Little.

Create a Kubernetes Cluster with Autoscaling Enabled

Basically, we just need to set 'ENABLE_AUTOSCALER: true' when deploying a TKG cluster. The simplest way to do this is to include it in your cluster-config.yml, or whichever config file you're maintaining for your cluster. There are some other options as well, which I'll lay out below. Most are already set to sane defaults.

For a complete description of all the values, see the docs.

#! ---------------------------------------------------------------------
#! Autoscaler related configuration
#! ---------------------------------------------------------------------
ENABLE_AUTOSCALER: true
AUTOSCALER_MAX_NODES_TOTAL: "0"
AUTOSCALER_SCALE_DOWN_DELAY_AFTER_ADD: "10m"
AUTOSCALER_SCALE_DOWN_DELAY_AFTER_DELETE: "10s"
AUTOSCALER_SCALE_DOWN_DELAY_AFTER_FAILURE: "3m"
AUTOSCALER_SCALE_DOWN_UNNEEDED_TIME: "10m"
AUTOSCALER_MAX_NODE_PROVISION_TIME: "15m"

# Each min/max pair (0,1,2) corresponds to an availability zone. 
# If you have a 'dev' cluster, you only need to fill in the min/max size for '0'.
# If you have a 'prod' cluster, you need to fill in all three pairs.
AUTOSCALER_MIN_SIZE_0: 3
AUTOSCALER_MAX_SIZE_0: 10
AUTOSCALER_MIN_SIZE_1:
AUTOSCALER_MAX_SIZE_1:
AUTOSCALER_MIN_SIZE_2:
AUTOSCALER_MAX_SIZE_2:

With that, it's as simple as creating a TKG cluster.

# NOTE: we created the file 'tkg-cluster-with-autoscaling.yml'
tanzu cluster create -f ~/.config/tanzu/tkg/cluster-configs/tkg-cluster-with-autoscaling.yml --plan dev aws-autoscale-01 -w 3

Once our cluster is alive, how do we know it worked? Two indicators.

Examine the Management Cluster

The management cluster, in the default namespace, will have an autoscaling deployment created for your new cluster. It should be healthy.

# Switch to our management cluster
$ kubectx us-east-2-mc-admin@us-east-2-mc
✔ Switched to context "us-east-2-mc-admin@us-east-2-mc".

$ kubectl -n default get deployments
NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
aws-autoscale-01-cluster-autoscaler   1/1     1            1           5d

Examine the Workload Cluster

On the workload cluster (the TKG cluster you created), in the 'kube-system' namespace, Cluster Autoscaler creates a configmap to track its status. You can view this config map.

$ kubectx aws-autoscale-01-admin@aws-autoscale-01
✔ Switched to context "aws-autoscale-01-admin@aws-autoscale-01".

$ kubectl -n kube-system get configmap
NAME                                 DATA   AGE
cluster-autoscaler-status            1      5d


$ kubectl -n kube-system describe configmap cluster-autoscaler-status
Name:         cluster-autoscaler-status
Namespace:    kube-system
Labels:       <none>
Annotations:  cluster-autoscaler.kubernetes.io/last-updated: 2021-10-19 19:06:41.612291737 +0000 UTC

Data
====
status:
----
Cluster-autoscaler status at 2021-10-19 19:06:41.612291737 +0000 UTC:
Cluster-wide:
  Health:      Healthy (ready=4 unready=0 notStarted=0 longNotStarted=0 registered=4 longUnregistered=0)
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-14 18:49:01.674656828 +0000 UTC m=+10.178995230
  ScaleUp:     NoActivity (ready=4 registered=4)
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-18 15:11:12.255597844 +0000 UTC m=+332540.759936180
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-18 15:27:44.207345103 +0000 UTC m=+333532.711683437

NodeGroups:
  Name:        MachineDeployment/default/aws-autoscale-01-md-0
  Health:      Healthy (ready=3 unready=0 notStarted=0 longNotStarted=0 registered=3 longUnregistered=0 cloudProviderTarget=3 (minSize=3, maxSize=10))
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-14 18:50:53.782671384 +0000 UTC m=+122.287009725
  ScaleUp:     NoActivity (ready=3 cloudProviderTarget=3)
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-18 15:11:12.255597844 +0000 UTC m=+332540.759936180
  ScaleDown:   NoCandidates (candidates=0)
               LastProbeTime:      2021-10-19 19:06:40.008835646 +0000 UTC m=+433068.513173981
               LastTransitionTime: 2021-10-18 15:27:44.207345103 +0000 UTC m=+333532.711683437



BinaryData
====

Events:  <none>

Great! So we now know that Cluster Autoscaler is installed, and it's configured our k8s cluster to have three worker nodes (see the 'Healthy (ready=3' part in the output above). Let's test it.

Testing Cluster Autoscaler

We have a 'dev' cluster, so that means one control plane node, and three worker nodes (we asked for three with the -w 3 flag in the tanzu cluster create command).

$ kubectl get nodes
NAME                                       STATUS   ROLES                  AGE     VERSION
ip-10-0-0-148.us-east-2.compute.internal   Ready    <none>                 5d      v1.21.2+vmware.1
ip-10-0-0-154.us-east-2.compute.internal   Ready    <none>                 27h     v1.21.2+vmware.1
ip-10-0-0-218.us-east-2.compute.internal   Ready    control-plane,master   5d      v1.21.2+vmware.1
ip-10-0-0-250.us-east-2.compute.internal   Ready    <none>                 4d22h   v1.21.2+vmware.1

We set a max of 10 nodes, so we should be able to test this out. Let's look at a super simple toy app, say php-apache.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  selector:
    matchLabels:
      run: php-apache
  replicas: 3
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: k8s.gcr.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: php-apache
  labels:
    run: php-apache
spec:
  ports:
  - port: 80
  selector:
    run: php-apache

Let's deploy it.

$ kapp deploy -a load-test -f php-apache.yml 
Target cluster 'https://aws-autoscale-01-apiserver-753085411.us-east-2.elb.amazonaws.com:6443' (nodes: ip-10-0-0-218.us-east-2.compute.internal, 3+)

Changes

Namespace  Name        Kind        Conds.  Age  Op      Op st.  Wait to    Rs  Ri  
default    php-apache  Deployment  -       -    create  -       reconcile  -   -  
^          php-apache  Service     -       -    create  -       reconcile  -   -  

Op:      2 create, 0 delete, 0 update, 0 noop
Wait to: 2 reconcile, 0 delete, 0 noop

Continue? [yN]: y

3:16:46PM: ---- applying 2 changes [0/2 done] ----
3:16:46PM: create service/php-apache (v1) namespace: default
3:16:47PM: create deployment/php-apache (apps/v1) namespace: default
3:16:47PM: ---- waiting on 2 changes [0/2 done] ----
3:16:47PM: ok: reconcile service/php-apache (v1) namespace: default
3:16:47PM: ongoing: reconcile deployment/php-apache (apps/v1) namespace: default
3:16:47PM:  ^ Waiting for 3 unavailable replicas
3:16:47PM:  L ok: waiting on replicaset/php-apache-78cc655ff (apps/v1) namespace: default
3:16:47PM:  L ongoing: waiting on pod/php-apache-78cc655ff-w4zk7 (v1) namespace: default
3:16:47PM:     ^ Pending: ContainerCreating
3:16:47PM:  L ongoing: waiting on pod/php-apache-78cc655ff-qmsdd (v1) namespace: default
3:16:47PM:     ^ Pending: ContainerCreating
3:16:47PM:  L ongoing: waiting on pod/php-apache-78cc655ff-fp2fp (v1) namespace: default
3:16:47PM:     ^ Pending: ContainerCreating
3:16:47PM: ---- waiting on 1 changes [1/2 done] ----
3:16:49PM: ok: reconcile deployment/php-apache (apps/v1) namespace: default
3:16:49PM: ---- applying complete [2/2 done] ----
3:16:49PM: ---- waiting complete [2/2 done] ----

Succeeded

And check for success:

$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
php-apache-78cc655ff-fp2fp   1/1     Running   0          39s
php-apache-78cc655ff-qmsdd   1/1     Running   0          39s
php-apache-78cc655ff-w4zk7   1/1     Running   0          39s

Okay, great. But what if we make a ludicrous request? Something beyond what three nodes can handle?

$ sed -i 's/replicas: 3/replicas: 300/g' php-apache.yml

And then....

$ kapp deploy -a load-test -f php-apache.yml 
Target cluster 'https://aws-autoscale-01-apiserver-753085411.us-east-2.elb.amazonaws.com:6443' (nodes: ip-10-0-0-218.us-east-2.compute.internal, 3+)

Changes

Namespace  Name        Kind        Conds.  Age  Op      Op st.  Wait to    Rs  Ri  
default    php-apache  Deployment  2/2 t   3m   update  -       reconcile  ok  -  

Op:      0 create, 0 delete, 1 update, 0 noop
Wait to: 1 reconcile, 0 delete, 0 noop

Continue? [yN]: y

.
.
.
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-5xq9v (v1) namespace: default
3:21:16PM:     ^ Pending: Unschedulable (message: 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient cpu.)
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-5wzns (v1) namespace: default
3:21:16PM:     ^ Pending: Unschedulable (message: 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient cpu.)
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-5mv4j (v1) namespace: default
3:21:16PM:     ^ Pending
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-5mkvj (v1) namespace: default
3:21:16PM:     ^ Pending: ContainerCreating
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-5c5qk (v1) namespace: default
3:21:16PM:     ^ Pending: Unschedulable (message: 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient cpu.)
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-57q2m (v1) namespace: default
3:21:16PM:     ^ Pending
3:21:16PM:  L ongoing: waiting on pod/php-apache-78cc655ff-4g4lm (v1) namespace: default
.
.
.

Ah! We're all out of space on our Kubernetes cluster!

Unschedulable (message: 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient cpu.)

This is telling us that there are 4 nodes, but none are available. And by the way, one of them is the control plane so we can't even deploy on that.

Let's see how many pods are failing to deploy.

Every 2.0s: kubectl get pods | grep '0/1' | wc -l                              samus: Tue Oct 19 15:23:01 2021

152

Okay, now let's check that configmap on our workload cluster.

 NodeGroups:                                                                                   
   Name:        MachineDeployment/default/aws-autoscale-01-md-0                                
   Health:      Healthy (ready=8 unready=0 notStarted=0 longNotStarted=0 registered=8 longUnreg
                LastProbeTime:      2021-10-19 19:25:42.319883574 +0000 UTC m=+434210.824221910
                LastTransitionTime: 2021-10-14 18:50:53.782671384 +0000 UTC m=+122.287009725   
   ScaleUp:     NoActivity (ready=8 cloudProviderTarget=8)                                     
                LastProbeTime:      2021-10-19 19:25:42.319883574 +0000 UTC m=+434210.824221910
                LastTransitionTime: 2021-10-19 19:22:31.744698667 +0000 UTC m=+434020.249037002

We can see that it's already scaling up to 8 worker nodes. In fact, by the time I got to checking the config map, it already scaled up to 8 and joined the new machines to the cluster. Let's see the k8s worker node count.

NAME                                       STATUS   ROLES                  AGE     VERSION
ip-10-0-0-130.us-east-2.compute.internal   Ready    <none>                 4m58s   v1.21.2+vmware.1
ip-10-0-0-148.us-east-2.compute.internal   Ready    <none>                 5d      v1.21.2+vmware.1
ip-10-0-0-154.us-east-2.compute.internal   Ready    <none>                 28h     v1.21.2+vmware.1
ip-10-0-0-203.us-east-2.compute.internal   Ready    <none>                 5m3s    v1.21.2+vmware.1
ip-10-0-0-218.us-east-2.compute.internal   Ready    control-plane,master   5d      v1.21.2+vmware.1
ip-10-0-0-226.us-east-2.compute.internal   Ready    <none>                 5m      v1.21.2+vmware.1
ip-10-0-0-250.us-east-2.compute.internal   Ready    <none>                 4d23h   v1.21.2+vmware.1
ip-10-0-0-33.us-east-2.compute.internal    Ready    <none>                 5m      v1.21.2+vmware.1
ip-10-0-0-72.us-east-2.compute.internal    Ready    <none>                 4m59s   v1.21.2+vmware.1

That's 8 nodes. Sweet! How about our kapp command? Did it ever reconcile?

3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-2d8ts (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-29xkp (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-27f62 (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-26vq5 (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-268m5 (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-252qt (v1) namespace: default
3:23:17PM:  L ok: waiting on pod/php-apache-78cc655ff-22dcd (v1) namespace: default
3:23:18PM: ok: reconcile deployment/php-apache (apps/v1) namespace: default
3:23:18PM: ---- applying complete [1/1 done] ----
3:23:18PM: ---- waiting complete [1/1 done] ----

Succeeded

Yes. kapp waited for the deployment to succeed before returning control.

And all the while, our 'watch' command was ticking down to zero:

Every 2.0s: kubectl get pods | grep '0/1' | wc -l                              samus: Tue Oct 19 15:28:28 2021

0

And the pods are all deployed.

$ kubectl get pods | wc -l
301

Success!

What about scaling down?

Scale Down the Kubernetes Cluster

TL;DR - Do nothing. Cluster Autoscaler does this for you. Read on for more.

The default behavior is for Cluster Autoscaler to check for a "scale down" opportunity every ten minutes. That means if we scale down our requested number of pods, and wait a short period, the Kubernetes cluster will scale down. Let's do that.

$ sed -i 's/replicas: 300/replicas: 3/g' php-apache.yml

$ kapp deploy -a load-test -f php-apache.yml

Now we wait, and check back in. After a time, the unneeded nodes will become unschedulable, showing that they're being decomissioned.

$ kubectl get nodes
NAME                                       STATUS                        ROLES                  AGE     VERSION
ip-10-0-0-130.us-east-2.compute.internal   NotReady,SchedulingDisabled   <none>                 20m     v1.21.2+vmware.1
ip-10-0-0-148.us-east-2.compute.internal   Ready                         <none>                 5d      v1.21.2+vmware.1
ip-10-0-0-154.us-east-2.compute.internal   Ready                         <none>                 28h     v1.21.2+vmware.1
ip-10-0-0-203.us-east-2.compute.internal   NotReady,SchedulingDisabled   <none>                 20m     v1.21.2+vmware.1
ip-10-0-0-218.us-east-2.compute.internal   Ready                         control-plane,master   5d      v1.21.2+vmware.1
ip-10-0-0-226.us-east-2.compute.internal   Ready                         <none>                 20m     v1.21.2+vmware.1
ip-10-0-0-250.us-east-2.compute.internal   NotReady,SchedulingDisabled   <none>                 4d23h   v1.21.2+vmware.1
ip-10-0-0-33.us-east-2.compute.internal    NotReady,SchedulingDisabled   <none>                 20m     v1.21.2+vmware.1
ip-10-0-0-72.us-east-2.compute.internal    NotReady,SchedulingDisabled   <none>                 20m     v1.21.2+vmware.1

And after another period of time, they'll be gone entirely.

$ kubectl get nodes
NAME                                       STATUS                        ROLES                  AGE   VERSION
ip-10-0-0-148.us-east-2.compute.internal   Ready                         <none>                 5d    v1.21.2+vmware.1
ip-10-0-0-154.us-east-2.compute.internal   Ready                         <none>                 28h   v1.21.2+vmware.1
ip-10-0-0-218.us-east-2.compute.internal   Ready                         control-plane,master   5d    v1.21.2+vmware.1
ip-10-0-0-226.us-east-2.compute.internal   Ready                         <none>                 22m   v1.21.2+vmware.1

And we're back down to three worker nodes.

What about changing these configs? While there's not a direct knob to dial, this is all using Cluster API, so there's a way to do that as well.

Edit Min/Max Node Configurations

Since this solution is based on the ClusterAPI provider, we can jump into the management cluster and edit the Annotations for our Machine Deployment that's part of our workload cluster. Need a refresher on Machine Deployments? See the relevant section in the Cluster API docs.

Let's find, and edit, our Machine Deployment.

$ kubectx us-east-2-mc-admin@us-east-2-mc
✔ Switched to context "us-east-2-mc-admin@us-east-2-mc".

$ kubectl get machinedeployments
NAME                    PHASE     REPLICAS   READY   UPDATED   UNAVAILABLE
aws-autoscale-01-md-0   Running   3          3       3         

$ kubectl edit machinedeployments/aws-autoscale-01-md-0

Here's the relevant section:

apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineDeployment
metadata:
  annotations:
    cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "10"
    cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3"

We're currently at a max size of 10. If we need to change that, just change it here, like any on-the-fly Kubernetes object. And then remember to update your config file if necessary. We don't want to introduce config drift.

Conclusion

And that's it! We can now turn on Cluster Autoscaler for our Kubernetes nodes, see it take action, and change the configs if necessary. Now we can get back to the important work, helping devs onboard their apps to our Kubernetes platform.

social