GPU-Enabled Kubernetes Clusters with Tanzu Kubernetes Grid

Mon 24 May 2021

Stated Goal

We want easy GPU access for Kubernetes workloads in our TKG clusters. This is done by:

  • Installing GPU device drivers on our Kubernetes worker nodes
  • Installing the device plugins on our Kubernetes worker nodes
  • Applying the appropriate labels to our Kubernetes worker nodes so that GPU workloads can find them.

The Nvidia GPU Operator does all three for us.

I'm Impatient. Just Tell Me What to Type

If you just want to get up and running and move on with life, here you go.

# Create a Kubernetes cluster, specifying Ubuntu for our base image and the g3s instance type for worker nodes
OS_NAME=ubuntu NODE_MACHINE_TYPE=g3s.xlarge tanzu cluster create aws-ubuntu-gpu --plan=dev 

# Add the Nvidia gpu-operator Helm repo
helm repo add nvidia https://nvidia.github.io/gpu-operator 
helm repo update 

# Deploy the gpu operator, this time into its own namespace
kubectl create ns test-operator 
helm install -n test-operator --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd 

Solution, Explained

When you run a gpu-enabled workload on any machine, you have to install device drivers for that device. This is no different when you move to running that workload on a Kubernetes cluster. Sure, you can manually install the drivers on your Kubernetes workloads (or automate it the manual installation) but that's not how Kubernetes works. You don't want to have to manage the machines in your cluster - you want to manage your cluster. Hence, the Nvidia GPU Operator.

For a nice, detailed writeup of the Nvidia GPU Operator, see this writeup on their site.

With that, let's walk through those commands.

Deploy a GPU-enabled Kubernetes Cluster

We're using Tanzu to create our Kubernetes cluster. There are multiple ways to achieve this, this command is but one of those ways.

OS_NAME=ubuntu NODE_MACHINE_TYPE=g3s.xlarge tanzu cluster create aws-ubuntu-gpu --plan=dev 

With the tanzu cli, you can configure various profiles via config files or, as we did above, you can use environment variables. When deploying to AWS, tanzu cluster create will default to using Amazon Linux 2 as the base OS. We override this with the OS_NAME environment variable. Likewise, we use NODE_MACHINE_TYPE to specify an instance type that has a GPU attached to it.

See the docs on Tanzu for more details on how to configure TKG.

See the AWS docs for other GPU-enabled instance types.

Once complete, you can fetch the credentials to your cluster with:

tanzu cluster kubeconfig get aws-ubuntu-gpu --admin

Next, is putting the GPU operator on there.

Deploy the Nvidia GPU Operator

Super simple, just add the Helm repo and deploy, specifying the containerd run time since TKG clusters use ContainerD instead of Docker (the default runtime for the operator is Docker).

helm repo add nvidia https://nvidia.github.io/gpu-operator 
helm repo update 

# Put the operator in its own namespace for easy management
kubectl create ns test-operator 
helm install \
  -n test-operator \
  --wait \
  --generate-name \
  nvidia/gpu-operator \
  --set operator.defaultRuntime=containerd 

This will take some time, as the operator is going to detect the kernel version on your Kubernetes worker node, fetch & install the driver, and install the plugin, and label the appropriate Kubernetes worker nodes.

A good indicator that the install process is complete (or nearly so), is to glance into the gpu-operator-resources namespace:

$ kubectl get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
nvidia-container-toolkit-daemonset-wwzfn   1/1     Running     0          3m36s
nvidia-device-plugin-daemonset-pwfq7       1/1     Running     0          101s
nvidia-device-plugin-validation            0/1     Completed   0          92s
nvidia-driver-daemonset-skpn7              1/1     Running     0          3m27s
nvidia-driver-validation                   0/1     Completed   0          3m

Test, and profit.

I just used a simple workload from the Kubernetes docs, copy/pasted here for ease.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

The logs from that pod should look something like this on success:

kubectl logs cuda-vector-add
nvidia 34037760 47 nvidia_modeset,nvidia_uvm, Live 0xffffffffc05da000 (POE)
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Congrats! You now have a GPU-enabled Kubernetes cluster. This showed AWS, but for another cloud, all you'd need to do is specify the appropriate GPU-enabled instance. Everything else is the same.

Now What?

Let's kick the tires! There are many things you can do, but if you're just getting started, try out a Jupyter notebook example, found here, copy/pasted below for completeness:

apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      limits:
        nvidia.com/gpu: 1
    ports:
    - containerPort: 8888
      name: notebook

Once you've deployed it, you can fetch the token with:

kubectl logs -f tf-notebook

Conclusion

If you're looking for an easy way to get GPU-enabled Kubernetes clusters, you're just about out of excuses now. Between VMware and Nvidia, you're about 30 minutes away from your first cluster where you can start experimenting with the new platform. Enjoy!

social