TKG 1.1 Upgrade and Auto-Remediation

Tanzu Kubernetes Grid or TKG is VMware’s Kubernetes distribution that builds a consistent Kubernetes offering across cloud environments. TKG allows for declarative deployment of Kubernetes clusters based on the upstream Cluster API project. If you want to learn more about TKG and how to install it, have a look at this excellent blog post by William Lam. In addition, there is a great blog article from Tom Schwaller on how to use TKG with multiple vSphere Clusters, here.

TKG 1.1 just got released, and it comes with Kubernetes 1.18.2 and a few other useful enhancements such as:

TKG CLI for Windows
Cluster API MachineHealthCheck will be deployed per default (Auto-Remediation)
Configure different sizes for master and worker nodes on the management cluster
Bring your own VPC for AWS deployments
Upgrade functionality from TKG 1.0 to 1.1 and Kubernetes 1.18.2
… Have a look at the TKG 1.1 release notes for more

If you want to know what’s new in Kubernetes 1.18, have a look at the following CNCF webinar.

In this blog post, I want to describe and demonstrate how to upgrade to TKG 1.1 on vSphere. Please note, I am not going to cover the TKG upgrade on AWS even though some of the steps might be identical.

Preparation

First of all, we need to download the new TKG 1.1 binaries here. Make sure to download the TKG CLI binary for your OS, the Photon Kubernetes 1.18.2 OVA, and the Photon haproxy 1.2.4 OVA.

Import the two TKG OVA images into your vSphere environment.

Screenshot 2020-05-22 at 09.00.10

After going through the Import process and specifying the necessary values, we need to convert both VMs to Templates.

Screenshot 2020-05-22 at 13.08.23

The last step of the preparation is to unzip and exchange the TKG binaries on your client. I am using Mac OS, but if you are using a different OS, have a look at the documentation here. Please note, the TKG CLI for Windows was not available with version 1.0, and therefore upgrading existing TKG clusters is not supported with the TKG CLI for Windows.

➜  ~ cd Downloads
➜  Downloads gunzip tkg-darwin-amd64-v1.1.0-vmware.1.gz
➜  Downloads mv tkg-darwin-amd64-v1.1.0-vmware.1 tkg
➜  Downloads chmod +x tkg
➜  Downloads mv tkg /usr/local/bin/tkg
➜  Downloads tkg version
Client:
    Version: v1.1.0
    Git commit: 1faa437638dc81ed234721b2dbc2ad51ca9ec928

Upgrade Management Cluster

We can now start with the management cluster upgrade. First of all, we need to see if we have multiple management clusters. Execute the following command with the new TKG CLI.

➜  ~ tkg get management-cluster
It seems that you have an outdated providers on your filesystem, proceeding on this command will cause your tkgconfig file to be overridden. You may backup your tkgconfig file before moving on.
? Do you want to continue?? [y/N]

As you can see, TKG discovered outdated providers on our client. The TKG config file will be overridden if we proceed. So let’s create a backup of our config file before we continue. The TKG config file can be found under the hidden folder “.tkg” in your home directory. I simply copied the file to a backup folder, as described here.

➜  ~ cd .tkg
➜  .tkg ls
config.yaml     providers
➜  .tkg mkdir backup
➜  .tkg cp config.yaml backup/
➜  .tkg cd backup
➜  backup ls
config.yaml

Let’s continue by pressing “y“. This will trigger an update of the providers and the config file.

➜  ~ tkg get management-cluster
It seems that you have an outdated providers on your filesystem, proceeding on this command will cause your tkgconfig file to be overridden. You may backup your tkgconfig file before moving on.
? Do you want to continue?? [y/N] y
the old providers folder /Users/aullah/.tkg/providers is backed up to /Users/aullah/.tkg/providers-20200522132438
 MANAGEMENT-CLUSTER-NAME  CONTEXT-NAME
 tkgmgmtcl *              tkgmgmtcl-admin@tkgmgmtcl

Let’s execute the diff command on the config files to see what has changed.

➜  .tkg diff config.yaml backup/config.yaml
1d0
< # Obsolete. Please use '-k' or '--kubernetes-version' to override the default kubernetes version
6c5
<     url: /Users/aullah/.tkg/providers/cluster-api/v0.3.5/core-components.yaml
---
>     url: /Users/aullah/.tkg/providers/cluster-api/v0.3.3/core-components.yaml
9c8
<     url: /Users/aullah/.tkg/providers/infrastructure-aws/v0.5.3/infrastructure-components.yaml
---
>     url: /Users/aullah/.tkg/providers/infrastructure-aws/v0.5.2/infrastructure-components.yaml
12c11
<     url: /Users/aullah/.tkg/providers/infrastructure-vsphere/v0.6.4/infrastructure-components.yaml
---
>     url: /Users/aullah/.tkg/providers/infrastructure-vsphere/v0.6.3/infrastructure-components.yaml
18c17
<     url: /Users/aullah/.tkg/providers/bootstrap-kubeadm/v0.3.5/bootstrap-components.yaml
---
>     url: /Users/aullah/.tkg/providers/bootstrap-kubeadm/v0.3.3/bootstrap-components.yaml
21c20
<     url: /Users/aullah/.tkg/providers/control-plane-kubeadm/v0.3.5/control-plane-components.yaml
---
>     url: /Users/aullah/.tkg/providers/control-plane-kubeadm/v0.3.3/control-plane-components.yaml
24,25d22
<     all:
<         repository: registry.tkg.vmware.run/cluster-api
52,55d48
< cert-manager-timeout: 30m0s
< NODE_STARTUP_TIMEOUT: 20m
< release:
<     version: v1.1.0

We can see that the providers have been changed to new versions and a few other parameters were added to the config file. The first line is very interesting, as it seems that a setting from the old config file got obsolete and we should use the “-k” or “–kubernetes-version”. This is referring to the “KUBERNETES_VERSION:” parameter that was specified in the config file.

If you have multiple management clusters, make sure to select the one you intend to upgrade with the tkg set management-cluster command.

➜  ~ tkg set management-cluster tkgmgmtcl
The current management cluster context is switched to tkgmgmtcl

Before we start the upgrade, we can check the Kubernetes version of our management clusters with the following command.

➜ ~  tkg get cluster --include-management-cluster
 NAME       NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES
 tkgcl1     default     running  1/1           1/1      v1.17.3+vmware.2
 tkgcl2     default     running  1/1           1/1      v1.17.3+vmware.2
 tkgmgmtcl  tkg-system  running  1/1           1/1      v1.17.3+vmware.2

Now let’s do the upgrade of our management cluster by simply executing the tkg upgrade management-cluster command while specifying the name of the cluster.

➜  ~ tkg upgrade management-cluster tkgmgmtcl
Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200522T133546949883492.log
Upgrading management cluster 'tkgmgmtcl' to TKG version 'v1.1.0' with Kubernetes version 'v1.18.2+vmware.1'. Are you sure?: y
Upgrading management cluster providers...
Performing upgrade...
Performing upgrade...
Upgrading Provider="capi-system/cluster-api" CurrentVersion="" TargetVersion="v0.3.5"
Deleting Provider="cluster-api" Version="" TargetNamespace="capi-system"
Installing Provider="cluster-api" Version="v0.3.5" TargetNamespace="capi-system"
Upgrading Provider="capi-kubeadm-bootstrap-system/bootstrap-kubeadm" CurrentVersion="" TargetVersion="v0.3.5"
Deleting Provider="bootstrap-kubeadm" Version="" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-bootstrap-system"
Upgrading Provider="capi-kubeadm-control-plane-system/control-plane-kubeadm" CurrentVersion="" TargetVersion="v0.3.5"
Deleting Provider="control-plane-kubeadm" Version="" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-control-plane-system"
Upgrading Provider="capv-system/infrastructure-vsphere" CurrentVersion="" TargetVersion="v0.6.4"
Deleting Provider="infrastructure-vsphere" Version="" TargetNamespace="capv-system"
Installing Provider="infrastructure-vsphere" Version="v0.6.4" TargetNamespace="capv-system"
Management cluster providers upgraded successfully...
Upgrading management cluster kubernetes version...
Creating management cluster client...
Verifying kubernetes version...
Retrieving configuration for upgrade cluster...
Create InfrastructureTemplate for upgrade...
Upgrading control plane nodes...
Patching KubeadmControlPlane with the kubernetes version v1.18.2+vmware.1...
Waiting for kubernetes version to be updated for control plane nodes
Upgrading worker nodes...
Patching MachineDeployment with the kubernetes version v1.18.2+vmware.1...
Waiting for kubernetes version to be updated for worker nodes...
Management cluster 'tkgmgmtcl' is being upgraded to TKG version 'v1.1.0' with kubernetes version 'v1.18.2+vmware.1'

In the vSphere Client, we should soon see a clone operation that will create a new control plane node for our management cluster.

Screenshot 2020-05-22 at 13.42.00

The upgrade process will first exchange the control-plane nodes and will continue with the worker nodes right after it. The upgrade is done in a rolling upgrade fashion, node by node.

It will also clean up the old control plane and worker node VMs.

Screenshot 2020-05-22 at 14.03.50

The management cluster upgrades finished successfully and we can now check the Kubernetes version of our cluster again.

➜  ~ tkg get cluster --include-management-cluster
 NAME       NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES
 tkgcl1     default     running  1/1           1/1      v1.17.3+vmware.2
 tkgcl2     default     running  1/1           1/1      v1.17.3+vmware.2
 tkgmgmtcl  tkg-system  running  1/1           1/1      v1.18.2+vmware.1.2

As we can see, it reports version 1.18.2 for the management cluster. After switching to the kubectl context for the management cluster, we can also check the Kubernetes version of the nodes.

(⎈ |tkgmgmtcl-admin@tkgmgmtcl:default)➜  ~ kubectl get nodes -o wide
NAME                              STATUS   ROLES    AGE   VERSION            INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
tkgmgmtcl-control-plane-dzpcq     Ready    master   22m   v1.18.2+vmware.1   172.16.10.169   172.16.10.169   VMware Photon OS/Linux   4.19.115-3.ph3   containerd://1.3.4
tkgmgmtcl-md-0-79d849c4d8-6mkw4   Ready    <none>   11m   v1.18.2+vmware.1   172.16.10.107   172.16.10.107   VMware Photon OS/Linux   4.19.115-3.ph3   containerd://1.3.4

Please note, the haproxy VMs are currently not being exchanged by the upgrade process as this could result in a new IPs for the control planes. This is a known limitation that is actively being worked on. I would assume that this will change as soon as we are offering static IP allocation with TKG.

Upgrade Workload Clusters

To start the upgrade of the individual workload clusters, simply execute the tkg upgrade cluster command while specifying the cluster name.

➜  ~ tkg upgrade cluster tkgcl2
Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200522T141816680388983.log
Upgrading workload cluster 'tkgcl2' to kubernetes version 'v1.18.2+vmware.1'. Are you sure?: y█
Creating management cluster client...
Validating configuration...
Creating workload cluster client...
Verifying kubernetes version...
Retrieving configuration for upgrade cluster...
Create InfrastructureTemplate for upgrade...
Upgrading control plane nodes...
Patching KubeadmControlPlane with the kubernetes version v1.18.2+vmware.1...
Waiting for kubernetes version to be updated for control plane nodes
Upgrading worker nodes...
Patching MachineDeployment with the kubernetes version v1.18.2+vmware.1...
Waiting for kubernetes version to be updated for worker nodes...
Cluster 'tkgcl2' successfully upgraded to kubernetes version 'v1.18.2+vmware.1'

Done, we have successfully upgraded TKG to version 1.1 and the Kubernetes clusters to version 1.18.2.

Post Upgrade Tasks

After the upgrade, I realized that there are still old parameters in the tkg config file. The VSPHERE_TEMPLATE and VSPHERE_HAPROXY_TEMPLATE parameters are pointing to the old templates from TKG 1.0. That’s why a tkg create cluster command will fail with the following error.

➜  ~ tkg create cluster tkgcl3 --plan=dev
Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200526T125424630689398.log
Creating workload cluster 'tkgcl3'...

Validating configuration...

Error: : workload cluster configuration validation failed: vSphere config validation failed: vSphere template kubernetes version validation failed: Kubernetes version (v1.18.2+vmware.1) does not match that of the VM template (v1.17.3+vmware.2)

Detailed log about the failure can be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200526T125424630689398.log

Therefore, we need to edit the tkg config file under the hidden “.tkg” folder. Simply change the two parameters to the location and name of the new TKG 1.1 templates.

VSPHERE_TEMPLATE: /ED/vm/Templates/photon-3-kube-v1.18.2+vmware.1
VSPHERE_HAPROXY_TEMPLATE: /ED/vm/Templates/photon-3-haproxy-v1.2.4+vmware.1

Now we can create TKG clusters from the new templates with the tkg create cluster command.

Auto-Remediation with MachineHealthCheck

A new feature that I want to highlight is the MachineHealthCheck that will be created for every newly deployed TKG cluster. The Cluster API MachineHealthCheck is checking and remediating unhealthy machines (worker nodes) in a TKG Kubernetes cluster. We can check the MachineHealthCheck object with the following command on the TKG management cluster.

(⎈ |tkgmgmtcl-admin@tkgmgmtcl:default)➜  ~ kubectl get MachineHealthCheck -A
NAMESPACE   NAME     MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default     tkgcl2   100%           1                  1

I was confused about the MAXUNHEALTHY value of 100% that was returned by the command. What does it mean? After reading through the Cluster Api documentation here, I learned that this value is a circuit breaker to prevent further remediation when the cluster has reached a certain percent of unhealthy nodes. If the value is set to 100%, it will always try to remediate no matter how many nodes are in an unhealthy state.

The Auto-Remediation can be tested easily by shutting down a worker node VM from the vSphere Client.

Screenshot 2020-05-26 at 14.48.57

Shortly after the VM has been powered off, the MachineHealthCheck remediation will power it on again.

Screenshot 2020-05-26 at 14.51.12

Another test we can easily execute is deleting a worker node VM.

Screenshot 2020-05-26 at 15.32.00

After a few minutes, a clone operation will be triggered to remediate the state of the cluster.

Screenshot 2020-05-26 at 15.34.57

Please note, upgraded TKG clusters will not be equipped with the MachineHealthCheck functionality. Additionally, only worker nodes are remediated, control plane nodes are currently not included. Further limitations and caveats can be found here.

Conclusion

Even though TKG version 1.1 is just a minor release, it comes with some nice features. Kubernetes 1.18.2, CLI driven cluster upgrades, auto-remediation for worker nodes, are just a few of them. TKG itself is very lightweight and so is the upgrade process.

The Auto-Remediation is a very useful feature to ensure worker node availability. Unfortunately, control plane nodes are yet not supported, but if you combine it with a multi-control plane TKG cluster, it should give you adequate availability. From a DR perspective, it is always recommended to backup your Kubernetes cluster with tools such as Velero. Have a look at this blog post about Velero backup and migration for TKGI and TKG clusters if you want to know more.

beyond elastic

Leave a comment Cancel reply

beyond elastic