TKG 1.1 Upgrade and Auto-Remediation
Tanzu Kubernetes Grid or TKG is VMware’s Kubernetes distribution that builds a consistent Kubernetes offering across cloud environments. TKG allows for declarative deployment of Kubernetes clusters based on the upstream Cluster API project. If you want to learn more about TKG and how to install it, have a look at this excellent blog post by William Lam. In addition, there is a great blog article from Tom Schwaller on how to use TKG with multiple vSphere Clusters, here.
TKG 1.1 just got released, and it comes with Kubernetes 1.18.2 and a few other useful enhancements such as:
- TKG CLI for Windows
- Cluster API MachineHealthCheck will be deployed per default (Auto-Remediation)
- Configure different sizes for master and worker nodes on the management cluster
- Bring your own VPC for AWS deployments
- Upgrade functionality from TKG 1.0 to 1.1 and Kubernetes 1.18.2
- … Have a look at the TKG 1.1 release notes for more
If you want to know what’s new in Kubernetes 1.18, have a look at the following CNCF webinar.
In this blog post, I want to describe and demonstrate how to upgrade to TKG 1.1 on vSphere. Please note, I am not going to cover the TKG upgrade on AWS even though some of the steps might be identical.
Preparation
First of all, we need to download the new TKG 1.1 binaries here. Make sure to download the TKG CLI binary for your OS, the Photon Kubernetes 1.18.2 OVA, and the Photon haproxy 1.2.4 OVA.
Import the two TKG OVA images into your vSphere environment.
After going through the Import process and specifying the necessary values, we need to convert both VMs to Templates.
The last step of the preparation is to unzip and exchange the TKG binaries on your client. I am using Mac OS, but if you are using a different OS, have a look at the documentation here. Please note, the TKG CLI for Windows was not available with version 1.0, and therefore upgrading existing TKG clusters is not supported with the TKG CLI for Windows.
➜ ~ cd Downloads ➜ Downloads gunzip tkg-darwin-amd64-v1.1.0-vmware.1.gz ➜ Downloads mv tkg-darwin-amd64-v1.1.0-vmware.1 tkg ➜ Downloads chmod +x tkg ➜ Downloads mv tkg /usr/local/bin/tkg ➜ Downloads tkg version Client: Version: v1.1.0 Git commit: 1faa437638dc81ed234721b2dbc2ad51ca9ec928
Upgrade Management Cluster
We can now start with the management cluster upgrade. First of all, we need to see if we have multiple management clusters. Execute the following command with the new TKG CLI.
➜ ~ tkg get management-cluster It seems that you have an outdated providers on your filesystem, proceeding on this command will cause your tkgconfig file to be overridden. You may backup your tkgconfig file before moving on. ? Do you want to continue?? [y/N]
As you can see, TKG discovered outdated providers on our client. The TKG config file will be overridden if we proceed. So let’s create a backup of our config file before we continue. The TKG config file can be found under the hidden folder “.tkg” in your home directory. I simply copied the file to a backup folder, as described here.
➜ ~ cd .tkg ➜ .tkg ls config.yaml providers ➜ .tkg mkdir backup ➜ .tkg cp config.yaml backup/ ➜ .tkg cd backup ➜ backup ls config.yaml
Let’s continue by pressing “y“. This will trigger an update of the providers and the config file.
➜ ~ tkg get management-cluster It seems that you have an outdated providers on your filesystem, proceeding on this command will cause your tkgconfig file to be overridden. You may backup your tkgconfig file before moving on. ? Do you want to continue?? [y/N] y the old providers folder /Users/aullah/.tkg/providers is backed up to /Users/aullah/.tkg/providers-20200522132438 MANAGEMENT-CLUSTER-NAME CONTEXT-NAME tkgmgmtcl * tkgmgmtcl-admin@tkgmgmtcl
Let’s execute the diff command on the config files to see what has changed.
➜ .tkg diff config.yaml backup/config.yaml 1d0 < # Obsolete. Please use '-k' or '--kubernetes-version' to override the default kubernetes version 6c5 < url: /Users/aullah/.tkg/providers/cluster-api/v0.3.5/core-components.yaml --- > url: /Users/aullah/.tkg/providers/cluster-api/v0.3.3/core-components.yaml 9c8 < url: /Users/aullah/.tkg/providers/infrastructure-aws/v0.5.3/infrastructure-components.yaml --- > url: /Users/aullah/.tkg/providers/infrastructure-aws/v0.5.2/infrastructure-components.yaml 12c11 < url: /Users/aullah/.tkg/providers/infrastructure-vsphere/v0.6.4/infrastructure-components.yaml --- > url: /Users/aullah/.tkg/providers/infrastructure-vsphere/v0.6.3/infrastructure-components.yaml 18c17 < url: /Users/aullah/.tkg/providers/bootstrap-kubeadm/v0.3.5/bootstrap-components.yaml --- > url: /Users/aullah/.tkg/providers/bootstrap-kubeadm/v0.3.3/bootstrap-components.yaml 21c20 < url: /Users/aullah/.tkg/providers/control-plane-kubeadm/v0.3.5/control-plane-components.yaml --- > url: /Users/aullah/.tkg/providers/control-plane-kubeadm/v0.3.3/control-plane-components.yaml 24,25d22 < all: < repository: registry.tkg.vmware.run/cluster-api 52,55d48 < cert-manager-timeout: 30m0s < NODE_STARTUP_TIMEOUT: 20m < release: < version: v1.1.0
We can see that the providers have been changed to new versions and a few other parameters were added to the config file. The first line is very interesting, as it seems that a setting from the old config file got obsolete and we should use the “-k” or “–kubernetes-version”. This is referring to the “KUBERNETES_VERSION:” parameter that was specified in the config file.
If you have multiple management clusters, make sure to select the one you intend to upgrade with the tkg set management-cluster command.
➜ ~ tkg set management-cluster tkgmgmtcl The current management cluster context is switched to tkgmgmtcl
Before we start the upgrade, we can check the Kubernetes version of our management clusters with the following command.
➜ ~ tkg get cluster --include-management-cluster NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES tkgcl1 default running 1/1 1/1 v1.17.3+vmware.2 tkgcl2 default running 1/1 1/1 v1.17.3+vmware.2 tkgmgmtcl tkg-system running 1/1 1/1 v1.17.3+vmware.2
Now let’s do the upgrade of our management cluster by simply executing the tkg upgrade management-cluster command while specifying the name of the cluster.
➜ ~ tkg upgrade management-cluster tkgmgmtcl Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200522T133546949883492.log Upgrading management cluster 'tkgmgmtcl' to TKG version 'v1.1.0' with Kubernetes version 'v1.18.2+vmware.1'. Are you sure?: y Upgrading management cluster providers... Performing upgrade... Performing upgrade... Upgrading Provider="capi-system/cluster-api" CurrentVersion="" TargetVersion="v0.3.5" Deleting Provider="cluster-api" Version="" TargetNamespace="capi-system" Installing Provider="cluster-api" Version="v0.3.5" TargetNamespace="capi-system" Upgrading Provider="capi-kubeadm-bootstrap-system/bootstrap-kubeadm" CurrentVersion="" TargetVersion="v0.3.5" Deleting Provider="bootstrap-kubeadm" Version="" TargetNamespace="capi-kubeadm-bootstrap-system" Installing Provider="bootstrap-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-bootstrap-system" Upgrading Provider="capi-kubeadm-control-plane-system/control-plane-kubeadm" CurrentVersion="" TargetVersion="v0.3.5" Deleting Provider="control-plane-kubeadm" Version="" TargetNamespace="capi-kubeadm-control-plane-system" Installing Provider="control-plane-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-control-plane-system" Upgrading Provider="capv-system/infrastructure-vsphere" CurrentVersion="" TargetVersion="v0.6.4" Deleting Provider="infrastructure-vsphere" Version="" TargetNamespace="capv-system" Installing Provider="infrastructure-vsphere" Version="v0.6.4" TargetNamespace="capv-system" Management cluster providers upgraded successfully... Upgrading management cluster kubernetes version... Creating management cluster client... Verifying kubernetes version... Retrieving configuration for upgrade cluster... Create InfrastructureTemplate for upgrade... Upgrading control plane nodes... Patching KubeadmControlPlane with the kubernetes version v1.18.2+vmware.1... Waiting for kubernetes version to be updated for control plane nodes Upgrading worker nodes... Patching MachineDeployment with the kubernetes version v1.18.2+vmware.1... Waiting for kubernetes version to be updated for worker nodes... Management cluster 'tkgmgmtcl' is being upgraded to TKG version 'v1.1.0' with kubernetes version 'v1.18.2+vmware.1'
In the vSphere Client, we should soon see a clone operation that will create a new control plane node for our management cluster.
The upgrade process will first exchange the control-plane nodes and will continue with the worker nodes right after it. The upgrade is done in a rolling upgrade fashion, node by node.
It will also clean up the old control plane and worker node VMs.
The management cluster upgrades finished successfully and we can now check the Kubernetes version of our cluster again.
➜ ~ tkg get cluster --include-management-cluster NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES tkgcl1 default running 1/1 1/1 v1.17.3+vmware.2 tkgcl2 default running 1/1 1/1 v1.17.3+vmware.2 tkgmgmtcl tkg-system running 1/1 1/1 v1.18.2+vmware.1.2
As we can see, it reports version 1.18.2 for the management cluster. After switching to the kubectl context for the management cluster, we can also check the Kubernetes version of the nodes.
(⎈ |tkgmgmtcl-admin@tkgmgmtcl:default)➜ ~ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME tkgmgmtcl-control-plane-dzpcq Ready master 22m v1.18.2+vmware.1 172.16.10.169 172.16.10.169 VMware Photon OS/Linux 4.19.115-3.ph3 containerd://1.3.4 tkgmgmtcl-md-0-79d849c4d8-6mkw4 Ready <none> 11m v1.18.2+vmware.1 172.16.10.107 172.16.10.107 VMware Photon OS/Linux 4.19.115-3.ph3 containerd://1.3.4
Please note, the haproxy VMs are currently not being exchanged by the upgrade process as this could result in a new IPs for the control planes. This is a known limitation that is actively being worked on. I would assume that this will change as soon as we are offering static IP allocation with TKG.
Upgrade Workload Clusters
To start the upgrade of the individual workload clusters, simply execute the tkg upgrade cluster command while specifying the cluster name.
➜ ~ tkg upgrade cluster tkgcl2 Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200522T141816680388983.log Upgrading workload cluster 'tkgcl2' to kubernetes version 'v1.18.2+vmware.1'. Are you sure?: y█ Creating management cluster client... Validating configuration... Creating workload cluster client... Verifying kubernetes version... Retrieving configuration for upgrade cluster... Create InfrastructureTemplate for upgrade... Upgrading control plane nodes... Patching KubeadmControlPlane with the kubernetes version v1.18.2+vmware.1... Waiting for kubernetes version to be updated for control plane nodes Upgrading worker nodes... Patching MachineDeployment with the kubernetes version v1.18.2+vmware.1... Waiting for kubernetes version to be updated for worker nodes... Cluster 'tkgcl2' successfully upgraded to kubernetes version 'v1.18.2+vmware.1'
Done, we have successfully upgraded TKG to version 1.1 and the Kubernetes clusters to version 1.18.2.
Post Upgrade Tasks
After the upgrade, I realized that there are still old parameters in the tkg config file. The VSPHERE_TEMPLATE and VSPHERE_HAPROXY_TEMPLATE parameters are pointing to the old templates from TKG 1.0. That’s why a tkg create cluster command will fail with the following error.
➜ ~ tkg create cluster tkgcl3 --plan=dev Logs of the command execution can also be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200526T125424630689398.log Creating workload cluster 'tkgcl3'... Validating configuration... Error: : workload cluster configuration validation failed: vSphere config validation failed: vSphere template kubernetes version validation failed: Kubernetes version (v1.18.2+vmware.1) does not match that of the VM template (v1.17.3+vmware.2) Detailed log about the failure can be found at: /var/folders/45/z71yc10j7cg35cwcvg3q8nym0000gq/T/tkg-20200526T125424630689398.log
Therefore, we need to edit the tkg config file under the hidden “.tkg” folder. Simply change the two parameters to the location and name of the new TKG 1.1 templates.
VSPHERE_TEMPLATE: /ED/vm/Templates/photon-3-kube-v1.18.2+vmware.1 VSPHERE_HAPROXY_TEMPLATE: /ED/vm/Templates/photon-3-haproxy-v1.2.4+vmware.1
Now we can create TKG clusters from the new templates with the tkg create cluster command.
Auto-Remediation with MachineHealthCheck
A new feature that I want to highlight is the MachineHealthCheck that will be created for every newly deployed TKG cluster. The Cluster API MachineHealthCheck is checking and remediating unhealthy machines (worker nodes) in a TKG Kubernetes cluster. We can check the MachineHealthCheck object with the following command on the TKG management cluster.
(⎈ |tkgmgmtcl-admin@tkgmgmtcl:default)➜ ~ kubectl get MachineHealthCheck -A NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY default tkgcl2 100% 1 1
I was confused about the MAXUNHEALTHY value of 100% that was returned by the command. What does it mean? After reading through the Cluster Api documentation here, I learned that this value is a circuit breaker to prevent further remediation when the cluster has reached a certain percent of unhealthy nodes. If the value is set to 100%, it will always try to remediate no matter how many nodes are in an unhealthy state.
The Auto-Remediation can be tested easily by shutting down a worker node VM from the vSphere Client.
Shortly after the VM has been powered off, the MachineHealthCheck remediation will power it on again.
Another test we can easily execute is deleting a worker node VM.
After a few minutes, a clone operation will be triggered to remediate the state of the cluster.
Please note, upgraded TKG clusters will not be equipped with the MachineHealthCheck functionality. Additionally, only worker nodes are remediated, control plane nodes are currently not included. Further limitations and caveats can be found here.
Conclusion
Even though TKG version 1.1 is just a minor release, it comes with some nice features. Kubernetes 1.18.2, CLI driven cluster upgrades, auto-remediation for worker nodes, are just a few of them. TKG itself is very lightweight and so is the upgrade process.
The Auto-Remediation is a very useful feature to ensure worker node availability. Unfortunately, control plane nodes are yet not supported, but if you combine it with a multi-control plane TKG cluster, it should give you adequate availability. From a DR perspective, it is always recommended to backup your Kubernetes cluster with tools such as Velero. Have a look at this blog post about Velero backup and migration for TKGI and TKG clusters if you want to know more.
Sources
- TKG 1.1 release notes
- TKG 1.1 download
- TKG 1.1 documentation
- TKG 1.1 management cluster upgrade
- TKG 1.1 workload cluster upgrade
- Cluster Api MachineHealthCheck
- Cluster Api MachineHealthCheck configuration
- Cluster Api MachineHealthCheck controller
- TKG overview and install by William Lam
- TKG multi-tenant setup by Tom Schwaller
- Velero home page
- Backup and Migration of TKGI to TKG with Velero
Categories