Backup and Migrate TKGI (PKS) to TKG with Velero

Very often, existing Disaster Recovery Solutions and processes cannot be used to backup and restore persistent applications on Kubernetes. vSphere Admins, as well as Backup Operators, are challenged with multiple layers of complexity and the need to ensure business continuity. So what can we do to ensure Disaster Recovery for Kubernetes on vSphere? Is it possible to migrate workloads between clusters, and what tools can we use? In this blog post, I want to explain and demonstrate how to backup as well as restore persistent applications on vSphere based Kubernetes clusters with Velero. Additionally, I want to show how we can migrate between a VMware Enterprise PKS 1.7 and a Tanzu Kubernetes Grid (TKG) cluster on vSphere. By the way, VMware Enterprise PKS got a new name and is now called “Tanzu Kubernetes Grid Integrated Edition“, which fits perfectly in the Tanzu Modern Apps portfolio. From here on, I will refer to it as TKGI.

Velero is an open-source project that can be used to backup, restore, or migrate your applications on Kubernetes. Until recently, the only way Velero could backup Kubernetes on vSphere was, to use a third-party solution called restic. If you want to learn more about this option, have a look at this excellent blog post from Cormac Hogan. Nevertheless, the Velero vSphere Plugin to backup persistency (Persistent Volumes) on vSphere came to life just lately. The Velero vSphere Plugin allows for snapshot backups of Persistent Volumes that are backed by virtual disk files (VMDKs). The backup data will be transferred and stored in S3 compatible object storage. Let’s see what it is capable of and how we can use it.

Pre-Requisites

We need to ensure the following pre-requisites are in place to make use of the Velero vSphere Plugin.

CSI (1.0.2 and above) enabled Kubernetes clusters on vSphere (6.7 U3 or above)
S3 compatible object storage
Velero 1.3.2

In this scenario, I will use a TKGI 1.7 Kubernetes cluster. Since TKGI version 1.7, CSI driver (out of tree driver) is supported as an alternative storage integration for Kubernetes. Per default, TKGI still uses the “vSphere Cloud Provider” (in tree driver) as the default driver. To enable the CSI driver for TKGI clusters, follow the instructions here.

Please note, the scope of the current Velero vSphere Plugin version is vanilla Kubernetes only. It has not been tested or developed for TKGI or TKG specifically at this time. However, since TKGI and TKG are shipping native Kubernetes bits, I thought to give it a try.

I have decided to use MinIO as the required S3 object storage solution. MinIO runs containerized and can be easily installed on Kubernetes with a Helm chart deployment. A straightforward way to make use of a whole Helm chart catalog is the Kubeapps project from Bitnami. Since the acquisition in 2019, Bitnami is part of the VMware family. Follow the instructions here, if you want to use Kubeapps as well. If you wish to setup MinIO without Kubeapps, have a look at the following link.

Screenshot 2020-04-25 at 22.14.57

Last but not least, we need to install the Velero CLI. As a Mac user, I used brew to get the latest version of Velero. If you are using a different platform, just follow the install guide here.

Assemble Backup

As a first step, we will create an S3 bucket on our MinIO server. I have exposed my MinIO Deployment via service type LoadBalancer on IP 192.168.24.10 and port 9000. To create a bucket, we log in to the MinIO user interface via http://<your_MinIO_IP>:9000/minio/login. The login details can be found on the Kubeapps dashboard for the MinIO deployment.

Screenshot 2020-04-25 at 23.02.21

After logging in to the MinIO user interface, create a bucket with the name “velero” by clicking on the “Create bucket” button.

Screenshot 2020-04-25 at 23.23.49

Now we have to create a credential file for the S3 bucket, which will be used for the configuration of Velero. Copy the following content to the file and add your access_key_id as well as the secret_access_key for MinIO.

[default]
aws_access_key_id = ...
aws_secret_access_key = ...

Additionally, we use environment variables for REGION and BUCKET. Make sure to use the bucket name created in the previous step.

export BUCKET=velero
export REGION=minio

Finally, we can install and configure Velero on our Kubernetes cluster. Make sure your kubectl can access the cluster and that the right context is selected before you execute the following command. Also, use your own MinIO IP address for the s3Url parameter.

(⎈ |k8scl1:default)➜  ~ velero install --provider aws \
 --bucket $BUCKET \
 --secret-file ./credentials \ --plugins velero/velero-plugin-for-aws:v1.0.0 \
 --snapshot-location-config region=$REGION \
 --backup-location-config region=$REGION,s3ForcePathStyle="true",s3Url=http://192.168.24.10:9000

Don’t get confused by the AWS provider and plugin parameters, this is correct. The Velero vSphere Plugin is using parts of the AWS plugin for the storage configuration. Now we have to add the Velero vSphere plugin.

(⎈ |k8scl1:default)➜  ~ velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0

After a successful installation of Velero, you should see a Deployment with one pod and a DaemonSet with one pod per Kubernetes node, running within the velero namespace of your Kubernetes cluster.

(⎈ |k8scl1:default)➜  ~ kubectl get pods -n velero
NAME                               READY   STATUS    RESTARTS   AGE
datamgr-for-vsphere-plugin-2m7fn   1/1     Running   0          2d5h
datamgr-for-vsphere-plugin-tprfj   1/1     Running   0          2d5h
velero-69f48bdfdf-ntw82            1/1     Running   0          2d5h

To finalize the configuration, we need to create a snapshot-location by executing the following command.

(⎈ |k8scl1:default)➜  ~ velero snapshot-location create vsl-vsphere --provider velero.io/vsphere

Before we continue, let’s have a look at the test application we are going to use.

Test Application

I am going to use the Ghost application for my test case. This is a simple blog application that can be executed as a single pod on Kubernetes. The Application needs a Persistent Volume to store the data (e.g., blog posts). You can find the application yaml files on my Github repo for this blog post. As you can see, the application is already running on my TKGI cluster.

(⎈ |k8scl1:default)➜  ~ kubectl get sc
NAME                    PROVISIONER              AGE
demo-sts-sc (default)   csi.vsphere.vmware.com   2d18h
(⎈ |k8scl1:default)➜  ~ kubectl get pvc -n ghost
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
blog-content   Bound    pvc-e71abb22-3212-42ab-a1f9-7ff0e07beee0   2Gi        RWO            demo-sts-sc    2d17h
(⎈ |k8scl1:default)➜  ~ kubectl get pods -n ghost
NAME                    READY   STATUS    RESTARTS   AGE
blog-5dc5f75fd7-9p8xh   1/1     Running   0          2d17h

Thanks to the Cloud Native Storage integration of vSphere 6.7U3, which uses CSI, we can also see the Persistent Volumes within the vSphere Client.

Screenshot 2020-04-27 at 10.30.52

To demonstrate that we will actually restore data, I have created a blog post that is stored on the Persistent Volume.

Execute Backup

Now that we have everything prepared and our test application is running, we can finally create a backup with the Velero vSphere Plugin. For this test, I will only backup the ghost namespace content by specifying the –include-namespaces parameter. Additionally, we need to reuse the snapshot-location that we have created earlier.

(⎈ |k8scl1:default)➜  ~ velero backup create my-ghost-backup --include-namespaces=ghost --snapshot-volumes --volume-snapshot-locations vsl-vsphere
Backup request "my-ghost-backup" submitted successfully.
Run `velero backup describe my-ghost-backup` or `velero backup logs my-ghost-backup` for more details.

Let’s check if data is coming in and if the backup will finish successfully. We can check the logs or do a “velero backup describe” to see if everything is working ok.

(⎈ |k8scl1:default)➜  ~ velero backup describe my-ghost-backup
Name:         my-ghost-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  <none>

Phase:  Completed

Namespaces:
  Included:  ghost
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Snapshot PVs:  true

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2020-04-24 16:55:05 +0200 CEST
Completed:  2020-04-24 16:55:28 +0200 CEST

Expiration:  2020-05-24 16:55:05 +0200 CEST

Persistent Volumes:  1 of 1 snapshots completed successfully (specify --details for more information)

The upload to the remote storage is an asynchronous job. It will create a custom resource “uploads.veleroplugin.io” for every volume snapshot. Depending on the amount of data, this can take a while. Use the following kubectl command to check the status of the upload process.

(⎈ |k8scl1:default)➜  ~ kubectl get -n velero uploads.veleroplugin.io -o yaml

The backup completed very fast as we don’t have much data to store. We should also see snapshot creation and deletion tasks in the vSphere Client as well as some data on the S3 bucket.

Screenshot 2020-04-24 at 16.59.14

Screenshot 2020-04-27 at 11.00.01

Great! That worked like a charm. We have successfully created a backup of the Ghost application and its data. Here comes the interesting part, let’s try to restore it on a different Kubernetes cluster.

Assemble Restore / Migration

I have created a separate Tanzu Kubernetes Grid Cluster on vSphere for the restore test. TKG allows for declarative deployment of Kubernetes clusters and builds a consistent Kubernetes layer with life-cycle capabilities based on ClusterAPI. If you want to learn more about TKG, there is a great blog article from William Lam here.

As we are using a different cluster, this is not only a simple restore. It is a restore plus migration scenario between a TKGI and TKG Kubernetes cluster.

For the restore, we need to install Velero on the target cluster. I simply used the same command as for the source cluster. Make sure you have the environment variables for REGION and BUCKET still set.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero install --provider aws --bucket $BUCKET --secret-file ./credentials --plugins velero/velero-plugin-for-aws:v1.0.0 --snapshot-location-config region=$REGION --backup-location-config region=$REGION,s3ForcePathStyle="true",s3Url=http://192.168.24.10:9000

This will create a default backup-location that we are going to change a bit.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero backup-location get
NAME      PROVIDER   BUCKET/PREFIX   ACCESS MODE
default   aws        velero          ReadWrite

We need to add two parameters. As this is not only a restore but a migration scenario, we want this backup-location to be ReadOnly. Second, we need to specify a backup sync period to sync the existing backups from the location. I am sure there are smarter ways to do it, such as creating another backup-location. But I just took the easy path and added the two parameters by editing the object via kubectl.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl edit backupstoragelocation default -n velero

As you can see here, I just added the parameters and saved the changes.

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  creationTimestamp: "2020-04-29T09:32:56Z"
  generation: 31
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "405473"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 9174ea53-5410-456f-8c8b-e27c38a99036
spec:
  accessMode: ReadOnly
  backupSyncPeriod: 60s
  config:
    region: minio
    s3ForcePathStyle: "true"
    s3Url: http://192.168.24.10:9000
  objectStorage:
    bucket: velero
  provider: aws
status:
  lastSyncedTime: "2020-04-29T10:00:04.147686049Z"

After adding these two parameters and waiting for 60 seconds, until the backup sync happened, you should see the backup object created in the previous step.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero backup get
NAME              STATUS      CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
my-ghost-backup   Completed   2020-04-24 16:55:05 +0200 CEST   25d       default            <none>
(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero backup describe my-ghost-backup
Name:         my-ghost-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  <none>
...

Now we need to add the Velero vSphere Plugin to our target Kubernetes cluster.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0

We should see the datamgr_for_vSphere_plugin pods starting on the cluster. I realized that this was not the case, and the pods were running into an error state.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl get pods -n velero
NAME                               READY   STATUS     RESTARTS   AGE
datamgr-for-vsphere-plugin-bqvs5   0/1     Error      3          55s
velero-69f48bdfdf-k5lbf            0/1     Init:1/2   0          58s
velero-795c8d58cd-s7zhr            1/1     Running    0          105m
(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl logs datamgr-for-vsphere-plugin-bqvs5 -n velero
time="2020-04-29T11:18:27Z" level=info msg="Starting data manager server  (-)" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/cmd/server/server.go:111"
time="2020-04-29T11:18:27Z" level=info msg="data manager server is started" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/cmd/server/server.go:174"
time="2020-04-29T11:18:27Z" level=error msg="Failed to get k8s secret, vsphere-config-secret" error="secrets \"vsphere-config-secret\" not found" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/utils/utils.go:59"
time="2020-04-29T11:18:27Z" level=error msg="Could not retrieve vsphere credential from k8s secret." error="secrets \"vsphere-config-secret\" not found" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/dataMover/data_mover.go:39"
An error occurred: secrets "vsphere-config-secret" not found

UPDATE 07.07.2020: This issue is fixed with version v1.0.1 of the Velero plugin for vSphere! See the changelog here.

As you can see from the logs, the vSphere Plugin is looking for a secret with the name “vSphere-config-secret“. This secret contains the vCenter Server credentials and config. In TKG, the secret used has a different name, and therefore cannot be found. I am currently in contact with the vSphere Plugin team to see if we can address this issue. Nevertheless, there is an easy workaround for it, simply create another secret with the name “vSphere-config-secret” and the required information. Here is an example config file:

[Global]
cluster-id = ”tkgcl1”        # Cluster-uuid must be unique.

[VirtualCenter "192.168.96.215"]
insecure-flag = "true"
user = "administrator@vsphere.local"
password = "your_password"
port = "443"
datacenters = "ED"

Simply change the credentials and config parameters with your own values and create the secret.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf --namespace=kube-system
secret/vsphere-config-secret created

The datamgr-for-vsphere-plugin pod should come up now.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl get pods -n velero
NAME                               READY   STATUS    RESTARTS   AGE
datamgr-for-vsphere-plugin-vrnqj   1/1     Running   3          2m11s
velero-69f48bdfdf-mgjb5            1/1     Running   0          2m14s

For the next step, we have to create the same snapshot-location that we have used for the backup.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero snapshot-location create vsl-vsphere --provider velero.io/vsphere
Snapshot volume location "vsl-vsphere" configured successfully.

Before we can start the restore, there are some limitations with the current version of the Velero vSphere plugin that you need to be aware of.

UPDATE 07.07.2020: This issue is fixed with version v1.0.1 of the Velero plugin for vSphere! See the changelog here.

The plugin assumes that your Kubernetes Node VMs are located under the datacenter object within the “VMs and Template” view. If the VMs are located in a subfolder such as “Discovered virtual machine”, the restore will fail. This is already addressed and will hopefully be fixed soon.

Screenshot 2020-04-30 at 08.11.16

Another implementation detail you should be aware of is that storage classes are currently not being used for the persistent volume placement during restore. This is a known issue, and the team is working on it. The current behavior places the persistent volumes on datastores that are accessible by all ESXi hosts where Node VMs are running on.

Execute Restore / Migration

Finally, we have everything prepared to start the Restore.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ velero restore create --from-backup my-ghost-backup
Restore request "my-ghost-backup-20200429084302" submitted successfully.
Run `velero restore describe my-ghost-backup-20200429084302` or `velero restore logs my-ghost-backup-20200429084302` for more details.

We can check the status of the restore with the following command.

(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero restore describe my-ghost-backup-20200429084302
Name: my-ghost-backup-20200429084302
Namespace: velero
Labels: <none>
Annotations: <none>

Phase: InProgress
...

Additionally, we can check the status of the data movement for the restore operation. Similar to what we did for the backup with the upload custom resource, but this time we are using the downnloads.veleroplugin.io custom resource.

(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get -n velero downnloads.veleroplugin.io -o yaml

We should see vCenter starting to create the VMDK and attach it to the Worker Node VM. Additionally, you should see the Persistent Volume appearing under the Cloud Native Storage integration of vCenter.

Screenshot 2020-04-29 at 08.48.11

Wait until the restore phase completes.

(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero restore describe my-ghost-backup-20200429084302
Name: my-ghost-backup-20200429084302
Namespace: velero
Labels: <none>
Annotations: <none>

Phase: Completed
...

Let’s double-check if all the objects got restored successfully.

(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl get pvc -n ghost
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
blog-content   Bound    pvc-e71abb22-3212-42ab-a1f9-7ff0e07beee0   2Gi        RWO            demo-sts-sc    15m
(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl get pods -n ghost
NAME                    READY   STATUS    RESTARTS   AGE
blog-7657d68b65-rlw2h   1/1     Running   0          4m17s
(⎈ |tkgcl1-admin@tkgcl1:default)➜  ~ kubectl get svc -n ghost
NAME   TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
blog   LoadBalancer   100.65.17.38   172.16.10.10   80:32700/TCP   58m

It looks good. As a final check, we should verify that the Ghost application is accessible, and the blog entry we have created is still there.

Screenshot 2020-04-29 at 09.48.13

Done! Backup and Migration of a persistent application from TKGI to TKG on vSphere with Velero.

Conclusion

Even though it is still early stages for the Velero vSphere Plugin, it is a good option to backup your applications on Kubernetes. The known limitations for the restore process will hopefully be addressed soon, which will make it an even better solution.

As demonstrated, it also works for migration scenarios between different Kubernetes clusters on vSphere (TKGI 1.7 to TKG), which is a topic that generally comes up quite often. Even though I haven’t tried it yet, I believe it could be combined with a Kubernetes cluster backup done by Bosh Backup and Restore (BBR) for TKGI clusters.

Currently, the plugin does not support “vSphere 7 with Kubernetes” Supervisor clusters and Guest Clusters. Stay tuned for more information about the Velero vSphere Plugin.

Sources

April 30, 2020

#Migrate, #PKS, #TKG, #TKGI, #Velero, #vSphere

Backup, Kubernetes

One response to “Backup and Migrate TKGI (PKS) to TKG with Velero”

TKG 1.1 Upgrade and Auto-Remediation – Beyond Elastic

May 29, 2020 at 4:00 pm

[…] always recommended to backup your Kubernetes cluster with tools such as Velero. Have a look at this blog post about Velero backup and migration for TKGI and TKG clusters if you want to know […]

LikeLike

beyond elastic