Very often, existing Disaster Recovery Solutions and processes cannot be used to backup and restore persistent applications on Kubernetes. vSphere Admins, as well as Backup Operators, are challenged with multiple layers of complexity and the need to ensure business continuity. So what can we do to ensure Disaster Recovery for Kubernetes on vSphere? Is it possible to migrate workloads between clusters, and what tools can we use? In this blog post, I want to explain and demonstrate how to backup as well as restore persistent applications on vSphere based Kubernetes clusters with Velero. Additionally, I want to show how we can migrate between a VMware Enterprise PKS 1.7 and a Tanzu Kubernetes Grid (TKG) cluster on vSphere. By the way, VMware Enterprise PKS got a new name and is now called “Tanzu Kubernetes Grid Integrated Edition“, which fits perfectly in the Tanzu Modern Apps portfolio. From here on, I will refer to it as TKGI.
Velero is an open-source project that can be used to backup, restore, or migrate your applications on Kubernetes. Until recently, the only way Velero could backup Kubernetes on vSphere was, to use a third-party solution called restic. If you want to learn more about this option, have a look at this excellent blog post from Cormac Hogan. Nevertheless, the Velero vSphere Plugin to backup persistency (Persistent Volumes) on vSphere came to life just lately. The Velero vSphere Plugin allows for snapshot backups of Persistent Volumes that are backed by virtual disk files (VMDKs). The backup data will be transferred and stored in S3 compatible object storage. Let’s see what it is capable of and how we can use it.
We need to ensure the following pre-requisites are in place to make use of the Velero vSphere Plugin.
- CSI (1.0.2 and above) enabled Kubernetes clusters on vSphere (6.7 U3 or above)
- S3 compatible object storage
- Velero 1.3.2
In this scenario, I will use a TKGI 1.7 Kubernetes cluster. Since TKGI version 1.7, CSI driver (out of tree driver) is supported as an alternative storage integration for Kubernetes. Per default, TKGI still uses the “vSphere Cloud Provider” (in tree driver) as the default driver. To enable the CSI driver for TKGI clusters, follow the instructions here.
Please note, the scope of the current Velero vSphere Plugin version is vanilla Kubernetes only. It has not been tested or developed for TKGI or TKG specifically at this time. However, since TKGI and TKG are shipping native Kubernetes bits, I thought to give it a try.
I have decided to use MinIO as the required S3 object storage solution. MinIO runs containerized and can be easily installed on Kubernetes with a Helm chart deployment. A straightforward way to make use of a whole Helm chart catalog is the Kubeapps project from Bitnami. Since the acquisition in 2019, Bitnami is part of the VMware family. Follow the instructions here, if you want to use Kubeapps as well. If you wish to setup MinIO without Kubeapps, have a look at the following link.
As a first step, we will create an S3 bucket on our MinIO server. I have exposed my MinIO Deployment via service type LoadBalancer on IP 192.168.24.10 and port 9000. To create a bucket, we log in to the MinIO user interface via http://<your_MinIO_IP>:9000/minio/login. The login details can be found on the Kubeapps dashboard for the MinIO deployment.
After logging in to the MinIO user interface, create a bucket with the name “velero” by clicking on the “Create bucket” button.
Now we have to create a credential file for the S3 bucket, which will be used for the configuration of Velero. Copy the following content to the file and add your access_key_id as well as the secret_access_key for MinIO.
[default] aws_access_key_id = ... aws_secret_access_key = ...
Additionally, we use environment variables for REGION and BUCKET. Make sure to use the bucket name created in the previous step.
export BUCKET=velero export REGION=minio
Finally, we can install and configure Velero on our Kubernetes cluster. Make sure your kubectl can access the cluster and that the right context is selected before you execute the following command. Also, use your own MinIO IP address for the s3Url parameter.
(⎈ |k8scl1:default)➜ ~ velero install --provider aws
Don’t get confused by the AWS provider and plugin parameters, this is correct. The Velero vSphere Plugin is using parts of the AWS plugin for the storage configuration. Now we have to add the Velero vSphere plugin.
(⎈ |k8scl1:default)➜ ~ velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0
After a successful installation of Velero, you should see a Deployment with one pod and a DaemonSet with one pod per Kubernetes node, running within the velero namespace of your Kubernetes cluster.
(⎈ |k8scl1:default)➜ ~ kubectl get pods -n velero NAME READY STATUS RESTARTS AGE datamgr-for-vsphere-plugin-2m7fn 1/1 Running 0 2d5h datamgr-for-vsphere-plugin-tprfj 1/1 Running 0 2d5h velero-69f48bdfdf-ntw82 1/1 Running 0 2d5h
To finalize the configuration, we need to create a snapshot-location by executing the following command.
(⎈ |k8scl1:default)➜ ~ velero snapshot-location create vsl-vsphere --provider velero.io/vsphere
Before we continue, let’s have a look at the test application we are going to use.
I am going to use the Ghost application for my test case. This is a simple blog application that can be executed as a single pod on Kubernetes. The Application needs a Persistent Volume to store the data (e.g., blog posts). You can find the application yaml files on my Github repo for this blog post. As you can see, the application is already running on my TKGI cluster.
(⎈ |k8scl1:default)➜ ~ kubectl get sc NAME PROVISIONER AGE demo-sts-sc (default) csi.vsphere.vmware.com 2d18h (⎈ |k8scl1:default)➜ ~ kubectl get pvc -n ghost NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE blog-content Bound pvc-e71abb22-3212-42ab-a1f9-7ff0e07beee0 2Gi RWO demo-sts-sc 2d17h (⎈ |k8scl1:default)➜ ~ kubectl get pods -n ghost NAME READY STATUS RESTARTS AGE blog-5dc5f75fd7-9p8xh 1/1 Running 0 2d17h
Thanks to the Cloud Native Storage integration of vSphere 6.7U3, which uses CSI, we can also see the Persistent Volumes within the vSphere Client.
To demonstrate that we will actually restore data, I have created a blog post that is stored on the Persistent Volume.
Now that we have everything prepared and our test application is running, we can finally create a backup with the Velero vSphere Plugin. For this test, I will only backup the ghost namespace content by specifying the –include-namespaces parameter. Additionally, we need to reuse the snapshot-location that we have created earlier.
(⎈ |k8scl1:default)➜ ~ velero backup create my-ghost-backup --include-namespaces=ghost --snapshot-volumes --volume-snapshot-locations vsl-vsphere Backup request "my-ghost-backup" submitted successfully. Run `velero backup describe my-ghost-backup` or `velero backup logs my-ghost-backup` for more details.
Let’s check if data is coming in and if the backup will finish successfully. We can check the logs or do a “velero backup describe” to see if everything is working ok.
(⎈ |k8scl1:default)➜ ~ velero backup describe my-ghost-backup Name: my-ghost-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: <none> Phase: Completed Namespaces: Included: ghost Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: <none> Storage Location: default Snapshot PVs: true TTL: 720h0m0s Hooks: <none> Backup Format Version: 1 Started: 2020-04-24 16:55:05 +0200 CEST Completed: 2020-04-24 16:55:28 +0200 CEST Expiration: 2020-05-24 16:55:05 +0200 CEST Persistent Volumes: 1 of 1 snapshots completed successfully (specify --details for more information)
The backup completed very fast as we don’t have much data to store. We should also see snapshot creation and deletion tasks in the vSphere Client as well as some data on the S3 bucket.
Great! That worked like a charm. We have successfully created a backup of the Ghost application and its data. Here comes the interesting part, let’s try to restore it on a different Kubernetes cluster.
Assemble Restore / Migration
I have created a separate Tanzu Kubernetes Grid Cluster on vSphere for the restore test. TKG allows for declarative deployment of Kubernetes clusters and builds a consistent Kubernetes layer with life-cycle capabilities based on ClusterAPI. If you want to learn more about TKG, there is a great blog article from William Lam here.
As we are using a different cluster, this is not only a simple restore. It is a restore plus migration scenario between a TKGI and TKG Kubernetes cluster.
For the restore, we need to install Velero on the target cluster. I simply used the same command as for the source cluster. Make sure you have the environment variables for REGION and BUCKET still set.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero install --provider aws --bucket $BUCKET --secret-file ./credentials --plugins velero/velero-plugin-for-aws:v1.0.0 --snapshot-location-config region=$REGION --backup-location-config region=$REGION,s3ForcePathStyle="true",s3Url=http://192.168.24.10:9000
This will create a default backup-location that we are going to change a bit.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero backup-location get NAME PROVIDER BUCKET/PREFIX ACCESS MODE default aws velero ReadWrite
We need to add two parameters. As this is not only a restore but a migration scenario, we want this backup-location to be ReadOnly. Second, we need to specify a backup sync period to sync the existing backups from the location. I am sure there are smarter ways to do it, such as creating another backup-location. But I just took the easy path and added the two parameters by editing the object via kubectl.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl edit backupstoragelocation default -n velero
As you can see here, I just added the parameters and saved the changes.
# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: creationTimestamp: "2020-04-29T09:32:56Z" generation: 31 labels: component: velero name: default namespace: velero resourceVersion: "405473" selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default uid: 9174ea53-5410-456f-8c8b-e27c38a99036 spec: accessMode: ReadOnly backupSyncPeriod: 60s config: region: minio s3ForcePathStyle: "true" s3Url: http://192.168.24.10:9000 objectStorage: bucket: velero provider: aws status: lastSyncedTime: "2020-04-29T10:00:04.147686049Z"
After adding these two parameters and waiting for 60 seconds, until the backup sync happened, you should see the backup object created in the previous step.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero backup get NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR my-ghost-backup Completed 2020-04-24 16:55:05 +0200 CEST 25d default <none> (⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero backup describe my-ghost-backup Name: my-ghost-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: <none> ...
Now we need to add the Velero vSphere Plugin to our target Kubernetes cluster.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.0.0
We should see the datamgr_for_vSphere_plugin pods starting on the cluster. I realized that this was not the case, and the pods were running into an error state.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get pods -n velero NAME READY STATUS RESTARTS AGE datamgr-for-vsphere-plugin-bqvs5 0/1 Error 3 55s velero-69f48bdfdf-k5lbf 0/1 Init:1/2 0 58s velero-795c8d58cd-s7zhr 1/1 Running 0 105m (⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl logs datamgr-for-vsphere-plugin-bqvs5 -n velero time="2020-04-29T11:18:27Z" level=info msg="Starting data manager server (-)" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/cmd/server/server.go:111" time="2020-04-29T11:18:27Z" level=info msg="data manager server is started" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/cmd/server/server.go:174" time="2020-04-29T11:18:27Z" level=error msg="Failed to get k8s secret, vsphere-config-secret" error="secrets \"vsphere-config-secret\" not found" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/utils/utils.go:59" time="2020-04-29T11:18:27Z" level=error msg="Could not retrieve vsphere credential from k8s secret." error="secrets \"vsphere-config-secret\" not found" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/dataMover/data_mover.go:39" An error occurred: secrets "vsphere-config-secret" not found
As you can see from the logs, the vSphere Plugin is looking for a secret with the name “vSphere-config-secret“. This secret contains the vCenter Server credentials and config. In TKG, the secret used has a different name, and therefore cannot be found. I am currently in contact with the vSphere Plugin team to see if we can address this issue. Nevertheless, there is an easy workaround for it, simply create another secret with the name “vSphere-config-secret” and the required information. Here is an example config file:
[Global] cluster-id = ”tkgcl1” # Cluster-uuid must be unique. [VirtualCenter "192.168.96.215"] insecure-flag = "true" user = "email@example.com" password = "your_password" port = "443" datacenters = "ED"
Simply change the credentials and config parameters with your own values and create the secret.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf --namespace=kube-system secret/vsphere-config-secret created
The datamgr-for-vsphere-plugin pod should come up now.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get pods -n velero NAME READY STATUS RESTARTS AGE datamgr-for-vsphere-plugin-vrnqj 1/1 Running 3 2m11s velero-69f48bdfdf-mgjb5 1/1 Running 0 2m14s
For the next step, we have to create the same snapshot-location that we have used for the backup.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero snapshot-location create vsl-vsphere --provider velero.io/vsphere Snapshot volume location "vsl-vsphere" configured successfully.
Before we can start the restore, there are some limitations with the current version of the Velero vSphere plugin that you need to be aware of. The plugin assumes that your Kubernetes Node VMs are located under the datacenter object within the “VMs and Template” view. If the VMs are located in a subfolder such as “Discovered virtual machine”, the restore will fail. This is already addressed and will hopefully be fixed soon.
Another implementation detail you should be aware of is that storage classes are currently not being used for the persistent volume placement during restore. This is a known issue, and the team is working on it. The current behavior places the persistent volumes on datastores that are accessible by all ESXi hosts where Node VMs are running on.
Execute Restore / Migration
Finally, we have everything prepared to start the Restore.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero restore create --from-backup my-ghost-backup Restore request "my-ghost-backup-20200429084302" submitted successfully. Run `velero restore describe my-ghost-backup-20200429084302` or `velero restore logs my-ghost-backup-20200429084302` for more details.
We can check the status of the restore with the following command.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero restore describe my-ghost-backup-20200429084302 Name: my-ghost-backup-20200429084302 Namespace: velero Labels: <none> Annotations: <none> Phase: InProgress ...
We should see vCenter starting to create the VMDK and attach it to the Worker Node VM. Additionally, you should see the Persistent Volume appearing under the Cloud Native Storage integration of vCenter.
Wait until the restore phase completes.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ velero restore describe my-ghost-backup-20200429084302 Name: my-ghost-backup-20200429084302 Namespace: velero Labels: <none> Annotations: <none> Phase: Completed ...
Let’s double-check if all the objects got restored successfully.
(⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get pvc -n ghost NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE blog-content Bound pvc-e71abb22-3212-42ab-a1f9-7ff0e07beee0 2Gi RWO demo-sts-sc 15m (⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get pods -n ghost NAME READY STATUS RESTARTS AGE blog-7657d68b65-rlw2h 1/1 Running 0 4m17s (⎈ |tkgcl1-admin@tkgcl1:default)➜ ~ kubectl get svc -n ghost NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE blog LoadBalancer 100.65.17.38 172.16.10.10 80:32700/TCP 58m
It looks good. As a final check, we should verify that the Ghost application is accessible, and the blog entry we have created is still there.
Done! Backup and Migration of a persistent application from TKGI to TKG on vSphere with Velero.
Even though it is still early stages for the Velero vSphere Plugin, it is a good option to backup your applications on Kubernetes. The known limitations for the restore process will hopefully be addressed soon, which will make it an even better solution.
As demonstrated, it also works for migration scenarios between different Kubernetes clusters on vSphere (TKGI 1.7 to TKG), which is a topic that generally comes up quite often. Even though I haven’t tried it yet, I believe it could be combined with a Kubernetes cluster backup done by Bosh Backup and Restore (BBR) for TKGI clusters.
Currently, the plugin does not support “vSphere 7 with Kubernetes” Supervisor clusters and Guest Clusters. Stay tuned for more information about the Velero vSphere Plugin.
- Velero home page
- Velero plugin for vSphere on GitHub
- Velero install documentation
- Velero snapshot location
- Velero backup location
- Velero cluster migration
- Cormac’s blog post about backing up persistent applications with restic
- Container Storage Interface (CSI) for Kubernetes
- Cloud-Native Storage for vSphere (CNS / CSI)
- Tanzu Kubernetes Grid solution overview
- William’s blog post about Tanzu Kubernetes Grid
- Cluster API on GitHub
- Tanzu Kubernetes Grid Integrated Edition (TKGI) 1.7
- Kubernetes Persistent Volumes
- Kubernetes Out-of-Tree Volume Plugins
- Kubernetes In-Tree Cloud Providers
- Kubernetes service type LoadBalancer
- Enable CSI / CNS for TKGI 1.7
- MinIO home page
- MinIO on GitHub
- Helm home page
- Kubeapps home page
- Getting started with Kubeapps on GitHub
- Homebrew home page
- My test yaml and config files on GitHub
- Bosh Backup and Restore for TKGI
- vSphere 7 with Kubernetes Supervisor and Guest Cluster