VMware PKS 1.3 Backup and Recovery

The new Backup and Recovery capabilities of VMware PKS 1.3 immediately caught my attention. Disaster Recovery is a very important topic for all customers running Kubernetes clusters in production or similar environments. Besides backing up the PKS control plane, PKS 1.3 has the capability to backup and restore single master node Kubernetes clusters with stateless workloads. In this blog post, I want to test the new backup and restore feature and demonstrate how to use it. Before you start reading this, please be aware of the following prerequisites as well as limitations of the Bosh Backup and Restore functionality.

First of all, we need to download the BBR (Bosh Backup and Restore) tool here. The BBR command should be executed on a secure Jumpbox server that has network access to the VMs you want to backup. Make sure you can communicate with the VMs on port 22, as BBR is going to use SSH to orchestrate the backups. By the way, you can also use the Ops Manager as a Jumpbox if you want to. The documentation and prerequisites for BBR can be found here. I am going to use my MacBook as the Jumpbox because I have limited resources in my homelab environment.

➜  mv bbr-1.3.2-darwin-amd64 bbr
➜  chmod a+x bbr
➜  mv bbr /usr/local/bin
➜  bbr -version
bbr version 1.3.2

In this blog post, I am focusing on the new capability to backup and restore single master node Kubernetes clusters. I am not going to cover the PKS control plane backup and restore. However, if you want to learn more about the disaster recovery of the control plane, have a look at the documentation here.

Bevor we can start, we need to copy the root_ca_cert from the OpsManager and set some Bosh variables. Follow the next steps to get your environment prepared.

As a first step, copy the root_ca_certificate to your Jumpbox. We can download the certificate from the Ops Manager UI under Settings/Advanced. If you are using Ops Manager as the Jumpbox, you can find the certificate under the following folder “/var/tempest/workspaces/default/root_ca_certificate”.

Screenshot 2019-01-25 at 10.53.28.png

Now we need to get the necessary bosh command line credentials via the Ops Manager UI, see screenshot.

Screenshot 2019-01-25 at 14.22.33.png

You should see an output like this after clicking on “Link to Credential”:

{"credential":"BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=ZYxWvutsRqPoNmlKjIhgFeDcBA BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.96.1 bosh "}

Create a file and copy/format the collected content in the following way. If you are executing BBR from a Jumpbox and not from the Ops Manager itself, make sure that you change the BOSH_CA_CERT path to the location of the downloaded certificate.

export BOSH_CLIENT_SECRET=ZYxWvutsRqPoNmlKjIhgFeDcBA
export BOSH_CLIENT=ops_manager
export BOSH_ENVIRONMENT=192.168.96.1
export BOSH_CA_CERT=/home/aullah/root_ca_certificate

Save the file and “source” it whenever you need to execute bosh-cli or bbr commands against the environment. Alternatively, you can add the content to your bash profile to have it available every time you start.

If you don’t have bosh-cli installed on your Jumpbox, simply download the right version for your OS on GitHub here and follow the instructions here.

Execute the “bosh vms” command to see if it’s working. If everything is configured correctly, you should see an output like this. Find your Kubernetes deployment and make a note of your deployment ID, see screenshot. We will need the deployment ID for the backup process later.

screenshot 2019-01-30 at 11.46.06

Backup

Now that we have everything prepared on our Jumpbox, let’s execute the pre-backup-check for our Kubernetes cluster deployment. Simply execute “bbr deployment -d <deployment_ID> pre-backup-check“. BBR will use the rest of the necessary Bosh inputs from the variables that we specified before.

➜  ~ bbr deployment -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 pre-backup-check
[bbr] 2019/01/30 11:50:26 INFO - Looking for scripts
[bbr] 2019/01/30 11:50:32 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/backup
[bbr] 2019/01/30 11:50:32 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/metadata
[bbr] 2019/01/30 11:50:33 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/post-restore-unlock
[bbr] 2019/01/30 11:50:33 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/pre-restore-lock
[bbr] 2019/01/30 11:50:33 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/restore
[bbr] 2019/01/30 11:50:33 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/post-restore-unlock
[bbr] 2019/01/30 11:50:33 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/pre-restore-lock
[bbr] 2019/01/30 11:50:36 INFO - Running pre-checks for backup of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90...
[10:50:39] Deployment 'service-instance_ab101b9e-b437-46a3-8000-0a1734087f90' can be backed up.

We can see that my Kubernetes deployment is good to be backed up. So let’s do it, we are going to execute the same command and replace pre-backup-check with backup and add –with-manifest at the end. The –with-manifest parameter will include the bosh deployment manifest in the backup artifact. Be aware that the manifest will include credentials that should be kept somewhere safe. The bosh deployment manifest will be used later to recreate the VMs.

➜  ~ bbr deployment -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 backup --with-manifest
[bbr] 2019/01/30 11:56:59 INFO - Looking for scripts
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/backup
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/metadata
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/post-restore-unlock
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/pre-restore-lock
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/restore
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/post-restore-unlock
[bbr] 2019/01/30 11:57:12 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/pre-restore-lock
[bbr] 2019/01/30 11:57:12 INFO - Running pre-checks for backup of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90...
[bbr] 2019/01/30 11:57:13 INFO - Starting backup of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90...
[bbr] 2019/01/30 11:57:14 INFO - Running pre-backup-lock scripts...
[bbr] 2019/01/30 11:57:14 INFO - Finished running pre-backup-lock scripts.
[bbr] 2019/01/30 11:57:14 INFO - Running backup scripts...
[bbr] 2019/01/30 11:57:14 INFO - Backing up bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 11:57:15 INFO - Finished backing up bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed.
[bbr] 2019/01/30 11:57:15 INFO - Finished running backup scripts.
[bbr] 2019/01/30 11:57:15 INFO - Running post-backup-unlock scripts...
[bbr] 2019/01/30 11:57:15 INFO - Finished running post-backup-unlock scripts.
[bbr] 2019/01/30 11:57:16 INFO - Copying backup -- 2.9M uncompressed -- for job cfcr-etcd-snapshot on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 11:57:29 INFO - Finished copying backup -- for job cfcr-etcd-snapshot on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 11:57:29 INFO - Starting validity checks -- for job cfcr-etcd-snapshot on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 11:57:30 INFO - Finished validity checks -- for job cfcr-etcd-snapshot on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 11:57:30 INFO - Backup created of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 on 2019-01-30 11:57:15.475991 +0100 CET m=+16.515467496

The backup finished successfully and I have the backup artifact on my Jumpbox. In general, it is advisable to store the artifacts on secure storage space and to have multiple copies of it.

➜  ~ ls -lh service-instance_ab101b9e-b437-46a3-8000-0a1734087f90_20190130T105658Z
total 6256
-rw-r--r--  1 aullah  staff   2.8M Jan 30 11:57 cfcr-etcd-snapshot.tar
-rw-r--r--  1 aullah  staff    43K Jan 30 11:57 manifest.yml
-rw-r--r--  1 aullah  staff   235B Jan 30 11:57 metadata

Disaster

In the disaster recovery scenario that we are going to demonstrate, all Kubernetes cluster VMs will be gone. Here are some details about the Kubernetes cluster and its Pods before we are going to delete everything.

➜  ~ pks cluster k8s-cluster-01
Name:                     k8s-cluster-01
Plan Name:                small
UUID:                     ab101b9e-b437-46a3-8000-0a1734087f90
Last Action:              CREATE
Last Action State:        succeeded
Last Action Description:  Instance provisioning completed
Kubernetes Master Host:   pks-cluster-01
Kubernetes Master Port:   8443
Worker Nodes:             1
Kubernetes Master IP(s):  172.16.10.1
Network Profile Name:

The following Pods running on the Kubernetes cluster:

➜  ~ kubectl get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE                                   NOMINATED NODE
nginx-cdb6b5b95-jt8tz             1/1     Running   0          18h   10.200.52.17   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
nginx-cdb6b5b95-wzkhn             1/1     Running   0          18h   10.200.52.19   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-86666d647f-4k6vv            1/1     Running   0          18h   10.200.52.18   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-86666d647f-cjpj5            1/1     Running   0          18h   10.200.52.16   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-server-77b4d88467-j5vmh     1/1     Running   0          18h   10.200.52.15   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-appserver-58db84c875-xktlf   1/1     Running   0          18h   10.200.52.13   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-db-69b5c4dc8b-mpv6h          1/1     Running   0          18h   10.200.52.12   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-ui-6b5d855894-4wmrl          1/1     Running   0          18h   10.200.52.14   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>

These are the VMs that are forming my Kubernetes cluster:

screenshot 2019-01-30 at 12.04.58screenshot 2019-01-30 at 12.04.24

Now we are going to create a disaster recovery scenario with powering off and deleting all the VMs that make up the Kubernetes cluster. We will see a warning that the VMs are managed by Bosh and we shouldn’t modify the VMs from within the vSphere Client. Nevertheless, we are ignoring this warning for now as we deliberately want to create a disaster.

screenshot 2019-01-30 at 12.07.06

screenshot 2019-01-30 at 12.08.30

Bosh will recognize that the VMs are gone and will show an “unresponsive” state.

screenshot 2019-01-30 at 12.14.19

As a note, I have disabled the self-healing functionality of Bosh up front. Otherwise, Bosh would start recreating the VMs.

Restore

Ok, we have deleted everything, now let’s start with the restore. First of all, we need to recreate the VMs with the bosh manifest which is part of our backup. But before we actually can recreate the VMs with the bosh manifest, we need to clean up the bosh deployment. Run the following command “bosh cck -d <deployment_ID>” to remove the missing VM and Disk references from the bosh deployment. Please be aware, you may have to follow a different restore procedure depending on your disaster scenario. In this scenario, we have already deleted the VMs and Disks, so we can safely remove the references from the bosh deployment by selecting “4: Delete VM reference” and “2: Delete disk reference”.

➜ ~ bosh cck -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90
Using environment '192.168.96.1' as client 'ops_manager'

Using deployment 'service-instance_ab101b9e-b437-46a3-8000-0a1734087f90'

Task 1515

Task 1515 | 11:26:39 | Scanning 2 VMs: Checking VM states (00:00:20)
Task 1515 | 11:26:59 | Scanning 2 VMs: 0 OK, 0 unresponsive, 2 missing, 0 unbound (00:00:00)
Task 1515 | 11:26:59 | Scanning 2 persistent disks: Looking for inactive disks (00:00:09)
Task 1515 | 11:27:08 | Scanning 2 persistent disks: 0 OK, 2 missing, 0 inactive, 0 mount-info mismatch (00:00:00)

Task 1515 Started  Wed Jan 30 11:26:39 UTC 2019
Task 1515 Finished Wed Jan 30 11:27:08 UTC 2019
Task 1515 Duration 00:00:29
Task 1515 done

#  Type          Description
5  missing_vm    VM for 'worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0)' with cloud ID 'vm-2ad56750-dbaa-428c-a90a-53db69cdf546' missing.
6  missing_vm    VM for 'master/b41b12b9-231f-483b-b174-0537c80df2ed (0)' with cloud ID 'vm-9f07d153-029b-4da1-92fb-91da8ab09394' missing.
7  missing_disk  Disk 'disk-91550b85-de87-43eb-8424-15ad69c80616' (master/b41b12b9-231f-483b-b174-0537c80df2ed, 10240M) is missing
8  missing_disk  Disk 'disk-52456894-cf35-46cc-9dd2-ab0e262d2790' (worker/3db3f417-e64a-49d9-925e-2d99f7fc3578, 51200M) is missing

4 problems

1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0)' with cloud ID 'vm-2ad56750-dbaa-428c-a90a-53db69cdf546' missing. (1): 4

1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'master/b41b12b9-231f-483b-b174-0537c80df2ed (0)' with cloud ID 'vm-9f07d153-029b-4da1-92fb-91da8ab09394' missing. (1): 4

1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-91550b85-de87-43eb-8424-15ad69c80616' (master/b41b12b9-231f-483b-b174-0537c80df2ed, 10240M) is missing (1): 2

1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-52456894-cf35-46cc-9dd2-ab0e262d2790' (worker/3db3f417-e64a-49d9-925e-2d99f7fc3578, 51200M) is missing (1): 2

Continue? [yN]: y

Task 1516

Task 1516 | 11:28:21 | Applying problem resolutions: VM for 'worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0)' with cloud ID 'vm-2ad56750-dbaa-428c-a90a-53db69cdf546' missing. (missing_vm 49): Delete VM reference (00:00:00)
Task 1516 | 11:28:21 | Applying problem resolutions: VM for 'master/b41b12b9-231f-483b-b174-0537c80df2ed (0)' with cloud ID 'vm-9f07d153-029b-4da1-92fb-91da8ab09394' missing. (missing_vm 48): Delete VM reference (00:00:00)
Task 1516 | 11:28:21 | Applying problem resolutions: Disk 'disk-91550b85-de87-43eb-8424-15ad69c80616' (master/b41b12b9-231f-483b-b174-0537c80df2ed, 10240M) is missing (missing_disk 16): Delete disk reference (DANGEROUS!) (00:00:05)
Task 1516 | 11:28:26 | Applying problem resolutions: Disk 'disk-52456894-cf35-46cc-9dd2-ab0e262d2790' (worker/3db3f417-e64a-49d9-925e-2d99f7fc3578, 51200M) is missing (missing_disk 17): Delete disk reference (DANGEROUS!) (00:00:04)

Task 1516 Started  Wed Jan 30 11:28:21 UTC 2019
Task 1516 Finished Wed Jan 30 11:28:30 UTC 2019
Task 1516 Duration 00:00:09
Task 1516 done

Succeeded

After the command completed successfully, we can see that the bosh deployment is still there but has no VMs referenced.

Screenshot 2019-01-30 at 13.18.20.png

Restore VMs

As a next step, we need to execute the following command “bosh deploy -d <service-instance-id> manifest.yml –recreate” to recreate the VMs. Make sure that you are using the still existing deployment ID and to specify the right path to the manifest.yml file. The manifest.yml file can be found within the backup artifact.

➜ ~ bosh deploy -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 manifest.yml --recreate
Using environment '192.168.96.1' as client 'ops_manager'

Using deployment 'service-instance_ab101b9e-b437-46a3-8000-0a1734087f90'

Continue? [yN]: y

Task 1522

Task 1522 | 12:29:39 | Preparing deployment: Preparing deployment
Task 1522 | 12:29:43 | Warning: DNS address not available for the link provider instance: pivotal-container-service/8f7f1aa0-4551-4670-b863-b7ddec4e5da4
Task 1522 | 12:29:43 | Warning: DNS address not available for the link provider instance: pivotal-container-service/8f7f1aa0-4551-4670-b863-b7ddec4e5da4
Task 1522 | 12:29:44 | Warning: DNS address not available for the link provider instance: pivotal-container-service/8f7f1aa0-4551-4670-b863-b7ddec4e5da4
Task 1522 | 12:29:55 | Preparing deployment: Preparing deployment (00:00:16)
Task 1522 | 12:30:00 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 1522 | 12:30:00 | Creating missing vms: master/b41b12b9-231f-483b-b174-0537c80df2ed (0)
Task 1522 | 12:30:00 | Creating missing vms: worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0)
Task 1522 | 12:31:04 | Creating missing vms: master/b41b12b9-231f-483b-b174-0537c80df2ed (0) (00:01:04)
Task 1522 | 12:31:04 | Creating missing vms: worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0) (00:01:04)
Task 1522 | 12:31:04 | Updating instance master: master/b41b12b9-231f-483b-b174-0537c80df2ed (0) (canary) (00:03:45)
Task 1522 | 12:34:49 | Updating instance worker: worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0) (canary) (00:07:01)

Task 1522 Started  Wed Jan 30 12:29:39 UTC 2019
Task 1522 Finished Wed Jan 30 12:41:50 UTC 2019
Task 1522 Duration 00:12:11
Task 1522 done

Succeeded

Now we should see that the VMs have been recreated in the vSphere Client and also Bosh should show the VM instances in a “running” state.

screenshot 2019-01-30 at 13.54.49screenshot 2019-01-30 at 13.54.18

screenshot 2019-01-30 at 13.49.39

Restore Kubernetes Cluster

The VMs are back online but we still need to recover the Kubernetes cluster and its workloads from the backup artifact. Before we start with the restore, please note that the restore can take a long time to complete. To run the command independently of your SSH session, we can use nohup, screen or tmux. In my case, this is not necessary as the Jumpbox is my MacBook and I am working directly from the local shell. To start the restore process, execute the following command “bbr deployment -d <deployment_id> restore –artifact-path <backup_artifact>“.

➜  ~ bbr deployment -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 restore --artifact-path service-instance_ab101b9e-b437-46a3-8000-0a1734087f90_20190130T105658Z
[bbr] 2019/01/30 14:20:37 INFO - Starting restore of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90...
[bbr] 2019/01/30 14:20:37 INFO - Validating backup artifact for service-instance_ab101b9e-b437-46a3-8000-0a1734087f90...
[bbr] 2019/01/30 14:20:37 INFO - Looking for scripts
[bbr] 2019/01/30 14:20:48 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/backup
[bbr] 2019/01/30 14:20:48 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/metadata
[bbr] 2019/01/30 14:20:49 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/post-restore-unlock
[bbr] 2019/01/30 14:20:49 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/pre-restore-lock
[bbr] 2019/01/30 14:20:49 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-etcd/restore
[bbr] 2019/01/30 14:20:49 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/post-restore-unlock
[bbr] 2019/01/30 14:20:49 INFO - master/b41b12b9-231f-483b-b174-0537c80df2ed/bbr-kube-apiserver/pre-restore-lock
[bbr] 2019/01/30 14:20:51 INFO - Copying backup for job cfcr-etcd-snapshot on master/0...
[bbr] 2019/01/30 14:20:58 INFO - Finished copying backup for job cfcr-etcd-snapshot on master/0.
[bbr] 2019/01/30 14:20:58 INFO - Running pre-restore-lock scripts...
[bbr] 2019/01/30 14:20:58 INFO - Locking bbr-kube-apiserver on master/b41b12b9-231f-483b-b174-0537c80df2ed for restore...
[bbr] 2019/01/30 14:21:00 INFO - Finished locking bbr-kube-apiserver on master/b41b12b9-231f-483b-b174-0537c80df2ed for restore.
[bbr] 2019/01/30 14:21:00 INFO - Locking bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed for restore...
[bbr] 2019/01/30 14:21:00 INFO - Finished locking bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed for restore.
[bbr] 2019/01/30 14:21:00 INFO - Finished running pre-restore-lock scripts.
[bbr] 2019/01/30 14:21:00 INFO - Running restore scripts...
[bbr] 2019/01/30 14:21:00 INFO - Restoring bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 14:21:01 INFO - Finished restoring bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed.
[bbr] 2019/01/30 14:21:01 INFO - Finished running restore scripts.
[bbr] 2019/01/30 14:21:01 INFO - Completed restore of service-instance_ab101b9e-b437-46a3-8000-0a1734087f90
[bbr] 2019/01/30 14:21:01 INFO - Running post-restore-unlock scripts...
[bbr] 2019/01/30 14:21:01 INFO - Unlocking bbr-kube-apiserver on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 14:21:02 INFO - Finished unlocking bbr-kube-apiserver on master/b41b12b9-231f-483b-b174-0537c80df2ed.
[bbr] 2019/01/30 14:21:02 INFO - Unlocking bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed...
[bbr] 2019/01/30 14:21:23 INFO - Finished unlocking bbr-etcd on master/b41b12b9-231f-483b-b174-0537c80df2ed.
[bbr] 2019/01/30 14:21:23 INFO - Finished running post-restore-unlock scripts.

The restore procedure completed successfully. Let’s verify that everything is running as expected. Run “pks get-credentials <clustername>” to see if we can successfully fetch the credentials from our recovered cluster.

➜  ~ pks get-credentials k8s-cluster-01

Fetching credentials for cluster k8s-cluster-01.
Context set for cluster k8s-cluster-01.

You can now switch between clusters by using:
$kubectl config use-context <cluster-name>

Ok, the connection between PKS and the Kubernetes cluster is working and we could successfully fetch the credentials. But are the Pods recovered as well? Enter “kubectl get pods -o wide” to see if the Pods are running and on which worker node.

➜  ~ Kubectl get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE                                   NOMINATED NODE
nginx-cdb6b5b95-jt8tz             1/1     Running   0          21h   10.200.52.17   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
nginx-cdb6b5b95-wzkhn             1/1     Running   0          21h   10.200.52.19   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-86666d647f-4k6vv            1/1     Running   0          21h   10.200.52.18   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-86666d647f-cjpj5            1/1     Running   0          21h   10.200.52.16   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
redis-server-77b4d88467-j5vmh     1/1     Running   0          21h   10.200.52.15   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-appserver-58db84c875-xktlf   1/1     Running   0          21h   10.200.52.13   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-db-69b5c4dc8b-mpv6h          1/1     Running   0          21h   10.200.52.12   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>
yelb-ui-6b5d855894-4wmrl          1/1     Running   0          21h   10.200.52.14   94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   <none>

It seems as if the Pods are running fine, but in reality, the cluster is not ready and I cannot access my applications. I can see that the node is in a “NotReady” state after the restore.

➜  ~ kubectl get nodes
NAME                                   STATUS     ROLES    AGE   VERSION
94cc8df2-c20b-41ae-a1e2-afe32ea5db1e   NotReady   <none>   41h   v1.12.4

Additionally, I get the following error messages.

➜  ~ kubectl top node
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
➜  ~ kubectl logs yelb-ui-6b5d855894-4wmrl
Error from server (NotFound): the server could not find the requested resource ( pods/log yelb-ui-6b5d855894-4wmrl)
➜  ~ kubectl exec -ti redis-86666d647f-4k6vv bash
error: unable to upgrade connection: pod does not exist

The Kubernetes cluster is not ready and we need to restart (stop & start) the bosh deployment to get everything operational again.

➜  ➜  ~ bosh -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 stop --force
Using environment '192.168.96.1' as client 'ops_manager'

Using deployment 'service-instance_ab101b9e-b437-46a3-8000-0a1734087f90'

Continue? [yN]: y

Task 1611

Task 1611 | 09:34:51 | Preparing deployment: Preparing deployment (00:00:01)
Task 1611 | 09:34:57 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 1611 | 09:34:57 | Updating instance worker: worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0) (canary) (00:00:13)
                    L Error: Action Failed get_task: Task ec1c10e5-ba2e-4249-784f-2e6a48112896 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder.
Task 1611 | 09:35:10 | Error: Action Failed get_task: Task ec1c10e5-ba2e-4249-784f-2e6a48112896 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder.

Task 1611 Started  Thu Jan 31 09:34:51 UTC 2019
Task 1611 Finished Thu Jan 31 09:35:10 UTC 2019
Task 1611 Duration 00:00:19
Task 1611 error

Changing state:
  Expected task '1611' to succeed but state is 'error'

Exit code 1

We will get an error during the stop operation as the cluster cannot drain the node, but that is expected as the worker node was not in a ready or healthy state. We could have skipped the drain process by adding –skip-drain at the end.

➜  ~ bosh -d service-instance_ab101b9e-b437-46a3-8000-0a1734087f90 start
Using environment '192.168.96.1' as client 'ops_manager'

Using deployment 'service-instance_ab101b9e-b437-46a3-8000-0a1734087f90'

Continue? [yN]: y

Task 1612

Task 1612 | 09:37:35 | Preparing deployment: Preparing deployment (00:00:02)
Task 1612 | 09:37:41 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 1612 | 09:37:42 | Updating instance master: master/b41b12b9-231f-483b-b174-0537c80df2ed (0) (canary) (00:01:18)
Task 1612 | 09:39:00 | Updating instance worker: worker/3db3f417-e64a-49d9-925e-2d99f7fc3578 (0) (canary) (00:03:29)

Task 1612 Started Thu Jan 31 09:37:35 UTC 2019
Task 1612 Finished Thu Jan 31 09:42:29 UTC 2019
Task 1612 Duration 00:04:54
Task 1612 done

Succeeded

After stopping and starting the bosh deployment, everything is working as expected and the workloads are running.

➜  ~ kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
15c799f0-a68e-415b-a4d4-42eee0d24630   Ready    <none>   12m   v1.12.4
➜  ~ kubectl top nodes
NAME                                   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
15c799f0-a68e-415b-a4d4-42eee0d24630   67m          6%     876Mi           46%
➜  ~ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
nginx-cdb6b5b95-6z4gz             1/1     Running   0          16m
nginx-cdb6b5b95-8ctxk             1/1     Running   0          16m
redis-86666d647f-2qhmh            1/1     Running   0          16m
redis-86666d647f-b6xfd            1/1     Running   0          16m
redis-server-77b4d88467-fg5z8     1/1     Running   0          16m
yelb-appserver-58db84c875-svhw8   1/1     Running   0          16m
yelb-db-69b5c4dc8b-ffttc          1/1     Running   0          16m
yelb-ui-6b5d855894-xdw7m          1/1     Running   0          16m

Here is a screenshot from my yelb app to proof that my workloads are up and running again.

screenshot 2019-01-31 at 10.56.40

We have successfully backed up, destroyed and restored our VMware PKS 1.3 managed Kuberenetes cluster with the BBR (Bosh Backup and Recovery) tool.

Conclusion

Bosh Backup and Recovery is a good way to back up the PKS control plane and since PKS 1.3 also single master node Kubernetes clusters with stateless workloads. Once you have understood the process and the tools, it is very easy to execute.  Unfortunately, it still has some limitations such as missing NSX-T and persistent volumes support. But let’s stay tuned, we will see what features will be added in future versions.

If you want to learn more about PKS 1.3, please have a look at my VMware PKS 1.3 What’s New blog post.

Additional Sources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s