I screwed up big time!!!
How I Accidentally Nuked My Blog (And What I Learned About Backups)
A tale of overconfidence, tiredness, Kubernetes PVs, and a backup strategy that wasn't quite what I thought it was. Spoiler: my blog didn't survive.
The Setup:
My blog is hosted on a self-managed Ghost instance running in a K3s Kubernetes cluster on a VPS. To keep things safe, I have set up Velero for daily, weekly, and monthly backups. I even published two blog posts about it. Confident that it would handle both metadata and persistent volumes (PVs) seamlessly.
The Incident:
After publishing my latest two blog posts about how to leverage NVIDIA GPUs in OpenStack and Kubernetes, I wanted to do a little clean up on my server and by that also update to latest version of Ghost. To be honest it was already quite late and I wish I would have went straight to bed instead.
As usual before making changes, I took a fresh backup of the blog with Velero. The backup finished with status completed but I wanted to be 100% sure, so I also checked the logs to make sure that there are no anomalies like failed backups of the PVs. Confident that my blog could be easily restored if something goes wrong during the update, I ran the usual helm upgrade procedure.
To make it short, the update failed. After checking the Ghost Helm Chart release notes, it seems that the latest version of the Ghost Helm Chart had some breaking changes that would require some work on the config before updating. Ok fine. Let's just uninstall the Helm chart and restore from backup.
Usually when restoring backups through Velero I also clean up the namespace because some remains from the previous Helm chart install like the PVs might cause issues. I started the restore and after a few moments it was successfully finished. I quickly went ahead and opened a fresh browser window to check if the blog was reachable again and it was, but wait... something is not right here. Where is my data!? No blog posts, just the default look as if it was freshly installed. I thought ok let's check the Velero restore logs. Looking through the logs I saw these three lines:
v1/PersistentVolume:
- pvc-7d4d2a56-e6b2-4fb4-a373-8116de67cae3(skipped)
- pvc-95eb084e-6ddf-43d5-a22b-4ade78e0fd88(skipped)
Alright no worries, let's just retry the restore. Unfortunately the same results. Panic started rising... So I went ahead and again checked the logs of the backup that I took before the update and yes it mentioned that the PVs and PVCs were in fact backed up.
velero describe backup manual-latest-080125 --details
....
v1/PersistentVolume:
- pvc-7d4d2a56-e6b2-4fb4-a373-8116de67cae3
- pvc-95eb084e-6ddf-43d5-a22b-4ade78e0fd88
v1/PersistentVolumeClaim:
- ghost/data-roksblog-mysql-0
- ghost/roksblog-ghost
So I thought...
But hey as I mentioned it was already late and I actually looked at the wrong section in the logs!!! The summary about the backed up PVs is at the very end:
Backup Volumes:
Velero-Native Snapshots: <none included>
CSI Snapshots: <none included or not detectable>
Pod Volume Backups: <none included>
And in fact no PVs were backed up. Now I was really in Panic mode. What went wrong here. I checked the Velero deployment and everything looked fine. I'm using FSB (Restic) and the node agent is running:
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
node-agent-f496t 1/1 Running 0 8h
velero-67fff6fb55-86ls2 1/1 Running 0 8h
Let's check the Velero Helm chart values:
helm get values -n velero velero
USER-SUPPLIED VALUES:
USER-SUPPLIED VALUES: null
configuration:
backupStorageLocation:
- bucket: velero
config:
region: None
s3ForcePathStyle: true
s3Url: ***************
name: default
provider: aws
defaultVolumesToFsBackup: true
credentials:
secretContents:
cloud: |
[default]
aws_access_key_id = **************
aws_secret_access_key = *********************************
deployNodeAgent: true
initContainers:
- image: velero/velero-plugin-for-aws:latest
imagePullPolicy: IfNotPresent
name: velero-plugin-for-aws
volumeMounts:
- mountPath: /target
name: plugins
snapshotsEnabled: false
I knew I was using the opt-in pod volume backup method were basically all PVs should be included and the following parameter enables this:
configuration.defaultVolumesToFsBackup
However, it was misindented, making Helm unable to associate the parameter with the proper context!!! 😦
Since Helm didn't validate the YAML structure or check the placement of defaultVolumesToFsBackup
, the chart accepted the values file as-is and rendered the templates accordingly. The misplacement resulted in the parameter being ignored.
At some stage when messing with the chart values I must have accidently indented it incorrectly.
The Aftermath:
- Hours spent combing through logs.
- Hours spent looking at old backups on my external disks
- Hours spent trying to restore files using xfs_undelete and Testdisk
- Panic.
- Finally, the grim realization: my blog was gone.
Luckily, it is what they say it is: The Internet never forgets. Thanks to the Wayback Machine, I was able to recover most of my content. While a few of my latest posts are gone for good, the majority of my blog lives on, ready to be restored.
The Lessons:
- Test Your Backups Regularly: A backup isn't truly a backup until you've successfully restored it.
- Understand Your Tools: Velero backs up PVs only with proper configuration or a compatible storage provider.
- Layered Redundancy: Always have more than one backup strategy.
- Avoid Risky Tasks Late at Night: Performing critical tasks when you're tired increases the chances of mistakes. Save these for when you're alert and focused.
The Bright Side:
This experience pushed me to rethink my entire approach to backup and recovery. I’ve since reconfigured Velero to ensure PV snapshots are properly handled and set up regular tests to verify the integrity of my backups. Sharing these hard-earned lessons here is my way of helping you avoid the same mistake.
If you’re running a blog, a cluster, or any service you care about, take this as a gentle nudge: double-check your backup strategy today!