Managing a Ceph cluster can be a complex yet rewarding experience. Recently, I encountered an issue with OSD (Object Storage Daemon) backfillfull, which required immediate attention to ensure the smooth operation of my storage infrastructure. In this blog post, I'll share the steps I took to resolve the OSD backfillfull issue and optimize the performance of my Ceph cluster.

[root@bootstrap]# sudo ceph -s
  cluster:
    id:     5b818fec-0b08-11ec-9007-005056b783e1
    health: HEALTH_WARN
            1 backfillfull osd(s)
            15 pool(s) backfillfull

  services:
    mon: 5 daemons, quorum node02,bootstrap,node01,node03,node05 (age 17h)
    mgr: node03.pvrgzt(active, since 3w), standbys: node05.bgupnv, bootstrap.bexjaj
    osd: 12 osds: 12 up (since 8w), 12 in (since 2M)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   15 pools, 593 pgs
    objects: 637.67k objects, 2.8 TiB
    usage:   8.3 TiB used, 2.2 TiB / 10 TiB avail
    pgs:     592 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   252 KiB/s rd, 4.6 MiB/s wr, 209 op/s rd, 576 op/s wr

Understanding the Issue

Ceph clusters rely on OSDs to store data. Each OSD is responsible for storing data, handling data replication, recovery, and rebalancing. When an OSD reaches its capacity limit, it triggers backfill operations to redistribute data across the cluster. This process can lead to performance degradation if not managed properly.

In my case, I received a warning indicating that one of my OSDs had reached the backfillfull threshold. This meant the OSD was nearly full and could no longer accept new data until the backfill process was complete. It was crucial to address this issue promptly to prevent further complications.

Step 1: Identifying the Full OSD

The first step was to identify which OSD had reached the backfillfull state. Using the Ceph command-line interface (CLI), I ran the following command:

sudo ceph osd df

This command provided a detailed report on the disk usage of each OSD in my cluster. I was able to identify the specific OSD that was full and needed immediate attention.

Step 2: Increasing the Backfill Ratio

As a temporary measure, I decided to increase the backfill ratio to allow the cluster to continue rebalancing. This would help alleviate the immediate pressure on the full OSD. I executed the following command:

sudo ceph osd set-backfillfull-ratio 0.95

This command set the backfill ratio to 95%, giving the cluster some breathing room to continue operations while I worked on a more permanent solution.

Step 3: Rebalancing the Cluster

Next, I needed to ensure that the cluster was rebalancing correctly. I checked the status of the cluster using:

sudo ceph status

This command provided an overview of the cluster's health and any ongoing operations. I monitored the rebalancing process to ensure it was progressing smoothly.

Step 4: Cleaning Up Old Snapshots

I also discovered that my OpenStack images and volumes pool had accumulated numerous old rbd's, contributing to the high disk usage. To clean these up, I created a list of all OpenStack image ids

openstack image list -f value -c ID > openstack_image_ids.txt

Next I created a list of all rbd's from the images pool

rbd ls images > openstack_rbds.txt

Now we combine those two lists, sort and filter for uniqueness

cat openstack_image_ids.txt openstack_rbds.txt > all.txt
cat all.txt | sort | unique > unique.txt

The file unique.txt should now only contain rbd's that do not exist in my OpenStack cluster. Now we can loop through that file and remove all orphaned rbd's. But before I do that I export the rbd's just to be on the safe side. For this I quickly added an NFS share to my machine to store them.

for rbd in $(cat unique.txt); do echo "rbd export images/$rbd - | gzip > $rbd.img.gz"; rbd export images/$rbd - | gzip > $rbd.img.gz; done

After some time the rbd's were exported and it was time to get rid of them:

for rbd in $(cat unique.txt); do echo "rbd rm images/$rbd"; rbd rm images/$rbd; done

This command removed all orphaned rbd's, freeing up valuable disk space.

Final step: Adding More Capacity

To prevent the issue from recurring, I decided to add more OSDs to increase the overall capacity of the cluster. Adding new OSDs would help distribute the data more evenly and reduce the likelihood of hitting the backfillfull threshold in the future.

I'm using cephadm and so all I needed to do was adding the new disks to my servers. They were automatically initialized and added to the cluster. To speed up the process I ran:

sudo ceph orch device ls --refresh

This is the default behaviour. It can be disabled by running:

sudo ceph orch apply osd --all-available-devices --unmanaged=true

After adding all disk I checked their status.

sudo ceph osd status

Finally the cluster status is healthy again.