OpenStack 2024.1 → 2025.1 (SLURP) Upgrade with Kolla-Ansible — Lessons from the Trenches

OpenStack 2024.1 → 2025.1 (SLURP) Upgrade with Kolla-Ansible — Lessons from the Trenches
created with ChatGPT

Yes it's kind of late but to be honest a lot of people run even older OpenStack release as of today. For me it was finally time to upgrade. 2024.1 is unmaintained for quite some time now. And as you can imagine the upgrade didn't go so smooth in the first place but I would not have expected otherwise. Notice that this is a SLURP upgrade. You can skip 2024.2. My first recommendation before you upgrade, is to first update to the latest kolla-ansible 2024.1 release.

Environment

  • OpenStack: 2024.1 → 2025.1 (SLURP)
  • Deployment: Kolla-Ansible
  • OS: Rocky Linux 9
  • Container Runtime: Podman
  • Storage: Ceph and NFS (no Cinder HA)
  • Network: OVS
  • Magnum: VEXXHOST CAPI driver

Upgrade Flow (TL;DR)

  1. Update Kolla-Ansible 2024.1
  2. Update containers (2024.1)
  3. Verify platform health
  4. Backup MariaDB (test restore!)
  5. RabbitMQ migration (durable queues)
  6. Prepare 2025.1 environment (Python 3.12)
  7. Merge configs (inventory + passwords)
  8. Run prechecks
  9. Upgrade to 2025.1
  10. Patch Magnum VEXXHOST CAPI Bug
  11. Validate

Before You Do Anything

My very first recommendation (and this one is critical):

👉 Update to the latest kolla-ansible version within 2024.1 first

Do NOT jump straight into the next release with an outdated deployment tooling.
This alone can save you from a lot of unnecessary pain.

BACKUP. YOUR. DATABASE.

I can’t stress this enough.

Before even thinking about upgrading:

👉 Backup your MariaDB

If something goes wrong—and at some point, something will—you need a way back. Otherwise, you're not upgrading… you're gambling.

If you have not enabled mariadb backup yet. It is very simple:

https://docs.openstack.org/kolla-ansible/2024.2/admin/mariadb-backup-and-restore.html

Run:

kolla-ansible -i multinode mariadb_backup --full

And seriously—verify that the backup actually worked. At best do a test restore to be 100% certain.

Update to latest Kolla-Ansible 2024.1 release

# switch to your venv
source 2024.1/bin/activate

# Check current version
$ kolla-ansible --version
18.7.1

# Ensure the latest version of pip is installed
pip install -U pip

# Update Kolla-Ansible
pip install --upgrade git+https://opendev.org/openstack/kolla-ansible@unmaintained/2024.1

# Verify
$kolla-ansible --version
18.8.1

Update 2024.1 containers

Next we need to update the containers on all OpenStack nodes. For this we first pull the latest versions and then deploy. It is recommend to not perform any tasks on the control-plane during the deploy phase. This ensures all database migrations and container images are in a consistent state before performing the SLURP upgrade.

kolla-ansible -i multinode pull
kolla-ansible -i multinode deploy-containers

If you get errors during the deploy phase, don't worry your running workloads will very likely not be affected. In my case I was connected to an instance via SSH and had no disconnects. I had many errors with Prometheus containers, but they were most likely due to issues with container storage on the control nodes. I force-removed the containers and re-ran deploy-containers.

Verify functionality after 2024.1 update

Once the deploy phase was completed successfully I quickly ran some tests and checks to ensure that everything was working as expected.

openstack volume service list
openstack server list
openstack coe cluster list
openstack loadbalancer list
openstack endpoint list
openstack hypervisor list
openstack network agent list
openstack compute service list

# create a small cirros vm and delete it
openstack server create \
    --image cirros \
    --flavor m1.tiny \
    --key-name mykey \
    --network demo-net \
    demo2025.1
openstack server show demo2025.1
openstack server delete demo2025.1

# create a small volume and delete it
openstack volume create --size 10 test2025.1
openstack volume show test2025.1
openstack volume delete test2025.1

Upgrade to 2025.1

After confirming that everything was still functioning as expected, I proceeded with the upgrade to 2025.1. It is crucial to carefully review the following sections and complete all prerequisites before starting the actual upgrade.

Important changes in 2025.1

🔴 Critical / Action Required

1. Prometheus v2 → v3

  • Must be on Prometheus v2.5.5 or later before upgrading — verify your current version first:
ansible -i multinode control -m shell -a 'podman exec prometheus_server /opt/prometheus/prometheus --version' -b

2. RabbitMQ 4.0 Major Upgrade— Durable Type Migration

In the Epoxy release of Kolla, the version of RabbitMQ will be updated to 4.0. As a result, all queues must be migrated to a durable type prior to upgrading to Epoxy. This can be done by setting the following options and then following the migration procedure outlined below. The prechecks (kolla-ansible prechecks) will catch this and fail if any non-quorum queues are detected.

om_enable_queue_manager: true
om_enable_rabbitmq_quorum_queues: true
om_enable_rabbitmq_transient_quorum_queue: true
om_enable_rabbitmq_stream_fanout: true
  • RabbitMQ is jumping to v4.0 — this is a significant major version bump
  • Review the RabbitMQ 4.0 changelog for breaking changes before deploying

3. neutron-linuxbridge-agent Dropped

  • If you're using Linux bridge (not OVS), this is relevant
  • You need to migrate to OVS or OVN
  • It is a breaking change!!!
  1. Disable ProxySQL and Cinder cluster precheck

Since I’m not using ProxySQL or a traditional HA storage setup, but instead rely on Ceph, I had to disable the following options. Otherwise, the upgrade process would fail with errors.

echo 'cinder_cluster_skip_precheck: true' >> /etc/kolla/globals.yml
echo "enable_proxysql: \"no\"" >> /etc/kolla/globals.yml
  1. Magnum Cluster API issue

After the migration I was no longer able to create K8s clusters. There is currently a bug in the container packaging. It can be fixed by running a patch. See my steps below for more info.

🟡 Important Notices

1. Logging Format Changed

  • Container logs now include timestamps, log levels, and custom date format
  • If you have any log parsers, alerts, or monitoring rules scraping container logs, they will need to be updated

2. MariaDB Backups

  • Incremental backup behavior has changed (directory-based)
  • Take a fresh full backup after the upgrade before relying on incrementals

3. ansible-core 2.18 in Kolla Toolbox

  • If you have any custom Ansible playbooks or roles, test them for compatibility

🟢 Deprecations to Plan For (2025.2)

  • swift images are deprecated and will be removed in 2025.2 — plan migration if you use Swift
  • bifrost-deploy is deprecated in favor of ironic standalone deployment

RabbitMQ Durable Type Migration

Maybe your queues are already set to durable type. If not, follow these steps. Running workloads (VMs) will survive since it's just the message queue being reset, but any API calls during that window will fail or hang. Good practice to warn users. By mistake I ran the steps while I had already upgraded to kolla-ansible 2025.1 and this of course threw a couple of errors. In case you also happened to get in this situation, see the Troubleshooting Section for some help.

⚠️ Impact:

  • Running VMs are NOT affected
  • API calls WILL fail during this step
  • Expect temporary control-plane downtime

Before anything else, add these to globals.yml:

echo "om_enable_queue_manager: true" >> /etc/kolla/globals.yml
echo "om_enable_rabbitmq_quorum_queues: true" >> /etc/kolla/globals.yml
echo "om_enable_rabbitmq_transient_quorum_queue: true" >> /etc/kolla/globals.yml
echo "om_enable_rabbitmq_stream_fanout: true" >> /etc/kolla/globals.yml

Step 1 — Generate new config

kolla-ansible genconfig -i multinode

Step 2 — Stop all RabbitMQ-using services - now it’s getting serious…

kolla-ansible stop -i multinode --tags \
  nova,neutron,heat,magnum,manila,octavia,cinder,glance \
  --yes-i-really-really-mean-it

Step 3 — Reconfigure RabbitMQ

kolla-ansible reconfigure -i multinode --tags rabbitmq

Step 4 — Reset RabbitMQ state ⚠️ point of no return

kolla-ansible rabbitmq-reset-state -i multinode

Step 5 — Bring services back up

kolla-ansible deploy -i multinode --tags \
  nova,neutron,heat,magnum,manila,octavia,cinder,glance

Perform the 2025.1 Upgrade

I run on Rocky 9 (Podman) and 2025.1 requires a newer python version so I first installed python3.12 and a couple of dependencies.

sudo dnf install -y python3.12-devel python3.12-pip python3-dbus dbus-devel python3-devel

Next I created a new venv for 2025.1 and installed kolla-ansible 2025.1.

deactivate
python3.12 -m venv ~/2025.1
source ~/2025.1/bin/activate
pip install -U pip
pip install git+https://opendev.org/openstack/kolla-ansible@stable/2025.1
kolla-ansible install-deps
# check kolla-ansible version
$ kolla-ansible --version
kolla-ansible 20.3.1.dev31
# upgrade ansible
pip3 install --upgrade "ansible-core>=2.17,<2.19"
# I also had to install these
pip install dbus-python
pip install podman
pip install 'bcrypt<5.0.0'

Next we change the release in globals.yml.

sed -i 's/openstack_release:.*/openstack_release: "2025.1"/' /etc/kolla/globals.yml

Merging the Kolla-Ansible (multinode) Inventory File

The multinode inventory file contains two logical sections: the host definitions (your custom control, compute, and storage node assignments) and the role group mappings (which nodes run which OpenStack services). When upgrading to a new Kolla-Ansible version, the role group mappings may change — new groups can be added, deprecated ones removed — while your host definitions remain the same.

The goal here is to preserve your custom host assignments from the old inventory while adopting the updated role group structure from the 2025.1 template. The boundary between these two sections is the [deployment] group, which serves as a natural split point.

Rather than manually copying and pasting between files in an editor, we can automate this cleanly with sed:

# Backup originals
cp multinode multinode.2024.1
cp ./2025.1/share/kolla-ansible/ansible/inventory/multinode multinode.2025.1

# Get everything UP TO AND INCLUDING [deployment] + 1 line from the old file
sed -n '1,/^\[deployment\]/{p; /^\[deployment\]/{n;p}}' multinode.2024.1 > multinode.new

# Append everything AFTER [deployment] from the new 2025.1 file
sed -n '/^\[deployment\]/,$ p' multinode.2025.1 | tail -n +3 >> multinode.new

# Replace the multinode file
mv multinode.new multinode

Verify the result:

grep -n "\[deployment\]" multinode          # should appear only once
diff multinode.2024.1 multinode | head -40  # sanity check

Update and merge passwords.yml

cp /etc/kolla/passwords.yml passwords.yml.old
cp ./2025.1/share/kolla-ansible/etc_examples/kolla/passwords.yml passwords.yml.new
kolla-genpwd -p passwords.yml.new
kolla-mergepwd --old passwords.yml.old --new passwords.yml.new --final /etc/kolla/passwords.yml

Make sure you have enough disk space available before proceeding.

ansible -i multinode all -m shell -a 'df -h .' -b

Now we first pull the new containers. Take note that the syntax for kolla-ansible has changed.

kolla-ansible pull -i multinode

Run Rechecks

kolla-ansible prechecks -i multinode

Upgrade

kolla-ansible upgrade -i multinode

After the upgrade has completed successfully, I recommend validating your setup again by re-running the tests.

Fix Magnum 2025.1 CAPI Bug

There is a bug in 2025.1 Magnum packaging. You only need this if you are using the VEXXHOST CAPI driver

pkg_resources.DistributionNotFound: The 'oslo.cache>=1.26.0' distribution was not found and is required by keystonemiddleware
: pkg_resources.DistributionNotFound: The 'oslo.cache>=1.26.0' distribution was not found and is required by keystonemiddleware

Run the following patch, it will even survive a container reboot but not an update / new deploy.

ansible control -i multinode -b -m shell -a '
CONTAINERS="magnum_api magnum_conductor"
FILE="/var/lib/kolla/venv/lib64/python3.9/site-packages/magnum_cluster_api/resources.py"

for CONTAINER in $CONTAINERS; do
  echo "=== Fixing $CONTAINER ==="
  podman exec --user root $CONTAINER python3 -c "
import re, importlib.metadata

filepath = \"$FILE\"
with open(filepath, \"r\") as f:
    content = f.read()
original = content

content = content.replace(
    \"import pkg_resources\",
    \"import importlib.metadata\nimport pkg_resources\"
)
content = re.sub(
    r\"CLUSTER_CLASS_VERSION = pkg_resources\.require\(\\\"magnum_cluster_api\\\"\)\[0\]\.version\",
    \"CLUSTER_CLASS_VERSION = importlib.metadata.version(\\\"magnum_cluster_api\\\")\",
    content
)

if content != original:
    with open(filepath, \"w\") as f:
        f.write(content)
    print(\"SUCCESS: patch applied\")
else:
    print(\"WARNING: nothing changed - already patched?\")

with open(filepath, \"r\") as f:
    for i, line in enumerate(f, 1):
        if any(x in line for x in [\"importlib\", \"pkg_resources\", \"CLUSTER_CLASS_VERSION\"]):
            print(f\"  Line {i}: {line.rstrip()}\")
"
  podman restart $CONTAINER
  echo "=== $CONTAINER done ==="
done
'

Lessons Learned

  • Always update within the same release first
  • Never trust backups without restore tests
  • RabbitMQ changes are more disruptive than they look
  • Test Magnum after upgrade if you rely on CAPI

Troubleshooting

In case you also ran the RabbitMQ migration while already upgraded to kolla-ansible 2025.1 you might run in the following errors:

msg": "No such container: nova_metadata to stop"}

fatal: [control01]: FAILED! => {"action": "mysql_db", "changed": false, "msg": "unable to find /root/.my.cnf. Exception message: (1045, "Access denied for user 'root_shard_0'@'control03.a.space.corp' (using password: YES)")"}

-> Either try running it again with kolla-ansible 2024.1 or simply create a dummy nova-metadata container that you delete afterwards and set enable_proxysql: "no" in /etc/kolla/globals.yml. In my case this worked fine on my test cluster.

Sources

https://docs.openstack.org/kolla-ansible/2024.1/admin/mariadb-backup-and-restore.html

Privacy Policy Cookie Policy Terms and Conditions