Kolla-Ansible Cinder Containers unhealthy
I recently replaced all 3 control nodes in my OpenStack cluster and once I replaced the last control node, all of a sudden three of the four Cinder containers became unhealty. Looking at the logs of the cinder_api container revealed the following error:
2023-05-30 21:31:36.350636 Timeout when reading response headers from
daemon process 'cinder-api': /var/www/cgi-bin/cinder/cinder-wsgi
2023-05-30 21:31:37.827101 mod_wsgi (pid=22): Failed to exec Python
script file '/var/www/cgi-bin/cinder/cinder-wsgi'.
2023-05-30 21:31:37.827168 mod_wsgi (pid=22): Exception occurred
processing WSGI script '/var/www/cgi-bin/cinder/cinder-wsgi'.
2023-05-30 21:31:37.828005 Traceback (most recent call last):
2023-05-30 21:31:37.828046 File "/var/www/cgi-bin/cinder/cinder-wsgi",
line 52, in <module>
2023-05-30 21:31:37.828053 application = initialize_application()
2023-05-30 21:31:37.828058 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/cinder/wsgi/wsgi.py",
line 44, in initialize_application
2023-05-30 21:31:37.828063 coordination.COORDINATOR.start()
2023-05-30 21:31:37.828068 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/cinder/coordination.py",
line 86, in start
2023-05-30 21:31:37.828071 self.coordinator.start(start_heart=True)
2023-05-30 21:31:37.828075 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/coordination.py",
line 689, in start
2023-05-30 21:31:37.828078 super(CoordinationDriverWithExecutor,
self).start(start_heart)
2023-05-30 21:31:37.828083 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/coordination.py",
line 426, in start
2023-05-30 21:31:37.828086 self._start()
2023-05-30 21:31:37.828090 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/drivers/etcd3gw.py",
line 224, in _start
2023-05-30 21:31:37.828093 self._membership_lease =
self.client.lease(self.membership_timeout)
2023-05-30 21:31:37.828098 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/etcd3gw/client.py",
line 122, in lease
2023-05-30 21:31:37.828111 json={"TTL": ttl, "ID": 0})
2023-05-30 21:31:37.828116 File
"/var/lib/kolla/venv/lib/python3.6/site-packages/etcd3gw/client.py",
line 88, in post
2023-05-30 21:31:37.828123 resp.reason
2023-05-30 21:31:37.828154 etcd3gw.exceptions.ConnectionTimeoutError:
Gateway Time-out
The error seemed very cryptic to me and I tried many different things to figure out what was going on with no success. As a last resort I added back one of the old control nodes and the containers started working again. Since I really had no clue what was causing this, I send an email to the OpenStack mailinglist and immediately got a reply. It is really a great community. :) The last line from the log actually gives a hint on what is going wrong:
2023-05-30 21:31:37.828154 etcd3gw.exceptions.ConnectionTimeoutError: Gateway Time-out
Cinder tries to connect to etcd and this fails. In my other OpenStack cluster Cinder works just fine without etcd and I knew I also had no etcd enabled. But then I remembered that I had enabled ZUN and this also required enabling etcd. And then I got another hint from the mailinglist, that by default Cinder will use etcd in case it is enabled in globals.yaml:
# Valid options are [ '', redis, etcd ]
#cinder_coordination_backend: "{{ 'redis' if enable_redis|bool else 'etcd' if enable_etcd|bool else '' }}"
Then I also remembered that I just recently created a bug report, because I noticed that if you add control nodes after you enabled etcd, the new nodes are not added to the etcd cluster. Since I replaced all three control nodes the etcd cluster was not working anymore and since Cinder by default uses etcd if it is enabled it failed too. The advise from the mailing list was to change the cinder_coordination_backend to redis. I decided to disable etcd for now and investigate further at a later stage.