Kolla-Ansible

Kolla-Ansible Cinder Containers unhealthy

r0k5t4r

03 Jun 2023 • 3 min read

I recently replaced all 3 control nodes in my OpenStack cluster and once I replaced the last control node, all of a sudden three of the four Cinder containers became unhealty. Looking at the logs of the cinder_api container revealed the following error:

2023-05-30 21:31:36.350636 Timeout when reading response headers from 
daemon process 'cinder-api': /var/www/cgi-bin/cinder/cinder-wsgi
2023-05-30 21:31:37.827101 mod_wsgi (pid=22): Failed to exec Python 
script file '/var/www/cgi-bin/cinder/cinder-wsgi'.
2023-05-30 21:31:37.827168 mod_wsgi (pid=22): Exception occurred 
processing WSGI script '/var/www/cgi-bin/cinder/cinder-wsgi'.
2023-05-30 21:31:37.828005 Traceback (most recent call last):
2023-05-30 21:31:37.828046   File "/var/www/cgi-bin/cinder/cinder-wsgi", 
line 52, in <module>
2023-05-30 21:31:37.828053     application = initialize_application()
2023-05-30 21:31:37.828058   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/cinder/wsgi/wsgi.py", 
line 44, in initialize_application
2023-05-30 21:31:37.828063     coordination.COORDINATOR.start()
2023-05-30 21:31:37.828068   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/cinder/coordination.py", 
line 86, in start
2023-05-30 21:31:37.828071 self.coordinator.start(start_heart=True)
2023-05-30 21:31:37.828075   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/coordination.py", 
line 689, in start
2023-05-30 21:31:37.828078 super(CoordinationDriverWithExecutor, 
self).start(start_heart)
2023-05-30 21:31:37.828083   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/coordination.py", 
line 426, in start
2023-05-30 21:31:37.828086     self._start()
2023-05-30 21:31:37.828090   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/tooz/drivers/etcd3gw.py", 
line 224, in _start
2023-05-30 21:31:37.828093     self._membership_lease = 
self.client.lease(self.membership_timeout)
2023-05-30 21:31:37.828098   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/etcd3gw/client.py", 
line 122, in lease
2023-05-30 21:31:37.828111     json={"TTL": ttl, "ID": 0})
2023-05-30 21:31:37.828116   File 
"/var/lib/kolla/venv/lib/python3.6/site-packages/etcd3gw/client.py", 
line 88, in post
2023-05-30 21:31:37.828123     resp.reason
2023-05-30 21:31:37.828154 etcd3gw.exceptions.ConnectionTimeoutError: 
Gateway Time-out

The error seemed very cryptic to me and I tried many different things to figure out what was going on with no success. As a last resort I added back one of the old control nodes and the containers started working again. Since I really had no clue what was causing this, I send an email to the OpenStack mailinglist and immediately got a reply. It is really a great community. :) The last line from the log actually gives a hint on what is going wrong:

2023-05-30 21:31:37.828154 etcd3gw.exceptions.ConnectionTimeoutError: Gateway Time-out

Cinder tries to connect to etcd and this fails. In my other OpenStack cluster Cinder works just fine without etcd and I knew I also had no etcd enabled. But then I remembered that I had enabled ZUN and this also required enabling etcd. And then I got another hint from the mailinglist, that by default Cinder will use etcd in case it is enabled in globals.yaml:

# Valid options are [ '', redis, etcd ]
#cinder_coordination_backend: "{{ 'redis' if enable_redis|bool else 'etcd' if enable_etcd|bool else '' }}"

Then I also remembered that I just recently created a bug report, because I noticed that if you add control nodes after you enabled etcd, the new nodes are not added to the etcd cluster. Since I replaced all three control nodes the etcd cluster was not working anymore and since Cinder by default uses etcd if it is enabled it failed too. The advise from the mailing list was to change the cinder_coordination_backend to redis. I decided to disable etcd for now and investigate further at a later stage.