Skip to main content

VCD Upgrade failure via VCPLCM


One of the nicest tool available for vCD Life cycle management is VMware Cloud Provider Lifecycle Manager. We were able to successfully upgrade our POC environment from 10.3.3 to 10.4.1.1, which gave us confidence in upgrading the production environment using VCPLCM 1.5. Unfortunately, the upgrade was failing in the production environment and I wanted to share the experience and solution for it. 

This tool makes our life easier to do all the prerequisite checks and takes snapshots and take backups of DB before doing the upgrade. It was failing at the stage of "Upgrade_DB" with the below task in VCPLCM. We could see it was timing out with a Postgres IO timeout error in the vCD logs. The logs in VCPLCM and VCD are not very informative and also the upgrade failure results in an automatic rollback to dig further.  
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,811 - DEBUG - Starting with VCD upgrade.
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,812 - DEBUG - Apply VCD database upgrade on primary node neo0020001.
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,812 - DEBUG - Giving some time to DB service to be fully started on primary node neo0020001.
["2023-05-19 13:14:05,973Z"] 2023-05-19 13:14:05,973 - DEBUG - Service vpostgres.service is running on node neo0020001.
["2023-05-19 13:16:43,771Z"] 2023-05-19 13:16:43,770 - WARNING - ERROR: Program 31524 completed with Failure
["2023-05-19 13:16:43,771Z"] 2023-05-19 13:16:43,770 - DEBUG - guest_processes: (vim.vm.guest.ProcessManager.ProcessInfo) [
["2023-05-19 13:16:43,771Z"] (vim.vm.guest.ProcessManager.ProcessInfo) {
["2023-05-19 13:16:43,771Z"] dynamicType = <unset>,
["2023-05-19 13:16:43,771Z"] dynamicProperty = (vmodl.DynamicProperty) [],
["2023-05-19 13:16:43,771Z"] name = 'bash',
["2023-05-19 13:16:43,771Z"] pid = 31524,
["2023-05-19 13:16:43,771Z"] owner = 'root',
["2023-05-19 13:16:43,771Z"] cmdLine = '"/bin/bash" -c "/usr/bin/yes | /opt/vmware/vcloud-director/bin/upgrade" > /tmp/vcplcm_command_result_0844932a-f647-11ed-9a88-0050569ea70c.txt',
["2023-05-19 13:16:43,771Z"] startTime = 2023-05-19T13:14:06Z,
["2023-05-19 13:16:43,771Z"] endTime = 2023-05-19T13:16:39Z,
["2023-05-19 13:16:43,771Z"] exitCode = 106

The beauty of using VCPLCM is we can disable automatic rollback during the upgrade, it helps us to troubleshoot or restore to the last checkpoint/snapshot and retrigger the upgrade from there. If the upgrade is failing still we can perform a manual rollback. 

The problem that we saw with the upgrade was that the upgrade to VCD 10.4.1.1 was too quick in proceeding with the upgrade of other DB nodes before the first node was complete with the database updates. A procedure is available in VCP LCM to wait for this to complete, but it was not detecting this to be still running. Therefore, we added a workaround to sleep 5 minutes (300 seconds) after installing the upgrade on one node. This way, the upgrade was waiting long enough and was able to complete successfully.
The file we need to update in VCPLCM is 
 /opt/vmware/cplcm/scripts/python/vcplcm/plugin/vcd/upgrade/vcd_10_1_0_upgrade.py
The following parameter was added to it
time.sleep(300)


 

Comments

Popular posts from this blog

Deleting stale kubernetes clusters in vCD

Unlike the previous version the CSE 4.x is a stateless appliance and its data is stored in VMware Cloud Director Database.  The cluster creation and deletion compared with CSE 3.x version has improved. Besides, there are some scenarios where the cluster deletion is failing even when the "Force Delete" option is chosen. We can use vCD API explorer to delete it, the following are the API queries you can execute  Under definedEntity POST /1.0.0/entities/{id}/resolve DELETE /1.0.0/entities/{id}

Manage RabbitMQ using VCP LCM

I have been working in vCD for quite some time, and most of the implementation engineers or consultants faced issues during the deployment or upgrade of RabbitMQ for the vCD message queuing service. From vCD 10.2.2, we can use the built-in MQTT client instead of RabbitMQ however, for VCD multisite configuration or some 3rd party applications need RabbitMQ, such as Veeam or VMware HCX. Using the VCP LCM, we can create a new RabbitMQ environment or manage an existing environment. The reason for this blog is that none of the VMware documentation has the information that registering an existing RMQ instance is only going to work if the RMQ instance was previously deployed by the VCP LCM (or at least, if it is a similar setup based on a Bitnami RMQ VM). Other RMQ instances (e.g., running in CentOS) are not supported and cannot be imported into the VCP LCM 1.5. I hope this information will be useful for someone who is performing green field deployment or upgrading an existing setup. ...