One of the nicest tool available for vCD Life cycle management is VMware Cloud Provider Lifecycle Manager. We were able to successfully upgrade our POC environment from 10.3.3 to 10.4.1.1, which gave us confidence in upgrading the production environment using VCPLCM 1.5. Unfortunately, the upgrade was failing in the production environment and I wanted to share the experience and solution for it.
This tool makes our life easier to do all the prerequisite checks and takes snapshots and take backups of DB before doing the upgrade. It was failing at the stage of "Upgrade_DB" with the below task in VCPLCM. We could see it was timing out with a Postgres IO timeout error in the vCD logs. The logs in VCPLCM and VCD are not very informative and also the upgrade failure results in an automatic rollback to dig further.
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,811 - DEBUG - Starting with VCD upgrade.
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,812 - DEBUG - Apply VCD database upgrade on primary node neo0020001.
["2023-05-19 13:13:43,812Z"] 2023-05-19 13:13:43,812 - DEBUG - Giving some time to DB service to be fully started on primary node neo0020001.
["2023-05-19 13:14:05,973Z"] 2023-05-19 13:14:05,973 - DEBUG - Service vpostgres.service is running on node neo0020001.
["2023-05-19 13:16:43,771Z"] 2023-05-19 13:16:43,770 - WARNING - ERROR: Program 31524 completed with Failure
["2023-05-19 13:16:43,771Z"] 2023-05-19 13:16:43,770 - DEBUG - guest_processes: (vim.vm.guest.ProcessManager.ProcessInfo) [
["2023-05-19 13:16:43,771Z"] (vim.vm.guest.ProcessManager.ProcessInfo) {
["2023-05-19 13:16:43,771Z"] dynamicType = <unset>,
["2023-05-19 13:16:43,771Z"] dynamicProperty = (vmodl.DynamicProperty) [],
["2023-05-19 13:16:43,771Z"] name = 'bash',
["2023-05-19 13:16:43,771Z"] pid = 31524,
["2023-05-19 13:16:43,771Z"] owner = 'root',
["2023-05-19 13:16:43,771Z"] cmdLine = '"/bin/bash" -c "/usr/bin/yes | /opt/vmware/vcloud-director/bin/upgrade" > /tmp/vcplcm_command_result_0844932a-f647-11ed-9a88-0050569ea70c.txt',
["2023-05-19 13:16:43,771Z"] startTime = 2023-05-19T13:14:06Z,
["2023-05-19 13:16:43,771Z"] endTime = 2023-05-19T13:16:39Z,
["2023-05-19 13:16:43,771Z"] exitCode = 106
The beauty of using VCPLCM is we can disable automatic rollback during the upgrade, it helps us to troubleshoot or restore to the last checkpoint/snapshot and retrigger the upgrade from there. If the upgrade is failing still we can perform a manual rollback.
The problem that we saw with the upgrade was that the upgrade to VCD 10.4.1.1 was too quick in proceeding with the upgrade of other DB nodes before the first node was complete with the database updates. A procedure is available in VCP LCM to wait for this to complete, but it was not detecting this to be still running. Therefore, we added a workaround to sleep 5 minutes (300 seconds) after installing the upgrade on one node. This way, the upgrade was waiting long enough and was able to complete successfully.
The file we need to update in VCPLCM is
/opt/vmware/cplcm/scripts/python/vcplcm/plugin/vcd/upgrade/vcd_10_1_0_upgrade.py
The following parameter was added to it
time.sleep(300)
Comments
Post a Comment