OACI VE backups fail for a particular node: timed backup_ve_cb out after 60000 MILLISECONDS
Modified on: Sat, 18 Nov 2023 5:35 AM2020-01-22
Symptoms
One or several VE backups fail for a particular hardware node, the following pattern appears in the /var/log/pa/vps.log
file on OACI IM node:
ERROR ScheduledBackupTask [Shared executor thread #9 @1 @BACKGROUND] - Backup operation failed
java.lang.reflect.UndeclaredThrowableException: null
...
Caused by: com.parallels.c2u.jms.rpc.footprint.TimeoutException: timed backup_ve_cb out after 60000 MILLISECONDS
at com.parallels.c2u.vm2vf.corba.Vm2VfApiOperationsClient.backup_ve_cb(Vm2VfApiOperationsClient.java:1418) ~[im.jar:na]
... 35 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[na:1.8.0_112]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[na:1.8.0_112]
at com.parallels.c2u.vm2vf.corba.Vm2VfApiOperationsClient.backup_ve_cb(Vm2VfApiOperationsClient.java:1410) ~[im.jar:na]
... 35 common frames omitted
ERROR GenericBackupTask [Shared executor thread #9 @1 @BACKGROUND] - Backup of the VE [VeId[1012069.server-1031410-1]] aborted
/var/log/pa/vps.vm2vf.log
on the corresponding hardware node contains this pattern for the backup call:
INFO BackupCallbackProxy [Thread-1129375] - INF calling: cbp 0x7f38e8002f68 job 0xcb416 [collect_markers_pp] index [3951](backup_action, backup.c, 492) @[common/generic_sdk_cb.c][706][reg_next_generic_sdk_cb][137750])
INFO BackupCallbackProxy [Thread-1129376] - INF creating: cbp 0x7f38e8002f68 job (nil) [collect_markers_pp] index [3951] (backup_action, backup.c, 492) @[common/generic_sdk_cb.c][714][reg_next_generic_sdk_cb][137750])
INFO BackupCallbackProxy [Thread-1129377] - INF pthread_cond_timedwait [110]: cbp 0x7f38e8002f68 job (nil) [__PrlSrv_GetCommonPrefs_cbp] index [3951] (backup_action, backup.c, 492) @[common/generic_sdk_cb_tools.c][151][cbps_timeout_sleep][755003])
ERROR BackupCallbackProxy [Thread-1129378] - ERR PrlJob_GetStatus: job (nil) failure -2147483645 @[common/generic_sdk_cb_tools.c][105][cbps_job_check_status_ex][755003])
This part is important:
ERR PrlJob_GetStatus: job (nil) failure -2147483645
It is possible that several consecutive backups fail the same way and vm2vf service stops responding afterwards. All further calls to the node end up with a fast failure error message:
fast failure for backup_ve_cb after 240000 MILLISECONDS
Studying vps.vm2vf.log carefully for the actions preceding the error, one may find activity messages from 2 different backup calls running under the same thread uuid, for example:
(f33d81ee-4f06-4ae9-b5ca-639f6b9a650e backup_ve_cb callback) INFO BackupCallbackMediator637 [Thread-59526] - XML: <Envelope xmlns=""><correlationId>CALLBACK_3834f137-8299-4307-a2fa-af15aafc9a6b</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
"msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
"percent" : 0
}</payload></Envelope> to destination queue://client
(f33d81ee-4f06-4ae9-b5ca-639f6b9a650e backup_ve_cb callback) INFO BackupCallbackMediator637 [Thread-59520] - XML: <Envelope xmlns=""><correlationId>CALLBACK_3834f137-8299-4307-a2fa-af15aafc9a6b</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
"msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
"percent" : 42
}</payload></Envelope> to destination queue://client
(c0f169ae-cc55-459b-992a-0fd9828dc58d backup_ve_cb callback) INFO BackupCallbackMediator637 [Thread-59523] - XML: <Envelope xmlns=""><correlationId>CALLBACK_e831f0cc-d5a4-4cf5-9daa-14a8fd87a3b9</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
"msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
"percent" : 42
}</payload></Envelope> to destination queue://client
(c0f169ae-cc55-459b-992a-0fd9828dc58d backup_ve_cb callback) INFO BackupCallbackMediator637 [Thread-59529] - XML: <Envelope xmlns=""><correlationId>CALLBACK_e831f0cc-d5a4-4cf5-9daa-14a8fd87a3b9</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
"msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
"percent" : 0
}</payload></Envelope> to destination queue://client
Under large pressure, vm2vf service may become completely unresponsive, not accepting any new requests.
Cause
The issue is treated in scope of CCU-18107: Deadlock on vm2vf under memory intensive operation (PRL_ERR_INVALID_ARG)
Resolution
To release vm2vf service from a hung state, restart it:
# service PACI-vm2vf restart
It is highly advised to restart Virtuozzo services as well, since the issue originates from Virtuozzo backend:
For VZ 6.0:
# service parallels-server restart
For VZ 7.0:
# systemctl restart prl-disp