OACI VE backups fail for a particular node: timed backup_ve_cb out after 60000 MILLISECONDS

Original Publishing Date:
2020-01-22

Symptoms

One or several VE backups fail for a particular hardware node, the following pattern appears in the /var/log/pa/vps.log file on OACI IM node:

ERROR ScheduledBackupTask [Shared executor thread #9 @1 @BACKGROUND] - Backup operation failed
java.lang.reflect.UndeclaredThrowableException: null
...
Caused by: com.parallels.c2u.jms.rpc.footprint.TimeoutException: timed backup_ve_cb out after 60000 MILLISECONDS
        at com.parallels.c2u.vm2vf.corba.Vm2VfApiOperationsClient.backup_ve_cb(Vm2VfApiOperationsClient.java:1418) ~[im.jar:na]
        ... 35 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[na:1.8.0_112]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[na:1.8.0_112]
        at com.parallels.c2u.vm2vf.corba.Vm2VfApiOperationsClient.backup_ve_cb(Vm2VfApiOperationsClient.java:1410) ~[im.jar:na]
        ... 35 common frames omitted

ERROR GenericBackupTask [Shared executor thread #9 @1 @BACKGROUND] - Backup of the VE [VeId[1012069.server-1031410-1]] aborted

/var/log/pa/vps.vm2vf.log on the corresponding hardware node contains this pattern for the backup call:

INFO  BackupCallbackProxy [Thread-1129375] - INF calling: cbp 0x7f38e8002f68 job 0xcb416 [collect_markers_pp] index [3951](backup_action, backup.c, 492) @[common/generic_sdk_cb.c][706][reg_next_generic_sdk_cb][137750])
INFO  BackupCallbackProxy [Thread-1129376] - INF creating: cbp 0x7f38e8002f68 job (nil) [collect_markers_pp] index [3951] (backup_action, backup.c, 492) @[common/generic_sdk_cb.c][714][reg_next_generic_sdk_cb][137750])
INFO  BackupCallbackProxy [Thread-1129377] - INF pthread_cond_timedwait [110]: cbp 0x7f38e8002f68 job (nil) [__PrlSrv_GetCommonPrefs_cbp] index [3951] (backup_action, backup.c, 492) @[common/generic_sdk_cb_tools.c][151][cbps_timeout_sleep][755003])
ERROR BackupCallbackProxy [Thread-1129378] - ERR PrlJob_GetStatus: job (nil) failure -2147483645 @[common/generic_sdk_cb_tools.c][105][cbps_job_check_status_ex][755003])

This part is important:

ERR PrlJob_GetStatus: job (nil) failure -2147483645

It is possible that several consecutive backups fail the same way and vm2vf service stops responding afterwards. All further calls to the node end up with a fast failure error message:

fast failure for backup_ve_cb after 240000 MILLISECONDS

Studying vps.vm2vf.log carefully for the actions preceding the error, one may find activity messages from 2 different backup calls running under the same thread uuid, for example:

(f33d81ee-4f06-4ae9-b5ca-639f6b9a650e backup_ve_cb callback) INFO  BackupCallbackMediator637 [Thread-59526] - XML: <Envelope xmlns=""><correlationId>CALLBACK_3834f137-8299-4307-a2fa-af15aafc9a6b</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
  "msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
  "percent" : 0
}</payload></Envelope> to destination queue://client

(f33d81ee-4f06-4ae9-b5ca-639f6b9a650e backup_ve_cb callback) INFO  BackupCallbackMediator637 [Thread-59520] - XML: <Envelope xmlns=""><correlationId>CALLBACK_3834f137-8299-4307-a2fa-af15aafc9a6b</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
  "msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
  "percent" : 42
}</payload></Envelope> to destination queue://client

(c0f169ae-cc55-459b-992a-0fd9828dc58d backup_ve_cb callback) INFO  BackupCallbackMediator637 [Thread-59523] - XML: <Envelope xmlns=""><correlationId>CALLBACK_e831f0cc-d5a4-4cf5-9daa-14a8fd87a3b9</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
  "msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
  "percent" : 42
}</payload></Envelope> to destination queue://client

(c0f169ae-cc55-459b-992a-0fd9828dc58d backup_ve_cb callback) INFO  BackupCallbackMediator637 [Thread-59529] - XML: <Envelope xmlns=""><correlationId>CALLBACK_e831f0cc-d5a4-4cf5-9daa-14a8fd87a3b9</correlationId><kind>659</kind><payload>// callback backup_ve_cb in_progress
{
  "msg" : "PET_DSP_EVT_BACKUP_PROGRESS_CHANGED",
  "percent" : 0
}</payload></Envelope> to destination queue://client

Under large pressure, vm2vf service may become completely unresponsive, not accepting any new requests.

Cause

The issue is treated in scope of CCU-18107: Deadlock on vm2vf under memory intensive operation (PRL_ERR_INVALID_ARG)

Resolution

To release vm2vf service from a hung state, restart it:

# service PACI-vm2vf restart

It is highly advised to restart Virtuozzo services as well, since the issue originates from Virtuozzo backend:

For VZ 6.0:

# service parallels-server restart

For VZ 7.0:

# systemctl restart prl-disp

Internal content

Link on internal Article