Back
Causes and Solutions for Backup Failures
This page explains the causes and solutions when backup (replication) fails.
When a virtual machine backup fails or is skipped, you will receive a “Backup report” email from the Xen Orchestra server. Even if this occurs, the virtual machine service will continue, but in the unlikely event that a failure occurs on the Xen host and the storage cannot be rescued, there is a risk that the restore point will be lost .
Backup failures occur mainly due to three reasons:
- When there is not enough free space left in the local storage
- When there are many snapshots of a virtual machine
- When the virtual disk size of a virtual machine becomes too large or there are too many updates, and the RPO = 1 hour (twice an hour) backup is no longer sufficient.
Below are the causes and solutions for each error message.
If you experience errors other than those listed, or if the problem does not resolve itself even after you have taken measures, such as the situation returning to normal immediately, please contact our support or sales representative. In addition, in the case of the above 3, the fundamental solution is to narrow down the backup targets or extend the RPO time (please inquire).
About Backup Report
The sender’s email address is “sysadmin@justplayer.com” and the subject is “[Xen Orchestra] failure(or skipped) − Backup report for backup job name”. Please check the body of the email for information about the backup job and the error message.
How Much Storage Space Do I Need?
The storage capacity required by Xen depends not only on the storage capacity currently in use, but also on the data update differentials within the virtual machine. Therefore, it is not possible to determine in general terms how many gigabytes are needed.
This is because Xen snapshots are a type of snapshot system that requires a backing store.
If there is a large amount of storage updates between snapshots (such as in a database or a system with a large number of logs), the number of differential disks will increase. When deleting a snapshot, the differential disks must be merged, but in order to prevent unexpected system shutdowns, the original data is left intact during the merge process, which requires a large amount of temporary free storage space. In Xen, this merging process is not performed immediately after the snapshot is deleted, but is performed automatically by the system, so there is a slight delay after the process is completed.
Most issues arise due to two characteristics of the snapshot backing store merge process: it requires free space and it is performed lazily.
SR_BACKEND_FAILURE_44 is often an error purely caused by storage capacity. It is necessary to always keep a certain amount of free space in the host’s local storage.
Please note that SR_BACKEND_FAILURE_109 is mostly related to the remaining disk space, such as failure due to insufficient storage during the delete operation after the merge process.
Storage can be saved by deleting unnecessary virtual machines, ceasing unnecessary backups, or smoothing the overall load on the host, but the benefits of this system in terms of quick recovery work and time (RTO) in the event of an incident cannot be overlooked. Depending on your contract, you may be able to expand your storage capacity. If you need to expand your storage (SSD), please contact our support or our sales representative.
Causes of Errors and How To Deal With Them
Error Statements |
---|
Error: the job (XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ) is already running *The string within () will be different for each user. |
Cause |
This occurs when a new backup job is skipped because a backup job is already running. Backups are delayed due to various factors, such as frequent disk updates. |
First Responders |
The user changes the RPO operating policy and contacts us |
Solution |
This means that the backup was skipped because the amount of data to be backed up was large. If the backup ends normally after that, there is no need to worry. If this occurs frequently, the backup unit may be too short compared to the amount of disk updates. In this case, the backup unit needs to be gradually lengthened to 1 hour, 2 hours, 3 hours, etc. At the same time, the RPO will become longer, which means that the maximum rewind time during recovery from a failure will increase. In this case, a setting change is required, so please contact our support or sales representative. |
Error Statements |
---|
Failure Error: SR_BACKEND_FAILURE_44(, There is insufficient space, ) |
Cause |
The storage capacity is full. |
First Responders |
user |
Solution |
Basically you need to delete some data to free up some space.Delete unnecessary VMs. Delete unnecessary snapshots For information on deleting unnecessary snapshots, please see here . Avoid backing up unnecessary virtual machines. If you want to exclude it from the backup, please check here . |
Error Statements |
---|
Failure Error: SR_BACKEND_FAILURE_109(, The snapshot chain is too long, ) |
Cause |
This occurs when there are many snapshots of the target virtual machine. Only up to 30 snapshots can be created for one virtual machine, including invisible ones. Also, snapshot deletion takes time, but these are delayed. The same error may be output during this time. In the case of Xen Orchestra, the replication process is a mechanism for taking snapshots and transferring the differences, so this may occur if the RPO is too short for the amount of updates to the virtual machine. If this occurs for several virtual machines, the RPO needs to be increased. |
First Responders |
user |
Solution |
Delete unnecessary snapshots from the snapshot list. There may be unnamed snapshots that Xen Orchestra automatically created during backup that remain. For information on deleting unnecessary snapshots, please refer to here . If this situation occurs, it is possible that the RPO is too short for the virtual machine updates. If this occurs frequently, please contact our support or sales representative. |
Error Statements |
---|
Skipped Reason: (unhealthy VDI chain) Job canceled to protect the VDI chain |
Cause |
This occurs when the snapshot consolidation process of the target virtual machine is required. The consolidation process is performed automatically, so you will need to wait for a while. This may occur when running a job immediately after deleting a snapshot. |
First Responders |
user |
Solution |
As described in the section on required storage capacity , Xen performs a delayed disk merging process after deleting a snapshot. This can also occur if the disk merging state is abnormal due to a mistake in snapshot deletion or stopping the deletion midway. Basically, if you wait a while, Xen will automatically perform the disk merging process, and the problem will be resolved after a while.If the error persists after a few daysOn rare occasions, an internal error may occur and the merge process may not be possible.If you end up in this state, the easiest way to get back is to clone the virtual machine and then delete the original. See here for information on cloning a virtual machine. |
Error Statements |
---|
could not find the base VM |
Cause |
This occurs when the previous backup point of the virtual machine cannot be found or is not normal. The backup is performed by merging the differences from the previous backup, so if the previous backup cannot be found, the backup will fail. |
First Responders |
user |
Solution |
From the snapshot list, delete the past backup point [XO Backup dp4-xenpool…] . This will cause the backup of that virtual machine to start a full synchronization instead of a differential synchronization. As a result, the message “Error: the job (XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ) is already running” may occur for a while. This may also occur when there are many snapshots. In that case, unnecessary snapshots will be needed. Please refer to here for information on deletion. |