Cancel Remove Snapshot Task
* Disclamer: I'm not sure this is supported by VMware, and requires a deep knowledge of how VMware snapshots works, and this can vary in different versions.
I had a customer contacting me that they had a problem. They had shutdown a Virtual Machine because it had a huge snapshot ~1.7 TB, and afterwards they had initiated the remove snapshot task. But after 2 hours it was only at 10%. So quick calculation says that this will take +20 hours, and they could not wait that long too power on the machine again, so we had to do something. But you can't cancel the task from the vSphere Client.
The first thing was to find the host where the machine was located and that was doing the removal of the snapshot, so I logged in to the ESXi host using SSH, and running th following command:
~ # vim-cmd vimsvc/task_list
(ManagedObjectReference) [
'vim.Task:haTask-311-vim.vm.Snapshot.remove-268537832',
'vim.Task:haTask-ha-host-vim.HostSystem.acquireCimServicesTicket-268543629'
The vim.Task:haTask-311-vim.vm.Snapshot.remove-268537832 indicated that this host is removing a snapshot (if delete all is running the text indicats that) afterward I wanted to be sure that i was the correct machine, by running this:
~ # vim-cmd vimsvc/task_info haTask-311-vim.vm.Snapshot.remove-268537832
(vim.TaskInfo) {
dynamicType = <unset>,
key = "haTask-311-vim.vm.Snapshot.remove-268537832",
task = 'vim.Task:haTask-311-vim.vm.Snapshot.remove-268537832',
description = (vmodl.LocalizableMessage) null,
name = "vim.vm.Snapshot.remove",
descriptionId = "vm.Snapshot.remove",
entity = 'vim.VirtualMachine:311',
entityName = "SERVER01",
state = "running",
cancelled = false,
cancelable = true,
error = (vmodl.MethodFault) null,
result = <unset>,
progress = 10,
reason = (vim.TaskReasonUser) {
dynamicType = <unset>,
userName = "vpxuser",
},
queueTime = "2014-12-17T18:10:17.285736Z",
startTime = "2014-12-17T18:10:17.286068Z",
completeTime = <unset>,
eventChainId = 268537832,
changeTag = <unset>,
parentTaskKey = <unset>,
rootTaskKey = <unset>,
}
Checking the information in the above output, and marked some of the imported information, like machine name, progress, start time, and that the task can be canceled.
Then I issued a Cancel Task with this command:
~ # vim-cmd vimsvc/task_cancel haTask-311-vim.vm.Snapshot.remove-268537832
After a few secounds the task was canceled.
Afterward we tried to power on the machine, but i returned the following error in the vSphere Client:
An error was received from the ESX host while powering on VM SERVER01.
Failed to start the virtual machine.
Module DiskEarly power on failed.
Cannot open the disk 'SERVER01-000001.vmdk' or one of the snapshot disks it depends on.
The system cannot find the file specified
To investigate this problem I had to look at the VMDK files for the VM, by running the following command:
~ # ls -lh /vmfs/volumes/Datastore1/SERVER01/*.vmdk
-rw——- 1 root root 40.0G Dec 17 18:12 /vmfs/volumes//Datastore1/SERVER01/SERVER01-flat.vmdk
-rw——- 1 root root 501 Dec 17 2013 /vmfs/volumes//Datastore1/SERVER01/SERVER01.vmdk
-rw——- 1 root root 1000.0G Dec 17 20:28 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-flat.vmdk
-rw——- 1 root root 494 Dec 17 2013 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1.vmdk
-rw——- 1 root root 1000.0G Dec 17 20:28 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001-delta.vmdk
-rw——- 1 root root 494 Nov 22 11:54 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001.vmdk
-rw——- 1 root root 522.1G Dec 17 18:01 /vmfs/volumes//Datastore1/SERVER01/SERVER01_2-flat.vmdk
-rw——- 1 root root 494 Dec 17 2013 /vmfs/volumes//Datastore1/SERVER01/SERVER01_2.vmdk
-rw——- 1 root root 617.8G Dec 17 18:01 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001-delta.vmdk
-rw——- 1 root root 494 Nov 22 11:54 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001.vmdk
-rw——- 1 root root 1000.0G Dec 17 18:01 /vmfs/volumes//Datastore1/SERVER01/SERVER01_3-flat.vmdk
-rw——- 1 root root 494 Dec 17 2013 /vmfs/volumes//Datastore1/SERVER01/SERVER01_3.vmdk
-rw——- 1 root root 567.3G Dec 17 18:01 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001-delta.vmdk
-rw——- 1 root root 494 Nov 22 11:54 /vmfs/volumes//Datastore1/SERVER01/SERVER01_1-000001.vmdk
The file that the error say is missing, is not there, but I could see that the disk SERVER01_1-flat.vmdk was updated latest at Dec 17 18:12 and that was after the snapshot removal was started, and that SERVER01_1-flat.vmdk was update latest at Dec 17 20:28, this means the SERVER01_1-flat.vmdk was fully consolidated, but the VMX file did not reflect this, so i just edited one line i the VMX file:
Before (part of the VMX file):
scsi0:0.fileName = "SERVER01-000001.vmdk" scsi0:0.mode = "persistent"
After (part of the VMX file):
scsi0:0.fileName = "SERVER01.vmdk" scsi0:0.mode = "persistent"
Afterward I could power on the machine without any problems, and then initiate a Consolidate Snapshots, to delete the remaining snapshot, will the machine is running.
This procedure can de diffent if there is multiple snapshots on the Virtual Machine.
* I would strongly advise against using snapshot for more then 1-3 days and that grows to this size.
Dear Allan,
Thank You for the useful article.
One thing I didn’t quite get is: What is happening behind the scenes:
A snapshot is created, all the changes are written in the delta, and You start a consolidation job:
Delta -> Base disk merge. When You cancel the task, You basically leave the new disk in “two” pieces: One for the new data, and one for the unchanged data.
With this scenario, how does VMware know which data must be read/written to which VMDK? (As far as I understand the snapshot is already gone at this point, so no real metadata is present for the hypervisor to know the “mapping”).
Thank You for the Answer.
This should not be a ptoblem since the delte disks does not get deleted before the consolidation of that disk finishes, and the consolidated helper snapshot and data snapshot contains the metadata for where to read the correct blocks, and the VMX file is not changed before end of the job.
But you need to know what you are doing before trying to doing this, or you can end up with corrupt disks/vmdk files.