Hey Checkyourlogs fans,
Today I wanted to talk about a case that I am currently working on for a customer running Veeam Backup and Replication V9.5 UR 3. Veeam is protecting our Storage Spaces Direct Hyper-V Hyper-Converged infrastructure running on Server 2016.
We recently upgraded all of the Servers from Windows 2012 R2 and started to see a weird error with Veeam over the weekend. Some of my backup jobs were failing with a WMI Error 32775 as you can see in the screen shot below.
When I checked the Hyper-V Cluster using Failover Cluster Manager everything looked fine.
Further, the cluster wasn’t showing any errors either. So I decided to have a look at the Hyper-V Nodes… This configuration is a 2x Node Storage Spaces Direct configuration.
What I found was very interesting the failed job for SQL was sitting on Node 2. Look at what Hyper-V Manager was showing.
It appears that the jobs were stalled out with the Hyper-V Management Services or the VMWP.exe were hung for these VM’s. I had seen this before with VSS Snapshots … The most interesting part was the VM WAC-ADM was going to be my new Windows Admin Center Virtual Machine. This one had failed on creation late last week before I had a chance to dig into the issue.
Something was definitely up with this node.
The event viewer showed an error 19060 in the Hyper-V-VMMS log. Stating that the VM was Creating a Checkpoint that never finishes. Thus making my new backups and replicas fail.
I have found reboots of nodes very problematic in this particular situation because the Cluster Service hangs and the Host Hyper-V Server sits there saying Shutting down Cluster Service. Then I am forced to do a hard reboot of the server which is never really a good idea especially with any type of Hyper-Converged solution.
NOTE: Storage Spaces Direct seems to be fine with the hard power outages. I just don’t like to push my luck because with other Hyper-Converged platforms this has caused me a lot of grief in the past.
There is a way to kill worker process, and I normally use SysInternals Process Explorer to get the job done.
NOTE: You need to run this from an Administrative Command Prompt to elevate Process Explorer to Admin Rights. If you don’t do this, you won’t be able to see the VMWP.exe processes that are controlling these Virtual Machines from the Parent Partition.
I killed each of the processes one by one watching Hyper-V Manager to see if it cleared up my issue.
In the end I wasn’t able to kill the above vmwp.exe processes. I got a general access denied as they were tied to an orphaned VMcompute.exe process. I ended up rebooting the node by disabling the cluster service manually killing it.
Upon reboot of the node I could see that the orphaned VMWP.exe processes were gone.
I started back up the cluster service and checked the Storage Jobs.
I also checked Hyper-V Manager to see if it was back being more responsive.
Instead of being stuck loading….. in the MMC it was now back to normal.
I live migrated the SQL Servers back to Node 2 and as you can see the locks from the CheckPoint creation were cleared.
At this point, I wanted to test the backups again and see if this fixed the problem. It looked good as the checkpoint for the Veeam Backup did create properly this time.
From what I can tell in troubleshooting this case the Hyper-V Server locked up while creating checkpoints on the Virtual Machines. Thus making it impossible to proceed with backups and or perform any other operations on the Server. I know I hate rebooting the Node but this is what ended up solving my issue.
Backups are now working again, and the customer is once again happy.
I hope this helps you if you run into this issue.
Thanks,
Dave
Did you ever find out what caused the problem? We have the same thing happening. Veeam backup starts and hangs at creating snapshot at 9% on all VM’s on the Hyper-V Server. You cannot stop it, You cannot replicate the server to another host. You cannot start a new backup. The only thing you can do is reboot the Hyper-V host it seems. After that we had 3 VM’s that were corrupted, domain controller, database and exchange server. We had to revert to a backup two days old because this happened on a friday night. Lost 3 days of work because of this. Once it’s stuck at 9% the VM is not consistent if you hard-reboot. This is a mayor problem.
Try looking for the following event in System EventID 10400 The network interface “Intel(R) Ethernet Converged Network Adapter X710” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets.
Reason: The network driver detected that its hardware has stopped responding to commands.
This network interface has reset 1 time(s) since it was last initialized.
It seems this causes the VMMS service to hang completely. We have seem to solved this by installing a new Intel driver for the network card, in stead of the Microsoft driver that you get by Windows Update
Great to know Caspar I will check this out.
Great article, great comments.