Hey Checkyourlogs Fans,
I am writing to you here tonight after having some not so fun nights dealing with persisting issues with Windows Server 2019 and Storage Spaces Direct this month. Let me preface this with the fact that we are early adopters and all of my clients so far understand this and are willing to working Microsoft and the Vendors to improve the experience.
WSSD is slated to come out with full certifications starting in March of 2019.
My customer has decided to purchase a brand new Hyper-Converged Cluster running all NVME SSD flash drives. We have seen the following issues so far in our deployment some resolved some not:
#1 – Mellanox Firmware and Drive Issues – We saw a ton of Paused Packets on the Switches and in the Mellanox Performance Counters. This was breaking our Lossless configuration of RDMA (RoCE) for the Storage Spaces Direct Nodes. This has been Resolved with the help of Mellanox
#2 – Mellanox S2700 Switches have a new RoCE configuration for lossless networks. Specifically they have some special settings for Advanced Buffer Management. Without these settings configured properly you will see abnormally high Paused Packets which also break the Lossless configuration required for RDMA (RoCE). Here is a great link to some configurations authored by a Microsoft Premier Field Engineer (Jan Mortenen) – https://www.s2d.dk/2019/01/monitor-roce-mellanox_5.html . For the record, we have been configuring with Layer 3 DSCP. You should checkout his blog he has some great stuff up there.
#3 – The SDDC Management Resource which is what Windows Admin Center uses as a polling mechanism to return results for the HTML 5 UI was crashing cluster roles (VM). There was a confirmed bug, and it is to have been fixed in the 1D Cumulative update for January. Right now we have been disabling this resource in Failover Cluster Manager until we can confirm what is happening. Things have been pretty stable since we stopped using Windows Admin Center. We don’t anticipate this to be a prolonged issue but it is one that we can’t have, and hopefully, it is indeed fixed in 1D.
#4 – We have had reports of customers getting files locked up in their Cluster Shared Volumes (CSV). In some cases, this has caused some production data loss. For our customer, we had good backups and replicas and were able to avoid prolonged outages. It is still unclear at this point what was causing this problem. This Microsoft product teams are investigating.
#5 – NVME Performance issues – I took an identical cluster working with one of the vendors I’m close with, and we ran identical Vmfleet tests on the same hardware. The results are pretty shocking. I discovered this issue in the customer’s production Cluster when my all NVME cluster started showing +15MS latency. The same cluster reformatted with Windows Server 2016 <1MS (US) latency. Digging in we have been working with Mellanox and have validated that their updated drivers and firmware look good. RoCE at the hardware level appears to be configured correctly with the Mellanox S2700 switches. This issue has now been escalated to Microsoft, and there is still no clear path as to what the issue might be at this time. Below are some screenshot examples of VMFleet running on both platforms. Despite different host names it is the same hardware.
Random 4K, 8 Threads, 8 Outstanding I/O 100 % Read
Windows Server 2016
Pretty good numbers right. +3 Million IOPS and <1ms latency.
Look at the Bandwidth – 12 GB/sec
These are running on 40 GbE Mellanox CX4 adapters.
Looks great to me.
Now lets, try the same thing on Windows Server 2019
Umm 800 K IOPS with 3.5 GB /Sec
WHAT 62 MS Latency at the top end.
Yah something is not right here.
Testing continues:
Random 4K, 8 Threads, 8 Outstanding I/O, 100 % Write
Not bad 800 K IOPS + 3.25 GB/SEC about 3.8 MS Latency
WS 2019 – 500 K IOPS – 2.1 GB/SEC and <20ms Latency
Moreover, for the final test:
Good old 70/30 Read Write
Random 4K, 8 Threads, 8 Outstanding I/O, 70% Read / 30% Write
Windows Server 2016 performs quite well with over 1.7 Million IOPS 7+ GB/Sec bandwidth and <1ms latency
Windows Server 2019:
We hit 700 K IOPS / 3 GB/Sec and +35ms latency.
Now, I know what you are thinking: Aren’t you suppose to be a huge fanboy of Storage Spaces Direct Dave? The answer to that is absolutely. However, my customers always come first, and this one, in particular, is running into some really weird issues at this time. At this point, Microsoft internally has taken our issue and is working through it to see what exactly is up. At the end of the day, all I know is that if I take the same hardware and run Windows Server 2016 with 4 x the performance or more something is wrong.
Also please remember that we are early adopters and as such we expect to hit roadblocks. My goal of this post is to see if anyone else is experiencing something similar to a common goal to get things resolved. As always I would highly recommend waiting until your vendors of choice have certified their hardware with the Windows Server Software Defined Program (WSSD). That programs single goal in life is to prevent these types of issues from occurring in the field. It has vendors certify their hardware and ensure that it can pass stress tests.
This configuration that we have is what we feel not a hardware problem because it works so well on Windows Server 2016. Microsoft at this point like I said is evaluating the problems, and as soon as I hear back you know, I’ll let you know.
So for right now, continue your testing of Windows Server 2019 and if you ask me I would wait for about the next 45 days until the certified builds are ready from both Microsoft and the Vendors.
I really hope you enjoyed this post and it saves you a ton of time,
Thanks,
Dave
Thanks, we have a nearly identical setup S2700, all NVMe, etc and are seeing nearly identical problems throughout the stack. Hope MS can pull this together.
Dave,
Have you any updates to this issue? we are using all nvme solution using the march 2019 version of server 2019 and still seeing the 45% performance hit.
Is there any update on this? About to deploy a 2 node SSD cluster and wanted to use 2019 in my production environment.
Same Problem here with Intel NVMe 4xxx Series…. Win2016 is Performing, Win2019 not!
Here is the update from Microsoft –> You need to turn off CSV Caching for VMFleet testing.
Certain microbenchmarking tools like DISKSPD and VM Fleet may produce worse results with the CSV in-memory read cache enabled than without it. By default VM Fleet creates one 10 GiB VHDX per virtual machine – approximately 1 TiB total for 100 VMs – and then performs uniformly random reads and writes to them. Unlike real workloads, the reads don’t follow any predictable or repetitive pattern, so the in-memory cache is not effective and just incurs overhead.
https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/csv-cache
Dear Dave,
I would like to ask the same as Steven, do you have any updates you would be able to share ?
thx!
Here is the update from Microsoft –> You need to turn off CSV Caching for VMFleet testing.
Certain microbenchmarking tools like DISKSPD and VM Fleet may produce worse results with the CSV in-memory read cache enabled than without it. By default VM Fleet creates one 10 GiB VHDX per virtual machine – approximately 1 TiB total for 100 VMs – and then performs uniformly random reads and writes to them. Unlike real workloads, the reads don’t follow any predictable or repetitive pattern, so the in-memory cache is not effective and just incurs overhead.
https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/csv-cache
1. What is the script you are running to get those statistics?
2. Any update on the performance issue? 4 node cluster all SSD/NVME and getting horrid results so far.
1. Any update on the performance issue? We have 2 4 node clusters with SSD/NVME with horrid performance.
2. What script are you using to get the statistics in your screen shots above?
Hey Brian,
Thanks for the comment –> The tool used for the screenshots is watch-cluster.ps1 it comes with VMFleet from Microsoft.
Did you tested it again with the Cache size disabled ? Because i’ve still got performances issue on 2019.
yah it was better with CSV Cache disabled.
Hi Dave, thank u for sharing your findings about WS2019 CSV Cache. We’ve got also an 2019 S2D solution almost ready for production. But we are not sure about performance. Would you please take q quick look at https://social.technet.microsoft.com/Forums/en-US/8e61341a-a5e4-434e-92d8-5381f50962ed/is-my-s2d-solution-performing?forum=winserverfiles
If you could only tell me whether we’re on the right track or not. Thank u!
I have a better solution join over 800 + S2D experts and users in our free Slack S2D Community … Ask your question there that is where we all hang out.
http://slack.storagespacesdirect.com/
Thanks,
Dave
Have you gotten updates from Microsoft in late 2019 in regards to the performance being optimized to that of WS16? Or did they confirm slower performance by design in S2D on 2019? Really curious to know sir!
Hey Oleg the performance issue was with how VMFleet runs. Just disable CSV Cache during the tests. You can re-enable afterwards
Thought I’d chime in here:
I believe we narrowed our 2019 performance issue down to what version the actual virtual disk files (vhdx) were created on.
On our cluster, virtual disks created natively and running on 2019 do not perform well at all. Vhdx’s created on earlier versions of windows/hyper-v that we copied over and mounted on 2019, performance goes through the roof.
In contrast, we copied a 2019 vmdx to a 2012 vm, and it performed well.
So it seems to be narrowed down the abstraction layer or a driver on 2019 vm’s.
Reported this to MS as I have an open case. Will post back if we get a resolution.
I guess for now (future vms) just create a blank vhdx using 2016 or 2012, and keep copying it, mount it. then extend it, format it, etc.
(FYI, is on a Win2019 S2D 6 node HCI cluster, all NVMe)