BUG Reports – Windows Server 2019 Storage Spaces Direct Field Notes – January 2019 Edition – Not what we expected #StorageSpacesDirect

Posted by Kawula Dave | Feb 7, 2019 | S2D, Windows Server | 17 |

Hey Checkyourlogs Fans,

I am writing to you here tonight after having some not so fun nights dealing with persisting issues with Windows Server 2019 and Storage Spaces Direct this month. Let me preface this with the fact that we are early adopters and all of my clients so far understand this and are willing to working Microsoft and the Vendors to improve the experience.

WSSD is slated to come out with full certifications starting in March of 2019.

My customer has decided to purchase a brand new Hyper-Converged Cluster running all NVME SSD flash drives. We have seen the following issues so far in our deployment some resolved some not:

#1 – Mellanox Firmware and Drive Issues – We saw a ton of Paused Packets on the Switches and in the Mellanox Performance Counters. This was breaking our Lossless configuration of RDMA (RoCE) for the Storage Spaces Direct Nodes. This has been Resolved with the help of Mellanox

#2 – Mellanox S2700 Switches have a new RoCE configuration for lossless networks. Specifically they have some special settings for Advanced Buffer Management. Without these settings configured properly you will see abnormally high Paused Packets which also break the Lossless configuration required for RDMA (RoCE). Here is a great link to some configurations authored by a Microsoft Premier Field Engineer (Jan Mortenen) – https://www.s2d.dk/2019/01/monitor-roce-mellanox_5.html . For the record, we have been configuring with Layer 3 DSCP. You should checkout his blog he has some great stuff up there.

#3 – The SDDC Management Resource which is what Windows Admin Center uses as a polling mechanism to return results for the HTML 5 UI was crashing cluster roles (VM). There was a confirmed bug, and it is to have been fixed in the 1D Cumulative update for January. Right now we have been disabling this resource in Failover Cluster Manager until we can confirm what is happening. Things have been pretty stable since we stopped using Windows Admin Center. We don’t anticipate this to be a prolonged issue but it is one that we can’t have, and hopefully, it is indeed fixed in 1D.

#4 – We have had reports of customers getting files locked up in their Cluster Shared Volumes (CSV). In some cases, this has caused some production data loss. For our customer, we had good backups and replicas and were able to avoid prolonged outages. It is still unclear at this point what was causing this problem. This Microsoft product teams are investigating.

#5 – NVME Performance issues – I took an identical cluster working with one of the vendors I’m close with, and we ran identical Vmfleet tests on the same hardware. The results are pretty shocking. I discovered this issue in the customer’s production Cluster when my all NVME cluster started showing +15MS latency. The same cluster reformatted with Windows Server 2016 <1MS (US) latency. Digging in we have been working with Mellanox and have validated that their updated drivers and firmware look good. RoCE at the hardware level appears to be configured correctly with the Mellanox S2700 switches. This issue has now been escalated to Microsoft, and there is still no clear path as to what the issue might be at this time. Below are some screenshot examples of VMFleet running on both platforms. Despite different host names it is the same hardware.

Random 4K, 8 Threads, 8 Outstanding I/O 100 % Read

Windows Server 2016

Pretty good numbers right. +3 Million IOPS and <1ms latency.

Look at the Bandwidth – 12 GB/sec

These are running on 40 GbE Mellanox CX4 adapters.

Looks great to me.

Now lets, try the same thing on Windows Server 2019

Umm 800 K IOPS with 3.5 GB /Sec

WHAT 62 MS Latency at the top end.

Yah something is not right here.

Testing continues:

Random 4K, 8 Threads, 8 Outstanding I/O, 100 % Write

Not bad 800 K IOPS + 3.25 GB/SEC about 3.8 MS Latency

WS 2019 – 500 K IOPS – 2.1 GB/SEC and <20ms Latency

Moreover, for the final test:

Good old 70/30 Read Write

Random 4K, 8 Threads, 8 Outstanding I/O, 70% Read / 30% Write

Windows Server 2016 performs quite well with over 1.7 Million IOPS 7+ GB/Sec bandwidth and <1ms latency

Windows Server 2019:

We hit 700 K IOPS / 3 GB/Sec and +35ms latency.

Now, I know what you are thinking: Aren’t you suppose to be a huge fanboy of Storage Spaces Direct Dave? The answer to that is absolutely. However, my customers always come first, and this one, in particular, is running into some really weird issues at this time. At this point, Microsoft internally has taken our issue and is working through it to see what exactly is up. At the end of the day, all I know is that if I take the same hardware and run Windows Server 2016 with 4 x the performance or more something is wrong.

Also please remember that we are early adopters and as such we expect to hit roadblocks. My goal of this post is to see if anyone else is experiencing something similar to a common goal to get things resolved. As always I would highly recommend waiting until your vendors of choice have certified their hardware with the Windows Server Software Defined Program (WSSD). That programs single goal in life is to prevent these types of issues from occurring in the field. It has vendors certify their hardware and ensure that it can pass stress tests.

This configuration that we have is what we feel not a hardware problem because it works so well on Windows Server 2016. Microsoft at this point like I said is evaluating the problems, and as soon as I hear back you know, I’ll let you know.

So for right now, continue your testing of Windows Server 2019 and if you ask me I would wait for about the next 45 days until the certified builds are ready from both Microsoft and the Vendors.

I really hope you enjoyed this post and it saves you a ton of time,

Thanks,

Dave

About The Author

Kawula Dave

Dave Kawula is a seasoned author, renowned blogger, global speaker, and enterprise consulting leader with over 30 years of experience in the IT industry. A recognized expert in Microsoft technologies, Dave has built a reputation for delivering practical, impactful solutions tailored to meet diverse business needs. Dave has authored numerous technical books, covering topics such as Windows Server, System Center, and Hyper-V. His publications have become essential resources for IT professionals looking to deepen their understanding of these technologies. Beyond writing, Dave is a prolific blogger, sharing insights and expertise on his blog and other prominent platforms, where he demystifies complex concepts and offers actionable advice to the tech community. As a world-class speaker, Dave has presented at leading conferences and events across the globe. His engaging style and in-depth knowledge have made him a sought-after speaker, inspiring IT professionals to harness the full potential of Microsoft technologies. Whether addressing audiences on Azure, Windows Server, or virtualization, Dave’s presentations are packed with practical insights that resonate with technical and business audiences alike. Dave’s extensive consulting experience spans a broad spectrum of organizations, from small businesses to Fortune 500 enterprises. As an enterprise consulting leader, he has guided clients through digital transformation, leveraging Microsoft technologies to drive innovation and achieve strategic goals. His ability to align technology with business objectives has made him a trusted advisor and invaluable partner. A Microsoft Most Valuable Professional (MVP) and Veeam Vanguard, Dave’s expertise is widely recognized by both the tech community and industry leaders. His contributions to the IT field, combined with his passion for mentoring and empowering others, solidify his position as a thought leader and a pillar of the global IT community.

17 Comments

Andy on February 8, 2019 at 12:58 pm

Thanks, we have a nearly identical setup S2700, all NVMe, etc and are seeing nearly identical problems throughout the stack. Hope MS can pull this together.
Reply
Steven on April 4, 2019 at 4:40 am

Dave,

Have you any updates to this issue? we are using all nvme solution using the march 2019 version of server 2019 and still seeing the 45% performance hit.
Reply
Daniel on May 22, 2019 at 4:40 pm

Is there any update on this? About to deploy a 2 node SSD cluster and wanted to use 2019 in my production environment.
Reply
Ray on May 27, 2019 at 12:10 pm

Same Problem here with Intel NVMe 4xxx Series…. Win2016 is Performing, Win2019 not!
Reply
- Dave Kawula on June 14, 2019 at 7:35 pm
  
  Here is the update from Microsoft –> You need to turn off CSV Caching for VMFleet testing.
  
  Certain microbenchmarking tools like DISKSPD and VM Fleet may produce worse results with the CSV in-memory read cache enabled than without it. By default VM Fleet creates one 10 GiB VHDX per virtual machine – approximately 1 TiB total for 100 VMs – and then performs uniformly random reads and writes to them. Unlike real workloads, the reads don’t follow any predictable or repetitive pattern, so the in-memory cache is not effective and just incurs overhead.
  
  https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/csv-cache
  Reply
Peter on May 27, 2019 at 6:12 pm

Dear Dave,

I would like to ask the same as Steven, do you have any updates you would be able to share ?

thx!
Reply
- Dave Kawula on June 14, 2019 at 7:36 pm
  
  Here is the update from Microsoft –> You need to turn off CSV Caching for VMFleet testing.
  
  Certain microbenchmarking tools like DISKSPD and VM Fleet may produce worse results with the CSV in-memory read cache enabled than without it. By default VM Fleet creates one 10 GiB VHDX per virtual machine – approximately 1 TiB total for 100 VMs – and then performs uniformly random reads and writes to them. Unlike real workloads, the reads don’t follow any predictable or repetitive pattern, so the in-memory cache is not effective and just incurs overhead.
  
  https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/csv-cache
  Reply
Brian on June 5, 2019 at 9:32 pm

1. What is the script you are running to get those statistics?
2. Any update on the performance issue? 4 node cluster all SSD/NVME and getting horrid results so far.
Reply
Brian on June 6, 2019 at 2:23 pm

1. Any update on the performance issue? We have 2 4 node clusters with SSD/NVME with horrid performance.
2. What script are you using to get the statistics in your screen shots above?
Reply
- Dave Kawula on June 12, 2019 at 5:04 am
  
  Hey Brian,
  Thanks for the comment –> The tool used for the screenshots is watch-cluster.ps1 it comes with VMFleet from Microsoft.
  Reply
Romain Le Disez on June 21, 2019 at 2:07 pm

Did you tested it again with the Cache size disabled ? Because i’ve still got performances issue on 2019.
Reply
- Dave Kawula on June 21, 2019 at 10:04 pm
  
  yah it was better with CSV Cache disabled.
  Reply
Gerrit on July 19, 2019 at 8:59 am

Hi Dave, thank u for sharing your findings about WS2019 CSV Cache. We’ve got also an 2019 S2D solution almost ready for production. But we are not sure about performance. Would you please take q quick look at https://social.technet.microsoft.com/Forums/en-US/8e61341a-a5e4-434e-92d8-5381f50962ed/is-my-s2d-solution-performing?forum=winserverfiles

If you could only tell me whether we’re on the right track or not. Thank u!
Reply
- Dave Kawula on July 22, 2019 at 4:07 pm
  
  I have a better solution join over 800 + S2D experts and users in our free Slack S2D Community … Ask your question there that is where we all hang out.
  
  http://slack.storagespacesdirect.com/
  
  Thanks,
  
  Dave
  Reply
Oleg B on October 16, 2019 at 9:30 pm

Have you gotten updates from Microsoft in late 2019 in regards to the performance being optimized to that of WS16? Or did they confirm slower performance by design in S2D on 2019? Really curious to know sir!
Reply
- Dave Kawula on October 21, 2019 at 5:29 pm
  
  Hey Oleg the performance issue was with how VMFleet runs. Just disable CSV Cache during the tests. You can re-enable afterwards
  Reply
Debinski on October 1, 2021 at 6:51 pm

Thought I’d chime in here:

I believe we narrowed our 2019 performance issue down to what version the actual virtual disk files (vhdx) were created on.

On our cluster, virtual disks created natively and running on 2019 do not perform well at all. Vhdx’s created on earlier versions of windows/hyper-v that we copied over and mounted on 2019, performance goes through the roof.

In contrast, we copied a 2019 vmdx to a 2012 vm, and it performed well.

So it seems to be narrowed down the abstraction layer or a driver on 2019 vm’s.

Reported this to MS as I have an open case. Will post back if we get a resolution.

I guess for now (future vms) just create a blank vhdx using 2016 or 2012, and keep copying it, mount it. then extend it, format it, etc.

(FYI, is on a Win2019 S2D 6 node HCI cluster, all NVMe)
Reply