Hey Checkyourlogs fans,
I was recently reading an update from a friend of mine at Veeam – MVP Anton Gostev in his Monthly Veeam Forums digest. In the September 3rd -9th 2019 Digest Email he stated that Veeam has been doing some extensive testing of Hotfix KB434884. This hotfix doesn’t state anything about fixes for ReFS in it. The ReFS.sys driver doesn’t get updated with every Windows Update or Rollup and this August one is supposed to have backported performance and stability improvements that are there in Windows Server 2019 to 2016.
In this hotfix, nothing is stated about fixes for ReFS so I figured I would check myself. My test system is a Veeam Backup Repository running Storage Spaces and ReFS volumes for the Backup Targets. It is currently running the August 2018 Cumulative Update from Microsoft.
As you can see in the screenshot below, it is running ReFS version 10.0.14393.2395 and apparently, if I apply the hotfix Anton was referring to it would update to ReFs version 2457 which matches the OS Build listed.
Now at the time of writing this Microsoft has updated the Cumulative updates to September 11, 2018, (KB4457131) and I figured I would try just running Windows Update to see if this updated itself.
A couple of posts in the Veeam Forums had mentioned that with the update ReFS.sys was consuming an excessive amount of RAM and or the was an issue with slow Block Cloning performance during Synthetic Fulls. For your reference here is a what Anton Gostev from Veeam wrote in the Digest sent out earlier this week.
A short update on our testing of KB4343884 that brought the updated ReFS driver (version 2457), which supposedly has significant under the hood changes due to stability and performance fixes backported from Windows Server 2019. So far in our testing, we have not seen issues such as excessive RAM consumption or slow block cloning performance during active or synthetic fulls as reported by one forum member, while backing up 10x more data than he did. So, probably it was the usual case of 3rd party software like antivirus getting in the way. Anyhow, for now we just keep creating synthetic full backup files for the “main” test – which is the deletion of a large number of backup files created with the use of block cloning – as lockups during this process is the core issue the new driver was supposed to resolve. I will keep you updated on the results of this test.
By the way, as we’re talking about ReFS, I think it’s a good idea to address some invalid perceptions about using one for your backup repositories. Especially since I do know where they are coming from. Indeed – if I was a new Veeam user doing my research on ReFS by strolling the Veeam forums, from reading the corresponding threads my perception would be that there are just a handful of the same Veeam users who keep trying to use ReFS, but having terrible issues due to its reliability and so are largely unsuccessful. So I wanted to provide you with some factual ReFS usage statistic coming from our debug logs data mining system. The numbers below will be based off 13500 unique Veeam installations – those of you who studied statistics know that this is a very healthy sample size, providing for a very small margin of error.
So, let’s roll! Among all these installations, 65% have at least one Windows-based backup repository (other use Linux-based servers, shared folders or deduplicating storage devices), and 17% of a total number of repositories in the data set are ReFS-based repositories. So, if we approximate this to the entire Veeam customer base, we can say that about 50000 Veeam customers are using at least one ReFS backup repository today… most just don’t come to the forums to tell us about that 😉
Now, of course, there are users who keep a single ReFS repository around solely for proof of concept (POC) purposes. So next, we sliced the data set by the number of ReFS repositories per installation. We saw that 39% have only 1 ReFS repository, 34% have 2 or 3, and the remaining 27% have 4 or more. So even if some of those 39% are POC only, it still does not affect the bigger picture that much. Especially considering that for historical reasons, Veeam has a very large number of SMB customers, and for many a single backup repository is all they need. We also looked at the other end here – those installations where ReFS is definitely not POC – and saw the top 10 installations with most ReFS repositories per backup server having anywhere between 28 and 58 ReFS repositories. Wow – I guess it is definitely working for them!
As the next step, we sliced the data set by ReFS repository sizes, in an attempt to find the most popular one. This question we actually failed to answer, because apparently I’m awesome at designing data buckets for statistical purposes! Basically, we saw almost perfectly equal distribution between the following buckets: < 10 TB, 10-25TB, 25-50TB, 50-100TB, 100-200TB, >200TB. If this does not prove that Veeam works well for customers of any size, then I don’t know what else can! The only other interesting data point here is that among those 5200 ReFS repositories in the data set, as many as 100 were over 400TB in size.
Then, we looked at the most common settings our customers are using for those ReFS repositories. As expected, 92% have the recommended 64KB ReFS cluster size vs. the default 4KB. Slightly more than one third of ReFS repositories are using per-VM backup file chains, which is usually an indication that it is an extent of a scale-out backup repository.
Finally, we looked at the backup repository server hardware and concurrent tasks configuration. Instead of calculating averages, which can be skewed by the extremes of non-typical hardware configurations, we used the mode operator, which returns the value that appears most often in the data set. This showed that for smaller repository servers (under 50TB), most customers are heavily under sizing physical RAM. And out of those typical 8-16 GB of RAM on the server, potentially up to 4GB RAM can be consumed by the data mover alone for each concurrent task! This leaves close to zero RAM for ReFS to operate with its metadata, because these smaller repositories are also showing to have a significant number of concurrent tasks set (typically 1 task per core, with 4 and 8 cores being the most common CPU configs). That’s a lot of concurrent tasks, which means these servers are well below our minimum system requirements for RAM regardless of the file system in use. So clearly, these are the installations where most performance and reliability complaints are coming from – and we actually saw plenty of such customers in our technical support even before ReFS was a thing, so this is not something new. The only difference is that before, NTFS would usually “win” the battle for RAM, and it would be our data movers having issues due to lack of memory.
The rest of the ReFS repositories spectrum is in a way better shape though – typical RAM allocation for most 50-200 TB repositories quite strictly follows our 1GB RAM per 1TB recommendation, while above 200TB most repository servers have 192 GB RAM installed (which is also acceptable, because our recommendation stops being linear after about 128GB RAM). These larger repositories are typically set up for 2 cores per concurrent tasks, allowing for good performance for each individual stream. Although this is still a lot of concurrent tasks, as typical server configurations here have between 40 and 80 cores. All in all, looks like most of these larger backup repositories are just carefully designed with Veeam requirements in mind, which is why they are following all of our recommendations so closely. While smaller customers, understandably, are simply using what they already have…
Here is a screenshot post Hotfix Installation of the ReFS.sys version number. It does look like an update has occurred during the August Cumulative rollup. This has indeed been included in the September rollup KB4457131 – The new Date modified time stamp is August 22, 2018.
I wanted to check for myself if we did see any performance improvements. Here is a replica job that ran the night before on the 2395 ReFS.sys driver and then after on the 2495 ReFS.sys driver.
Now let’s have a look at a Backup Job to the same Repository that has been upgraded.
At the time of writing this blog, I didn’t have any Synthetic Full Backups to compare, but I would have to say with my initial analysis that performance is pretty much the same and appears to be pretty stable. All of my backup jobs ran successfully last night without issues post reboot after the September 2018 Cumulative Update was installed and the ReFS.sys driver was updated to version 2457. As per Anton’s digest above, I also agree with the findings that Veeam has discovered and would love to dig in deeper to find out exactly what changed in this new version of the driver other than it contains.
In previous fixes for ReFS, Microsoft has published good documentation around what the problem is and how to fix it. https://support.microsoft.com/en-us/help/4016173/fix-heavy-memory-usage-in-refs-on-windows-server-2016-and-windows-10
For example, the above link states this fix corrects Heavy Memory Usage in ReFS on Windows Server 2016 and Windows 10. None of this type of documentation exists at this time for the new REFS 2457 driver. So my advice is to watch your repositories closely, and if you notice anything, please do feel free to reach out and contact me or write a comment on the blog.
Thanks for reading and I hoped you enjoyed the post,
Dave