Hey Checkyourlogs Fans,


I know a lot of you have been reaching out to me asking about why they are getting the 5120 errors with a Status Code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted.


It appears that in the May Cumulative update Microsoft introduced a new feature SMB Resilient Handles for the Storage Spaces Direct Intra-Cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted. This can effect a system under stress.


Until a fix is made from Microsoft here is a Workaround that addresses the issue. You can Invoke Storage Maintenance Mode prior to rebooting a node on a Storage Spaces Direct Cluster.


Here is an example:

First drain the node, then invoke Storage Maintenance Mode, then reboot

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode 

Once the node is back online disable Storage Maintenance Mode.

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode 



I really hope this helps you to resolve some of your issues



