Hey Checkyourlogs Fans,
I know a lot of you have been reaching out to me asking about why they are getting the 5120 errors with a Status Code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted.
It appears that in the May Cumulative update Microsoft introduced a new feature SMB Resilient Handles for the Storage Spaces Direct Intra-Cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted. This can effect a system under stress.
Until a fix is made from Microsoft here is a Workaround that addresses the issue. You can Invoke Storage Maintenance Mode prior to rebooting a node on a Storage Spaces Direct Cluster.
Here is an example:
First drain the node, then invoke Storage Maintenance Mode, then reboot
Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode
Once the node is back online disable Storage Maintenance Mode.
Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode
I really hope this helps you to resolve some of your issues
Thanks,
Dave