I recently came across a post on the Dell Support forum listing issues with a BLUE Screen on Dell R730XD Servers and Mellanox CX3-Pro Adapters. It is a really interesting read and here are some of the highlights of the case. BTW -> you can check it out yourself here: http://en.community.dell.com/support-forums/servers/f/956/t/20010905
Basically user M.Olsson was deploying a small 3x node Storage Spaces Direct cluster with brand new Dell PowerEdge 730xd’s. He ordered the Mellanox Connectx-3 pro nics direct from Dell. This is interesting for me as I have the exact same configuration coming up in a few weeks and thus why it caught my eye.
Here was where the train went off the tracks: He built his Cluster, Enabled Storage Spaces Direct, and started stress testing with VMFleet. Seems straightforward right. Pretty much a standard Storage Spaces Direct deployment in my opinion. As soon as one of the nodes was rebooted he got the following BLUE Screen:
The Blue Screen Memory_Management would just keep looping on that node. Not a good scenario right… Nope don’t think so especially on a brand new build on hardware that was purchased in April of 2017.
Then about 3 days later on April 24, 2017 a moderator on the forum posted the issue could be related to IO Non Posted Prefetching in the UEFI BIOS. He also posted a link to the Mellanox Technical Support Form that was much more helpful than the post on the Dell Support Site. You can check that one out here: https://community.mellanox.com/thread/3593
Apparently, this issue goes all the way back to January of 2017 or earlier. This is when user t3chyphil posted the following the Mellanox Support Forum:
I have copied some of the post here for completeness of this blog post.
Storage Spaces Direct Windows Server 2016 (1607) BSOD – Mellanox ConnectX-3 Pro (Dell)
Good afternoon,
There is very little documentation specific to Windows Server 2016, much of the RDMA/RoCE documentation referrers to Windows Server 2012(r2) Storage Spaces. So I figured I’d start a conversation in here to help others also looking at Microsoft Storage Spaces Direct (S2D) in Windows Server 2016.
I currently have an open case with Dell ProSupport regarding a BSOD my 2 Node cluster encounters. Either node will just halt and restart after 60 seconds when stress testing the environment. Each server is configured as follows…
-
Dell 13th Gen R730XD
-
2x 120GB Intel SSDs SSDSC2BB120G6R (OS Mirror)
-
6x 1.6TB SSDs SSDSC2BX016T4R
-
6x 8TB HDDs ST8000NM0055-1RM112
-
2x Intel DC P3700 800GB (Journal / Cache)
-
256GB 2400Mhz Memory
-
HBA330 Mini Controller
-
1x Mellanox ConnectX-3 Pro (MT04103) Dual Port SFP+ 10GbE (Firmware Version: 2.26.50.80 / Driver Version: 2.25.12665.0)
-
Running Windows Server 2016 DataCenter 1607 Build 14393.693
Each server has two links to a Dell N4032F Switch.
To rule out a possible fault with my switch config, Dell advised I directly connect the two nodes together. RDMA is engaged because I can see the traffic using performance monitor.
Here’s the order in which I’ve setup my environment…
-
Install the OS and fully update/patch
-
Set Windows Power Mode to Performance
-
Install Windows Features – Hyper-V / File-Services / Failover-Clustering / Data-Center-Bridging
-
Install Dell drivers for all hardware including the Mellanox nics. (I’ve tried both the Mellanox drivers and Dell’s. They appear to be the same. MLNX_VPI_WinOF-5_25_All_Win2016_x64 / Driver Version: 2.25.12665.0)
-
I perform the network configuration. Essentially create a Hyper-V SET Switch joined to both ports of the Mellanox nic. I then create two vNics connected to the new Switch with a VLAN tag. (See attached file)
-
I then create the Failover-Cluster and enable Storage Spaces Direct (See attached file)
Everything appears to be okay then it’ll randomly crash. Below is a memory dump. This is what I receive on either host. I want to upgrade the firmware but it’s a Dell product code so I’m stuck. It’s been three weeks and we still don’t have a working environment. I also have another debug output further below…
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time.
Arguments:
Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
Arg2: ffffa48778febe20, Physical Device Object of the stack
Arg3: ffffc080258f4960, nt!TRIAGE_9F_POWER on Win7 and higher, otherwise the Functional Device Object of the stack
Arg4: ffff9c8fe2328010, The blocked IRP
Debugging Details:
——————
Implicit thread is now ffff9c8f`e23a8080
DUMP_CLASS: 1
DUMP_QUALIFIER: 401
BUILD_VERSION_STRING: 14393.693.amd64fre.rs1_release.161220-1747
SYSTEM_MANUFACTURER: Dell Inc.
SYSTEM_PRODUCT_NAME: PowerEdge R730xd
SYSTEM_SKU: SKU=NotProvided;ModelName=PowerEdge R730xd
BIOS_VENDOR: Dell Inc.
BIOS_VERSION: 2.3.4
BIOS_DATE: 11/08/2016
BASEBOARD_MANUFACTURER: Dell Inc.
BASEBOARD_PRODUCT: 0WCJNT
BASEBOARD_VERSION: A04
DUMP_TYPE: 1
BUGCHECK_P1: 3
BUGCHECK_P2: ffffa48778febe20
BUGCHECK_P3: ffffc080258f4960
BUGCHECK_P4: ffff9c8fe2328010
DRVPOWERSTATE_SUBCODE: 3
FAULTING_THREAD: e23a8080
CPU_COUNT: 38
CPU_MHZ: 960
CPU_VENDOR: GenuineIntel
CPU_FAMILY: 6
CPU_MODEL: 4f
CPU_STEPPING: 1
CPU_MICROCODE: 6,4f,1,0 (F,M,S,R) SIG: B00001E’00000000 (cache) B00001E’00000000 (init)
DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT
BUGCHECK_STR: 0x9F
PROCESS_NAME: System
CURRENT_IRQL: 2
ANALYSIS_SESSION_HOST: PHALFORDPC
ANALYSIS_SESSION_TIME: 01-26-2017 10:07:27.0372
ANALYSIS_VERSION: 10.0.14321.1024 amd64fre
LAST_CONTROL_TRANSFER: from fffff800d1ce5f5c to fffff800d1dcf506
STACK_TEXT:
ffffc080`2afcd6a0 fffff800`d1ce5f5c : 00000000`00000000 00000000`00000001 ffffa487`79d23801 fffff800`d1d47359 : nt!KiSwapContext+0x76
ffffc080`2afcd7e0 fffff800`d1ce59ff : ffffa487`70040100 00000000`00000000 00000000`00000000 fffff800`00000000 : nt!KiSwapThread+0x17c
ffffc080`2afcd890 fffff800`d1ce77c7 : ffffc080`00000000 fffff80d`41a33a01 ffffa487`70040130 00000000`00000000 : nt!KiCommitThreadWait+0x14f
ffffc080`2afcd930 fffff80d`41a0aaba : ffffa487`790a6c90 ffffa487`00000000 fffff80d`41a44000 ffffa487`00000000 : nt!KeWaitForSingleObject+0x377
ffffc080`2afcd9e0 fffff80d`3b05debf : 00000000`00000000 00000000`00000006 ffffa487`78fd3980 fffff80d`3b428bf9 : mlx4eth63+0x4aaba
ffffc080`2afcda30 fffff80d`3b0f6f80 : ffffa487`71c971a0 00000000`00000000 ffff9c8f`e2328010 00000000`00000000 : NDIS!ndisMInvokeShutdown+0x53
ffffc080`2afcda60 fffff80d`3b0b910a : ffffa487`71c971a0 00000000`00000000 0000007f`fffffff8 ffff9c8e`c5249bb0 : NDIS!ndisMShutdownMiniport+0xb4
ffffc080`2afcda90 fffff80d`3b09d342 : 00000000`00000000 00000000`00000000 ffff9c8f`e2328010 ffffa487`71c971a0 : NDIS!ndisSetSystemPower+0x1bdc6
ffffc080`2afcdb10 fffff80d`3b01fc28 : ffff9c8f`e2328010 ffffa487`78febe20 ffff9c8f`e2328200 ffffa487`71c97050 : NDIS!ndisSetPower+0x96
ffffc080`2afcdb40 fffff800`d1d9a1c2 : ffff9c8f`e23a8080 ffffc080`2afcdbf0 fffff800`d1f80600 ffffa487`71c97050 : NDIS!ndisPowerDispatch+0xa8
ffffc080`2afcdb70 fffff800`d1c82729 : ffffffff`fa0a1f00 fffff800`d1d99fe4 ffff9c8e`c9cb8120 00000000`000001d1 : nt!PopIrpWorker+0x1de
ffffc080`2afcdc10 fffff800`d1dcfbb6 : ffffc080`25955180 ffff9c8f`e23a8080 fffff800`d1c826e8 00000000`00000000 : nt!PspSystemThreadStartup+0x41
ffffc080`2afcdc60 00000000`00000000 : ffffc080`2afce000 ffffc080`2afc8000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16
STACK_COMMAND: .thread 0xffff9c8fe23a8080 ; kb
THREAD_SHA1_HASH_MOD_FUNC: b7cf6cc0234897f6fd93ad4ead1f75c9e7fd9df1
THREAD_SHA1_HASH_MOD_FUNC_OFFSET: 263f1d39481efd9f34c4df5786cc37534825cc6e
THREAD_SHA1_HASH_MOD: 1de60aba82b9f9b6af56a445a099815cd801e5d9
FOLLOWUP_IP:
mlx4eth63+4aaba
fffff80d`41a0aaba 488d152f050300 lea rdx,[mlx4eth63+0x7aff0 (fffff80d`41a3aff0)]
FAULT_INSTR_CODE: 2f158d48
SYMBOL_STACK_INDEX: 4
SYMBOL_NAME: mlx4eth63+4aaba
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: mlx4eth63
IMAGE_NAME: mlx4eth63.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 57c2dc3b
BUCKET_ID_FUNC_OFFSET: 4aaba
FAILURE_BUCKET_ID: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function
BUCKET_ID: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function
PRIMARY_PROBLEM_CLASS: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function
TARGET_TIME: 2017-01-26T09:54:25.000Z
OSBUILD: 14393
OSSERVICEPACK: 0
SERVICEPACK_NUMBER: 0
OS_REVISION: 0
SUITE_MASK: 400
PRODUCT_TYPE: 3
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
OSEDITION: Windows 10 Server TerminalServer DataCenter SingleUserTS
OS_LOCALE:
USER_LCID: 0
OSBUILD_TIMESTAMP: 2016-12-21 06:50:57
BUILDDATESTAMP_STR: 161220-1747
BUILDLAB_STR: rs1_release
BUILDOSVER_STR: 10.0.14393.693.amd64fre.rs1_release.161220-1747
ANALYSIS_SESSION_ELAPSED_TIME: 6ba
ANALYSIS_SOURCE: KM
FAILURE_ID_HASH_STRING: km:0x9f_3_power_down_mlx4eth63!unknown_function
FAILURE_ID_HASH: {476104f0-13a3-bd96-8e08-ff1f10ccd888}
Followup: MachineOwner
In the forum users have been fighting with Dell Pro Support and Mellanox Technical Support to figure out the issue. Remember that a vendor like Dell can OEM Cards from a vendor like Mellanox. The issue that you run into is that typically Dell will only support the cards that they sell in an OEM fashion if you use their drivers. I have run into this exact issue with Dell in the past with Intel NIC’s where intel will release the latest driver with lots of fixes and Dell won’t support it until they QA it. I understand why Dell has to do this from their perspective and I feel it is important for my readers to also understand.
Dell can’t support the latest revision of every driver for every manufacturer instantly. They all need to go through a formal QA process and be certified to work on the Dell Platform. This is not unique to Dell as pretty much every manufacturer that OEM’s is in the same boat.
Folks this happens and it is what it is. So, you have two choices as a client of Dell:
- If you really want to have the latest Mellanox Tech Drivers and take on support from them and not worry about the Dell Pro Support you purchase these cards directly from Mellanox. In this scenario Mellanox will be more than happy to help you.
- If you choose to purchase the cards from Dell you will have one uniform support provider without the latest and greatest versions of the drivers. These drivers will be fully supported and certified though.
Now my dilemma is that my customer is in the support scenario where they have purchased all of their equipment from Dell. This means that I will have to live with whatever the Driver Versions are that Dell has certified (Option 2 from above). Thankfully the same poster that had the issue in the first place found a solution. I have listed his fix below:
I managed to fix the issue. I had a support case open with Dell ProSupport for about 3 weeks. They too had issues trying to replicate the fault. I suggested the firmware was out of sync with the drivers they’d released. Anyway, they said try BIOS settings. I then spent the next 3 weeks reinstalling windows over and over because it would corrupt the install of Windows on occasion because of the BSOD’s.
In the end I was able to resolve the issue. There’s a BIOS setting IO Non Posted Prefetching. This was enabled by default on delivery of the servers. I disabled this setting and was able to run VMFleet for a few days hammering the system with no crashes. I fed this info back to Dell who then closed the case. They did acknowledge the firmware is a problem but said they can’t do anything about it other than raise a case for it to be updated. We just have to wait.
I think I’d buy Mellanox cards directly from Mellanox in future. I can’t see a way of upgrading the firmware as the firmware tools don’t recognise the cards at all. There’s no way to discover them because Dell have changed the identifiers the MFT’s look for. Mellanox was very unhelpful as I tried to raise a case with them, only to be told I don’t have support. Pretty annoyed at the time. Dell won’t give me a time or date for firmware or even if it’s on the cards. Mellanox did not want to know unless I paid more. Anyway, I hope this helps others.
May I add. The servers have been running fine for about a month and now we’re experiencing similar crashes again (not as often). This time Microsoft have a case open as I believe the mellanox side of things are sorted. Who knows, Microsoft might turn around and say there’s a firmware + driver mismatch on the Mellanox cards. It’s been a nightmare.
Anyway, I hope that BIOS setting helps others.
It looks like the initial issue was resolved but they were still having what appeared to be some driver issues. I will make sure to fully patch my Windows Hosts (Dell R730XD’s), get the latest Firmware applied, find out if Dell has released newer drivers, and I have a support case open with Mellanox to see if there has been any movement on this case. I will update the blog post later once I hear back from Mellanox Technical Support as in the forum they were saying contact them directly which I have.
Happy Learning and have a nice weekend.
Dave
- If you really want to have the latest Mellanox Tech Drivers and take on support from them and not worry about the Dell Pro Support you purchase these cards directly from Mellanox. In this scenario Mellanox will be more than happy to help you.