While presenting about S2D (Storage Spaces Direct) during MVPDays, I was asked if the benefits of RDMA over Converged Ethernet was worth replacing existing 10GbE infrastructure for a cluster.
To answer this, we first need to understand how a shared-nothing Storage Pool works.
Consider a traditional converged datacenter, with storage separated from compute and network. The SAN is responsible for disk/controller availability, providing shared volumes over fiber or iSCSI network connections identically to all hosts in the cluster. Simultaneous disk reads and writes occur within the SANs backplane itself and do not traverse the network to cluster nodes.
In a Storage Spaces Direct cluster, disks are installed directly on the compute nodes, which is commonly known as Locally Attached Storage. We then add all the disks from every nodes into a single storage pool. From this pool, we provision Cluster Shared Volumes, specifying the level of redundancy required for that specific volume. The level of redundancy will determine the amount of drives in each host that are used to share the load, and ultimately the amount of raw storage that that CSV consumes.
For example, if I have a 2-node cluster that are each populated with 4x2TB SSD drives, my storage pool will have a total capacity of 16TB RAW. In this pool, I create a 1TB volume with a redundancy level of 1. This creates a two-way mirror, allowing me to lose one disk in the pool. To accomplish this, S2D places the volume across multiple physical disks, and hosts, in the pool, occupying 2TB of space. If we are using a 3-node cluster, 3-way mirrors (disk redundancy 2) are recommended. In a 3-way mirror, if you haven’t put it together yet, the 1TB CSV will occupy 3TB of raw storage across the three nodes.
Now this is where the networking component comes in to play. As the volume is not local to one node, the process creates significant network traffic between the two. We segment this traffic on the nodes by using specific virtual adapters for storage and cluster data, and implement QoS to ensure the storage connection has the highest priority. Using multiple virtual adapters on a teamed vSwitch is also known as Converged Ethernet. As you can imagine, this causes significant network traffic, and is the reason why 10GbE is recommended for HyperConverged clusters.
With RDMA, or Remote Direct Memory Access, it enables the network adapters to transfer data directly from/to main memory on separate hosts, eliminating the need for the OS to process that data first. This requires no CPU resources and does not require a cache, thus providing high-throughput/low-latency connection between hosts. This is also known as Zero-Copy networking.
I heard a great analogy last week from MVP Dave Kawula: Think about Star Trek. The shuttlecraft is your 10GbE network. It’s faster than pre-warp spacecraft, but still has to undergo loading and docking procedures before moving the subject. The transporter on the other hand, your RDMA network, analyzes the subject’s DNA and sends it as code to be recompiled on the other end, making the transfer much faster.
Thankfully, RDMA-capable Mellanox cards are not expensive, running around $300 each. A 10Gb RDMA switch can also be purchased for less than 5k. After spending the cost of a new SUV on your 10Gb infrastructure (which still wasn’t wasted effort), the small added cost of RDMA is worthwhile…not just for the S2D performance benefit, but by also freeing up your 10Gb for other tasks.
Hope this helps!
É