End-to-end NVMe arrays poised to resurrect external storage

Friday 15th June 2018 10:28 GMT Anonymous Coward

Who needs DAS with RDMA?

Anyone who needs throughput on parallel workloads, that's who. Even if you've dropped rather a lot of money on gucci 100Gbit networking that's still only comparable to 4-5 commodity machines stuffed with spinning rust in throughput terms.

Remote storage is still a bottleneck for "big data" workloads.

4 1 Reply
1. Friday 15th June 2018 21:50 GMT WYSIWYG650
  
  Re: Who needs DAS with RDMA?
  
  is 25GBs slow? Modern NVMe over Fiber like NetApp's A800 can do 25GBs, last I checked that was fast. DAS vendors should be worried, this technology will replace DAS in HPC environments and I realize that is just a niche. It is a very large and profitable niche!
  
  2 1 Reply
Friday 15th June 2018 10:55 GMT Anonymous Coward

"Because NVMe-oF uses RDMA"

RDMA is NOT a requirement of NVMe-oF. The "-oF" stands for "over Fabrics" of which there are many. Most implementation of NVMe-oF use RDMA but there are alternatives (and maybe better solutions?) - Fibre Channel for instance.

2 1 Reply
1. Friday 15th June 2018 12:44 GMT Anonymous Coward
  
  Re: "Because NVMe-oF uses RDMA"
  
  NVMe is the solution to fibre channel, which is too latent.
  
  NVMe provides a lower latency and parallelism for throughput. Together with flash, this can deliver significant advantages over traditional array designs.
  
  3 1 Reply
  1. Friday 15th June 2018 14:38 GMT Anonymous Coward
    
    Re: "Because NVMe-oF uses RDMA"
    
    The only reason to have external NVMe arrays will be
    
    - sheer capacity (TB - PB scale)
    
    - advanced data services
    
    With modern server designs and software parallelism, the vast majority of NVMe drives will be better served server side. Furthermore, the cost of an NVMe drive in the server is lower cost/TB than the same drive in an external array. Also, the types of applications driving this growth are designed around server modularity and linear scalability - servers become the modern building block.
    
    Who needs an external array when you're constructing Hadoop, building VSAN, Splunk? The clever parts are already programmed into the application level.
    
    4 0 Reply
2. Friday 15th June 2018 14:48 GMT muliby
  
  Re: "Because NVMe-oF uses RDMA"
  
  Or NVMe/TCP, which gives you all of the benefits of NVMe, without requiring *any* changes on the network side. Just run TCP/IP and you're good.
  
  0 1 Reply
  1. Friday 15th June 2018 15:42 GMT Anonymous Coward
    
    Re: "Because NVMe-oF uses RDMA"
    
    Congratulations! You just invented iSCSI!
    
    2 1 Reply
Friday 15th June 2018 15:47 GMT bsdnazz

If you're not operating as scale (and let's face it most of us are not) then fault tolerant external storage can be a useful part of a resilient system. To date, the down side has been that external storage is slower than internal storage but that's all part of the speed/reliability/cost trade off - pick any two.

NVMeoF greatly improves the speed (latency/bandwidth) and and will be more expensive simply because people will pay for the speed.

It will probably be normal in 5 years time and I'd be happy for NVMEoF over fabric to replace Fibre Channel.

1 0 Reply
1. Friday 15th June 2018 16:15 GMT Anonymous Coward
  
  FC-NVMe will coexist to start with
  
  Agree with everything you say - but I'd add that FC-NVMe is the easiest NVMe-oF to start using in my opinion and doesn't need to replace FC/FCP in the strictest sense.
  
  If you have Gen 5 or later switches (i.e. the ones that announced since 2011), you can run FCP and FC-NVMe over the same physical Fibre Channel links at the same time. Your call as to when you switch off FCP.....and in both cases you're still using Fibre Channel (just running Fibre Channel Protcocol / FCP or FC-NVMe), so you're not replacing FC necessarily!
  
  Compare and contrast with RoCE and iWarp - the game there is to guess which one will win (and it really is a guess at this point) and hope when you've pushed the button that you didn't buy Betamax.....
  
  2 0 Reply
Friday 15th June 2018 17:29 GMT Anonymous Coward

NVMe vs. Hyperconverged

Unless the HCI vendors leverage an RDMA network for replicating storage, and a very fast persistent write cache, it is unlikely HCI solutions will achieve the latencies of dedicated NVMe arrays connected via NVMe over Fabrics.

Optane as a write cache is much faster than NAND, but still much slower than an NVDIMM. Optane DIMMs (Apache Pass), along with RDMA cluster networks, may be the technologies which allows HCI to get close to end to end NVMe arrays.

0 0 Reply
1. Sunday 17th June 2018 17:16 GMT JRW
  
  Re: NVMe vs. Hyperconverged
  
  JohnW from Nutanix here. We introduced support for RDMA in AOS 5.5, which was GA December 2017. Most people won’t need it for some time to come, but for those who do it is available now. It’s not really NVMe vs. Hyperconerged as we’re ready for tomorrow today.
  
  0 0 Reply
Tuesday 19th June 2018 01:20 GMT JohnMartin

A few points

1. https://www.systor.org/2017/slides/NVMe-over-Fabrics_Performance_Characterization.pdf is probably a better resource if you're really interested in the speeds and feeds of local, vs iSCSI vs RDMA

2. It's not just RDMA that makes things fast with NVMe-F, it's the "Zero copy" aspect of RDMA that gives most of the performance benefits .. NVMe-FC used zero copy techniques, but not RDMA, some micro-benchmarks show there's a tiny (a few microseconds) of difference between the two approaches

3. NVME-F consumes MUCH less CPU than any SCSI based storage protocol (FCP, iSCSI or even iSER which is also RDMA based), and other efficiencies in the software stack shave off at least 20 microseconds of latency when comparing SAS vs NVMe on a local system. That protocol efficiency to make accessing flash via NMVe over fabrics faster than local SAS (the network overhead of local NVMe vs fabric NVMe is much less than 20 microseconds), and based on the benchmarks done by Samsung, you're looking at about a 10% difference in latency for local vs remote NVMe

3. From my reading of the E8 architecture, it does a lot of caching at the host layer in the E8 Agents, the actual array itself isn't that special (about the same as a NetApp EF570 / EF580). If I've read the marketing material correctly, by absorbing a lot of the read I/O at the host layer you're not really seeing a much benefit vs DAS from NMVe-F as the article infers, which probably explains why the results dont show the same 10% difference in local vs remote performance seen in the testing done by Samsung, though a bunch of them were probably throughput tests rather than random I/O tests, and in throughput there's pretty much zero difference until you saturate the network

4. You really have to look at the end to end architecture, HDFS for example does a horrendous job of aggregating the performance of multiple devices on the same host, and distributed shared nothing infrastructures simply dont get down to anywhere near the same level of performance as a highly engineered HA pair, especially once the write workload becomes non-trivial .. that affects pretty much every hyperconverged solution out there, and adding in NVMe over fabrics isn't going to change that by much because the bottlenecks are in the other parts of the stack.

6. Attaching an RDMA capable external block level device to a DGX-1, you're going to have to use something that can attach via infiniband (like say an EF580), and as I dont think you can load external software like the E8 agent onto a DGX-1 you're going to be limited to the performance of the actual array. If you want ethernet, then low latency scale out NFS is still pretty much your only option, and theres a surprising amount of ML training data that turns out to be remarkably compressible which makes the AF800 (which supports end to end NVMe today) the biggest fastest storage you can easily attach to an DGX-1 today (e.g. three hundred Gigabytes / second of throughput is quite achievable in a single cluster)

2 0 Reply
1. Tuesday 19th June 2018 09:48 GMT RonH
  
  Re: A few points
  
  Dear Netapp employee from E8 employee,
  
  1. The E8 storage host layer (agent) never caches. It would be great if you can show where you read that it does. All the E8 performance benchmarks and publications are done with 100% random access, always served from NVMe storage and going NVMe end-to-end.
  
  2. Contrary to what you wrote, DGX-1 and indeed any server with a GPU is just a Linux server and can run any software, including but not limited to the E8 agent.
  
  3. The AI training data, even referred to in Netapp's own publication, is *images* which are not compressible; and are unique hence not de-dupable either. Not sure where you came up with the statement "remarkably compressible".
  
  4. "300GB/s out of a single Netapp cluster": it's interesting that Netapp's own SPC-2 submissions go up to about 11GB/s. Maybe you can share with the world how big would a 300GB/s Netapp cluster be (2 racks of equipment by my estimate), how much it would cost ($4M by my estimate). In the case of E8, a 2U shelf does 42GB/s so with 7 shelves (14U) you can get 300GB/s; dismissing the SSDs the cost of that is <$200K.
  
  0 0 Reply
Tuesday 19th June 2018 08:33 GMT Korev

Genomics 100X?

A UK customer found E8’s storage, used with Spectrum Scale (GPFS), enabled it to accelerate its genomics processing 100X, from 10 hours per genome to 10 genomes per hour. The E8 array was used as a fast storage tier and scratch space.

An 100X increase in throughput sounds too good to be true. Are there any details on what they actually did? I checked E8's website, but nothing leapt out.

0 0 Reply
1. Tuesday 19th June 2018 11:10 GMT Korev
  
  Re: Genomics 100X?
  
  I tried sending E8 a message to ask about the above, but their webform just shows a spinny icon once I'd successfully got past the reCaptcha...
  
  0 0 Reply
Tuesday 26th June 2018 18:30 GMT eXtremeDB

Hi, McObject rep here. McObject’s eXtremeDB is the DBMS used in the STAC-M3 benchmark with E8’s storage array. One detail overlooked in this article is that the other STAC-M3 reports (Dell with Samsung Z-SSDs, and Lenovo with direct-access Optane drives) were conducted with a competing DBMS. So, a direct comparison of E8, Z-SSD and Lenovo w/direct-access Optane is not possible with these reports. Some of the record-setting performance that was achieved is attributable to eXtremeDB vs. kdb+. How much? Impossible to say, but eXtremeDB has bested kdb+ in every STAC-M3 published with eXtremeDB since 2012. Usually with hardware that costs a fraction of the hardware used in kdb+ published STAC-M3 results.

0 0 Reply