back to article Pavilion compares RocE and TCP NVMe over Fabrics performance

Analysis Pavilion Data says NVMe over Fabrics using TCP adds less than 100µs latency to RDMA RoCE and is usable at data centre scale. The biz is an NVMe-over-Fabrics (NVMe-oF) flash array pioneer and is already supporting simultaneous RoCE and TCP NVMe-oF transports. Head of Products Jeff Sosa told El Reg: “We are … …

  1. eldakka

    It was focused on random write-latency serving 1,000s of 10GbitE non-RDMA (NVMe-oF over TCP) clients and a few dozen 25GbitE RDMA (NVME-oF with RoCE) clients.

    Can you fit any more acronyms in a single sentence? Bit of an acronym-fest this article.

    I know or could puzzle out most of them, but what's RoCE ?

    1. Anonymous Coward
      Anonymous Coward

      RoCE

      > I know or could puzzle out most of them, but what's RoCE ?

      https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

  2. Anonymous Coward
    Anonymous Coward

    "Inbox" driver

    The term "Inbox" driver needs to die.

    Whoever at Mellanox started using the term "Inbox" driver to mean the driver provided by the OS - known throughout the industry for years as the "OS" driver - needs to cut it out and start using the industry standard term.

    The Mellanox forum often has people asking effectively "WTF?" due to this. Ugh.

  3. JohnMartin

    100+ microseconds Seems way too slow

    100 microseconds is a rediculously large overhead in the world of solid state media .. the protocol overheads of NVMe are about 5 microseconds, the wire latency of electricity is about a nanosecond per 30cm and the switch latency in ethernet is about 200 nanoseconds ...

    If you look at OLD benchmarking from chelsio in the snip presentation here https://www.snia.org/sites/default/files/SDC15_presentations/networking/WaelNoureddine_Implementing_%20NVMe_revision.pdf. you'd see that there should only be about 8 microseconds of difference in NVMe inside of a server on PCIe and using the same I/O over a network .. end to end latency for a 4K I/O should be in the vicinity of 20 microseconds.

    If you really want to geek out on this stuff, check out http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post149s2-file3.pdf. which shows the actual latency differences between running RDMA traffic over Layer-2 vs TCP for a 4K I/O size should be about 5 to 10 microseconds if you're just measuring protocol level differences

    As a benchmark of NVMe over fabrics using RoCEv1 or v2 vs using TCP its kind of uninspiring on all levels.

  4. CheesyTheClown

    Digging for use cases?

    Ok, let’s kill the use case already.

    MongoDB... you scale this out, not up, MongoDB’s performance will always be better when run with local disk instead of centralized.

    Then, let’s talk how MongoDB is deployed.

    It’s done through Kubernetes... not as a VM, but as a container. If you need more storage per node, you probably need a new DB admin who actually has a clue.

    Then there’s development environment. When you deploy a development environment, you run minikube and deploy. Done. No point in spinning up a whole VM. It’s just wasteful and locks the developer into a desktop.

    Of course there’s also cloud instances of MongoDB if you really need something online to be shared.

    And for tests... you would never use a production database cluster for tests. You wouldn’t spin up a new database cluster on a SAN or central storage. You’d run it on minikube or in the cloud on Appveyor or something similar.

    If latency is really an issue for your storage, instead of a few narrow 25Gbe pipes to an oversubscribed PCIe ASIC for switching and an FPGA for block lookups, you would instead use more small scale nodes, map/reduce and spread the work-load with tiered storage.

    A 25GbE network or RoCE network in general would cost a massive fortune to compensate for a poorly designed database. Instead, it’s better to use 1GbE or even 100MbE to scale the compute workload into more small nodes. 99% of the time, 100 $500 nodes connected by $30 a port networking will use less power, cost considerably less to operate and perform substantially better than 9 $25,000 nodes.

    Also, with a proper map/reduce design, the vast majority of operations become RAM based which will drastically reduce latency compared to even the most impressive NVMe architectures based on obsessive scrubbing. Go the extra mile and make indexes that are actually well formed and use views and/or eventing to mutate records and NVMe is a really useless idea.

    Now, a common problem I’ve encountered is in HPC... this is an area where propagating data sets for map reduce can consume hours of time given the right data set. There are times where processes don’t justify 2 extra months of optimization. In this case, NVMe is still a bad idea because RAM caching in an RDMA environment is much smarter.

    I just don’t see a market for all flash NVMe except in legacy networks.

    That said, I just designed a data center network for a legacy VMware installation earlier today. I threw about $120,000 of switches at the problem. Of course, if we had worked on downscaling the data center and moving to K8s, we probably could have saved the company $2 million over the next 3 years.

  5. Anonymous Coward
    Anonymous Coward

    Under estimating the impact of data management

    Testing for QD=1 is like shooting in a range, from the looks it does not seem that, they are looking at up to 20 clients going to the array. There is a big difference shooting in a range vs shooting in a real combat situation. What kills predictability of application performance is data management needs and locality. As an example, when you take a backup in a DAS environment, you pretty much kill the application. It becomes even more pronounced when the data volume increases, lets not even get to all the scans that many enterprises run day-to-day. (Let us not go into the debate as to who needs backup since there are copies.) With respect to test and dev, what if I want a copy of the production "data" for testing, can we do a writeable snapshot and mount it on a test dev cluster and go on or do I have to copy the data out. The key to bringing a shared accelerated storage is all about minimizing data movement and data movement happens all the time and every data movement job creates a choke on the application performance.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon