back to article I'm not having a VMware moment – there's just something in my eye

SAP VP Renu Raman thinks 2U, 24-drive, NVMe storage boxes could provoke a storage VMware moment. Raman looks after HANA cloud computing at SAP and his interests include high-performance persistence architecture for in-memory databases. What he's clocked is that there is storage hardware now in the same position as the 2U 2- …

  1. Jemma

    Not a technical question...

    But doesn't heat rise under the normal scheme of things?

    So why on earth have your storage and router, which are probably the most heat sensitive at the top, directly above the biggest heat generators in the room?

    Wouldn't it be better to have them on the bottom (above flooding levels) or in the middle (where there's a risk of flooding) so they don't get kfc'd in short order?

    1. Crazy Operations Guy

      Re: Not a technical question...

      In a proper datacenter the fans in the machines would push the hot air out the back where it gets sucked up by the cooling system, which pushes deposits cold air in front of the machines. If you are worrying about heat rising in a server room / datacenter, you have much bigger issues to deal with.

      As for flooding, all the data-centers I've been to are raised up a few feet from ground level to match the height of the loading dock and have a basement underneath to store fuel for the generators, run cables, or for storage of items that aren't damaged by flood water.

  2. abb2u

    Not a mirage

    The software needs to achieve dis-aggregation for a grid of HCI. Yes, software is the bottleneck, and most of the parallel block services suffer from being implemented when the governor was the HDD speed, then SATA/SAS. NVMe is the curveball. Besides adding in 100Gbps bandwidth as we saw at SC2016, we need to see network software enhancements to increase the required parallelism and dis-aggregation with QoS. As another article today puts it, in 2017 we are at a tipping point. Who can code the quickest? (Get those students working on dcache.org and SC2016 doing it! -- Whoa, what if it comes from Open Source instead of a vendor? Truly Open COTS!)

  3. Anonymous Coward
    Anonymous Coward

    OMG

    I think he invented the SAN!

    1. Anonymous Coward
      Anonymous Coward

      Re: OMG

      Yep, physical machines booting from SAN is hardly a new idea. Sure NVMe is faster than when they were connected via FC to LUNs carved from 15K rpm mirrored drives, but the boot disk of a server is rarely critical to performance.

  4. Crazy Operations Guy

    Those look awfully familiar....

    Yet another company re-selling year-old SuperMicro hardware and throwing a massive markup on the thing:

    https://www.supermicro.com/products/system/2u/2028/SYS-2028U-TN24R4T_.cfm

    Or go with 48 NVMe in 2U:

    https://www.supermicro.com/products/system/2U/2028/SSG-2028R-NR48N.cfm

    Wouldn't be surprised if they weren't also just using SuperMicro for the networking as well, what with these adapters:

    https://www.supermicro.com/products/system/2U/2028/SSG-2028R-NR48N.cfm

    and a couple of these switches to stitch it all together:

    https://www.supermicro.com/products/accessories/Networking/SSE-C3632S.cfm

  5. Crazy Operations Guy

    Hooray for single points of failure!

    So they want us to move storage for mission-critical data from very reliable local drives to a SAN that can fail for many reasons and taking the whole rack with it? Yeah, that's a pass for me...

    1. Nate Amsden

      Re: Hooray for single points of failure!

      Mission critical(which is really in the eye of the beholder) data already sits on centralized storage for probably 98% of organizations out there. You simply cannot get the reliability and stability(and data services) with internal storage systems on any platform. Even the biggest names in cloud and social media make very large scale usage(relative to your typical customer anyway) of enterprise class storage systems internally.

      Certainly you can put "critical" data on internal drives, though it's highly unlikely that truly mission critical stuff (typically databases and stuff - that may be responsible for millions or more in revenue) would sit on anything other than an external storage array (likely fibre channel). Ten or so years ago VMware brought a whole new life to centralized storage simply because of vMotion.

      If you don't understand that then I don't have time to discuss it further.

      Though the person who is touting this idea in the article sounds neat, getting that kind of thing done right is far easier said than done and am not sure when it may happen (certainly none of the solutions on the market are even close). Some solutions do file well, others do block well, others do object well. Nobody comes close to being able to do it all well on a single platform. Maybe it will be another decade or so before we get to that point if we ever do.

      I think at this point speeds of flash is really not important anymore(outside of edge cases). What is far more important is simply cost. Cost is improving but obviously has quite a ways to go still. Many data sets do not dedupe, and lots of datasets come compressed already (e.g. media files). Need to wait to continue to get the cost of the raw bits down further.

      SAS-based SSD systems will be plenty fast for a long time to come for most workloads.

      I have some mission critical systems that do not use our SANs, though they are generally stateless(web or app servers), there is no mission critical data on them.

      1. Crazy Operations Guy

        Re: Hooray for single points of failure!

        Well, in my environment, the critical data is -also- stored on large central arrays. The model I've gone for is to break tasks up into as small of pieces as possible and then distribute the tasks across as many boxes as needed. We prefer scaling out rather than up.

        With our mission-critical system, rather than scaling up, we scale out. So rather than using a high-end 8-way Xeon E7 system with a 1+ TB of RAM and a 24-disk bay filled with high-end SSDs, we use 32x Xeon E3 machines with 64 GB of RAM and some mid-range SSDs. For databases, we split the tables so that very active and CPU-intensive end up on many of the systems, even mostly inactive tables end up on several machines.

        We found that its very cost-effective to just distribute the work load to many mid-range servers. When pushing a system further and further, it was found that at a certain point each additional FLOP / IOP we squeeze out of system, the more expensive it was, so we build systems up until we get to the point where an additional 10% performance is going to push costs up further than 20%.

        We also save a lot of money on staffing as well. Rather than having staff working around the clock, we just have a skeleton crew outside of business hours whose task is just to diagnose why something failed rather than attempting to repair: When you have a pair of large system in an HA configuration, you need to repair them quickly to ensure things run smoothly; but when 1 out of 32 servers go down, capacity is only reduced 2-3% and can be handled on Monday when normal staff come in. We also saw a lot of gains in routine maintenance, we don't need to wait until after-hours to patch a system, we can do it any time and no one outside of IT even notices that something was down.

        We even do that model with our file servers, we use these 1U boxes with 12x 3.5" disks in them and filled with 6-8 TB drives. With everything replicated and advanced load-balancing, we can squeeze many more IOPs out of a rack of boxes with inexpensive spinning-rust and 1-gig interfaces than the 1.5-rack monstrosity EMC sent us to try out.

        The developers we use for in-house development are highly-disciplined and code is profiled quite regularly so that we can identify inefficient databases, overloaded functions, or just inefficient methods. Our head of development is a firm believer in the Unix model of building things so they do only one thing but does it well and does it quickly. Our code may be a bit large, but the systems zip right through it without breaking a sweat and its very, very easy to understand and debug. Managed to reduce development staff from 150+ devs in India and China to a dozen folk in Austin and another dozen in Frankfurt.

        1. DeCoder

          Re: Hooray for single points of failure!

          Wow. That sounds rather complex. May I assume that you did not arrive at these conclusions - nor certainly this implementation - overnight? Your setup doesn't sound like a typical M$ shop of admins nor devs.

          Did you start out this way, or did you start out with a more traditional, monolithic approach and 'climb the mountain' after years of thought and design? Was it a gradual implementation or more of a 'moon shot'?

          I'd be interested to know more detail about how you are 'splitting tables' in your DB's. Oracle? RAC?

          Thanks for sharing. You've reignited my confidence that there is some wisdom left in modern corporate IT.

          Wow...

          1. Crazy Operations Guy

            Re: Hooray for single points of failure!

            We actually started out as a Microsoft-shop on big, monolithic boxes. We shifted over micro-size machines after hiring a developer that became impatient with how long it was taking to build test environments, so they built their own out of a pallet of Pentium-4 desktops that were destined for the scrap heap. They became frustrated with the performance of running a full deployment on a single box so started to split the option out to multiple machines; the application then evolved from there.

            Management was impressed by their work (First time development was both well under budget and delivered early) so set them loose on other projects. They ended getting some staff of their own and pulled in some Open-source *BSD devs that they had worked with before at some local hack-athons. They eventually re-coded everything to work on Open / FreeBSD, nginx, PostgreSQL, and using CGIs built in Python and C (Depending on performance requirements). After some time, they ended up with the time and energy to start tuning the OS, eventually bringing the OS overhead down to 4 GB of disk space with base daemons and libraries; less than 1 GB of RAM; and the OS uses less than 10% of a single CPU core on a E3-1230v2 chip while saturating the machine during load testing (The rest of the machine is spent on the application load).

            AS for DB performance tuning, all queries go through a set of stateless "director" systems that have a set of in-memory db tables containing performance data, a list of which tables are present on each DB server, and last update time for each DB to direct the queries to the most optimal server. The director servers are completely stateless and exist behind a shared / LACP IP address on the front-end and use a multi-cast address on the back-end to receive performance figures the 'performance manager' system. Once a minute, performance figures from all the DB servers are sent to the 'performance manager' system which compiles a performance score and send out those figures to directors. The performance manager also tracks long-term performance data to determine which tables should be distributed to additional Database servers.

            Additional tables are distributed the performance manager itself, it sends a list of tables that each server should possess to the database servers which will pull those tables from the master DB server that stores all tables (A low-performance DB box that is loaded with a bunch of several-TB 7200 RPM SATA disks). This master server doesn't handle queries from the applications, only replication requests from the actual DB servers and changes moving up from the DB servers to the Master server account for less than 1% of query traffic, so not much performance is taken up with that (most of its performance is spent sending tables to DB servers or dumping its tables out to a backup server). Once a DB server receives a new table, it informs the performance manager which then updates the list it sends out to the directors.

            The whole thing is a bit complex, but its a lot of the same pieces, so the documentation is relatively simple (at least no more so than any company's documentation and architecture diagrams). A lot of the complexity is just appearances, sure its a lot of pieces, but the pieces are simple and straight-forward so teams can operate without needing to know things end-to-end, only what is coming into their piece, and what is supposed to come out the other end.

            1. DeCoder

              Re: Hooray for single points of failure!

              Wow... I won't ask more, as I'm sure you've shared more of the guts than most people would be willing to. Still, the piece my pea brain is struggling with is the tables...

              Anyway, based on my decades of experience, it seems pretty brave for your company to trust dev's/techies so much that you would actually be allowed to work out such a complex process without big brand names stamped all over everything.

              Haven't worked with PostgreSQL, so I'm not familiar with how much of this arrangement leverages PGSQL's unique capabilities, and how much is just an incredible design from extremely smart techies.

              Believe it or not, I once designed a similar system - on virtual paper only - using Delphi and multi-tier tech, as I've worked in a couple of shops that leveraged multi-tier tech extensively, in order to solve similar problems.

              In both cases we went from managing large numbers of users on very little hardware efficiently, to having management cram crude M$ tech down our throats. In all instances all timelines for conversion projects were blown by 200-300%, as were the project budgets. AND stability, reliability, performance, flexibility etc. were all completely lost. Had to hire M$ developers for the new tech, who were in no way competent on some of the most fundamental concepts.

              Really depressing to watch companies implode from extreme stupidity.

              Thanks a lot! (grumble, grumble). Thanks to you I've now moved from impressed to downright jealous...

              All the best, thanks for sharing.

              Let me see, where did I put that resume again?...

  6. theblackhand

    Not so sure this is a revolution...yet

    The question for NVMe storage is how do you provide all of this potential IO to the processor?

    For general purpose servers (and databases), this storage will always be further away from the CPU than cache/RAM, so slower and provide latency challenges. These bottlenecks will continue to be addressed but they are likely to remain the bottlenecks for NVMe storage until the next leap in communications buses (i.e. Total IO to the CPU is memory+CPU interconnects+system buses peaking around the 200GBps mark with current generation) and likely CPU evolution allow the NVMe bandwidth to be fully used. By which time it will be the next CPU revolution...

  7. Anonymous Coward
    Anonymous Coward

    > The question for NVMe storage is how do you provide all of this potential IO to the processor?

    Thunderbolt is basically PCIe on an external cable isn't it? A single Thunderbolt 3 on a USB-C connector gives you 40Gbps. And if that's not enough you can always have multiple connections to each server.

    Remember that if this is a single TOR storage box then its available bandwidth will be shared between all the servers in the rack.

    One presumes, however, that servers in one rack will need a way to access storage in another rack - for providing data replication if nothing else.

    1. theblackhand

      OK - put a slightly different way.

      One CPU can service one Thunderbolt port at full speed at present, and you might be able to scale that to a 4S serving 4 full speed Thunderbolt ports.

      How do you serve 8/16/32 ports from your TOR? Or will your NVMe storage only be shared by one or two compute nodes?

      I know AFA's can already saturate their IO links (assuming enough money is spent) and while there are some nice applications for NVMe, I don't see it in the same light that Oracle does. Much like 1Gbps ethernet in 1999, it has removed a system bottle neck but it needs the CPU to catch up before it proves its worth.

  8. Anonymous Coward
    Anonymous Coward

    Probably will be bigger than 2RU ... and someday smaller.

    Given the potential of 30TB and 60TB SSDs, there is a real prospect of over 1PB of post-RAID usable storage capacity using RAID-6. Add to that the possibility of storage efficiency technologies (deduplication, compression, and thin clones), and you have the prospect of 3PB-4PB of effective storage capacity. But to deduplicate and compress 1PB of usable storage takes a lot of CPU cycles, and a lot of memory for caching metadata. Then you will need to drive multiple 100Gb RDMA Ethernet or other I/O channels. That also requires CPU cycles. And if you want high-availability, you need two controllers.

    The second issue is, while the U.2 (formerly SFF-8639) NVMe SSD connector will likely be widely used for disk shelves and servers, there is the possibility the 2.5" SSD form factor will be replaced by some other form factor, perhaps an enterprise version of the M.2 form factor. Part of the problem with a 60TB SSD is it becomes a very large and expensive failure domain and FRU.

    I think the more likely scenario is at least 3RU for the storage array with 24 NVMe U.2 drives, and perhaps at some point a smaller array, perhaps a 1RU device with two controllers in an HA configuration and using about 100 M.2 22110 SSDs.

  9. Anonymous Coward
    Anonymous Coward

    Old news .. kind of.

    A lot of this is old news, and misses one of the biggest points about NVMe .. the M in NVMe stands for Memory not Storage. Using NVMe devices for storage is like using disk as tape .. useful, but not exactly transformative, all it really does it lets you do the same thing you're doing now, but faster. If someone figures out how to do inline de-duplication at submicrosecond latencies then maybe it will also be cheaper. (thats the key for most fast growing innovations .. let the customer do the same thing they're already doing faster and cheaper)

    For a box with 24 Flash drives using POSIX/SCSI style device commands , NVMe doesn't actually help that much (though if you're limited to queue depths of less than 5 the lower latency can make a big difference). It's not until you start to use media that looks more like memory (byte addressable as opposed to block addressable) and start using completely different programming models that you'll see the kind of revolution that is being talked about here. Even then if you want to centralise it and add some value add data services (security, copy services, etc) then you'll probably find the CPU will be the bottleneck. Hell even with a couple of next generation devices, you'll probably find the CPU will be the bottleneck, even the fastest current crop of all flash array controllers limit performance after 4 or 5 SSDs, let alone memory bus connected ReRAM or PCM or whatever media wins the persistent memory war.

    Most of the big vendors already have the kinds of products described in this article either announced, or just one "sustaining innovation" away from release. Like I said .. it's mostly old news.

  10. Anonymous Coward
    Anonymous Coward

    Interesting design , moving Tier 0 to Top of Rack. Glad to see that NVMe is being discussed here and it is being recognized as an important next step in Storage. Sure theblackhand's comments are fair and NVMe will help push the next generation of communication buses. Moving the bottleneck always has an impact on innovation so this is a good thing.

    NVMe is something that should be on every SAN Admins radar. Today the only storage vendor I am aware of that has NVMe support built into their existing SAN is Pure Storage (See Link Below). They have built out some pretty nifty marchitecture with its NVMe guarantee. Basically, the FlashArray //M chassis is wired for both SAS and PCIe/NVMe in every flash module slot. So nothing really new here as the FlashArray //M always had NVMe support but now it is being called out as we get closer to NVMe primetime.

    1. Anonymous Coward
      Anonymous Coward

      Adding NVMe drives wont really help Pure that much unless they can find a way of packing in way more CPU into those //m chassis. Right now //m models are currently about 2-3 IOPS/GB (300,000 IOPS / 150TB)), where the performance density of the drives is about 10X that. Making the drives respond faster might help their performance when they're doing garbage collection and give them some more predictability, but it probably wont help top line performance out of the array.

      I think you also forgot about DSSD (which is probably the best example out there today for this kind of top of rack NVMe, but it not just a drop in replacement for SAN), and NetApp's latest crop of controllers all have NVMe built into them for a caching tier between RAM and Flash. There are other examples, but I'd probably have to bend NDA more than I should to talk about them in detail.

  11. andy_tech_uk

    VMware are the answer

    "Many storage suppliers have proposed/speculated that they are the VMware of storage and, so far, none has emerged to sit on that virtual storage throne." - Except maybe VMware? Consider VMware vSAN with an AllFlash 16TB NVMe config!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like