WekaIO pulls some Matrix kung fu on SPEC file system benchmark

Thursday 22nd March 2018 11:43 GMT Anonymous Coward

Apples and cheese?

So WekaIO needed 64 drives to Spectrum Scale/E8's 24 to get double the number of builds? We're not exactly comparing like with like here, are we? Frankly, I think we're going to see these records tumble for some time as vendors throw more NVMe hardware at the spec.

The old TPC specs for databases used to rate submissions on TPS/$ -- maybe SpecFS needs to go down the same road.

Also interesting to note that Spectrum Scale "walks all over" WekaIO on Overall Response Time, if that matters to anybody's workload...

7 3 Reply
1. Thursday 22nd March 2018 14:45 GMT Liran Zvibel
  
  Re: Apples and cheese?
  
  The Spectrum Scale results used clients RAM caching (with write back) of 30 seconds, so did not test the same small IO performance as the rest of the submissions.
  
  Actually, talking about Apples and Cheese this is a much different change.
  
  You could argue that this is a sensible user setting (that WekaIO also supports), but when comparing IO benchmarks you would expect to compare IO and not client caching.
  
  If you inspect their graph (and compare it to any other SpecSFS submission) you will see that initial 230 concurrent builds have latencies that are physically not possible with NAND FLASH, so obviously did not hit flash. Since the client memory is not protected in any form (unlike the RAM in the NetApp, for example) -- this may not even be considered primary storage as per the SpecSFS definition.
  
  I suspect main reason they picked that setting is to improve average response time.
  
  2 3 Reply
  1. Tuesday 27th March 2018 20:26 GMT Anonymous Coward
    
    Re: Apples and cheese?
    
    Liran,
    
    as the CEO of WEKA, the company in this case trying to twist results to compete with Spectrum Scale, you do a lot of guessing and spreading of false information on how spectrum scale works, please stop doing this. you should have disclosed in your post that you work for the company, in fact that you are the CEO of it as this is considered professional behavior, when making comments about a product your company is trying to compete with.
    
    i for myself work on spectrum scale and actually know how it works, i don't post here as a IBM employee but simply as a person with knowledge and interest to correct some of your wrong claims and educate you on the entry bar for claiming some bragging rights.
    
    first, there is no write cache in Spectrum Scale on the filesystem clients for the exact reason you mentioned, committed data would get lost in case of a power loss. all writes that are requested by the application to be stable are on stable storage before Scale would ever acknowledge them back to the application layer, like any of the other filesystem solutions published results with SPECSFS, including yours.
    
    what the parameter you refer to actually triggers is making sure that all data that has NOT been explicit committed by the application to be FORCED within that time, not delay it in any way. the default if not set is whatever the linux distro sync interval is, which can be anything between 5 seconds and 120 seconds, so explicitly setting it to 30 seconds, simply means, even if the OS didn't sync data within 30 seconds make sure its committed to stable storage within that time.
    
    i am glad you are agreeing that the latency on spectrum scale are exceptional low and almost unbelievable fast, unfortunate the reason for that is not what you claim it is, the reason the latency numbers are so exceptional low on Scale has to do with multiple factors, just to mention a few :
    
    1) scale has a very advanced distributed coherent read cache, so some of the stats/reads SPECSFS does can be served from the cache, the data itself of the benchmark is far too big to cache a reasonable amount data to be relevant performance wise, but for metadata operations Scale gets some benefits from that.
    
    2) scales latency, despite your spread of FUD that its optimized for HDD's is exceptional low, much lower than most any other filesystem on the market, even compared to some local filesystems. the Scale filesystem including network overhead is in double digit usec range, lower than most NAND devices itself.
    
    to put some credibility to my claims, this presentation --> http://files.gpfsug.org/presentations/2017/SC17/SC17-UG-CORAL_V3.pdf shows end-to-end latency with spectrum scale against remote , not local NVME devices with a latency of ~70 usec on read and ~80 usec on write which includes media, network, filesystem and path to/from the application buffer, so the SPECSFS response time numbers of ~100 usec are absolutely possible.
    
    that now actually begs a couple of interesting question. how can a filesystem that claims "flash native file system such as WekaIO Matrix™, out-performs Spectrum Scale by a wide margin regardless of the workload" , which is a pretty bold claim btw, have a minimum latency of ~500 usec for a mix of requests when the media used (link to the spec --> https://www.micron.com/~/media/documents/products/data-sheet/ssd/9100_hhhl_u_2_pcie_ssd.pdf) has a rated latency of 120 usec for read and 30 usec for write ? where is the remaining time spend ? Filesystem inefficiency ? bad raid implementation ?
    
    for your own education on what you need to achieve, to have bragging rights in the distributed filesystem space, you should take a closer look at some of the other charts in the presentation shared above. you claim to outperform Spectrum Scale "by a wide margin regardless of the workload", can you share an example on how you deliver 23 GB/sec from a single client (page 13) ? btw this is limited by the 2 x 100Gb it Network Links used, not the Scale SW stack.
    
    or how you do 2.5 TB/sec in a single filesystem ? or a workload like doing 2.6 Million File creates /second , writing 32k into each file and flush them after write and you need to sustain that for 20 minutes ? its only ~3.2 billion files within 20 minutes to help with the math.
    
    how about a easy target of doing 14 Million Stat operations/sec on a 11 node cluster (page 24) ?
    
    the publication by E8 uses 1/1000th of the in production supported scalability of Spectrum Scale (which is limited by a #define and test ressources), the system is proven to scale linear for independent work entities, essentially what SWBUILD does. Scale has multiple customers with thousands of nodes, its simply a matter of double the HW and the build number of SPECSFS will double, you multiply it by 4 and it will be 4x the results and we can continue this for a while as long as somebody is willing to spend the money for the HW costs for such a excessive benchmark setup.
    
    Sven
    
    5 1 Reply
Thursday 22nd March 2018 13:54 GMT Arthur A.

Not so impressive if you know that NetApp used NL-SAS HDDs in ther test configuration.

1 2 Reply
1. Thursday 22nd March 2018 14:45 GMT Liran Zvibel
  
  The NetApp used 9.2TB of memory + NVMe flash cache to achieve 520 builds at much higher latency than the WekaIO 2.9TB of memory for 1200 clients.
  
  Between the huge amount of RAM and NVMe FlashCache (that together were bigger than the aggregate capacity of the benchmark), I suspect that almost no IO actually had hit the HDDs, and they were only used to increase system capacity.
  
  Also, in "practical" terms -- the NetApp street price for that box is about x3 times of the WekaIO solution with SuperMicro, so actually with or without HDDs a WekaIO client is less than 15% of the cost of a NetApp client and this is what really matters.
  
  WekaIO supports tiering also, so if larger capacity is needed, it can be achieved with WekaIO as well.
  
  2 0 Reply
  1. Thursday 22nd March 2018 22:25 GMT Arthur A.
    
    ORT of NetApp is 1.04 ms and WekaIO has 1.02 ms. Do you call it much higher latency?
    
    NVMe FlashCache is read only cache. I hope SFS2014_swbuild has some amount of write operations and suppose there are huge amount of file metadata changes.
    
    NetApp has 8 times more of usable space. I suppose it doesn't make a lot of sense for source code storing. But clients do not buy storage for only one type of workload.
    
    Anyway my first point was that for me just 2 times more builds on system with NVMe SSDs versus NL-SAS drives with the same ORT is not very impressive.
    
    1 2 Reply
This post has been deleted by its author
Thursday 22nd March 2018 14:35 GMT Bad Hombre

Marketing Bull

It looks like IBM Spectrum Scale is 33% faster per NVMe drive:

Weka: 1200 Builds/64 SSDs = 18.75 Builds/SSD

IBM: 600 Builds/24 SSDs = 25.00 Builds/SSD

2 0 Reply
1. Thursday 22nd March 2018 15:13 GMT Liran Zvibel
  
  Re: Marketing Bull
  
  As mentioned earlier -- the Spectrum Scale was configured with 30 seconds write back, so IOs were terminated by clients unprotected RAM, then aggregated and sent back to storage.
  
  This is the only submission in SpecSFS that enabled RAM caching by clients, as SpecSFS is a storage benchmark and not memory benchmark.
  
  If you look at the beginning of the graph, the first 230 builds (our of 600, so a hefty proportion!) had latencies that cannot run on NVMe if you consider NAND FLASH physics, this is also the reason that the average is lower.
  
  I wonder why they had chosen that configuration option, and how the results would have looked if they had hit NVMe from first IO....
  
  Take a look at the Spectrum Scale graph and see how quickly latency rises once it starts hitting the NVMe devices...
  
  0 4 Reply
  1. Thursday 22nd March 2018 17:59 GMT Bad Hombre
    
    Re: Marketing Bull
    
    Good point, did not see that. You know your stuff!
    
    2 1 Reply
    1. Thursday 22nd March 2018 19:13 GMT Throatwarbler Mangrove
      
      Re: Marketing Bull
      
      Yes, well . . .
      
      1 0 Reply
    2. Wednesday 4th April 2018 13:12 GMT Anonymous Coward
      
      Doubling down Re: Marketing Bull
      
      He does not know his stuff nearly as well as he thinks. His claim about Scale using an unprotected client cache to outperform Weka was debunked above, in detail the first time he made the claim.
      
      1 0 Reply
  2. Thursday 22nd March 2018 21:24 GMT CheesyTheClown
    
    Re: Marketing Bull
    
    Hi Liran,
    
    Nice to see someone in your position actually commenting on the article.
    
    I'm a long-time file system and storage protocol developer. I spent many years trying to solve storage problems at the file system level and I've now moved further up the stack as I believe that there are rarely any cases where high performance distributed file systems are really the answer as opposed to better designs further up the stack.
    
    For example, the SpecSFS test is building code which is obviously quite a heavy task. I spend most of my life waiting for compiles and I would always welcome better solutions. But I already have seen huge improvements by moving away from poor languages like C and C++ towards more managed languages that have endless performance and security benefits over compiled languages.
    
    Now, given the problem of compiling code, this has always been a heavy process. Consider that most development houses have a complete rats nest of header files dependencies in code. Simply using a library like Boost or the standard C++ library can cause decades of programmers lives to be lost. Of course the local operating system will generally RAM cache most files once they've been read once... making the file system irrelevant. But compiling something that produces a large number of object files (such as the Linux kernel) on a system which has anti-malware protection will kill performance in general.
    
    To distribute the task of compilation across multiple systems, there are many solutions, but tools like Incredibuild handle this in a far more intelligent manor than placing a large burden on the file system. Therefore, testing file access in those regards is a meaningless solution because it presents a higher performance file system as opposed to a distributed compilation environment as the solution. Simply precompiling the headers and distributing that along with the code to be built to other systems is far more intelligent.
    
    Then there's the case of data storage and manipulation. Your product makes a big point out of having it run side by side by with compute on large nodes which also hold storage. On algorithmic principles in terms of making file i/o perform better, making a better distributed file system that implements the POSIX APIs makes a lot of sense... if you're interested in diagnosing the symptoms but not the underlying problem.
    
    When working with huge numbers of nodes and huge data sets, generally the data in question is structured at least in some way that can be consider object oriented. It may not be relational, but it is generally something that can be broken down into smaller computing segments.
    
    Consider mapping a DNA strand. We could have hundreds of terabytes of data if we store more than simple ribosome classification. If we stored molecular composition of individual ribosomes, the data set will be massive. In this case, each ribosome will be able to be structured as an object which can be distributed and scheduled most intelligently in a database that handles hot and cold data distribution across the cluster through either sharding or share-nothing record replication.
    
    Consider the storage from a collision within an LHC experiment. The data is a highly structured representation of energy readings which themselves are not structured... or at least not until we'll identified their patterns. As such, the same general principle of shared nothing database technologies make sense.
    
    To have a single distributed file system to store this data would be quite silly as the data itself isn't well represented as a file as opposed to a massive number of database records or objects.
    
    The only system I know of anymore where large scale file systems makes sense is virtual machine image storage. And in this case, since VMware has one of the most impressively stupid API licensing policies EVER... you can't generally depend on supporting them in a meaningful way. They actually wanted to charge me $5000 and make me sign NDAs blocking me from open sourcing a VAAI NAS driver for VMware. I simply moved my customers away from VMware instead... that was about $5,000,000 lost for them. In addition, if I had to instead a vib to support a new file system, I'd be nervous since VMware famously crashes in flames constantly due to either storage API or networking API vibs.
    
    But that said, VM storage for Hyper-V, KVM and Xen are a great place to be. But if I'm using Hyper-V, I'll use Storage Spaces Direct, for KVM or Xen, I can see room for a good replacement for Gluster or the others.
    
    So, now that I hit you with a book... I'm interested in hearing where your product fits.
    
    I read your entire web page because you sounded interesting. And I found your technology to be quite interesting. Under different circumstances, I'd probably even ask for a job as a programmer to have some fun (it's sad, but I find writing distributed file systems to be fun). But I simply don't see the market segment which this technology targets. Is it meant as file storage for containers? Is there something which makes it suitable for map/reduce environments other than better database tier distribution?
    
    I look forward to hearing back. I get the feeling you and I could have some absolutely crazy (and generally incomprehensible) conversations at a pub.
    
    P.S. - I'm working on a system now that would probably benefit from technologies like yours if I wasn't trying to solve the problem higher up in the stack. I may still need something like this later on if you start looking towards FaaS in the future.
    
    1 1 Reply
    1. Wednesday 4th April 2018 13:32 GMT Anonymous Coward
      
      And yet... Re: Marketing Bull
      
      So wrong:
      
      >Consider mapping a DNA strand. We could have hundreds of terabytes of data if we store more than
      
      >simple ribosome classification.
      
      ...
      
      >Consider the storage from a collision within an LHC experiment.
      
      ...
      
      >To have a single distributed file system to store this data would be quite silly
      
      ...
      
      And yet, at very smart companies doing DNA mapping, physics experiments (CERN and others), etc. they are doing precisely this thing you consider "quite silly". Look at CORAL at the National Labs for one well-publicised example. Is it possible that they understand the requirements better than you do?
      
      >The only system I know of anymore where large scale file systems makes sense is
      
      >virtual machine image storage.
      
      Then I guess you don't know of many systems. For example, these file systems are widely used in the financial industry, and many big data analytics workloads also run on them. They are even used for such prosaic tasks as providing file storage to multiple tenants in cloud services, allowing the service provider to use shared physical infrastructure for a very large population of relatively small users rather than having separate systems for each one, which would be grossly inefficient
      
      0 0 Reply
Thursday 22nd March 2018 19:41 GMT RollTide14

FileSystem for AI

Those are some super impressive results!

I see this is billed as an FS for DL/AI workloads correct? I was under the assumption that parallel file systems were terrible at handling small random access patterns. How does WekaIO address this?

1 0 Reply
1. Thursday 22nd March 2018 22:24 GMT Shimon
  
  Re: FileSystem for AI
  
  WekaIO is the only distributed file system that was architected from day 1 for NVMEs and flash devices, therefore It is able to handle small files as well as big/huge/etc... files with the same high performance efficiency. Since there is no locality of data and all of the data is spread evenly across all of the components it is highly efficient for random access by utilizing multiple components in parallel (instead of a small amount of bottlenecked components or raid group as in other legacy environments)
  
  2 2 Reply
  1. Thursday 26th April 2018 09:37 GMT Axel Koester
    
    Re: FileSystem for AI
    
    Disclaimer: IBMer here.
    
    Interesting discussion... but I don't agree with this statement:
    
    > ...the only distributed file system that was architected from day 1 for NVMEs and flash devices, therefore It is able to handle small files ...
    
    The abililty to handle small writes does not solely (and should not) depend on the availability of flash devices that can do the job for you. Technologies like distributed log-structured small writes buffering give decent performance gains on "old" flash technology and HDDs alike, and the same is true for NVMe devices - which basically gain faster access and less queueing, but they are still EEPROMs at the core.
    
    So if "architected for NVMe" - i.e. not using the system block io driver - means "we didn't care implementing media accelerators because the media is quick enough", then this is not the correct way forward IMHO, because others will be able to copy it easily once NVMe is widespread. Intelligent distributed metadata management is a better investment.
    
    Btw. "IOs were terminated by clients unprotected RAM" ... huh? I don't think this exists in *any* of the IBM Spectrum storage products. The secret of the exceptionally low latency in Spectrum Scale is non-blocking metadata management and parallelism, and very shallow software stacking. This worked for Terabytes, works for Petabytes, and will work for Yottabytes. Page 15 ('client - server - device roundtrip)' in the CORAL presentation discloses what to expect: 0.074 ms Avg Latency. http://files.gpfsug.org/presentations/2017/SC17/SC17-UG-CORAL_V3.pdf
    
    Thank you for the discussion!
    
    Axel
    
    0 0 Reply