back to article The TPC-C/SPC-1 storage benchmarks are screwed. You know what we need?

Comment The storage benchmarking world is broken because there are no realistic and pratical storage benchmarks with realistic workloads customers can apply to systems. So says storage analyst Howard Marks, and he aims to fix this mess with the help of a consortium of industry players. He says the world of storage …

  1. Anonymous Coward
    Anonymous Coward

    Is this similar to...

    mitrend.com? Just expanding to a broader audience?

    1. Nuff Said

      Re: Is this similar to...

      Err... unless Mitrend have significantly changed their business since I was unfortunate enough to work at EMC, no. They analyse specific workloads, this is about creating a homogeneous view over many workloads in order to benchmark, which is very different.

  2. PaulHavs
    Stop

    The age old problem with POC's is the focus on perfect world scenarios.

    Whereas - through life - what customers will remember and what will bite them - is the imperfect scenarios.

    How does the system perform under all the likely failure situations likely to be encountered over the 3 to 5 year lifespan? How will operating in a degraded state affect performance? How much failure can be sustained before the system stops servicing IO requests?..... and so on....

    Good luck with another new synthetic workload...we dont need it. We just need customers to run their own workload in a POC and do some realistic failure testing.

    /Paul H - HPE Storage.

    1. DeepStorage

      Nothing beats real applications - BUT

      Testing with your real applications is harder to do than it is to say. Sure the F500 folks have HPE Loadrunner scripts that pretend to be their users and customers accessing the application via their web interfaces BUT in 30 years of consulting to mid-market companies (my clients were typically $500 million-1.5 billion in revenue) I've never seen a client that could generate 125% of their peak day's load against a dev/test copy of their application let alone an application they were only planning to install.

      If you're a mid-market customer a vendor will be happy to lend you kit for 30-60 days, but since you're busy you can only spend 10 person-days on the POC. Even though you're planning on spending $100,000 or more on a storage system you just don't have the time, or skills, to make your production applications (all of them if this is your storage for VMs) generate the load they do when flesh and blood users are running them.

      So we're aiming at giving those people a way to test storage. Our synthetic workloads will be a decent first approximation of real applications with mixed I/O sizes, realistic hotspot sizes and realistic data reducibility.

      To address your other concerns the cookbook will include measuring haw performance is effected with faults introduced. We're even planning to force a drive/node rebuild.

      This is very much NOT an SPC like org where the goal of the whole project is to crate a hero number.

      - Howard

      PS: If you want to be kept up to date on our progress please leave your contact info at: http://www.theotherotherop.org/wordpress/contact-us/

      PPS: HPE is a charter member of The Other Other Operation so someone there sees some value.

      1. Nate Amsden Silver badge

        Re: Nothing beats real applications - BUT

        Oh, don't even get me started on load testing :)

        In my 13 or so years working at various SaaS-type companies I have never, EVER seen a valid load test against an application(across many companies). Every test has always fallen far short of real production load.

        My favorite test was at an advertising company I was at, the front end app was basically pixel tracking, so very light weight, and no external dependencies beyond the single server that the app lived on. The "performance" testing we did there was basically shift enough production load onto a single server until it could no longer keep up, and the load balancer automatically routed requests to other systems when the system under test went down. At the peak we managed to get about 3,200 HTTP transactions a second from a server(disk I/O bound), I assume by now(that was many years ago) they probably use SSDs or something to help with that.

        Worked really well, but obviously very dangerous to do in a more sophisticated environment with shared components like databases or caching servers.

        My first SaaS job over a decade ago it was semi routine to have to double production application capacity after a major release (and of course we didn't plan for it, so we would order HP servers and have them shipped overnight). At some point the need to double stopped but it was there for quite a while.

    2. Nate Amsden Silver badge

      As a 3PAR customer for the past 10 years I say I do want another synthetic test, the more data the better, especially if it has full disclosure requirements like the SPC seems to.

      POCs are very time consuming to do, many vendors won't do them at all (NetApp refusing a POC 10 years ago is why I became a 3PAR customer, and NetApp again said 5 years ago it would be very difficult do do a POC(not that we could do one anyway but I was curious what the situation was with them and POCs at that time)). I'm sure it's difficult for other vendors too(though the startups generally don't have an issue with POC since they want customers so badly).

      If you got your list down to 1 or 2 systems to test then maybe it is realistic to try to do real workloads. But I go back again to my 3PAR experience with 3PAR saying several customers wound up feeling they were "forced" to buy the system because it was in production(and working well). For me when I did my first 3PAR evaluation (small E200) I made sure it wasn't even in the same data center as our production stuff, I didn't want to get stuck in that situation.

      When the company I am at now did our initial data center specs(5 years ago), we literally had zero equipment. No racks, no servers, no storage. Nothing to do a POC on(everything was "cloud" based at that time). So impossible to do a POC (in any reasonable time frame anyway).

      Synthetic tests are far from perfect, even though some people claim "SPC-1 is "broken"" there's no other alternatives at this point.

      If there's a dozen systems or more to choose from POCs from everyone is just crazy.

      Fortunately for me my choices these days are pretty simple, 3PAR is good enough, fast enough, cost effective enough, and most importantly mature enough for everything I need (well except File storage, looking to do a small real Isilon POC soon, after testing their Isilon SD edge and having it implode in my tests due to the software architecture they have(their hardware works around some of that with metadata acceleration not available in the software product, something EMC staff seemed to know nothing about when they pitched SD Edge as being basically identical to what runs on bare metal). Wish HP had good file services...and data compression for SSDs(sort of feel like I should give up asking what the status of that is at this point after 4+ years of asking for that feature).

  3. TheVogon Silver badge

    https://xkcd.com/927/

    1. DeepStorage

      Sadly true

      But of there are 14 bad attempts we still have to create number 15

  4. Anonymous Coward
    Anonymous Coward

    but

    ...wait...I thought Jetstress was the sh*T!?

    Simply, nothing simulates workload, like the real workload! No, marker/magic_number/unicorn_score is EVER going to prove to answer everyones varying requirements for storage.

    Yes a ferrari is faster than campervan but you can't ge t4 kids, your wardrobe and the kitchen sink in an 488 Spider.

  5. DougS Silver badge

    The problem is that there's a moving target

    Imagine you did this 10 years ago. You would not take deduplication or SSDs into account in your testing scenarios. So let's say someone creates something that meets these specs. Now XPoint comes out, or using cloud for tier 3 becomes a popular built in feature....suddenly your test scenarios have changed.

    I agree that current storage benchmarks suck, but they have ALWAYS sucked. Even when TPC-C was new everyone realized the problems with it. If you wanted to get better numbers, you'd fill the entire array with short stroked 15K drives - yeah, that would up your $/transaction by using a configuration almost no one would ever actually use but the top line figure looked great!

    No matter what benchmarks we get, there will be a way to game them, and it will take some fairly deep knowledge to read the submission and figure out how they gamed them and whether that's applicable to your situation or not.

    1. DeepStorage

      Re: The problem is that there's a moving target

      Agreed and the reason I decided to take on this project. We have to build benchmarks for current systems and we have to continue to advance the state of the art in storage testing to keep up with the state of the art in storage.

      The problem is that we created incredibly simplified workloads (4K, 60/40, 100 random) which worked well enough (-+25-50%) in the day of disk arrays with small RAM caches but really broke down as we added flash caches and data reduction.

      It's also important to note that The Other Other Operation isn't about vendor reports with hero numbers, although I'm sure well provide a way to do that and than vendors will find a way to game it, but about the cookbook for users to run their own POCs.

      The cookbook not only includes the code, and instructions on how to run the code, but also instructions on how to set things up realistically and how to interpret the results. All designed to make gaming the system harder and less valuable.

      - Howard

  6. Trevor_Pott Gold badge

    Every criticism and complaint you could level, I promise you Howard has heard and considered. A dozen times over. This isn't some nobody, or some partisan vendor shill. It's Howard Marks. And he's not alone; he's put together a team of the best to build this thing.

    Nothing's ever perfect, but this benchmark will be as close as one can get for storage. Howard knows no other way.

  7. Anonymous Coward
    Anonymous Coward

    More nonsense trash-talking on the SPC/TPC benchmarks...

    Howard's building a straw-man argument that is purely imaginary when he says "Even worse vendors continue to publish test reports that game the system by using a data set smaller than the system under test’s cache..."

    Howard -- FYI there has only ever been one (1) SPC-1 that ran entirely in DRAM -- this came from Texas Memory Systems before IBM bought them. Stop propagating falsehoods.

    http://www.storageperformance.org/results/a00063_TMS_RamSan400_full-disclosure.pdf

    All storage systems use (power-protected) DRAM for cache and more is always better, until the higher costs of DRAM outweigh the performance gains needed for the workload. Hint -- this is why "In-Memory Databases" are all the rage now...

    1. DeepStorage

      Re: More nonsense trash-talking on the SPC/TPC benchmarks...

      When I said "Vendors publish test reports using data sets smaller than the cache" I was neither referring specifically to SPC nor to a RAM cache.

      It's common for vendors to publish reports, or even "How to run a POC" manuals for hybrid flash/HDD systems that test with workloads smaller than the flash cache in the system. Those are the shenanigans I'm calling out.

      - Howard

      1. Anonymous Coward
        Anonymous Coward

        Re: More nonsense trash-talking on the SPC/TPC benchmarks...

        "It's common for vendors to publish reports, or even "How to run a POC" manuals for hybrid flash/HDD systems that test with workloads smaller than the flash cache in the system. Those are the shenanigans I'm calling out."

        There is a very simple enhancement to the SPC-1 benchmark that I think would go a long way to address this shenanigan: To the existing metrics add a ($/workload size) metric.

        The reason it works is simple: if you pick a workload size that is so small that most of it fits in cache, then the $/workload size metric would be very high. The metric has an additional advantage in that you don't need to re-run benchmarks, with existing data you can calculate it.

        I have always calculated this metric for myself for all the SPC-1 submissions, but few others seem to, and I don't recall any Register article listing that.

        1. Trevor_Pott Gold badge

          Re: More nonsense trash-talking on the SPC/TPC benchmarks...

          So you're upset because you're considered one of the top independent storage industry analysts in the world? Perhaps you might consider actually doing something worthy of note.

          Oh, right, it's far easier to snipe anonymously in a forum. Here's an idea: you can start being of note by using your real name, coward. Then we can start to compare your achievements to Howard's, and see whose advice about the necessity of a proper storage benchmark we should be trusting.

  8. Anonymous Coward
    Anonymous Coward

    Better Vendors?

    "Even worse vendors continue to publish test reports..."

    But what do the better vendors do?

    1. DeepStorage

      Re: Better Vendors?

      The better vendors hire DeepStorage to test their gear and write reports.

      I couldn't resist it was a great straight line.

      - Howard

  9. bkrosnov

    IMHO, with very few exceptions, no customer knows the size of their active set or performance requirements of their application. Almost all ignore latency, and only look at high queue depth IOPS and MB/s numbers (storage vendors' fault). Testing with "dd if=/dev/zero" is common, and everyone has their favourite ad-hoc test tool and favourite FIO parameters.

    Another aspect which is often ignored is the effect of the complete storage stack. The same workload behaves very differently if run on files in ext4 through the Linux buffer cache, vs directly on a block device. Applications live on top of actual real-world (guest) OS, so ignoring it in testing is silly.

    Characterizing the workload can help a long way in designing synthetic tests to emulate it. Example things to measure, which would be very useful for storage system characterisation:

    - size of active set over different time-spans, separate for reads and writes -- 1-hour active set is different from 7-day active set.

    - sustained and peak random operations per second per size and per direction (read/write)

    - io depth

    These are all accessible through good quality storage traces. Traces are not ideal, but much better than the current status quo. They are used too little.

    We need unification - one test which we can all trust represents actual applications. It will definitely need different profiles for different applications and use-cases. Still, as a storage vendor, being able to say "our $50k system can do 100 kilo-MySQL-stones and 50 PublicCloudVM-stones" would be extremely helpful. Getting the help of diverse users in defining meaningful units of merit for each profile sounds like a good idea.

    A test harness blessed by, or even better designed by, Howard will go a long way in being widely trusted.

    Just my 2c.

    Cheers,

    BK

  10. storageer

    The VI Load Dynamix approach offers true real-world application workload I/O profiles

    What Howard is creating can have real value for smaller to mid-size shops. Sizing storage systems from a performance perspective has always been a black art for this class of users. Benchmarks like SPC/TPC simply don’t reflect YOUR applications or YOUR use of storage. Workload I/O profiles vary immensely, even for the same core applications (think Oracle, SQL Server). To get a free workload analysis of your current production I/O profiles, you can simply visit WorkloadCentral.com, which offers free workload analysis and sample workload models. These are based on the Virtual Instruments Load DynamiX approach, where I work. As Howard implied, the Load DynamiX workload analysis, modeling and load generation products are the Gold Standard for the industry. They are used by companies such as AT&T, PayPal, T-Mobile, New York Life, NTT, BNY Mellon, Cisco, Boeing and United Healthcare, Cerner, Softlayer and LinkedIn to name a few.

    Although not inexpensive and mostly designed for F1000 companies and service providers, the VI LDX offering includes complete, automated production workload acquisition. It captures your existing production storage workload profile data with ~99% accuracy and repeatability for any simple or complex workload mix. It accomplishes this using a real-time, network attached sensor or via an offline “Workload Data Importer” software tool. It is the ONLY way to truly capture YOUR workload profiles. That’s why so many storage engineers/architects have turned to these products. HP, EMC and HDS (and their VARs) are all resellers of the VI LDX products, so they are widely available.

    If your storage infrastructure is relatively small, perhaps you spend around $100,000 a year on storage as Howard mentioned, then Howard’s tool should be a viable option once it is completed. If you are spending $500K or more per year, then investing in a professional solution like the VI Load DynamiX platform will have immediate ROI as most LDX POCs are conducted in a few days, not weeks or months and they include full reporting and visual analytics capabilities in addition to the automated workload capture, modeling and testing platform.

  11. Anonymous Coward
    Anonymous Coward

    Doesn't SPEC's SFS2014 do almost everything on this shopping list ?

    Doesn't SPEC's SFS2014 benchmark do almost everything on this shopping list ?

    Multiple, market specific workloads, user definable workloads, compression, dedeup, business oriented metrics, portable, backed by a standards group, Workloads based on actual workload samples from real customers and their software stacks... and on and on.

    See: https://www.spec.org/sfs2014

    Take a look at the User's guide, or the online video tutorials, under "presentations"

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019