back to article Overcoming objections: Objects in storage the object of the exercise

My friend Enrico Signoretti is a massive fan of object storage, whereas for a long time I’ve had the reputation of being somewhat sceptical. I've felt for a long time that the whole thing has been more than a little overhyped, with the hype beginning with EMC’s Atmos launch, and continuing from there. The problem with object …

  1. Platypus

    Not so fast

    The elephant is the room is performance. Object storage pushers try very hard to avoid even measuring it (as shown by the near total lack of benchmarks). If you layer a file system on top, it gets even worse. Part of that's de to the overhead of pushing your bits through an HTTP-based protocol, losing and having to recreate half of your state at every request. Even more comes from the extra work you have to do to implement stronger file system durability/consistency semantics on top of weaker object store semantics. Of course, you can always cheat by not actually meeting all standards or expectations applicable to file systems, and most object store pushers do, but it still puts them in a poor position relative to systems that implement those semantics and protocols natively. Object stores aren't going to displace NAS until they can at least get into the same ballpark on performance, and I'm not sure that will *ever* happen.

    Disclaimer: I'm a Gluster developer. We took the saner approach of implementing files natively and objects on top of that.

    1. Yaron Haviv

      Re: Not so fast

      Platypus,

      Never say Never :)

      Yes, object today is mostly slow but that is due to the vendor implementation/focus on archive and HDDs, not an architectural limitation

      Object is not tight to http, scality, ceph, and Coho data have native tcp api's, and from benchmarks i did Ceph was actually faster then Gluster, in my prev job my team did rdma transport to Ceph (now upstream) which have better bw and lower cpu compared to NFS

      If you add Object concurrent apis (do one get vs nfs open, lookup, get-attr, read, close), when using small files object can be way faster and challenge the slow and non scalable metadata access of file and NFS

      I expect emerging object solutions will put more emphasis on performance and consistency

      Yaron

      1. Platypus

        Re: Not so fast

        The issue of implementing file semantics on top of weak (S3-style) object semantics is not just an implementation choice. It introduces an architectural need for an extra level of coordination, which any implementation will have to address. There are richer object APIs that offer better performance (Ceph's RADOS is one), but that's very much not what most object-store advocates (like Enrico and now Trevor) are peddling.

        As for Ceph being faster than Gluster, I'll take that with a *big* grain of salt. I've seen many such comparisons, and even made a few myself. Anyone can cherry-pick a configuration or workload that favors one over the other. That's why disclosing such things is important. I have literally *never* seen such a comparison that made such disclosures and didn't contain blatant methodological flaws, and which favored Ceph. Not even from the Ceph folks themselves. Maybe if someone who worked for one of the RDMA-hardware companies (and who should have disclosed that fact before making claims) had done special tuning, and was comparing RADOS to Gluster+loopback, they could come up with such a result, but it wouldn't mean anything. Without details, I'm inclined to call BS on that one.

        Lastly, yes, one get can be more efficient than lookup, open (note the order), read, etc. That's great for file-at-a-time access patterns. A few people care about those. On the other hand, that difference pales in comparison to the difference between writing a single byte in the middle of a multi-gigabyte file vs. having to do a get/modify/put on the whole thing. Chunk up the files into multiple objects and you're back at multiple requests for the whole-file case, plus a metadata-maintenance problem that starts to look like the one file systems already solve.

        Layering files on top of semantically-poor objects always leads to problems. Solving those problems either destroys any potential performance or scalability advantages you might have started with. That's why most such systems have gateway SPOFs and bottlenecks. In fact they look a lot like distributed file systems fifteen years ago, before we figured out how to solve exactly those problems in a reasonably elegant and efficient way. Those who do not know the lessons of history, etc.

        1. Yaron Haviv

          Re: Not so fast

          i agree with all your comments when you relate to legacy object (S3, Swift, Amplidate, cleversafe, ..)

          the other ones i mentions do support mutable objects (i.e. no need for get/modify/put or chunk files), some have consistency and other features to simplify NAS gateways.

          CephFS is a poor way to implement FS over object, knowing the details i wouldn't use that as an example, But Coho data seem to have nice NFS perf and scale. Even if Ceph/Gluster are +/- on par re perf, it somewhat contradict you theory that file should be way faster

          I assume you know the stats re % of (Slow & Blocking) Metadata ops in NFS (NFS 4 added compound to help, but unfortunately no one use it), in real world it is pretty hard to get NFS to perform (and i personally did many of those benchmarks), if you disable the client cache or sync() on every IO to be on par with object atomicity/durability (required for micro-services) its even worse. the exponential growth in data now add more dependency on Object Metadata related indexing and operations, those are not possible in NFS.

          We know web-scale moved from NFS to Object (e.g. Facebook haystack), but now even heavy users/proponents of POSIX like the DoE/DoD work to relax the POSIX dependency and have government funded object projects.

          Anyway you are right about the limitations some of those products have which limit object usability, its time for new vendors to come with better solutions.

          Yaron

          SDSBlog.com

          1. Platypus

            Re: Not so fast

            "if you disable the client cache or sync() on every IO to be on par with object atomicity/durability (required for micro-services)"

            S3-style object storres make *no* guarantee about consistency or durability. There's a word for the kind of tuning you speak of, hamstringing one side to meet a requirement for which the other is held exempt. It's called cheating. It's a way of *massively* skewing the results to favor one side, and it's why methodological disclosure is so important. Please compare apples to apples, then get back to us.

            1. Yaron Haviv

              Re: Not so fast

              seems like a religious discussion, so i will stop with this post

              lest stick to facts: ALL Object even S3 provides Atomicity & Durability as base attributes (with better guarantees than NAS), most object like S3 are Eventually Consistent, and some are Fully Consistent (implementation dependent).

              Would be happy if you can point me to a benchmark to back your thesis which can shows Gluster significantly knocks out Ceph or Coho Data on IOPs, all the ones i saw show they are on par or better, not fare to pick on a cloud archiving product like S3 to make perf claims.

              lets see what the future holds, seems like object is penetrating more use cases in which NAS was the king, and is getting better, trend backed by IDC figures, you may propose its a temporary trend, seems less likely to me

              Yaron

              SDSBlog.com

              1. Platypus

                Re: Not so fast

                "Object even S3 provides Atomicity & Durability as base attributes"

                Simply untrue. You were talking about making the file store sync *on every write*. Object stores provide no guarantees on every write, because they don't even have a concept of every write. That's the flip side of any API based on PUT instead of OPEN+WRITE. At the very worst, an apples to apples comparison would require only an fsync *per file*, and even that would be requiring more of the file store than the object store. Can you actually cite the API description or SLA for any S3-like object store that makes *any claims at all* about immediate durability at the end of a PUT? Amazon's certainly don't, and that's the API that most others in this category implement.

                "Would be happy if you can point me to a benchmark to back your thesis which can shows Gluster significantly knocks out Ceph"

                http://www.principledtechnologies.com/Red%20Hat/RedHatStorage_Ceph_1113.pdf

                https://indico.cern.ch/event/214784/session/6/contribution/332/attachments/340854/475673/storage_donvito_chep_2013.pdf

                "not fare to pick on a cloud archiving product like S3 to make perf claims."

                Except that such "archiving products" are the subject of the article we're discussing. What's unfair is comparing a file system to an object store alone, on a clearly object-favoring workload, when the subject is file systems *layered on top of* object stores. All of those protocol-level pathologies you mention for NFS will still exist for an NFS server layered on top of an object store, *plus* all of the inefficiencies resulting from the impedance mismatch between the two APIs. If the client does an OPEN + STAT + many small WRITEs, the server has to do an OPEN + STAT + many small WRITEs. The question is not how a file system implemented on top of an object store performs when it has freedom to collapse those, because it doesn't. The question is how it performs when it's executes each of those individual operations according to applicable standards and user expectations, which set definite requirements for things like durability.

                The only "religion" here is faith in the assumptions that support your startup's business model. It's not my fault if those assumptions run contrary to fact. I'm just pointing out that they do.

  2. Yaron Haviv

    from https://aws.amazon.com/s3/details/

    "Amazon S3 is designed for 99.999999999% durability" (i.e. every put has 11 9s durability)

    again, in implementations which support offset (Scality, Ceph, ..) you partial write example is incorrect

    few year old Beta level Ceph benchmarks are not a good measure, see more recent ones by Sandisk:

    http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Roy.pdf

    even the one you sent doesn't show a knockout on IOPs, just slightly better

    Jeff, please don't take it to the personal level, as the Gluster architect you are not clean from bias

    i suggest to continue this discussion offline

    Yaron

    1. Platypus

      "Amazon S3 is designed for 99.999999999% durability" (i.e. every put has 11 9s durability)

      That's really about availability. It says nothing at all about when data is guaranteed to hit stable storage. You do know what "durability" means in data storage, don't you?

      "few year old Beta level Ceph benchmarks are not a good measure,"

      Ah yes, that's no true Scotsman all right. You asked for citations, I provided them, now you demand different ones. At least those actually compared Ceph to Gluster, on the same hardware. The document you cite only compares Ceph to itself. Why would you assume Gluster has been standing still, and wouldn't also perform better? That's convenient, I suppose, but hardly realistic. Making comparisons across disparate versions and disparate hardware tells us absolutely nothing.

      "as the Gluster architect you are not clean from bias"

      And I disclosed that association right at the beginning, because I believe in being honest with people. You're still moving the goalposts, citing "evidence" that's unrelated to the actual topic at hand, ducking the issue of how NFS overhead *plus* impedance-mismatch overhead can be less than NFS overhead alone. You haven't even begun to address the problems inherent in trying to provide true file system semantics on top of a system that has only GET and PUT, different metadata and permissions models, etc. This isn't personal, but misleading claims often lead to wasting a lot of people's time if they're not challenged. If you think object-store based file systems are such a great idea then you need to grapple with the issues and provide some facts instead of just slinging mud.

  3. Jeffrey Tabor

    Scalable, high-performance file system for object storage

    Avere Systems provides a scalable, high-performance file system for object storage, accessible with standard NFS and SMB protocols. Evidence can be found on the spec.org website where we demonstrated high performance NFS file serving with AWS S3, Cleversafe (now IBM), and Amplidata (now HGST) object storage. Also below find Robin Harris' (aka StorageMojo) take on this.

    http://spec.org/sfs2008/results/sfs2008.html

    http://storagemojo.com/2014/04/08/avere-makes-cloud-nfs-fast-safe-for-the-enterprise/

    Disclaimer: I work for Avere Systems. We are helping customers with this challenge everyday. See below for more info.

    http://www.averesystems.com/

  4. kdkeyser

    Amplidata / HGST consistency

    The AmpliStor (Amplidata) / Active Archive (HGST) object storage systems offer strong consistency, i.e. when a PUT is confirmed (200 OK), the data/metadata is guaranteed to be written with full durability

    There is nothing intrinsic about the S3 API that prevents a strongly consistent implementation, the consistency is simply not part of the API and is implementation specific.

    The scope of the consistency is indeed a single object.

    BTW, DeepStorage has a whitepaper about AmpliStor performance, so that gives you at least one data point.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon