back to article High performance object storage: Not just about reskinning Amazon's S3

There are three tiers of storage: Primary storage, or block; secondary storage, or file; and object storage comes third. Object storage is immensely scalable, cheaper, durable... and slooooow. Could that be changing soon? Let me talk about on-premises object storage here. Many end users start with an application (sync and …

  1. Peter Gathercole Silver badge
    Joke

    Nothing new

    From Wikipedia (I know, but it's a useful first description).

    "Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier"

    Hey, I've got an object storage system, and didn't know it! The "globally unique identifier" starts with "/home/peter/Media..." or some such, and each object has some metadata that can be seen using examination tools like "ls -l", "istat" and "file"

    Wow. Whoda thuk it!

    1. Peter Gathercole Silver badge

      Re: Nothing new

      I know that responding to my own comment is a bit... well, poor form but -

      How global is global? If it's really global, what is the arbitration system to make sure that there are no collisions with other systems and organizations? And are objects immutable so that you have to version them as part of their globally unique identifier?. I cannot really believe that there are people who believe that a non-hierarchical unique identifier is really possible at any scale.

      Is there any structure at all imposed upon the identifier and format of the metadata? If there is a structure, then it's just another type of file system with a different indexing system. Tree based filesystems are not the only type that have been used, they've just become almost standard because they mostly fit the requirements of most users.

      I know that, in theory, if you can segregate the object from the path to access the actual storage of the object, you become storage agnostic, such that objects can be moved to different stores and it still be found, but under the covers, there will still be something that resembles a filesystem.

      This whole concept still sounds a bit like buzz words, even though CAFS have been around for more than 30 years.

      1. pPPPP

        Re: Nothing new

        You're right: all storage is object storage. Even raw block storage is object storage: the only metadata is its address. Files within file systems have more metadata: directory trees, file names, ACLs, as well as inodes where appropriate.

        Object storage is just an abstraction of that, but you access it directly using the metadata and you don't need to use find / to look for what you're after.

        MP3 players are well-known object storage. Underneath the interface there'll be a file system though, and under that, block storage. And accessing by metadata isn't always the best; if you sort by artist you end up splitting up those Various Artists albums.

        1. Peter Gathercole Silver badge

          Re: Nothing new @pPPPP re MP3 players.

          Yes, but all you're doing is storing an index, in the same way that the permuted index for old-style UNIX man pages from 40 years ago allows you to identify pages that mentioned particular key words.

          And if you break it down, in a UNIX-like system, the objects are actually tracked by inode which links blocks to objects (files), and the file tree structure is just a way of indexing the inodes.

          It could be perfectly possible, if a little unwieldy, to have an index of inodes other than a hierarchy of directory indexes, but you would have to do something about permissions, as although the inode itself includes permissions that can be checked, UNIX also requires a permissions check on the path to a file, not just the file itself.

          In fact, I understand that a number of POSIX compliant filesystem implementations do allow this type of access. GPFS (sorry, IBM Spectrum Storage, or whatever it's called this week) for example, has a policy engine that allows files to be accessed outside of the traditional file tree.

          1. Platypus

            Re: Nothing new @pPPPP re MP3 players.

            Once you have files, objects are easy. The difficulty lies in going the other way.

          2. Reg Whitepaper

            Re: Nothing new @pPPPP re MP3 players.

            Peter.

            Once you drop the POSIX requirements of a traditional filesystem many things become a lot easier.

            Globally unique identifiers can be randomly generate with a (very) high probability that they will not clash. Sure there may be some arbitration involved but simply making part of the key unique to the generating geo-datacentre would be a big help.

            If an object is write-once rather than updatable then replication is a breeze (have the object stored locally? yes - serve it, no - contact the datacentre identified in the geo part of the key to fetch it on the fly and cache it). A worldwide replicated POSIX filesystem would be a weeee bit trickier to achieve with the possibility of multiple writers.

            You don't have to "mount" an object store - it's just "there" via an API.

            Unshackle yourself!

            1. Peter Gathercole Silver badge
              FAIL

              Re: Nothing new @Reg

              Your comment contains a oxymoron. A "globally unique identifier" cannot clash, by definition.

              Adding the geo-datacentre makes it hierarchical, and actually means that it becomes difficult to address an object if it moves to another datacentre.

              In case you had not realized, there are many ways to get files from filesystems that do not require mounting (if you know a file handle, some past implementations of NFS allowed you to access a file without mounting the filesystem, but that was a bug!). You're just applying current thinking to make an artificial distinction to try and preserve the definition of an object file store.

              Despite your completely valid points, I still maintain that an object filestore is just a filesystem by another name.

              My use of the POSIX example was just to illustrate the use of inodes, and that things can be familiar and different at the same time. I was not saying that all filesystems need to be POSIX compliant, and the use of things like SSHfs, which is in essence stateless but runs on top of existing filesystems indicates that the APIs you suggest can (and probably are in most instances) just a layer on top of existing filesystems.

              1. Reg Whitepaper

                Re: Nothing new @Reg

                Hey man, I'm just using your terminology:

                " Hey, I've got an object storage system, and didn't know it! The "globally unique identifier" starts with "/home/peter/Media..." or some such, and each object has some metadata that can be seen using examination tools like "ls -l", "istat" and "file""

                My point was that an object can be instantiated in one location with certainty that it will not clash with an object in another location. That is all.

                Once it is instantiated it can be replicated to the other locations. If it needs to be accessed from another location BEFORE it has been replicated then it can be requested from the originating location (and now it has been replicated).

                Quoth you:

                "How global is global? If it's really global, what is the arbitration system to make sure that there are no collisions with other systems and organizations? And are objects immutable so that you have to version them as part of their globally unique identifier?. I cannot really believe that there are people who believe that a non-hierarchical unique identifier is really possible at any scale."

                I am just answering your questions. A local originator being part of the key prevents collision between locations (and a local arbitrator prevents local collisions). If they are immutable then replication is very easy.

                I'm not detailing the working of any actual system. Just suggestions as to how they *could* work.

                I'm not suggesting that you should use then for any particular workload, just that their properties may be useful some workloads.

                Take it or leave it. Meh.

  2. Platypus

    Not so fast ;)

    Enrico, the problem with the idea of high-performance object storage is that the S3-style APIs are not well suited to it. Whole-object GET and PUT are insufficient. Most have added reading from the middle of an object; writing likewise has been claimed/promised for a long time, but is still not something developers can count on being able to do. The stateless HTTP protocol is also inherently less efficient than what you get with file descriptors and a better pipelining model. Frankly, a lot of the object-store implementations aren't up for a performance game either. The most charitable way to put it is that the developers were prioritizing other features such as storage efficiency. I'll be a bit less charitable and say the whole reason most of them got into object stores was because they're easy, so they wrote their code with inefficient algorithms and languages/frameworks. That lets them get to market earlier, but the downside is darn-near-unfixable performance issues. The main exception is Ceph's RADOS, which has an API more like NASD/T10 than S3 and which was designed from day one to support upper-layer protocols that demand higher performance.

    Throwing flash at an object store won't let it catch up with block or file storage that's also flash based. It might be higher performance than it is now, but it will still be slower than contemporaries. It's going to be really hard for anyone in that mire to get beyond the tertiary role.

    1. Yaron Haviv

      Re: Not so fast ;)

      Object can be accessed by different apis, not just S3/Http

      E.G. Ceph has librados Api as you noted and one can access/modify data in variable offsets, to some extent key/value systems like Redis or Aerospike are kind of object storage (use DHT) with more structured data and focus on performance (of 1M iops) vs capacity

      The key benefit of object is avoiding the Metadata scaling challanges, and the overhead in maintaining directories when the key is random by nature (e.g. Picture id in a web page, user record, .. ).

      In systems with millions and billions of files you want to use atomic get/put and read/write the entire file or record vs use nested directory lookup -> open -> read/write/lock -> close.

      Yaron

  3. pyite

    Scale-out NFS? Tell me more

    Is this something we can do with GPL software?

    1. schafdog

      Re: Scale-out NFS? Tell me more

      GlusterFS supports NFS, but it's not a Object Store

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020