Is Object storage really appropriate for 100+ PB stores?

This topic was created by Chris Mellor 1 .

  1. Chris Mellor 1

    Is Object storage really appropriate for 100+ PB stores?

    Object storage vendors say their technology is ideal for storing billions of files and hundreds of petabytes of data in a single namespace across hundreds of sites. Is it?

    CERN's LHC experiment stores its great mass of data on tape. Why not in an object store? Too expensive maybe? What is the real message about object storage if the billions of files/hundreds of petabytes use case doesn't actually exist?

    1. TDuskin

      Re: Is Object storage really appropriate for 100+ PB stores?

      I think the fact that Amazon S3, an object store, has scaled to over a trillion objects (http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html) is evidence that the technology can and has scaled to those sizes. Other service providers, such as Rackspace and HP, offer object storage as "cloud storage" and enables them to sell space on their cluster and scale horizontally to meet demand from not only many objects but many users as well.

      Also, the claim that the LHC data is being recorded to tape may be true, but that doesn't mean CERN has given up on object storage all together. In fact, the CERN folks are taking a hard look at object storage and other cloud computing technologies because they want to process more data AND let more researchers get access to said data for processing (http://www.techrepublic.com/blog/european-technology/cern-where-the-big-bang-meets-big-data/636), and with everything stored on tape, there is an obvious bottleneck when many people want data from many different experiments.

      Data generation is not going down, so while 100PB seems like overkill, it will be used. Remember, 1GB was at one time considered "too much space."

  2. Dominic Connor, Quant Headhunter

    Depends on the type of data

    As I understand it, CERN data is lots of numbers with relatively few fields per record.

    Also it will often be accessed sequentially, since the vast majority of numerical algorithms are implemented that way, the loading is sequential and even when parallel algos are used you usually still aren't doing random access.

    A business DB will be queried as in "list the expiry dates of our cat food inventory, grouped by date and flavour"

    This may be a multi table query, something experimental data does a lot less of.

    With high volume numerical data you may not even bother with an index since it may be the same size as the data itself and either use the sequence number or a datestamp.

  3. MDebelic

    right questions

    Chris,

    you're asking the right qurstions here. "Big Data" seems to be the buzzword of the moment ("the Cloud" appears to have found its place and form). Lots of Big Data talk, announcements, ideas, shapes, forms and a number of (potential) solutiond - but few real things to show.

  4. TomLeyden

    time will tell but

    Chris,

    Of course there is a lot of buzz and marketing around Object Storage and Big Data, but isn't that normal? Back in 2009, more than one person claimed Cloud Computing would never break through. One of them runs a database company that claims to be "the" Cloud company today...

    The market drivers are there, the technology follows. And there are a good number of use cases out there as well. Amazon S3 has been mentioned, but there is also Facebook, Google and others.

    The Cern reference, which I also saw in your Netapp article is not the greatest for a number of reasons:

    * Their choice for tape is two years old

    * TDuskin's comment is correct

    * It's not the easiest use case; a better one would be Facebook: are they storing the massive numbers of pictures on tape? Or youtube its videos?

    So, Chris, object storage may need some time to become a standard, and there might be something else coming up in a couple of decades, but for now it is a much more scalable alternative than file based storage when storage large volumes of unstructured data (the better use case). Also, when designed with the proper Erasure Coding, it's a lot more efficient and reliable than traditional technologies.

    Disclaimer: I work for Amplidata, an erasure coding based object storage vendor

  5. marc.villemade

    Re: Is Object storage really appropriate for 100+ PB stores?

    Hey Chris,

    [Disclaimer for others than Chris: I work for Scality, Object Storage vendor]

    I think there is a need out there for object storage. The problem is that I think it's hard to contradict that argument if you're focused on that technology (ie. with one of the vendors), but if you're not, it's much easier to dismiss it completely. My point is that there ARE projects out there in the hundreds of PBs. For those going to Tape (Like CERN or NCSA), I'm not entirely sure if it's because of a lack of good marketing/sales from the Object Storage vendors or because tape has a real competitive advantage (Technically, or financially, or both).

    I am almost convinced that it's the former. And I think that i'm not the only one in the Object Storage world, as I pointed out in our blog: http://www.scality.com/the-object-storage-summit/ - Most of us object storage vendors see the way people perceive our technology as something that could be improved. A lot.

    I am convinced that Object-based storage is the future for unstructured data. One of the point you're bringing in your article is that "The El Reg storage desk is cynical because no company is actually storing hundreds of petabytes of data in billions of files inside a single namespace across hundreds of data centres."

    It is true that there are no publicly known storage infrastructure that are spread across hundreds of sites. But the reason for this is that i think it will never really happen. "Data Gravity" from Dave Mccrory is completely at play here. Applications can be moved easier than data. But we can also only replicate subsets of data.

    My vision of this is that we will see a lot of large data stores replicated across a few sites for redundancy, failover and whatnot. For processing the data, workloads will be moved to those sites; for data access, smaller sets of data will be replicated to other sites closer to the end-user to improve latency to the end user. This will involve a lot of new technology in terms of usage patterns and analysis to make sure that the data is there when the user wants it. I hate to bring it it here because it's really over-hyped these days, but some aspects of Big Data will be involved in this to move the right data in advance of users actually requesting it.

    We don't have this technology yet. However, there is no doubt that we have the technology to "store hundreds of petabytes of data in billions of files inside a single namespace" with Object Storage today . These will grow to hundreds of ExaBytes in the next 10 years, and Technology will be built up on top of it to make it available to the application and/or end-users in the best way possible (moving workloads or data subsets, to optimize latency, security, bandwidth, processing power ....).

    It is really going to be an incredible journey to go there from here, and such an interesting time to be in the industry.

    I'd be happy to hear what you think about my rant ;)

    -marc

    @mastachand

    http://www.scality.com/blog

  6. Chris Mellor 1

    Glacier re-writes the rules

    Amazon's Glacier (http://www.theregister.co.uk/2012/08/21/amazon_glacier_objjects/) seems to be a combination of object storage and spun down disks. It uses five data centres and will probably quite rapidly become the largest object storage implementation in the world. That will be a significant vote of confidence into object storage I think.

    Chris.

  7. Chris Mellor 1

    Cleversafe

    Cleversafe (http://www.theregister.co.uk/2012/09/03/cleversafe_1tb_sec/) has a 10EB object storage concept with a 1TB/sec ingest speed. It suggest to me that object ingest and placement in an object store is slower than file ingest and placement in a file store. Obviously an object has to be hashed and then its location in the object namespace calculated and the data then written - which takes longer than tacking a file on to the end of a file store and updating a folder listing.

  8. SPGoetze
    Megaphone

    800 PB...

    The biggest StorageGrid installation I've heard about is 800 PB on 3 continents. I forgot how many sites, but it was quite a few... so apparently some people DO put more than 100 PB in one namespace.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon