back to article IBM multi-petabyte cloud defies XIV storage

With its Smart Business Storage Cloud, IBM says we have its GPFS and XIV being used to build a system capable of multiple petabytes of capacity and supporting billions of files with high-performance computing-like I/O performance. But XIV has a maximum usable capacity of 79TB. We can envisage an IBM BladeCenter server set …


This topic is closed for new posts.

GPFS: heavy

Some of the UK Academic supercomputers run GPFS; costs a lot for 50TB of storage, whereas you can have that same amount of capacity without needing full time suits on site with Apache Hadoop with 200 TB of HDD (assuming 4 TB/server node, 1TB for local storage of OS and MR job temp files). GPFS relies on high end RAID storage and an infiniband backbone, and costs lot to expand.

But with infiniband it delivers fantastic datarates to any node in the cluster. Unlike Google GFS and Hadoop HDFS, which only deliver local disk rates to code running on a single node. With GPFS your disk bandwidth to a single node scales up the more disks you put on,

So, tradeoff. Hadoop: free, needs storage near your CPUs, "commodity" x86 and GbE -proven to scale to PB. GPFS: top of the line hardware, I doubt anyone could afford to scale it up to many PB today. Then XIV, which sounds like something in between. More commodity hardware, less scalability. Given the engineering effort being put in to Hadoop -no involvement from IBM in the filesystem, incidentally- it and the layers of code near it will have the mass petabyte datastore market. Someone had better be writing a bridge so that Hadoop can run MapReduce code on XIV storage, they way they have done for GPFS


"XIV is an NSD storage node in this scenario."

I doubt that - sorry. XIV is a storage box. It only stores and retrieves things- it doesn't run regular application code (like GPFS).

Here's some lowdown on GPFS:

An NSD is a regular node of a GPFS cluster, one wich some of the GPFS disk is connected to.

GPFS supports AIX, Linux or Windows nodes, of which only the first two types can also be NSDs.

If your example above, an XIV would most likely be connected to by multiple NSDs. GPFS is parallel, and so are all the components.

The NSDs will connect to one (or typically, more) storage boxes (XIV or anything else industrial), and all nodes in the GPFS cluster will request data from the NSD nodes.

Usually, NSDs communicate with backend storage via FC, and the cluster as a whole communicates via IP. No one NSD is used for access to any one item, in normal designs.

An example setup for GPFS could be something like: 128 nodes in the GPFS cluster. All nodes have the same filesystem mounted.GPFS does byte-range locking of files and has some pretty cool distributed lock semantics as well as the usual POSIX stuff.

In the example setup 8 of the 128 nodes have connection to 4 FC storage boxes eg. DS5000.

This could be 64 FC storage boxes or any other number pretty much. It just depends on the performance requirements.

It's not unusual for a cluster to be plugged together with something fast or exotic (inifiniband/myrinet/10gigE).

Incidentally, calling GPFS a filesystem sort of undersells it, compared with the vast majority of filesystems out there, since it has features nothing else has.

The closest thing to it currently is Lustre (which sun is currently bolting onto the front of ZFS).


Most filesystems have no concept of having more than one speed/reliability of access of disk in one filesystem: disk is usually presented via some sort of LVM mechanism.

GPFS doesn't need to work like that. You can do things like mirroring individual files by policy* or by changing the attributes of the file (mmchattr). It's amazingly powerful and not really shouted about enough.

*(GPFS has a policy engine in the filesystem, not an external tool, so you can do things like set policies where your mp3s end up on SATA, profiles on FC, DB on SSD. And the DB has to be mirrored, which GPFS does too.)

Megaphone because this ended up being a much longer post than I expected...


Waste of the worlds resources

XIV uses 180 1TB SATA drives to give you that 79TB usable that is 101 drives of inefficiency that suck power and require cooling multiply that out to reach pettabytes and the waste of power is staggering. Building a cloud service based on that architecture is going to be expensive. The data center costs are just going to increase as the cost of power increases and not only that Greenpeace should be knocking on the door for engineering such an architecture that consumes so much of the worlds resources for no reason what so ever.

Complete fail in my mind, there are much more efficient ways of storing large amounts of data without having to power that amount of spindles especially when large amounts of data being stored in the cloud is mostly not referenced after the first month of storing Tape combined with a hierarchical file system would be a better solution.


Pretty simple, really

Having worked with several parallel filesystems in the past, it never even occurred to me that there would be only one XIV. Shared-storage filesystems just aren't very common nowadays, and GPFS was never such. There will be many XIVs, connected to many servers, perhaps in slightly-interesting ways to facilitate failover and such but generally not much different than if the storage were entirely private to each server - the base model for most of the parallel filesystems in common use. Scaling XIV up or out was never necessary to support this announcement.

Now, for anonymous in #2: thanks for the IBM ad copy, but your claim of uniqueness for GPFS wrt knowing about multiple kinds of storage is simply not true. I'm no fan of Lustre generally, but it has long given users the ability to stripe files within a directory tree across a particular subset of OSTs. As of 1.8, they also added OST pools which give users even more control in this regard. PVFS2 and Gluster also offer some control in this area. Ceph is conceptually ahead of the whole pack (including GPFS), though it's still in development so maybe it doesn't count. In a slightly different but related space, EMC's Atmos offers even more policy-based control over placement. It's an area where GPFS does well, and it's a legitimate selling point - not that this is the place for "selling points" - but it's far from *unique*.


Thanks, Platypus

I take your points, and I stand corrected...

AC#2 (and I suppose 5 when this is posted)


Too bad IBM doesn't use QFS and won't participate on Lustre on ZFS

Chris Mellor writes, "IBM says we have its GPFS and XIV being used to build a system capable of multiple petabytes of capacity and supporting billions of files with high-performance computing-like I/O performance. But XIV has a maximum usable capacity of 79TB."


Sun's "QFS volumes can scale up to 4 PB" - quite a difference. This would hold IBM off until the next real clustered filesystem is completely integrated.

Once Lustre is completely integrated to use ZFS as the back-end file system - clustered file system volume size limitations will be a thing of the past for the next couple decades, under Solaris.


"According to Jeff Bonwick, the ZFS chief architect, in ZFS: the last word in file systems, 'Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans.' Jeff also discussed the mathematics behind this statement in his blog entry on 128-bit storage. Since we don't yet have the technology to produce that kind of energy for the mass market, we might be safe for a while."

It is nice to see that the Lustre folks have been adding more features & hooks into ZFS, some of the clustering infrastructure had made it during this past October 2009 release of Solaris 10!


"Zero-Copy I/O... heart-beat protocol on the disk to allow for multi-home-protection..."

Clustering is a pretty exciting area to be participating in today - I look forward to see what IBM comes up with!


@David Halko

Please don't confuse the size of an XIV with the limits of a GPFS filesystem: you can have as many XIVs underpinning a GPFS as you like, so it becomes 79TB x N where N is the amount of XIVs. (Joke: 14? pah.)

Coincidentally, the largest tested size so far for a single GPFS filesystem is the same as your QFS example, 4 Petabytes. This isn't an architectural limit though.

The (current) architectural limit of GPFS for a filesystem is 2^99 bytes (OK, this isn't 2^128, but both are impractically large to implement, so it's a moot point).

You can read more about limits and so on over here if you fancy it:



ac#2,5,7 :o)

Anonymous Coward

1,000-node XIV Infiniband cluster...easy

XIV storage nodes are nothing but generic Intel-based server boards running Linux. Slap in an Infiniband card and cluster away...thousands of nodes just like in the Top-500...

Anonymous Coward

Much ado over nothing...

What a load of nonsense in this thread...

XIV has no intrinsic scalability issues because (as seems lost on everyone here) XIV is just a cluster of off-the-shelf intel-based mobos running Linux. Each mobo is in a chassis that also houses a bunch of HDDs, just like a Sun "thumper".

IBM can cluster them by the hundreds, or (by slipping in an infiniband HCA) by the thousands, just like any other linux-on-generic-intel-mobo platform. Plus, IBM does huge clusters running Linux all the time...look a the top-500 supercomputers list.

Scaling XIV is no big deal...


The real nonsense is...

...the idea that because something is a cluster it doesn't have any intrinsic scalability issues. What bollocks. Lots of clusters have serious scaling issues because their communication protocols are poorly designed, leading either to a bottleneck at one "master" node for some critical facility or to N^2 (or worse) communication complexity among peers. It's not at all uncommon to find clusters that fall apart at a mere 32 nodes. Yes, it's also possible to design a cluster that scales better, but the difficulty is domain-specific. In a storage cluster, consistency/durability requirements are higher than in a compute cluster supporting applications already based on explicit message passing with application-level recovery, and the coupling associated with those requirements makes the problem harder. It's *possible* that XIV has solved these problems well enough to scale higher, but only an idiot would *assume* that they can or have.

As I already pointed out, it's a moot point this particular case because they don't need to, but in other situations the gulf between theoretical possibility and practical reality can loom quite large.

This topic is closed for new posts.