Object storage specialist Cleversafe is after a piece of Big Data analytics action, and has wheeled in MapReduce to make it happen. In the same stroke it rejected HDFS as vulnerable and wasteful of storage capacity. MapReduce is the analysis part of Hadoop, the flavour-of-the-month open-source data hoarding software. It …
must have missed something
"stores only one copy of the MapReduce data instead of three, which is cheaper as capacities grow from terabytes to exabytes and on to petabytes. It isn't vulnerable to a catastrophic loss of data"
so how is it protected if the data store does not make use of redudancy?
Re: must have missed something
Cleversafe effectively provides a distributed network block device with parity RAID style redundancy.
"Cleversafe says it stores only one copy of the MapReduce data instead of three, which is cheaper as capacities grow from terabytes to exabytes and on to petabytes."
The order in the units is wrong. Terrabyte is 1000GB, Petabyte is 1000TB, Exabyte is 1000PB.
But typos aside, the performance of Cleversafe is likely to suck, because their storage technology effectively applies network RAID N+M (N data disks + M redundancy disks, M=1 = RAID5, M=2 = RAID6, etc.). Point being that instead of data being local to some number of nodes like in HDFS, the data is remote from all nodes, so there will be a performance hit. Parity RAID (5,6) also has massive performance issues with small-ish writes. (One exception of this is the ZFS implementation which works around the write-read-modify overhead by making the stripe size dynamic, but that's getting a bit off topic.)
But to summarize - the reason HDFS keeps multiple copies of the data is not just redundancy, it is performance (data is local to some nodes, which typically determines which node a task gets "mapped" (as in map-reduce) to. If Cleversafe are saying that their solution to multiple copies of data is to keep the data remote from all nodes (with extra overheads to boot) then that's not very clever at all (even if no less safe).
As I understand it
It's not actually a RAID system as you're used to. So talking of "RAID M + N" is actually technically wrong and misleading. They use an Information Dispersal Algorithm. This means that instead of M master copies(*) + N parity calculations, the data stream is split into M shares, with any N of them sufficient to recreate the original data stream (M>=N, obviously). Michael O Rabin's paper "Efficient dispersal of information for security, load balancing, and fault tolerance" is the seminal paper describing not only how these schemes work, but (as the title suggests) shows how it can actually be used to improve performance in large (distributed) storage arrays.
To be honest, I'm not surprised that they're switching to MapReduce instead of Hadoop. MapReduce is simply closer to the metal, and they don't need any of the extra redundancy that Hadoop can provide simply because they're building that redundancy in themselves with their own IDA. The IDA calculations themselves are also ideally suited to the MapReduce computation pattern.
(*) I know that many of the distributed filesystems that use an IDA as their underlying storage mechanism do also keep some 100% copies as well as just shares, but that's nothing to do with IDA as such, merely a feature of the system that uses the IDA storage layer.
Re: As I understand it
You have missed the most important point regarding the performance, and that is locality. The idea is that you "map" the task to the node that has the data, rather than move the data since moving large amounts of data is prohibitively expensive compared to processing it locally. By having multiple copies of the data, you can load balance similar tasks that need mapping to the same data efficiently. Multiple copies of the data in Hadoop aren't just about redundancy, they are even more about performance. Any reasonable systems architect would steer well clear of a system that compromises that by making the data remote from every node.
Re: As I understand it
Original AC here. First, let me say that I'm not 100% sure what this "metadata" that they're migrating from HDFS to MapReduce is. I can take a pretty good stab at guessing, though: it's the shares (slices) of the original data stream. At least that's how I read the original article.
The idea of what Cleversafe is doing is pretty simple to understand at a high level, and it's covered in the wikipedia page for their product. There are basically three operations: an IDA split function that takes a data stream and splits it into shares (or slices, as they call them), a set of per-share transforms (encryption and decryption), and an IDA combine operation that takes some number of decrypted shares and combines them to get back the data stream.
I'm assuming that what the announcement means is that instead of using HDFS (which has triple redundancy) for storing each share, they're now just using MapReduce-based storage (which has no redundancy). They don't lose out on any kind of fault tolerance that matters (answering the first question posed here) because they already get the desired level of redundancy by having more shares (M) than the quorum (N) required to reconstruct the data (again, M>N if they want redundancy, M=N if they only want distribution).
As to locality of data processing, all the per-share processing (ie, encryption, decryption and associated checksumming and validation) is done locally. This could be done in hardware, but I suspect that they it's probably done in software on the local store's controller (so it's a software equivalent of a regular disk controller).
You're right to say that data is always going to be remote when it comes to actually splitting and combining files. But I think you're wrong in your conclusions about performance. Rabin's paper (and several others referencing it) goes through some pretty rigorous analysis of performance and concludes that it gets better performance overall relative to various RAID-like systems. I'll try to be brief and boil it down to three essential points.
1. Size of data. If the data stream is of length L, then each share is of size L/N(*), and the total space required for all shares is LM/N (recalling that M is number of shares, and N is the quorum). You can verify for yourself that as the redundancy level increases, IDA becomes exponentially more space efficient than the equivalent RAID M+N system (it's also got a single orthogonal failure case--loss of one or more shares--compared to multiple RAID failure modes making it easier to understand and implement, but that's by the by). This space efficiency is the major gain in IDA, including data transfer throughput.
2. Splitting. The splitter performs the IDA transform on the data of length L and produces M shares each of length L/N, as above. All of these shares are transferred to different controllers for fault tolerance. Contrast with a RAID-like scheme with multiple 100% copies + some number of parity streams and again you should see increased write performance for increasing redundancy levels.
3. Combining. In what I'll call non-fault tolerant mode, we request just the bare minimum of shares (N) at random from M controllers (and request more shares afterwards if there's an error with one or more of them). Recall that the total length of all this data is just L, (again, excluding any checksum and bookkeeping information). In fully fault tolerant mode, we request all M shares and use the first N to actually arrive and discard the rest (unless one of the first N is faulty, in which case we just use the next one, with no retransmission needed). The latter gives the best access times as we effectively keep the best seek times from among all the controllers, but at the cost of needing increased bandwidth to transfer all M shares instead of N (for a total of LM/N). We can also have cases in between, such as requesting N + 1, N + 2 or whatever N + x < M we want. This figure can be tuned based on how common we expect transmission errors, corruption of share data or transient failures of share storage controllers to be and also how much we're willing to trade off improved latency for worsening bandwidth usage. In the first mode (x=0), we get the worst latency as we're stuck with the worst seek time from among the controllers, with the latter (M + x = N) giving the best seek times but taking the most bandwidth.
I hope I haven't gone on too long with this, but I think I did need to clarify exactly what the IDA is, how I think MapReduce is being deployed here, and how the "remote" aspect of IDA doesn't necessarily mean reduced performance. I strongly recommend looking up Rabin's paper if you're interested in finding out more. Also, as a shameless plug, you could try checking out my Crypt::IDA module for Perl. I think it has pretty decent documentation, although it's at a fairly low level and not specifically geared towards (solely) being used in RAID-like disk controller systems.
(*) Subject to rounding. Data must be padded to be a multiple of N in length. Some bookkeeping information also needs to be stored with each share, and this scales linearly with N (and is =N in many implementations).
May be it would help
The redundancy is achieved through standard IDA. It guarantees a high reliability with a small overhead. Theoretically you could achieve any level of reliability with overhead converging to one (almost no overhead). The price in this theoretical case is unlimited computational resources. For practical cases however, the level of reliability comparable with 3 copies could be achieved with configuration in your notation with M=6 and N=20, which gives you 30% overhead instead of 300%.
The Cleversafe solutions doesn't sacrifice locality in the best case, namely when a desired portion is stored as a single slice on a 'native' node. The difference compared with the copy based solution is that if a slice is missing on a 'native' server (or in the worst case the whole server is unavailable), the computation would take place on a 'foreign' node. This node would have to reconstruct an original portion of data using IDA which is obviously not cheap - from network use to read slices and computationally to perform reconstructions. It is done completely transparent to a client software.
What Cleversafe offers is a tradeoff: you decrease required storage by 230% (3/1.3) by paying a price of non-locality only when you loose your primary 'native' server. In real enterprise environments this should happen relatively infrequently (for sure less than 1% of the time, due to server or disk loss). Would you agree it is a bargain price to pay when you are talking about saving many petabytes of storage.