Re: As I understand it
Original AC here. First, let me say that I'm not 100% sure what this "metadata" that they're migrating from HDFS to MapReduce is. I can take a pretty good stab at guessing, though: it's the shares (slices) of the original data stream. At least that's how I read the original article.
The idea of what Cleversafe is doing is pretty simple to understand at a high level, and it's covered in the wikipedia page for their product. There are basically three operations: an IDA split function that takes a data stream and splits it into shares (or slices, as they call them), a set of per-share transforms (encryption and decryption), and an IDA combine operation that takes some number of decrypted shares and combines them to get back the data stream.
I'm assuming that what the announcement means is that instead of using HDFS (which has triple redundancy) for storing each share, they're now just using MapReduce-based storage (which has no redundancy). They don't lose out on any kind of fault tolerance that matters (answering the first question posed here) because they already get the desired level of redundancy by having more shares (M) than the quorum (N) required to reconstruct the data (again, M>N if they want redundancy, M=N if they only want distribution).
As to locality of data processing, all the per-share processing (ie, encryption, decryption and associated checksumming and validation) is done locally. This could be done in hardware, but I suspect that they it's probably done in software on the local store's controller (so it's a software equivalent of a regular disk controller).
You're right to say that data is always going to be remote when it comes to actually splitting and combining files. But I think you're wrong in your conclusions about performance. Rabin's paper (and several others referencing it) goes through some pretty rigorous analysis of performance and concludes that it gets better performance overall relative to various RAID-like systems. I'll try to be brief and boil it down to three essential points.
1. Size of data. If the data stream is of length L, then each share is of size L/N(*), and the total space required for all shares is LM/N (recalling that M is number of shares, and N is the quorum). You can verify for yourself that as the redundancy level increases, IDA becomes exponentially more space efficient than the equivalent RAID M+N system (it's also got a single orthogonal failure case--loss of one or more shares--compared to multiple RAID failure modes making it easier to understand and implement, but that's by the by). This space efficiency is the major gain in IDA, including data transfer throughput.
2. Splitting. The splitter performs the IDA transform on the data of length L and produces M shares each of length L/N, as above. All of these shares are transferred to different controllers for fault tolerance. Contrast with a RAID-like scheme with multiple 100% copies + some number of parity streams and again you should see increased write performance for increasing redundancy levels.
3. Combining. In what I'll call non-fault tolerant mode, we request just the bare minimum of shares (N) at random from M controllers (and request more shares afterwards if there's an error with one or more of them). Recall that the total length of all this data is just L, (again, excluding any checksum and bookkeeping information). In fully fault tolerant mode, we request all M shares and use the first N to actually arrive and discard the rest (unless one of the first N is faulty, in which case we just use the next one, with no retransmission needed). The latter gives the best access times as we effectively keep the best seek times from among all the controllers, but at the cost of needing increased bandwidth to transfer all M shares instead of N (for a total of LM/N). We can also have cases in between, such as requesting N + 1, N + 2 or whatever N + x < M we want. This figure can be tuned based on how common we expect transmission errors, corruption of share data or transient failures of share storage controllers to be and also how much we're willing to trade off improved latency for worsening bandwidth usage. In the first mode (x=0), we get the worst latency as we're stuck with the worst seek time from among the controllers, with the latter (M + x = N) giving the best seek times but taking the most bandwidth.
I hope I haven't gone on too long with this, but I think I did need to clarify exactly what the IDA is, how I think MapReduce is being deployed here, and how the "remote" aspect of IDA doesn't necessarily mean reduced performance. I strongly recommend looking up Rabin's paper if you're interested in finding out more. Also, as a shameless plug, you could try checking out my Crypt::IDA module for Perl. I think it has pretty decent documentation, although it's at a fairly low level and not specifically geared towards (solely) being used in RAID-like disk controller systems.
(*) Subject to rounding. Data must be padded to be a multiple of N in length. Some bookkeeping information also needs to be stored with each share, and this scales linearly with N (and is =N in many implementations).