Back to the drawing board
for a new name - or are they planning to introduce this technology on harddiscs with Heat Assisted Magnetic Recording technology?
Like the idea of chewing on terabytes data using Google’s MapReduce but think it's too slow, too hardware-hungry and too complicated? A fledgling big-data analytics venture reckons it’s got the answer - a Hadoop programming framework built using Java it claims is 20 times faster than using ordinary Hadoop and that it claims …
Great. More "innovative" API and algorithms that's looking to get a patent.
The longer I work in the software industry, the more I see that patented algorithms and API are all just really obvious technical engineering constructs where a lot of the times it has been done by someone somewhere way ahead of the so-called "inventor".
If you're looking to make the most out of software, it is obvious what you need to do. It is always the same thing. Minimising locks, contention, optimising cache and architecture with the aim of reducing the number of machine cycles and codes needed to perform a task. An exercise that can be performed by ANY competent software engineer with the time to spend that knows how the hardware interacts with the software.
Patents does nothing but restrict real commercial innovation in software, and only benefits the "first to file". How many times do we have to say it until governments eliminates software patents.
It's fairly clear that we can see beyond Map/Reduce to more sophisticated distributed processing. There are quite a few contenders for the next generation platform and it looks like this is yet another.
However, I don't think we're there yet. Most options are about reducing the pain of M/R when you've got iterative jobs and more complex work flows, but a lot of arguably unnecessary pain remains. At some point I'd expect a generic way to describe such work flows to emerge and to become the de-facto standard. For now, none of the proposals are so compelling that developers are stopping coming up with proposal n+1.
The standard already exists. It's Crunch. It will happily target MR2, Tez or Spark as its execution engine with a trivial change in code. No one in the Hadoop world writes plain old MapReduce any more unless they absolutely have to, and the reasons to do that are disappearing. Honestly very few developers in the serious end of the industry even use MapReduce at all now. Spark is the new standard (it's like HAMR but not patent encumbered and, you know, already in use) for everything. Pig announced a few days ago that all of its integration tests now pass for the Pig-On-Spark codebase, and Hive won't be too far behind.
Crunch - why not Cascading or Pig? The point is, there are a lot of options in this space right now, most of which are moving/already run on the new execution engines. I'm glad you feel you can call the 'winner' on this, but from where I'm sitting we're back in the era of fighting over which text editor to use. It all generates plenty of work for the 'serious' end of the industry (yeah, and our production cluster is bigger than yours), but jumping from framework to framework to keep up with the latest trends is not ultimately productive.
Why not Pig? Because no one knows Pig Latin. Pretty simple. It's a lovely language, nearly ideal for writing complex data pipelines, *but* it's not a language many people know. Meanwhile everyone knows Java, and it's a lot easier to plug custom code into Crunch (or, yes, Cascading) than it is Pig, because with Pig you've got to fall back to UDFs. Different tools for different jobs, really; Pig is inherently oriented towards structured data. Which is nice, but in that case it falls fowl of the fact that's it's a hundred times easier to find a good SQL developer than it is to find someone who knows Pig.
Why not Cascading? This is more nuanced, but it comes down to two main points. I love Cascading, so this isn't a criticism of the product, but these are the facts. The first is sheer penetration. Most Hadoop users are CDH customers, and Crunch comes packaged with CDH, so it's easier to get going with it. Coupled with that is tight integration with the rest of the stack, including the Kite SDK. Together they're a joy to work with. The second is a matter of data model. Cascading, like Pig, operates on Tuples. Crunch operates on higher-level PTypes or even POJOs. This makes it much more flexible, and in practise it is easier to encapsulate the functions within your data pipeline for unit testing, verification and reuse.