EMC's Pivotal Initiative division made a big splash last week with the launch of its Pivotal HD distribution of Hadoop. This is not a normal Hadoop distribution, but one that takes the parallel guts of the Greenplum database and reworks them to transform the Hadoop Distributed File System (HDFS) into something that speaks …
What's not to like?
“This cost model-based optimization is really something rare, and in fact, no one else in the industry working with Hadoop, working with query engines on top of Hadoop, has anything like this."
A very important point, me thinks.
To quote Steven Brobst, Teradata’s CTO, “you are the optimiser” when you write a Hadoop application. Business users are (usually) capable of specifying the ‘what’ via SQL but can’t be expected to figure out the ‘how’.
Most IT folks would struggle to figure out the optimal execution plan when developing an app to run in parallel on a large cluster. Even understanding the explain output from a parallel query optimiser can be tricky…especially for those pesky 20 way joins users like to concoct.
1 billion rows on a 60 node cluster is a measly ~17m rows/node. Assuming multi-socket and multi-core nodes, with each core running a separate database segment, that’s not a lot of rows per segment - maybe ~1.5m each for a 2 x 6 core node. There was probably nothing else running on there, so why did it take 13 seconds?!?!?
It’s no surprise that the Greenplum DBMS outperforms Impala, given that Greenplum was a standalone parallel database company before being acquired by EMC. Greenplum is also relatively mature compared to Impala. They also made some very smart hires early in the development process about 7-8 years ago to make sure they got things right.
“if you double the nodes you should be able to do the job twice as quickly”
While this should be true for any MPP architecture, whether it is always observed or not depends on the data demographics and query in question i.e. “your mileage may vary”.