Techies at Microsoft Research, the big brain arm of the software goliath, have taken the crown in the sorting benchmark world. The researchers are thinking about how to implement new sorting algorithms in the Bing search engine to give Microsoft a leg up on the MapReduce algorithms that underpin Google's search engine and other …
Say what you will about MS, but it's nice to see new stuff coming from Microsoft Research. The guys there are (and have always done) real computer science. Not touchie-feeley UI design and the stuff. Power user zealots flame each other over, just real science.
google abandoned map reduce
news at 11
20 times as much bandwidth?
full duplex 2Gbps per server? Doesn't seem like much - Is MS implying that the average data center server sits on a 100mbps half duplex connection ? That doesn't sound right. Maybe they are referring to a massively oversubscribed network like maybe 48x1GbE uplinked using a single 10GbE cable or some strange sort of setup like that.
Even going back a couple of years it seemed not outside the realm of possibility that people would have 10GbE all the way down to the server in some hadoop clusters.
Given a small cluster size of 250 nodes it's trivial to provide non blocking bandwidth to that number of systems. Non blocking bandwidth to a few thousand nodes though is a bit more complicated, though not much with today's switching technology.
So that is the problem with win 2008 r2 file and print servers and why they are so slow at cifs. I need to trade in the gigabit routers and get 2Gb ones.
Jim didn't work on DEC's servers, he was a research scientist known for expertise in data base systems. One of the best technical writers ever, his book on Transaction Processing is one of the bibles of database design and theory. Jim worked at Tandom, IBM, DEC and Microsoft.
One of Jim's projects was Terraserver, the forerunner of Google Earth. Hosting complete satellite image coverage, his group had to buy military spy satellite images from Russia. He also worked with astronomers to develop a unifying database of sky survey images, which was only completed after his death (the World Wide Telescope).
Jim was a great guy, and a great scientist, and his disappearance at sea caused a lot of sadness. Ironically, there was a large effort to search for his boat using online satellite images.
Why does this article read like a virtually unmodified Microsoft press release?
it's not. Engadget have the press release. It's way, way more boring.
Lies, damned lies, and statistics
As that old saying goes, "There are lies, damned lies, and then there are statistics" - sort of a redux of the old saw about MIPS == "Meaningless Index of Performance". There are Hadoop/MapReduce implementations that can blow MS figures out of the water. They tested their system against an out-of-date version of Hadoop/MapReduce. What does that prove? That they are faster than software that was out-of-date 2 years ago? In internet time, that is about 1000 years...
I will wait to see the scientific paper on this. A problem with these tests is that it mixes hardware and software performance measurement. Gaining speed by increasing communication bandwidth (and decreasing latency, for preference) just get the "duh" response it deserves.
The only ways to see if two algorithms differ is to (i) do a proper complexity analysis (computing time and memory/bandwidth use) to see how it should scale theoretically (both in terms of data size and number of processors), and (ii) time optimized versions on the same hardware (or different sets of hardware), using a variable number of processors or nodes.
this is why Bingbot notoriously steals all the bandwidth from any site it's crawling?
FDS sounds like a re-hash of Dryad’s Distributed Filesystem (cosmos)
Doing a sort on any other key than the Hadoop distribution key is always going to be slower because the data must be re-distributed in one step then aggregated/merged in another and the second step must wait for the first to finish. Dryad on the other hand is pipelined and can work on all steps concurrently.. in much the same way as Teradata did twenty years ago.
In the big-data space you can’t slosh the data into multiple stores, you have to pick one and that means Microsoft’s FDS must work with Microsoft’s version of Hadoop (Daytona) and preferably full Apache Hadoop.. just like Teradata does.
The question then is whether FDS is faster than HDFS and whether Daytona is faster than Apache Hadoop..
- Review Tough Banana Pi: a Raspberry Pi for colour-blind diehards
- Product round-up Ten Mac freeware apps for your new Apple baby
- Analysis Pity the poor Windows developer: The tools for desktop development are in disarray
- Product round-up The Glorious Resolution: Feast your eyes on 5 HiDPI laptops
- Analysis BlackBerry's turnaround relies on a secret weapon: Its own network