120 posts • joined 3 Jul 2008
Re: Hive and Spark for Microsoft Hadoop?
/ * hadoop committer stevel; employee at cloudera competitor; speaking for self only; interpret/ignore comments as you will */
"Since Microsoft has adopted Hadoop as "their" standard Big Data Processing framework, will the company be updating to use Hive and Spark enabled Hadoop, and can these new Hadoop add-ons even run in a Microsoft environment?"
I don't know about this new work and don't intend to comment on it directly; no point in kicking the impala while it's down.
What I can say is that Microsoft have done a lot of work on Hive, using the skills of their SQL to team to work on the query planner and execution, as well as their Dryad work which is reflected in Tez. All works on Windows Server and on Azure.
There's one other thing MS have done that's interesting: Excel integration with Hive and the HCat schema service --you can point Excel at any Hadoop cluster and issue queries with it. With the speedups of Hive 13 you can get fast results on datasets way bigger than Excel has ever supported before. Given that Excel is probably the most widely used end-user data analysis tool on the planet that's pretty sweet.
Interestingly IBM has been a lot less forthcoming on contributing code, I'm only aware from a few bug reports and patches related to IBM JVM compatibility, and some (immature) code to talk to the softlayer openstack storage layer. The usual "supports OSS/resists OSS" rules have changed at this layer of the stack -which is clearly a sign of cultural shift for Microsoft.
Re: I want one
Good point; can only be better than a smart TV "send all your data for adverts", and with a roku box at 99 pounds, a 21' tablet isn't too bad. iplayer, youtube, netflix, google hangouts,
In fact, a 37" version could be really impressive. While HP's record in hardware is a bit patchy at times, at least they are more experienced than TV vendors at building things with ethernet ports
Re: For shame
Even stranger: why does the First Great Western train app want to view my call history?
Permissions Manager does a good job of cranking these rights back -because Android doesn't
Re: For shame
Actually its hidden in the policy that is only available on the TV (search terms don't find it online), viewable on 50 pages if you scroll down that "opt out settings" menu to find a menu option that is off the window, then select "legal". Everything bar the "beware of the leopard" sign
You could look at a big chunk of the grid schedulers: condor, platform, mesos and say "quelle difference?", but there are some
* designed to place work close to the data: your code can ask for specific machines & racks, with the scheduler trying to place it there, but if you say "best effort" then it will do it as close as it can network wise. This lets us run Hadoop without the high-cost SAN networks and so make storing petabytes of data affordable.
* designed for algorithms that have to handle failure. MapReduce does this by splitting up the work, retrying failed jobs, recognising slow machines and re-issuing the work -and even blacklisting the slow boxes. Those slow ones are the enemy as these stragglers slow everything down. Apache Tez can do checkpoints, then roll back to them. The Streaming algorithms need to replay the streams, which is a different problem.
If you do go back to the 1980s era massively parallel designs, some of the architectures do look familiar. Is the scale that's different -a scale that makes failures a fact of life that everything has to handle, rather than a disaster that needs someone to be paged and your on-site HDD replacements (for which you pay a lot for) wheel out. Even so -there are lessons there that we should learn from. After all, aren't VMs and their hypervisors just descendents of VM/360 -which had billing in from the outset too.
Two corrections to this article
1. Google do not run Hadoop internally. They have Google FS, BigTable, Pregel and other things. The Apache Hadoop stack is evolving to be equivalent, but Google have their own stack, which predates much of Hadoop. The paper gets this right; it's the El Reg journalists who appear confused.
2. Bandwidth after the "MapReduce" stage is normally much less than ingress bandwidth. Hint, the word "reduce". This usually means squeezing down log data and the like to smaller summary.
Regarding the ingress/egress bandwidth, if all you are collecting is internal log data, you can predict the data rate (your daily click count, compressed), and its origin (your servers). Click log bandwidth will always be much less than site bandwidth, unless your site is something like bit.ly that just bounces 302 redirects back to the caller, in which case it's probably equal. Provided you keep the web servers near the Hadoop cluster, the cluster ingress bandwidth will be straightforward to handle.
The paper looks specifically at the problem of "classic" enterprises (i.e. pre-web), where systems are widely distributed for historical reasons; intra-enterprise traffic becomes a problem. This is probably the case when the application is itself distributed (telcos, banks). if your servers are scattered across 20 datacentres for historical reasons, you should consolidate down for cost reasons.
Despite these critiques of the article, the paper itself is pretty good.
not available in UK App Store
UK readers will be disappointed to know that this isn't in the UK App store "yet"
London & Hadoop
As currently UK's sole Hadoop committer, I tend to get all the linked In job invites related to Hadoop.
Every so often something hits me related to things in London, but if these are the one's Matt is talking about, the recruiters suck. Things like "exposure to some of the following, Linux, Java, Hadoop, Ocaml, Python, Haskell. ". Or some idiot phoning me up at work -interrupting my coding- to discuss something about spotify. No way to win friends there.
Ignore the technologies: get the statisticians.
last one sucked
Someone got my child the last version, it was dire. I wouldn't imagine any xbox owners feeling sorry about missing out on this, given how many good games you can get for your money on that platform
Bob the Dinosaur from Dilbert
They should call this specimen "bob" : "Dinosaurs aren't extinct, they're just hiding behind the furniture."
Turn off java in your browsers
The issue with Java is that client side java has a limited set of roles
-Java tooling for server-side development
-Sandboxed runescape gameplay
-malware breaking out of browser sandboxes
Java is just as bad as flash here, but unlike flash, even easier to live without in your browser.
Disable it in the browser; if you don't need real java apps, remove the JDK
If you look at the graph, Windows & .NET dominate; sharepoint is growing about as fast as the OSS technologies, and it's bigger. Makes it hard to conclude that OSS-skills are what you need. A breakdown of skill by region/company type would be interesting: is it the enterprises that want (windows, sharepoint, oracle) while its the startups that want OSS? I wouldn't conclude that immediately, as if you do search indeed.com for Hadoop, you get a mix of web companies, telcos, media companies and others, implying its a tool that fits a role in organisations. It's just not an end user tool the way shareporint is
Yes, the JVM is popular, but it's not clear that the Java roadmap advocated by oracle is the future.
-the whole TCK debacle has shown up the Java Community Progress to be as democractic as the Peoples Congress of the Union of Soviet Socialist Republics. Follow the leader or you are in trouble.
-The sluggishness of the Java7 project has given other languages: Scala, Clojure, JRuby and Groovy an advantage. These languages are better than Java7, work on Java6, and are developed in an open source process. To follow Java7 is to hand control back to Oracle.
Some of the really interesting stuff in Java -the Hadoop stack in particular- isn't being done with any participation from Oracle, let alone under the auspices of the JCP. There's no reason for the ASF to follow the Oracle strategies there.
Sponsored by General Motors?
This sounds lovely but appears to miss a few points
-a lot of that stress is caused by commuting, which is an artifact of congestion. You could add more roads, but then "Induced Demand" kicks in: people do more journeys or move further out into the suburbs
-there isn't enough space in an EU city for parking either
-misses out on other pollution artifacts such as CO2 or NOx, the latter being tangibly harmful.
CPU yes, but what about storage
Although the author's assertion that running on existing hardware will lower cost of big-data analytics, a key point of "Big Data" is not CPU-load, it is "you have lots of low value data to work with". Platform doesn't address that story; they may have better scheduling than Apache Hadoop's out the box schedulers, but their storage story is the same: run HDFS for location-aware storage.
No doubt IBM's story will become that of IBM's grid story: use GPFS, but that increases the cost of storage in exchange for location-independence, which limits the amount of data you can retain.
Hadoop != Linux
Maybe it's just the title but likening Hadoop to Linux is daft. Linux: OS. Hadoop: Java based data mining platform.
MS adopting is one of recognising that it is the de-facto standard for datamining outside of Google, and if they didn't want to lose the server sales *and all the developers* they'd better support it.
Confuses JVM releases with Java EE versions
Oracle may be pushing Java7 and talking about Java8, but Java EE 6 is the latest version of the Java Enterprise Edition specification. The author of the article has misunderstood things.
[This is not to be take as an endorsement of Java7+, merely an observation on the article]
Why is Flash so vulnerable?
Adobe may be proud of the turnaround time on their 0-day exploits, but there's still a 72 h lag from every discovery to a fix -and there is at least one official patch a month, plus often an emergency patch.
Why are acroread and flash so vulnerable? They are attacked more often than the entire MS office suite?
Adobe need to get flash patches out because they fear that all OS vendors -not just Apple- will stop bundling flash, that all Browser vendors will disable it by default. I don't think the latter is a bad thing at all
DRAM failure rates
This is interesting, especially since the MS paper "Cycles, Cells and Platters" ( http://research.microsoft.com/apps/pubs/default.aspx?id=144888 ) provides evidence that overclocked machines are significantly more likely to show memory or HDD failures. Yes, you will get performance, but you'd better use higher end ECC memory (chip-kill/chip-spare) and plan for failures, as well as having an OS that is ready to handle the memory check reporting that comes with the Nehalem-EX architecture -the one lets the OS blacklist memory pages that are playing up.
GPS is too embedded
Even if filtering can fix this, consider that GPS is now built in to phones, cars, watches. My latest compact camera has one in. It is also becoming an SPOF for the US transport industry. Lightsquared may point the blame at the GPS receiver vendors, but it seems to me that if they want to change the use of the adjacent bad, they get to pay for all the upgrades and replacements of the existing devices.
Idle during the day?
Well, the 32kbps uplink for Bristol overloads from about 5pm during university term time. It doesn't matter how idle the cable network is at that point.
This only measures people switching through this company's service; when I switched I didn't use them. Furthermore, it doesn't measure people who don't switch.
Better to display the (weak) data as the #of people switching from and to a particular supplier, not this market share thing, which may be untrue. People may not be switching to O2, but they may not be switching from it either.
What it does seem to show is that people who switch from this app switch to 3. That's all
moving to a post-Oracle world
I though the article was a bit bleak at first "Java is left behind", but the closing point is key: the Java world is moving beyond Oracle, beyond the enterprise. Big HDFS filestores running Hadoop and HBase: Java based, hosted in Apache. Spring? At SpringSource, and happily staying ahead of the EJB attempts to catch up. OSGi? Have Oracle stopped pretending it doesn't exist yet?
Oracle aren't playing in these worlds, and some of their key concepts "NoSQL, no app server, commodity servers" are the kind of think that Larry must wake up screaming about
Maybe they could do a version of Flash for Linux that works
Look at the Mozilla crash stats for Firefox on Linux
Look at how often libflashplayer pops up there.
Adobe can't even write a flash plugin that works reliably on the main Linux web browser! What makes them think "Air" will be any better. And yes, while it works more reliably on Windows, how many times in the past two weeks have I had to update both my windows browsers with new flash versions? And new versions of Acroread.
If HTML5 kills Flash and delivers security and stability, as well as cross-platform operation, I'll be happy.
I'm not going to get into an argument of C++ vs Java, but note that if you have 1000 hadoop nodes, that gives you 12-24PB of storage. Regardless of performance, its the storage that has an edge there, along with the layers on top.
that said, lots of room to improve hadoop performance and job startup time, contributions are welcome.
If you look at the price difference, the key one is in OS licenses. But RHEL provide support. Oracle have their own downstream version of RHEL, Which, if I wanted, I could get by running CentOS on anything, which is what people in the big data centres tend to do.
I wonder how much OS support oracle actually provide for their stack.
If it makes you feel better, the ASA told citroen they can't show a car advert that includes adults cycling without helmets in the UK before 9pm in case it made kids want to cycle without a helmet. That's despite the fact that it is not a legal requirement to wear a helmet
They are in a word: daft.
How secure is the router login?
One interesting question here: do all the routers have the same username & login, or different ones. If so, what is the password? That could be quite serious. If not the same, how is it generated after a firmware update and hard reset? It would have to be something predictable.
-I'm not sure HDD sizes will shrink as cost of storage server-side is still higher (power, capital, etc). SSD disks on laptops do save power, so there may be a trend there.
-I do agree with the closed world comments, as the desktop is going that way. Look at how macs don't have blu-ray, because to watch blu-ray disks would reduce the demand for i-tunes videos. Look how their thin laptops don't have flash (admittedly, there are security benefits). Look how they are backing off Java support, and adding an app store for the mac.
I can imagine MS selling locked down machines citing end user experience and security as the reason. Microsoft never attempted that.
Although I wouldn't code for it myself, Mono has a big place in the gnome toolchain. It is better integrated with the OS than Java (which pretends all platforms look the same), and can be used to produce high quality code.
-python and ruby and the like are also fast enough on modern machines, and even easier to deal with
-Google's Dalvik runtime has given the Java language a boost on phones, and removes the need for Mono there.
-Apple are very fussy about what they let run on their phones -witness their treatment of adobe's Flash runtime/cross compiler.
I wish them well, but fear that between android and apple, it's going to be hard
Oracle and ASF lawsuit
The ASF doesn't have the money for a lawsuit:
"Through the JSPA, the agreement under which both Oracle and the ASF participate in the JCP, the ASF has been entitled to a license for the test kit for Java SE (the "TCK") that will allow the ASF to test and distribute a release of the Apache Harmony project under the Apache License. Oracle is violating their contractual obligation as set forth under the rules of the JCP by only offering a TCK license that imposes additional terms and conditions that are not compatible with open source or Free software licenses"
I don't think the author of this article should be allowed to write about Apache Hadoop -it's painful to read. I hope nobody actually believes a word this person says.
1. The only official release of Apache Hadoop comes from the Apache Software Foundation, the last of the 0.20 releases, 0.20.203 came out yesterday with lots of bug fixes from Yahoo! and Cloudera in it.
2. Any other so called "distribution" of Hadoop is not "a distribution" unless it is just the Apache release packaged for easy installation (as Thomas Koch does for debian) -it is a derivative work, containing code that is not in the Apache release.
3. Such derivative works can be open source (Cloudera) or closed source (EMC, IBM).
4. Any closed source derivative work forces the distributor to maintain their branch indefinitely.
5. Any derivative work forces the developer to test at the same scale as Y! and Facebook (thousands of machines, tens of PB of storage), or they cannot claim that it scales up.
6. Any closed source derivative work will only support bug fixes and patches at a rate determined by the closed source developer team, and provided at a cost determined by the price of that developer team.
7. Apache only provide support for the official apache release. If you use Cloudera or EMC: go talk to them about problems.
8. People who are not part of the Apache developer and user community do not get their needs addressed in the Apache releases, because we are unaware of them.
9. We, the apache developer team, have no need to take on random patches from developers of closed source derivative works unless we can see tangible benefits.
10. Finally, any derivative work that pulls out large amounts of the Hadoop codebase (e.g Brisk, EMC Enteprise HD) cannot call themselves a version of Hadoop. They are not. We, the apache community define the interfaces and what "100% compatible" means. When someone like EMC declare their derivative work is "certified 100% compatible", that is a meaningless statement. Only the official Apache Hadoop release is, implicitly 100% compatible with Apache Hadoop.
11. We reserve the right to change the semantics and interfaces to meet the community needs, on the schedule that suits the development community.
12. The rules of using the term "Hadoop" are defined in the Apache license, and it is not legal to say "a distribution of Hadoop" if it is in fact a derivative work. This is why Cloudera call their software "Cloudera’s Distribution including Apache Hadoop". EMC, Brisk and others are sailing close to the wind here.
13. The fact that Oracle are now subpoenaing Apache in the Oracle/Google lawsuit mean that the relationship between Oracle and Apache have reached a low point -even after Apache left the Java Community Program due to Oracle's unwillingness to meet its legal requirements to provide the Testing Compatibility Kit without imposing Field of Use restrictions.
14. Because of (#13), it's hard to see a team of Oracle developers being trusted or welcome in the Hadoop community. You can't serve subpoenas on the ASF and then say "we'd like to help develop a technology of yours that threatens our entire business model and margins". They won't be trusted.
I have a term for the EMC-style not-quite-Hadoop products that use the same interfaces but offer unknown semantics and a cost model on a par with the vendor's existing enterprise product line. It is "Enterprisey Hadoop". This is not Apache Hadoop supported in the Enterprise, it is some derivative work that pretends to be Hadoop but misses the point about affordable scalability through commodity hardware and an open source codebase.
SteveL, Apache Hadoop Committer. All comments are personal opinions only, etc.
KK says that Oracle don't have the right to donate his six months worth of post-oracle code to Eclipse, because that would require a change in the code license to the Eclipse Public License.
If true, that will make things harder for Oracle.
I'll miss the trowbridge team
While it'd be nice to have integrated billing, I have spoken to the trowbridge people on the phone and they were always helpful and competent. Something to treasure.
Of course, now I've moved to 3 for their data, it's less of an issue
Badly informed article
This article is painfully bad. Unlike JBoss (open source, most contributors work for JBoss), Spring, MySQL, Apache Hadoop is managed by apache, and that organisations structure is designed (somewhat) to prevent a single vendor dominating. Yes, Cloudera can do its fork, so can IBM, but then they both take on the problem of testing at the scale of 1000+ servers, servers with 12TB of storage each.
And you you know who has that kind of storage to play with? Yahoo! and Facebook. Nobody else can test at that scale -even though others (Apple?) may want to play at that scale. What the Y! team can do is focus on the large scale datacentre problems, the ones where the cloudera licensing fees are too much (hey, these datacentres run CentOS to avoid paying for 1000 RHEL licenses). With the current Cloudera support license, it's cheaper to hire an ex-yahoo! person.
The other thing I'd like to point out -as a Hadoop Committer- is that while Cloudera has some excellent Hadoop developers -Doug, Tom White,. Todd, Aaron, Konstantin, to name some key ones, they don't own the Hadoop developer world. There's the LinkedIn people, the Facebook people, lots of little startups who are busy filing bugs against it. There are the people at adding layers on Hadoop, things that aren't yet mainstream (Hama for Graphs, Mahout for machine learning), there are the people working on Hadoop-compatible filesystems. It's open source, anyone can play, we welcome the users, we welcome the bug reports, and we welcome patches especially if they come with Junit tests. The whole MR2.0 engine is coming out of Yahoo! and it looks a great place to play. Come join us!
SteveL at apache dot org
Cost of death
you forget that the cost of KSI is usually measured in millions of pounds. It's cheaper for people not to die.
Y! run different clusters for scale
Whoever said that Y! run >1 cluster so they can submit multiple jobs is ill informed. Yahoo! have multiple clusters because the current scale of the HDFS filestore tops out at 25-30 PB, and putting Platform's code on top of that will not remove that limitation.
The MR engine can schedule multiple jobs, and can even prioritise work from different people. It too has a scale limit of about 4K servers, and even that requires tuned jobs to avoid overloading the central Job Tracker.
If you do want to know more about what Hadoop's limits are, you are welcome to get in touch with me, a committer on the Apache Hadoop project, as otherwise you will end up repeating marketing blurb from people who have a vested interest in discrediting a project that is tested at Y! and Facebook scale, is free, and which has shown up fundamental flaws in the "classic" Grid frameworks, namely their reliance on high cost SAN storage limits their storage capacity, and hence their ability to work with Big Data problems. It is good that the Platform people now have a story to work with lower cost storage than GPFS -by using Hadoop's on filesystem- but I'm not sure then why you need to pay the premium for Platform over the free version of Hadoop. That of course is the other flaw in the classic frameworks...
Spindle:CPU ratio bad for Hadoop
I'm putting on my Hadoop committer hat and noting some things about it on this box -independent of any other HPC uses-
1. Ignoring point (3) below, you don't need to "port" Apache Hadoop to the system provided you can bring up RHEL and Java on it, ideally 64-bit JVM from Sun, that being the only one that the Hadoop team opt to care about.
2. There's not enough storage. 24 HDDs for that many CPUs? The current generation of Hadoop servers put 12x 3.5" HDDs in a 1U rack with 6-12 x86-64 cores, giving a ratio of 1 CPU to 1 or 2 HDDs. That's massive storage capacity and good IO bandwidth, with good CPU. Why? Storage capacity with some local datamining is the driving need. It's why HDD and not SDD is the storage, it's why 3.5" disks are chosen over 2.5". It brings you cost/petabyte down.
3. The use of independent servers gives you better failure modes. If you built a rack out of these systems, you would need to somehow change Hadoop's topology logic to know that a set of servers are inter-dependent, and so that copies of blocks of the files (usually 128+ MB blocks) are not stored on servers instances in the same physical server. There's been discussion of making the placement policy pluggable, so Quanta could write a new Java class to implement placement differently, but as the plugin interface isn't there yet, they can't have done so.
Which part of the UK are they talking about
We'd like to see some definitive evidence that IBM either reduced journey times, made it safer for people walking and cycling, improved public transport or reduced city centre pollution before taking any claim from IBM that they solved the UK's traffic problems.
Maybe the harsh truth they dare not say is that all they can do is damage limitation. In which case, come out and say it.
Not in BS6 during term time
Go search for "virgin media BS6" and see that VM quality depends purely on load, and in some parts of the country they are overloaded.
Why no WinXP support
I can see a few reasons for MS abandoning WinXP support
* Technical : directX integration with the windows rendering system is a vista onwards feature
* Technical: older hardware doesn't have the graphics hardware to justify the feature
* tactical: WinXP goes unsupported soon, which is another word for "less secure than ever before". It's not worth the expense
* Strategic: MS want you to upgrade, that's how they make money.
Of course, the end of WinXP is also an opportunity to move to Linux on the existing hardware. It's a shame that Ubuntu 10.10 and RHEL are getting as overweight as Win7 -they're trying to keep their UIs as cool, and that means the memory footprint of the base image is up in the 800MB range. Not for old machines either.
Returning to IE9, inconsistency with the other browsers will always be a problem -unless they fix this, IE will remain something for people who don't know any better, and for interaction with corporate sites stuck in IE6 hell.
the site pushes IE even to mac and linux clients
The ie countdown site encourages you to go and get a new IE but if you follow the links it tells OS/X and Linux users that first they need to get Windows 7. Missing the point, which is that we are already ex-IE6 users.
worse than that
the primary cause for falling pedestrian deaths in the UK is the fall in the #of people walking, especially on the school run. The less pedestrians you have, the less to get run over. This isn't an improvement in safety, its a failure of the country's transport system in favour of one transport option (motor vehicles) over all others.
IF you go that way, you have to look at total input
If you start saying some road users have more rights to the road based on how much tax they paid "into the pot". then anyone on 40% tax should have right of way over anyone in the normal tax band, anyone earning in the 50% band should have rights over everyone else. and any of us walking or cycling who has left their car at home should have some hi-viz top to say "don't run me over, I paid for these roads"
If someone in the 50% tax band, was, say, walking, they should have right of way of anyone driving who is on benefits. And only smokers should be allowed into NHS facilities.
VED, fuel and VAT duties on cars aren't hypothecated because we all share everything else the government does.
Can I just say I'm appalled by how ill informed this article is. Either the author or the RAC haven't noticed the near doubling in car volumes in past 15 years, and attempted to correlate that with congestion. Instead they blame traffic lights and a lack of new roads. But where do new roads go in our inner cities? They don't unless you knock them down and move the people who lived there to tower blocks. We have a name for that -Glasgow-and it still has its traffic jams.
And rule 170?
Highway code Rule 170 says turning cars are meant to give way to pedestrians who are already out there: "watch out for pedestrians crossing a road into which you are turning. If they have started to cross they have priority, so give way"
Anyone being strict about zebra crossing rules must also be expected to stop for pedestrians already crossing. Try that in London and the taxi behind will go into the back of you. then get irate/
Virgin Cable sucks in Bristol on 10 MBps
Everyone here is blaming the router, but if you type in Virgin+Broadband+BS6 you can see that in some parts of the country capacity is overloaded on the cable network, and there is no point whatsoever upgrading to anything other than 10 Mbps, and if you want to use your network in the evenings you ought to consider an alternative provider.
Even with a direct laptop connect to the cable modem I was getting serious DNS packet loss, timeouts on post. once you add in wifi, you are doomed.
kindle may still work
-amazon will have the billing and auth services, and even if apple want 30% of all in-app purchases, there's nothing to stop you browsing over to amazon.com in safari, logging in and buying some kindle books there, books that could trickle over.
That said, Mr J probably views all customers who buy stuff online on one of his browsers as something he deserves a commission on. How every AOLy
ANPR and insurance
When they do ANPR checks on the severn Bridge, they have enough time to detect uninsured cars and notify the police at the far end of the bridge, who can then pull the car over. This can reduce the #of uninsured cars and unlicensed drivers, which is no bad thing.
That said, the DVLA vans that drive round Bristol and check for untaxed cars seem to do a good job of finding untaxed vehicles; if they also checked for whether a vehicle was currently insured (trickier than you think as askmid can lag), they could also clamp them.
I agree with the others: it's the update process
MS has no way to let you subscribe to updates from third parties. Every updater runs on startup, takes up CPU/memory, has its own problems (security, etc). Many of the updaters push out features for strategic reasons, not because you need them. Example: google updater pushing out the new web video codec for IE and firefox, iTunes updater adding MS Outlook plugins for contacts, etc. Nobody trusts them
OS/X isn't that much better, believe me.
That's why I like Ubuntu Linux. Not for its inconsistent usability, but because I know that when I do a weekly update and reboot, it is up to date.
Hadoop cluster requirements
MacMinis are way behind the CPU and storage. You can get 12 HDDs in a 1U, 8-12 cores, from a couple of x86 vendors, these are cutting-edge in Hadoop hardware, gigablit ethernet to the Top of Rack switch and then 10 Gbits from there. Having bigger worker nodes increases the likelihood of finding a slot for work by the data, and with multiple HDDs work can use one for input, one for output and one for intermediate (overspill) data.
Pretty much every production cluster claims to use Linux -usually RHEL or CentOS 5.x-, and the Sun JVM, and rarely the latest edition. Filesystem: ext3 with the noatime option. These big clusters try and stay in sync in OS/JVM versions as everyone wants to avoid finding bugs first. Some people (linkedin) use Hadoop on Solaris.
Hadoop is set up to build on the mac, and its easier than on windows, where you need cygwin installed. Nobody admits to running Hadoop in production on windows or MacOS, because of cost and because you get to find the bugs yourself -and fix them. And of course, even if Apple are secretly doing their own high end sever motherboards with the disks and CPU to compete with the datacentre-specific kit, apple would have to port their OS to their own or purchased hardware, with even more debugging fun.
Assume, therefore: Linux on hardware from somebody who can do proper datacentre kit. That is, unless the apple hardware team have just told the ops team that they need to come up with a plan to mount 1000 mac pro boxes on their side in an earthquake-safe form. Oh, and they need to get 5-11 extra disks into each box.
- Breaking news: Google exec veep in terrifying SKY PLUNGE DRAMA
- Geek's Guide to Britain Kingston's aviation empire: From industry firsts to Airfix heroes
- Analysis Happy 2nd birthday, Windows 8 and Surface: Anatomy of a disaster
- Google CEO Larry Page gives Sundar Pichai keys to the kingdom
- Something for the Weekend, Sir? SKYPE has the HOTS for my NAKED WIFE