* Posts by Steve Loughran

130 publicly visible posts • joined 3 Jul 2008

Page:

Virgin posts increase in profits and sales amid 900 jobs chop

Steve Loughran

it started years ago

Those "congratulations the speed you won't ever see has been increased (oh and we are increasing your monthly bill even though you wont see the speedup)" missives were coming for years. Switching back to BT infinity has worked well, even though their hubs are stunningly awful.

The yummy mummies' likeable wagon: Nissan Qashqai Tekna

Steve Loughran

Re: Kumquat

this is actually a documented phenomenon of the raised seating position of this and other vehicles: you get a bigger blindspot to the sides and that changes your expectations.

here in inner Bristol anyone with functional wingmirrors is viewed as being in a position of weakness when it comes to negotiating proirities in narrow roads; the trend towards fatter cars -both oncoming and parked- makes things worse. Quashquai drivers don't appear any worse than any other "urban SUV" driver, whatever that means

Roku 3: Probably the best streaming player on the market ... for now, at least

Steve Loughran

Re: Roku Media Player

works well with MiniDLNA; will down-code 5.1 Audio channels to a stereo TV too.

Steve Loughran

Re: US or world wide?

Works nicely in the UK, has great iplayer client and can youtube videos too.

in-house it'll find DLNA servers and stream down the content there, with the Linux ones working perfectly with it. There's a USB port to play content too, or just use it to charge things.

I have the older model, with the older remote. It's got the headphone socket, just no netflix/amazon buttons. What's not covered is how intuitive this remote is, even in the dark you can use it nicely. Way easier to use than any smart tv I've come across.

Flaws:

1. it doesn't always boot when powered off -spend time chasing this and you end up concluding it's best to be left on. It drains very little power, but it's still annoying.

2. The remote is wifi. That avoids having to point the remote near the telly, but it does mean more risk of clashing with your house wifi channel(s).

Reader suggestion: Using HDFS as generic iSCSI storage

Steve Loughran

HDFS: the subset of posix it implements

HDFS lacks a couple of features which people expect from a "full" posix filesystem

1. The ability to overwrite blocks within an existing file.

2. The ability to seek() past the end of a file and add data there.

That is: you can only write to the end of a file, be it new or existing.

Ignoring details like low-level OS integration (there's always the NFS gateway for that), without the ability to write to anywhere in a file, things are going to break.

There's also a side issue: HDFS hates small files. They're expensive in terms of metadata stored in the namenode, you don't get much in return. It's while filesystem quotas include the number of entries (I believe; not my code)

What then? Well, I''d be interested in seeing Greg's prototype. Being OSS, Hadoop is open to extension, and there's no fundamental reason why you couldn't implement sparse files (feature #2) and writing within existing blocks via some Log-structure writes-are-always-appends thing: it just needs someone to do it. Be aware that the HDFS project is one of the most cautious group when it comes to changes, protection against data loss comes way, way ahead of adding cutting edge features.

without that you can certainly write code to snapshot anything to HDFS; in a large cluster they won't even notice you backing up a laptop regularly.

One other thing Hadoop can do is support an object store API (Ozone is the project there), so anything which uses object store APIs can work with data stored in HDFS datanodes -bypassing the namenode (not for failure resilience but to keep the cost of billions of blob entries down. Anything written for compatible object storage APIs (I don't know what is proposed there) could backup to HDFS, without getting any expectations that this is a posix FS (that's untrue: I regularly have to explain to people why Amazon S3 isn't a replacement for HDFS).

To close then: if this is a proof of concept, stick the code on github and everyone can take a look.

stevel (hadoop committer)

Bristol’s ‘Smart City’ reserved for boffins. Sorry bumpkins

Steve Loughran

Re: more useful things than networking

I do now have a use of the shopping trolley data.

If combined with one of homeless people you could use it as a new navigation layer in google maps, "turn left after the homeless person and continue until you reach the shopping trolley. Your destination is on the right".

But without those live feeds of shopping trolleys or homeless people: its just not useful enough.

Steve Loughran

more useful things than networking

We aren't actually bumpkins here, and there are lots of interesting things going on alongside Just Eat. As well as the people working out how to blow things up from the air (MoD, BAe, etc), we've got a nice little set of big data companies, Cray setting up shop and -for better or worse- Oracle. All R&D stuff.

come over and eat at some of famous eateries like Slix of Stokes Croft (check the Yelp reviews), then enjoy some of our fine beverages, which covers more than just cider. Oddly enough, Bristols most profitable local "craft product" is actually homegrown Ganja, but the "green city" event doesn't seem to highlight that.

Regarding the Smart City, there's already enough bandwidth for recreational needs, provided you aren't on Virgin Media in the parts of the city where students live. If their bittorrent feeds can be moved onto this new network, all will be well.

Where it does fall down is that it has no interesting data sources. London, especially TfL have some fantastic historical CSV files as well as live feeds.

Bristol? There's a downloadable spreadsheet of shopping trollies found in rivers:

http://data.gov.uk/dataset/abandoned-shopping-trolleys-bristol-rivers

What are you meant to do with that? Write a "trolley-watch" app for your iwatch that pops up to tell you whenever you are within 300 metres of an abandoned Shopping trolley? Integrate with OpenStreetMap for a live trolley-viewer web site?

Without data, the open city is useless, irrespective of bandwidth. And whatever is being collected, there's no sign of it being made publicly downloadable.

Trouble comes in threes: Yet ANOTHER Flash 0-day vuln patch looming

Steve Loughran

uninstall it -and hope chrome keeps up to date

As google chrome builds flash in, if you have chrome, you have to rely on google to keep it up date.

and if you do have chrome installed, then every other browser you have are just going to have to learn that flash is uninstalled. Just do it! One walk round a house cutting it from 3 laptops and 2 desktops and my life is better. I don't have to worry about these 0-days, just despair at Adobe's eternal insecurity.

Jammin', we know you hate jammin' too: Marriott U-turns on guest Wi-Fi ban

Steve Loughran

Re: They have a point IF and only IF…

you'd know if something was spoofing a marriott hotspot if

-it gave some decent bandwidth

-you didn't have to go through "no I only want the free rate" dialog box every 24h

-a dialog box designed to be unreadable on a phone.

More succinctly "any time a Marriott wifi gives you a good user experience -it's a malicious base station"

Incidentally, they disable the HDMI ports on their TVs, putting them into "hotel mode", just in case you want to put a laptop or chrome dongle in and watch content of your own, rather than pay for some on-demand cruft coming from a betamax player behind the front desk

TalkTalk customers demand opt-out fix for telco's DNS ad-jacking tactics

Steve Loughran

Focus on HTTP/Web breaks everything else

One issue with all these "helpful DNS" services is that they applications other than browsers, applications that expect unresolveable domains and hosts to fail. It also breaks applications that expect to get XML or JSON back, rather than some HTML crud

This surfaced when Verisign tried to roll out a similar service on the root domains: every SOAP stack failed in different ways when they tried to handle the output.

http://www.xml.com/pub/a/ws/2003/10/28/sitefinder.html

There's also the fact that example.com, example.net and example.org are required by the IETF to be unresolved, which is something I've used in tests in OSS projects. Tests that turned out to fail on Verizon fibre connections, because ISPs getting search revenue is more important than working applications.

Cloudera, MapR, IBM, and Intel bet on Spark as the new heart of Hadoop

Steve Loughran

Re: Hive and Spark for Microsoft Hadoop?

/ * hadoop committer stevel; employee at cloudera competitor; speaking for self only; interpret/ignore comments as you will */

"Since Microsoft has adopted Hadoop as "their" standard Big Data Processing framework, will the company be updating to use Hive and Spark enabled Hadoop, and can these new Hadoop add-ons even run in a Microsoft environment?"

I don't know about this new work and don't intend to comment on it directly; no point in kicking the impala while it's down.

What I can say is that Microsoft have done a lot of work on Hive, using the skills of their SQL to team to work on the query planner and execution, as well as their Dryad work which is reflected in Tez. All works on Windows Server and on Azure.

There's one other thing MS have done that's interesting: Excel integration with Hive and the HCat schema service --you can point Excel at any Hadoop cluster and issue queries with it. With the speedups of Hive 13 you can get fast results on datasets way bigger than Excel has ever supported before. Given that Excel is probably the most widely used end-user data analysis tool on the planet that's pretty sweet.

Interestingly IBM has been a lot less forthcoming on contributing code, I'm only aware from a few bug reports and patches related to IBM JVM compatibility, and some (immature) code to talk to the softlayer openstack storage layer. The usual "supports OSS/resists OSS" rules have changed at this layer of the stack -which is clearly a sign of cultural shift for Microsoft.

SteveL

HP’s ENORMO-SLAB: The Slate 21 MONSTER tablet

Steve Loughran

Re: I want one

Good point; can only be better than a smart TV "send all your data for adverts", and with a roku box at 99 pounds, a 21' tablet isn't too bad. iplayer, youtube, netflix, google hangouts,

In fact, a 37" version could be really impressive. While HP's record in hardware is a bit patchy at times, at least they are more experienced than TV vendors at building things with ethernet ports

You THINK you're watching your LG smart TV - but IT's WATCHING YOU, baby

Steve Loughran

Re: For shame

Even stranger: why does the First Great Western train app want to view my call history?

Permissions Manager does a good job of cranking these rights back -because Android doesn't

Steve Loughran

Re: For shame

Close

Actually its hidden in the policy that is only available on the TV (search terms don't find it online), viewable on 50 pages if you scroll down that "opt out settings" menu to find a menu option that is off the window, then select "legal". Everything bar the "beware of the leopard" sign

http://steveloughran.blogspot.co.uk/2013/11/television-viewing-privacy-policies-and.html

http://www.flickr.com/photos/steve_l/sets/72157637867348596

Hadoop 2 stampedes onto world's mega compute clusters

Steve Loughran

sort of

You could look at a big chunk of the grid schedulers: condor, platform, mesos and say "quelle difference?", but there are some

* designed to place work close to the data: your code can ask for specific machines & racks, with the scheduler trying to place it there, but if you say "best effort" then it will do it as close as it can network wise. This lets us run Hadoop without the high-cost SAN networks and so make storing petabytes of data affordable.

* designed for algorithms that have to handle failure. MapReduce does this by splitting up the work, retrying failed jobs, recognising slow machines and re-issuing the work -and even blacklisting the slow boxes. Those slow ones are the enemy as these stragglers slow everything down. Apache Tez can do checkpoints, then roll back to them. The Streaming algorithms need to replay the streams, which is a different problem.

If you do go back to the 1980s era massively parallel designs, some of the architectures do look familiar. Is the scale that's different -a scale that makes failures a fact of life that everything has to handle, rather than a disaster that needs someone to be paged and your on-site HDD replacements (for which you pay a lot for) wheel out. Even so -there are lessons there that we should learn from. After all, aren't VMs and their hypervisors just descendents of VM/360 -which had billing in from the outset too.

Mighty WAN pumper offered in the struggle to cope with Big Data

Steve Loughran

Sanity Checks

Two corrections to this article

1. Google do not run Hadoop internally. They have Google FS, BigTable, Pregel and other things. The Apache Hadoop stack is evolving to be equivalent, but Google have their own stack, which predates much of Hadoop. The paper gets this right; it's the El Reg journalists who appear confused.

2. Bandwidth after the "MapReduce" stage is normally much less than ingress bandwidth. Hint, the word "reduce". This usually means squeezing down log data and the like to smaller summary.

Regarding the ingress/egress bandwidth, if all you are collecting is internal log data, you can predict the data rate (your daily click count, compressed), and its origin (your servers). Click log bandwidth will always be much less than site bandwidth, unless your site is something like bit.ly that just bounces 302 redirects back to the caller, in which case it's probably equal. Provided you keep the web servers near the Hadoop cluster, the cluster ingress bandwidth will be straightforward to handle.

The paper looks specifically at the problem of "classic" enterprises (i.e. pre-web), where systems are widely distributed for historical reasons; intra-enterprise traffic becomes a problem. This is probably the case when the application is itself distributed (telcos, banks). if your servers are scattered across 20 datacentres for historical reasons, you should consolidate down for cost reasons.

Despite these critiques of the article, the paper itself is pretty good.

Arcade emulator MAME slips under Apple radar

Steve Loughran

not available in UK App Store

UK readers will be disappointed to know that this isn't in the UK App store "yet"

Tempt tech talent without Googlesque mega perks

Steve Loughran
FAIL

London & Hadoop

As currently UK's sole Hadoop committer, I tend to get all the linked In job invites related to Hadoop.

Every so often something hits me related to things in London, but if these are the one's Matt is talking about, the recruiters suck. Things like "exposure to some of the following, Linux, Java, Hadoop, Ocaml, Python, Haskell. ". Or some idiot phoning me up at work -interrupting my coding- to discuss something about spotify. No way to win friends there.

Ignore the technologies: get the statisticians.

Next Dr Who game to leave Xboxers in the cold

Steve Loughran
Thumb Down

last one sucked

Someone got my child the last version, it was dire. I wouldn't imagine any xbox owners feeling sorry about missing out on this, given how many good games you can get for your money on that platform

New species of dinosaur discovered... in museum

Steve Loughran

Bob the Dinosaur from Dilbert

They should call this specimen "bob" : "Dinosaurs aren't extinct, they're just hiding behind the furniture."

Java tops for hackers, warns Microsoft

Steve Loughran

Turn off java in your browsers

The issue with Java is that client side java has a limited set of roles

-Java tooling for server-side development

-Sandboxed runescape gameplay

-malware breaking out of browser sandboxes

Java is just as bad as flash here, but unlike flash, even easier to live without in your browser.

Disable it in the browser; if you don't need real java apps, remove the JDK

Open-source skills best hope for landing a good job

Steve Loughran

Sharepoint?

If you look at the graph, Windows & .NET dominate; sharepoint is growing about as fast as the OSS technologies, and it's bigger. Makes it hard to conclude that OSS-skills are what you need. A breakdown of skill by region/company type would be interesting: is it the enterprises that want (windows, sharepoint, oracle) while its the startups that want OSS? I wouldn't conclude that immediately, as if you do search indeed.com for Hadoop, you get a mix of web companies, telcos, media companies and others, implying its a tool that fits a role in organisations. It's just not an end user tool the way shareporint is

Five years of open-source Java: Freedom isn't (quite) free

Steve Loughran

Java strength

Yes, the JVM is popular, but it's not clear that the Java roadmap advocated by oracle is the future.

-the whole TCK debacle has shown up the Java Community Progress to be as democractic as the Peoples Congress of the Union of Soviet Socialist Republics. Follow the leader or you are in trouble.

-The sluggishness of the Java7 project has given other languages: Scala, Clojure, JRuby and Groovy an advantage. These languages are better than Java7, work on Java6, and are developed in an open source process. To follow Java7 is to hand control back to Oracle.

Some of the really interesting stuff in Java -the Hadoop stack in particular- isn't being done with any participation from Oracle, let alone under the auspices of the JCP. There's no reason for the ASF to follow the Oracle strategies there.

Pollution from car exhausts 'helps city dwellers fight stress'

Steve Loughran
FAIL

Sponsored by General Motors?

This sounds lovely but appears to miss a few points

-a lot of that stress is caused by commuting, which is an artifact of congestion. You could add more roads, but then "Induced Demand" kicks in: people do more journeys or move further out into the suburbs

-there isn't enough space in an EU city for parking either

-misses out on other pollution artifacts such as CO2 or NOx, the latter being tangibly harmful.

Another piece of analytics puzzle snaps home

Steve Loughran

CPU yes, but what about storage

Although the author's assertion that running on existing hardware will lower cost of big-data analytics, a key point of "Big Data" is not CPU-load, it is "you have lots of low value data to work with". Platform doesn't address that story; they may have better scheduling than Apache Hadoop's out the box schedulers, but their storage story is the same: run HDFS for location-aware storage.

No doubt IBM's story will become that of IBM's grid story: use GPFS, but that increases the cost of storage in exchange for location-independence, which limits the amount of data you can retain.

Hadoop: A Linux even Microsoft likes

Steve Loughran
WTF?

Hadoop != Linux

Maybe it's just the title but likening Hadoop to Linux is daft. Linux: OS. Hadoop: Java based data mining platform.

MS adopting is one of recognising that it is the de-facto standard for datamining outside of Google, and if they didn't want to lose the server sales *and all the developers* they'd better support it.

'Silent majority' is content with elderly Java, says startup

Steve Loughran
WTF?

Confuses JVM releases with Java EE versions

Oracle may be pushing Java7 and talking about Java8, but Java EE 6 is the latest version of the Java Enterprise Edition specification. The author of the article has misunderstood things.

[This is not to be take as an endorsement of Java7+, merely an observation on the article]

Adobe: crashing 100 million machines not an option

Steve Loughran
FAIL

Why is Flash so vulnerable?

Adobe may be proud of the turnaround time on their 0-day exploits, but there's still a 72 h lag from every discovery to a fix -and there is at least one official patch a month, plus often an emergency patch.

Why are acroread and flash so vulnerable? They are attacked more often than the entire MS office suite?

Adobe need to get flash patches out because they fear that all OS vendors -not just Apple- will stop bundling flash, that all Browser vendors will disable it by default. I don't think the latter is a bad thing at all

IBM pitches overclocked Xeons to Wall Street

Steve Loughran

DRAM failure rates

This is interesting, especially since the MS paper "Cycles, Cells and Platters" ( http://research.microsoft.com/apps/pubs/default.aspx?id=144888 ) provides evidence that overclocked machines are significantly more likely to show memory or HDD failures. Yes, you will get performance, but you'd better use higher end ECC memory (chip-kill/chip-spare) and plan for failures, as well as having an OS that is ready to handle the memory check reporting that comes with the Nehalem-EX architecture -the one lets the OS blacklist memory pages that are playing up.

LightSquared blasts GPS naysayers in FCC letter

Steve Loughran

GPS is too embedded

Even if filtering can fix this, consider that GPS is now built in to phones, cars, watches. My latest compact camera has one in. It is also becoming an SPOF for the US transport industry. Lightsquared may point the blame at the GPS receiver vendors, but it seems to me that if they want to change the use of the adjacent bad, they get to pay for all the upgrades and replacements of the existing devices.

Virgin mulls handing out free Wi-Fi

Steve Loughran

Idle during the day?

Well, the 32kbps uplink for Bristol overloads from about 5pm during university term time. It doesn't matter how idle the cable network is at that point.

Telcos: up your prices, lose customers

Steve Loughran

near-meaningless

This only measures people switching through this company's service; when I switched I didn't use them. Furthermore, it doesn't measure people who don't switch.

Better to display the (weak) data as the #of people switching from and to a particular supplier, not this market share thing, which may be untrue. People may not be switching to O2, but they may not be switching from it either.

What it does seem to show is that people who switch from this app switch to 3. That's all

Oracle's Java plan trapped in last century

Steve Loughran
Thumb Up

moving to a post-Oracle world

I though the article was a bit bleak at first "Java is left behind", but the closing point is key: the Java world is moving beyond Oracle, beyond the enterprise. Big HDFS filestores running Hadoop and HBase: Java based, hosted in Apache. Spring? At SpringSource, and happily staying ahead of the EJB attempts to catch up. OSGi? Have Oracle stopped pretending it doesn't exist yet?

Oracle aren't playing in these worlds, and some of their key concepts "NoSQL, no app server, commodity servers" are the kind of think that Larry must wake up screaming about

Adobe offloads unwanted Linux AIR onto OEMs

Steve Loughran

Maybe they could do a version of Flash for Linux that works

Look at the Mozilla crash stats for Firefox on Linux

https://crash-stats.mozilla.com/query/query?product=Firefox&version=ALL%3AALL&platform=linux&range_value=1&range_unit=weeks&date=06%2F18%2F2011+02%3A13%3A19&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1

Look at how often libflashplayer pops up there.

Adobe can't even write a flash plugin that works reliably on the main Linux web browser! What makes them think "Air" will be any better. And yes, while it works more reliably on Windows, how many times in the past two weeks have I had to update both my windows browsers with new flash versions? And new versions of Acroread.

If HTML5 kills Flash and delivers security and stability, as well as cross-platform operation, I'll be happy.

LexisNexis open sources Hadoop challenger

Steve Loughran

1000 nodes

I'm not going to get into an argument of C++ vs Java, but note that if you have 1000 hadoop nodes, that gives you 12-24PB of storage. Regardless of performance, its the storage that has an edge there, along with the layers on top.

that said, lots of room to improve hadoop performance and job startup time, contributions are welcome.

Oracle whips out private cloud with blades

Steve Loughran

RHEL

If you look at the price difference, the key one is in OS licenses. But RHEL provide support. Oracle have their own downstream version of RHEL, Which, if I wanted, I could get by running CentOS on anything, which is what people in the big data centres tend to do.

I wonder how much OS support oracle actually provide for their stack.

ASA smackdown for Yahoo! Thelma & Louise

Steve Loughran
FAIL

irresponsible driving

If it makes you feel better, the ASA told citroen they can't show a car advert that includes adults cycling without helmets in the UK before 9pm in case it made kids want to cycle without a helmet. That's despite the fact that it is not a legal requirement to wear a helmet

https://lofidelitybicycleclub.wordpress.com/2011/04/27/trading-standards/

They are in a word: daft.

BT cheerfully admits snooping on customer LANs

Steve Loughran

How secure is the router login?

One interesting question here: do all the routers have the same username & login, or different ones. If so, what is the password? That could be quite serious. If not the same, how is it generated after a firmware update and hard reset? It would have to be something predictable.

Apple leases space in new Silicon Valley data center

Steve Loughran

Closed world

-I'm not sure HDD sizes will shrink as cost of storage server-side is still higher (power, capital, etc). SSD disks on laptops do save power, so there may be a trend there.

-I do agree with the closed world comments, as the desktop is going that way. Look at how macs don't have blu-ray, because to watch blu-ray disks would reduce the demand for i-tunes videos. Look how their thin laptops don't have flash (admittedly, there are security benefits). Look how they are backing off Java support, and adding an app store for the mac.

I can imagine MS selling locked down machines citing end user experience and security as the reason. Microsoft never attempted that.

Open source .NET mimic rises from Novell ashes

Steve Loughran

Defending Mono

Although I wouldn't code for it myself, Mono has a big place in the gnome toolchain. It is better integrated with the OS than Java (which pretends all platforms look the same), and can be used to produce high quality code.

However

-python and ruby and the like are also fast enough on modern machines, and even easier to deal with

-Google's Dalvik runtime has given the Java language a boost on phones, and removes the need for Mono there.

-Apple are very fussy about what they let run on their phones -witness their treatment of adobe's Flash runtime/cross compiler.

I wish them well, but fear that between android and apple, it's going to be hard

Oracle: Quit messin' and marry Hadoop!

Steve Loughran

Oracle and ASF lawsuit

The ASF doesn't have the money for a lawsuit:

https://blogs.apache.org/foundation/entry/statement_by_the_asf_board1

"Through the JSPA, the agreement under which both Oracle and the ASF participate in the JCP, the ASF has been entitled to a license for the test kit for Java SE (the "TCK") that will allow the ASF to test and distribute a release of the Apache Harmony project under the Apache License. Oracle is violating their contractual obligation as set forth under the rules of the JCP by only offering a TCK license that imposes additional terms and conditions that are not compatible with open source or Free software licenses"

Steve Loughran
FAIL

Painful reporting

I don't think the author of this article should be allowed to write about Apache Hadoop -it's painful to read. I hope nobody actually believes a word this person says.

1. The only official release of Apache Hadoop comes from the Apache Software Foundation, the last of the 0.20 releases, 0.20.203 came out yesterday with lots of bug fixes from Yahoo! and Cloudera in it.

2. Any other so called "distribution" of Hadoop is not "a distribution" unless it is just the Apache release packaged for easy installation (as Thomas Koch does for debian) -it is a derivative work, containing code that is not in the Apache release.

3. Such derivative works can be open source (Cloudera) or closed source (EMC, IBM).

4. Any closed source derivative work forces the distributor to maintain their branch indefinitely.

5. Any derivative work forces the developer to test at the same scale as Y! and Facebook (thousands of machines, tens of PB of storage), or they cannot claim that it scales up.

6. Any closed source derivative work will only support bug fixes and patches at a rate determined by the closed source developer team, and provided at a cost determined by the price of that developer team.

7. Apache only provide support for the official apache release. If you use Cloudera or EMC: go talk to them about problems.

8. People who are not part of the Apache developer and user community do not get their needs addressed in the Apache releases, because we are unaware of them.

9. We, the apache developer team, have no need to take on random patches from developers of closed source derivative works unless we can see tangible benefits.

10. Finally, any derivative work that pulls out large amounts of the Hadoop codebase (e.g Brisk, EMC Enteprise HD) cannot call themselves a version of Hadoop. They are not. We, the apache community define the interfaces and what "100% compatible" means. When someone like EMC declare their derivative work is "certified 100% compatible", that is a meaningless statement. Only the official Apache Hadoop release is, implicitly 100% compatible with Apache Hadoop.

11. We reserve the right to change the semantics and interfaces to meet the community needs, on the schedule that suits the development community.

12. The rules of using the term "Hadoop" are defined in the Apache license, and it is not legal to say "a distribution of Hadoop" if it is in fact a derivative work. This is why Cloudera call their software "Cloudera’s Distribution including Apache Hadoop". EMC, Brisk and others are sailing close to the wind here.

13. The fact that Oracle are now subpoenaing Apache in the Oracle/Google lawsuit mean that the relationship between Oracle and Apache have reached a low point -even after Apache left the Java Community Program due to Oracle's unwillingness to meet its legal requirements to provide the Testing Compatibility Kit without imposing Field of Use restrictions.

14. Because of (#13), it's hard to see a team of Oracle developers being trusted or welcome in the Hadoop community. You can't serve subpoenas on the ASF and then say "we'd like to help develop a technology of yours that threatens our entire business model and margins". They won't be trusted.

I have a term for the EMC-style not-quite-Hadoop products that use the same interfaces but offer unknown semantics and a cost model on a par with the vendor's existing enterprise product line. It is "Enterprisey Hadoop". This is not Apache Hadoop supported in the Enterprise, it is some derivative work that pretends to be Hadoop but misses the point about affordable scalability through commodity hardware and an open source codebase.

SteveL, Apache Hadoop Committer. All comments are personal opinions only, etc.

Oracle U-turns on Hudson open source control

Steve Loughran
Thumb Down

licensing issues

KK says that Oracle don't have the right to donate his six months worth of post-oracle code to Eclipse, because that would require a change in the code license to the Eclipse Public License.

If true, that will make things harder for Oracle.

Virgin outsources techies, pulls plug on Trowbridge call centre

Steve Loughran

I'll miss the trowbridge team

While it'd be nice to have integrated billing, I have spoken to the trowbridge people on the phone and they were always helpful and competent. Something to treasure.

Of course, now I've moved to 3 for their data, it's less of an issue

Yahoo! Hadoop! brain! spin-off! doomed! to! fail!

Steve Loughran
WTF?

Badly informed article

This article is painfully bad. Unlike JBoss (open source, most contributors work for JBoss), Spring, MySQL, Apache Hadoop is managed by apache, and that organisations structure is designed (somewhat) to prevent a single vendor dominating. Yes, Cloudera can do its fork, so can IBM, but then they both take on the problem of testing at the scale of 1000+ servers, servers with 12TB of storage each.

And you you know who has that kind of storage to play with? Yahoo! and Facebook. Nobody else can test at that scale -even though others (Apple?) may want to play at that scale. What the Y! team can do is focus on the large scale datacentre problems, the ones where the cloudera licensing fees are too much (hey, these datacentres run CentOS to avoid paying for 1000 RHEL licenses). With the current Cloudera support license, it's cheaper to hire an ex-yahoo! person.

The other thing I'd like to point out -as a Hadoop Committer- is that while Cloudera has some excellent Hadoop developers -Doug, Tom White,. Todd, Aaron, Konstantin, to name some key ones, they don't own the Hadoop developer world. There's the LinkedIn people, the Facebook people, lots of little startups who are busy filing bugs against it. There are the people at adding layers on Hadoop, things that aren't yet mainstream (Hama for Graphs, Mahout for machine learning), there are the people working on Hadoop-compatible filesystems. It's open source, anyone can play, we welcome the users, we welcome the bug reports, and we welcome patches especially if they come with Junit tests. The whole MR2.0 engine is coming out of Yahoo! and it looks a great place to play. Come join us!

SteveL at apache dot org

Oxfordshire cops switch speed cameras back on

Steve Loughran

Cost of death

you forget that the cost of KSI is usually measured in millions of pounds. It's cheaper for people not to die.

Platform wants to out-map, out-reduce Hadoop

Steve Loughran
WTF?

Y! run different clusters for scale

Whoever said that Y! run >1 cluster so they can submit multiple jobs is ill informed. Yahoo! have multiple clusters because the current scale of the HDFS filestore tops out at 25-30 PB, and putting Platform's code on top of that will not remove that limitation.

The MR engine can schedule multiple jobs, and can even prioritise work from different people. It too has a scale limit of about 4K servers, and even that requires tuned jobs to avoid overloading the central Job Tracker.

If you do want to know more about what Hadoop's limits are, you are welcome to get in touch with me, a committer on the Apache Hadoop project, as otherwise you will end up repeating marketing blurb from people who have a vested interest in discrediting a project that is tested at Y! and Facebook scale, is free, and which has shown up fundamental flaws in the "classic" Grid frameworks, namely their reliance on high cost SAN storage limits their storage capacity, and hence their ability to work with Big Data problems. It is good that the Platform people now have a story to work with lower cost storage than GPFS -by using Hadoop's on filesystem- but I'm not sure then why you need to pay the premium for Platform over the free version of Hadoop. That of course is the other flaw in the classic frameworks...

Quanta crams 512 cores into pizza box server

Steve Loughran

Spindle:CPU ratio bad for Hadoop

I'm putting on my Hadoop committer hat and noting some things about it on this box -independent of any other HPC uses-

1. Ignoring point (3) below, you don't need to "port" Apache Hadoop to the system provided you can bring up RHEL and Java on it, ideally 64-bit JVM from Sun, that being the only one that the Hadoop team opt to care about.

2. There's not enough storage. 24 HDDs for that many CPUs? The current generation of Hadoop servers put 12x 3.5" HDDs in a 1U rack with 6-12 x86-64 cores, giving a ratio of 1 CPU to 1 or 2 HDDs. That's massive storage capacity and good IO bandwidth, with good CPU. Why? Storage capacity with some local datamining is the driving need. It's why HDD and not SDD is the storage, it's why 3.5" disks are chosen over 2.5". It brings you cost/petabyte down.

3. The use of independent servers gives you better failure modes. If you built a rack out of these systems, you would need to somehow change Hadoop's topology logic to know that a set of servers are inter-dependent, and so that copies of blocks of the files (usually 128+ MB blocks) are not stored on servers instances in the same physical server. There's been discussion of making the placement policy pluggable, so Quanta could write a new Java class to implement placement differently, but as the plugin interface isn't there yet, they can't have done so.

IBM wants to relieve Aussie traffic pain

Steve Loughran

Which part of the UK are they talking about

We'd like to see some definitive evidence that IBM either reduced journey times, made it safer for people walking and cycling, improved public transport or reduced city centre pollution before taking any claim from IBM that they solved the UK's traffic problems.

Maybe the harsh truth they dare not say is that all they can do is damage limitation. In which case, come out and say it.

Virgin Media, Readers - an apology

Steve Loughran

Not in BS6 during term time

Go search for "virgin media BS6" and see that VM quality depends purely on load, and in some parts of the country they are overloaded.

Page: