Feeds

back to article Hadoop's little buddy Nutch 2.0 gulps down web's big data

Hadoop daddy Doug Cutting's Nutch, the open-source web-search engine written in Java, has been updated to crawl through piles of big data on the web. Apache Software Foundation (ASF) has released Nutch 2.0 featuring a data abstraction technique that plugs into big-data stores and frameworks Apache Accumulo, Avro, Cassandra, …

COMMENTS

This topic is closed for new posts.
Meh

Does this article mean anything?

These all seem like randomly generated words, sprinkled with a few that I recognise, like "the" and "big".

0
1
Coat

Re: Does this article mean anything?

If I Hadoop, I would Nutch up an Accumulo by this Avro. Soir now that Cassandra has moved to HBase, Gora won't let Tika go.

Makes perfect sense.

1
0

Re: Does this article mean anything?

Jesus, don't you get out man!? What are you doing? Sitting around smoking the chronic and keeping it real?

Nutch can be configured for a targeted crawl (eg in the Enterprise) to generate Lucene indexes. These indexes can then be inspected using Luke for tuning of your Nutch configuration file. If a single Nutch server is not sufficient (and often it is,) then you can put the whole thing on a Hadoop cluster. After the indexes are created, you could create a simple search results webpage running on Apache using Solr.

A lot of products are built on combined Apache projects. Try to keep up.

0
0
Anonymous Coward

GORA

yet another persistence API, and in this case with next to no documentation. Will people never learn that to get people to use your tool you need to explain why you reinvented that wheel, and indeed what are the controls to make the wheel turn. Without those they will look at it for 2 mins, grunt, and then go off back to their cave

0
1

This post has been deleted by a moderator

This topic is closed for new posts.