Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make it great:
Suppose we want to search for the author of the website by his email id.
Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code:
package com.swayam.nutch.plugins.indexfilter;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.parse.Parse;
/**
*@author paawak
*/
public class EmailIndexingFilter implements IndexingFilter {
private static final Log LOG = LogFactory.getLog(EmailIndexingFilter.class);
private static final String KEY_CREATOR_EMAIL = "email";
private Configuration conf;
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
// look up email of the author based on the url of the site
String creatorEmail = EmailLookup.getCreatorEmail(url.toString());
LOG.info("######## creatorEmail = " + creatorEmail);
if (creatorEmail != null) {
doc.add(KEY_CREATOR_EMAIL, creatorEmail);
}
return doc;
}
public void addIndexBackendOptions(Configuration conf) {
LuceneWriter.addFieldOptions(KEY_CREATOR_EMAIL, LuceneWriter.STORE.YES,
LuceneWriter.INDEX.TOKENIZED, conf);
}
public Configuration getConf() {
return conf;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
}
Also, you need to create a plugin.xml:
This done, create a new folder in the $NUTCH_HOME/plugins and put the jar and the plugin.xml there.
Now we have to activate this plugin. To do this, we have to edit the conf/nutch-site.xml.
plugin.includes
nutch-extensionpoints|protocol-http|parse-(text|html)|index-(basic|email)|query-(basic|site|url)
Regular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
Add my own query plugin:
package com.swayam.nutch.plugins.queryfilter;
import org.apache.nutch.searcher.FieldQueryFilter;
/**
*@author paawak
*/
public class MyEmailQueryFilter extends FieldQueryFilter {
public MyEmailQueryFilter() {
super("email");
}
}
Do not forget to edit the plugin.xml.
This line is particularly important:
If you skip this line, you will never be able to see this in search results.
The only catch here is you have to prepend the keyword email: to the search key. For example, if you want to search for jsmith@mydomain.com, you have to search for email:jsmith@mydomain.com or email:jsmith.
There is an easier and more elegant way :), read on...
Use the existing query-basic plugin.
This involves editing just one file: conf/nutch-default.xml.
In the default distribution, you can see some commented lines like this:
All you have to do is un-comment them and put your custom field, email, in our case in place of description. The resulting fragment will look like:
query.basic.email.boost
1.0
Queries the author of the site by his email-id
With this while looking for jsmith@mydomain.com, you can simply enter jsmith@mydomain.com or a part the name like jsmit.
The preferred way is by ant, but I have used maven with the following dependencies:
...
...
org.apache.lucene
lucene-core
2.4.0
provided
org.apache.lucene
lucene-misc
2.4.0
provided
org.apache.nutch
nutch
1.0
provided
org.apache.taglibs
taglibs-i18n
1.0.N20030822
provided
org.apache.tika
tika
0.1-incubating
provided
xerces
xerces
2.6.2
provided
xerces
xerces-apis
2.6.2
provided
org.jets3t.service
jets3t
0.6.1
provided
oro
oro
2.0.8
provided
com.ibm.icu
icu4j
4.0.1
provided
org.apache.hadoop
hadoop-core
0.19.1
provided
org.apache.solr
solr-common
1.3.0
provided
org.apache.solr
solrj
1.3.0
provided
...
...
Be warned that these are a bit out dated, so they may not be correct verbatim.