Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make it great:
Suppose we want to search for the author of the website by his email id.
Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code:
package com.swayam.nutch.plugins.indexfilter;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.parse.Parse;
/**
 *@author paawak
 */
public class EmailIndexingFilter implements IndexingFilter {
    private static final Log LOG = LogFactory.getLog(EmailIndexingFilter.class);
    private static final String KEY_CREATOR_EMAIL = "email";
    private Configuration conf;
    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        // look up email of the author based on the url of the site
        String creatorEmail = EmailLookup.getCreatorEmail(url.toString());
        LOG.info("######## creatorEmail = " + creatorEmail);
        if (creatorEmail != null) {
            doc.add(KEY_CREATOR_EMAIL, creatorEmail);
        }
        return doc;
    }
    public void addIndexBackendOptions(Configuration conf) {
        LuceneWriter.addFieldOptions(KEY_CREATOR_EMAIL, LuceneWriter.STORE.YES,
                LuceneWriter.INDEX.TOKENIZED, conf);
    }
    public Configuration getConf() {
        return conf;
    }
    public void setConf(Configuration conf) {
        this.conf = conf;
    }
}
Also, you need to create a plugin.xml:
  
    
       
     
   
  
     
   
  
     
   
 
This done, create a new folder in the $NUTCH_HOME/plugins and put the jar and the plugin.xml there.
Now we have to activate this plugin. To do this, we have to edit the conf/nutch-site.xml.
  plugin.includes 
  nutch-extensionpoints|protocol-http|parse-(text|html)|index-(basic|email)|query-(basic|site|url) 
  Regular expression naming plugin id names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
   
 
Add my own query plugin:
package com.swayam.nutch.plugins.queryfilter;
import org.apache.nutch.searcher.FieldQueryFilter;
/**
 *@author paawak
 */
public class MyEmailQueryFilter extends FieldQueryFilter {
    public MyEmailQueryFilter() {
        super("email");
    }
}
Do not forget to edit the plugin.xml.
   
      
          
       
    
   
       
    
   
      
         
       
    
 
This line is particularly important:
If you skip this line, you will never be able to see this in search results.
The only catch here is you have to prepend the keyword email: to the search key. For example, if you want to search for jsmith@mydomain.com, you have to search for email:jsmith@mydomain.com or email:jsmith.
There is an easier and more elegant way :), read on...
Use the existing query-basic plugin.
This involves editing just one file: conf/nutch-default.xml.
In the default distribution, you can see some commented lines like this:
All you have to do is un-comment them and put your custom field, email, in our case in place of description. The resulting fragment will look like:
  query.basic.email.boost 
  1.0 
   Queries the author of the site by his email-id
   
 
With this while looking for jsmith@mydomain.com, you can simply enter jsmith@mydomain.com or a part the name like jsmit.
The preferred way is by ant, but I have used maven with the following dependencies:
...
	
		...
		
		
			org.apache.lucene 
      		lucene-core 
      		2.4.0 
      		provided 
		 
		
			org.apache.lucene 
      		lucene-misc 
      		2.4.0 
      		provided 
		 
		
			org.apache.nutch 
      		nutch 
      		1.0 
      		provided 
		 
		
			org.apache.taglibs 
      		taglibs-i18n 
      		1.0.N20030822 
      		provided 
		 
		
			org.apache.tika 
			tika 
			0.1-incubating 
			provided 
		 
		
			xerces 
			xerces 
			2.6.2 
			provided 
		 
		
			xerces 
			xerces-apis 
			2.6.2 
			provided 
		 
		
			org.jets3t.service 
			jets3t 
			0.6.1 
			provided 
		 
		
			oro 
  			oro 
  			2.0.8 
  			provided 
		 
		
			com.ibm.icu 
  			icu4j 
  			4.0.1 
  			provided 
		 
		
			org.apache.hadoop 
  			hadoop-core 
  			0.19.1 
  			provided 
		 
		
			org.apache.solr 
  			solr-common 
  			1.3.0 
  			provided 
		 
		
			org.apache.solr 
  			solrj 
  			1.3.0 
  			provided 
		 
		
	    ...
	 
...
 
Be warned that these are a bit out dated, so they may not be correct verbatim.