Organizational Research By

Surprising Reserch Topic

Question:solr + Heritrix


How is it possible to integrate solr with heritrix?

I want to archive a site using heritrix and then index and search locally this file using solr.

Thanks

asked Sep 13, 2013 in Crawl by rajesh
edited Sep 12, 2013
0 votes
35 views



Related Hot Questions

2 Answers

0 votes
The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank').

Using Nutch will give you a much better index however as it does use pagerank.

NutchWAX

If however you are deadset on using Heritrix and would like pagerank based search results you could use NutchWAX (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing).

NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index).

Solr

If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr.

The Solr end is just an XML file posted via HTTP and is dead simple.

The Heritrix end is little bit more complicated, but the Developer's Manual will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.).
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank').

Using Nutch will give you a much better index however as it does use pagerank.

NutchWAX

If however you are deadset on using Heritrix and would like pagerank based search results you could use NutchWAX (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing).

NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index).

Solr

If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr.

The Solr end is just an XML file posted via HTTP and is dead simple.

The Heritrix end is little bit more complicated, but the Developer's Manual will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.).
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013

...