Organizational Research By

Surprising Reserch Topic

Question:how to make nutch crawl file system?


not based on http,

like http://localhost:81 and so on,

but directly crawl a certain directory on local file system,

is there any way out?

asked Sep 13, 2013 in Crawl by rajesh
edited Sep 12, 2013
0 votes
22 views



Related Hot Questions

2 Answers

0 votes
From the Nutch Wiki:

How do I index my local file system?

1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:

  # accept anything else +.*

3) I changed my nutch.xml to include the following:

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013 by rajesh
0 votes
From the Nutch Wiki:

How do I index my local file system?

1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:

  # accept anything else +.*

3) I changed my nutch.xml to include the following:

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013 by rajesh

...