• Increase font size
  • Default font size
  • Decrease font size
nutch

 created by sjw/mfgis 2feb07

Installing and Running Nutch Under Debian 'Etch'

 

Install Sun's Java

Sun Java is available as a set of Debian packages and may be easily installed using apt. To obtain Sun's Java, ensure that 'non-free' is included in /etc/apt/sources.list

  • # apt-get install sun-java5-bin sun-java5-demo sun-java5-jdk sun-java5-jre

Since there may be more than one flavor of Java on the system (e.g. kaffe) ensure that Sun Java is the chosen alternative

  • # update-alternatives --config java // then select sun java from the menu

If necessary edit /etc/profile to include the following lines:

  • JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.10
    export JAVA_HOME

 

Install Tomcat5.5 and Verify that it is functioning

  • # apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin tomcat5.5-webapps

Verify Tomcat is running:

  • # /etc/init.d/tomcat5.5 status
    #Tomcat servlet engine is running with Java pid /var/lib/tomcat5.5/temp/tomcat5.5.pid

Tomcat may be started and/or stopped using the following:

  • # /etc/init.d/tomcat5.5 start
    # /etc/init.d/tomcat5.5 stop

It is NOT necessary to run '~/local/tomcat/bin/catalina.sh start' as noted elsewhere in the WIKI, nor is it necessary to start tomcat/catalina from any particular location
Tomcat5.5 under Debian Etch listens to port 8180, not 8080, so pointing your browser to http://blahblah:8180 will bring up the Tomcat home page, if everything is functioning properly.

Grant Yourself Tomcat Manager Permissions

Edit /usr/share/tomcat5.5/conf/tomcat-users.xml and include the following:

 

Enter the Tomcat Manager

Tomcat5.5 under Debian Etch comes pre-installed with a handfull of simple webapps. Clicking on the Tomcat Manager link from the Tomcat home page will show you a list of these applications and their execution status. Later we will return to this page to verify that our nutch applications are running.

 

Acquire, install and configure Nutch

Acquire a copy of nutch and unpack it in a new directory location. I suggest using /usr/local/nutch as the top-level directory, but this is of course optional

 

Configure for multiple, independent site crawls and searches

Follow the section Intranet:Configuration from the Nutch tutorial at http://lucene.apache.org/nutch/tutorial8.html. However, plan in advance for crawling and searching sites independently from one another:
Given two sites, site1 and site2 which you wish to crawl/index (and later search) independently from each other, you may make multiple copies of the conf directory:

  • #cd /usr/local/nutch
    #cp -rp conf conf.site1
    #cp -rp conf conf.site2

And then work through steps one through four of the above mentioned section for each site.

Create simple shell scripts which allow for the independent crawling of each site, such as /usr/local/nutch/crawl_site1.sh

  • NUTCH_CONF_DIR=conf.site1
    export NUTCH_CONF_DIR
    bin/nutch crawl urls/site1 -dir crawls/site1 -depth 10 -topN 100000

and the same for site2.

 

Then proceed to crawl each site:

  • #sh crawl_site1.sh
    #sh crawl_site2.sh

 

Configure Tomcat's File and Webapp Paths

Under Debian Etch, the Catalina configuration files are located under /etc/tomcat5.5/policy.d At runtime they are combined into a single file,/usr/share/tomcat5.5/conf/catalina.policy Do not edit the latter, as it will be overwrittten.
At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:

{{{grant codeBase "file:/usr/share/tomcat5.5-webapps/-\" {

Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown

 

Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching

Under Debian Etch & Tomcat5.5 the webapps path is located at

  • /usr/share/tomcat5.5-webapps

Contrary to the Nutch tutorial(s) it is NOT NECESSARY to remove the ROOT context nor is it desirable. It was noted above that the Tomcat Manager allows us to view and control our multiple applications. Removing ROOT would break this functionality.
Create two new folders under /usr/share/tomcat5.5-webapps, and explode the nutch war file into each: {{{ #cd /usr/share/tomcat5.5-webapps #mkdir site1#mkdir site2 #cp /usr/local/nutch/nutch-0.8.1.war site1 #cp /usr/local/nutch/nutch-0.8.1.war site2 #cd site1; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd .. #cd site2; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd .. }}}

Configure the site1,site2 webapps

Edit the site1/WEB-INF/classes/nutch-default.xml file for the searcher.dir parameter, so that it points back to your crawl directory under /usr/local/nutch and save it as nutch-site.xml after making the following changes:
{{{searcher.dir /usr/local/nutch/crawls/site1 }}} And repeat for site2.
Create site1.xml and site2.xml under /usr/share/tomcat5.5-webapps by modifying the distribution nutch-site.xml


And repeat for site2.
Create symbolic links to these files under /usr/share/tomcat5.5/conf/Catalina/localhost

ln -s /usr/share/tomcat5.5-webapps/site1.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site1.xml
ln -s /usr/share/tomcat5.5-webapps/site2.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site2.xml

 

Restart Tomcat

 /etc/init.d/tomcat5.5 restart  Revisit the Tomcat Manager. You should see new entries for site1 and site2 and with luck their Running status should show asTrue

 

Search Your Sites!

Point your browser to http://blahblah:8180/site1 and conduct a search. 
Point your browser to http://blahblah:8180/site2 and conduct another search. 
If everything was configured properly you should see independent results representing independent searches on independent crawls.

FIN.

 

 Today I started to work on a little project that required a crawler, and Nutch seemed to do most of what I needed. The nutch team conveniently released Nutch 1.0 late in March 2009, so I had a brand new release to test out. Installing nutch 1.0 on a mac is not as straight forward as I thought, I ran into a lot of unexpected issues and here is my cook book description of how to successfully install nutch 1.0 on your mac.

  1. Download the latest source code from the Apache SVN repositoryhttp://svn.apache.org/repos/asf/lucene/nutch/. I tried running it from the tarball without success, I also tried to compile the source from the tarball, but a post on the nutch forum clearly states that this will not work.
  2. Set your JAVA_HOME and NUTCH_JAVA_HOME variables, again this is not straight forward, they both need to point to your real installation of Java 1.6 (earlier versions of Java will fail). I sat these variables to: /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home, I could not get the /Library/Java/Home symbolic link to work properly.
  3. Compile the source code using Ant (I built it in Eclipse).
  4. Setup your nutch configuration, by following the tutorial by Peter P. Wang
  5. Run your first crawl with: ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Most of the issues I encountered was related to the Java version and the fact that using/Application/Utilities/Java/Java preferences application do not really change the JAVA_HOMEdirectory /Library/Java/Home properly. So make sure you have set both JAVA_HOME andNUTCH_JAVA_HOME, and that your OSX does not fool you when it pretend to be symbolically linking to the 1.6 installation.

 

 This howto will explain how to get NutchNutch-Gui, Sun JDK & Tomcat 6.0.16 working on Centos 5.x while maintaining a normally functioning Centos system. Currently, Centos 5.x ships with Tomcat 5.5, however, while it does run, there are problems with the default install of this version that results in errors which are undocumented and persistent at this time. If you have information or believe that these errors have been addressed and can point to a fix, please use the contact form on this website to let us know. The following instructions allow for easy removal of any software installed through following this howto by either using "rpm -e foo.rpm" or "rm -rf /opt/foo" returning your system to its original state.

Applicable to Centos Versions:

  • Centos 5.x

Requirements

Explanation of requirements.
  1. Root or sudo access with appropriate privileges to the system you intend to install on.
  2. A server preferably on a high-speed network.
  3. Sun JDK rpm.bin.
  4. Tomcat 6 rpm.
  5. Nutch 1.0 tar.gz.
  6. Nutch-Gui 0.2 tar.gz.

Doing the Work

Basic description of what will be done and what is expected.

  1. Install a few dependencies:
  2. sudo yum install ant xml-commons-apis ant-trax
  3. Download & install the latest Sun JDK rpm.bin:
  4. Go here:
    http://java.sun.com/javase/downloads/index.jsp

    Get the following:
    Java SE Development Kit (JDK) 32bit (approx. 73.98MB)
    JDK 6 Update 16 (or the latest update, the version is important in setting your JAVA_HOME path variable)


    Once downloaded install using the following:
    chmod +x jdk-6u16-linux-i586-rpm.bin; ./jdk-6u16-linux-i586-rpm.bin
    answer "yes" to the EULA
    sudo rpm -ivh jdk* sun*
  5. Download & install Tomcat 6:
  6. http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm
    http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm (provided for reference)

    Once downloaded install the rpm with the following command:

    sudo rpm -ivh tomcat-6.0.16-0.noarch.rpm (this installs entirely into /opt/tomcat and can be removed with: rpm -e tomcat)
    sudo vi /opt/tomcat/conf/tomcat-env.sh (set: JAVA_HOME="/usr/java/jre1.6.0_16")
  7. Download & install Nutch 1.0:
  8. Dowmload Nutch 1.0 from a mirror here: http://www.apache.org/dyn/closer.cgi/lucene/nutch/
    sudo cp nutch-1.0.tar.gz /opt; cd /opt && tar xvfz nutch-1.0.tar.gz; cd nutch-1.0
    sudo ant
    sudo ant war (this creates the "build" directory)

    sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
    (modify the property "searcher.dir" to: /opt/nutch-1.0/crawl/ & the docBase attribute
    to the full path of your nutch-1.0 war file: docBase="nutch.war" path="/opt/tomcat/webapps/"
    )

    sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
    (a .war file is a zip/jar file known as a "web archive" or war file, it is uncompressed when tomcat is started)
  9. Edit /etc/profile:
  10. Add these lines just above: # ksh workaround

    sudo vi /etc/profile

    ##Tomcat 6 / Java##
    JAVA_HOME="/usr/java/jdk1.6.0_16"
    export JAVA_HOME
    CATALINA_HOME="/opt/tomcat"
    export CATALINA_HOME
    NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16"
    export NUTCH_JAVA_HOME
    ##End Tomcat 6 / Java##
  11. Configure Nutch to fetch URLs:
  12. cd /opt/nutch-1.0; sudo mkdir urls
    (make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)

    sudo vi conf/nutch-default.xml

    edit the following:

    http.agent.name My Spider
    http.robots.agents My Spider
    http.agent.description My Bot
    http.agent.url http://www.example.com
    http.agent.email admin@example.com
    all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
  13. Nutch "deepcrawler" script:
  14. Put this script in /opt/nutch-1.0/bin
    chmod +x deepcrawler
    Note: this script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in:
    /opt/nutch-1.0/crawl1 to store the new crawl.

  15. Fetch URLs with Nutch via command line:
  16. If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject,
    so you'll want to run it in screen.

    screen -S nutch
    sudo service tomcat start
    cd /opt/nutch-1.0; su -c "bin/deepcrawler"
  17. Download & install Nutch-Gui 0.2:
  18. Note: if you use the script provided above, you can skip the GUI altogether.

    Download Nutch-Gui 0.2 from:
    http://github.com/101tec/nutch/downloads
    sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2
    sudo ant clean package
    cd build/nutch-gui-0.2
    sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war

    unsecured quick test method, to assure it's working:
    su -c "bin/nutch admin /opt/nutch-1.0 50060"
    http://example.com:50060/general

    more secure password protection:
    sudo vi conf/nutchguiUsers.properties
    (edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role)
    screen -S nutch-gui (since we'll probably run it for a while)
    su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure"
    http://example.com:50060/general

Troubleshooting

How to test

Explanation troubleshooting basics and expectations.
  1. Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
  2. rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME
    tomcat-6.0.16-0
    jdk-1.6.0_16-fcs
    ant-1.6.5-2jpp.2
    xml-commons-apis-1.3.02-0.b2.7jpp.10
    ant-trax-1.6.5-2jpp.2
    /usr/java/jdk1.6.0_16

    Replace "localhost" with your machines IP
    Try accessing Tomcat here: http://localhost:8080/
    Try accessing Nutch here: http://localhost:8080/nutch/
    Try accessing Nutch-Gui here: http://localhost:50060/general
  3. Set Tomcat to start on boot:
  4. sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat
    tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:off

Common problems and fixes

Describe common problems here, include links to known common problems if on another site

More Information

Any additional information or notes.

Disclaimer

We test this stuff on our own machines, really we do. But you may run into problems, if you do, come to #centos on irc.freenode.net

Added Reading

 

 

Requirements

  1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation.
  2. Apache's Tomcat 4.x.
  3. On Win32, cygwin, for shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)
  4. Up to a gigabyte of free disk space, a high-speed connection, and an hour or so.

Getting Started

First, you need to get a copy of the Nutch code. You can download a release from http://lucene.apache.org/nutch/release/. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant.

Try the following command:

bin/nutch

This will display the documentation for the Nutch command script.

Now we're ready to crawl. There are two approaches to crawling:

  1. Intranet crawling, with the crawl command.
  2. Whole-web crawling, with much greater control, using the lower level injectgeneratefetch and updatedb commands.

Intranet Crawling

Intranet crawling is more appropriate when you intend to crawl up to around one million pages on a handful of web servers.

Intranet: Configuration

To configure things for intranet crawling you must:

  1. Create a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
    http://lucene.apache.org/nutch/
    
  2. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
    +^http://([a-z0-9]*\.)*apache.org/
    
    This will include any url in the domain apache.org.

Intranet: Running the Crawl

Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:

  • -dir dir names the directory to put the crawl in.
  • -depth depth indicates the link depth from the root page that should be crawled.
  • -delay delay determines the number of seconds between accesses to each host.
  • -threads threads determines the number of threads that will fetch in parallel.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10.

Once crawling has completed, one can skip to the Searching section below.

Whole-web Crawling

Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines.

Whole-web: Concepts

Nutch data is of two types:

  1. The web database. This contains information about every page known to Nutch, and about links between those pages.
  2. A set of segments. Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
    • fetchlist is a file that names a set of pages to be fetched
    • the fetcher output is a set of files containing the fetched pages
    • the index is a Lucene-format index of the fetcher output.

In the following examples we will keep our web database in a directory named db and our segments in a directory named segments:

mkdir db
mkdir segments

Whole-web: Boostrapping the Web Database

The admin tool is used to create a new, empty database:

bin/nutch admin db -create

The injector adds urls into the database. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Next we inject a random subset of these pages into the web database. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We inject one out of every 3000, so that we end up with around 1000 URLs:

bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000

This also takes a few minutes, as it must parse the full file.

Now we have a web database with around 1000 as-yet unfetched URLs in it.

Whole-web: Fetching

To fetch, we first generate a fetchlist from the database:

bin/nutch generate db segments

This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1:

s1=`ls -d segments/2* | tail -1`
echo $s1

Now we run the fetcher on this segment with:

bin/nutch fetch $s1

When this is complete, we update the database with the results of the fetch:

bin/nutch updatedb db $s1

Now the database has entries for all of the pages referenced by the initial set.

Now we fetch a new segment with the top-scoring 1000 pages:

bin/nutch generate db segments -topN 1000
s2=`ls -d segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch updatedb db $s2

Let's fetch one more round:

bin/nutch generate db segments -topN 1000
s3=`ls -d segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb db $s3

By this point we've fetched a few thousand pages. Let's index them!

Whole-web: Indexing

To index each segment we use the index command, as follows:

bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3

Then, before we can search a set of segments, we need to delete duplicate pages. This is done with:

bin/nutch dedup segments dedup.tmp

Now we're ready to search!

Searching

To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.)

Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:

rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war

The webapp finds its indexes in ./segments, relative to where you start Tomcat, so, if you've done intranet crawling, connect to your crawl directory, or, if you've done whole-web crawling, don't change directories, and give the command:

~/local/tomcat/bin/catalina.sh start

Then visit http://localhost:8080/ and have fun!

More detailed tutorials are available on the Nutch Wiki.

 

 This document contains instructions for downloading and installing Nutch and Lucene. Please beware that you must be logged into the csci571 computer to run Apache Tomcat and not on aludra or nunki.

  1. Downloading and Installing Nutch
  2. Downloading and Installing Lucene

Downloading and Installing Nutch
Chris A. Mattmann
mattmann@apache.org

Pre-requisites

  1. Installation of Java 1.4 or above. You can download java from http://java.sun.com

  2. Installation of Apache ANT 1.6 or above. You can download ANT from http://ant.apache.org 
  3. Installation of Apache Tomcat 5.5.19 or above. You can download Tomcat from http://tomcat.apache.org
  4. If you are using Windows OS, please install Cygwin: you can find Cygwin here: http://www.cygwin.com/ 
  5. Install the subversion client, You can find Subversion at: http://subversion.tigris.org

 

Installation Instructions

  1. Download Nutch from SVN, using the Subversion command line client:

    # svn co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-0.8.1/ ./nutch

    1. This will install nutch into a directory called “nutch” local to wherever you ran this command. So, if you ran this command from /home/bogus, then you would have a directory called /home/bogus/nutch 
    2. We’ll call the directory where you unpacked Nutch to your $NUTCH_HOME 
  2. Cd into the Nutch directory, and compile Nutch:

    # cd nutch
    # ant 
    1. You should see a message such as the following if all is well and the build ran successfully

      compile:

      job:
      [jar] Building jar: /Users/mattmann/tmp/nutch/build/nutch-0.8.1.job

      BUILD SUCCESSFUL

      Total time: 27 second
  1. Okay, now that Nutch is built, you can fetch some content. There is a detailed, step-by-step set of instructions on the wiki, for how to fetch content. This page provides all the details: http://wiki.apache.org/nutch/NutchTutorial

  2. Once you’ve fetched some content, you’ll probably want to browse it. To get Nutch set up on Tomcat, first build the Nutch webapp (run the below command from your nutch directory):

    # ant war 
  3. The above command will construct a nutch-0.8.1.war file within $NUTCH_HOME/build. It will also construct a nutch.xml file within $NUTCH_HOME/build. The nutch.xml is a Tomcat context.xml file, that you can use to configure a WAR file for deployment within Tomcat. 
  4. First, make a directory for your nutch war file, and your nutch context.xml file to live in. /usr/local/nutch is a good place.

    # mkdir /usr/local/nutch
    # cp –R $NUTCH_HOME/build/nutch-0.8.1.war /usr/local/nutch
    # cp –R $NUTCH_HOME/build/nutch.xml /usr/local/nutch 
  5. Next, edit your /usr/local/nutch/nutch.xml file

    Inside the file, modify the property searcher.dir to the path where your Nutch index that you created separately (in step 3 above) exists. If that directory is /home/bogus/nutch/my.crawl, then you would set searcher.dir to /home/bogus/nutch/my.crawl. 
  6. Edit your /usr/local/nutch/nutch.xml file again

    Edit the docBase attribute on the Context tag to be the FULL path to your Nutch 0.8.1 WAR file, e.g., /usr/local/nutch/nutch-0.8.1.war 
  7. Now, assuming that you have installed Tomcat according to the pre-requisites, and assuming that you have set the $TOMCAT_HOME environment variable (that points to your Tomcat installation directory), first shutdown tomcat:

    # cd $TOMCAT_HOME/bin
    # ./shutdown.sh

    Now, symlink your context.xml file for Nutch to the Tomcat conf directory

    # ln -s /usr/local/nutch/nutch.xml $TOMCAT_HOME/conf/Catalina/localhost/nutch.xml

    Now, restart your Tomcat server:

    # cd $TOMCAT_HOME/bin
    # ./startup.sh 
  8. If everything went right in step 9, then you should open up your browser, and point it at your tomcat installation (e.g., http://localhost:8080), and then append the path “/nutch” at the end of it. So, if you installed tomcat to run on port 8080, then you would visit: http://localhost:8080/nutch/

That’s it!

If you have any further questions, please feel free to contact me at the email address provided above.

 

Login Form