• Increase font size
  • Default font size
  • Decrease font size

Installing & Configuring Nutch, Nutch-Gui, Sun JDK & Tomcat 6 on Centos 5.x

 This howto will explain how to get NutchNutch-Gui, Sun JDK & Tomcat 6.0.16 working on Centos 5.x while maintaining a normally functioning Centos system. Currently, Centos 5.x ships with Tomcat 5.5, however, while it does run, there are problems with the default install of this version that results in errors which are undocumented and persistent at this time. If you have information or believe that these errors have been addressed and can point to a fix, please use the contact form on this website to let us know. The following instructions allow for easy removal of any software installed through following this howto by either using "rpm -e foo.rpm" or "rm -rf /opt/foo" returning your system to its original state.

Applicable to Centos Versions:

  • Centos 5.x

Requirements

Explanation of requirements.
  1. Root or sudo access with appropriate privileges to the system you intend to install on.
  2. A server preferably on a high-speed network.
  3. Sun JDK rpm.bin.
  4. Tomcat 6 rpm.
  5. Nutch 1.0 tar.gz.
  6. Nutch-Gui 0.2 tar.gz.

Doing the Work

Basic description of what will be done and what is expected.

  1. Install a few dependencies:
  2. sudo yum install ant xml-commons-apis ant-trax
  3. Download & install the latest Sun JDK rpm.bin:
  4. Go here:
    http://java.sun.com/javase/downloads/index.jsp

    Get the following:
    Java SE Development Kit (JDK) 32bit (approx. 73.98MB)
    JDK 6 Update 16 (or the latest update, the version is important in setting your JAVA_HOME path variable)


    Once downloaded install using the following:
    chmod +x jdk-6u16-linux-i586-rpm.bin; ./jdk-6u16-linux-i586-rpm.bin
    answer "yes" to the EULA
    sudo rpm -ivh jdk* sun*
  5. Download & install Tomcat 6:
  6. http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm
    http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm (provided for reference)

    Once downloaded install the rpm with the following command:

    sudo rpm -ivh tomcat-6.0.16-0.noarch.rpm (this installs entirely into /opt/tomcat and can be removed with: rpm -e tomcat)
    sudo vi /opt/tomcat/conf/tomcat-env.sh (set: JAVA_HOME="/usr/java/jre1.6.0_16")
  7. Download & install Nutch 1.0:
  8. Dowmload Nutch 1.0 from a mirror here: http://www.apache.org/dyn/closer.cgi/lucene/nutch/
    sudo cp nutch-1.0.tar.gz /opt; cd /opt && tar xvfz nutch-1.0.tar.gz; cd nutch-1.0
    sudo ant
    sudo ant war (this creates the "build" directory)

    sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
    (modify the property "searcher.dir" to: /opt/nutch-1.0/crawl/ & the docBase attribute
    to the full path of your nutch-1.0 war file: docBase="nutch.war" path="/opt/tomcat/webapps/"
    )

    sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
    (a .war file is a zip/jar file known as a "web archive" or war file, it is uncompressed when tomcat is started)
  9. Edit /etc/profile:
  10. Add these lines just above: # ksh workaround

    sudo vi /etc/profile

    ##Tomcat 6 / Java##
    JAVA_HOME="/usr/java/jdk1.6.0_16"
    export JAVA_HOME
    CATALINA_HOME="/opt/tomcat"
    export CATALINA_HOME
    NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16"
    export NUTCH_JAVA_HOME
    ##End Tomcat 6 / Java##
  11. Configure Nutch to fetch URLs:
  12. cd /opt/nutch-1.0; sudo mkdir urls
    (make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)

    sudo vi conf/nutch-default.xml

    edit the following:

    http.agent.name My Spider
    http.robots.agents My Spider
    http.agent.description My Bot
    http.agent.url http://www.example.com
    http.agent.email This e-mail address is being protected from spambots. You need JavaScript enabled to view it
    all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
  13. Nutch "deepcrawler" script:
  14. Put this script in /opt/nutch-1.0/bin
    chmod +x deepcrawler
    Note: this script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in:
    /opt/nutch-1.0/crawl1 to store the new crawl.

  15. Fetch URLs with Nutch via command line:
  16. If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject,
    so you'll want to run it in screen.

    screen -S nutch
    sudo service tomcat start
    cd /opt/nutch-1.0; su -c "bin/deepcrawler"
  17. Download & install Nutch-Gui 0.2:
  18. Note: if you use the script provided above, you can skip the GUI altogether.

    Download Nutch-Gui 0.2 from:
    http://github.com/101tec/nutch/downloads
    sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2
    sudo ant clean package
    cd build/nutch-gui-0.2
    sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war

    unsecured quick test method, to assure it's working:
    su -c "bin/nutch admin /opt/nutch-1.0 50060"
    http://example.com:50060/general

    more secure password protection:
    sudo vi conf/nutchguiUsers.properties
    (edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role)
    screen -S nutch-gui (since we'll probably run it for a while)
    su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure"
    http://example.com:50060/general

Troubleshooting

How to test

Explanation troubleshooting basics and expectations.
  1. Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
  2. rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME
    tomcat-6.0.16-0
    jdk-1.6.0_16-fcs
    ant-1.6.5-2jpp.2
    xml-commons-apis-1.3.02-0.b2.7jpp.10
    ant-trax-1.6.5-2jpp.2
    /usr/java/jdk1.6.0_16

    Replace "localhost" with your machines IP
    Try accessing Tomcat here: http://localhost:8080/
    Try accessing Nutch here: http://localhost:8080/nutch/
    Try accessing Nutch-Gui here: http://localhost:50060/general
  3. Set Tomcat to start on boot:
  4. sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat
    tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:off

Common problems and fixes

Describe common problems here, include links to known common problems if on another site

More Information

Any additional information or notes.

Disclaimer

We test this stuff on our own machines, really we do. But you may run into problems, if you do, come to #centos on irc.freenode.net

Added Reading

 

Add comment


Security code
Refresh