This howto will explain how to get Nutch, Nutch-Gui, Sun JDK & Tomcat 6.0.16 working on Centos 5.x while maintaining a normally functioning Centos system. Currently, Centos 5.x ships with Tomcat 5.5, however, while it does run, there are problems with the default install of this version that results in errors which are undocumented and persistent at this time. If you have information or believe that these errors have been addressed and can point to a fix, please use the contact form on this website to let us know. The following instructions allow for easy removal of any software installed through following this howto by either using "rpm -e foo.rpm" or "rm -rf /opt/foo" returning your system to its original state.
Basic description of what will be done and what is expected. Describe common problems here, include links to known common problems if on another site We test this stuff on our own machines, really we do. But you may run into problems, if you do, come to #centos on irc.freenode.netApplicable to Centos Versions:
Requirements
Doing the Work
sudo yum install ant xml-commons-apis ant-trax
Go here:
http://java.sun.com/javase/downloads/index.jsp
Get the following:
Java SE Development Kit (JDK) 32bit (approx. 73.98MB)
JDK 6 Update 16 (or the latest update, the version is important in setting your JAVA_HOME path variable)
Once downloaded install using the following:
chmod +x jdk-6u16-linux-i586-rpm.bin; ./jdk-6u16-linux-i586-rpm.bin
answer "yes" to the EULA
sudo rpm -ivh jdk* sun*http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm
http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm (provided for reference)
Once downloaded install the rpm with the following command:
sudo rpm -ivh tomcat-6.0.16-0.noarch.rpm (this installs entirely into /opt/tomcat and can be removed with: rpm -e tomcat)
sudo vi /opt/tomcat/conf/tomcat-env.sh (set: JAVA_HOME="/usr/java/jre1.6.0_16")Dowmload Nutch 1.0 from a mirror here: http://www.apache.org/dyn/closer.cgi/lucene/nutch/
sudo cp nutch-1.0.tar.gz /opt; cd /opt && tar xvfz nutch-1.0.tar.gz; cd nutch-1.0
sudo ant
sudo ant war (this creates the "build" directory)
sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
(modify the property "searcher.dir" to: /opt/nutch-1.0/crawl/ & the docBase attribute
to the full path of your nutch-1.0 war file: docBase="nutch.war" path="/opt/tomcat/webapps/")
sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
(a .war file is a zip/jar file known as a "web archive" or war file, it is uncompressed when tomcat is started)Add these lines just above: # ksh workaround
sudo vi /etc/profile
##Tomcat 6 / Java##
JAVA_HOME="/usr/java/jdk1.6.0_16"
export JAVA_HOME
CATALINA_HOME="/opt/tomcat"
export CATALINA_HOME
NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16"
export NUTCH_JAVA_HOME
##End Tomcat 6 / Java##
cd /opt/nutch-1.0; sudo mkdir urls
(make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)
sudo vi conf/nutch-default.xml
edit the following:
http.agent.name
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
Put this script in /opt/nutch-1.0/bin
chmod +x deepcrawler
Note: this script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in:
/opt/nutch-1.0/crawl1 to store the new crawl.If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject,
so you'll want to run it in screen.
screen -S nutch
sudo service tomcat start
cd /opt/nutch-1.0; su -c "bin/deepcrawler"Note: if you use the script provided above, you can skip the GUI altogether.
Download Nutch-Gui 0.2 from: http://github.com/101tec/nutch/downloads
sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2
sudo ant clean package
cd build/nutch-gui-0.2
sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war
unsecured quick test method, to assure it's working:
su -c "bin/nutch admin /opt/nutch-1.0 50060"
http://example.com:50060/general
more secure password protection:
sudo vi conf/nutchguiUsers.properties
(edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role)
screen -S nutch-gui (since we'll probably run it for a while)
su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure"
http://example.com:50060/generalTroubleshooting
How to test
rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME
tomcat-6.0.16-0
jdk-1.6.0_16-fcs
ant-1.6.5-2jpp.2
xml-commons-apis-1.3.02-0.b2.7jpp.10
ant-trax-1.6.5-2jpp.2
/usr/java/jdk1.6.0_16
Replace "localhost" with your machines IP
Try accessing Tomcat here: http://localhost:8080/
Try accessing Nutch here: http://localhost:8080/nutch/
Try accessing Nutch-Gui here: http://localhost:50060/general
sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat
tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:offCommon problems and fixes
More Information
Disclaimer
Added Reading
| How to Install nutch 1.0 on OSX< Prev | Next >how to install nutch 1.0 |
|---|


