Tuesday, 02 February 2010
linux
created by sjw/mfgis 2feb07
Installing and Running Nutch Under Debian 'Etch'
Install Sun's Java
Sun Java is available as a set of Debian packages and may be easily installed using apt. To obtain Sun's Java, ensure that 'non-free' is included in /etc/apt/sources.list
Since there may be more than one flavor of Java on the system (e.g. kaffe) ensure that Sun Java is the chosen alternative
If necessary edit /etc/profile to include the following lines:
Install Tomcat5.5 and Verify that it is functioning
Verify Tomcat is running:
Tomcat may be started and/or stopped using the following:
It is NOT necessary to run '~/local/tomcat/bin/catalina.sh start' as noted elsewhere in the WIKI, nor is it necessary to start tomcat/catalina from any particular location
Tomcat5.5 under Debian Etch listens to port 8180, not 8080, so pointing your browser to http://blahblah:8180 will bring up the Tomcat home page, if everything is functioning properly.
Grant Yourself Tomcat Manager Permissions
Edit /usr/share/tomcat5.5/conf/tomcat-users.xml and include the following:
Enter the Tomcat Manager
Tomcat5.5 under Debian Etch comes pre-installed with a handfull of simple webapps. Clicking on the Tomcat Manager link from the Tomcat home page will show you a list of these applications and their execution status. Later we will return to this page to verify that our nutch applications are running.
Acquire a copy of nutch and unpack it in a new directory location. I suggest using /usr/local/nutch as the top-level directory, but this is of course optional
Follow the section Intranet:Configuration from the Nutch tutorial at http://lucene.apache.org/nutch/tutorial8.html. However, plan in advance for crawling and searching sites independently from one another:
Given two sites, site1 and site2 which you wish to crawl/index (and later search) independently from each other, you may make multiple copies of the conf directory:
And then work through steps one through four of the above mentioned section for each site.
Create simple shell scripts which allow for the independent crawling of each site, such as /usr/local/nutch/crawl_site1.sh
and the same for site2.
Then proceed to crawl each site:
Under Debian Etch, the Catalina configuration files are located under /etc/tomcat5.5/policy.d At runtime they are combined into a single file,/usr/share/tomcat5.5/conf/catalina.policy Do not edit the latter, as it will be overwrittten.
At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:
{{{grant codeBase "file:/usr/share/tomcat5.5-webapps/-\" {
-
permission java.util.PropertyPermission "user.dir", "read"; permission java.util.PropertyPermission "java.io.tmpdir", "read,write"; permission java.util.PropertyPermission "org.apache.*", "read,execute"; permission java.io.FilePermission "/usr/local/nutch/crawls/-" , "read"; permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read"; permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", "read,write,execute,delete"; permission java.lang.RuntimePermission "createClassLoader", ""; permission java.security.AllPermission; };}}}
Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown
Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching
Under Debian Etch & Tomcat5.5 the webapps path is located at
Contrary to the Nutch tutorial(s) it is NOT NECESSARY to remove the ROOT context nor is it desirable. It was noted above that the Tomcat Manager allows us to view and control our multiple applications. Removing ROOT would break this functionality.
Create two new folders under /usr/share/tomcat5.5-webapps, and explode the nutch war file into each: {{{ #cd /usr/share/tomcat5.5-webapps #mkdir site1#mkdir site2 #cp /usr/local/nutch/nutch-0.8.1.war site1 #cp /usr/local/nutch/nutch-0.8.1.war site2 #cd site1; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd .. #cd site2; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd .. }}}
Edit the site1/WEB-INF/classes/nutch-default.xml file for the searcher.dir parameter, so that it points back to your crawl directory under /usr/local/nutch and save it as nutch-site.xml after making the following changes:
{{{searcher.dir /usr/local/nutch/crawls/site1 }}} And repeat for site2.
Create site1.xml and site2.xml under /usr/share/tomcat5.5-webapps by modifying the distribution nutch-site.xml
And repeat for site2.
Create symbolic links to these files under /usr/share/tomcat5.5/conf/Catalina/localhost
ln -s /usr/share/tomcat5.5-webapps/site1.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site1.xml
ln -s /usr/share/tomcat5.5-webapps/site2.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site2.xml
Restart Tomcat
/etc/init.d/tomcat5.5 restart Revisit the Tomcat Manager. You should see new entries for site1 and site2 and with luck their Running status should show asTrue
Search Your Sites!
Point your browser to http://blahblah:8180/site1 and conduct a search.
Point your browser to http://blahblah:8180/site2 and conduct another search.
If everything was configured properly you should see independent results representing independent searches on independent crawls.
FIN.
Sunday, 24 January 2010
linux
Today I started to work on a little project that required a crawler, and Nutch seemed to do most of what I needed. The nutch team conveniently released Nutch 1.0 late in March 2009, so I had a brand new release to test out. Installing nutch 1.0 on a mac is not as straight forward as I thought, I ran into a lot of unexpected issues and here is my cook book description of how to successfully install nutch 1.0 on your mac.
- Download the latest source code from the Apache SVN repositoryhttp://svn.apache.org/repos/asf/lucene/nutch/. I tried running it from the tarball without success, I also tried to compile the source from the tarball, but a post on the nutch forum clearly states that this will not work.
- Set your JAVA_HOME and NUTCH_JAVA_HOME variables, again this is not straight forward, they both need to point to your real installation of Java 1.6 (earlier versions of Java will fail). I sat these variables to: /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home, I could not get the /Library/Java/Home symbolic link to work properly.
- Compile the source code using Ant (I built it in Eclipse).
- Setup your nutch configuration, by following the tutorial by Peter P. Wang
- Run your first crawl with: ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Most of the issues I encountered was related to the Java version and the fact that using/Application/Utilities/Java/Java preferences application do not really change the JAVA_HOMEdirectory /Library/Java/Home properly. So make sure you have set both JAVA_HOME andNUTCH_JAVA_HOME, and that your OSX does not fool you when it pretend to be symbolically linking to the 1.6 installation.
Sunday, 24 January 2010
linux
This howto will explain how to get Nutch, Nutch-Gui, Sun JDK & Tomcat 6.0.16 working on Centos 5.x while maintaining a normally functioning Centos system. Currently, Centos 5.x ships with Tomcat 5.5, however, while it does run, there are problems with the default install of this version that results in errors which are undocumented and persistent at this time. If you have information or believe that these errors have been addressed and can point to a fix, please use the contact form on this website to let us know. The following instructions allow for easy removal of any software installed through following this howto by either using "rpm -e foo.rpm" or "rm -rf /opt/foo" returning your system to its original state.
Applicable to Centos Versions:
Requirements
Explanation of requirements.
- Root or sudo access with appropriate privileges to the system you intend to install on.
- A server preferably on a high-speed network.
- Sun JDK rpm.bin.
- Tomcat 6 rpm.
- Nutch 1.0 tar.gz.
- Nutch-Gui 0.2 tar.gz.
Doing the Work
Basic description of what will be done and what is expected.
- Install a few dependencies:
sudo yum install ant xml-commons-apis ant-trax
- Download & install the latest Sun JDK rpm.bin:
Go here: http://java.sun.com/javase/downloads/index.jsp
Get the following: Java SE Development Kit (JDK) 32bit (approx. 73.98MB) JDK 6 Update 16 (or the latest update, the version is important in setting your JAVA_HOME path variable)
Once downloaded install using the following: chmod +x jdk-6u16-linux-i586-rpm.bin; ./jdk-6u16-linux-i586-rpm.bin answer "yes" to the EULA sudo rpm -ivh jdk* sun*
- Download & install Tomcat 6:
http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm (provided for reference)
Once downloaded install the rpm with the following command: sudo rpm -ivh tomcat-6.0.16-0.noarch.rpm (this installs entirely into /opt/tomcat and can be removed with: rpm -e tomcat) sudo vi /opt/tomcat/conf/tomcat-env.sh (set: JAVA_HOME="/usr/java/jre1.6.0_16")
- Download & install Nutch 1.0:
Dowmload Nutch 1.0 from a mirror here: http://www.apache.org/dyn/closer.cgi/lucene/nutch/ sudo cp nutch-1.0.tar.gz /opt; cd /opt && tar xvfz nutch-1.0.tar.gz; cd nutch-1.0 sudo ant sudo ant war (this creates the "build" directory)
sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml (modify the property "searcher.dir" to: /opt/nutch-1.0/crawl/ & the docBase attribute to the full path of your nutch-1.0 war file: docBase="nutch.war" path="/opt/tomcat/webapps/")
sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war (a .war file is a zip/jar file known as a "web archive" or war file, it is uncompressed when tomcat is started)
- Edit /etc/profile:
Add these lines just above: # ksh workaround
sudo vi /etc/profile
##Tomcat 6 / Java## JAVA_HOME="/usr/java/jdk1.6.0_16" export JAVA_HOME CATALINA_HOME="/opt/tomcat" export CATALINA_HOME NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16" export NUTCH_JAVA_HOME ##End Tomcat 6 / Java##
- Configure Nutch to fetch URLs:
cd /opt/nutch-1.0; sudo mkdir urls (make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)
sudo vi conf/nutch-default.xml
edit the following: http.agent.name My Spider http.robots.agents My Spider http.agent.description My Bot http.agent.url http://www.example.com http.agent.email admin@example.com all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
- Nutch "deepcrawler" script:
Put this script in /opt/nutch-1.0/bin chmod +x deepcrawler Note: this script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in: /opt/nutch-1.0/crawl1 to store the new crawl.
- Fetch URLs with Nutch via command line:
If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject, so you'll want to run it in screen.
screen -S nutch sudo service tomcat start cd /opt/nutch-1.0; su -c "bin/deepcrawler"
- Download & install Nutch-Gui 0.2:
Note: if you use the script provided above, you can skip the GUI altogether.
Download Nutch-Gui 0.2 from: http://github.com/101tec/nutch/downloads sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2 sudo ant clean package cd build/nutch-gui-0.2 sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war
unsecured quick test method, to assure it's working: su -c "bin/nutch admin /opt/nutch-1.0 50060" http://example.com:50060/general
more secure password protection: sudo vi conf/nutchguiUsers.properties (edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role) screen -S nutch-gui (since we'll probably run it for a while) su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure" http://example.com:50060/general
Troubleshooting
How to test
Explanation troubleshooting basics and expectations.
- Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME tomcat-6.0.16-0 jdk-1.6.0_16-fcs ant-1.6.5-2jpp.2 xml-commons-apis-1.3.02-0.b2.7jpp.10 ant-trax-1.6.5-2jpp.2 /usr/java/jdk1.6.0_16
Replace "localhost" with your machines IP Try accessing Tomcat here: http://localhost:8080/ Try accessing Nutch here: http://localhost:8080/nutch/ Try accessing Nutch-Gui here: http://localhost:50060/general
- Set Tomcat to start on boot:
sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:off
Common problems and fixes
Describe common problems here, include links to known common problems if on another site
More Information
Any additional information or notes.
Disclaimer
We test this stuff on our own machines, really we do. But you may run into problems, if you do, come to #centos on irc.freenode.net
Added Reading
Sunday, 24 January 2010
linux
Requirements
- Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation.
- Apache's Tomcat 4.x.
- On Win32, cygwin, for shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)
- Up to a gigabyte of free disk space, a high-speed connection, and an hour or so.
Getting Started
First, you need to get a copy of the Nutch code. You can download a release from http://lucene.apache.org/nutch/release/. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant.
Try the following command:
bin/nutch
This will display the documentation for the Nutch command script.
Now we're ready to crawl. There are two approaches to crawling:
- Intranet crawling, with the crawl command.
- Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands.
Intranet Crawling
Intranet crawling is more appropriate when you intend to crawl up to around one million pages on a handful of web servers.
Intranet: Configuration
To configure things for intranet crawling you must:
- Create a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
http://lucene.apache.org/nutch/
- Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org.
Intranet: Running the Crawl
Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
- -dir dir names the directory to put the crawl in.
- -depth depth indicates the link depth from the root page that should be crawled.
- -delay delay determines the number of seconds between accesses to each host.
- -threads threads determines the number of threads that will fetch in parallel.
For example, a typical call might be:
bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10.
Once crawling has completed, one can skip to the Searching section below.
Whole-web Crawling
Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines.
Whole-web: Concepts
Nutch data is of two types:
- The web database. This contains information about every page known to Nutch, and about links between those pages.
- A set of segments. Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
-
- a fetchlist is a file that names a set of pages to be fetched
- the fetcher output is a set of files containing the fetched pages
- the index is a Lucene-format index of the fetcher output.
In the following examples we will keep our web database in a directory named db and our segments in a directory named segments:
mkdir db
mkdir segments
Whole-web: Boostrapping the Web Database
The admin tool is used to create a new, empty database:
bin/nutch admin db -create
The injector adds urls into the database. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
Next we inject a random subset of these pages into the web database. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We inject one out of every 3000, so that we end up with around 1000 URLs:
bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000
This also takes a few minutes, as it must parse the full file.
Now we have a web database with around 1000 as-yet unfetched URLs in it.
Whole-web: Fetching
To fetch, we first generate a fetchlist from the database:
bin/nutch generate db segments
This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1:
s1=`ls -d segments/2* | tail -1`
echo $s1
Now we run the fetcher on this segment with:
bin/nutch fetch $s1
When this is complete, we update the database with the results of the fetch:
bin/nutch updatedb db $s1
Now the database has entries for all of the pages referenced by the initial set.
Now we fetch a new segment with the top-scoring 1000 pages:
bin/nutch generate db segments -topN 1000
s2=`ls -d segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb db $s2
Let's fetch one more round:
bin/nutch generate db segments -topN 1000
s3=`ls -d segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb db $s3
By this point we've fetched a few thousand pages. Let's index them!
Whole-web: Indexing
To index each segment we use the index command, as follows:
bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3
Then, before we can search a set of segments, we need to delete duplicate pages. This is done with:
bin/nutch dedup segments dedup.tmp
Now we're ready to search!
Searching
To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.)
Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:
rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
The webapp finds its indexes in ./segments, relative to where you start Tomcat, so, if you've done intranet crawling, connect to your crawl directory, or, if you've done whole-web crawling, don't change directories, and give the command:
~/local/tomcat/bin/catalina.sh start
Then visit http://localhost:8080/ and have fun!
More detailed tutorials are available on the Nutch Wiki.
Sunday, 24 January 2010
linux
This document contains instructions for downloading and installing Nutch and Lucene. Please beware that you must be logged into the csci571 computer to run Apache Tomcat and not on aludra or nunki.
- Downloading and Installing Nutch
- Downloading and Installing Lucene
Downloading and Installing Nutch
Chris A. Mattmann
mattmann@apache.org
Pre-requisites
-
Installation of Java 1.4 or above. You can download java from http://java.sun.com
- Installation of Apache ANT 1.6 or above. You can download ANT from http://ant.apache.org
- Installation of Apache Tomcat 5.5.19 or above. You can download Tomcat from http://tomcat.apache.org
- If you are using Windows OS, please install Cygwin: you can find Cygwin here: http://www.cygwin.com/
- Install the subversion client, You can find Subversion at: http://subversion.tigris.org
Installation Instructions
-
Download Nutch from SVN, using the Subversion command line client:
# svn co http://svn.apache.org/repos/asf/lucene/nutch/tags/release-0.8.1/ ./nutch
- This will install nutch into a directory called “nutch” local to wherever you ran this command. So, if you ran this command from /home/bogus, then you would have a directory called /home/bogus/nutch
- We’ll call the directory where you unpacked Nutch to your $NUTCH_HOME
- Cd into the Nutch directory, and compile Nutch:
# cd nutch
# ant
- You should see a message such as the following if all is well and the build ran successfully
compile:
job:
[jar] Building jar: /Users/mattmann/tmp/nutch/build/nutch-0.8.1.job
BUILD SUCCESSFUL
Total time: 27 second
-
Okay, now that Nutch is built, you can fetch some content. There is a detailed, step-by-step set of instructions on the wiki, for how to fetch content. This page provides all the details: http://wiki.apache.org/nutch/NutchTutorial
- Once you’ve fetched some content, you’ll probably want to browse it. To get Nutch set up on Tomcat, first build the Nutch webapp (run the below command from your nutch directory):
# ant war
- The above command will construct a nutch-0.8.1.war file within $NUTCH_HOME/build. It will also construct a nutch.xml file within $NUTCH_HOME/build. The nutch.xml is a Tomcat context.xml file, that you can use to configure a WAR file for deployment within Tomcat.
- First, make a directory for your nutch war file, and your nutch context.xml file to live in. /usr/local/nutch is a good place.
# mkdir /usr/local/nutch
# cp –R $NUTCH_HOME/build/nutch-0.8.1.war /usr/local/nutch
# cp –R $NUTCH_HOME/build/nutch.xml /usr/local/nutch
- Next, edit your /usr/local/nutch/nutch.xml file
Inside the file, modify the property searcher.dir to the path where your Nutch index that you created separately (in step 3 above) exists. If that directory is /home/bogus/nutch/my.crawl, then you would set searcher.dir to /home/bogus/nutch/my.crawl.
- Edit your /usr/local/nutch/nutch.xml file again
Edit the docBase attribute on the Context tag to be the FULL path to your Nutch 0.8.1 WAR file, e.g., /usr/local/nutch/nutch-0.8.1.war
- Now, assuming that you have installed Tomcat according to the pre-requisites, and assuming that you have set the $TOMCAT_HOME environment variable (that points to your Tomcat installation directory), first shutdown tomcat:
# cd $TOMCAT_HOME/bin
# ./shutdown.sh
Now, symlink your context.xml file for Nutch to the Tomcat conf directory
# ln -s /usr/local/nutch/nutch.xml $TOMCAT_HOME/conf/Catalina/localhost/nutch.xml
Now, restart your Tomcat server:
# cd $TOMCAT_HOME/bin
# ./startup.sh
- If everything went right in step 9, then you should open up your browser, and point it at your tomcat installation (e.g., http://localhost:8080), and then append the path “/nutch” at the end of it. So, if you installed tomcat to run on port 8080, then you would visit: http://localhost:8080/nutch/
That’s it!
If you have any further questions, please feel free to contact me at the email address provided above.
|