Today I started to work on a little project that required a crawler, and Nutch seemed to do most of what I needed. The nutch team conveniently released Nutch 1.0 late in March 2009, so I had a brand new release to test out. Installing nutch 1.0 on a mac is not as straight forward as I thought, I ran into a lot of unexpected issues and here is my cook book description of how to successfully install nutch 1.0 on your mac.
- Download the latest source code from the Apache SVN repositoryhttp://svn.apache.org/repos/asf/lucene/nutch/. I tried running it from the tarball without success, I also tried to compile the source from the tarball, but a post on the nutch forum clearly states that this will not work.
- Set your JAVA_HOME and NUTCH_JAVA_HOME variables, again this is not straight forward, they both need to point to your real installation of Java 1.6 (earlier versions of Java will fail). I sat these variables to: /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home, I could not get the /Library/Java/Home symbolic link to work properly.
- Compile the source code using Ant (I built it in Eclipse).
- Setup your nutch configuration, by following the tutorial by Peter P. Wang
- Run your first crawl with: ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Most of the issues I encountered was related to the Java version and the fact that using/Application/Utilities/Java/Java preferences application do not really change the JAVA_HOMEdirectory /Library/Java/Home properly. So make sure you have set both JAVA_HOME andNUTCH_JAVA_HOME, and that your OSX does not fool you when it pretend to be symbolically linking to the 1.6 installation.
| Installing and Running Nutch Under Debian 'Etch'< Prev | Next >Installing & Configuring Nutch, Nutch-Gui, Sun JDK & Tomcat 6 on Centos 5.x |
|---|


