Nutch
Author : Jbuenol
From TechnologicalWiki
Overview
Nutch (v1.0) is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Nutch works like a Vertical Searcher, so it searches on particular sites, delving into the different site levels, that is, Nutch is focused on specific slices of content. This is really interesting, when the searchs are focused on one specific type of users. For example, we suppose there is a web-site focused on technology. If we want to find articles refer to Java, the Java island results are not interesting for the user, like the results on a library named Java.
Example using Nutch :
Input :
http://freshmeat.net
http://sourceforge.net
Output :
Nutch provides data from the previous domains : http://freshmeat.net ( customizable ) & http://sourceforge.net ( customizable )
Nutch through, we can access much information about the content of these websites ( link to URL, summary, content, ... ).
How to install & run Nutch in UBUNTU is shown in this article.
[edit] Crawling
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
Nutch mainly is a Web crawler, which can be configured through some parameters, in order to index a set of pages storing the data in a database. The crawling starts with a list of URLs to visit, called the seeds which are provided into a config file. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
Nutch data crawled is composed of:
[edit] The crawl database, or crawldb
This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
[edit] The link database, or linkdb
This contains the list of known links to each URL, including both the source URL and anchor text of the link.
[edit] A set of segments
Each segment is a set of URLS that are fetched as a unit. Segments are directories with the following subdirectories:
- a crawl_generate names a set of URLS to be fetched.
- a crawl_fetch contains the status of fetching each URL.
- a content contains the raw content retrieved from each URL.
- a parse_text contains the parsed text of each URL.
- a parse_data contains outlinks and metadata parsed from each URL.
- a crawl_parse contains the outlink URLS, used to update the crawldb .
[edit] Indexes
The index is a Lucene-format index of the fetcher output. When a set of segments is obtained, then, the next step is to index them. These indexes will be fetched in the search process.
[edit] How to configure Nutch for Crawling (single-node cluster)
First, we have to install Nutch for crawling. Then , learn to install for searches.
[edit] Pre-requisites
[edit] Create the nutch user and directories for Nutch
We will use a dedicated Nutch user account for running Nutch. While that is not required, it is recommended because it helps to separate the Nutch installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
- This will add the user nutch and the group users to the local machine:
sudo addgroup users sudo adduser --ingroup users nutch
- And this command creates the necessary directories for nutch :
sudo mkdir /nutch sudo mkdir /nutch/search sudo mkdir /nutch/filesystem sudo mkdir /nutch/local sudo mkdir /nutch/home sudo mkdir /nutch/src sudo mkdir /nutch/build
- Modify home folder for the nutch user :
sudo usermod -d /nutch/home nutch
- Copy the files .bashrc, .profile, .bash_logout to the new home folder :
sudo cp /home/nutch/.profile /nutch/home sudo cp /home/nutch/.bashrc /nutch/home sudo cp /home/nutch/.bash_logout /nutch/home
- Set nutch as the owner of /nutch:
sudo chown -R nutch:users /nutch
Note : Careful with the permissions, are a frequent source of errors
[edit] Installing and configuring Java
Install Sun's Java Development Kit (JDK) 1.5.x ( or later ) via Synaptic (System > Administration > Synaptic Package Manager) or via apt-get. This how to has been made using v1.6.0. Install the package :
sudo apt-get install sun-java6-jdk
The full JDK which will be placed in /usr/lib/jvm/java-6-sun.
If you want to use Sun's Java instead of the open source GIJ (GNU Java bytecode interpreter) you need to set it as default. To list installed JVMs:
sudo update-java-alternatives -l
To select Sun's JVM run:
sudo update-java-alternatives -s java-6-sun
You should also edit /etc/jvm and move /usr/lib/jvm/java-6-sun to the top of JVMs offered.
# This file defines the default system JVM search order. Each # JVM should list their JAVA_HOME compatible directory in this file. # The default system JVM is the first one available from top to # bottom. /usr/lib/jvm/java-6-sun /usr/lib/jvm/java-gcj /usr/lib/jvm/ia32-java-1.5.0-sun /usr/lib/jvm/java-1.5.0-sun /usr
Let's put JAVA_HOME in our ~/.bash_profile and ~/.bashrc for root and nutch :
# from terminal nutch@???:~$ echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/.bash_profile nutch@???:~$ . ~/.bash_profile nutch@???:~$ echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/.bashrc nutch@???:~$ . ~/.bashrc
[edit] Install ssh
The ssh package allows to connect via ssh to the same/other machines.
sudo apt-get install ssh
[edit] Install subversion
The subversion package allows to get the nutch release from apache repository.
sudo apt-get install subversion
[edit] Install ant
The ant packages allow to compile and build the Nutch source files.
sudo apt-get install ant sudo apt-get install ant-optional sudo apt-get install ant-optional-gcc
[edit] What is Hadoop ?
Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Hadoop is used in Nutch to manage data obtained from the crawling process.
[edit] Configure SSH for Hadoop
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the user we create in the previous section. First, we have to generate an SSH key for the nutch user :
nutch@nutch-laptop:~$ su nutch nutch@nutch-laptop:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/nutch/home/.ssh/id_rsa): Created directory '(/nutch/home/.ssh'. Your identification has been saved in (/nutch/home/.ssh/id_rsa. Your public key has been saved in (/nutch/home/.ssh/id_rsa.pub. The key fingerprint is: 9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27 nutch@nutch-laptop nutch@nutch-laptop:~$
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don't want to enter the passphrase every time Hadoop interacts with its nodes).
- Second, you have to enable SSH access to your local machine with this newly created key.
nutch@nutch-laptop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- The final step is to test the SSH setup by connecting to your local machine with the nutch user. The step is also needed to save your local machine's host key fingerprint to the nutch user's known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
nutch@nutch-laptop:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
...
nutch@nutch-laptop:~$
[edit] Installation
Getting Nutch from Subversion :
cd /nutch/src/ svn co http://svn.apache.org/repos/asf/nutch/trunk/ cd trunk echo dist.dir=/nutch/build > build.properties ant package
[edit] Config Files
- schema.txt : Schema definition to be used with solr integration.
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or
more contributor license agreements. See the NOTICE file
distributed with this work for additional information regarding
copyright ownership. The ASF licenses this file to You under the
Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain
a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
and limitations under the License.
-->
<!--
Description: This document contains solr schema definition to be
used with solr integration currently build into Nutch. See
https://issues.apache.org/jira/browse/NUTCH-442
https://issues.apache.org/jira/browse/NUTCH-699 for more info.
-->
<schema name="nutch" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.LongField"
omitNorms="true"/>
<fieldType name="float" class="solr.FloatField"
omitNorms="true"/>
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<!-- fields for index-basic plugin -->
<field name="host" type="url" stored="true" indexed="true"/>
<field name="site" type="string" stored="true" indexed="false"/>
<field name="url" type="url" stored="true" indexed="true"
required="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="title" type="text" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="long" stored="true" indexed="false"/>
<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
multiValued="true"/>
<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
multiValued="true"/>
<field name="contentLength" type="long" stored="true"
indexed="false"/>
<field name="lastModified" type="long" stored="true"
indexed="false"/>
<field name="date" type="string" stored="true" indexed="true"/>
<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>
<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
indexed="true"/>
<!-- fields for feed plugin -->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="string" stored="true"
indexed="true"/>
<field name="updatedDate" type="string" stored="true"
indexed="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="url" dest="id"/>
</schema>
- regex-urlfilter.txt : In this file. You can configure protocols, format files and filters.
# skip this protocols -^(https|telnet|file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # allow urls in foofactory.fi domain +^http://.* # deny anything else -.
Note : "Nutch doens't catch contents linked though skipped protocols ( In this example : https,telnet,file,ftp & mailto )"
- <NUTCH-PATH>/search/urls/<NAME_OF_THE_FILE> : In this file, the user can configure domains from where data will be crawled.
http://cesla.info/ http://sourceforge.net/ http://freshmeat.net/
[edit] How to configure Hadoop
We are going to configure a single-node setup of Hadoop. First, copy the build into the search directory.
cp -R /nutch/build/* /nutch/search/
Modify your /nutch/search/conf/core-site.xml to override the core configurations for Hadoop:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>
</configuration>
Next, modify your /nutch/search/conf/hdfs-site.xml to establish paths and configurations about HDFS:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/opt/nutch/filesystem/name</value>
<description>
Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/nutch/filesystem/data</value>
<description>
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>
Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
Finally, modify your /nutch/search/conf/mapred-site.xml to establish configurations for MapRed:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>
The maximum number of tasks that will be run simultaneously by
a task tracker. This should be adjusted according to the heap size
per task, the amount of RAM available, and CPU consumption of each task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1000m</value>
<description>
You can specify other Java options for each map or reduce task here,
but most likely you will want to adjust the heap size.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>
This should be a prime number larger than multiple number of slave hosts,
e.g. for 3 nodes set this to 17
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>
This should be a prime number close to a low multiple of slave hosts,
e.g. for 3 nodes set this to 7
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/opt/nutch/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/opt/nutch/filesystem/mapreduce/local</value>
</property>
</configuration>
[edit] Environment Settings
To configure the HDFS environment:
1. Formating Name Node : The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.
command : bin/hadoop namenode -format
2. Starting your single-node cluster : This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
command : bin/start-all.sh
3. Copy url data to HDFS : File from urls folder is put into the HDFS :
command : bin/hadoop dfs -put urls urls
... to check if HDFS has stored the directory use the dfs -ls option of hadoop :
command : bin/hadoop dfs -ls urls
... to remove the file from HDFS :
command : bin/hadoop dfs -rmr urls
[edit] Solr
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
In this article, we are going to config Solr to be used from Nutch to find indexes.
The first step to get started is to download the required software components, namely Apache Solr and Nutch.
1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
2. Extract Solr package
3. Configure Solr
For the sake of simplicity we are going to use the example configuration of Solr as a base.
a. Copy the provided Nutch schema from directory <NUTCHSEARCHER-PATH>/conf to directory <SOLR-PATH>/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:
b. Change schema.xml so that the stored attribute of field “content” is true.
<field name=”content” type=”text” stored=”true” indexed=”true”/>
We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case.
c. Start Solr
cd <SOLR-PATH>/example java -jar start.jar
[edit] How to Crawl
This process consists of a set of other minor processes:
1.Injector
- Convert injected urls to crawl db entries.
- Merge injected urls into crawl db.
command : bin/nutch inject <BASEDIR>/crawldb urls
example usage : bin/nutch inject crawl/crawldb urls
2.Generator
- Select best-scoring urls due for fetch.
- Create segments.
- Partition selected urls by host.
command : bin/nutch generate <BASEDIR>/crawldb <BASEDIR>/segments -topN <NUMDOC>
example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000
3.Fetcher
- Fetch remote pages.
command : bin/nutch fetch <BASEDIR>/segments/<SEGMENT> -threads <THREADS>
example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10
4.CrawlDb update
- Merging segment data into db.
command : bin/nutch updatedb <BASEDIR>/crawldb <BASEDIR>/segments/<SEGMENT> -filter
example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter
5.LinkDb
- Add segments to the database.
(steps from 2 to 5) is done in iterations.
command : bin/nutch invertlinks <BASEDIR>/linkdb <BASEDIR>/segments/*
example usage : bin/nutch invertlinks crawl/linkdb crawl/segments/*
6.Index process
- Index content to be access. In this article is described the process, using Solr to index
- Delete duplicates.
command : bin/nutch solrindex http://<DOMAIN>:<PORT>/<PATH>/ <BASEDIR>/crawldb <BASEDIR>/linkdb <BASEDIR>/segments/*
example usage : bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
7.Delete duplicates
command : bin/nutch solrdedup http://<DOMAIN>:<PORT>/<PATH>/
example usage : bin/nutch solrdedup http://127.0.0.1:8983/solr/
[edit] Searching
When data has been crawled, these data can be retrieve through the indexes.
[edit] How to configure Nutch for searching
To search in the collected webpages the data that is now on the hdfs is best copied to the local filesystem for better performance. Because the searching needs different settings for nutch than for crawling, the easiest thing to do is to make a separate folder for the nutch search part.
mkdir /nutchsearch chown nutch:users /nutchsearch cp -R /nutch/build/* /nutchsearch/search mkdir /nutchsearch/local
- nutch-site.txt : You must configure where data is stored.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>local</value>
</property>
<property>
<name>searcher.dir</name>
<value>/vol/nutchsearch/local/crawl</value>
</property>
</configuration>
Edit the hadoop-site.xml file and delete all the properties:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> </configuration>
[edit] Make a local index
Copy the data from dfs to the local filesystem.
bin/hadoop dfs -copyToLocal crawl /nutchsearch/local/
[edit] Indicating the Solr server.
The Solr server is the responsible for bringing the indexes. That is why it is necessary to create a file called solr-servers.txt containing the Solr server location.
/nutchsearch/local/crawl/solr-servers.txt
/* solr-servers.txt file content */ http://localhost:8983/solr/
[edit] How to search
Exist two ways to search into the Nutch data:
- Console Commands
- Web Service
[edit] Console Command
With this option, the user searches through a command of console of the OS. Just type :
command : bin/nutch org.apache.nutch.searcher.NutchBean <TERM> example usage : bin/nutch org.apache.nutch.searcher.NutchBean Java
The indexation process was conducted through Solr, so Nutch should search through Solr indexes. To do it, Nutch uses the solr-servers.txt file located into the folder which has been copied data from crawling process.
Typically :
nutchsearch/local/crawl/solr-servers.txt
[edit] Web Tools
With this option, the user searches through a set of tools provided by Nutch. The user can see the results on a webpage or capture the output via a client application. :
1. : JSP Page : http://<DOMAIN>:<SERVER-PORT>/<PREFIX>/search.jsp 2. : Structured format (xml,json) : http://<DOMAIN>:<SERVER-PORT>/<PREFIX>/search?query=<TERM>
example option 1 :
example option 2 :
[edit] See also
Using the Nutch Web Service >> ToDo <<
Creating a Custom Web Service for Nutch searches >> ToDo <<




