Subscribe to News

Nutch

Author : Jbuenol

From TechnologicalWiki

Jump to: navigation, search

Contents

Overview

Nutch (v1.0) is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Nutch works like a Vertical Searcher, so it searches on particular sites, delving into the different site levels, that is, Nutch is focused on specific slices of content. This is really interesting, when the searchs are focused on one specific type of users. For example, we suppose there is a web-site focused on technology. If we want to find articles refer to Java, the Java island results are not interesting for the user, like the results on a library named Java.

Example using Nutch :

Input :

http://freshmeat.net     
           
http://sourceforge.net

Output :

Nutch provides data from the previous domains :

http://freshmeat.net ( customizable ) & http://sourceforge.net ( customizable )

Nutch through, we can access much information about the content of these websites ( link to URL, summary, content, ... ).

How to install & run Nutch in UBUNTU is shown in this article.

[edit] Crawling

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

Nutch mainly is a Web crawler, which can be configured through some parameters, in order to index a set of pages storing the data in a database. The crawling starts with a list of URLs to visit, called the seeds which are provided into a config file. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

Nutch data crawled is composed of:

[edit] The crawl database, or crawldb

This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.

[edit] The link database, or linkdb

This contains the list of known links to each URL, including both the source URL and anchor text of the link.

[edit] A set of segments

Each segment is a set of URLS that are fetched as a unit. Segments are directories with the following subdirectories:

  • a crawl_generate names a set of URLS to be fetched.
  • a crawl_fetch contains the status of fetching each URL.
  • a content contains the raw content retrieved from each URL.
  • a parse_text contains the parsed text of each URL.
  • a parse_data contains outlinks and metadata parsed from each URL.
  • a crawl_parse contains the outlink URLS, used to update the crawldb .

[edit] Indexes

The index is a Lucene-format index of the fetcher output. When a set of segments is obtained, then, the next step is to index them. These indexes will be fetched in the search process.

[edit] How to configure Nutch for Crawling (single-node cluster)

First, we have to install Nutch for crawling. Then , learn to install for searches.

[edit] Pre-requisites

[edit] Create the nutch user and directories for Nutch

We will use a dedicated Nutch user account for running Nutch. While that is not required, it is recommended because it helps to separate the Nutch installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

  • This will add the user nutch and the group users to the local machine:
sudo addgroup users
sudo adduser --ingroup users nutch
  • And this command creates the necessary directories for nutch :
sudo mkdir /nutch
sudo mkdir /nutch/search
sudo mkdir /nutch/filesystem
sudo mkdir /nutch/local
sudo mkdir /nutch/home
sudo mkdir /nutch/src
sudo mkdir /nutch/build
  • Modify home folder for the nutch user :
sudo usermod -d /nutch/home nutch
  • Copy the files .bashrc, .profile, .bash_logout to the new home folder :
sudo cp /home/nutch/.profile /nutch/home
sudo cp /home/nutch/.bashrc /nutch/home
sudo cp /home/nutch/.bash_logout /nutch/home
  • Set nutch as the owner of /nutch:
sudo chown -R nutch:users /nutch

Note : Careful with the permissions, are a frequent source of errors

[edit] Installing and configuring Java

Install Sun's Java Development Kit (JDK) 1.5.x ( or later ) via Synaptic (System > Administration > Synaptic Package Manager) or via apt-get. This how to has been made using v1.6.0. Install the package :

sudo apt-get install sun-java6-jdk

The full JDK which will be placed in /usr/lib/jvm/java-6-sun.

If you want to use Sun's Java instead of the open source GIJ (GNU Java bytecode interpreter) you need to set it as default. To list installed JVMs:

sudo update-java-alternatives -l

To select Sun's JVM run:

sudo update-java-alternatives -s java-6-sun

You should also edit /etc/jvm and move /usr/lib/jvm/java-6-sun to the top of JVMs offered.

# This file defines the default system JVM search order. Each
# JVM should list their JAVA_HOME compatible directory in this file.
# The default system JVM is the first one available from top to
# bottom.

/usr/lib/jvm/java-6-sun
/usr/lib/jvm/java-gcj
/usr/lib/jvm/ia32-java-1.5.0-sun
/usr/lib/jvm/java-1.5.0-sun
/usr

Let's put JAVA_HOME in our ~/.bash_profile and ~/.bashrc for root and nutch :

# from terminal
nutch@???:~$ echo   'export  JAVA_HOME=/usr/lib/jvm/java-6-sun' >>
~/.bash_profile
nutch@???:~$ . ~/.bash_profile
nutch@???:~$  echo  'export  JAVA_HOME=/usr/lib/jvm/java-6-sun' >>
~/.bashrc
nutch@???:~$ . ~/.bashrc
[edit] Install ssh

The ssh package allows to connect via ssh to the same/other machines.

sudo apt-get install ssh
[edit] Install subversion

The subversion package allows to get the nutch release from apache repository.

sudo apt-get install subversion
[edit] Install ant

The ant packages allow to compile and build the Nutch source files.

sudo apt-get install ant
sudo apt-get install ant-optional
sudo apt-get install ant-optional-gcc
[edit] What is Hadoop ?

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Hadoop is used in Nutch to manage data obtained from the crawling process.

[edit] Configure SSH for Hadoop

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the user we create in the previous section. First, we have to generate an SSH key for the nutch user :

nutch@nutch-laptop:~$ su nutch
nutch@nutch-laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/nutch/home/.ssh/id_rsa):
Created directory '(/nutch/home/.ssh'.
Your identification has been saved in (/nutch/home/.ssh/id_rsa.
Your public key has been saved in (/nutch/home/.ssh/id_rsa.pub.
The key fingerprint is:
9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27 nutch@nutch-laptop
nutch@nutch-laptop:~$

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don't want to enter the passphrase every time Hadoop interacts with its nodes).

  • Second, you have to enable SSH access to your local machine with this newly created key.
nutch@nutch-laptop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
  • The final step is to test the SSH setup by connecting to your local machine with the nutch user. The step is also needed to save your local machine's host key fingerprint to the nutch user's known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
nutch@nutch-laptop:~$ ssh localhost
The authenticity of host  'localhost  (127.0.0.1)'     can't be established.
 RSA  key fingerprint  is  76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
                 ...
nutch@nutch-laptop:~$

[edit] Installation

Getting Nutch from Subversion :

cd /nutch/src/
svn co http://svn.apache.org/repos/asf/nutch/trunk/
cd trunk
echo dist.dir=/nutch/build > build.properties
ant package

[edit] Config Files

  • schema.txt : Schema definition to be used with solr integration.
<?xml version="1.0" encoding="UTF-8" ?>
    <!--
        Licensed to the Apache Software Foundation (ASF) under one or
        more contributor license agreements. See the NOTICE file
        distributed with this work for additional information regarding
        copyright ownership. The ASF licenses this file to You under the
        Apache License, Version 2.0 (the "License"); you may not use
        this file except in compliance with the License. You may obtain
        a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0 Unless required by
        applicable law or agreed to in writing, software distributed
        under the License is distributed on an "AS IS" BASIS, WITHOUT
        WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions
        and limitations under the License.
    -->
    <!--
        Description: This document contains solr schema definition to be
        used with solr integration currently build into Nutch. See
        https://issues.apache.org/jira/browse/NUTCH-442
        https://issues.apache.org/jira/browse/NUTCH-699 for more info.
    -->
<schema name="nutch" version="1.1">
    <types>
        <fieldType name="string" class="solr.StrField"
            sortMissingLast="true" omitNorms="true"/>
        <fieldType name="long" class="solr.LongField"
            omitNorms="true"/>
        <fieldType name="float" class="solr.FloatField"
            omitNorms="true"/>
        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
        <fieldType name="url" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
    </types>
    <fields>
        <field name="id" type="string" stored="true" indexed="true"/>

        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->
        <field name="host" type="url" stored="true" indexed="true"/>
        <field name="site" type="string" stored="true" indexed="false"/>
        <field name="url" type="url" stored="true" indexed="true"
            required="true"/>
        <field name="content" type="text" stored="true" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="long" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="long" stored="true"
            indexed="false"/>
        <field name="date" type="string" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->
        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->
        <field name="subcollection" type="string" stored="true"
            indexed="true"/>

        <!-- fields for feed plugin -->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="string" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="string" stored="true"
            indexed="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
    <copyField source="url" dest="id"/>
</schema>
  • regex-urlfilter.txt : In this file. You can configure protocols, format files and filters.
# skip this protocols
-^(https|telnet|file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# allow urls in foofactory.fi domain
+^http://.*

# deny anything else
-.

Note : "Nutch doens't catch contents linked though skipped protocols ( In this example : https,telnet,file,ftp & mailto )"

  • <NUTCH-PATH>/search/urls/<NAME_OF_THE_FILE> : In this file, the user can configure domains from where data will be crawled.
http://cesla.info/
http://sourceforge.net/
http://freshmeat.net/

[edit] How to configure Hadoop

We are going to configure a single-node setup of Hadoop. First, copy the build into the search directory.

cp -R /nutch/build/*  /nutch/search/ 

Modify your /nutch/search/conf/core-site.xml to override the core configurations for Hadoop:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000/</value>
    <description>
      The name of the default file system. Either the literal string
      "local" or a host:port for NDFS.
    </description>
  </property>

</configuration>

Next, modify your /nutch/search/conf/hdfs-site.xml to establish paths and configurations about HDFS:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
     <name>dfs.name.dir</name>
     <value>/opt/nutch/filesystem/name</value>
     <description>
         Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. 
     </description>
  </property>

  <property>
     <name>dfs.data.dir</name>
     <value>/opt/nutch/filesystem/data</value>
     <description>
         Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. 
     </description>
  </property>

  <property>
     <name>dfs.replication</name>
     <value>1</value>
     <description>
         Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. 
     </description>
  </property>
</configuration>

Finally, modify your /nutch/search/conf/mapred-site.xml to establish configurations for MapRed:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
    <description>
      The host and port that the MapReduce job tracker runs at. If
      "local", then jobs are run in-process as a single map and
      reduce task.
    </description>
  </property>
  <property>
    <name>mapred.tasktracker.tasks.maximum</name>
    <value>2</value>
    <description>
      The maximum number of tasks that will be run simultaneously by
      a task tracker. This should be adjusted according to the heap size
      per task, the amount of RAM available, and CPU consumption of each task.
  </description>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1000m</value>
    <description>
      You can specify other Java options for each map or reduce task here,
      but most likely you will want to adjust the heap size.
    </description>
  </property>
  <property>
    <name>mapred.map.tasks</name>
    <value>1</value>
    <description>
      This should be a prime number larger than multiple number of slave hosts,
      e.g. for 3 nodes set this to 17
   </description>
  </property>
  <property>
     <name>mapred.reduce.tasks</name>
     <value>1</value>
     <description>
      This should be a prime number close to a low multiple of slave hosts,
      e.g. for 3 nodes set this to 7
     </description>
  </property>
  <property>
     <name>mapred.system.dir</name>
     <value>/opt/nutch/filesystem/mapreduce/system</value>
  </property>
  <property>
     <name>mapred.local.dir</name>
     <value>/opt/nutch/filesystem/mapreduce/local</value>
  </property>
</configuration>

[edit] Environment Settings

To configure the HDFS environment:

1. Formating Name Node : The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

command : bin/hadoop namenode -format

2. Starting your single-node cluster : This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

command : bin/start-all.sh


3. Copy url data to HDFS : File from urls folder is put into the HDFS :

command : bin/hadoop dfs -put urls urls

... to check if HDFS has stored the directory use the dfs -ls option of hadoop :

command : bin/hadoop dfs -ls urls

... to remove the file from HDFS :

command : bin/hadoop dfs -rmr urls

4. Crawling process

[edit] Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

In this article, we are going to config Solr to be used from Nutch to find indexes.

The first step to get started is to download the required software components, namely Apache Solr and Nutch.

1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page

2. Extract Solr package

3. Configure Solr

For the sake of simplicity we are going to use the example configuration of Solr as a base.

a. Copy the provided Nutch schema from directory <NUTCHSEARCHER-PATH>/conf to directory <SOLR-PATH>/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:

b. Change schema.xml so that the stored attribute of field “content” is true.

<field name=”content” type=”text” stored=”true” indexed=”true”/>

We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case.

c. Start Solr

cd <SOLR-PATH>/example
java -jar start.jar

[edit] How to Crawl

This process consists of a set of other minor processes:

1.Injector

  • Convert injected urls to crawl db entries.
  • Merge injected urls into crawl db.

command : bin/nutch inject <BASEDIR>/crawldb urls

example usage : bin/nutch inject crawl/crawldb urls

2.Generator

  • Select best-scoring urls due for fetch.
  • Create segments.
  • Partition selected urls by host.

command : bin/nutch generate <BASEDIR>/crawldb <BASEDIR>/segments -topN <NUMDOC>

example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000

3.Fetcher

  • Fetch remote pages.

command : bin/nutch fetch <BASEDIR>/segments/<SEGMENT> -threads <THREADS>

example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10

4.CrawlDb update

  • Merging segment data into db.

command : bin/nutch updatedb <BASEDIR>/crawldb <BASEDIR>/segments/<SEGMENT> -filter

example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter

5.LinkDb

  • Add segments to the database.

(steps from 2 to 5) is done in iterations.

command : bin/nutch invertlinks <BASEDIR>/linkdb <BASEDIR>/segments/*

example usage : bin/nutch invertlinks crawl/linkdb crawl/segments/*

6.Index process

  • Index content to be access. In this article is described the process, using Solr to index
  • Delete duplicates.

command : bin/nutch solrindex http://<DOMAIN>:<PORT>/<PATH>/ <BASEDIR>/crawldb <BASEDIR>/linkdb <BASEDIR>/segments/*

example usage : bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

7.Delete duplicates

command : bin/nutch solrdedup http://<DOMAIN>:<PORT>/<PATH>/

example usage : bin/nutch solrdedup http://127.0.0.1:8983/solr/

[edit] Searching

When data has been crawled, these data can be retrieve through the indexes.

[edit] How to configure Nutch for searching

To search in the collected webpages the data that is now on the hdfs is best copied to the local filesystem for better performance. Because the searching needs different settings for nutch than for crawling, the easiest thing to do is to make a separate folder for the nutch search part.

mkdir /nutchsearch
chown nutch:users /nutchsearch
cp -R /nutch/build/* /nutchsearch/search
mkdir /nutchsearch/local
  • nutch-site.txt : You must configure where data is stored.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>
  <property>
    <name>searcher.dir</name>
    <value>/vol/nutchsearch/local/crawl</value>
  </property>
</configuration>

Edit the hadoop-site.xml file and delete all the properties:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

</configuration>

[edit] Make a local index

Copy the data from dfs to the local filesystem.

bin/hadoop dfs -copyToLocal crawl /nutchsearch/local/

[edit] Indicating the Solr server.

The Solr server is the responsible for bringing the indexes. That is why it is necessary to create a file called solr-servers.txt containing the Solr server location.

/nutchsearch/local/crawl/solr-servers.txt

/* solr-servers.txt file content */

http://localhost:8983/solr/

[edit] How to search

Exist two ways to search into the Nutch data:

  • Console Commands
  • Web Service

[edit] Console Command

With this option, the user searches through a command of console of the OS. Just type :

command : bin/nutch org.apache.nutch.searcher.NutchBean <TERM>
example usage : bin/nutch org.apache.nutch.searcher.NutchBean Java

The indexation process was conducted through Solr, so Nutch should search through Solr indexes. To do it, Nutch uses the solr-servers.txt file located into the folder which has been copied data from crawling process.

Typically :

nutchsearch/local/crawl/solr-servers.txt

[edit] Web Tools

With this option, the user searches through a set of tools provided by Nutch. The user can see the results on a webpage or capture the output via a client application. :

1. : JSP Page : http://<DOMAIN>:<SERVER-PORT>/<PREFIX>/search.jsp
2. : Structured format (xml,json) : http://<DOMAIN>:<SERVER-PORT>/<PREFIX>/search?query=<TERM>

example option 1 :

example option 2 :

[edit] See also

Using the Nutch Web Service >> ToDo <<

Automating Nutch

Creating a Custom Web Service for Nutch searches >> ToDo <<

[edit] References

http://wiki.apache.org/nutch/RunningNutchAndSolr

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Main Collaborators