Scraping the Web with Nutch for Elasticsearch
Posted by Roland Kofler December 2, 2015When building vertical search engines, for example for collecting recipes, prices or addresses, the first step is to crawl the web for information.
In this tutorial you will learn how to configure the Nutch web crawler to feed data into Elasticsearch.
Alternative web crawlers or why pick Nutch?
The most prominent web scrapers to consider are: Scrapy, Storm Crawler, River Web and Nutch.
Scrapy is an easily configurable python scraper targeted at medium sized scraping jobs. Recently with the “distributed-frontera” framework scaling Scrapy became possible.
Storm-crawler, based on the Apache Storm project, is a collection of resources to build your own highly scalable scraper infrastructure.
River Web, originally an Elasticsearch plugin it is now a simple standalone webscraper designed with Elasticsearch in mind.
Nutch stands at the origin of the Hadoop Stack and today is often called “the gold standard of web scraping”, its large adoption is the main reason we chose Nutch for this Tutorial.
The first task is to decide between two main versions of the crawler:
Option 1: The MapReduce framework, Hadoop was originally created as part of the Nutch project and is still integrated in the 1.x versions.
Option 2: the 2.x stream introduces an abstraction layer to work with any modern data store, e.g. HBase, MongoDB, CouchDB. There is even a gora-elasticsearch module planed.
For this tutorial we chose the actual 2.x stream, but 1.x would work similarly, in fact it is easier to configure.
Web crawling components and workflow
Before we dive in to the configuration files, here's a small introduction to the workflow of scraping with Nutch.
Figure: Main components of Nutch and its relation to Elasticsearch.
- First, we tell Nutch from where to start. The injector takes all the URLs of a seed file and adds them to Crawlbase. The Crawlbase maintains information on all known URLs, when they were fetched and what the resulting status was.
- Based on the data of Crawldb, the Generator creates a list of URLs to fetch. They are placed into a newly created segment directory.
- The Fetcher gets the content of the URLs on the Crawlist and saves them in the Segment directory.
- The Parser hands over the content of each website to configured Processors. For example the HTML Processor could eliminate HTML markup.
- Elasticsearch finally takes over the content and indexes it.
How to configure Nutch for Elasticsearch
Requirements
Tools |
Notes |
is needed to compile Nutch, currently only the 1.x branch releases binaries |
|
The 0.98.x stream is not working at the time of writing due to different release cycles of Apache Gora and HBase. Alternatives: it has been shown that MongoDB works. |
|
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."
Setting up HBase
Edit /conf/hbase-site.xml
and add
<configuration> <property> <name>hbase.rootdir</name> <value>file:///path/where/the/data/should/be/stored</value> </property> <property> <name>hbase.cluster.distributed</name> <value>false</value> </property> </configuration>
Setting up Nutch
Nutch must be compiled with the Ant builder.
Configure the HBase adapter by editing the /conf/gora.properties
:
-#gora.datastore.default=org.apache.gora.mock.store.MockDataStore<br />+gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
In conf/nutch-site.xml
you need to name your spider and configure both HBase as a store and Elasticsearch as an indexer
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Qbox Spider</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> </property> <property> <name>elastic.host</name> <value>localhost</value> </property> <property> <name>elastic.port</name> <value>9300</value> </property> <property> <name>elastic.cluster</name> <value>elasticsearch</value> </property> <property> <name>elastic.index</name> <value>nutchindex</value> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> </property> <property> <name>http.content.limit</name> <value>6553600</value> </property> <property> <name>elastic.max.bulk.docs</name> <value>250</value> <description>Maximum size of the bulk in number of documents.</description> </property> <property> <name>elastic.max.bulk.size</name> <value>2500500</value> <description>Maximum size of the bulk in bytes.</description> </property> </configuration>
Build Nutch
Execute: ant runtime Expected result is: build success.
Crawling the web
You will find the compiled binaries in $NUTCH_ROOT/runtime/local/
1. Nutch expects some seed URLs from where to start the crawling.
For example on a Linux system:
mkdir seed<br />echo "https://en.wikipedia.org" > seed/urls.txt
You can also limit crawling to a certain hostname etc. by setting a regular expression in runtime/local/config/regex-filter.txt
2. Inject the URLs into the Crawldb
nutch inject seed/urls.txt
3. Generate URLs to fetch
bin/nutch generate -topN 40
4. Fetch the pages
bin/nutch fetch -all
Example output snippet:
fetching https://en.wikipedia.org/wiki/Free_content (queue crawl delay=5000ms) 10/10 spinwaiting/active, 77 pages, 0 errors, 0.2 0 pages/s, 261 182 kb/s, 2 URLs in 1 queues * queue: https://en.wikipedia.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1448521467348 now = 1448521466676 0. https://en.wikipedia.org/wiki/Century_(cricket) 1. https://en.wikipedia.org/wiki/Mal_Whitfield
5. Parse the pages
bin/nutch parse -all Example output snippet: ParserJob: starting at 2015-11-26 08:09:35 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: parsing all Parsing https://en.wikipedia.org/w/opensearch_desc.php Parsing https://en.wikipedia.org/wiki/1863 Parsing https://en.wikipedia.org/wiki/1915 Parsing https://en.wikipedia.org/wiki/1975 Parsing https://en.wikipedia.org/wiki/1984 Parsing https://en.wikipedia.org/wiki/2015_Bamako_hotel_at... ...
6. Updating should be done regularly in order to always have the actual content in the segment base.
bin/nutch updatedb -all DbUpdaterJob: starting at 2015-11-26 08:10:35 DbUpdaterJob: updatinging all DbUpdaterJob: finished at 2015-11-26 08:10:51, time elapsed: 00:00:15
7. Finally we index the content with Elasticsearch
bin/nutch index elasticsearch -all No IndexWriters activated - check your configuration You can query your data with Elasticsearch curl -X GET "http://localhost:9200/_search?query=myterm"
Bonus
Install Kibana, select the “nutch*” index and deselect “Index contains time-based events”
Figure: how to configure Kibana.
After all this work, start to visualize some interesting features of your search.

Figure: Exploring the most frequent terms in the fetched Wikipedia pages. An exercise left to the reader would be to filter stopwords in Elasticsearch.
Conclusions
In this tutorial you have learned how to configure Nutch as a data source for Elasticsearch. Nutch is powerful yet not very easy to handle for beginners. Together with Kibana we have a solid pipeline to analyse huge amounts of text from specific topics.
About the author
Roland Kofler works in software development since 16 years. A Lucene user since the early days, he enjoys the modernity of Elasticsearch and the ELK-Stack. He is co-founder and the CTO of Alimentaris Ltd. an European startup in the field of Food Regulations.