Scraping the Web with Lassie and Elasticsearch
Posted by Jacqueline Outka August 24, 2017In this tutorial, we’ll use Lassie, a Python library for retrieving content from websites, to fetch information regarding a Qbox YouTube video as JSON. We’ll then store that data in our Qbox Elasticsearch cluster using elasticsearch-py, Elasticsearch’s official low-level Python client. We’ll also use elasticsearch-py
to query and return the record we indexed.
Although this example is minimal and the choice of a YouTube video to index is somewhat arbitrary, the concept it demonstrates has larger practical applications. For example, a company could build a vertical search engine collecting all information about it found online. The user-friendliness of Lassie and Python would enable a task like this to be done in relatively fewer lines of code and with syntax easily understood, even by those new to programming.
Setup
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.”
We also assume Python 3 is installed and configured on your system. If Python 3 is not installed, check out the downloads for further instruction.
We also assume your Qbox base URL, username, and password are all saved on your system as environment variables called QBOX_BASE_URL, QBOX_USERNAME, and QBOX_PASSWORD.
Imports
First, we need to install our project’s dependencies. You can do this using pip
or your system package manager, depending on your preference. In addition to lassie
and elasticsearch-py
, we need to install certifi, which provides Mozilla’s root certificate bundle for SSL certificate validation and TLS host identity verification. This is necessary since we connect to our Qbox cluster over https
. We also install urllib3, a HTTP client that we’ll use to create a PoolManager to manage the certificate bundle.
Once these packages are installed, we can create our program, lassie_qbox.py
.
Here are our imports:
import lassie from elasticsearch import Elasticsearch import os import certifi import urllib3
In addition to the previously mentioned packages, we import os from the Python standard library to extract information about our system environment variables, $QBOX_BASE_URL
, $QBOX_USERNAME
, and $QBOX_PASSWORD
.
Creating a Client
First, we create a PoolManager
using urllib3
which will verify the certificates for each request, using the data stored in certifi
:
http = urllib3.PoolManager( cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
Next, we pull and save the relevant environment variables with os
. Os.environ
is a mapping object that stores all environment variables as keys and their content as corresponding values.
username = os.environ['QBOX_USERNAME'] password = os.environ['QBOX_PASSWORD'] base_url = os.environ['QBOX_BASE_URL']
We are now ready to create Elasticsearch client. We specify the username, password, and base_url for our Qbox cluster using the RFC-1738 format. Note that verify_certs
is set to True
– without this, our HTTPS connection would be insecure.
es = Elasticsearch([ 'https://{}:{}@{}/'.format(username, password, base_url), ], verify_certs=True)
Fetching Data with Lassie
Now, we use the fetch
function from lassie
to retrieve information about our Qbox YouTube video. Lassie
prioritizes beautifully-formated information retrieval, so we don’t have to worry about manipulating the format of our fetch
results.
doc = lassie.fetch("https://www.youtube.com/watch?v=xB0E9X6Nmxk")
Indexing and Searching Our Data
Next, we can index the document we just fetched with elasticsearch-py
, setting the document type to string
:
res = es.index(index="test-index", doc_type='string', id=1, body=doc)
We fetch and print the document we just created. Here are the relevant lines of code:
res = es.get(index="test-index", doc_type='string', id=1) print(res['_source'])
And here is the printed output:
{'images': [{'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'og:image'}, {'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'twitter:image'}, {'src': 'https://s.ytimg.com/yts/img/favicon-vflz7uhzw.ico', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_48-vfl1s0rGh.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_96-vfldSA3ca.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_144-vflWmzoXw.png', 'type': 'favicon'}], 'videos': [{'src': 'http://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'secure_src': 'https://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'type': 'application/x-shockwave-flash', 'width': 1280, 'height': 720}, {'src': 'https://www.youtube.com/embed/xB0E9X6Nmxk', 'width': 1280, 'height': 720}], 'title': 'Qbox is Hosted Elasticsearch', 'url': 'https://www.youtube.com/watch?v=xB0E9X6Nmxk', 'description': 'Qbox is the dedicated Elasticsearch hosting solution. Our purpose is to help you be successful in your Elasticsearch environment and take away the stress of ...', 'site_name': 'YouTube', 'keywords': ['Elasticsearch', 'Cloud', 'Hosted Elasticsearch'], 'locale': 'en_US', 'status_code': 200}
We refresh the index we created, then query our cluster for all items matching test-index
and print the results found.
es.indices.refresh(index="test-index") res = es.search(index="test-index", body={"query": {"match_all": {}}}) print("Got %d Hits:" % res['hits']['total']) for hit in res['hits']['hits']: print(hit["_source"])
Our output is the same as before:
Got 1 Hit: {'images': [{'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'og:image'}, {'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'twitter:image'}, {'src': 'https://s.ytimg.com/yts/img/favicon-vflz7uhzw.ico', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_48-vfl1s0rGh.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_96-vfldSA3ca.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_144-vflWmzoXw.png', 'type': 'favicon'}], 'videos': [{'src': 'http://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'secure_src': 'https://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'type': 'application/x-shockwave-flash', 'width': 1280, 'height': 720}, {'src': 'https://www.youtube.com/embed/xB0E9X6Nmxk', 'width': 1280, 'height': 720}], 'title': 'Qbox is Hosted Elasticsearch', 'url': 'https://www.youtube.com/watch?v=xB0E9X6Nmxk', 'description': 'Qbox is the dedicated Elasticsearch hosting solution. Our purpose is to help you be successful in your Elasticsearch environment and take away the stress of ...', 'site_name': 'YouTube', 'keywords': ['Elasticsearch', 'Cloud', 'Hosted Elasticsearch'], 'locale': 'en_US', 'status_code': 200}
Finally, let’s search for a fake index and verify that we receive no results from our query.
es.indices.refresh(index="fake-index") res = es.search(index="fake-index", body={"query": {"match_all": {}}}) print("Got %d Hits:" % res['hits']['total']) for hit in res['hits']['hits']: print(hit["_source"])
The output confirms that our program is working as expected.
POST https://$BASE_URL/fake-index/_refresh [status:404 request:0.036s] Traceback (most recent call last): File "lassie_qbox.py", line 26, in <module> es.indices.refresh(index="fake-index") File "/usr/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 71, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python3.6/site-packages/elasticsearch/client/indices.py", line 56, in refresh '_refresh'), params=params) File "/usr/lib/python3.6/site-packages/elasticsearch/transport.py", line 318, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request self._raise_error(response.status, raw_data) File "/usr/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 122, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info) elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')
Conclusion
In conclusion, Lassie
and elasticsearch-py
combined are a powerful pair. We can quickly extract meaningful information from web pages and just as readily index and search the data we’ve retrieved.
Other Articles
- How to Lock Down Elasticsearch, Kibana, and Logstash and Maintain Security
- How to Secure Your Elasticsearch with Your Own Authentication Plugin
- How to Index NMAP Port Scan Results into Elasticsearc
- How to Import from CSV into Elasticsearch via Logstash and Sincedb
- Introduction to the Logstash Translate Filter
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.