In this tutorial, we'll use Lassie, a Python library for retrieving content from websites, to fetch information regarding a Qbox YouTube video as JSON. We'll then store that data in our Qbox Elasticsearch cluster using elasticsearch-py, Elasticsearch's official low-level Python client. We'll also use elasticsearch-py to query and return the record we indexed.

Although this example is minimal and the choice of a YouTube video to index is somewhat arbitrary, the concept it demonstrates has larger practical applications. For example, a company could build a vertical search engine collecting all information about it found online. The user-friendliness of Lassie and Python would enable a task like this to be done in relatively fewer lines of code and with syntax easily understood, even by those new to programming.

Setup

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

We also assume Python 3 is installed and configured on your system. If Python 3 is not installed, check out the downloads for further instruction.

We also assume your Qbox base URL, username, and password are all saved on your system as environment variables called QBOX_BASE_URL, QBOX_USERNAME, and QBOX_PASSWORD.

Imports

First, we need to install our project's dependencies. You can do this using pip or your system package manager, depending on your preference. In addition to lassie and elasticsearch-py, we need to install certifi, which provides Mozilla's root certificate bundle for SSL certificate validation and TLS host identity verification. This is necessary since we connect to our Qbox cluster over https. We also install urllib3, a HTTP client that we'll use to create a PoolManager to manage the certificate bundle.

Once these packages are installed, we can create our program, lassie_qbox.py

Here are our imports:

import lassie
from elasticsearch import Elasticsearch
import os
import certifi
import urllib3

In addition to the previously mentioned packages, we import os from the Python standard library to extract information about our system environment variables, $QBOX_BASE_URL$QBOX_USERNAME, and $QBOX_PASSWORD.

Creating a Client

First, we create a PoolManager using urllib3 which will verify the certificates for each request, using the data stored in certifi:

http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())

Next, we pull and save the relevant environment variables with osOs.environ is a mapping object that stores all environment variables as keys and their content as corresponding values.

username = os.environ['QBOX_USERNAME']
password = os.environ['QBOX_PASSWORD']
base_url = os.environ['QBOX_BASE_URL']

We are now ready to create Elasticsearch client. We specify the username, password, and base_url for our Qbox cluster using the RFC-1738 format. Note that verify_certs is set to True - without this, our HTTPS connection would be insecure.

es = Elasticsearch([
    'https://{}:{}@{}/'.format(username, password, base_url),
],
verify_certs=True)

Fetching Data with Lassie

Now, we use the fetch function from lassie to retrieve information about our Qbox YouTube video. Lassieprioritizes beautifully-formated information retrieval, so we don't have to worry about manipulating the format of our fetch results.

doc = lassie.fetch("https://www.youtube.com/watch?v=xB0E9X6Nmxk")

Indexing and Searching Our Data

Next, we can index the document we just fetched with elasticsearch-py, setting the document type to string:

res = es.index(index="test-index", doc_type='string', id=1, body=doc)

We fetch and print the document we just created. Here are the relevant lines of code:

res = es.get(index="test-index", doc_type='string', id=1)
print(res['_source'])

And here is the printed output:

{'images': [{'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'og:image'}, {'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'twitter:image'}, {'src': 'https://s.ytimg.com/yts/img/favicon-vflz7uhzw.ico', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_48-vfl1s0rGh.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_96-vfldSA3ca.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_144-vflWmzoXw.png', 'type': 'favicon'}], 'videos': [{'src': 'http://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'secure_src': 'https://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'type': 'application/x-shockwave-flash', 'width': 1280, 'height': 720}, {'src': 'https://www.youtube.com/embed/xB0E9X6Nmxk', 'width': 1280, 'height': 720}], 'title': 'Qbox is Hosted Elasticsearch', 'url': 'https://www.youtube.com/watch?v=xB0E9X6Nmxk', 'description': 'Qbox is the dedicated Elasticsearch hosting solution. Our purpose is to help you be successful in your Elasticsearch environment and take away the stress of ...', 'site_name': 'YouTube', 'keywords': ['Elasticsearch', 'Cloud', 'Hosted Elasticsearch'], 'locale': 'en_US', 'status_code': 200}

We refresh the index we created, then query our cluster for all items matching test-index and print the results found.

es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit["_source"])

Our output is the same as before:

Got 1 Hit:
{'images': [{'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'og:image'}, {'src': 'https://i.ytimg.com/vi/xB0E9X6Nmxk/maxresdefault.jpg', 'type': 'twitter:image'}, {'src': 'https://s.ytimg.com/yts/img/favicon-vflz7uhzw.ico', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_48-vfl1s0rGh.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_96-vfldSA3ca.png', 'type': 'favicon'}, {'src': 'https://www.youtube.com/yts/img/favicon_144-vflWmzoXw.png', 'type': 'favicon'}], 'videos': [{'src': 'http://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'secure_src': 'https://www.youtube.com/v/xB0E9X6Nmxk?version=3&autohide=1', 'type': 'application/x-shockwave-flash', 'width': 1280, 'height': 720}, {'src': 'https://www.youtube.com/embed/xB0E9X6Nmxk', 'width': 1280, 'height': 720}], 'title': 'Qbox is Hosted Elasticsearch', 'url': 'https://www.youtube.com/watch?v=xB0E9X6Nmxk', 'description': 'Qbox is the dedicated Elasticsearch hosting solution. Our purpose is to help you be successful in your Elasticsearch environment and take away the stress of ...', 'site_name': 'YouTube', 'keywords': ['Elasticsearch', 'Cloud', 'Hosted Elasticsearch'], 'locale': 'en_US', 'status_code': 200}

Finally, let's search for a fake index and verify that we receive no results from our query.

es.indices.refresh(index="fake-index")
res = es.search(index="fake-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print(hit["_source"])

The output confirms that our program is working as expected.

POST https://$BASE_URL/fake-index/_refresh [status:404 request:0.036s]
Traceback (most recent call last):
  File "lassie_qbox.py", line 26, in <module>
    es.indices.refresh(index="fake-index")
  File "/usr/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python3.6/site-packages/elasticsearch/client/indices.py", line 56, in refresh
    '_refresh'), params=params)
  File "/usr/lib/python3.6/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')

Conclusion

In conclusion, Lassie and elasticsearch-py combined are a powerful pair. We can quickly extract meaningful information from web pages and just as readily index and search the data we've retrieved.

Other Articles

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon or Microsoft Azure data centers. And you can now provision a replicated cluster.

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus