Scraping the Web with Lassie and Elasticsearch
Posted by Jacqueline Outka August 24, 2017In this tutorial, we’ll use Lassie, a Python library for retrieving content from websites, to fetch information regarding a Qbox YouTube video as JSON. We’ll then store that data in our Qbox Elasticsearch cluster using elasticsearch-py, Elasticsearch’s official low-level Python client. We’ll also use elasticsearch-py
to query and return the record we indexed.
Although this example is minimal and the choice of a YouTube video to index is somewhat arbitrary, the concept it demonstrates has larger practical applications. For example, a company could build a vertical search engine collecting all information about it found online. The user-friendliness of Lassie and Python would enable a task like this to be done in relatively fewer lines of code and with syntax easily understood, even by those new to programming.