How To Use Elasticsearch With Python and Django ( Part 4 )
Posted by Alex Alex April 5, 2016In the previous posts in this series we created a basic Django app and populated a database with automatically generated data. We also added data to the elasticsearch index in bulk, wrote a basic command, and added a mapping to the elasticsearch index. In this final article we will add functional frontend items, write queries, allow the index to update, and discuss a bonus tip.
Add frontend and write some queries
Most of the sites consist not just from search, but from different functional items such as:
- authentication/authorization
- permission system
- notifications
- lots of other functionality
Of course, we could’ve dropped some javascript lines to query Elasticsearch directly, but it’s not like we’ll have a real integration. So, we’ll use Elasticsearch in Django views. Below is a simplified (middlewares, context handlers are omitted) representation of the request-response cycle for how Django handles http requests:
template | \/ request -> url router -> view -> response
From this picture, we will have to add code in 3 places:
urls.py
views.py
templates/{template name}.html
In the end, we will have a main page with autocomplete search of students, facets with filtering, and a bit of aggregation. Also, there will be an additional page for student details.
First, let’s start a development server, in case you’ve turned it off after we finished stage one.
python project/manage.py runserver next, let's add logic to the url router: from core.views import autocomplete_view, student_detail, HomePageView urlpatterns = [ url(r'^admin/', admin.site.urls), url(r'^autocomplete/', autocomplete_view, name='autocomplete-view'), url(r'^student', student_detail, name='student-detail'), url(r'^$', HomePageView.as_view(), name='index-view'), ]
As you can see, we’ve added three new urls: One to perform autocomplete queries, one to get student details, and one as a main page. Now you probably see ImportError
in your shell. That’s okay, because we haven’t added them to the core.views
. Let’s do that.
Let’s drop all the imports we need there:
import json from urllib import urlencode from copy import deepcopy from django.http import HttpResponse from django.conf import settings from django.shortcuts import render from django.views.generic.base import TemplateView from core.models import Student and add two auxiliary views: client = settings.ES_CLIENT def autocomplete_view(request): query = request.GET.get('term', '') resp = client.suggest( index='django', body={ 'name_complete': { "text": query, "completion": { "field": 'name_complete', } } } ) options = resp['name_complete'][0]['options'] data = json.dumps( [{'id': i['payload']['pk'], 'value': i['text']} for i in options] ) mimetype = 'application/json' return HttpResponse(data, mimetype) def student_detail(request): student_id = request.GET.get('student_id') student = Student.objects.get(pk=student_id) return render(request, 'student-details.html', context={'student': student})
The first one, autocomplete_view
, is to return the json response so that autocomplete works. The second one is to render student data from database into student-details.html template
. As html/css code here is obvious (bootstrap and jquery), I will not post it here, you should rather visit the repository of this tutorial to view them. To help you find them, all templates are in the project/templates
folder, and all the css/javascript files are in the project/static
folder.
Here’s the main view:
class HomePageView(TemplateView): template_name = "index.html" def get_context_data(self, **kwargs): body = { 'aggs': { 'course_names': { 'terms': { 'field': 'course_names', 'size': 0 } }, 'university__name': { 'terms': { 'field': 'university.name' } }, 'year_in_school': { 'terms': { 'field': 'year_in_school' } }, 'age': { 'histogram': { 'field': 'age', 'interval': 2 } } }, # 'query': {'match_all': {}} } es_query = self.gen_es_query(self.request) body.update({'query': es_query}) search_result = client.search(index='django', doc_type='student', body=body) context = super(HomePageView, self).get_context_data(**kwargs) context['hits'] = [ self.convert_hit_to_template(c) for c in search_result['hits']['hits'] ] context['aggregations'] = self.prepare_facet_data( search_result['aggregations'], self.request.GET ) return context def convert_hit_to_template(self, hit1): hit = deepcopy(hit1) almost_ready = hit['_source'] almost_ready['pk'] = hit['_id'] return almost_ready def facet_url_args(self, url_args, field_name, field_value): is_active = False if url_args.get(field_name): base_list = url_args[field_name].split(',') if field_value in base_list: del base_list[base_list.index(field_value)] is_active = True else: base_list.append(field_value) url_args[field_name] = ','.join(base_list) else: url_args[field_name] = field_value return url_args, is_active def prepare_facet_data(self, aggregations_dict, get_args): resp = {} for area in aggregations_dict.keys(): resp[area] = [] if area == 'age': resp[area] = aggregations_dict[area]['buckets'] continue for item in aggregations_dict[area]['buckets']: url_args, is_active = self.facet_url_args( url_args=deepcopy(get_args.dict()), field_name=area, field_value=item['key'] ) resp[area].append({ 'url_args': urlencode(url_args), 'name': item['key'], 'count': item['doc_count'], 'is_active': is_active }) return resp def gen_es_query(self, request): req_dict = deepcopy(request.GET.dict()) if not req_dict: return {'match_all': {}} filters = [] for field_name in req_dict.keys(): if '__' in field_name: filter_field_name = field_name.replace('__', '.') else: filter_field_name = field_name for field_value in req_dict[field_name].split(','): if not field_value: continue filters.append( { 'term': {filter_field_name: field_value}, } ) return { 'filtered': { 'query': {'match_all': {}}, 'filter': { 'bool': { 'must': filters } } } }
This is a subclass of the TemplateView. It makes template rendering easier. template_name
attribute defines template name to be used. You can find it at the project/templates
folder.
Let’s describe it method-by-method:
get_context_data
– we generate (gen_es_query
) and perform elasticsearch query in this method. Also, we prepare data to be easily rendered in template (prepare_facet_data
andconvert_hit_to_template
).gen_es_query
generates elasticsearch filters from GET-parameters.prepare_facet_data
converts elasticsearch aggregation dict so that it’s easy to present in the template. We generate URL args so that the user is able to click on a link to add or delete a filter. Also, we mark filters as active, and we leave counts there.convert_hit_to_template
extracts_source
of the elasticsearch documents with primary key added. It’s a way to write less in the templates.
The general workflow is next:
- User goes to the index page.
- User clicks on some of the filters.
- User is redirected to the new url with new GET-parameters added.
- View generates a new query from GET-parameters and outputs query results to the template.
- User sees new data on the webpage.
As you’ve probably noted, not a single hit was performed to the underlying database. All the data in the template is from Elasticsearch. Of course different Django middlewares (sessions, for instance) can perform some hits to the database.
Also, in case one uses the autocomplete field in the right-upper corner of the page, and clicks on one of the students suggested, they are redirected to the student’s detail page. We render database data there. From my previous experiences, this can be quite a handy way to handle time lag you have between data is updated in the database and data is updated in the search index. We have to remember Elasticsearch is near-realtime and not actually realtime. This way you can guarantee user sees correct data even though the version in the search index is outdated.
To look at the template html/javascript code and to play with it you can visit <6796013>.
Be sure index is up to date when new data is added, updated or deleted.
In the previous part of this tutorial we’ve pushed all the data to the elasticsearch index in a bulk. However, everything is changing regularly, so some students can update their properties and courses can be added, etc. We need to handle updates. We have three options:
- Live updates: All the data is pushed to the index as soon it’s updated in the database. Index is refreshed right after
insert/update/delete
is performed. - Semi-live updates: Same as #1, but index is refreshed periodically.
- Periodic bulk updates: We mark some documents as
dirty
until their database-based state is synchronized with elasticsearch, which occurs periodically (like a Celery or cron job).
As you can see, elasticsearch load per document decreases from #1 to #3. So, as your project develops and your user base grows, you can move to a less realtime and resource-hungry option. In case we have a relatively small project/MVP, we can use the first option. Let’s update the models. Student save
and delete
methods to have data updated in the index:
def save(self, *args, **kwargs): is_new = self.pk super(Student, self).save(*args, **kwargs) payload = self.es_repr() if is_new is not None: del payload['_id'] es_client.update( index=self._meta.es_index_name, doc_type=self._meta.es_type_name, id=self.pk, refresh=True, body={ 'doc': payload } ) else: es_client.create( index=self._meta.es_index_name, doc_type=self._meta.es_type_name, id=self.pk, refresh=True, body={ 'doc': payload } ) def delete(self, *args, **kwargs): prev_pk = self.pk super(Student, self).delete(*args, **kwargs) es_client.delete( index=self._meta.es_index_name, doc_type=self._meta.es_type_name, id=prev_pk, refresh=True, )
As you can see, after database save (super
call) we detect whether it’s a new object created or an update. Afterwards, we call appropriate methods of Elasticsearch client. Same for delete. We perform standard first and move to the search index next. Also, we store self.pk
state before the call of original delete or save methods because its state changes. When a new object is created, self.pk
becomes undefined. When an object is deleted self.pk
becomes undefined, therefore, to use it we have to observe it before db-related operations.
To make it easier to switch from realtime refresh to periodical, and vice versa, let’s put a refresh
argument to the settings:
drop ES_AUTOREFRESH
= True somewhere inproject/conf/base.py
- Use it in places you call Elasticsearch – switch
refresh
argument value to the settings-based one:
-refresh=True +refresh=settings.ES_AUTOREFRESH
I have to note that not all API paths to change data state
in Django is covered with save
and delete
methods. In case you perform massive updates, creates or deletes, we will not update the index state. In such cases, I’d suggest bulk API. If related models are updated, no index update will occur because from relational point of view no update occurred. To fix this, let’s update University’s save
method:
def save(self, *args, **kwargs): super(University, self).save(*args, **kwargs) for student in self.student_set.all(): data = student.field_es_repr('university') es_client.update( index=student._meta.es_index_name, doc_type=student._meta.es_type_name, id=student.pk, body={ 'doc': { 'university': data } } )
As a naive approach, we update university
field for every student. After I tried to run such query for an University with 1,600 students associated, it took me 100 seconds to update. Also, there’s a probability of version conflicts. Since we’re interested in search performance and also suppose universities names do not change that frequently, we can use such methods in case we have low number of related objects or if this specific functionality is not exposed to the general public. Otherwise, we should use bulk update.
If you’d like to investigate the state after this section – please checkout the dessert.
Dessert
For a dessert, I have these two suggestions:
- A correct way to share used modules is to use
requirements.txt
. I’ve added it to the repository. To install all requirements, typepip install -r requirements.txt
withvirtualenv
activated. - I’ve implemented facets functionality in sql (may be a bit buggy, but it uses
group by
). As a result of comparison of different queries on 10,000 students, I can state that there’s up to 10 times speed performance increase. You can check this by yourself. Start the development server, go to the main page, click on the filters a bit, and next add&sql=true
to the address bar. You will see it in thetime
panel of the Django debug toolbar, which is already installed in the last commit. To see the last commit, use the head of the master branch.
Conclusion
In this series, we created a basic Django app and populated a database with automatically generated data. We also added data to the elasticsearch index in bulk, wrote a basic command, and added a mapping to the elasticsearch index. In this final post, we added functional frontend items, wrote queries, allowed the index to update, and discussed a bonus suggestion. Thanks for taking the time to read and participate. Let us know if you have any questions in the comment section below.