In the previous posts in this series we created a basic Django app and populated a database with automatically generated data. We also added data to the elasticsearch index in bulk, wrote a basic command, and added a mapping to the elasticsearch index. In this final article we will add functional frontend items, write queries, allow the index to update, and discuss a bonus tip. 

Add frontend and write some queries

Most of the sites consist not just from search, but from different functional items such as:

  • authentication/authorization
  • permission system
  • notifications
  • lots of other functionality

Of course, we could've dropped some javascript lines to query Elasticsearch directly, but it's not like we'll have a real integration. So, we'll use Elasticsearch in Django views. Below is a simplified (middlewares, context handlers are omitted) representation of the request-response cycle for how Django handles http requests:

                        template
                           |
                          \/
request -> url router -> view -> response

From this picture, we will have to add code in 3 places:

  1. urls.py
  2. views.py
  3. templates/{template name}.html

In the end, we will have a main page with autocomplete search of students, facets with filtering, and a bit of aggregation. Also, there will be an additional page for student details.

First, let's start a development server, in case you've turned it off after we finished stage one.

python project/manage.py runserver
next, let's add logic to the url router:
from core.views import autocomplete_view, student_detail, HomePageView
urlpatterns = [
    url(r'^admin/', admin.site.urls),
    url(r'^autocomplete/', autocomplete_view, name='autocomplete-view'),
    url(r'^student', student_detail, name='student-detail'),
    url(r'^$', HomePageView.as_view(), name='index-view'),
]

As you can see, we've added three new urls: One to perform autocomplete queries, one to get student details, and one as a main page. Now you probably see ImportError in your shell. That's okay, because we haven't added them to the core.views. Let's do that.

Let's drop all the imports we need there:

import json
from urllib import urlencode
from copy import deepcopy
from django.http import HttpResponse
from django.conf import settings
from django.shortcuts import render
from django.views.generic.base import TemplateView
from core.models import Student
and add two auxiliary views:
client = settings.ES_CLIENT
def autocomplete_view(request):
    query = request.GET.get('term', '')
    resp = client.suggest(
        index='django',
        body={
            'name_complete': {
                "text": query,
                "completion": {
                    "field": 'name_complete',
                }
            }
        }
    )
    options = resp['name_complete'][0]['options']
    data = json.dumps(
        [{'id': i['payload']['pk'], 'value': i['text']} for i in options]
    )
    mimetype = 'application/json'
    return HttpResponse(data, mimetype)
def student_detail(request):
    student_id = request.GET.get('student_id')
    student = Student.objects.get(pk=student_id)
    return render(request, 'student-details.html', context={'student': student})

The first one, autocomplete_view, is to return the json response so that autocomplete works. The second one is to render student data from database into student-details.html template. As html/css code here is obvious (bootstrap and jquery), I will not post it here, you should rather visit the repository of this tutorial to view them. To help you find them, all templates are in the project/templates folder, and all the css/javascript files are in the project/static folder.

Here's the main view:

class HomePageView(TemplateView):
    template_name = "index.html"
    def get_context_data(self, **kwargs):
        body = {
            'aggs': {
                'course_names': {
                    'terms': {
                        'field': 'course_names', 'size': 0
                    }
                },
                'university__name': {
                    'terms': {
                        'field': 'university.name'
                    }
                },
                'year_in_school': {
                    'terms': {
                        'field': 'year_in_school'
                    }
                },
                'age': {
                    'histogram': {
                        'field': 'age',
                        'interval': 2
                    }
                }
            },
            # 'query': {'match_all': {}}
        }
        es_query = self.gen_es_query(self.request)
        body.update({'query': es_query})
        search_result = client.search(index='django', doc_type='student', body=body)
        context = super(HomePageView, self).get_context_data(**kwargs)
        context['hits'] = [
            self.convert_hit_to_template(c) for c in search_result['hits']['hits']
        ]
        context['aggregations'] = self.prepare_facet_data(
            search_result['aggregations'],
            self.request.GET
        )
        return context
    def convert_hit_to_template(self, hit1):
        hit = deepcopy(hit1)
        almost_ready = hit['_source']
        almost_ready['pk'] = hit['_id']
        return almost_ready
    def facet_url_args(self, url_args, field_name, field_value):
        is_active = False
        if url_args.get(field_name):
            base_list = url_args[field_name].split(',')
            if field_value in base_list:
                del base_list[base_list.index(field_value)]
                is_active = True
            else:
                base_list.append(field_value)
            url_args[field_name] = ','.join(base_list)
        else:
            url_args[field_name] = field_value
        return url_args, is_active
    def prepare_facet_data(self, aggregations_dict, get_args):
        resp = {}
        for area in aggregations_dict.keys():
            resp[area] = []
            if area == 'age':
                resp[area] = aggregations_dict[area]['buckets']
                continue
            for item in aggregations_dict[area]['buckets']:
                url_args, is_active = self.facet_url_args(
                    url_args=deepcopy(get_args.dict()),
                    field_name=area,
                    field_value=item['key']
                )
                resp[area].append({
                    'url_args': urlencode(url_args),
                    'name': item['key'],
                    'count': item['doc_count'],
                    'is_active': is_active
                })
        return resp
    def gen_es_query(self, request):
        req_dict = deepcopy(request.GET.dict())
        if not req_dict:
            return {'match_all': {}}
        filters = []
        for field_name in req_dict.keys():
            if '__' in field_name:
                filter_field_name = field_name.replace('__', '.')
            else:
                filter_field_name = field_name
            for field_value in req_dict[field_name].split(','):
                if not field_value:
                    continue
                filters.append(
                    {
                        'term': {filter_field_name: field_value},
                    }
                )
        return {
            'filtered': {
                'query': {'match_all': {}},
                'filter': {
                    'bool': {
                        'must': filters
                    }
                }
            }
        }

This is a subclass of the TemplateView. It makes template rendering easier. template_name attribute defines template name to be used. You can find it at the project/templates folder.

Let's describe it method-by-method:

  • get_context_data - we generate (gen_es_query) and perform elasticsearch query in this method. Also, we prepare data to be easily rendered in template (prepare_facet_data and convert_hit_to_template).
  • gen_es_query generates elasticsearch filters from GET-parameters.
  • prepare_facet_data converts elasticsearch aggregation dict so that it's easy to present in the template. We generate URL args so that the user is able to click on a link to add or delete a filter. Also, we mark filters as active, and we leave counts there.
  • convert_hit_to_template extracts _source of the elasticsearch documents with primary key added. It's a way to write less in the templates.

The general workflow is next:

  1. User goes to the index page.
  2. User clicks on some of the filters.
  3. User is redirected to the new url with new GET-parameters added.
  4. View generates a new query from GET-parameters and outputs query results to the template.
  5. User sees new data on the webpage.

As you've probably noted, not a single hit was performed to the underlying database. All the data in the template is from Elasticsearch. Of course different Django middlewares (sessions, for instance) can perform some hits to the database.

Also, in case one uses the autocomplete field in the right-upper corner of the page, and clicks on one of the students suggested, they are redirected to the student's detail page. We render database data there. From my previous experiences, this can be quite a handy way to handle time lag you have between data is updated in the database and data is updated in the search index. We have to remember Elasticsearch is near-realtime and not actually realtime. This way you can guarantee user sees correct data even though the version in the search index is outdated.

To look at the template html/javascript code and to play with it you can visit <6796013>.

Be sure index is up to date when new data is added, updated or deleted.

In the previous part of this tutorial we've pushed all the data to the elasticsearch index in a bulk. However, everything is changing regularly, so some students can update their properties and courses can be added, etc. We need to handle updates. We have three options:

  1. Live updates: All the data is pushed to the index as soon it's updated in the database. Index is refreshed right after insert/update/delete is performed.
  2. Semi-live updates: Same as #1, but index is refreshed periodically.
  3. Periodic bulk updates: We mark some documents as dirty until their database-based state is synchronized with elasticsearch, which occurs periodically (like a Celery or cron job).

As you can see, elasticsearch load per document decreases from #1 to #3. So, as your project develops and your user base grows, you can move to a less realtime and resource-hungry option. In case we have a relatively small project/MVP, we can use the first option. Let's update the models. Student save and delete methods to have data updated in the index:

    def save(self, *args, **kwargs):
        is_new = self.pk
        super(Student, self).save(*args, **kwargs)
        payload = self.es_repr()
        if is_new is not None:
            del payload['_id']
            es_client.update(
                index=self._meta.es_index_name,
                doc_type=self._meta.es_type_name,
                id=self.pk,
                refresh=True,
                body={
                    'doc': payload
                }
            )
        else:
            es_client.create(
                index=self._meta.es_index_name,
                doc_type=self._meta.es_type_name,
                id=self.pk,
                refresh=True,
                body={
                    'doc': payload
                }
            )
    def delete(self, *args, **kwargs):
        prev_pk = self.pk
        super(Student, self).delete(*args, **kwargs)
        es_client.delete(
            index=self._meta.es_index_name,
            doc_type=self._meta.es_type_name,
            id=prev_pk,
            refresh=True,
        )

As you can see, after database save (super call) we detect whether it's a new object created or an update. Afterwards, we call appropriate methods of Elasticsearch client. Same for delete. We perform standard first and move to the search index next. Also, we store self.pk state before the call of original delete or save methods because its state changes. When a new object is created, self.pk becomes undefined. When an object is deleted self.pk becomes undefined, therefore, to use it we have to observe it before db-related operations.

To make it easier to switch from realtime refresh to periodical, and vice versa, let's put a refresh argument to the settings:

  1. drop ES_AUTOREFRESH = True somewhere in project/conf/base.py
  2. Use it in places you call Elasticsearch - switch refresh argument value to the settings-based one:
-refresh=True
+refresh=settings.ES_AUTOREFRESH

I have to note that not all API paths to change data state in Django is covered with save and delete methods. In case you perform massive updates, creates or deletes, we will not update the index state. In such cases, I'd suggest bulk API. If related models are updated, no index update will occur because from relational point of view no update occurred. To fix this, let's update University's save method:

    def save(self, *args, **kwargs):
        super(University, self).save(*args, **kwargs)
        for student in self.student_set.all():
            data = student.field_es_repr('university')
            es_client.update(
                index=student._meta.es_index_name,
                doc_type=student._meta.es_type_name,
                id=student.pk,
                body={
                    'doc': {
                        'university': data
                    }
                }
            )

As a naive approach, we update university field for every student. After I tried to run such query for an University with 1,600 students associated,  it took me 100 seconds to update. Also, there's a probability of version conflicts. Since we're interested in search performance and also suppose universities names do not change that frequently, we can use such methods in case we have low number of related objects or if this specific functionality is not exposed to the general public. Otherwise, we should use bulk update.

If you'd like to investigate the state after this section - please checkout the dessert. 

Dessert

For a dessert, I have these two suggestions:

  1. A correct way to share used modules is to use requirements.txt. I've added it to the repository. To install all requirements, type pip install -r requirements.txt with virtualenv activated.
  2. I've implemented facets functionality in sql (may be a bit buggy, but it uses group by). As a result of comparison of different queries on 10,000 students, I can state that there's up to 10 times speed performance increase. You can check this by yourself. Start the development server, go to the main page, click on the filters a bit, and next add &sql=true to the address bar. You will see it in the time panel of the Django debug toolbar, which is already installed in the last commit. To see the last commit, use the head of the master branch.

Conclusion

In this series, we created a basic Django app and populated a database with automatically generated data. We also added data to the elasticsearch index in bulk, wrote a basic command, and added a mapping to the elasticsearch index. In this final post, we added functional frontend items, wrote queries, allowed the index to update, and discussed a bonus suggestion. Thanks for taking the time to read and participate. Let us know if you have any questions in the comment section below. 

comments powered by Disqus