In the previous posts, we created a basic Django app, and populated a database with automatically generated data. In this post, we will add data to the elasticsearch index in bulk, write a basic command, and add a mapping to the elasticsearch index.

Plan:

  1. Write a basic command.
  2. Add mapping to the elasticsearch index.
  3. Push all data to the index in bulk.
  4. Check results.

Write a basic command

First, let's add the push-to-index command:

cp project/apps/core/management/commands/dummy-data.py 
project/apps/core/management/commands/push-to-index.py

Next, you should delete everything we've written in the previous part of this tutorial from push-to-index.py. We just need a dummy "handle" method.

Add mapping to the elasticsearch index

If you don't know what mapping is please read it here. Of course, as Elasticsearch has a dynamic mapping feature, we could've omitted the explicit mapping definition. However, since "Explicit is better than implicit" we won't. Also, explicit mapping definition gives you more control over the index structure, as you know types of all fields. As Student is a main entity of this system, we'll store mapping in the model core/models.py:

import django.db.models.options as options
options.DEFAULT_NAMES = options.DEFAULT_NAMES + (
    'es_index_name', 'es_type_name', 'es_mapping'
)
class Student(models.Model):
...
    class Meta:
        es_index_name = 'django'
        es_type_name = 'student'
        es_mapping = {
            'properties': {
                'university': {
                    'type': 'object',
                    'properties': {
                        'name': {'type': 'string', 'index': 'not_analyzed'},
                    }
                },
                'first_name': {'type': 'string', 'index': 'not_analyzed'},
                'last_name': {'type': 'string', 'index': 'not_analyzed'},
                'age': {'type': 'short'},
                'year_in_school': {'type': 'string'},
                'name_complete': {
                    'type': 'completion',
                    'analyzer': 'simple',
                    'payloads': True,
                    'preserve_separators': True,
                    'preserve_position_increments': True,
                    'max_input_length': 50,
                },
                "course_names": {
                    "type": "string", "store": "yes", "index": "not_analyzed",
                },
            }
        }

Here, we've added es_index_name, es_type_name, and es_mapping to the list of allowed names for the Meta class inside of the model. It's forbidden to add anything to the options so that anyone shades a real option name. However, I don't think someone will want to use es_index_name as a name for something Django-models internal.

As most of the index fields names are equal to the fields in the models, they have to look straightforward for you. I will describe only a fraction of them:

  • university: We plan to use this field in facets and filtering. We could've used university_name only, but we put it here as an object for two reasons; To mirror the fact it's really a distinct entity in the database, and to be able to show how to put an object to the mapping.
  • course_names: as Elasticsearch doesn't require array to be specified, we just put string there. As a result, this will be an array of strings.

Now, let's look at our command push-to-index.py:

from elasticsearch.client import IndicesClient
from django.conf import settings
from django.core.management.base import BaseCommand
from core.models import Student
class Command(BaseCommand):
    def handle(self, *args, **options):
        self.recreate_index()
    def recreate_index(self):
        indices_client = IndicesClient(client=settings.ES_CLIENT)
        index_name = Student._meta.es_index_name
        if indices_client.exists(index_name):
            indices_client.delete(index=index_name)
        indices_client.create(index=index_name)
        indices_client.put_mapping(
            doc_type=Student._meta.es_type_name,
            body=Student._meta.es_mapping,
            index=index_name
        )

In general, the method recreate_index does the following:

  1. Creates an instance of IndicesClient.
  2. Checks whether the index named index_name exists. If it does, it has to be deleted.
  3. Creates a new index.
  4. Puts mapping to that index.

Now, let's add some details. First, on the first line we import IndicesClient. To be able to do this, you have to install the official elasticsearch python client. Also, as we're going to use requests as a transport layer to connect to the elasticsearch api, we install it at the same time:

pip install elasticsearch requests

Second, In the recreate_index method, you can see we've accessed ES_CLIENT setting. As it's not a standard django setting, we have to add it to the project/conf/base.py:

from elasticsearch import Elasticsearch, RequestsHttpConnection
ES_CLIENT = Elasticsearch(
    ['http://127.0.0.1:9200/'],
    connection_class=RequestsHttpConnection
)

In case you'd like to connect to the different elasticsearch server or play with connection parameters for other reason, you can take a look at the Elasticsearch class api documentation.

We can launch that command python project/manage.py push-to-index and go to the http://localhost:9200/django/ to look at the new index or to the http://localhost:9200/_mapping/ to review all mappings that exist on the server.

Push all data to the index in bulk

As I want you to be comfortable running this tutorial, I'm going to show you how to load data to the Elasticsearch index in bulk. This way you will spend much less time waiting for data to be in the index. General documentation for low-level bulk helpers is here. The core idea of this section is to convert database-stored data to the json. Next, that json is to flow to the elasticsearch server.

First, let's update the command:

from elasticsearch.helpers import bulk
class Command(BaseCommand):
    help = "My shiny new management command."
    def handle(self, *args, **options):
        self.recreate_index()
        self.push_db_to_index()
....
    def push_db_to_index(self):
        data = [
            self.convert_for_bulk(s, 'create') for s in Student.objects.all()
        ]
        bulk(client=settings.ES_CLIENT, actions=data, stats_only=True)
    def convert_for_bulk(self, django_object, action=None):
        data = django_object.es_repr()
        metadata = {
            '_op_type': action,
            "_index": django_object._meta.es_index_name,
            "_type": django_object._meta.es_type_name,
        }
        data.update(**metadata)
        return data

As you can see, I've added two methods here: push_db_to_index as a router, (calls converter and uploads data to the index) and convert_for_bulk - a method to add some metadata (index name, action type, elasticsearch type name) to each of all students (serialized to python dict in the Student.es_repr method).

Probably, you've noted already there's no es_repr method in the Student to perform serialization work. Of course, I could've used Django REST framework model-based serializers, but in that case I would've had the same model defined three times: as a Django model, as a elasticsearch mapping (in Meta class), and as a serializer. From my point of view, this duplication is unnecessary, as the only purpose of this serializer could've been to work with elasticsearch. But, if you've got an api already, you can reuse existing serializers for this activity. So, let's add some methods to the model:

    def es_repr(self):
        data = {}
        mapping = self._meta.es_mapping
        data['_id'] = self.pk
        for field_name in mapping['properties'].keys():
            data[field_name] = self.field_es_repr(field_name)
        return data
    def field_es_repr(self, field_name):
        config = self._meta.es_mapping['properties'][field_name]
        if hasattr(self, 'get_es_%s' % field_name):
            field_es_value = getattr(self, 'get_es_%s' % field_name)()
        else:
            if config['type'] == 'object':
                related_object = getattr(self, field_name)
                field_es_value = {}
                field_es_value['_id'] = related_object.pk
                for prop in config['properties'].keys():
                    field_es_value[prop] = getattr(related_object, prop)
            else:
                field_es_value = getattr(self, field_name)
        return field_es_value
    def get_es_name_complete(self):
        return {
            "input": [self.first_name, self.last_name],
            "output": "%s %s" % (self.first_name, self.last_name),
            "payload": {"pk": self.pk},
        }
    def get_es_course_names(self):
        if not self.courses.exists():
            return []
        return [c.name for c in self.courses.all()]

The main method here is es_repr. I get mapping from the Meta there and generate a representation of each field with field_es_repr. The other two methods (get_es_name_complete and get_es_course_names) are there to generate representation of the fields I'm unable to serialize automatically. Description of field serialization process field_es_repr:

  1. Get the field description from mapping.
  2. In case there's a method named get_es_{field name} - use it to get field's value.
  3. In case it's an object we just populate a dictionary directly from attributes of the related object. We don't call es_repr of the related object so that we don't have an infinite recursion if there's a foreign key added from the related object to our model. I believe one should write a method to serialize a field in such case.
  4. In case it's not an object, and there's no special method with special name, we just get an attribute from the model.

Now, let's launch our command:

python project/manage.py push-to-index

To check whether data was updated or not - you can go to http://localhost:9200/django/_stats and look to the _all.primaries.docs.count.


django-stats-example.png#asset:935

 

So, as a result, we have data in our index. Let's use it!

If you're too lazy to type - you can check out code state at his commit <7089ce7>.

Conclusion

In the previous posts, we created a basic Django app and populated a database with automatically generated data. In this post, we added data to the elasticsearch index in bulk, write a basic command, and add a mapping to the elasticsearch index. In the next post, we will add some functional frontend items, write queries, allow the index to update, and discuss a bonus suggestion. 


comments powered by Disqus