How To Use Elasticsearch With Python and Django ( Part 3 )
Posted by Alex Alex March 31, 2016In the previous posts, we created a basic Django app, and populated a database with automatically generated data. In this post, we will add data to the elasticsearch index in bulk, write a basic command, and add a mapping to the elasticsearch index.
Plan:
- Write a basic command.
- Add mapping to the elasticsearch index.
- Push all data to the index in bulk.
- Check results.
Write a basic command
First, let’s add the push-to-index
command:
cp project/apps/core/management/commands/dummy-data.py project/apps/core/management/commands/push-to-index.py
Next, you should delete everything we’ve written in the previous part of this tutorial from push-to-index.py
. We just need a dummy “handle” method.
Add mapping to the elasticsearch index
If you don’t know what mapping is please read it here. Of course, as Elasticsearch has a dynamic mapping feature, we could’ve omitted the explicit mapping definition. However, since “Explicit is better than implicit” we won’t. Also, explicit mapping definition gives you more control over the index structure, as you know types of all fields.
As Student
is a main entity of this system, we’ll store mapping in the model core/models.py
:
import django.db.models.options as options options.DEFAULT_NAMES = options.DEFAULT_NAMES + ( 'es_index_name', 'es_type_name', 'es_mapping' ) class Student(models.Model): ... class Meta: es_index_name = 'django' es_type_name = 'student' es_mapping = { 'properties': { 'university': { 'type': 'object', 'properties': { 'name': {'type': 'string', 'index': 'not_analyzed'}, } }, 'first_name': {'type': 'string', 'index': 'not_analyzed'}, 'last_name': {'type': 'string', 'index': 'not_analyzed'}, 'age': {'type': 'short'}, 'year_in_school': {'type': 'string'}, 'name_complete': { 'type': 'completion', 'analyzer': 'simple', 'payloads': True, 'preserve_separators': True, 'preserve_position_increments': True, 'max_input_length': 50, }, "course_names": { "type": "string", "store": "yes", "index": "not_analyzed", }, } }
Here, we’ve added es_index_name
, es_type_name
, and es_mapping
to the list of allowed names for the Meta class inside of the model. It’s forbidden to add anything to the options so that anyone shades a real option name. However, I don’t think someone will want to use es_index_name
as a name for something Django-models internal.
As most of the index fields names are equal to the fields in the models, they have to look straightforward for you. I will describe only a fraction of them:
university
: We plan to use this field in facets and filtering. We could’ve useduniversity_name
only, but we put it here as an object for two reasons; To mirror the fact it’s really a distinct entity in the database, and to be able to show how to put an object to the mapping.course_names
: as Elasticsearch doesn’t require array to be specified, we just put string there. As a result, this will be an array of strings.
Now, let’s look at our command push-to-index.py
:
from elasticsearch.client import IndicesClient from django.conf import settings from django.core.management.base import BaseCommand from core.models import Student class Command(BaseCommand): def handle(self, *args, **options): self.recreate_index() def recreate_index(self): indices_client = IndicesClient(client=settings.ES_CLIENT) index_name = Student._meta.es_index_name if indices_client.exists(index_name): indices_client.delete(index=index_name) indices_client.create(index=index_name) indices_client.put_mapping( doc_type=Student._meta.es_type_name, body=Student._meta.es_mapping, index=index_name )
In general, the method recreate_index
does the following:
- Creates an instance of
IndicesClient
. - Checks whether the index named
index_name
exists. If it does, it has to be deleted. - Creates a new index.
- Puts mapping to that index.
Now, let’s add some details. First, on the first line we import IndicesClient
. To be able to do this, you have to install the official elasticsearch python client. Also, as we’re going to use requests as a transport layer to connect to the elasticsearch api, we install it at the same time:
pip install elasticsearch requests
Second, In the recreate_index
method, you can see we’ve accessed ES_CLIENT
setting. As it’s not a standard django setting, we have to add it to the project/conf/base.py
:
from elasticsearch import Elasticsearch, RequestsHttpConnection ES_CLIENT = Elasticsearch( ['http://127.0.0.1:9200/'], connection_class=RequestsHttpConnection )
In case you’d like to connect to the different elasticsearch server or play with connection parameters for other reason, you can take a look at the Elasticsearch class api documentation.
We can launch that command python project/manage.py push-to-index
and go to the http://localhost:9200/django/ to look at the new index or to the http://localhost:9200/_mapping/ to review all mappings that exist on the server.
Push all data to the index in bulk
As I want you to be comfortable running this tutorial, I’m going to show you how to load data to the Elasticsearch index in bulk. This way you will spend much less time waiting for data to be in the index. General documentation for low-level bulk helpers is here. The core idea of this section is to convert database-stored data to the json. Next, that json is to flow to the elasticsearch server.
First, let’s update the command:
from elasticsearch.helpers import bulk class Command(BaseCommand): help = "My shiny new management command." def handle(self, *args, **options): self.recreate_index() self.push_db_to_index() .... def push_db_to_index(self): data = [ self.convert_for_bulk(s, 'create') for s in Student.objects.all() ] bulk(client=settings.ES_CLIENT, actions=data, stats_only=True) def convert_for_bulk(self, django_object, action=None): data = django_object.es_repr() metadata = { '_op_type': action, "_index": django_object._meta.es_index_name, "_type": django_object._meta.es_type_name, } data.update(**metadata) return data
As you can see, I’ve added two methods here: push_db_to_index
as a router, (calls converter and uploads data to the index) and convert_for_bulk
– a method to add some metadata (index name
, action type
, elasticsearch type name
) to each of all students (serialized to python dict in the Student.es_repr
method).
Probably, you’ve noted already there’s no es_repr
method in the Student to perform serialization work. Of course, I could’ve used Django REST framework model-based serializers, but in that case I would’ve had the same model defined three times: as a Django model, as a elasticsearch mapping (in Meta class), and as a serializer. From my point of view, this duplication is unnecessary, as the only purpose of this serializer could’ve been to work with elasticsearch. But, if you’ve got an api already, you can reuse existing serializers for this activity. So, let’s add some methods to the model:
def es_repr(self): data = {} mapping = self._meta.es_mapping data['_id'] = self.pk for field_name in mapping['properties'].keys(): data[field_name] = self.field_es_repr(field_name) return data def field_es_repr(self, field_name): config = self._meta.es_mapping['properties'][field_name] if hasattr(self, 'get_es_%s' % field_name): field_es_value = getattr(self, 'get_es_%s' % field_name)() else: if config['type'] == 'object': related_object = getattr(self, field_name) field_es_value = {} field_es_value['_id'] = related_object.pk for prop in config['properties'].keys(): field_es_value[prop] = getattr(related_object, prop) else: field_es_value = getattr(self, field_name) return field_es_value def get_es_name_complete(self): return { "input": [self.first_name, self.last_name], "output": "%s %s" % (self.first_name, self.last_name), "payload": {"pk": self.pk}, } def get_es_course_names(self): if not self.courses.exists(): return [] return [c.name for c in self.courses.all()]
The main method here is es_repr
. I get mapping from the Meta there and generate a representation of each field with field_es_repr
. The other two methods (get_es_name_complete
and get_es_course_names
) are there to generate representation of the fields I’m unable to serialize automatically. Description of field serialization process field_es_repr
:
- Get the field description from mapping.
- In case there’s a method named
get_es_{field name}
– use it to get field’s value. - In case it’s an
object
we just populate a dictionary directly from attributes of the related object. We don’t calles_repr
of the related object so that we don’t have an infinite recursion if there’s a foreign key added from the related object to our model. I believe one should write a method to serialize a field in such case. - In case it’s not an object, and there’s no special method with special name, we just get an attribute from the model.
Now, let’s launch our command:
python project/manage.py push-to-index
To check whether data was updated or not – you can go to http://localhost:9200/django/_stats
and look to the _all.primaries.docs.count
.
So, as a result, we have data in our index. Let’s use it!
If you’re too lazy to type – you can check out code state at his commit <7089ce7>
.
Conclusion
In the previous posts, we created a basic Django app and populated a database with automatically generated data. In this post, we added data to the elasticsearch index in bulk, write a basic command, and add a mapping to the elasticsearch index. In the next post, we will add some functional frontend items, write queries, allow the index to update, and discuss a bonus suggestion.