In this series, we are creating a Django app with Elasticsearch-based search integrated. The previous article focused on the creation of a basic Django app. As we cannot show the great applications of Elasticsearch features without data to use, we shall populate a database with automatically generated data in this post.

Plan

  1. Create a command.
  2. Populate Universities.
  3. Populate Students.
  4. Populate Courses.
  5. Make a command reusable.

First, let's create a command via django-extensions:

python project/manage.py create_command core

Now there's a folder management in the core app with specific structure:

project/apps/core/management/
├── commands
│   ├── __init__.py
│   └── sample.py
└── __init__.py

sample.py is a sample command. Since "sample" is not a semantically correct name for a command, let's move it:

mv project/apps/core/management/commands/sample.py 
project/apps/core/management/commands/dummy-data.py

Next, edit dummy-data.py: Drop some print statement in the handle method. This way we can check if we are able to launch it:

python project/manage.py dummy-data

You have to be able to see something you've added in the print statement. Now, let's generate universities (listing of project/apps/core/management/commands/dummy-data.py):

from model_mommy import mommy
from django.core.management.base import BaseCommand
from core.models import University, Course, Students
class Command(BaseCommand):
    help = "My shiny new management command."
    def handle(self, *args, **options):
        print 'lala'
        self.make_universities()
    def make_universities(self):
        university_names = (
            'MIT', 'MGU', 'CalTech', 'KPI', 'DPI', 'PSTU'
        )
        self.universities = []
        for name in university_names:
            uni = mommy.make(University, name=name)
            self.universities.append(uni)

Here we use model_mommy which is a great library to populate testing data. It makes it possible to specify only the fields you are interested in, and not all the fields in the model. All other fields will be filled with dummy data. This way you shouldn't be worried that you're not able to run some command/test after you cange your schema a bit. Install it with pip install model_mommy.

Now, let's add courses:

class Command(BaseCommand):
    def handle(self, *args, **options):
        self.make_universities()
        self.make_courses()
...
    def make_courses(self):
        template_options = ['CS%s0%s', 'MATH%s0%s', 'CHEM%s0%s', 'PHYS%s0%s']
        self.courses = []
        for num in range(1, 4):
            for course_num in range(1, 4):
                for template in template_options:
                    name = template % (course_num, num)
                    course = mommy.make(Course, name=name)
                    self.courses.append(course)

As you can see, we store courses and universities in the Command's attributes. We've done this to decrease the number of sql queries, it's much easier to get an object from memory, while we create Student and populate its relations.

Now, let's add students. We will do that in bulk so that we're able to make a single query. This will decrease command running time. Also it's worth mentioning that:

  1. We select university in an absolutely random way.
  2. We use the names library to make human-like names. Please, don't forget to install it with pip.
  3. We've added a command-line option (unnamed and required) to specify a number of students to be created (look at add_arguments method).
import random
import names
...
class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument('count', nargs=1, type=int)
    def handle(self, *args, **options):
        self.make_universities()
        self.make_courses()
        self.make_courses(options)
...
    def make_students(self, options):
        self.students = []
        for _ in xrange(options.get('count')[0]):
            stud = mommy.prepare(
                Student,
                university=random.choice(self.universities),
                first_name=names.get_first_name(),
                last_name=names.get_last_name(),
                age=random.randint(17, 25)
            )
            self.students.append(stud)
        Student.objects.bulk_create(self.students)

As a last step, we need to link students with courses. The value of one of the ForeignKeys was not known, and we couldn’t insert a row into a table as this wasn’t done earlier. When we were creating Students: ManyToMany course <--> student relation maps to the database table, which consists of two ForeignKeys. One points to the raw in the courses' table, and the other to the students' table. Anyway, here's a commit diff:

    def handle(self, *args, **options):
        self.clear()
        self.make_universities()
        self.make_courses()
        self.make_students(options)
        self.connect_courses()
...
    def connect_courses(self):
        ThroughModel = Student.courses.through
        stud_courses = []
        for student_id in Student.objects.values_list('pk', flat=True):
            courses_already_linked = []
            for _ in range(random.randint(1, 10)):
                index = random.randint(0, len(self.courses) - 1)
                if index not in courses_already_linked:
                    courses_already_linked.append(index)
                else:
                    continue
                stud_courses.append(
                    ThroughModel(
                        student_id=student_id,
                        course_id=self.courses[index].pk
                    )
                )
        ThroughModel.objects.bulk_create(stud_courses)

As you can see here, we define intermediate table as a ThroughModel and then add courses per student. It's a bulk insert, just like we had before. Also, as courses selection proceeds randomly, there's a possibility to have duplicated items in a list of student's courses to add. This raises a database-related error. Check if new course is already added to the list to insert courses_already_linked. If it is there; don't add it again. Instead, skip to the next loop iteration.

At the last moment, I'd prefer to add clearance at the beginning of the command execution flow. That way we have a specified number of students after the command runs:

    def handle(self, *args, **options):
        self.clear()
        self.make_universities()
...
    def clear(self):
        Student.objects.all().delete()
        University.objects.all().delete()
        Course.objects.all().delete()

As you can see, we delete all models before execution of creation activities. This way we don't need to check whether University or Course exists, thus decreasing queries to the database. Therefore, we decrease time of the overall command execution. As a last step, let's run the command:

python project/manage.py dummy-data 10000

This will create 10,000 Students. You can check if they were created in the Django shell. I prefer the django-extension-provided shell_plus command:

python project/manage.py shell_plus

Here you can count number of Students:

In [1]: Student.objects.count()
Out[1]: 10000

Conclusion

In this post, we populated a database with automatically generated data. In the next, we will add data to the elasticsearch index in bulk, write a basic command, and add a mapping to the elasticsearch index.