BirdWatch is an open-source application (available on GitHub) created by Matthias Nehlsen that shows off some of the cool things developers can do with Elasticsearch. Tweets matching a selection of terms are consumed via the Twitter Streaming API and stored in Elasticsearch.

Tweets already stored can be searched by users, and search queries are logged. As new tweets are received, they are checked against active users’ search queries via Elasticsearch’s Percolate API, and matching tweets are sent to users in real time via Server Side Events. Paginated, sortable results are shown client-side with some slick SVG data visualizations showing the some of the properties of the result set text.

The server-side part of the application is constructed in Scala using the Play Framework. The client-side portion uses AngularJS and D3.js for text analysis to create a very impressive UI. Data storage and querying are handled by Elasticsearch.

You can read more about the application details in Matthias’ blog post about it (which I highly recommend), but here I’m just going to give a quick overview of the Elasticsearch queries involved.

Screen_Shot_2013-12-18_at_3.39.04_PM.png


Search Queries

If you type “elasticsearch” into the search box and hit enter, the following query is submitted (I “translated” it to a cURL command for concreteness):

curl -XPOST "http://birdwatch.qbox.io/birdwatch_tech/tweets/_search" -d '{
   "size":250,
   "from":0,
   "query":{
      "query_string":{
         "default_field":"text",
         "query":"(elasticsearch) AND lang:en",
         "default_operator":"AND"
      }
   },
   "sort":{
      "id":"desc"
   }
}'

This uses a query string query, which allows for a lot of power in a small package. Since the default field is “text,” the query we typed in the search box is executed against that field. Since “lang:en” is also specified, only results for which the “lang” field is equal to “en” will be returned. In other words, we only want results in English.

Thanks to the composable nature of the query string query, we can type “elasticsearch AND (java OR scala)” in the search box, yielding a search command that looks like:

curl -XPOST "http://birdwatch.qbox.io/birdwatch_tech/tweets/_search" -d '{
   "size":250,
   "from":0,
   "query":{
      "query_string":{
         "default_field":"text",
         "query":"(elasticsearch AND (java OR scala)) AND lang:en",
         "default_operator":"AND"
      }
   },
   "sort":{
      "id":"desc"
   }
}'

This only begins to illustrate the power of the query string query syntax, but it’s enough to give you a taste of what’s possible.


Percolation

One of the more interesting features of BirdWatch is its ability to show you new tweets matching a query you have already submitted, in real time. So if I search for “elasticsearch,” I will get all of the results that have already been saved, but if a new tweet comes in from the streaming API that matches my query, it will get pushed down to my browser without me having to submit a new search query!

This is accomplished using the Percolate API in Elasticsearch. Percolation is basically index querying in reverse. You register a query via the percolator, then later you post documents to the percolator, and it will tell you which queries match the documents. In other words, it will tell you the questions for which you have the answer (well, an answer).

When a search query is executed by BirdWatch, the query is also registered in Elasticsearch with the “_percolator” endpoint. You can name the query anything you want, but BirdWatch uses the SHA-256 hash of the query as its ID, to be sure that each unique query is only saved once. So to register the text query “elasticsearch,” the equivalent curl command is:

curl -XPUT "http://birdwatch.qbox.io/_percolator/queries/ae1b8c8635429669ca1987498e06e01008a886fdf4439c79a33bdcd4aca6f3b2" -d'
{
   "query": {
      "query_string": {
         "default_field": "text",
         "default_operator": "AND",
         "query": "(elasticsearch) AND lang:en"
      }
   }
}'

Now that we’ve registered that query, let’s try percolating against it. Below we post a simple document (the actual tweet records are much more complex, but this is adequate for our purposes) against the appropriate index:

curl -XPOST "http://birdwatch.qbox.io/queries/tweets/_percolate" -d'
{
    "doc": {
        "text":"elasticsearch is cool",
        "lang":"en"
    }
}'

yielding the following result:

{
   "ok": true,
   "matches": [
      "684888c0ebb17f374298b65ee2807526c066094c701bcc7ebbe1c1095f494fc1",
      "ae1b8c8635429669ca1987498e06e01008a886fdf4439c79a33bdcd4aca6f3b2"
   ]
}

There were two matches, meaning two queries have been registered which would yield the document we posted. The percolator only returns IDs, but we can retrieve the details of the query with the following command:

curl -XGET "http://birdwatch.qbox.io/_percolator/queries/ae1b8c8635429669ca1987498e06e01008a886fdf4439c79a33bdcd4aca6f3b2"

which gives us our query back:

{
   "_index": "_percolator",
   "_type": "queries",
   "_id": "ae1b8c8635429669ca1987498e06e01008a886fdf4439c79a33bdcd4aca6f3b2",
   "_version": 312,
   "exists": true,
   "_source": {
      "query": {
         "query_string": {
            "default_field": "text",
            "default_operator": "AND",
            "query": "(elasticsearch) AND lang:en"
         }
      }
   }
}

BirdWatch is an open-source application, available here. The instance I’ve used for this blog post is running on a dedicated cluster at https://qbox.io.