While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database. Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Our Goal

Qbox provides a turnkey solution for Elasticsearch, Kibana and many of Elasticsearch analysis and monitoring plugins. The goal of the tutorial is to use Qbox to demonstrate fetching large chunks of data using a Scan and Scroll Requests. We set up Logstash in a separate node/machine to gather Twitter stream and use Qbox provisioned Elasticsearch to play around the powerful Scan and Scroll API.

Our ELK stack setup has three main components: 

  • Elasticsearch: It is used to store all of the application and monitoring logs(Provisioned by Qbox).

  • Logstash: The server component that processes incoming logs and feeds to ES.

  • Kibana(optional): A web interface for searching and visualizing logs (Provisioned by Qbox).

Prerequisites

The amount of CPU, RAM, and storage that your Elasticsearch Server will require depends on the volume of logs that you intend to gather. For this tutorial, we will be using a Qbox provisioned Elasticsearch with the following minimum specs:

  • ProviderAWS

  • Version5.1.1

  • RAM1GB

  • CPUvCPU1

  • Replicas0

The above specs can be changed per your desired requirements. Please select the appropriate names, versions, regions for your needs. For this example, we used Elasticsearch version 5.1.1, the most current version is 5.3. We support all versions of Elasticsearch on Qbox. (To learn more about the major differences between 2.x and 5.x, click here.)  

In addition to our Elasticsearch Server, we will require a separate logstash server to process incoming twitter stream from twitter API and ship them to Elasticsearch. For simplicity and testing purposes, the logstash server can also act as the client server itself. The Endpoint and Transport addresses for our Qbox provisioned Elasticsearch cluster are as follows:

common_1.png

Endpoint: REST API

https://ec18487808b6908009d3:efcec6a1e0@eb843037.qb0x.com:32563

Authentication

  • Username = ec18487808b6908009d3

  • Password = efcec6a1e0

TRANSPORT (NATIVE JAVA)

eb843037.qb0x.com:30543

Note: Please make sure to whitelist the logstash server IP from Qbox Elasticsearch cluster.

Install Logstash

Download and install the Public Signing Key:

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

We will use the Logstash version 2.4.x as compatible with our Elasticsearch version 5.1.x. The Elastic Community Product Support Matrix can be referred in order to clear any version issues.

Add the repository definition to your /etc/apt/sources.list file:

echo "deb https://packages.elastic.co/logstash/2.4/debian stable main" | sudo tee -a /etc/apt/sources.list

Run sudo apt-get update and the repository is ready for use. You can install it with:

sudo apt-get update && sudo apt-get install logstash

Alternatively, logstash tar can also be downloaded from Elastic Product Releases Site. Then, the steps of setting up and running logstash are pretty simple:

  • Download and unzip Logstash

  • Prepare a logstash.conf config file

  • Run bin/logstash -f logstash.conf -t to check config (logstash.conf)

  • Run bin/logstash -f logstash.conf

Configure Logstash (Twitter Stream)

Logstash configuration files are in the JSON-format, and reside in /etc/logstash/conf.d. The configuration consists of three sections: inputs, filters, and outputs.

We need to be authorized to take data from Twitter via its API. This part is easy:

  1. Login to your Twitter account

  2. Go to https://dev.twitter.com/apps/

  3. Create a new Twitter application (here I give Twitter-Qbox-Stream as the name of the app).

t1.png

After you successfully create the Twitter application, you get the following parameters in "Keys and Access Tokens":

  1. Consumer Key (API Key)

  2. Consumer Secret (API Secret)

  3. Access Token

  4. Access Token Secret

t2.png

We are now ready to create the Twitter data path (stream) from Twitter servers to our machine. We will use the above four parameters (consumer key, consumer secret, access token, access token secret) to configure twitter input for logstash.

Let's create a configuration file called 02-twitter-input.conf and set up our "twitter" input:

sudo vi /etc/logstash/conf.d/02-twitter-input.conf

Insert the following input configuration:

input {
 twitter {
   consumer_key => "BCgpJwYPDjXXXXXX80JpU0"
   consumer_secret => "Eufyx0RxslO81jpRuXXXXXXXMlL8ysLpuHQRTb0Fvh2"
   keywords => ["mobile", "java", "android", "elasticsearch", "search"]
   oauth_token => "193562229-o0CgXXXXXXXX0e9OQOob3Ubo0lDj2v7g1ZR"
   oauth_token_secret => "xkb6I4JJmnvaKv4WXXXXXXXXS342TGO6y0bQE7U"
 }
}

Save and quit the file 02-twitter-input.conf.

This specifies a twitter input that will filter tweets with keywords "mobile", "java", "android", "elasticsearch", "search" and pass them to logstash output. Save and quit. Lastly, we will create a configuration file called 30-elasticsearch-output.conf:

sudo vi /etc/logstash/conf.d/30-elasticsearch-output.conf

Insert the following output configuration:

output {
 elasticsearch {
   hosts => ["https://eb843037.qb0x.com:32563/"]
   user => "ec18487808b6908009d3"
   password => "efcec6a1e0"
   index => "twitter-%{+YYYY.MM.dd}"
   document_type => "tweet"
 }
 stdout { codec => rubydebug }
}

Save and exit. This output basically configures Logstash to store the twitter logs data in Elasticsearch which is running at https://eb843037.qb0x.com:30024/, in an index named after the twitter.

If you have downloaded logstash tar or zip, you can create a logstash.conf file having input, filter and output all in one place.

sudo vi LOGSTASH_HOME/logstash.conf

Insert the following input and output configuration in logstash.conf

input {
 twitter {
   consumer_key => "BCgpJwYPDjXXXXXX80JpU0"
   consumer_secret => "Eufyx0RxslO81jpRuXXXXXXXMlL8ysLpuHQRTb0Fvh2"
   keywords => ["mobile", "java", "android", "elasticsearch", "search"]
   oauth_token => "193562229-o0CgXXXXXXXX0e9OQOob3Ubo0lDj2v7g1ZR"
   oauth_token_secret => "xkb6I4JJmnvaKv4WXXXXXXXXS342TGO6y0bQE7U"
 }
}
output {
 elasticsearch {
   hosts => ["https://eb843037.qb0x.com:32563/"]
   user => "ec18487808b6908009d3"
   password => "efcec6a1e0"
   index => "twitter-%{+YYYY.MM.dd}"
   document_type => "tweet"
 }
 stdout { codec => rubydebug }
}

Test your Logstash configuration with this command:

sudo service logstash configtest

It should display Configuration OK if there are no syntax errors. Otherwise, try and read the error output to see what's wrong with your Logstash configuration.

Restart Logstash, and enable it, to put our configuration changes into effect:

sudo service logstash restart
sudo update-rc.d logstash defaults 96 9

If you have downloaded logstash tar or zip, it can be run using following command

bin/logstash -f logstash.conf

Numerous responses are received. The structure of document is as follows:

{
 "text": "Learn how to automate anomaly detection on your #Elasticsearch #timeseries data with #MachineLearning:",
 "created_at": "2017-05-07T07:54:47.000Z",
 "source": "<a href="%5C">Twitter for iPhone</a>",
 "truncated": false,
 "language": "en",
 "mention": [],
 "retweet_count": 0,
 "hashtag": [
   {
     "text": "Elasticsearch",
     "start": 49,
     "end": 62
   },
   {
     "text": "timeseries",
     "start": 65,
     "end": 75
   },
   {
     "text": "MachineLearning",
     "start": 88,
     "end": 102
   }
 ],
 "location": {
   "lat": 33.686657,
   "lon": -117.674558
 },
 "place": {
   "id": "74a60733a8b5f7f9",
   "name": "elastic",
   "type": "city",
   "full_name": "San Francisco, CA",
   "street_address": null,
   "country": "United States",
   "country_code": "US",
   "url": "https://api.twitter.com/1.1/geo/id/74a60733a8b5f7f9.json"
 },
 "link": [],
 "user": {
   "id": 2873953509,
   "name": "Elastic",
   "screen_name": "elastic",
   "location": "SF, CA",
   "description": "The company behind the Elastic Stack (#elasticsearch, #kibana, Beats, #logstash), X-Pack, and Elastic Cloud"
 }
}

Elasticsearch enables pagination by adding a size and a from parameter. For example if you wanted to retrieve results in batches of 5 starting from the 3rd page (i.e. show results 11-15), you would do:

curl -XGET 'ES_HOST:ES_PORT/twitter/tweet/_search?size=5&from=10’

However it becomes more expensive as we move further and further into the list of results. Each time we make one of these calls, we are re-running the search operation, forcing Lucene to go off and re-score all the results, rank them and then discard the first 10 (or 10000 if we get that far). The easier option is the scan and scroll API. The idea is to run the actual query once and then Elastic caches the result somewhere and gives us an “access token” to go back in and get them. Then we call the scroll API endpoint with said token to get next page of results.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive, eg ?scroll=1m.

curl -XPOST 'ES_HOST:ES_PORT/twitter-*/tweet/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
   "size": 100,
   "query": {
       "match" : {
           "text" : "elasticsearch"
       }
   }
}'

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.

curl -XPOST 'ES_HOST:ES_PORT/_search/scroll?pretty' -H 'Content-Type: application/json' -d '{
   "scroll" : "1m",
   "scroll_id" : "DXF1ZXJ5DKJ56hghFHFDJgBAAAAAAAArknJBJsbjYtZndUQlNsdDcwakFSDSSXSQ=="
}'

The size parameter allows you to configure the maximum number of hits to be returned with each batch of results. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty. A few important points to consider regarding Scroll and Scan API are as follows: 

  • The initial search request and each subsequent scroll request returns a new _scroll_id, only the most recent _scroll_id should be used.

  • If the request specifies aggregations, only the initial search response will contain the aggregations results.

  • Scroll requests have optimisations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

curl -XGET 'ES_HOST:ES_PORT/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
 "sort": [
   "_doc"
 ]
}'

Search Context

The scroll parameter tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m) does not need to be long enough to process all data, it just needs to be long enough to process the previous batch of results. Each scroll request sets a new expiry time.

Clear Scroll API

Search context are automatically removed when the scroll timeout has been exceeded. However keeping scrolls open has a cost (discussed later in the performance section) so scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API:

curl -XDELETE 'ES_HOST:ES_PORT/_search/scroll?pretty' -H 'Content-Type: application/json' -d '{
   "scroll_id" : ["DXF1ZXJ5DKJ56hghFHFDJgBAAAAAAAArknJBJsbjYtZndUQlNsdDcwakFSDSSXSQ=="]
}'

Multiple scroll IDs can be passed as array:

curl -XDELETE 'ES_HOST:ES_PORT/_search/scroll?pretty' -H 'Content-Type: application/json' -d '{
   "scroll_id" : [
      "DXF1ZXJ5DKJ56hghFHFDJgBAAAAAAAArknJBJsbjYtZndUQlNsdDcwakFSDSSXSQ==",
      "DnF1ZXJ5VNJNSJDNJ68hhdsAAABFmtSWWRRWUJrNNDSvbv767DHhdsdQxYMWFB"
   ]
}'

 All search contexts can be cleared with the _all parameter: 

curl -XDELETE 'ES_HOST:ES_PORT/_search/scroll/_all?pretty'

Sliced Scroll

Scroll queries which return a lot of documents can be split into multiple slices which can be consumed independently:

 curl -XGET 'ES_HOST:ES_PORT/twitter-*/tweet/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
   "slice": {
       "id": 0,
       "max": 2
   },
   "query": {
       "match" : {
           "text" : "elasticsearch"
       }
   }
}'
curl -XGET 'ES_HOST:ES_PORT/twitter-*/tweet/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
   "slice": {
       "id": 1,
       "max": 2
   },
   "query": {
       "match" : {
           "text" : "elasticsearch"
       }
   }
}' 

The result from the first request returns documents that belong to the first slice (id: 0) and the result from the second request returns documents that belong to the second slice. Since the maximum number of slices is set to 2, the union of the results of the two requests is equivalent to the results of a scroll query without slicing.

By default the splitting is done on the shards first and then locally on each shard using the _uid field with the following formula:

slice(doc) = floorMod(hashCode(doc._uid), max)

Performance Considerations:

Scroll API : The background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents. 

Keeping older segments alive means that more file handles are needed. Ensure that nodes have been configured to have ample free file handles and scroll API context is cleared soon after data fetch.

We can check how many search contexts are open with the nodes stats API: 

curl -XGET 'ES_HOST:ES_PORT/_nodes/stats/indices/search?pretty'

It is thus very necessary to clear the Scroll API Context as described earlier in Clear Scroll API section. 

Sliced Scroll API : If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals to N bits per slice where N is the total number of documents in the shard. After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of sliced query you perform in parallel to avoid the memory explosion.

Note: the maximum number of slices allowed per scroll is limited to 1024 and can be updated using the index.max_slices_per_scroll index setting to bypass this limit.

To avoid this cost entirely, it is possible to use the doc_values of another field to do the slicing but the user must ensure that the field has the following properties:

  • The field is numeric.

  • doc_values are enabled on that field

  • Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.

  • The value for each document should be set once when the document is created and never updated. This ensures that each slice gets deterministic results.

  • The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents. 

The field “date” naturally serves above properties and thus can be used for slicing: 

curl -XGET 'ES_HOST:ES_PORT/twitter-*/tweet/_search?scroll=1m&pretty' -H 'Content-Type: application/json' -d '{
   "slice": {
       "field": "date",
       "id": 0,
       "max": 10
   },
   "query": {
       "match" : {
           "text" : "elasticsearch"
       }
   }
}'

Try Qbox Hosted Elasticsearch

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus