In this article, we continue our series on Getting to Know Elasticsearch. In this series, Technical writer John Vanderzyden, a technologist for over two decades but with no previous experience in Elasticsearch, is writing a series of guest posts documenting how easy it is to ramp up as a newbie to Qbox. He’s especially keen to help individuals who are new to these technologies. The first post in the series, Welcome to the ELK Stack, introduced the series. — Mark Brandon

Note: Qbox is planning to disable scripting by default by the mid-August, 2018. New clusters won’t have it and existing clusters will lose it upon a restart or rebuild. All Qbox customers will be notified once this is done. 

The Elasticsearch scripting module gives you the ability to use scripts for evaluating custom expressions. For example, you can specify that the script return script fields along with the search request or evaluate a custom query score.

Up through version 1.3, the scripting module uses MVEL as the default scripting language, along with some extensions. MVEL is quite fast, simple to us,e and in most cases only simple expressions are necessary. However, there will be a change to the scripting language in version 1.4 of Elasticsearch, as we explain below.

Elasticsearch provides additional language plugins for executing scripts in different languages, including lang-javascript for JavaScript and lang-python for Python. For all cases in which you can use a script parameter, you can specify a lang parameter on the same level to define the script language. There are several lang options, including mvel, js, groovy, python, expression, and native.

Being Careful with Dynamic Scripting

Although Elasticsearch disables dynamic scripting by default (since version 1.2.0), if you are running an on-premise instance, QBox enables dynamic scripting because we provide a number of solid security features to prevent and block unauthorized access. As a Qbox user, you can disable dynamic scripting, but it will then be necessary to manually upload script files to the cluster if a particular user needs to use scripting.

To ensure robust security, Elasticsearch does not allow you to specify scripts by means of a request. Instead, you’ll need to place scripts in the scripts directory within the configuration directory (the same directory containing elasticsearch.yml). Elasticsearch will automatically recognize any scripts in this directory and put them into service. You can reference any script in this directory by name. For example, consider a script with the name calculate_score.mvel. You can reference this script in a request like this:

$ tree config
config
├── elasticsearch.yml
├── logging.yml
└── scripts
    └── calculate-score.mvel
$ cat config/scripts/calculate-score.mvel
Math.log(_score * 2) + my_modifier
curl -XPOST localhost:9200/_search -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "body": "foo"
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "calculate-score",
            "params": {
              "my_modifier": 8
            }
          }
        }
      ]
    }
  }
}'

The name of the script derives from the directory structure that defines its location and also the file name (without the language extension). For example, a script in the location config/scripts/group1/group2/test.py will get the name group1_group2_test.

Groovy Instead of Javascript

Beginning with version 1.4, Elasticsearch will change the default scripting language to Groovy. This is really good for us here at QBox, since MVEL will no longer be “sandboxed.” This alleviates most of our common worries about remote code execution.

All scripts will eventually transition from MVEL to Groovy, which has these advantages:

  1. Groovy performs better than MVEL, especially in loops.
  2. Groovy is progressing on a faster development pace and now has full support for Java 8 and exploits many of the newer JVM features.
  3. With Groovy, it’s easier to add sandboxing.

Keep in mind that sandboxing does not specifically identify or prevent DoS (Denial Of Service) attack scripts: it only prevents scripts from unauthorized access the core operating system or Elasticsearch internals. A malicious, infinite-loop script can still consume system resources. If you need to disable the sandbox, simply add this to your configuration (in Elasticsearch 1.3 or later):

script.groovy.sandbox.enabled: false

After applying this change, Elasticsearch will deny any dynamic scripts that are sent as string requests.

In most cases, we find that MVEL scripts need very little adjustment to work with Groovy. It’s easy to convert any MVEL script to Groovy by specifying lang:groovy in the script. Or, simply change the default scripting language for all scripts to Groovy by adding script.default_lang:groovy to elasticsearch.yml. Then you can transition each MVEL script to Groovy. After upgrading to Elasticsearch 1.4, MVEL will no longer be available as a scripting language. If you still require the use of MVEL, then you’ll need to install the elasticsearch-lang-mvel plugin.

Why Groovy?

Javascript is a very popular language, so why not use it for the Elasticsearch scripting language? There are two primary reasons that Elasticsearch uses Groovy instead of Javascript:

  • Groovy is faster than Javascript (when using Rhino), and Nashorn has poor support for concurrent script execution.
  • There is very little difference in syntax between Groovy and Javascript for simple scripts.

If you don’t want to use Groovy, there is an alternative scripting language available. We cover Lucene Expressions in the next section.

Using Lucene Expressions

Lucene Expressions gives you the ability to dynamically evaluate a single Javascript numeric statement for a specific document. You can easily make scoring adjustments without any custom Java code, and each expression compiles to Java bytecode to achieve performance that is similar to native code.

The new expression lang for scripts can be used for virtually all query scripts in Elasticsearch, including script_score, script_fields, sort scripts, and numeric aggregation scripts. Typically, the performance is much faster than Groovy scripts and slightly faster than native scripts. High performance often comes with tradeoffs, and you’ll need to be mindful of the following restrictions if you decide to employ Lucene expressions:

  • No loops; these are only single statements in javascript (an “expression”).
  • No local variables; use only the right-side of an assignment.
  • Only single-value, numeric fields are accessible.

To learn more about the available expressions, functions, and operators, we recommend that you consult the Lucene documentation. Also, see the scripting documentation to learn how to to use Lucene expressions within Elasticsearch.

Indexing your Scripts

If dynamic scripting is enabled (as it is by default in Qbox), then Elasticsearch will permit you to store scripts according to an internal index and reference them by an ID. There are various REST endpoints to manage such scripts, as we explain below.

Here is the form of a request to the scripts endpoint, in which lang is the language of the script and id is the script identifier:

/_scripts/{lang}/{id}

The example below will create a document in the .scripts index having an id that equals the value of indexedCalculateScore and having the type mvel:

curl -XPOST localhost:9200/_scripts/mvel/indexedCalculateScore -d '{
    "script": "log(_score * 2) + my_modifier"
}'

You can configure access to this script at query time by appending _id to the script parameter and passing the script id:

curl -XPOST localhost:9200/_search -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "body": "foo"
        }
      },
      "functions": [
        {
          "script_score": {
            "script_id": "indexedCalculateScore",
            "params": {
              "my_modifier": 8
            }
          }
        }
      ]
    }
  }
}'

NOTE: Dynamic scripting must be enabled to use indexed scripts at query time.

View a script with this call:

curl -XGET localhost:9200/_scripts/mvel/calculate-score

Delete an indexed script with this:

curl -XDELETE localhost:9200/_scripts/mvel/calculate-score

Configuring Dynamic Scripting

We recommend the following best practices as you configure Elasticsearch for dynamic scripting. At Qbox, we build all of our Elasticsearch environments according to the following dynamic scripting guidelines.

We recommend running Elasticsearch behind an application or proxy, to protect it from the outside world. If you do intend to give your users direct access to your Elasticsearch environment, then you have to consider the extent to which you trust them enough to run scripts or not.

We do not recommend that you give direct access. If you find it necessary, then you can enable dynamic scripting by adding the following setting to the config/elasticsearch.yml file on every node:

script.disable_dynamic:false

Keep in mind that this will permit execution of any named scripts in the config directory and any native Java scripts that register through plugins. It also allows any user to run any script through the API.

There are three possible configuration values for the script.disable_dynamic setting. The default value is sandbox:

“>

Value

“>

Description

“>

true

“>

all dynamic scripting is disabled, scripts must be placed in the config/scripts directory.

“>

false

“>

all dynamic scripting is enabled, scripts may be sent as strings in requests.

“>

sandbox

“>

scripts may be sent as strings for languages that are sandboxed.

Groovy Sandboxing

Elasticsearch will sandbox all Groovy scripts to ensure that they do not perform any malicious or potentially damaging actions. Below we present the options for configuring the sandbox.

List of string classes for objects that have invokable methods:

script.groovy.sandbox.receiver_whitelist

List of packages in which a user can construct new objects:

script.groovy.sandbox.package_whitelist

List of classes that users can construct:

script.groovy.sandbox.class_whitelist

List of methods that are not invokable, irrespective of the target object:

script.groovy.sandbox.method_blacklist

Flag to disable the sandbox (defaults to true meaning the sandbox is enabled).

script.groovy.sandbox.enabled

When you specify whitelist or blacklist settings for the sandbox, the current whitelist is overwritten (not additive).

Automatic Script Reloading

Elasticsearch frequently scans the config/scripts directory for any changes, and all new or modified scripts will reload. All deleted scripts will no longer exist in the scripts cache. You can change reload frequency with the watcher.interval setting, which has a default of 60 seconds. If you want entirely disable script reloading, then set script.auto_reload_enabled to false.

Document Score

As soon as a document matches a query, Elasticsearch calculates a score for that query and then combines the scores of each matching term. For any script that is used in a facet or aggregation, the current document score is accessible in doc.score. If you’re using script_score, the current score is available in _score.

You can access text features (such as term or document frequency for a specific term) in a script with the _index variable. For example, this can be useful if you want to implement your own scoring model using a script inside of a function score query. Remember that statistics for the entire document collection will compute for each shard (not per index).

Document Fields

For most scripting, the focus is on the use of data that corresponds to document fields. The [‘field_name’] can be used to access specific field data within a document, and this is usually a very quick operation since all of the field values and tokens load directly into memory. Keep in mind, however, that the doc[…] notation is only applicable to non-analyzed or single-term fields, and the notation only accommodates simple-value fields (cannot return a JSON object).

Stored Fields

Stored fields are also accessible during script execution, although the performance is much slower in comparison with document fields-since stored fields do not load into memory. Simply use _fields[‘my_field_name’].value or _fields[‘my_field_name’].values to access the value for a stored field.

Accessing the Score of a Document within a Script

When calculating the score of a document (as with the function _scorequery), you can access the score value with the _score variable inside of a using another script.

Source Field

The source field can also be accessed when executing a script. The source field loads for each doc, and will be parsed and made available to the script for evaluation. The _source forms the context under which you can access the source field, such as _source.obj2.obj1.field3.

Accessing _source is much slower than using _doc, but the data doesn’t load into memory. To access a single field, _fields may be faster when we consider the potentially extra overhead of parsing large documents. However, _source may be faster when accessing multiple fields or if the source is already loaded for another purpose.

Lock It Down

In the Summer of 2014, the Elasticsearch project got hit with some bad press after it became known that the scripting defaults left early versions of Elasticsearch vulnerable to remote code execution. At the time, the guidance from the ES development team was three-fold:

  1. Don’t run Elasticsearch as root. This is not a problem for Qbox users, since no user is ever given root access.
  2. Don’t run Elasticsearch open to the public. For Qbox users, it is important that only your app servers communicate with your Qbox endpoint. You should block both ports 9200 and 9300 from all machines that are not part of your development environment.
  3. Disable dynamic scripting. Whoa… hold on there! This would squelch a whole lot of great functionality inherent in the platform. At QBox, we’ve chosen to leave this option and rely instead on our safeguards for #1 and #2 above.

That said, the typical usage of Elasticsearch is through HTTP and binding to localhost. Intuitively, external hosts cannot connect to anything that has firewall protection or listens to localhost. However, your web browser can reach your localhost,and might be able to reach servers on your company’s internal network.

It’s possible for any website that you visit to send requests to your local Elasticsearch node. Your browser will happily do an HTTP-request to 127.0.0.1/_search. Consequently, that website can go spelunking in whatever data is in Elasticsearch. It could then POST its findings somewhere. Adjusting the settings for cross-origin resource sharing can help, but rememnber that it would still be possible to search using JSONP-requests.

Wrapping Up

Elasticsearch is very flexible when it comes to scripting, allowing users to use the scripting language of their choice to accomplish a wide range of tasks. Groovy is a good choice right out-of-the-box, and the default configuration is to sandbox all of your scripts. You also get a number of options for configuring dynamic scripting. However, it’s critically important to prevent exposure of your data assets. In John’s next article, he’ll run through an overview of Elasticsearch security.