Ask most folks to describe Elasticsearch, and you’ll get a variety of answers. Many senior full-stack developers will struggle to provide a helpful answer. They might know how to use Elasticsearch — but it’s hard to get them to provide clear, concise, and accurate answers. This can of course create no small amount of frustration in others who need to know: What is it? What does it do? How might I benefit?
Well, we’ve got answers for you. Right here in this article. Comprehensive, yet easy for almost anyone to read. Enjoy. And ….. you’re welcome!
“Elementary,” said the fictional detective Mr. Holmes to his dear friend Dr. Watson.
Some concepts are indeed quite easy to grasp — but if he were a real person, Dr. Watson might not categorize Elasticsearch as such.
How would you, dear blog reader, describe Elasticsearch?
An interesting video from Elastic shows how difficult it can be to find any consistency — even within the core of the Elasticsearch community.
Our team here at Qbox offers this description:
Elasticsearch (ES) is an open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications.
If you are hungry for straight answers and more details, here is a broader overview.
Elasticsearch is a distributed analytics and search engine built over Apache Lucene, a Java-based search and indexing library. Elasticsearch provides distributed cluster features and sharding for Lucene indexes, advanced search and aggregation functionality, as well as high availability, security, a snapshot/restore module, and other data management features. The platform also provides such features as thread-pooling, node/cluster monitoring API, queues, data monitoring API, cluster management, etc.
In brief, Elasticsearch allows managing Lucene indexes at scale, providing storage and search functionality for large data clusters distributed across data centers.
Elasticsearch is a perfect choice for e-commerce applications, recommendation engines, and analysis of time-series data (logs, metrics, etc.) and geospatial information. Also, you can use Elasticsearch to create autocomplete functionality and contextual suggesters, to analyze linguistic content, and to build anomaly detection features. Elasticsearch is also widely used for IoT.
It’s easy to get going with Elasticsearch. It ships with sensible defaults, and it hides complex search and distribution mechanics from beginners. It works quite well right out of the box. Within a short learning curve for grasping
the basics, you can become very productive very quickly.
In what follows, we describe the key features of ES that enable a variety of use cases described above.
Fast and Incisive Search against Large Volumes of Data
Conventional SQL database management systems aren’t really designed for full-text searches, and they certainly don’t perform well against loosely structured raw data that resides outside the database. Elasticsearch directly addresses the need for fast access to and processing of semi-structured and unstructured data in a distributed environment. Queries that would take more than 10 seconds using SQL will return results in under 10 milliseconds in Elasticsearch — using the same hardware!
Where does such a dramatic boost in performance come from? To provide high read and write performance, Elasticsearch uses optimized data structures for various data types. For example, fast full-text search in Elasticsearch is enabled via so-called inverted indexes. These indexes consist of a list of unique words that appear in each document, and for each word, a list of all documents in which it appears. Elasticsearch is also well-optimized for geospatial and numerical data with advanced binary search algorithms (BKD trees).
This fast search functionality is accessible via a simple JSON-based REST API and query language known as Query DSL. A query examines one or many target values and scores each of the elements in the results according to how closely they match the focus of the query. The query operators enable you to optimize simple or complex queries that often return results from large datasets in just a few milliseconds.
Also, you can use Painless (Elasticsearch’s built-in query scripting language) to create custom expressions that return “scripted fields” as a part of a search request. This makes querying more flexible and efficient.
Overall, the Elasticsearch design is much simpler and much leaner than a database constrained by schemas, tables, fields, rows, and columns.
Indexing Documents to the Repository
During an indexing operation, Elasticsearch converts raw data such as log files or message files into internal documents and stores them in a basic data structure similar to a JSON object. Each document is a simple set of correlating keys and values: the keys are strings, and the values are one of numerous data types: strings, numbers, dates, or lists.
Adding documents to Elasticsearch is easy — and it’s easy to automate. Simply do an HTTP POST that transmits your document as a simple JSON object. Searches are also done with JSON: send your query in an HTTP GET with a JSON body. The RESTful API makes it easy to retrieve, submit, and verify data directly from a command line. Even if they are developing with a client such as Python or Ruby, many developers use the cURL tool for debugging and developing with Elasticsearch.
Denormalized Document Storage: Fast and Direct Access to your Data
It’s important to remember that Elasticsearch isn’t a relational database, so RDBMS concepts usually won’t apply. The most important concept that you must set aside when coming over from conventional databases is normalization. Native Elasticsearch doesn’t permit joins or subqueries, so denormalizing your data is essential.
Elasticsearch will typically store a document once for each repository in which it resides. Although this is counterintuitive from the perspective of a conventional RDBMS, it is optimal for Elasticsearch. Full text searches will be extremely fast because the documents are stored in close proximity to the corresponding metadata in the index. This design greatly reduces the number of data reads, and Elasticsearch limits the index growth rate by keeping it compressed.
Enabling Advanced Search and Suggestion Functionality
Elasticsearch exposes the powerful Apache Lucene library and Elasticsearch-native functionality to enable auto-completion, “did-you-mean” feature, highlighters, and state-of-the-art concept of percolators. Let’s briefly discuss these features.
Elasticsearch supports auto-completion functionality that helps point users to relevant documents as they type. Completion suggesters that enable this feature can be used with the fuzziness parameter that returns the probable result even if there is a typo in the query. There are multiple ways you can adjust this feature to your specific
The Elasticsearch API makes it easy to implement “did-you-mean” functionality that allows correcting queries with typos and suggesting the most relevant search request to users. This feature is implemented with the advanced n-gram language model that breaks words into short morphological tokens that can be matched to user queries.
Highlighters are great for queries against “full-text” documents. For such types of documents, they can return all occurences of the queried word or phrase. This can be used for enabling advanced search functionality
in customer-facing applications.
Percolators is the implementation of a “reverse search” model. In this model, user queries are stored as documents in the Elasticsearch index. Each of these queries is run against the documents of the other index to find relevant documents. As a result, users can get relevant documents without sending actual queries. Percolators can be used in recommendation engines to store user interests as “queries” and match them against newly added documents (songs, movies, etc.)
Broadly Distributable and Highly Scalable
Elasticsearch can scale up to thousands of servers and accommodate petabytes of data. Its enormous capacity results directly from its elaborate, distributed architecture. And yet the Elasticsearch user can be happily unaware of nearly all of the automation and complexity that supports this distributed design.
If you were to run most of the examples in any of our tutorials or those found in the Elastic documentation on either a single node or on a 50-node cluster, everything would function exactly the same.
In Elasticsearch, these delicate and often intensive operations occur automatically and imperceptibly:
- Partitioning your documents across an arrangement of distinct shards (containers)
- In a multi-node cluster, distributing the documents to shards that resides across all of the nodes
- Balancing shards across all nodes in a cluster to evenly manage the indexing and search load
- With replication, duplicating each shard to provide data redundancy and failover
- Routing requests from any node in the cluster to specific nodes containing the specific data that you need
- Seamlessly adding and integrating new nodes as you find the need to increase the size of your cluster
- Redistributing shards to automatically recover from the loss of a node
Leveraging Built-in Data Analytics
Elasticsearch is not just a search engine but a powerful tool that can be used for data analytics. It supports the following features:
- Metrics aggregations. With Elasticsearch, you can retrieve various statistics such as weighted average, percentiles, min/max, etc. from your indexes. Also, you can enrich these aggregations with scripting.
- Buckets aggregations. Want to analyze distinct categories in your data separately or in comparison to each other? Elasticsearch buckets aggregations make it easy to create buckets from your index data and apply histogram and range aggregations to them to see how different buckets compare. You can also apply metrics aggregations to each bucket to get some statistics.
- Pipeline aggregations. Pipeline aggregations work on the output produced by other aggregations adding granular metrics and statistics to them.
All these aggregations can be applied at search time in a single request that makes it really fast to analyze data stored in the ES indexes. This makes ES a good fit for data analytics, monitoring, data mining, text analysis,
and even complex machine learning tasks.
Enabling Advanced Text Analysis and Processing
Elasticsearch is a full-text search engine that naturally requires advanced text analytics features to process text stored in Elasticsearch documents. Elasticsearch supports the following features for many languages using built-in tools and language-specific plugins:
- Analyzers. You can use ES analyzers to detect whitespace, find keywords, patterns, and specific characters. Elasticsearch ships with many language-specific analyzer plugins that can help indicate stems and morphological structures in linguistic data from many languages.
- Tokenizers. Tokenizers can split text into tokens such as words, characters, or stems. Elasticsearch also supports n-gram and edge n-gram tokenizers that can create a sliding window of continuous letters in words. These tokenizers can be used for “did-you-mean” suggesters, natural language processing models, and statistical language analysis.
- Token and character filters. These filters can be used to replace specific tokens and characters according to some rules. For example, a HTML char filter is used to strip and decode HTML characters found in texts.
All these features can be used to analyze unstructured text (e.g., reviews, comments) and turn them into structured data that can be used in the data analytics and machine learning pipelines.
Index Lifecycle Management
Data-rich applications and event-based pipelines have the challenge of storing and managing huge volumes of data such as logs, metrics, website user actions, etc. Some of this data needs to be stored permanently and some just temporarily. To avoid over-utilization of storage and growing cloud costs, companies should have a flexible way of managing their indexes. Elasticsearch Index Lifecycle Management (ILM) was developed to address this concern. ILM allows creating automated policies to manage indices based on their performance and retention requirements. Using ILM, you can spin up new indices and delete stale indices to meet data retention standards. This feature is especially useful for managing time-series data such as metrics and logs.
Logs and Processing
Fast and granular text search, advanced aggregation features, and the developed cluster distribution and sharding capabilities make Elasticsearch a great solution for storing and processing logs and metrics. Elasticsearch can be easily integrated with logging aggregators and shippers such as Logstash to digest logs from different applications and sources. Preprocessed logs can be then aggregated and analyzed using Elasticsearch DSL queries, filteres, and other features discussed above. Also, Elasticsearch can be integrated with Kibana to visualize metrics and data providing cluster administrators and data analysts with excellent representation of cluster and application state.
Sharing our Battle-Hardened Expertise
We invite you to learn more in our extensive help library, which you can find in our Support Center and throughout our blog. Here is the short list that we recommend to those who are relatively new to Elasticsearch. You’ll find a number of examples throughout.
We’re always ready to help you achieve maximum success with your Elasticsearch environment, and we hope that you find this article helpful. We welcome your comments below.
If you like this article, consider using Qbox hosted Elasticsearch service. It’s stable and more affordable — and we offer top-notch free 24/7 support. Sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“