Before setting up elasticsearch to fulfill entity extraction, it is worth checking out how it became such an easy task. There is a lot of buzz around the new Ingest API shipped with elasticsearch 5.x.

The Ingest API allows data manipulation and enrichment by defining a pipeline through which every document is subject to pass. This pipeline is created with a set of processors - each of which do specific tasks that enrich our data. A typical example of the processor is a grok processor, which allows you to modify and structure your unstructured log using pattern matching. Elasticsearch 5 ships many built-in processors about which you can read here.

Keep reading

We recently announced Qbox hosted ElastAlert -- the superb open-source alerting tool built by the team at Yelp Engineering -- now available on all new Elasticsearch clusters on AWS.

Most organizations use the ELK Stack for managing their ever increasing amount of data and logs. Kibana is great for visualizing and querying data, but it needs a companion tool like ElastAlert for alerting on inconsistencies, anomalies, spikes, or other patterns of interest from data in Elasticsearch.

Keep reading

Stemming, in linguistic morphology and information retrieval science, is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form, generally a written word form. For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing

Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

Keep reading

In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc. 

Stop words are a fundamental part of core Information Retrieval. Common wisdom dictates that we should identify and remove stop words from our index. This increases both performance (fewer terms in your dictionary) and more relevant search results.

Stop words are the most frequent words in the English language. Thus, most sentences share a similar percentage of stop words. Stop words bloat the index without providing any extra value.

If they are both common and lacking in much useful information, why not remove them?

Removing stop words helps decrease the size of the index as well as the size of the query. Fewer terms is always a win with regards to performance and since stop words are  semantically empty, relevance scores are unaffected.

Keep reading

Redis, the popular open source in-memory data store, has been used as a persistent on-disk database that supports a variety of data structures such as lists, sets, sorted sets (with range queries), strings, geospatial indexes (with radius queries), bitmaps, hashes, and Hyper Logs. The in-memory store is used to solve various problems in areas such as real-time messaging, caching, and statistic calculation.

Provisioning an Elasticsearch cluster in Qbox is easy. In this article, we walk you through the initial steps to start and configure your cluster. We then setup and configure logstash to ship the logs to elasticsearch in order to monitor Redis performance. Redis performance logs shipped to elasticsearch can then be visualized and analyzed via Kibana dashboards.

Keep reading

A comprehensive log management and analysis strategy is vital, enabling organizations to understand the relationship between operational, security, and change management events and maintain a comprehensive understanding of their infrastructure. Log files from web servers, applications, and operating systems also provide valuable data, though in different formats, and in a random and distributed fashion.

No real-world web application can exist without a data storage backend, and most applications today use relational database management systems (RDBMS) for storing and managing data. The most commonly used database is MySQL, which is an open-source RDBMS that is the ‘M’ in the open-source enterprise LAMP Stack (Linux, Apache, MySQL and PHP).

Middle and large-sized applications send multiple database queries per second, and slow queries are often the cause of slow page loading and even crashes. The task of analyzing query performance is critical to determine the root cause of these bottlenecks, and most databases come with built-in profiling tools to help us.

Provisioning an Elasticsearch cluster in Qbox is easy. In this article, we walk you through the initial steps and show you how simple it is to start and configure your cluster. We then install and configure logstash to ship our MySQL or MariaDB/Galera logs to Elasticsearch. MySQL logs shipped to elasticsearch can then be visualized and analyzed via Kibana dashboards.

Keep reading