In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc. 

Stop words are a fundamental part of core Information Retrieval. Common wisdom dictates that we should identify and remove stop words from our index. This increases both performance (fewer terms in your dictionary) and more relevant search results.

Stop words are the most frequent words in the English language. Thus, most sentences share a similar percentage of stop words. Stop words bloat the index without providing any extra value.

If they are both common and lacking in much useful information, why not remove them?

Removing stop words helps decrease the size of the index as well as the size of the query. Fewer terms is always a win with regards to performance and since stop words are  semantically empty, relevance scores are unaffected.

Keep reading

Computers, fundamentally, just deal with numbers. They store letters and other characters by assigning a number for each one. 

Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters. For example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding is adequate for all the letters, punctuation, and technical symbols in common use.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is the single universal character set for text that enables the interchange, processing, storage and display of text in many languages. The Unicode standard serves as a foundation for the globalization of modern software. Making use of software that supports Unicode to develop or run business applications will enable us to reduce our development and deployment time and costs, enabling us to expand into new markets more quickly.

Keep reading

Analyzers are made up of two main components: a Tokenizer and a set of Token Filters. The tokenizer splits text into tokens according to some set of rules, and the token filters each perform operations on those tokens. The result is a stream of processed tokens, which are either stored in the index or used to query results.

Keep reading

A fuzzy search is a process that locates web pages or documents that are likely to be relevant to a search argument even when the argument does not exactly correspond to the desired information. 

A fuzzy search is done by means of a fuzzy matching query, which returns a list of results based on likely relevance even though search argument words and spellings may not exactly match. Exact and highly relevant matches appear near the top of the list.

Keep reading

In this post, we discuss elasticsearch analyzers. Creation and configuration of analyzers are the main steps to increasing search efficiency.  They are used for adding elements and search.

The main goal of any analyzer is splitting a stream of characters which are overloaded with unnecessary details. They squeeze out the needed information and produce a list of tokens that reflect it. Let’s look at its structure.

Keep reading

The last two blogs in the analyzer series covered a lot of topics ranging from the basics of the analyzers to how to create a custom analyzer for our purpose with multiple elements. In this blog we are going to see a few special tokenizers like the email-link tokenizers and token-filters like edge-n-gram and phonetic token filters.

These tokenizers and filters provide very useful functionalities that will be immensely beneficial  in making our search more precise.

Keep reading