Stopwords Filtering and Precision in Elasticsearch
Posted by Adam Vanderbush May 10, 2017In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc.
Stop words are a fundamental part of core Information Retrieval. Common wisdom dictates that we should identify and remove stop words from our index. This increases both performance (fewer terms in your dictionary) and more relevant search results.
Stop words are the most frequent words in the English language. Thus, most sentences share a similar percentage of stop words. Stop words bloat the index without providing any extra value.
If they are both common and lacking in much useful information, why not remove them?
Removing stop words helps decrease the size of the index as well as the size of the query. Fewer terms is always a win with regards to performance and since stop words are semantically empty, relevance scores are unaffected.