Tutorial Series: Anatomy of Elasticsearch Analysers
Stemming, in linguistic morphology and information retrieval science, is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form, generally a written word form. For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing.
Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
Algorithmic stemmers apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals. They don’t have to know about individual words in order to stem them. The dictionary stemmers work differently from algorithmic stemmers.
Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer.
A dictionary stemmer should be able to return the correct root word for irregular forms such as feet and mice. Additionally, it must be able to recognize the distinction between words that are similar but have different word senses, for example, organ and organization.
Elasticsearch provides dictionary-based stemming via the Hunspell token filter. Hunspell is the spell checker used by OpenOffice, LibreOffice, Chrome, Firefox, Thunderbird, and many other open and closed source projects.
In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc.
Stop words are a fundamental part of core Information Retrieval. Common wisdom dictates that we should identify and remove stop words from our index. This increases both performance (fewer terms in your dictionary) and more relevant search results.
Stop words are the most frequent words in the English language. Thus, most sentences share a similar percentage of stop words. Stop words bloat the index without providing any extra value.
If they are both common and lacking in much useful information, why not remove them?
Removing stop words helps decrease the size of the index as well as the size of the query. Fewer terms is always a win with regards to performance and since stop words are semantically empty, relevance scores are unaffected.
Computers, fundamentally, just deal with numbers. They store letters and other characters by assigning a number for each one.
Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters. For example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding is adequate for all the letters, punctuation, and technical symbols in common use.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is the single universal character set for text that enables the interchange, processing, storage and display of text in many languages. The Unicode standard serves as a foundation for the globalization of modern software. Making use of software that supports Unicode to develop or run business applications will enable us to reduce our development and deployment time and costs, enabling us to expand into new markets more quickly.
A fuzzy search is a process that locates web pages or documents that are likely to be relevant to a search argument even when the argument does not exactly correspond to the desired information.
A fuzzy search is done by means of a fuzzy matching query, which returns a list of results based on likely relevance even though search argument words and spellings may not exactly match. Exact and highly relevant matches appear near the top of the list.
Text segmentation has always been very critical from the perspective of Search. It is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.
Word segmentation is the task of dividing a string of written language into its component words. In English and many other languages, using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). Some examples where the space character alone may not be sufficient include contractions like won’t for will not.
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited