Text segmentation has always been very critical from the perspective of Search. It is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Word segmentation is the task of dividing a string of written language into its component words. In English and many other languages, using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). Some examples where the space character alone may not be sufficient include contractions like won’t for will not.

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

Delimiters represent one of various means to specify boundaries in a data stream. Declarative notation, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.

The Word Delimiter Token Filter in Elasticsearch is a great effort towards text segmentation to meaningful tokens. It splits words into subwords and performs optional transformations on subword groups. One use for Word Delimiter Filter is to help match words with different delimiters. One way of doing so is to specify generate_word_parts="1" and catenate_words="1" in the analyzer used for indexing, and generate_word_parts="1" in the analyzer used for querying. Given that the current Standard Tokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as Whitespace Tokenizer).

By default, words are split into subwords with the following rules:

  • Split on intra-word delimiters (all non alpha-numeric characters) Eg. "Wi-Fi" -> "Wi", "Fi"
  • Split on case transitions (can be turned off – see split_on_case_change parameter) Eg. "PowerShot" -> "Power", "Shot"
  • Split on letter-number transitions (can be turned off – see split_on_case_change parameter) Eg. "SD500" -> "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored Eg. "//hello---there, 'dude'" -> "hello", "there", "dude"
  • Trailing “‘s” are removed for each subword (can be turned off – see split_on_case_change parameter) Eg. "O'Neil's" -> "O", "Neil"

Splitting is affected by the following parameters:

  • generate_word_parts – If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.
  • generate_number_parts – If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.
  • catenate_words – If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to false.
  • catenate_numbers – If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to false.
  • catenate_all – If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to false.
  • split_on_case_change – If true causes “PowerShot” to be two tokens; ("Power-Shot" remains two parts regards). Defaults to true.
  • preserve_original – If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to false.
  • split_on_numerics – If true causes “j2se” to be three tokens; "j" "2" "se". Defaults to true.
  • stem_english_possessive – If true causes trailing “‘s” to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true.

Illustration of Word Delimiter Token Filter using generate_word_parts=”1″ and catenate_words=”1″:

"PowerShot" => 
0:"Power", 1:"Shot" 1:"PowerShot"(where 0,1,1 are token positions)
"A's+B's&C's" => 
0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!" => 
0:"Super",
1:"Duper",
2:"XL",
2:"SuperDuperXL",
3:"500"
4:"42",
5:"Auto",
6:"Coder",
6:"AutoCoder"

Let’s consider a simple use case of removing white spaces and underscores/hyphens from given input string using Word Delimiter Token Filter.

curl -XPOST 'localhost:9200/word_deliter_test_index/' -d '{
 "settings" : {
   "analysis" : {
     "analyzer" : {
       "word_delimiter_custom_analyser" : {
         "type" : "custom",
         "tokenizer" : "keyword",
         "filter" : [ "replace-whitespaces", "truncate_underscore" ]
       }
     },
     "filter" : {
       "replace-whitespaces" : {
         "type" : "pattern_replace",
         "pattern" : "\\s+",
         "replacement" : "_"
       },
       "truncate_underscore" : {
         "type" : "word_delimiter",
         "catenate_all" : true,
         "split_on_case_change" : false,
         "split_on_numerics" : false,
         "generate_word_parts" : false,
         "generate_number_parts" : false
       }
     }
   }
 }
}'

The index word_deliter_test_index consists of a custom analyser named word_delimiter_custom_analyser. It is composed of keyword tokenizer and two token filters – replace-whitespaces and trucate_underscores.

The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term.

The pattern_replace token filter allows to easily handle string replacements based on a regular expression. The regular expression is defined using the pattern parameter, and the replacement string can be provided using the replacement parameter. Here, replace-whitespaces replaces white spaces with underscores.

The truncate_underscore is a word delimiter token filter with the following settings:

"truncate_underscore" : {
    "type" : "word_delimiter",
    "catenate_all" : true,
    "split_on_case_change" : false,
    "split_on_numerics" : false,
    "generate_word_parts" : false,
    "generate_number_parts" : false
}

Let’s now test a few string literals against our word_deliter_test_index :

  • Qbox – is awesome _
curl -XGET localhost:9200/word_deliter_test_index/_analyze?analyzer=word_delimiter_custom_analyser&pretty=true' -d 'Qbox - is awesome _'
{
 "tokens" : [ {
   "token" : "Qboxisawesome",
   "start_offset" : 0,
   "end_offset" : 22,
   "type" : "word",
   "position" : 0
 } ]
}
  • Qbox – provides _ fully managed Elasticsearch-Hosting Service_
curl -XGET localhost:9200/word_deliter_test_index/_analyze?analyzer=word_delimiter_custom_analyser&pretty=true' -d 'Qbox - provide _ fully manged Elasticsearch-Hosting Service_'
{
 "tokens" : [ {
   "token" : "QboxprovidesfullymanagedElasticsearchHostingService",
   "start_offset" : 0,
   "end_offset" : 59,
   "type" : "word",
   "position" : 0
 } ]
}

Word Delimiter Token Filter advanced settings include:

1. protected_words – A list of protected words from being word delimited. It’s either an array or a set of protected_words_path which resolve to a file configured with protected words (one on each line).

2. type_table – A custom type mapping table. It is configured using type_table_path.

The path is relative to {ES_HOME}/config/.

# Map the $, %, '.', and ',' characters to DIGIT
   # This might be useful for financial data.
   $ => DIGIT
   % => DIGIT
   . => DIGIT
   \\u002C => DIGIT

The advanced settings can be configured as follows:

"word_delimiter_filter": { 
    "type": "word_delimiter", 
    "split_on_numerics": true, 
    "split_on_case_change": false, 
    "generate_number_parts": true, 
    "catenate_words": false, 
    "generate_word_parts": true, 
    "catenate_all": false, 
    "protected_words_path": "analysis/protected_words_path.txt",
    "type_table_path": "analysis/type_table.txt" 
}

Segmentation

Word Delimiter Token Filter helps in text segmentation or the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. Segmenting the text into topics or discourse semantics is useful in many natural processing tasks. It can improve information retrieval or speech recognition significantly by indexing and recognising documents more precisely or by returning the specific part of a document corresponding to the query. It is also needed in topic detection, tracking systems and text summarising problems.

Give it a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.