NEST by Elastic and contributors

<PackageReference Include="NEST" Version="7.13.2" />

.NET API 4,917,760 bytes

 AnalyzeTokenizersSelector

A tokenizer of type edgeNGram.

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables. Part of the `analysis-icu` plugin: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

A tokenizer of type keyword that emits the entire input as a single input.

A tokenizer of type pattern that can flexibly separate text into terms via a regular expression. Part of the `analysis-kuromoji` plugin: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji.html

A tokenizer of type letter that divides text at non-letters. That’s to say, it defines tokens as maximal strings of adjacent letters.

Note, this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

A tokenizer of type lowercase that performs the function of Letter Tokenizer and Lower Case Token Filter together.

It divides text at non-letters and converts them to lower case.

While it is functionally equivalent to the combination of Letter Tokenizer and Lower Case Token Filter,

there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation.

A tokenizer of type nGram.

The path_hierarchy tokenizer takes something like this:

/something/something/else

And produces tokens:

/something

/something/something

/something/something/else

A tokenizer of type pattern that can flexibly separate text into terms via a regular expression.

A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language documents.

The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

A tokenizer of type uax_url_email which works exactly like the standard tokenizer, but tokenizes emails and urls as single tokens

A tokenizer of type whitespace that divides text at whitespace.