Introduction of tokenization in NLP

Minh Nguyen
5 min readOct 22, 2021

This is the first article in my writing series about popular tokenizers in recent NLP development.

I. Introduction

  1. What is tokenization?

The process of splitting a sequence of text into units is called tokenization. These units are called tokens. This is a very important preprocessing step before feeding input into any NLP models but people often just skim through the step and concentrate on meatier parts like building models.

In the example above, if we have used white spaces as splitting delimiters, will the token “games!” look reasonable? Or if we have used both white spaces and punctuations for splitting and then, is it ok if the name “O’Neill” becomes [O, ‘, Neill]? Working with languages that don’t have words separated by spaces makes tokenization even harder. Perhaps, at this point, it has become abundantly clear that finding a correct/optimal split strategy is not an easy task, especially when we want all the tokens to retain their correct meaning.

But after all, is tokenizing something worthwhile? Should we just feed the entire textual input to a model and let it figure out itself? This has something to do with the way “machines” do the reading.

2. Issues with machine reading

One of the ultimate goals of any NLP model is to understand human language. Apparently, the feat cannot be completed without understanding written knowledge under textual forms. We, humans, are gifted when it comes to working with language. We can relate and connect physical things in the world with their textual names. We can least guess the context of a sentence to infer its correct meaning. We can even the meaning of our talk with friends before we went to school.

Unfortunately, a machine is not equipped with such prior knowledge. They don’t know the vocabulary, language structures/grammar. They can’t guess the contextual information of a sentence. They won’t even be able to decide where one word starts and ends. Here, we are dealing with the “chicken-and-egg” problem since to understand language, a machine needs to know its structure, but to learn structure, a machine has to know the language. How fuzzy is it?

One popular approach is to break text into chunks/tokens before feeding them to models. Through the training process, with a meticulously designed objective, a model can learn the relationship between the words in a text, then figure out the contextual semantic itself. Sound familiar? That’s exactly what a tokenizer does ;)

II. Popular approaches for tokenizing text

  1. Word-level tokenization

Word level one is the most commonly used tokenization algorithm that finds word boundaries according to some specification.

Advantages

  • The method has been developed for a long time and is available in plenty of libraries.
  • The method is intuitively easy to comprehend since the tokenization scheme is close to how humans split a sentence.

Disadvantages

  • If we want to create a vocabulary for every unique word, the size of the vocabulary would be huge. That not only imposes a burden for training NLP models, also makes the inference step very slow.
  • Word level tokenizers often perform poorly on text containing a significant number of misspellings like scripts from internet forums. The tokenizers also struggle with combined words like “check-in” or compound words like “San Francisco” since they are confused about whether to treat them as one word or two words? The same with abbreviated words like “lmao”, “IDK”, are these sets of words or one word?
  • Last but not least, some languages like Japanese don’t segment by spaces, meaning using space and punctuations is not sufficient for splitting the text in the languages correctly.

2. Character-level tokenization

As the name suggests, this type of tokenizer splits text into its characters like the example below.

Advantages

  • The vocabulary size is dramatically reduced to the number of characters in the language. For instance, in English, the size is often equal to the length of the Unicode characters list.
  • Misspellings, rare words, or unknown words are handled better because they are broken down into characters and these characters are already known in the vocabulary. Because of that, we can construct any embedding for any token.

Disadvantages

  • The main goal of tokenization is not achieved, because characters, at least in English, have no semantic meaning.
  • Reducing the vocabulary size has a tradeoff with the sequence length. Now, with each word being split into all its characters, the tokenized sequence is much longer than the initial text.

3. Sub-word level

Understanding the limitations of both word-level and character-level tokenizers, we realize that we need a tokenization scheme that could build a manageable-size vocabulary while retaining tokens’ semantic as much as possible. Here comes a savior, sub-word tokenization which produces sub-word units, smaller than words but bigger than just characters.

Advantages

  • The method provides a good balance between vocabulary size and retaining meaning to tokens
  • Its tokenization scheme helps mostly avoid the out-of-vocabulary (OOV) word problem. Having subword tokens (and ensuring the individual characters are part of the subword vocabulary), makes it possible to encode words that were not even in the training data.
  • Neural networks perform very well with them. In all sorts of tasks, they excel: neural machine translation, NER, etc, you name it. Almost all SOTA recent Deep Neural Network models use some form of sub-word tokenizer.

Disadvantages

Compared to other siblings, this one is still a relatively new tokenization scheme, thus, we need more time to discover more optimal tokenizers.

--

--