NLP Tokenization in Machine Learning: Python Examples

NLP Tokenization Types and Examples in Machine Learning

Last updated: 1st Feb, 2024

Tokenization is a fundamental step in Natural Language Processing (NLP) where text is broken down into smaller units called tokens. These tokens can be words, characters, or subwords, and this process is crucial for preparing text data for further analysis like parsing or text generation. Tokenization plays a crucial role in training machine learning models, particularly Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) series, BERT (Bidirectional Encoder Representations from Transformers), and others.

Tokenization is often the first step in preparing text data for machine learning. LLMs use tokenization as an essential data preprocessing step. Advanced tokenization techniques (like those used in BERT) allow models to understand the context of words better. This leads to more accurate interpretations of word meanings based on the surrounding text. Tokenization helps in dealing with linguistic nuances like contractions, hyphenations, and morphological variations, making models more adept at understanding natural language. In this blog, we will explore the different types of tokenization methods with examples and Python code examples for each type.

Whitespace Tokenization

This method splits the text into tokens based on whitespace (spaces, tabs, newlines). It’s simple and effective for many English texts but may not handle special cases well. Here is an example:

  • Text: “Hello, world! This is an example.”
  • Tokens: [“Hello,”, “world!”, “This”, “is”, “an”, “example.”]

Here is how you could implement whitespace tokenization in Python:

text = "Hello, world! This is an example."
tokens = text.split()
print(tokens)

Punctuation Tokenization

In punctuation tokenization, the text is split based on punctuation marks. This method is useful in texts where punctuation plays a critical role but might not be ideal for handling abbreviations or contractions. Here is an example of punctuation tokenization:

  • Text: “Mr. Smith went to Washington. He’s excited!”
  • Tokens: [“Mr”, “Smith”, “went”, “to”, “Washington”, “He’s”, “excited”]

In Python, we can implement punctuation tokenization using the re (regular expression) module. Here is a sample code:

import re

text = "Mr. Smith went to Washington. He's excited!"
tokens = re.findall(r'\b\w+\b', text)
print(tokens)

Word Tokenization

Word tokenization uses language-specific rules to split text into words. This approach handles special cases like hyphenated words, contractions, and punctuation more accurately. For example, consider the text:

Text: “Don’t stop believing, hold on to that feeling!”

A basic whitespace tokenizer would split this into [“Don’t”, “stop”, “believing,”, “hold”, “on”, “to”, “that”, “feeling!”], treating punctuation and contractions as part of the words. However, a more advanced word tokenizer would understand linguistic nuances and handle contractions and punctuation appropriately.

The expected tokens might be: [“Do”, “n’t”, “stop”, “believing”, “,”, “hold”, “on”, “to”, “that”, “feeling”, “!”]

To demonstrate this, we can use the NLTK library, which includes functions specifically designed for word tokenization.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Don't stop believing, hold on to that feeling!"
tokens = word_tokenize(text)
print(tokens)

In this example, word_tokenize from NLTK is used to accurately tokenize the sentence into individual words, taking into account the contraction “Don’t” and separating punctuation like commas and exclamation marks from the words. This approach is more aligned with how humans naturally parse and understand text, making it valuable for tasks that require a deep understanding of language, such as sentiment analysis or language modeling.

Subword Tokenization

Subword tokenization methods, popularized in the neural machine translation literature, are designed to split text into smaller, meaningful units (subwords) based on a predefined vocabulary. These subwords can range from individual characters to full words, capturing the most frequent and meaningful segments of text for efficient processing. The two main components of subword tokenization algorithms—vocabulary construction and tokenization procedure.

  • The vocabulary construction procedure involves analyzing a large corpus of text to identify the most common and useful subword units. This procedure aims to create a compact yet comprehensive vocabulary that can represent a wide range of words, including those not seen during training, by combining these subwords.
  • The second component is the tokenization procedure, where the built vocabulary is applied to new or unseen text to break it down into a sequence of tokens from this vocabulary. This process involves finding the longest subword units in the text that exist in the vocabulary and splitting the text accordingly.

A common method for subword tokenization is Byte Pair Encoding (BPE). Subword tokenization includes methods like Wordpiece, Byte Pair Encoding (BPE), SentencePiece, and Unigram language modeling. The following is an example of how subword tokenization works:

  • Text: “unbelievable”
  • Tokens (hypothetical with BPE): [“un”, “believ”, “able”]

Byte Pair Encoding (BPE) Tokenization Method

BPE technique is used as a preprocessing tokenization technique in GPTx models. Let’s demonstrate this using the Hugging Face tokenizers library, which offers a variety of tokenizers including BPE. Here is a Python code example.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

# Create a BPE tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

# Trainer to train the tokenizer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Example corpus to train the tokenizer
corpus = ["unbelievable", "belief", "believable", "unbelievably"]

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer)

# Tokenize a text
output = tokenizer.encode("unbelievable")
print(output.tokens)

In this example, we use a hypothetical small corpus to train a BPE tokenizer and then tokenize the word “unbelievable”. The tokenizer learns to break down words into frequent subunits observed in the training corpus.

SentencePiece Tokenization Method

The following code demonstrates the usage of SentencePiece:

import sentencepiece as spm

# Assume spm_model is a pre-trained SentencePiece model
sp = spm.SentencePieceProcessor(model_file='spm_model.model')

text = "Tokenization"
tokens = sp.encode(text, out_type=str)
print(tokens)

Wordpiece Tokenization Method

Wordpiece is one of the subword tokenization methods which is also called a language-modeling based variant of BPE. The WordPiece algorithm was developed by Google to pretrain BERT. It has since been reused in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET. It’s very similar to BPE in terms of the training, but the actual tokenization is done differently. Read greater details on this page – Huggingface Wordpiece Tokenization.

Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Machine Learning, NLP, Python. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *