NLP: Different Types of Language Models – Examples

Different types of language models in NLP

Have you ever wondered how your smartphone seems to know exactly what you’re going to type next? Or how virtual assistants like Alexa and Siri understand and respond to your queries with such precision? The magic is NLP language models. In this blog, we will explore the diverse types of language models in NLP that have evolved over time, each with its unique capabilities and applications. From the simplicity of N-gram models, which predict text based on preceding words, to the sophisticated neural network-based models like RNNs, LSTMs, and the groundbreaking large language models using Transformers, we will learn about the intricacies of these models, examples of real-world applications and Python code examples.

N-gram Language Models

N-gram language models are fundamental in NLP. They rely on the probability of a word based on its preceding words, with the number of considered words defined by N“. The “N” in N-gram refers to the number of words considered in a given context, and this leads to different types of N-gram models. The following are different types of N-gram language models:

  1. Unigram (1-gram) Language Model: This model views each word in isolation, as seen in a sentence like “The cat sat on the mat,” where words are considered separately [“The”, “cat”, “sat”, “on”, “the”, “mat”]. It’s particularly useful in basic text classification tasks where the frequency of individual words is the main focus.
  2. Bigram (2-gram) Language Model: It analyzes word pairs for predictions, as in “The cat sat on the mat,” forming pairs like [“The cat”, “cat sat”, “sat on”, “on the”, “the mat”]. This model is ideal for autocomplete features in search engines and text editors, predicting the next word based on the previous one.
  3. Trigram (3-gram) Language Model: This model extends the context to sequences of three words, such as [“The cat sat”, “cat sat on”, “sat on the”, “on the mat”] in the sentence “The cat sat on the mat.” It’s well-suited for more complex tasks like speech recognition and basic language translation.
  4. Higher-Order N-grams: Focusing on sequences of four or more words, for example, a 4-gram model in “The cat sat on the mat” would generate [“The cat sat on”, “cat sat on the”, “sat on the mat”]. This approach is beneficial for applications requiring an understanding of longer contextual dependencies but faces challenges like data sparsity and overfitting.

The following are examples of real-world applications where N-gram language models can be used:

  • Spell Check and Autocorrect: Due to their simplicity and efficiency, N-gram models are well-suited for spell checking and autocorrect features in text editors and word processors. They can quickly predict and suggest the next word based on previous word(s) without needing extensive computational resources.
  • Simple Text Prediction: They are useful in applications requiring basic text prediction, such as in older mobile phone keyboards where computational resources are limited.
  • Search Engine Autocomplete: N-gram models can power the autocomplete features in search engines, providing suggestions based on the most common queries.

The following is an example of how to implement a basic 2-gram (bigram) language model for next word prediction in Python. This example uses the NLTK library, which is a popular toolkit for natural language processing in Python. Make sure to install NLTK (pip install nltk).

import nltk
from nltk.util import bigrams
from nltk.corpus import reuters
from collections import Counter, defaultdict

# Download necessary NLTK data
nltk.download('reuters')
nltk.download('punkt')

# Function to build a bigram model
def build_bigram_model():
    model = defaultdict(lambda: defaultdict(lambda: 0))
    for sentence in reuters.sents():
        for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True):
            model[w1][w2] += 1
    
    # Convert frequencies to probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count
    
    return model

# Build the model
model = build_bigram_model()

# Predict the next word
def predict_next_word(previous_word):
    next_word = model[previous_word]
    # Sort by probability
    next_word = sorted(next_word.items(), key=lambda item: item[1], reverse=True)
    return next_word[0][0] if next_word else None

# Example usage
previous_word = 'economic'
predicted_word = predict_next_word(previous_word)
print(f"The predicted next word after '{previous_word}' is '{predicted_word}'")

The Python code does the following:

  1. Imports necessary modules from NLTK and Python’s standard library.
  2. Downloads the Reuters corpus, which is a collection of news documents. This corpus is used to train the bigram model.
  3. Builds the bigram model by counting the occurrences of each bigram (pair of consecutive words) and converting these counts to probabilities.
  4. Defines a function predict_next_word to predict the most probable next word given a previous word.

Neural Network based Language Models

Neural network based language models are superior to N-gram models in capturing complex relationships between words. These language models employ deep learning architectures to understand and generate human language. Many neural language models, especially those based on RNNs, LSTMs (Long Short-Term Memory), and GRUs (Gated Recurrent Units), are designed to process sequential data. This makes them adept at understanding the context and maintaining state over a sequence of words, essential for tasks like sentence completion and text generation. These models require more computational resources than N-gram models and can be slower to train, which might be a limiting factor in resource-constrained scenarios.

  • Recurrent Neural Networks (RNNs): RNNs maintain a hidden state that captures contextual information from previously processed words, thus providing a way to understand context in a sequence. However, they struggle with long-range dependencies due to the vanishing gradient problem.
  • Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs): These are advanced versions of RNNs specifically designed to combat the vanishing gradient issue. This allows them to remember information over longer sequences, making them more effective for complex language tasks. This is one of the main reasons why LSTM and GRUs are chosen over basic RNNs and N-grams.

For implementing neural language models, TensorFlow or PyTorch are commonly used. They provide extensive support for building and training various neural network architectures. Here is an example using TensorFlow and Keras for an LSTM model:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim))
model.add(LSTM(units=50))
model.add(Dense(units=vocab_size, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy')

# Model training
# model.fit(x_train, y_train, epochs=num_epochs)

The following are examples of real-world applications where neural networks based language models such as RNN / LSTM / GRUs can be used:

  • Machine Translation: LSTM and GRU models, with their ability to remember long-range dependencies, are well-suited for machine translation applications, where context and sequence memory are crucial.
  • Speech Recognition: These models are effective in speech recognition systems, as they can handle variable-length input sequences and capture the temporal dependencies in spoken language.
  • Sentiment Analysis: For sentiment analysis, particularly in longer texts where context is key, RNNs and their variants can provide more nuanced understanding than simpler models.

Transformers & Large Language Models

Unlike RNNs and their derivatives such as LSTM and GRUs, transformers based language models are the go-to choice for complex language tasks like translation, question-answering, and text generation due to their superior ability to understand context and handle long-range dependencies.. The key innovation in transformers is the self-attention mechanism, allowing the model to weigh the importance of different words in a sentence, regardless of their positional distance. This ability to capture both short and long-range dependencies makes them incredibly powerful. Models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are prominent examples of transformer based language models. These models are also termed as large language models.

For transformers, the Hugging Face’s Transformers library is a popular choice. It provides pre-trained models like GPT and BERT, which can be fine-tuned for specific tasks. The Python code given below uses the Hugging Face transformers library to generate text using a pre-trained GPT-2 language model.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Encode text inputs
inputs = tokenizer("AI has taken the world by ", return_tensors="pt")

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=20)

# Decode and print the output text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The following are some of the key steps done in the above code:

  • Importing the model and tokenizer
  • Loading the pre-trained model and tokenizer
  • Encoding the input text
  • Text generation
  • Decoding and printing the generated text

When the above code is executed, the following gets printed: AI has taken the world by  a storm. It has been a long time since I have

The following are examples of real-world applications where transformers based language models can be used:

  • Advanced Natural Language Understanding (NLU): Tasks like question answering, summarization, and language inference benefit greatly from Transformer models due to their ability to understand complex sentence structures and context.
  • Large-Scale Language Generation: Applications like chatbots, content creation tools, and story generators, where coherent and contextually relevant text generation is required, are ideal use cases for Transformer models like GPT.
  • Contextual Word Embeddings: BERT and similar models are used to generate word embeddings that capture the context of a word within a sentence, significantly improving performance in a variety of NLP tasks like named entity recognition (NER) and part-of-speech (POS) tagging.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Large Language Models, Machine Learning, NLP, Python. Tagged with , .

Leave a Reply

Your email address will not be published. Required fields are marked *