Smoothing techniques in NLP are used to address scenarios related to determining probability / likelihood estimate of a sequence of words (say, a sentence) occuring together when one or more words individually (unigram) or N-grams such as bigram([latex]w_{i}[/latex]/[latex]w_{i-1}[/latex]) or trigram ([latex]w_{i}[/latex]/[latex]w_{i-1}w_{i-2}[/latex]) in the given set have never occured in the past.
In this post, you will go through a quick introduction to various different smoothing techniques used in NLP in addition to related formulas and examples. The following is the list of some of the smoothing techniques:
You will also quickly learn about why smoothing techniques to be applied. In the examples below, we will take the following sequence of words as corpus and test data set.
Based on the training data set, what is the probability of “cats sleep” assuming bigram technique is used? Based on bigram technique, the probability of the sequence of words “cats sleep” can be calculated as the product of following:
[latex]P(cats sleep) = P(\frac{cats}{<s>})\times P(\frac{sleep}{cats})\times P(\frac{</s>}{sleep})[/latex]
You will notice that [latex]P(\frac{sleep}{cats}) = 0[/latex]. Thus, the overall probability of occurrence of “cats sleep” would result in zero (0) value. However, the probability of occurrence of a sequence of words should not be zero at all.
This is where various different smoothing techniques come into the picture.
In Laplace smoothing, 1 (one) is added to all the counts and thereafter, the probability is calculated. This is one of the most trivial smoothing techniques out of all the techniques.
Maximum likelihood estimate (MLE) of a word [latex]w_i[/latex] occuring in a corpus can be calculated as the following. N is total number of words, and [latex]count(w_{i})[/latex] is count of words for whose probability is required to be calculated.
MLE: [latex]P(w_{i}) = \frac{count(w_{i})}{N}[/latex]
After applying Laplace smoothing, the following happens. Adding 1 leads to extra V observations.
MLE: [latex]P_{Laplace}(w_{i}) = \frac{count(w_{i}) + 1}{N + V}[/latex]
Similarly, for N-grams (say, Bigram), MLE is calculated as the following:
[latex]P(\frac{w_{i}}{w_{i-1}}) = \frac{count(w_{i-1}, w_{i})}{count(w_{i-1})}[/latex]
After applying Laplace smoothing, the following happens for N-grams (Bigram). Adding 1 leads to extra V observations.
MLE: [latex]P_{Laplace}(\frac{w_{i}}{w_{i-1}}) = \frac{count(w_{i-1}, w_{i}) + 1}{count(w_{i-1}) + V}[/latex]
This is very similar to “Add One” or Laplace smoothing. Instead of adding 1 as like in Laplace smoothing, a delta([latex]\delta[/latex]) value is added. Thus, the formula to calculate probability using additive smoothing looks like following:
[latex]P(\frac{w_{i}}{w_{i-1}}) = \frac{count(w_{i-1}, w_{i}) + \delta}{count(w_{i-1}) + \delta|V|}[/latex]
Good Turing Smoothing technique uses the frequencies of the count of occurrence of N-Grams for calculating the maximum likelihood estimate. For example, consider calculating the probability of a bigram (chatter/cats) from the corpus given above. Note that this bigram has never occurred in the corpus and thus, probability without smoothing would turn out to be zero. As per the Good-turing Smoothing, the probability will depend upon the following:
The following is the formula:
For the unknown N-grams, the following formula is used to calculate the probability:
[latex]P_{unknown}(\frac{w_{i}}{w_{i-1}}) = \frac{N_1}{N}[/latex]
In above formula, [latex]N_1[/latex] is count of N-grams which appeared one time and N is count of total number of N-grams
For the known N-grams, the following formula is used to calculate the probability:
[latex]P(\frac{w_{i}}{w_{i-1}}) = \frac{c*}{N}[/latex]
where c* = [latex](c + 1)\times\frac{N_{i+1}}{N_{c}}[/latex]
In the above formula, c represents the count of occurrence of n-gram, [latex]N_{c + 1}[/latex] represents count of n-grams which occured for c + 1 times, [latex]N_{c}[/latex] represents count of n-grams which occured for c times and N represents total count of all n-grams.
This video represents great tutorial on Good-turing smoothing.
In Good Turing smoothing, it is observed that the count of n-grams is discounted by a constant/abolute value such as 0.75. The same intuiton is applied for Kneser-Ney Smoothing where absolute discounting is applied to the count of n-grams in addition to adding the product of interpolation weight and probability of word to appear as novel continuation.
[latex]P_{Kneser-Ney}(\frac{w_{i}}{w_{i-1}}) = \frac{max(c(w_{i-1},w_{i} – d, 0))}{c(w_{i-1})} + \lambda(w_{i-1})*P_{continuation}(w_{i})[/latex]
where [latex]\lambda[/latex] is a normalizing constant which represents probability mass that have been discounted for higher order. The following represents how [latex]\lambda[/latex] is calculated:
[latex]\lambda(w_{i-1}) = \frac{d\times|c(w_{i-1},w_{i})|}{c(w_{i-1})}[/latex]
The following video provides deeper details on Kneser-Ney smoothing.
Good-turing technique is combined with interpolation. Outperforms Good-Turing
by redistributing different probabilities to different unseen units.
Good-turing technique is combined with bucketing.
In this post, you learned about different smoothing techniques, using in NLP, such as following:
Did you find this article useful? Do you have any questions about this article or understanding smoothing techniques using in NLP? Leave a comment and ask your questions and I shall do my best to address your queries.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…