In this post, you will learn about getting started with natural language processing (NLP) with NLTK (Natural Language Toolkit), a platform to work with human languages using Python language. The post is titled hello world because it helps you get started with NLTK while also learning some important aspects of processing language. In this post, the following will be covered:
This is what you need to do set up NLTK.
# Pip install
#
pip install nltk
#
# Import NLTK
#
import nltk
You could get started with practicing NLTK commands by downloading the book collection comprising of several books. Here is what you need to execute:
#
# NLTK Book Download
#
nltk.download()
Executing above command will open up a utility where you could select book and download. Here is how it looks like:
Select the book and click download. Once the download is complete, you could execute the following command to load the book.
#
# Load the books
#
from nltk.book import *
This is how it would look like by executing the above command.
Here are some of the common NLTK commands vis-a-vis their utility:
import nltk
#
# Sentence
#
intro = 'My name is Ajitesh Shukla. I work in HighRadius. I live in Hyderabad.'
#
# Tokenize using word_tokenize method
#
tokens = nltk.word_tokenize(intro)
#
#
print(tokens)
#
#
print(set(tokens))
Here is how the output would look like:
We will try and understand with one of the text (text7 – Wall Street Journal) loaded from nltk.book. In the example below, common_contexts output is to_the and to_their. This implies that to_the and to_their occurred around both the words, finance and improve. If the output of common_contexts would have been null / empty, the output of method similar would also have been null / empty.
import nltk
#
# Sentence
#
intro = 'My name is Ajitesh Shukla. I write blogs on Vitalflux.com. I live in Hyderabad. I love writing blogs. I also have good expertise in cloud computing. I am also good in AWS.'
#
# Tokenize using word_tokenize method
#
tokens = nltk.word_tokenize(intro)
#
# Create an instance of FreqDist
#
freqdist = FreqDist(tokens)
#
# Draw the frequency distribution of tokens
#
freqdist.plot()
Here is how the output plot would look like:
import nltk
#
# Sentence
#
intro = 'My name is Ajitesh Shukla. I write blogs on Vitalflux.com. I live in Hyderabad. I love writing blogs. I also have good expertise in cloud computing. I am also good in AWS.'
#
# Tokenize using word_tokenize method
#
tokens = nltk.word_tokenize(intro)
#
# Condition to filter words meeting criteria
#
long_words = [words for words in tokens if len(words) > 5]
This is what will be printed
Here is the sumary of what you learned in this post related to NLTK set up and some common methods:
The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…
Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…
When building a Retrieval-Augmented Generation (RAG) application powered by Large Language Models (LLMs), which combine…
Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…