In this post, you will learn about the how to read one or more text files using NLTK and process words contained in the text file. As data scientists starting to work on NLP, the Python code sample for reading multiple text files from local storage will be very helpful.
Here is the Python code sample for reading one or more text files. Pay attention to some of the following aspects:
from nltk.corpus import PlaintextCorpusReader
#
# Root folder where the text files are located
#
corpus_root = '/Users/apple/Downloads/nltk_sample/modi'
#
# Read the list of files
#
filelists = PlaintextCorpusReader(corpus_root, '.*')
#
# List down the IDs of the files read from the local storage
#
filelists.fileids()
#
# Read the text from specific file
#
wordslist = filelists.words('virtual_global_investor_roundtable.txt')
Other two important methods on PlaintextCorpusReader are sents (read sentences) and paras (read paragraphs).
Once the words found in specific file is loaded, you can do some of the following operations for processing the text file:
filtered_words = [words for words in set(wordslist) if len(words) > 3]
from nltk.probability import FreqDist
#
# Frequency distribution
#
fdist = FreqDist(wordslist)
#
# Plot the frequency distribution of 30 words with
# cumulative = True
#
fdist.plot(30, cumulative=True)
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
#
# Frequency distribution
#
fdist = FreqDist(wordslist)
#
# Print words having 5 or more characters which occured for 5 or more times
#
frequent_words = [[fdist[word], word] for word in set(wordslist) if len(word) > 4 and fdist[word] >= 5]
#
# Record the frequency count of
#
sorted_word_frequencies = {}
for item in sorted(frequent_words):
sorted_word_frequencies[item[1]] = item[0]
#
# Create the plot
#
plt.bar(range(len(sorted_word_frequencies)), list(sorted_word_frequencies.values()), align='center')
plt.xticks(range(len(sorted_word_frequencies)), list(sorted_word_frequencies.keys()), rotation=80)
plt.title("Words vs Count of Occurences", fontsize=18)
plt.xlabel("Words", fontsize=18)
plt.ylabel("Words Frequency", fontsize=18)
Here is the summary of what you learned in this post regarding reading and processing the text file using NLTK library:
Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…
The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…
When building a Retrieval-Augmented Generation (RAG) application powered by Large Language Models (LLMs), which combine…
Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
View Comments
I have a text file that I need to parse to find the most often used words. How do I do this using python and nltk?