In this post, you will learn about the how to read one or more text files using NLTK and process words contained in the text file. As data scientists starting to work on NLP, the Python code sample for reading multiple text files from local storage will be very helpful.
Here is the Python code sample for reading one or more text files. Pay attention to some of the following aspects:
from nltk.corpus import PlaintextCorpusReader
#
# Root folder where the text files are located
#
corpus_root = '/Users/apple/Downloads/nltk_sample/modi'
#
# Read the list of files
#
filelists = PlaintextCorpusReader(corpus_root, '.*')
#
# List down the IDs of the files read from the local storage
#
filelists.fileids()
#
# Read the text from specific file
#
wordslist = filelists.words('virtual_global_investor_roundtable.txt')
Other two important methods on PlaintextCorpusReader are sents (read sentences) and paras (read paragraphs).
Once the words found in specific file is loaded, you can do some of the following operations for processing the text file:
filtered_words = [words for words in set(wordslist) if len(words) > 3]
from nltk.probability import FreqDist
#
# Frequency distribution
#
fdist = FreqDist(wordslist)
#
# Plot the frequency distribution of 30 words with
# cumulative = True
#
fdist.plot(30, cumulative=True)
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
#
# Frequency distribution
#
fdist = FreqDist(wordslist)
#
# Print words having 5 or more characters which occured for 5 or more times
#
frequent_words = [[fdist[word], word] for word in set(wordslist) if len(word) > 4 and fdist[word] >= 5]
#
# Record the frequency count of
#
sorted_word_frequencies = {}
for item in sorted(frequent_words):
sorted_word_frequencies[item[1]] = item[0]
#
# Create the plot
#
plt.bar(range(len(sorted_word_frequencies)), list(sorted_word_frequencies.values()), align='center')
plt.xticks(range(len(sorted_word_frequencies)), list(sorted_word_frequencies.keys()), rotation=80)
plt.title("Words vs Count of Occurences", fontsize=18)
plt.xlabel("Words", fontsize=18)
plt.ylabel("Words Frequency", fontsize=18)
Here is the summary of what you learned in this post regarding reading and processing the text file using NLTK library:
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…
View Comments
I have a text file that I need to parse to find the most often used words. How do I do this using python and nltk?