Spacy Tokenizer Python Example
In this post, you will quickly learn about how to use Spacy for reading and tokenising a document read from text file or otherwise. As a data scientist starting on NLP, this is one of those first code which you will be writing to read the text using spaCy.
First and foremost, make sure you have got set up with Spacy, and, loaded English tokenizer. The following commands help you set up in Jupyter notebook.
#
# Install Spacy
#
!pip install spacy
#
# Load English tokenizer
#
!python -m spacy download en_core_web_sm
Reading text using spaCy: Once you are set up with Spacy and loaded English tokenizer, the following code can be used to read the text from the text file and tokenize the text into words. Pay attention to some of the following:
import spacy
#
# Load the model for English language;
# nlp is an instance of spaCy language class. A Language object
# contains the language’s vocabulary and other data from the statistical model.
#
nlp = spacy.load('en')
#
# Create an instance of document;
# doc object is a container for a sequence of Token objects.
#
intro = 'My name is Ajitesh Shukla. I live in Hyderabad, India. I love doing projects in AI / ML.'
doc = nlp(intro)
#
# Read the words; Print the words
#
words = [word.text for word in doc]
print(words)
#
# Read text from a text file
#
modi_speech = nlp(open('/Users/apple/Downloads/nltk_sample/modi/virtual_global_investor_roundtable.txt').read())
#
# Read the words; Print the words
#
words_modi_speech = [word.text for word in modi_speech]
print(words_modi_speech)
Nouns & Verbs: Here is the code for reading Nouns and Verb from the text file using the instance of document:
print("Noun phrases:", [chunk.text for chunk in modi_speech.noun_chunks])
print("Verbs:", [token.lemma_ for token in modi_speech if token.pos_ == "VERB"])
Named Entities: Here is the code for reading named entities from the from the text file using the instance of document:
# Find named entities, phrases and concepts
for entity in modi_speech.ents:
print(entity.text, entity.label_)
Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…
Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…
Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…
Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…
The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…
Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…