Data Science

Spacy Tokenization Python Example

In this post, you will quickly learn about how to use Spacy for reading and tokenising a document read from text file or otherwise. As a data scientist starting on NLP, this is one of those first code which you will be writing to read the text using spaCy.

First and foremost, make sure you have got set up with Spacy, and, loaded English tokenizer. The following commands help you set up in Jupyter notebook.

#
# Install Spacy
#
!pip install spacy
#
# Load English tokenizer
#
!python -m spacy download en_core_web_sm

Reading text using spaCy: Once you are set up with Spacy and loaded English tokenizer, the following code can be used to read the text from the text file and tokenize the text into words. Pay attention to some of the following:

  • First and foremost, the model for English language needs to be loaded using command such as spacy.load(‘en’). This results in an instance of spaCy language class.
  • One can either pass an instance of text created using assignment or read the text from file, to the instance of spaCy language class. This results in a document object which is a container of token objects.
  • One can then invoke several operations such as printing the words, finding the nouns, verbs on document object.
import spacy
#
# Load the model for English language; 
# nlp is an instance of spaCy language class. A Language object 
# contains the language’s vocabulary and other data from the statistical model.
#
nlp = spacy.load('en')
#
# Create an instance of document;
# doc object is a container for a sequence of Token objects. 
#
intro = 'My name is Ajitesh Shukla. I live in Hyderabad, India. I love doing projects in AI / ML.'
doc = nlp(intro)
#
# Read the words; Print the words
#
words = [word.text for word in doc]
print(words)
#
# Read text from a text file
#
modi_speech = nlp(open('/Users/apple/Downloads/nltk_sample/modi/virtual_global_investor_roundtable.txt').read())
#
# Read the words; Print the words
#
words_modi_speech = [word.text for word in modi_speech]
print(words_modi_speech)

Nouns & Verbs: Here is the code for reading Nouns and Verb from the text file using the instance of document:

print("Noun phrases:", [chunk.text for chunk in modi_speech.noun_chunks])
print("Verbs:", [token.lemma_ for token in modi_speech if token.pos_ == "VERB"])

Named Entities: Here is the code for reading named entities from the from the text file using the instance of document:

# Find named entities, phrases and concepts
for entity in modi_speech.ents:
    print(entity.text, entity.label_)
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Data Lakehouses Fundamentals & Examples

Last updated: 12th May, 2024 Data lakehouses are a relatively new concept in the data…

9 hours ago

Machine Learning Lifecycle: Data to Deployment Example

Last updated: 12th May 2024 In this blog, we get an overview of the machine…

18 hours ago

Autoencoder vs Variational Autoencoder (VAE): Differences, Example

Last updated: 12th May, 2024 In the world of generative AI models, autoencoders (AE) and…

19 hours ago

Linear Regression T-test: Formula, Example

Last updated: 7th May, 2024 Linear regression is a popular statistical method used to model…

6 days ago

Feature Engineering in Machine Learning: Python Examples

Last updated: 3rd May, 2024 Have you ever wondered why some machine learning models perform…

1 week ago

Feature Selection vs Feature Extraction: Machine Learning

Last updated: 2nd May, 2024 The success of machine learning models often depends on the…

1 week ago