In this blog, we will learn about how to implement content-based recommender system using Python programming example. We will learn with the example of movie recommender system for recommending movies. Download the movies data from here to work with example given in this blog.
The following is a list of key activities we would do to build a movie recommender system based on content-based recommendation technique.
To start with, we import the data in csv format. Once data is imported, next step is analyse and prepare data before we apply modeling techniques.
import pandas as pd
df = pd.read_csv('sample_data/movies.csv')
df = df[['title', 'genres', 'keywords', 'cast', 'director']]
df = df.fillna('') # Fill missing values with empty strings
df.head()
The dataset contains 24 columns, only a few of which are needed to describe a movie. In the above code, key features such as title, genres, etc are extracted and missing values in these features are filled with empty string.
The next step is add a column in the dataframe that holds the value created by combining other columns representing key attributes such as title, genres, etc. And, then CountVectorizer is used on that column to vectorize the text. The following code does the same.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', min_df=20)
word_matrix = vectorizer.fit_transform(df['features'])
word_matrix.shape
The following happens in the above code:
The table of word counts contains 4,803 rows—one for each movie—and 918 columns. The next task is to compute cosine similarities for each row pair:
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(word_matrix)
The final step is to get one or more recommendations based on the input – a movie title. The Python code given below can be used to get the top N recommendations for movies similar to input movie name.
def recommend(title, df, sim, count=10):
# Get the row index of the specified title in the DataFrame
index = df.index[df['title'].str.lower() == title.lower()]
# Return an empty list if there is no entry for the specified title
if (len(index) == 0):
return []
# Get the corresponding row in the similarity matrix
similarities = list(enumerate(sim[index[0]]))
# Sort the similarity scores in that row in descending order
recommendations = sorted(similarities, key=lambda x: x[1], reverse=True)
# Get the top n recommendations, ignoring the first entry in the list since
# it corresponds to the title itself (and thus has a similarity of 1.0)
top_recs = recommendations[1:count + 1]
# Generate a list of titles from the indexes in top_recs
titles = []
for i in range(len(top_recs)):
title = df.iloc[top_recs[i][0]]['title']
titles.append(title)
return titles
The following can be learned from the above code:
You can now invoke the recommend function to get the top 10 recommendations (count = 10). This is how you would invoke the function: recommend( ‘Spectre’, df, sim).
You can use the Python code example given in this blog to create a content-based recommender system. In case you have questions, feelf ree to drop a message.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…