In this blog, we will learn about how to implement content-based recommender system using Python programming example. We will learn with the example of movie recommender system for recommending movies. Download the movies data from here to work with example given in this blog.
The following is a list of key activities we would do to build a movie recommender system based on content-based recommendation technique.
To start with, we import the data in csv format. Once data is imported, next step is analyse and prepare data before we apply modeling techniques.
import pandas as pd
 
df = pd.read_csv('sample_data/movies.csv')
df = df[['title', 'genres', 'keywords', 'cast', 'director']]
df = df.fillna('') # Fill missing values with empty strings
df.head()
The dataset contains 24 columns, only a few of which are needed to describe a movie. In the above code, key features such as title, genres, etc are extracted and missing values in these features are filled with empty string.
The next step is add a column in the dataframe that holds the value created by combining other columns representing key attributes such as title, genres, etc. And, then CountVectorizer is used on that column to vectorize the text. The following code does the same.
from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer(stop_words='english', min_df=20)
word_matrix = vectorizer.fit_transform(df['features'])
word_matrix.shape
The following happens in the above code:
The table of word counts contains 4,803 rows—one for each movie—and 918 columns. The next task is to compute cosine similarities for each row pair:
from sklearn.metrics.pairwise import cosine_similarity
 
sim = cosine_similarity(word_matrix)
The final step is to get one or more recommendations based on the input – a movie title. The Python code given below can be used to get the top N recommendations for movies similar to input movie name.
def recommend(title, df, sim, count=10):
    # Get the row index of the specified title in the DataFrame
    index = df.index[df['title'].str.lower() == title.lower()]
     
    # Return an empty list if there is no entry for the specified title
    if (len(index) == 0):
        return []
 
    # Get the corresponding row in the similarity matrix
    similarities = list(enumerate(sim[index[0]]))
     
    # Sort the similarity scores in that row in descending order
    recommendations = sorted(similarities, key=lambda x: x[1], reverse=True)
     
    # Get the top n recommendations, ignoring the first entry in the list since
    # it corresponds to the title itself (and thus has a similarity of 1.0)
    top_recs = recommendations[1:count + 1]
 
    # Generate a list of titles from the indexes in top_recs
    titles = []
 
    for i in range(len(top_recs)):
        title = df.iloc[top_recs[i][0]]['title']
        titles.append(title)
 
    return titles
The following can be learned from the above code:
You can now invoke the recommend function to get the top 10 recommendations (count = 10). This is how you would invoke the function: recommend( ‘Spectre’, df, sim).
You can use the Python code example given in this blog to create a content-based recommender system. In case you have questions, feelf ree to drop a message.
Large language models (LLMs) have fundamentally transformed our digital landscape, powering everything from chatbots and…
As Large Language Models (LLMs) evolve into autonomous agents, understanding agentic workflow design patterns has…
In today's data-driven business landscape, organizations are constantly seeking ways to harness the power of…
In this blog, you would get to know the essential mathematical topics you need to…
This blog represents a list of questions you can ask when thinking like a product…
AI agents are autonomous systems combining three core components: a reasoning engine (powered by LLM),…