Which movies should I watch? Collaborative Filtering with Nearest Neighbors


I finally got around into coding this thing. Turned out it's really quite simple. Thanks to the awesome blogposts I follow! Tho I'd have to code some of the other functions myself. But the principle is the same. 

The Concept: Nearest Neighborhood

The standard method of Collaborative Filtering (CF) is known as Nearest Neighborhood algorithm
There are user-based CF and item-based CF
Let’s first look at User-based CF. User-based CF is a technique used to predict the items that a user might like on the basis of ratings given to that item by other users who have similar taste with that of the target user.
Basically, the idea is to find the most similar users to your target user (nearest neighbors) and weight their ratings of an item as the prediction of the rating of this item for target user.


Item-based CV is an algorithm where the similarities between different items in the dataset are calculated by using one of a number of similarity measures, and then these similarity values are used to predict ratings for user-item pairs not present in the dataset.
One key advantage of this algorithm is its stability. That is, ratings on a given item will not change significantly overtime, unlike human preferences.


So for this post, I'll be using the Item-based CV as I will be measuring how close the movies are to each other based on the ratings given to them by the users. 
The kNN algorithm is a supervised non-parametric lazy learning method used for both classification and regression. Supervised because it needs labeled data points to work. Non-parametric because it doesn’t make any assumption on the data distribution. Lazy because does not use the training data points to do a generalization; it is delayed until the testing phase.
This algorithm is based on how similar features are. If we translate this in the context of recommendation systems, then it is simply finding other stuff that are similar to what you already like! 
Hence, we first need to assume (rather the algorithm assumes) that similar things are normally located near each other. Same birds flock together thing. We therefore need some distance metric to measure how far apart they are!
I am going to use a simple example presented in the this post. Look through it!
For example, we have two attributes for each data point. And all the data points can be grouped into 2 classes as follows. Next question is, if we have another datapoint added to the system (say the circle thingy with the question mark in the middle), which class doe sit belong to? What's your guess?

We use the kNN algorithm for this aka k-nearest neighbors. K because we first look at the k nearest neighbors (like literally) from a datapoint. That is, if k=5, all the selected datapoints will cast a vote (at least that's what I imagine they would do), and the unlabeled datapoint gets the green class.
But if we set k as 3, the unlabeled datapoint gets the yellow class as shown below.
So using the concepts above, one could easily imagine how kNN can be of great assistance in recommender systems. So in the case of a movie recommender, I could just throw in the movie that I want, and based on the neighbors of this movie (based on some features in the feature space), the kNN algorithm spews out the list of movies that are near this movie that I like!

The Implementation: 

Refer to my github repository for the implementation!
Over the weekend (some weekend far far away from now, you have no idea how much has happened since!), I took fancy on watching Forrest Gump sponsored on Netflix by a friend. Naturally, I would want another movie like it as I loved it so much!


The recommender system that I built recommended me the following. I think I love them! 


References:

3. https://medium.com/analytics-vidhya/k-nearest-neighbors-all-you-need-to-know-1333eb5f0ed0
4. https://heartbeat.comet.ml/recommender-systems-with-python-part-ii-collaborative-filtering-k-nearest-neighbors-algorithm-c8dcd5fd89b2

Comments

Popular Posts