Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion...
Transcript of Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion...
![Page 1: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/1.jpg)
Clustering of top 250 movies from IMDB
Mustafa Panbiharwala Siddharth Ajit
![Page 2: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/2.jpg)
Broad goals
• Find out if there’s any intrinsic pattern or clusters among top rated movies (Based on IMDb Ratings).
• Identify similarities among the movies of the same cluster.
• Use Dimensionality Reduction to improve clustering output
• Repeat the above steps for different clustering algorithms.
![Page 3: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/3.jpg)
Data acquisition
• Movie features were obtained using OMDB’s API.
• Plot summary were scrapped from IMDB
https://www.imdb.com/chart/top.
![Page 4: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/4.jpg)
• Data was directly acquired from OMDB’s Database through it’s public API.
• The API had movie’s IMDB id as one of the outputs.
• Used movie’s IMDB to scrap the data from IMDB’s website.
• OMDB’s API had the following Output-
![Page 5: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/5.jpg)
Feature selection
• Features used for clustering : year, runtime ,genre director, actors, plot, language, country.
• Certain features like year of release was discretized to categorical variables by choosing a suitable cutoff value.
• For text heavy columns, important words were pulled after cleaning and sorting according to TF-IDF scores.
![Page 6: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/6.jpg)
Genre
![Page 7: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/7.jpg)
Actors
![Page 8: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/8.jpg)
Directors
![Page 9: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/9.jpg)
Language
![Page 10: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/10.jpg)
Countries
![Page 11: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/11.jpg)
Feature representation
• Features – Genre, Actors, Directors, Language and Countries were one hot encoded.
![Page 12: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/12.jpg)
Text processing of plot summary
These are standard procedures for NLP problems.
• Stripping of all punctuations, stop words etc.
• Tokenizing – breaking text into sentences and words.
• Stemming – reducing the inflected word to the root form.
![Page 13: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/13.jpg)
TF-IDF
• Stands for term frequency–inverse document frequency
• Reveals the importance of a word to a document in a corpus.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log(Total number of documents / Number of documents with term t).
TF-IDF = TF(t) * IDF(t)
![Page 14: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/14.jpg)
Terms with TF-IDF scores
• The list was combed through for relevant words
• Top 20 words were selected and one hot encoded
father, police, family , men , young , war , child , home , young, son , love, money , friend, mother , escape , boy , girl, murder , brother, escape, german , gang
words selected:
![Page 15: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/15.jpg)
Final Dataframe
![Page 16: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/16.jpg)
Curse of dimensionality
• Final data frame is sparse
• As predictor space explodes, clustering becomes difficult due to various reasons – Effect of noise, high correlation among predictors.
• Approach – Dimension reduction ( PCA)
![Page 17: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/17.jpg)
Principal component vectors
• Orthogonal transformation to convert set of correlated variables to a linearly uncorrelated variables called principal vectors.
• The first principal component has the highest variance among all the principal components
• 24 top principal components were used for clustering
• They explained 99.8% of the variance in our dataset.
![Page 18: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/18.jpg)
PCA solution
• Solve using singular value decomposition S = A’ΛA
• A matrix contains eigen vectors of S
• Λ is a diagonal matrix corresponding to each eigen vector
• Retain top q eigen values from Λ
• Now, the predictor subspace has been mapped from p dim to q dim
![Page 19: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/19.jpg)
Predictors sorted by loading vector weights
![Page 20: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/20.jpg)
Tsne 2-D
![Page 21: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/21.jpg)
Clustering
• K means
• EM clustering
• DBSCAN
• Hierarchical clustering
![Page 22: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/22.jpg)
K means clustering
• Randomly assign data points to (1…K) clusters
• Compute centroids of clusters
• Calculate distance measure from data points to corresponding centroids and reassign the points to centroids by distance
• Repeat the above steps until there is no reassignment of data points
![Page 23: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/23.jpg)
EM clustering GMM
![Page 24: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/24.jpg)
(K means)
• K = 5 ( From DBSCAN)
• Cluster formations were difficult to interpret at higher K values
• Minimum SSE was obtained at k = 8
![Page 25: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/25.jpg)
K means results – Cluster 0• No of clusters – 5
• Theme : family, biography
• Genre : Fantasy, comedy,
biography
MoviesStar wars : Episode V
Star wars : Episode IV
The great dictator
Mad Max : Fury road
Whiplash
JawsPassion of Joan of Arc
![Page 26: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/26.jpg)
K means results – Cluster 1
• Theme : Son, family, Murder
• Runtime is very high in comparison
to the median value
• Genre : Dark, History, biography
MoviesThe godfather part I
The godfather part 2
Schindler’s list
Once upon a time in America
Gandhi
Gone with the windBen-Hur
![Page 27: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/27.jpg)
K means results – Cluster 2
• Theme :War, kill
• Runtime is very high in comparison
to the median value
• Genre : Action, Crime, thriller
MoviesDark knightDark knight risesPulp fiction
Inglorious basterdsDjango UnchainedScarface
There will be blood
Once upon a time in west
![Page 28: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/28.jpg)
K means results – Cluster 3
• Theme : young
• Genre : Comedy, Animation, Romance
MoviesSpirited awayFinding NemoModern timesCity lightsCasablancaToy story
Hachi : A dog’s tale
Monty Python
Truman Show
![Page 29: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/29.jpg)
K means results – Cluster 4
• Theme : home, family
• Actor : Al Pacino
• Genre : Thriller, crime
MoviesShawshank RedemptionMatrixThe shiningTerminator 2Beautiful mindTo kill a mockingbirdSe7enPrestigeGood will hunting
V for Vendetta
La la land
Catch me if you can
![Page 30: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/30.jpg)
DBSCAN
• DBSCAN is a density based clustering algorithm.
• Robust towards Outlier Detection.
• The most common distance metric used is Euclidean distance especially for high-dimensional data, this metric can be rendered almost useless due to the so-called “curse of dimensionality”.
![Page 31: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/31.jpg)
DBSCAN Results – Cluster 1
• No of clusters – 5
• Director : Hiyao Miyazaki
• Theme : Young, world
• Genre :Animation, Adventure
Movies
Spirited Away
Finding Nemo
Nausicaa of the valley of wind
Wall -E
The Lion king
![Page 32: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/32.jpg)
Cluster 2
• Director : Christopher Nolan
• Theme : Help, dark
• Actors : Christian Bale, Di Caprio
• Genre : Drama, thriller
Movies
The dark night
Inception
Prestige
Batman begins
![Page 33: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/33.jpg)
Cluster 3
• Actor : Charlie Chaplin
• Theme : love, family
• Genre : Comedy, drama
Movies
City lights
Modern times
The great dictator
The gold rush
![Page 34: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/34.jpg)
Cluster 4
• Director : Stanley Kubrick
• Theme : war, men, German
• Genre :Drama, autobiography
• All the movies in these cluster have
extremely high run time
Movies
Barry Lyndon
Gandhi
Judgement at Nuremberg
Lawrence of Arabia
Schindler's list
Seven Samurai
![Page 35: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/35.jpg)
Cluster 5
• Director : Clint Eastwood
• Theme : war, gang
• Genre :Action, fantasy
• Movies in these cluster are also longer
in comparison to median run time
Movies
Lord of the rings series
The good, the bad and the ugly
Mad Max : Fury road
Gangs of Wasseypur
Roshomon
Yojimbo
![Page 36: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/36.jpg)
Outlier clusters
• DBSCAN assigned many data points as outliers
• Why ?? – DBSCAN looks for spatial density for clustering. In higher dimensions, it is difficult to find such density clusters.
• The clusters formed by DBSCAN are easier to interpret than K means
![Page 37: Clustering of top 250 movies from IMDB - GitHub Pages · Mad Max : Fury road Whiplash Jaws Passion of Joan of Arc. K means results –Cluster 1 •Theme : Son, family, Murder •Runtime](https://reader035.fdocuments.us/reader035/viewer/2022063011/5fc4ca379e100c44846585cc/html5/thumbnails/37.jpg)
Future work
• Scaling of predictors
• In K means clustering, finding the best value of k that is optimum according to gap statistic, silhouette score and has good interpretability
• Different distance measures such as jaccard , Gower distance
• Heirarchical clustering