Final Report: Team 2 Predicting Golden Globe Awards ...€¦ · This power can also be used to...

Final Report: Team 2Predicting Golden Globe Awards & Christmas Day Movie Gross

Deepak Kumar Balasubramaniam, Justin Chan, Ashish Kulkarni, and Dandan Zheng

Department of Computer Science, Stony Brook University,Stony Brook, NY 11790

{dbalasubrama,juchan,ashkulkarni,dazzheng}@cs.stonybrook.edu

http://www.cs.stonybrook.edu/~skiena/591/projects

1 Challenge

Our challenge is focused on making two predictions that revolve around the outcome of movies for the currentyear 2014. The first challenge is to predict the nominees and award winners for the 72nd Golden Globe Awardstaking place on January 11th, 2015. We will focus our efforts to predict the winners of the following fourcategories: Best Director, Best Actor Motion Picture Drama, Best Actress Motion Picture Drama and BestScreenplay.

The second challenge is to predict how much the movies running in theaters on Christmas Day will gross.We consider only the prerelease data associated wih movies to predict the gross of the movies. While boththe Golden Globes and Box office gross revolve around the success of the film industry, the two metrics thatassess a movies success are very different from each other. Our final report that follows will outline or splitapproach to handling the two challenges.

2 History/Background

2.1 History of the Golden Globes

The 2015 Golden Globe Awards represent the 72nd season of film and television accolades bestowed by theHollywood Foreign Press Association (HFPA). The Golden Globes are part of an annual film and televisionaward season which includes awards such as the Academy Award (Oscars), Screen Actors Guild Awards, andthe Directors Guild Awards, and for television the Emmy Awards and the Peoples Choice Awards. For theGolden Globe Awards, the roughly 90 full members [11] of the Hollywood Foreign Press Association vote onwinners for the award, bestowed for different categories between film and television [1]. Members of the HFPAare international journalists responsible for maintaining relations between Hollywood and the internationalcommunity. Our research project aims to predict the upcoming winners of the 72nd annual Golden Globeawards. Nominees will be announced on December 11th, 2014 and the winners are announced January 11th,2015. The scope of our prediction encompasses four of the twelve awards in the film category:

1. Best Director2. Best Actor Motion Picture Drama3. Best Actress Motion Picture Drama4. Best Screenplay

Literature exists with regard to predicting the winners of the Academy Awards, but is sparse for anyother film award. Predicting the outcome of the Golden Globes includes the unique challenge: the GoldenGlobes are the first awards of the season. Past literature for Academy Award prediction often factored inwhether or not the film won other awards, such as the Golden Globes or Directors Guild Award. The awardsseason for films released in 2013 were as follows and reflect a similar schedule for this year’s award season[2]:

01/12/2014 - Golden Globe Awards01/16/2014 - Critics Choice Awards01/18/2014 - Screen Actors Guild Awards

2 D Balasubramaniam, J Chan, A Kulkarni, D Zheng

01/19/2014 - Producers Guild Awards01/25/2014 - Directors Guild Awards02/01/2014 - Writers Guild Awards02/16/2014 - BAFTA Film Awards03/02/2014 - Academy Awards

Predicting the outcome of Golden Globes will involve using information specific to the film, actor, actress,director, or screenwriter. We use the data pulled from the Internet Movie Database (IMDB) to generate met-rics on actor/actress/screenwriter/director performance as well as data about past Golden Globe nominationsand wins in order to make our prediction.

2.2 History of Christmas Day Movie Gross Sales

Christmas is one of the most awaited times of the year where families and friends go out to watch movies andenjoy the holiday spirit. Box Office Mojo’s Brandon Gray states that since the year 2000, Christmas week hasconsistently been the highest grossing week of the year. From movie industry’s point of view, the Christmasholiday provides an excellent platform for movies to make a lot of money and even an average movie canbuild over time if it had grossed fairly during Christmas. Thus Christmas Day Gross is an important metricto measure how well a movie performs.The movie industry is of interest to general public, economists and investors for its high profits and entertain-ment value. Our project aims to predict the movie gross for films which are running in theatres on Christmasday, i.e. December 25th, 2014. It will help people associated with the movie industry to make wise decisions.There are few publications which forecast the movie gross based on pre-release data based on the movie dataobtained from IMDB dataset. There are publications which use post-release data like the first-weekend grossand number of theatres to predict movie gross.

3 Literature Review

The papers explored below offer a cross-section of past research that we think represent relevant informationto leverage for our prediction. The Pardoe et al. paper and the Kaplan et al. paper explore logistic modelingmethods to predict Academy Award winners, relevant to our Golden Globe prediction challenge. The Zhanget al. paper leverages news data to enhance movie gross prediction, and the Simonoff et al. paper appliedregression methods to predict movie gross.

3.1 Pardoe, Iain, and Dean K. Simonton

Pardoe et al. [3] focus on predicting the Academy Award winners of four major categories: picture, director,actor in a leading role and actress in a leading role from those who were nominated each year. Since the goalis to predict the eventual winner from a list of nominees, any information on the nominees that is availablebefore the announcement of the winner is potentially useful, including other Oscar category nominations,previous nominations and wins and other (earlier in the season) movie awards. The primary data source usedwas IMDB.

For previous Oscar winner and nominee data, the main restriction on when a variable enters their model isthe earliest date at which the variable is available. Variables were also omitted for years in which they providedlittle predictive power and counter-intuitive parameter estimates. The paper adopts discrete choice modelsthat are used to provide annual predictions and discusses the modelling process. The modelling approachused allows a 1-year-ahead, out-of sample prediction of the four major Oscars from 1938 to 2006 (earlieryears had yet to accumulate sufficient information to provide satisfactory predictions). The main models thepaper used are the mixed logit (fully general statistical model for examining discrete choices) model andmulti-nominal logit (MNL) model. The authors used a software package called NLOGIT to obtain maximumlikelihood estimates for the MNL and ML, as well as another software package WinBUGS which uses Markovchain Monte Carlo techniques to estimate MNL and ML.

They found that Best Director Oscar was the most predictable, followed by the Oscars for Best Picture,Best Actor and Best Actress. Each of the categories has tended to become more predictable over time,

Predicting Golden Globe Awards & Christmas Day Movie Gross 3

particularly Best Actress, which was very difficult to predict until the early 1970s. The failure of the modelto predict the Best Picture winner in the last three years signifies a reversal in the predictability trend forthe category.

Pardoe et al. found that discrete choice modelling of past data on Oscar nominees in the four majorcategories Best Picture, Director, Actor and Actress enables prediction of the winners in these categorieswith a high degree of success. The authors state with confidence that if recent trends persist, it should bepossible to predict future winners with a prediction success rate of approximately 70% for picture, 93% fordirector, 77% for actor and 77% for actress.

3.2 Kaplan, David

Kaplan [4] applies a logistic regression model to attempt to determine influences for winning an AcademyAward based on: personnel, genre, marketing, and records in other award competitions. The data sourcesused for this paper include the Internet Movie Database and a website called the Videohound Golden MovieRetriever. Data was placed into categories for personnel information, genre, and marketing. The person-nel category included information about the film with regard to who the actors, actresses, directors, andscreenwriters were, specifically their past accomplishments or awards won - accessed through IMDB andVideohound. Genre for the film was determined based on either categorization from IMDB or for nominationfor an award in a specific genre category. Marketing data included the film length and release date, obtainedfrom IMDB.

The model used for this paper was a logistic linear regression. Twenty-two independent variables wereincluded in the first logistic linear regression model. The variables were almost all binary in nature, such asGolden Globe Win for Drama - whether not the film won a Golden Globe Drama award, or Musical - whetheror not the film was categorized as a musical, but also included non-binary variables such as Number of BestDirector Nominations, or the length of the film.

The first model is shown below - the author then produced successive models based on removing insignifi-cant variables as well as variables that were affected by multicollinearity (via calculation of variance inflationfactors).

Kaplan Model 1:

ln(BESTPICi/[1 +BESTPICi]) =

β0 + β1TOTALNOMi+ β2MOSTNOMi+ β3GGWINCOMi

+ β4GGWINDRAMAi+ β5MUSICALi+ β6COMEDY i

+ β7EPICi+ β8BIOPi+ β9EPICBIOPi+ β10PREV DIRWINSi

+ β11MOSTPREVDIRWINSi+ β12PREV DIRNOMSi

+ β13MOSTPREVDIRNOMSi+ β14GGDIRWINi+ β15DGAWINi

+ β16PREV ACTWINi+ β17CURACTNOMSi+ β18GGACTWINSi

+ β19RELEASEi+ β20TIMEi+ β21LONGEST + β22ADAPTi+ εi

Through changes in the model parameters, Kaplan was able to determine the parameters that had the highestvalue for predicting the next Academy Award winner. Based on the paper’s findings, the genre categorizationvariable of “EPICBIOP”, whether or not a film was categorized as an epic and a biographic film, as well asDGAWIN, whether or not the film’s director won the Directors Guild Association award were most predictiveof whether or not a film would win the Academy Award.

This paper demonstrates the successful application of logistic regression modeling to predict the winnerof the Best Picture Award for the Academy Awards using data from the award season. While this model iseffective for the Academy Awards, it may not be entirely relevant for our group’s attempt at a Golden GlobeAward prediction, since the Golden Globes are the first awards of the season.

3.3 Zhang, Wenbin, and Skiena, Steven

Zhang and Skiena [5] aim to improving the movie gross prediction by applying linguistic analysis to newsdata. The paper focuses on pre-release models as a method of predicting the movie gross before the movie is


released. From the recent research work, it was shown that media has the power to estimate future financialmarket like stock prices, volatilities or earnings. This power can also be used to predict movie grosses. Ashigh-grossing movies generate more media exposure, more attention is given to such movies.

Movie data is obtained from IMDB website and news data is obtained via Lydia. Lydia is a high-speedtext processing system, which analyze news publicity and output movie news data. The IMDB dataset pro-vides important movie variables like budget or opening screens. Even the genre of the movie is an importantmovie parameter. Lydia generates entity database consisting of four entities - movie titles, directors, top 3actors and top 15 actors.

In the first part of the paper, different correlation analyses are performed between movie gross and movievariables obtained from IMDB dataset and news variables. The correlation is done between variables as well.The correlation is measured both in strength and significance. The significance was verified with t-test. Inthe second part, regression and k-nearest neighbor classifiers were adopted as modelling methodologies. Re-gression models forecasted grosses and k-NN modeled similar movies. The movie gross prediction was dividedinto three sections: using only the traditional movie variables, using news variables, and then finally withvariables combined.

The data set was divided into two parts - 60% as the training set and the rest as the predicting set. Analysingthe correlation coefficient between traditional movie data variables and movie gross, the following variableswere significant in forecasting the movie gross: budget, opening screens, whether or not it was a holiday,MPAA Rating, whether or not it was a sequel, and the genre of the movie. In the news data, the movietitle references and top actors references have better correlations with gross compared to director refer-ences. Article count, frequency, positive frequency, and negative frequency are highly correlated with eachother. In regression model, budget improves the performance of the model. The high grossing movies arepredicted best by k-NN model whereas the low-grossing movies by regression model. The model built usingthe combination of traditional IMDB data and news data yielded better results. As the figure below shows, anotable improvement in performance is observed when the model uses both the the IMDB data and news data.

Movie gross prediction can be done by using IMDB data only, or news data only or combination of both asthey are highly correlated with movie gross. The combined model yielded better results. Regression modelshave better low-grossing performance and k-NN models have better high grossing performance.

3.4 Simonoff, Jeffrey S., and Ilana R. Sparrow

The authors [6] attempt to predict a films first weekend gross before the movie is released and also immedi-ately after release. A single movie can be the difference between millions of dollars of profits or losses for astudio in a given year. In 1998, moviegoers spent $6.88 billion at the U.S box office.

The authors explored whether or not the star power of the actors affected a movies success, as well aswhether or not big budget movies actually grossed more money. Additionally, they considered whether ornot the first weekends gross can be highly predictive of the total gross, whether or not reviews by prominentcritics have effect on the result, as well as whether or not awards provide a boost to the movie success.

Simonoff and Sparrow found that the a film’s first weekend gross had a high correlation with its total gross.Large film release and small film release movies were separated by the number of theaters that featuredthe movie on its opening. Regression models predicting log gross from information available after the firstweekend. The chosen model is based on only log first weekend gross and log opening screens, with an R2of 73.5% (adding a possible effect of Christmas release adds only 0.8% to the R2 ). Oscar nominations andwins were modeled with <10 theater movies and >10 theatre movies. Domestic grosses and as well as foreigngrosses(60 - 70% of total gross) were modeled.


Relevant details about the movie and its business information were retrieved from the IMDB for all themovies released in the United States during the calendar year 1998. Movies that opened late in 1997 to beconsidered for awards in 1998 were also considered. Genre and MPAA rating, origin country, star power,budget, sequel, release time, number of release screens, first weekend gross revenue, award nominations andwins were the variables considered. Star power was based on whether or not the actor was listed on Enter-tainment Weekly’s list of Top 25 Actors.

The accuracy of the prediction was found to increase after the first weekend of the release for more than 10theater movies. Oscar nominations does not affect the movie gross unless the movie is running for a longperiod of time. Two movies of the same type releasing at the same time will impact one another.

3.5 Additional Literature and Meeting with Yanqing Chen on DeepWalk

On the suggestion of Professor Skiena, we met with Stony Brook Ph.D. candidate Yanqing Chen to discusspotential applications for research on word embeddings as a method of determining how similar actors andactresses were to each other, or how similar movies were to each other [16]. Yanqing is able to calculate thevector representation of items using deep walk embedding. By doing this, we can determine the similaritybetween items based on the information extracted from Wikipedia pages.

The information collected from this method are stored in a graph data structure in high dimensional spaceand similarity is calculated based on the closeness of these points in space. And to get relevant results, filterscan be applied on the items. For example to get a similar person for an actor we can apply a filter that givesus the closest matching persons alone ignoring all other information. Freebase contains data grouped intodifferent categories like people, movies, music, etc. and it is used to choose only the relevant information fromdeep walk output. Due to time constraints, we were unable to incorporate additional parameters from deepwalk embedding, though we expect that incorporating the data would provide better insights into award andmovie gross prediction.

3.6 Devany and Walls - Uncertainty in the Movie Industry

Devany and Walls in analyze the variance of industry movie grosses, and found that the mean of box-officerevenue is dominated by a few blockbuster movies and the probability distribution of box-office outcomeshas infinite variance [17]. The distribution of box-office revenues is a member of the class of probabilitydistributions known as Levy stable distribution (is a continuous probability distribution for a non-negativerandom variable).

The data included 2,015 movies from the years 1984 to 1996. Information on each movie’s were obtainedfrom ACNielson EDI Inc’s historical database, as opposed to IMDB or other web sources. The box-officerevenue data included weekly and weekend box-office revenues for the United States and Canada .

Some of the variables like sequels, genres, ratings, stars, budgets, and opening screens were consideredas potentially important. Also considered is the star power which can potentially move the movie box-officerevenue probability distribution toward more favorable outcomes.

The authors found that motion picture runs are highly variable. As an example, the movie Waterworld,a highly promoted film (starring Kevin Costner) with a production budget of $175 million: it opened on over2000 screens and had fallen to 500 screens by the tenth week . Compared with with HomeAlone, a film witha much smaller budget and which featured no stars: it opened on just over 1000 screens and grew to peak atover 2000 screens in the eighth week before starting a slow decline. The box-office gross for Home Alone wasnearly four times as large as the box-office gross for Waterworld.

Pareto rank law as applied: SRB2 = (B1), where S is the size of box-office revenues, R is the rank(1=highest), and B1 and B2 are parameters. The exponent B2 is an indication of the degree of concentrationof revenues on movies because it indicates the relative frequency of large grossing movies to small grossingmovies. The Pareto rank law can be written as:


logRevenue = logβ1 + β2logRank + β3Star + β4[logRank × Star] + µ

About 65− 70 percent of all motion pictures earn their maximum box-office revenue in the first week ofrelease; the exceptions are those that gain positive word-of-mouth and enjoy long runs. In week 1 the meanand median screen counts are 927 and 994, respectively. By week 10, the median screen count is 264 whilethe mean screen count is 409. About 35, 19, and 12 percent of all box-office revenues are earned in the first,second, and third weeks of a films release, respectively. Nearly 85 percent of all films open on the first day ofthe weekend, Friday. About 13 percent open on Wednesday, 1 percent on Thursday, and less than 1 percentopening on the remaining days combined. The mean revenue in the sample was $17 million and this wasmuch larger than the median of $6.9 million.

The probability distributions of movie box-office revenues and profits are characterized by heavy tails andinfinite variance . Past success does not predict future success because a movies box-office possibilities areLevy-distributed. Forecasts of expected revenues are meaningless because the possibilities do not convergeon a mean; they diverge over the entire outcome space with an infinite variance.

4 Data sets

IMDB: The Internet Movie Database (abbreviated IMDb) is an online database of information related tofilms, television programs, and video games, taking in actors, production crew, fictional characters, biogra-phies, plot summaries, and trivia.

The data lists we took advantage of in the IMDb are as follows: Actors, Actresses, Business, Cinematog-raphers, Countries, Directors, Editors, Genres, Keywords, MPAA Ratings Reasons, Movies, Producers, Pro-duction Companies, Ratings, Release Dates, Running Times, and Writer [7].

IMDb does not provide an API for automated queries. However, most of the data can be downloadedas compressed plain text files and the information can be extracted using the command-line interface toolsprovided. A Python package called IMDbPY can also be used to process the compressed plain text files into anumber of different SQL databases, enabling easier access to the entire dataset for searching or data mining.

IMDbPY: IMDbPY is a Python package useful to retrieve and manage the data of the IMDb moviedatabase. IMDbPY is mainly a tool intended for programmers and developers, but some example scriptswere included. We can search for a movie with a given title, a person with a given name, a character in amovie or a company, and retrieve information for a given movie, person, character or company; the supporteddata access systems are ’http’ (i.e.: the data are fetched through the IMDb’s web server http://akas.imdb.com)and ’sql’, meaning that the data are taken from a SQL database, populated (using the imdbpy2sql.py script)with data taken from the plain text data files. [8]

New Sources of Data In our background report presentation, we reported planning on using the followingdata sources: IMDb, AggData, and Wikipedia. IMDb provides a downloadable dataset that we planned onrelying on, and the initial assumption was that the dataset was complete. Upon further inspection we foundthat the dataset was missing some key parameters. In particular, the dataset did not include the IMDb IDfor the movies and actors, IMDb rating, Metascore, and most importantly, data on awards for people andmovies. Not having these key parameters introduced additional challenges to overcome. One of our initialgoals was to include the IMDb rating, Metascore, and awards data in our movie gross and award prediction.To resolve not having the data, as well as to get more data, we resorted to scraping techniques on IMDb andadditional sources.

In order to get more data, we have scraped from four sources: RottenTomatoes, MetaCritic, BoxOffice-Mojo, as well as IMDb [12][13][14][15]. Scraping data produced additional challenges for our project. Theprimary challenges revolved around data reconciliation, as well as throttling as a result of the sheer amountof data we attempted to retrieve. Data reconciliation is a problem because each website refers to a particularmovie or actor differently. Our challenge was to match data correctly to each movie or person. An example of


this problem would be the movie titled “Titanic”. The title “Titanic” was in fact used for a movie made in1996 as well as in 1997 (the 1997 “Titanic” was the one that featured Leonardo DiCaprio and Kate Winslet)[15]. By querying the sources by name, further reconcile the different movies by also comparing the yearthat the movie was released. The second challenge was handling the sheer amount of queries that we neededto make. Web servers frequently drop clients that have high request rates, requiring us to throttle our webrequests.

From RottenTomatoes we retrieved average critic ratings as well as average user ratings. From Metacriticwe retrieved a movie’s Metascore, a proprietary weighted average of critic ratings. BoxOfficeMojo providesthe number of theaters in which a movie was shown. From scraping IMDb we were able to retrieve theIMDb rating, another weighted average score based on user ratings, and most importantly detailed awardsinformation. The awards data retrieved for each movie includes several awards including the Golden Globes,Academy Awards, BAFTAS, Director’s Guild Awards, Writer’s Guild Awards, and People’s Choice Awardsamong others. The awards for the actors, actresses, directors, and screenwriters are sorted by award and yearearned.

– RottenTomatoes: Average Critic Rating, Average User Rating

– Metacritic.com: Metascore

– BoxOfficeMojo.com: Number of Theaters

– IMDb.com (scraped): IMDb Rating, award information, etc.

Golden Globe Award Prediction Data Composition Our data for people (actors, actresses, writers,directors) contains a combined 1.4 million actors, actresses, writers, and directors in our database. We reducedthe number of movies by requiring the length of the movie to be greater than 70 minutes (based on theAcademy Award definition of a feature length movie), and with data that shows either the budget or grossbeing greater than $100,000. We made the elimination based on gross being greater than $100,000 based onthe observation that for all the Golden Globes won since 2000, none of the movies grossed less than $100,000.It was also noted that the paper by Apte, Forsell and Sidhwa made the same assumption. Based on thissubset of movies, we then selected the actors and actresses who were in the first four billed positions, aswell as the writers and directors. By taking only the actors and actresses in the first four billed positions wefurther reduced the number of people to scrape data for. The assumption behind this is that the importantlead and supporting roles are included in the first four billing positions. With this, we produce 15 years ofcelebrity data, and separated the data by year. As an example, for the year 2014, we gathered data on 879actors from a total number of 91760, 494 actresses from a total number of 51678, 698 directors from a totalnumber of 9642 and 1062 writers from a total number of 12950.

5 Observations

In this section we will discuss some interesting observations found after compiling the entire dataset. We willhave separate observation results for Movie Gross and Golden Globe.

5.1 Movie Gross Prediction Data Observations

In our advanced models, we split movies into two categories based on their gross. We classified movies aslow grossing if their gross was less than $69,150. Movies grossing higher than $69,150 were classified as highgrossing movies.


Fig. 1. Correlation Matrix: High grossing movie

Fig. 2. Correlation Matrix: Low grossing movie


We can see from the correlation matrix in figure 1 for high grossing movies that the gross is highly cor-related with budget, the total number of past award nominations and past award wins accrued by the cast.We also see that movies with adventure, fantasy and action genre are highly positively correlated with grosscompared to other genres. Where as movies with documentary genre also have a strong negative correlation of–0.30. Another interesting observation is that movies with horror and comedy have high negative correlationwith ratings (critic score, IMDb rating, audience score).

From correlation matrix in figure 2 for low grossing movies we can see that we do not have very goodcorrelation coefficients with gross. Unlike high grossing movies, the gross of low grossing movies do no de-pend on genre of the movie. Another interesting observation was that for low grossing movies, documentarygenre is strongly positively correlated with ratings where as thriller and comedy is strongly negatively corre-lated with ratings. It is apparent from this correlation matrix that we do not have strong categorical variablesto make a prediction when compared to the high grossing movies.


5.2 Golden Globe Prediction Data Observations

We show the correlation matrix for all the features extracted with respect to the class variable Next GoldenNominee in the table 1 for actor, actress, directory and writer categories.

Actor Actress Director Writer Parameter Infomation

next golden nominate 1 1 1 1 Binary class variable for nomination for nextyear’s Golden Globes

pre golden win num 0.222 0.222 0.198 0.088 # Golden Globes won and nominated

pre golden nominate num 0.243 0.277 0.246 0.178 # Golden Globes nominated

pre golden total 0.244 0.277 0.246 0.178 # Golden Globes won and nominated

pre oscars win num 0.158 0.186 0.141 0.161 # Academy Awards won

pre oscars nominate num 0.214 0.263 0.259 0.155 # Academy Awards nominated

pre oscars total 0.215 0.263 0.259 0.156 # Academy Awards won and nominated

pre actor win num 0.161 – – – # Screen Actor Guild Awards won

pre actor nominate num 0.231 – – – # Screen Actor Guild Awards nominated

pre actor total 0.232 – – – # Screen Actors Guild Awards won and nomi-nated

pre bafta win num 0.129 0.152 -0.001 0.104 # BAFTA Film Awards won

pre bafta nominate num 0.185 0.246 -0.001 0.133 # BAFTA Film Awards nominated

pre bafta total 0.185 0.245 -0.001 0.133 # BAFTA Film Awards won and nominated

pre independent win num 0.084 – 0.231 0.173 # Independent Spirit Awards won

pre independent nominate num 0.078 – 0.168 0.139 # Independent Spirit Awards nominated

pre independent total 0.079 – 0.170 0.140 # Independent Spirit Awards won and nominated

pre critic win num 0.138 0.218 0.213 0.067 # Critic Choice Awards won

pre critic nominate num 0.188 0.238 0.305 0.144 # Critic Choice Awards nominated

pre critic total 0.188 0.239 0.305 0.144 # Critic Choice Awards won and nominated

pre mtv win num 0.133 -0.005 – – # MTV Movie Awards won

pre mtv nominate num 0.125 -0.006 – – # MTV Movie Awards nominated

pre mtv total 0.125 -0.006 – – # MTV Movie Awards won and nominated

pre wga win num – – – 0.153 # Writer Guild Awards won

pre wga nominate num – – – 0.139 # Writer Guild Awards nominated

pre wga total – – – 0.140 # Writer Guild Awards won and nominated

pre dga win num – – 0.130 – # Director Guild Awards won

pre dga nominate num – – 0.207 – # Director Guild Awards nominated

pre dga total – – 0.206 – # Director Guild Awards won and nominated

pre win total 0.246 0.273 0.209 0.169 total number of awards won

pre nominate total 0.242 0.284 0.238 0.162 total number of awards nomiated

pre total 0.243 0.284 0.238 0.162 total number of awards won and nominated

break years -0.225 -0.269 -0.232 -0.158 # years bewteen his first movie and first award

experience 0.098 0.124 0.067 0.049 # years from his first movie to calculated year

max budget 0.135 0.152 0.122 0.064 max budget of his movies released in calculatedyear

max gross 0.115 0.142 0.157 0.068 max gross of his movies released in calculated year

max meta score 0.151 0.213 0.189 0.160 max metascore of his movies released in calculatedyear

max imdb rating 0.141 0.181 0.135 0.120 max imdbrating of his movies released in calcu-lated year

Table 1. Correlation Matrix for next Golden Globe Award Nominee and variables

We can see that we have a high correlation with the total number of past awards with respect to thenext nominee class variable. This feature is considered in our baseline model. Following were some of theinteresting observations we found:


1. The previous cumulative sum of Golden Globe nominations and Golden Globe wins are highly positivelycorrelated with the class variable for actor and actress where as for director and writer it is not correlated.

2. The number of break years, that is, the difference between the first movie year the actor was casted andthe present year of the instance object, is highly negatively correlated with the class variable for actor,actress and director. From this we can infer that actors/actress/directors who have a lower value forbreak years have a higher chance of getting nominated for Golden Globes.

3. Some of the interesting awards which are also highly correlated with the class variable are “Total numberof Oscar Wins/Nominations”, and “Total number of Screen Actor Guild Awards”.

4. Another interesting observation is that the class variable is not correlated with the experience parameter(number of years the person is in the film industry - industry years).

5. Another interesting observation is that the “Director Guild Awards” and the “Critic Choice Awards”have a high correlation with the class variable for Director category.

.

6 Baseline model and initial prediction

6.1 Movie Gross - Baseline Model and Linear Regression

Baseline 1: Average by Genre for Movie Gross The genre of the movie is one of the important pa-rameters for predicting the movie gross based on previous research. The IMDB dataset defines 21 genres andBox Office Mojo has these genres into sub categories. Some of the genres like “Family” and ”Action” arepositively correlated with gross whereas genres like “Documentry” are negatively correlated. For our baselinemodel, we considered the average movie gross of each genre as the gross for movies belonging to that genre.

In this simple baseline model, we considered 21 genres and averaged together all of the film’s movie grossesof each genre. To visually compare this, a bar chart is shown below with the average gross for each genre.

Fig. 3. Barchart of average gross per genre, from 2000-2014 in order of average gross


From the barchart in figure 3, we observe that movies which are sci-fi, family, adventure, fantasy, anima-tion, and action have a higher average gross. Since our baseline makes a simple prediction for a movies grossbased solely on its genre, we expect large errors for situations where an unpopular movie is assigned a largegross simply because it fits under a certain genre.

Baseline 2: Linear regression for Movie Gross We evaluated the correlations between differentparameters and the gross using Pearson correlations.Applying Pearson correlation on the parameters we con-sidered for the movie gross, we found that our highly anticipated parameters like critic rating, audience score,meta score and IMDb rating have a poor correlation with the movie gross. The highly correlated parametersare budget of the movie and number of theaters which is not surprising. Interestingly, we also found that thelength of the movie and whether it is a sequel has high correlations. Also the total number of awards for themovie cast has a moderate correlation of 0.28.

We fit a linear regression using a single parameter to our data to predict the movie gross. In this model,we do not consider categorical parameters like the genre of the movie or the rating of the movie. We then fita linear regression based on one parameter - Budget, which is an important parameter to predict the moviegross and table 2 shows the results.

We observe a R-square value of 0.5597 for a single parameter and in the next model we included additionalindependent variables based on the correlation table. However when we observed the results obtained fromthe evaluation model, the “Total past awards” had a very high P-value, and some other parameters had verylow P-value. We take P-value of 0.05 as the threshold. The R-square value also did not increase dramatically.

Thus we revisited all the parameters we can use from our dataset, and we observed that “audience score”and “critic score” had a very low P–value which can be considered as an important parameter for fitting alinear regression.

Models Relative Absolute Error (%) Root Mean Square Error ($)

Baseline Model 223.13 49,422,985

Linear Regression (Budget as feature) 78.03 41,347,705

Table 2. Comparison of relative absolute error and root mean square error between baseline model (by genre) andfirst linear regression model (budget as the only feature)

When we compare the results obtained from our linear regression model using only budget to the baselinemodel, we observe that the relative absolute error is largely reduced as shown in table 2. Besides, most of theresults of baseline model and first linear regression model are overestimated. As our baseline model performspoor compared to our linear regression model, in the later analysis, we compare our advanced models withthe linear regression model.


Fig. 4. Comparison of error distributions between the baseline model (by genre) and first linear regression model(budget as the only feature) - the error distribution improves with the linear regression model.

As the error is still high with a relative absolute error of 78.03%, for advanced models we split the moviesinto high and low grossing groups. We consider the lower quartile of the gross as low grossing movies. Wevisualise the error by splitting the movies into two groups, the error distribution based on the low grossingmovies and high grossing movies for our advanced models.

Fig. 5. Linear Regression Movie Gross Error Distribution - the error distribution is split by low grossing and highgrossing movies (Relative Percent Error% = (Predicted Value – Actual Value)/(Actual Value))

In figure 5, the relative percent error shows whether the predicted values by the model are over-estimatedor under-estimated compared to the actual gross value. From this error distribution histogram, it is appar-ent that the distributions are different from each other. The error distributions are both shifted slightly tothe left of zero, indicating that our model underestimates gross. However for low-grossing movies the errordistribution does not have as large of a tail as with the high grossing movie model error distribution.


6.2 Golden Globes Baseline Model

For the Golden globes, as we did not have the data available in the database, we scraped the IMDB websiteand obtained the awards data associated with each person based on their role: actor, actress, director andscreenplay writer. As we must not consider, the awards won by the cast after the year of the training year,we obtained a schema which considers the year of the training data into account. We also extracted fewadditional features for our advanced model, such as the total number of industry years (experience) and thenumber of years he/she took in order to win his/her award nomination (break year).

Baseline Model Our baseline model for predicting the Golden Globe Awards is to calculate the totalnumber of past awards nominations or wins by the person before that particular year. The celebrity who hasthe highest number of past awards has a higher probability of winning the next Golden Globe award. We areconsidering only selection of awards, such as the past Golden Globes, the Academy Awards, BAFTA Awards,MTV Movie Awards, Screen Actors Guild Awards, Directors Guild Award, Independent Spirit Award andthe Critics Choice Award.

Our baseline model produced a prediction from the person’s past awards. A probability distribution isgenerated and sorted to determine the nominations for the four categories.

Table 3 shows the favorability of nominations for 72nd Golden Globe Awards in January 2015 for eachcategory separately.

Best Actor Best Actress Best Director Best Writer

Duvall, Robert Streep, Meryl Eastwood, Clint Allen, Woody

Crowe, Russell MacLaine, Shirley Fincher, David Nolan, Christopher

Hoffman, Philip Seymour Blanchett, Cate Allen, Woody Goldsman, Akiva

Kingsley, Ben Fonda, Jane Murro, Noam McCarthy, Thomas

Washington, Denzel Lange, Jessica Nolan, Christopher Anderson, Wes

Cruise, Tom Keaton, Diane Reiner, Rob Haggis, Paul

Bridges, Jeff Mirren, Helen Aronofsky, Darren Brickman, Marshall

Pitt, Brad Smith, Maggie Clooney, George Linklater, Richard

Harris, Ed Lawrence, Jennifer James, Steve Mamet, David

Clooney, George Bening, Annette Carter, Thomas McQuarrie, Christopher

Table 3. Predictions from baseline model for Golden Globes 2015

7 Development/Evaluation Environment

For development and evaluation, we consider data from 2000 – 2014. This data is divided into 3 buckets,where each block consists of 5 years. For each bucket, the first 3 years is taken to be the training dataset, thenext consecutive year is taken to be the testing data set and the year after that is taken to be the evaluationyear. We end up with:

Training: {2000, 2001, 2002}, {2005, 2006, 2007}, {2010, 2011, 2012}Testing: 2003, 2008, 2013Evaluation: 2004, 2009, 2014

For movie gross prediction, in the model of dividing movies into high-grossing dataset and low-grossingdateset, we utilize weighted F-Measure, ROC, Root Relative Squared Error to evaluate its performance. Inlinear regression models and advanced models, we take advantage of Root Relative Squared Error and RootMean Squared Error to evaluate performance.

For Golden Globes prediction, we use the result buffer obtained from Weka classifiers. This result bufferprovides us with the summary statistics of the classification result. We have the F measure, confusion matrix,relative error and other summary statistics. We also get the predicted value of the golden globe award nominee


for the next year which is used in order to predict the golden globe award nominee and winner for 2015 fordifferent categories.

8 Advanced Model

8.1 Movie Gross

For our advanced model, we realized that the number of theaters was a parameter that is not available untilseveral days after the movie is released.

Fig. 6. This flow chart represents our workflow to predict the movie gross. First the movie is classified as either ahigh or low grossing movie using a J48 decision tree model. Movies classified as high grossing are separated where aM5 pruned tree model is applied. For low grossing movies we apply a M5 rule based model.

In evaluating advanced models for the movie gross prediction, we wanted to first further enhance ourdataset by adding new features. By leveraging our data scraped from the awards page for all of our ac-tors, actresses, directors, and writers, we were able to add two parameters: “total cast award win” and “to-tal cast award nominate”. The two parameters are sums of the number of award nominations or award winsfor the actors, actresses, screenwriters and directors in the movie. Additionally, we incorporated categoricalvariables into our prediction, such as the genre of the movie, and whether or not the movie was a sequel.

Based on our observations during our exploration of the data and baseline model, we determined thatseparate models should be applied for high grossing and low grossing movies. From previous literature,modeling techniques were susceptible to having good performance in predicting high grossing movies withpoor performance with low grossing movies. To address this, we trained a classifier to separate high grossingmovies from low grossing movies. First, we trained the classifier using our training sets where movies weredesignated as high or low grossing based on the lowest grossing quartile. Movies which grossed less than$69,150 were tagged as low grossing movies, and movies which grossed greater than that were tagged as highgrossing movies. The flow chart diagram in figure 6 illustrates the split model process.

Using our training dataset, we tested several classification models and selected a J48 decision tree toclassify which movies were high and low grossing. We also tested: logistic regression, naive bayes, JRIP, andk-NN. We compared model output statistics such as the f-measure, ROC and root relative squared error.Based on the summary statistics from figure 4 we selected the J48 decision tree primarily because of itsperformance within the three statistics we compared. We also tested different k-NN models by varying thenumber of nearest neighbors, NN=1 had the best performance and is listed in the table.


Classifiers F Measure ROC Root relative squared error (%)

Logistic 0.74 0.79 88.87

Naive Bayes 0.75 0.81 97.39

J48 Decision Tree 0.76 0.83 86.47

JRip Ripper 0.76 0.66 92.88

K- Nearest Neighbor 0.69 0.69 116.29

Table 4. Error Statistics - Classification of high and low grossing movies

The J48 model is an implementation of Quinlan’s C4.5 decision tree statistical classification model [26][25].Essentially, a decision tree classifies data as much as possible based on the provided attributes from thedataset. It first takes the attribute that discriminates the data the best and branches from that discrimination.From there it continues to choose the best discriminating attribute to build a decision tree that will classifyfor a target feature [24]. We provide the model with a total of 34 attributes, where 21 of the attributes arebinary columns for the 21 genres possible for the movies. The output tree is not shown because of spaceconstraints, as there are 306 leaves produced from the model. Interestingly, the most differentiating attributefor the movies was “total cast award nominate”. Following that was the sequel attribute, a binary value forwhether or not the movie was a sequel.

Models (High Grossing) Classifiers Relative AbsoluteError (%)

Root Mean Square Error($)

Linear Regression (Budget as feature) Linear Regression 76.82 47,392,684

Advanced Model

Linear Regression 69.28 41,088,911K- Nearest Neighbour 78.48 47,171,766M5P Tree 51.94 35,517,512M5 Rule 55.35 38,101,864

Table 5. Comparison of classifiers for high-grossing model

After our classification step that splits the dataset into a high grossing movie dataset and a low grossingmovie dataset, we tested models on each dataset and compared their performance. For the high grossingmovies, we selected the M5P tree model based on the relative absolute error (51.94%), as well as root meansquared error ($35,517,512). From table 5, we see that the M5 rule based model and linear regression modelswere slightly poorer in performance. Additionally, we can see a vast improvement in relative absolute errorpercentage (now 51.9%) from the baseline model in table 2, where the baseline relative absolute error was223%, and 78% for the one feature linear regression model. The M5P tree is a combination of a pruned treewith linear regression functions from Wang [27]. Each leaf results in a different linear regression model. Onepowerful component of this model is that performance is robust to missing values [27].

From inspection of the M5 pruned model tree (displayed in figure 7), it is interesting to observe thatthe budget is the attribute that differentiates the movies the most. Additional parameters used are “audi-ence score”, which was a social media parameter based on audience ratings of the movie scraped from IMDB,“critic score”, and the genre. The “total cast award nominate” parameter mentioned previously is also adeciding parameter in the tree.


Fig. 7. The M5P tree produced for predicting the gross for movies classified as high-grossing. The LM# refers todifferent linear regression models - 13 different linear regression models were produced.

By looking at the error distribution for the M5P tree model for high grossing movies in figure 8, it seemsthat the error is centered around zero. With 59% of the predictions having positive errors, we can see thatthe model is slightly over predicting the movie gross for high grossing movies. It is apparent that the errordistribution is quite wide, given the x-axis is to the power of 8. From inspection of the data, it is also apparentthat the errors are compounded when the classification model does not correctly classify the movies as highor low grossing. If a low grossing movie is thrown into the high grossing model, the error is very high forthose movies.


Fig. 8. High grossing movie error distributions: linear regression model and advanced model

For the low grossing movies, we again compared different models and found that overall performance for allthe models was significantly worse than with the high grossing movies. We expect this for a couple of reasons.The primary reason is that for low grossing movies, our data for the movies is often quite sparse. Potentiallyimportant parameters such as the budget and length of the movie may not be available. Additionally, for lowgross movies, attributes such as critic or user ratings may not exist because of the films lack of notoriety andpopularity. From previous analysis of our correlation matrix 7 in figure we also knew that the attributes hadcorrelation coefficients were very low. The highest correlation coefficients we had were for critic score (0.20),budget (0.2), audience score (0.19), and IMDB rating (0.18).

Models (Low Grossing) Classifiers Relative AbsoluteError (%)

Root Mean Square Error($)

Linear Regression (Budget as feature) Linear Regression 100 20,157

Advanced Model

Linear Regression 99.75 20,280K- Nearest Neighbour 101.29 20,202M5P Tree 99.29 20,532M5 Rule 98.08 20,158

Table 6. Comparison of classifiers for low-grossing model

Based on the comparison of the summary statistics for model performance on low grossing movies in table6 we chose the M5 rule model. All of the models had relatively low correlation coefficients compared to theirperformance with the high grossing movie dataset, where correlation coefficients were between 0.67 and 0.83.With the low grossing dataset the M5 rule model had the best correlation coefficient of 0.16, with relativeabsolute error and root relative squared error at 98.08% and 100% respectively.


Fig. 9. The M5 Rule output for predicting the gross for movies classified as low-grossing. The two linear regressionequations are also shown.

It is interesting to note that the output of the M5 Rule model in figure 9 only produced two rules. basedon the criteria of whether or not the critic score was less than 0.575 and whether or not the cast had anyaward wins. If the critic score was greater than 0.575 and had award wins, the audience score had the highestcoefficient in the linear regression by a large margin. For the rule if the critic score was less than 0.575 andhad no award wins, the highest coefficient was audience score.


Fig. 10. Low grossing movie error distributions: linear regression model and advanced model

In comparison to the high grossing model error distribution histogram in figure 10, it is clear that theerror is not as straightforward to interpret. It seems that a portion of the error distribution sits around$10,000 however the distribution is not very symmetric. It is also apparent from inspection of the data thatthe classification model did not incorrectly classify any high grossing movies as low grossing movies.

Until this point, our models and predictions have focused on predicting the total domestic gross of themovie. The total domestic gross is the amount the movie grossed from all theater ticket sales, for all thedays it was shown. We adopt this method because of the lack of daily gross data. In order to predict howmuch the movie will gross specifically on Christmas Day we applied a relatively simple technique to extractthis from our model. Our approach was to determine what percentage the Christmas Day gross was of thedomestic lifetime gross of a movie. We went through BoxOfficeMojo and looked at all movies from 2002 to2013 that were released up to 7 days before December 25 and recorded their Christmas Day gross as well astheir domestic total gross. BoxOfficeMojo only has daily gross data from 2002 to 2013. From the 125 moviesfrom which the data was available, we found that that on average (median), the gross on Christmas Dayrepresents 5.99% of the total domestic gross. We simply apply this percentage to our predicted total domesticgross to establish our Christmas Day gross prediction.

Discussion of the advanced model techniques for the Golden Globe prediction follows this section. Afterthe model discussion is complete, we discuss the output of our final prediction and offer our concludingremarks.

8.2 Golden Globes

Baseline Model In the initial stage we build a baseline model which considered the total number ofsignificant award nominations or wins by a celebrity prior to the year of consideration. We considered thissingle parameter to predict the golden globe award nominations for the next year and obtained a similarmodel for each category - actor, actress, director and screenplay.

Award History of a Celebrity In this model, we evaluate each award separately. We analyse the predictivepower of each individual award considering them as individual features in this model. We have the cumulativesum of the number nominations and wins for each celebrity for each award separately prior to the year ofconsideration. We used different classifiers to evaluate the model and plotted the error graphs.

Personal History + Movie History As an enhancement to the model, we extracted the year of the firstmovie the celebrity was casted and the year of the first significant award he/she was nominated or won. Wethen calculated the difference between the years and termed this new feature as “Break Years”. This gives us


the number of years the celebrity took to achieve his/her first nomination. We also calculated the “industryyears” by considering the number of years after the celebrity was casted in his first movie. As a secondenhancement to the model, we then combined some of the movie features from our movie gross project. Weobtained the gross, budget, the imdb rating and the metacritic score associated with the movie the celebritywas casted in that particular year. We then designed our final model by considering all the features aboveand evaluated our model using different classifiers.

Resampling with logistic regression and re-run the model We observed that the results were affectedby the skewness of the data. We had a large number of celebrities in our dataset who were not nominatedcompared to the celebrities who were nominated at least once. We used a simple logistic regression to reduceour dataset to a non-skew dataset for obtaining better results. For our new set, we first considered the actualnominated celebrities and to this set we add the celebrities who were predicted to be nominated by our simplelogistic regression classifier. In order to bolster the model we added celebrities who have high probability ofbeing nominated. After this step of sampling we rerun our model using this new dataset. So we have twomodels, Model 1 which was built on the entire dataset and Model 2 was built on the sampled dataset.

Evaluation and Results In table 7 we evaluate the nominated f-measure error statistics from severalclassifiers to determine which classifier to use for each award. The statistics are evaluated both for the Model1 dataset as well as the Model 2 dataset to ensure that changing the dataset caused the models to change interms of relative performance. Based on the results we found that the classifiers still ranked the same relativeto each other given the different datasets. For the actor category, the K-NN (k=7) performed the best by alarge margin for the Model 1 dataset, and slightly better than the other classifiers in the Model 2 dataset. Inthe actress category, the Naive-Bayes classifier had the best performance. The Director and Writer categoriesboth show the ripper classifier as the best performing classifier.

It is also apparent that the performance across all of the classifiers and all award categories is better withthe Model 2 sampled dataset.

Nominated F-Measure

Models Classifiers Actors Actress Director Writers

Naive Bayes 0.2 0.35 0.18 0.17Logistic 0.09 0.24 0.24 0.08K- Nearest Neighbor 0.48 0.23 0.11 0.14Decision Tree 0.14 0.24 0.32 0.09

MODEL 1

Ripper 0.22 0.21 0.55 0.36

Naive Bayes 0.41 0.49 0.21 0.41Logistic 0.4 0.48 0.33 0.25K- Nearest Neighbor 0.68 0.5 0.38 0.29Decision Tree 0.55 0.47 0.58 0.48

MODEL 2

Ripper 0.49 0.51 0.67 0.62

Table 7. Golden Globe model error analysis for classifier selection. The highlighted cells reflect the best f-measurefor the specific model and category, from which the classifier is selected.

9 Final Prediction and Conclusions

9.1 Movie Gross Final Predictions

Our final movie gross prediction is in the included table below. Overall we are satisfied with our final predictionbased on our time and resource constraints. We have a group of four movies that were classified as highgrossing, and are expected to have a domestic lifetime gross of at least $200 million dollars with ChristmasDay gross of at least $13 million dollars. The two highest grossing movies, “Into the Woods”, and “Unbroken”


are expected to have a domestic lifetime gross of $483 million and $437 million respectively. Based on ourexploration of the data, these predictions seem to make sense. “Into the Woods” has high grossing genres,including the Comedy and Family genres. Additionally, it includes big names for actors, actresses, directorsand writers, such as Meryl Streep (161 award wins, 235 nominations), Emily Blunt (1 Golden Globe win),and director Rob Marshall (1 Oscar win, 13 additional wins, 25 nominations). “Unbroken”, an action/dramafilm, features four time Oscar winning writers the Coen brothers, and is directed by Oscar winning AngelinaJolie. Low grossing films typically have sparse information. It seems that the fact that the movie does nothave very much information available intuitively indicates that the film may not gross a high amount.

Movie Predicted Gross($) Predicted Christmas Gross ($)

Into the Woods 483,991,484 28,969,025

Unbroken 437,138,342 26,164,658

The Interview 269,823,384 16,150,120

Big Eyes 219,996,652 13,167,770

Crimes and Mister Meanors 3,963,342 237,224

Cut! 3,864,054 231,281

Zordon of Eltar 3,245,611 194,264

AgAu 2,634,107 157,663

Cotton 2,373,462 142,062

Gospel Adventures 2,328,042 139,344

A Glimpse of Gods Hand at Work 738,153 44,182

Political Animals 19,924 1,193

Manifest Ecstasy 19,924 1,193

In Search of the Kushtaka 19,924 1,193

Breast Cancer Boy 19,924 1,193

Lady Windermere’s Fan 17,507 1,048

I Before Thee 14,960 895

The Bandit Hound 14,495 868

Misirlou 12,910 773

Baker Hotel 12,463 746

Table 8. Our final movie gross prediction for the 20 movies released on Christmas Day. The Predicted Gross columnis our predicted lifetime gross of the movie. The Predicted Christmas Gross column is our prediction of how much themovie will gross on Christmas Day (simply 5.99% of the Predicted Gross).

From inspecting our results, it is apparent that several films appear to have been misclassified as high-grossing films. “Into the Woods”, “Unbroken”, “The Interview”, and “Big Eyes” are expected to be big hitsbecause of their all-star cast and are classified correctly as high-grossing. However all of the films where thepredicted gross is in the range of two to three million dollars were misclassified as high grossing. This is dueto the sparse amount of data available for the movies, while information that does exist force the movie tobe classified as high-grossing. While these movies were placed in the high grossing model, the predicted grossis among the lowest grosses that the model predicts. Movies like “Breast Cancer Boy”, “Political Animals”,and “Manifest Ecstasy” do appear to make sense as low-grossing movies, as they are documentaries. For verylow grossing movies, it becomes more difficult to predict since it is possible that the screening of the moviemay be limited to one or less theaters that may or may not report their gross data to IMDb.

Total Lifetime Gross vs Christmas Day Gross It is important to note that our modeling has beenfocused on predicting the total lifetime gross of a movie. This is because of the relatively small amount ofdata available for the opening day gross of movies released on Christmas Day. As a simple approach, wecollected the Christmas Day gross and lifetime gross of movies released within one week of Christmas Daybetween 2002 and 2013 (the only year range available). From the data we calculated an averaged percentagethe Christmas Day gross was of the total lifetime gross, which was 5.99% and applied it to our total lifetimegross prediction.


9.2 Golden Globes Final Predictions

Actor Nominations Our final predictions for Golden Globes 2015 for different categories are as shownbelow. As we are predicting awards for Best Actor in a Motion Picture, Best Actor in a Musical or Comedy,as well as Best Supporting actor, we list the top fifteen nominations from our model output based on theirnomination probability, as each category receives five nominations. Following are the predictions of Model-1and Model-2.

Actor Model-1 Prediction Model-2 Prediction Actual Outcome Award Name

Murray, Bill Nominated Nominated Nominated Best Actor in a Motion Pic-ture, Musical or Comedy

Cranston, Bryan Nominated Nominated – –

Downey Jr. , Robert – Nominated

Douglas, Michael Nominated Nominated

Redford, Robert Nominated –

Sheen, Martin Nominated Nominated

Gervais, Ricky Nominated –

Gordon-Levitt, Joseph Nominated –

Kinnear, Greg Nominated –

Lithgow, John Nominated Nominated

Blevins, Ronnie Gene Nominated –

McNairy, Scoot Nominated –

Damon, Matt Nominated –

Bentley, Wes Nominated –

Pitt, Brad Nominated Nominated

Hardy, Tom Nominated –

Forte, Will – Nominated

Pacino, Al – Nominated

Luna, Diego – Nominated

Jackman, Hugh – Nominated

Washington, Denzel – Nominated

Hamm, Jon – Nominated

Tatum, Channing – Nominated

Brendan Gleeson – Nominated

Table 9. Actor Nominations from Model 1 and Model 2, including the actual outcome

From inspection of table 9, we find similar performance between the two models. Both models 1 and 2correctly predicted that Bill Murray would be nominated, while all other predicted nominations are falsepositives. To understand why our models only produced one correct prediction, we looked at the actualnominations and discovered that the movies were released after our initial data collection date in earlyNovember. As a result of this, critical information for our model, such as the mount that it grossed, as wellas audience and critic ratings, were not in our database.

Actress Nominations From table 10, we see that a large number of actress were predicted to be nominatedfor Golden Globes 2015, out of which Model 1 correctly nominated 5 out of the 33 actresses. The total numberof actress we considered for nominations were 15 using Model 2, out of which 4 were correctly predicted.From the results of Model 1 and Model 2, we predict the winners as follows:

– Best Actress in a Motion Picture, Drama - Jennifer Aniston– Best Actress in a Motion Picture, Musical or Comedy - Helen Mirren– Best Supporting Actress in a Motion Picture - Meryl Streep

Both Models 1 and 2 agreed with nominating Helen Mirren and Meryl Streep, while Jennifer Aniston wasselected based on Model 2’s nomination because it predicted she had the highest likelihood of winning.


Actress Model-1 Prediction Model-2 Prediction Actual Outcome Award Name

Bates, Kathy Nominated –

Aniston, Jennifer – Nominated Nominated Best Actress in a MotionPicture, Drama

Bening, Annette Nominated –

Arquette, Patricia – Nominated Nominated Best Supporting Actress in aMotion Picture

Binoche, Juliette Nominated –

Blanchett, Cate Nominated Nominated

Blunt, Emily Nominated – Nominated Best Actress In A MotionPicture, Musical or Comedy

Chastain, Jessica Nominated – Best Supporting Actress in aMotion Picture

Connelly, Jennifer Nominated –

Byrne, Rose - Nominated

Cotillard, Marion Nominated –

Davis, Viola Nominated –

Dern, Laura Nominated –

Fanning, Dakota Nominated –

Farmiga, Vera Nominated –

Fey, Tina Nominated Nominated

Fonda, Jane Nominated Nominated

Johansson, Scarlett Nominated –

Jolie, Angelina Nominated –

Linklater, Lorelei – Nominated

Keaton, Diane Nominated –

Lange, Jessica Nominated –

Moss, Elisabeth – Nominated

Lawrence, Jennifer Nominated –

Leo, Melissa Nominated –

MacLaine, Shirley Nominated Nominated

McCarthy, Melissa Nominated –

Mirren, Helen Nominated Nominated Nominated Best Actress In A MotionPicture, Musical or Comedy

Mo’Nique Nominated –

Moore, Julianne Nominated – Nominated Best Actress In A MotionPicture, Musical or Comedyand Best Actress In A Mo-tion Picture, Drama

Wilson, Ruth –

Benoist, Melissa – Nominated

Sarandon, Susan Nominated –

Sedgwick, Kyra Nominated –

Steinfeld, Hailee – Nominated

Smith, Maggie Nominated –

Streep, Meryl Nominated Nominated Nominated Best Supporting Actress in aMotion Picture

Swank, Hilary Nominated –

Theron, Charlize Nominated –

Rapace, Noomi – Nominated

Watts, Naomi Nominated –

Witherspoon, Reese Nominated – Nominated Best Actress in a MotionPicture, Drama

Table 10. Actress Nomination Predictions from Model 1 and Model 2, including the actual outcome


Director Nominations From table 11, Model 1 produced 3 nominations, out of which 2 were actuallynominated. We are satisfied with this result because the model predicted very few nominations but witha higher precision value of 0.67 and recall value of 0.4. Model 2 predicted a single nomination but witha precision value of 1 and recall value of 0.2. It is worth noting that the model output does not have afixed number of nominations to produce, thus predicting a ”top five” nominations for best director was notpossible. According to Model 1 and Model 2, we predict the Golden Globe award 2015 for the Best Director- Motion Picture will be Wes Anderson.

Director Model-1 Prediction Model-2 Prediction Actual Outcome Award Name

Anderson, Wes Nominated Nominated Nominated Best Director - Motion Picture

Fincher, David Nominated – Nominated Best Director - Motion Picture

Gunn, James Nominated –

Table 11. Director Nominations

Screenwriter Nominations From table 12, the Model 1 predicted one correct nomination out of the twopredicted. We are very happy with the results obtained from Model-2 as it predicted 8+2 nominations(2nominations were book writers). Of the eight, three were actually nominated. Model 1 has a precision valueof 0.5 and a recall value of 0.2. The precision value of Model 2 is 0.375 with a higher recall value of 0.6. Fromthe results of Model 1 and Model 2, we predict Richard Linklater to be the winner of Golden Globe 2015 forthe Best Screenplay Motion Picture.

Writer Model-1 Prediction Model-2 Prediction Actual Outcome Award Name

Knight, Steven Nominated Nominated

Linklater, Richard Nominated Nominated Nominated Best Screenplay - Motion Picture

Anderson, Wes – Nominated Nominated Best Screenplay - Motion Picture

Flynn, Gillian – Nominated Nominated Best Screenplay - Motion Picture

Hageman, Kevin – Nominated

Phil Lord – Nominated

Christopher Miller – Nominated

Hageman, Dan – Nominated

Table 12. Screenwriter Nominations

9.3 Conclusion

To conclude, we have discussed our entire process from beginning to end of predicting the outcome of the72nd Golden Globe awards as well as the Christmas Day movie gross for movies released on ChristmasDay. During the course of the project we found that a significant amount of our time and effort was spentin gathering and cleaning data. Combining data from different sources presents additional challenges. Formovie gross prediction, we found that the budget, genre, critic and audience reviews, as well as the qualityof the cast, screenwriters, and directors (in terms of number of previous award nominations and wins) playa large role in predicting the gross of the movie. Separate models are needed to predict the gross of movies.For Golden Globes, we found out that the number of break years, the total number of past awards andthe total number of past Golden Globe nominations to be an important parameter for predicting the nextaward nomination/winner. The Director’s Guild and Writer’s Guild Awards for different categories also havea higher weightage in predicting the nomination/win. We also observed that number of years in the industrydid not correlate well with the nomination chances.

Additionally, we recognized that our models were sensitive to the amount of data available for each movie.Since more parameters for our models become available closer to the release of the movie, both our Golden


Globe and movie gross predictions would be enhanced by a constantly updating dataset. The nature ofprediction challenges involve constant fluctuations, additions, and changes to data and resulting parametersthat become fixed without re-scraping for content on a frequent basis.

Acknowledgments We would like to thank Professor Steven Skiena for guidance on modeling and dataanalysis techniques. Additionally we sould like to thank Paul St. Denis for his expertise with video editing,and Yanqing Chen for meeting with us to discuss his research.

10 Bibliography

References

1. “Members - Golden Globe Awards Official Website.” OFFICIAL WEBSITE OF THE GOLDEN GLOBEAWARDS. Hollywood Foreign Press Association, n.d. Web. 11 Oct. 2014. - http://www.hfpa.org/members/

2. “Road to the Oscars 2014.” The Internet Movie Database. IMDb.com, n.d. Web. 20 Oct. 2014. - http://www.

imdb.com/oscars/3. Pardoe, Iain, and Dean K. Simonton. “Applying discrete choice models to predict Academy Award winners.”

Journal of the Royal Statistical Society: Series A (Statistics in Society) 171.2 (2008): 375-394.4. Kaplan, David. “And the Oscar Goes to A Logistic Regression Model for Predicting Academy Award Results.”

Journal of Applied Economics and Policy 25.1 (2006): 23-41.5. Zhang, Wenbin, and Skiena, Steven. “Improving movie gross prediction through news analysis.” Proceedings of

the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01. IEEE Computer Society, 2009.

6. Simonoff, Jeffrey S., and Ilana R. Sparrow. ”Predicting movie grosses: Winners and losers, blockbusters and sleep-ers.” Chance 13.3 (2000): 15-24.

7. IMDB Dataset: http://www.imdb.com/interfaces/8. IMDBPy Python Library : http://imdbpy.sourceforge.net/docs/README.txt9. Aggdata Dataset: http://aggdata.com/awards/golden\_globes10. Aggdata Dataset: http://aggdata.com/free\_data\_awards\_locations/academy\_awards11. Hollywood Foreign Press Association (HFPA): http://www.hfpa.org/members/12. Rotten Tomatoes. RottenTomatoes.com, n.d. Web. 14 Nov. 2014. http://www.rottentomatoes.com13. “Metacritic - Movie Reviews, TV Reviews, Game Reviews, and Music Reviews.” Metacritic - Movie Reviews, TV

Reviews, Game Reviews, and Music Reviews. N.p., n.d. Web. 16 Nov. 2014. http://www.metacritic.com14. “Box Office Mojo” Box Office Mojo. N.p., n.d. Web. 16 Nov. 2014. http://boxofficemojo.com15. “IMDb”. IMDb.com, n.d. Web. 14 Nov. 2014. http://www.IMDb.com16. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “DeepWalk: Online Learning of Social Representations.” arXiv

preprint arXiv:1403.6652 (2014).17. De Vany, Arthur, and W. David Walls. ”Uncertainty in the movie industry: Does star power reduce the terror of

the box office?.” Journal of Cultural Economics 23.4 (1999): 285-318.18. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195–197

(1981)19. May, P., Ehrlich, H.C., Steinke, T.: ZIB Structure Prediction Pipeline: Composing a Complex Biological Workflow

through Web Services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006)

20. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, SanFrancisco (1999)

21. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid Information Services for Distributed ResourceSharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181–184.IEEE Press, New York (2001)

22. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: an Open Grid Services Architecturefor Distributed Systems Integration. Technical report, Global Grid Forum (2002)

23. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov24. Pitchumani Angayarkanni, S., and Nadira Banu Kamal. “Data Mining Techniques for Efficient Detection of

Cancerous Masses in Mammogram.” International Journal of Advanced Research in Computer Science 3.1 (2012).25. Quinlan, John Ross. C4.5: programs for machine learning. Vol. 1. Morgan Kaufmann, 1993.26. ”J48.” J48 - Weka.classifiers.trees. Weka - Sourceforge, n.d. Web. 11 Dec. 2014. http://weka.sourceforge.net/

doc.dev/weka/classifiers/trees/J48.html27. Wang, Yong, and Ian H. Witten. “Inducing model trees for continuous classes.” Proceedings of the Ninth European

Conference on Machine Learning. 1997.

Final Report: Team 2 Predicting Golden Globe Awards ...€¦ · This power can also be used to...

Documents

Transcript of Final Report: Team 2 Predicting Golden Globe Awards ...€¦ · This power can also be used to...