Integrated Movie Database - saeedm/Integrated_Movie_DB_report.pdf · Integrated Movie Database ......

Integrated Movie Database

CSCI 586 Project Report

Muhammad Rizwan Saeed, Santhoshi Priyanka Gooty Agraharam, Ran Ao

University of Southern California

1 Background

The extensive growth of the Web and associated web technologies have broughtthe focus on the research towards the idea of Semantic Web. The purpose behindSemantic Web is to give meaning to data on the web, so that it augments the hu-mans’ as well as machines’ capability of effectively understanding and processinginformation [1]. In the domain of Semantic Web, Ontologies (or vocabularies)define the concepts and relationships used to describe and represent an area ofconcern [2]. The purpose of creating Ontologies and integrating data is to orga-nize heterogeneous data sources for simplifying on-demand information accessand enabling complex analysis to be performed on the integrated knowledgebase.

In our project titled “Integrated Movie Database”, we apply the conceptsof Semantic Web and, by using Ontologies, integrate data coming from varioussources to provide a unified view to the user. The data becomes queryable andcan be used to extract information required by the user.

2 Problem Statement

There are various websites that provide different (and often disjoint) chunks of in-formation related to movies. IMDB1 (Internet Movie Database) is the most com-prehensive online source of information for movies. It provides various attributesrelated to movies such as title, genre, run time, casting details, awards etc. or-ganized across multiple web pages. Similarly, BoxOfficeMojo2 is another movie-related website which mainly focuses on the financial aspects of movies. It keepstrack of the domestic and worldwide earnings of movies on daily, weekly andmonthly basis. Multiple other websites such as Fandango3 and Google Movies4

provide lists of movies in local movie theaters and their showtimes. Some web-sites keep track of critical reception of movies e.g. Rotten Tomatoes5 is one suchwebsite that assigns a score to every movie based on the percentage of positivereviews published about it in notable (print and digital) publications. Due to

1 http://www.imdb.com2 http://www.boxofficemojo.com3 http://www.fandango.com4 http://www.google.com/movies5 http://www.rottentomatoes.com

Fig. 1: Work Flow

this distribution of information about movies on multiple websites, users cannotget a unified view of all the relevant information pertaining to a movie at oneplace. For example, a user can only get movie showtimes from Google Movies. Ifthe user wants to select a movie to watch from listed movies, he may require in-formation such as critic and user reviews and gross etc. to make a decision. Sincethis information is not available on Google Movies page, the user must browseother websites to get all relevant pieces of information. Similarly, performing ananalysis of box office business of Oscar-winning movies also requires accessingmultiple web pages of different websites, which can be a time-consuming processfor the end-user.Solving this problem of consolidating scattered pieces of relevant informationis the goal of our project. We acquire information about movies coming frommultiple web sources and create a framework that organizes and integrates thedata and provides a SPARQL6 endpoint for querying movie-related data. Here,we want to clarify that the goal of the project is not to create a mash-up appli-cation. Mash-up applications usually focus on only presenting different streamsof information in different panels or frames in the same application window.However, those streams cannot be used to run integrated queries. For example,consider a mash-up application that shows information from RottenTomatoesand BoxOfficeMojo in separate panels in the same window. A user may be ableto interact with one panel to filter movies with a critic score of > 90% and withthe other to filter movies that have grossed more than a billion dollars worldwide.However, a query to get results that fulfill both the criteria simultaneously, willrequire both data streams to be integrated, which is the objective of our project.

3 Scope

In our project, we are focusing on data available from the following websites.The attributes crawled from each page are listed in Table 1:

1. Internet Movie Database (www.imdb.com)

6 https://www.w3.org/TR/rdf-sparql-query/

2. Box Office Mojo (www.boxofficemojo.com)3. Rotten Tomatoes (www.rottentomatoes.com)4. Good Reads (www.goodreads.com)5. Wikipedia (en.wikipedia.org)6. Google Movies Cinema Page (crawled in real time based on user query, see

Section 4.4 for details)

The high level schematic of the project is shown in Figure 1. We provide moredetails about the project in subsequent sections. In Section 4, we discuss, indetail, the multiple phases of the project and the challenges faced in every phaseand in Section 5, we talk about the conclusion and possible future work basedon this work.

Website Extracted Information

IMDB Title, Release Date, Genre, MPAA Rating, IMDB User Rating, List ofCast Members (Actors/Actresses), Metacritic Score, List of AcademyAwards (won), Director

RottenTomatoes Title, Year, Critic Score (Tomatometer)

BoxOfficeMojo Title, Release Date, Genre, Run time, Domestic Gross, WorldwideGross, Budget

GoodReads Book Title, Year, Author Name, Rating

Wikipedia Movie links for IMDB and RottenTomatoes

Google Movies Title, Year, IMDB Link, Showtimes

Table 1: Information extracted from websites

4 Approach

In this section, we discuss different phases of the development and execution ofour project. The project can be divided into four phases.:

1. Data Acquisition2. Data Modeling & Integration3. Data Linking4. Querying

4.1 Data Acquisition

Our primary focus is on extracting information from data sets listed in Section3. We created Java-based crawlers using the jsoup7 library, which provides anAPI for extracting and manipulating data from web pages. Due to the uniquestructure of each web page, we have created separate crawlers for each typeof web page. The details of the process and challenges of extracting data fordifferent web sources are discussed next.7 http://jsoup.org/

Fig. 2: SPARQL Query to extract Wikipedia pages of Movies from DBpedia

IMDB and RottenTomatoes For each crawler, we require a list of URLs tocrawl. In order to generate such a list of URLs for each website, we started withextracting relevant information from DBpedia8. DBpedia is a crowd sourcedversion of Wikipedia9 built on the principle of Linked Open Data (LOD)[3].On DBpedia, movies are represented as instances of classes dbo:Film10 andschema:Movie11. Every DBpedia entity has a property foaf:isPrimaryTopicOf12

that holds the link to the corresponding Wikipedia page. Using SPARQL queries(similar to one shown in Figure 2), we created a list of Wikipedia pages relatedto movies. Every Wikipedia page contains links to IMDB and RottenTomatoespages for the corresponding movie in the section named “External links”. Wecreated a crawler for Wikipedia page to extract those links and, hence, generateda list of URLs for the software to crawl IMDB and RottenTomatoes websites.

BoxOfficeMojo and GoodReads BoxOfficeMojo provides an index of moviesthat are available on its website. We, first, wrote a crawler to extract the listof URLs from that index and then using another crawler extracted informationfrom the respective movie pages. We also thought that it would be interestingto add another dimension to our movie data sets by integrating informationrelated to books to our repository. We crawled multiple user-generated lists onGoodReads, which contained names of books which were adapted for movies.The statistics of all the data extracted are given in Table 2.

8 http://wiki.dbpedia.org/9 https://www.wikipedia.org/

10 http://dbpedia.org/ontology/Film11 http://schema.org/Movie12 http://xmlns.com/foaf/0.1/isPrimaryTopicOf

Fig. 3: Integrated Movie Ontology

Website No. of Records Generated

IMDBRecords generated for 36, 549 movies,Total casting records generated: 856, 407

Rotten Tomatoes Records generated for 10, 000 movies

Box Office Mojo Records generated for 16, 945 movies

Good Reads Records generated for over 3, 000 books

Table 2: Statistics about generated data

4.2 Data Modeling & Integration

To model our data, we created the Integrated Movie Ontology, shown in Figure3. In the center of the figure, we have the Movie class with various attributes.An instance of Movie can be based on an instance of the Book class. Both Book

and Movie are subclasses of the class CreateWork. Each book has an associ-ated instance of the Author class and each movie has associated instances ofclasses Actor and Director. Actor, Director and Author are subclasses ofclass Person. Each movie instance can be associated with multiple instances ofclass Award and each such instance of class Award will have an associated winner(instance of class Person).We created a Java program using Apache Jena13 API, that took relevant seg-

13 http://jena.apache.org/

ments of the Ontology and the data file(s) as input and converted CSV datainto RDF triples format. Table 3 shows few examples of triples for movie “ThePrestige”.

The Prestige (2006)

PREFIX imo: < http : //www.usc.edu/csci586/entertainment# >subject predicate object

http : //www.imdb.com/title/tt0482571/ rdf:type imo:Movie

http : //www.imdb.com/title/tt0482571/ imo:Title The Prestige

http : //www.imdb.com/title/tt0482571/ imo:IMDBUserRating 8.5

http : //www.imdb.com/title/tt0482571/ imo:Year 2006

Table 3: Sample triples for movie The Prestige

4.3 Data Linking

Every movie has been assigned its own ID in different data sources. After con-verting data into RDF in the previous phase, there can be multiple nodes inthe RDF graph corresponding to the same movie, as the linkage between samemovies across data sets has yet to be established. For example, movie The Pres-tige (2006) has different URLs across three data sets and the book on which itis based has another URL, as shown in Table 4.

The Prestige (2006)IMDB http : //www.imdb.com/title/tt0482571/

Rotten Tomatoes http : //www.boxofficemojo.com/movies/?id = prestige.htm

Box Office Mojo http : //www.rottentomatoes.com/m/prestige/

Good Reads http : //www.goodreads.com/book/show/239239.The Prestige

Table 4: Information related to same movie across website

To establish links between data sets, we used the tool called FRIL14(Fine-grained Record Integration and Linkage Tool) which allows users to upload twodata sources and configure linkages based on multiple similarity metrics. Theinterface of FRIL is shown in Figure 4.

Movie-to-Movie Linkage For linking an IMDB movie to the same movie inboth RottenTomatoes or BoxOfficeMojo data sets, we performed record link-age based on similar titles and same release years. For matching titles, we useedit distance similarity metric. The edit distance between strings a1 . . . am and

14 http://fril.sourceforge.net/

Fig. 4: FRIL - Main Interface

b1 . . . bn is the minimum cost of a sequence of editing steps (insertions, dele-tions, substitutions) that convert one string into the other. Let d be the editdistance function and e be the exact matching function, then formula for findingsimilarity between two records MA and MB can be represented as:

sim(MA,MB) = w1 ∗d(MA.T itle,MB .T itle)+w2 ∗e(MA.Y ear,MB .Y ear) (1)

By hit and trial, we used value of 50 for both w1 and w2 which gave usquite accurate results. For example, we were able to make matches despite slightdifferences of letters and punctuation in titles across data sets. Some examplesof matches are shown in Table 5. After such matches were found, we connectedthe two nodes representing the same movie with owl:sameAs link.

Movie (IMDB) Movie (Box Office Mojo)

Crank 2: High Voltage Crank: High Voltage

Love Wedding Marriage Love, Wedding, Marriage

The Hills have Eyes II The Hills have Eyes 2

Table 5: Approximate matching of movie titles

Movie-to-Book Linkage For linking an IMDB movie to a book, we performedrecord linkage based on similar titles and the notion that the publishing year ofthe book should to be less than or equal to the release year of the movie. Letd be the edit distance function and l be the comparison function such that it istrue when publishing year of book is less than or equal to the release year of

movie. The relation for finding similarity between two records MA (movie) andBA (book) can, then, be represented as:

sim(MA, BA) = w1 ∗ d(MA.T itle, BA.T itle) + w2 ∗ l(MA.Y ear,BA.Y ear) (2)

Equation 2 is less restrictive than the movie to movie comparison of Equation1 because we are not doing exact matching of years. Hence, this approach turnedout to be less precise. We had to perform manual clean-up of the suggestedmatches. For instance, the movie Avatar (2009) was matched to the book Avatarfrom the Avatar: The Last Air Bender Series. Similarly Amnesia was matchedto Amnesiac (edit distance = 1), even though they are unrelated. So for thisstep, we used FRIL to provide initial record linkage which was then manuallychecked of erroneous matching. Based on the corrected matching, we used thebasedOn property from our ontology to create links between relevant instancesof Movie and Book.

4.4 Querying

After data integration and linking, we had an RDF dataset of roughly 2.3 mil-lion triples. We used OpenLink Virtuoso15 server to host the RDF data set.The repository was then queried using a combination of Apache Jena API andVirtuoso API. We developed an interface using Java swing libraries that takesa query type as an input. Additional text input is required based on the typeof query selected. The interface is shown in Figure 5. Some of the queries havebeen pre-configured, however provisions for users to issue their own queries is alsoprovided. When users choose to run a pre-configured query, the actual SPARQLquery is also shown in the text area, which can be used as a template for customqueries. The list of pre-configured queries and their descriptions is given in Table6. To show the value of integration, we have divided the test queries into threecategories, which are discussed next.

Adding Relevant Data to Movie Theater Page As discussed in Section 2that websites such as Google Movies do not provide entire information that auser can use to determine which movie to watch e.g. critic rating, user rating,gross etc. In this first type of query, the users provide a Google Movie URL fora particular cinema. The software uses a crawler to get information from thepage (e.g. title, IMDB Link), queries the repository for each movie found andaugments information with received data and presents as a table. Essentially,the user is getting information from four websites (IMDB, BoxOfficeMojo, Rot-tenTomatoes, Google Movies) in a single table. The table can be sorted basedon any column. This gives user the ability to sort movies showing in a particularcinema based on his preferred criteria. Figure 6 shows result of this query runfor a particular movie theater in the downtown Los Angeles area.

15 http://virtuoso.openlinksw.com/

Fig. 5: Integrated Movie Database Search Interface

Fig. 6: Augmented movie theater data with additional information

Querying Data from Previously Disjoint Data Sources Now that wehave integrated previously disjoint data sources, we can think of queries thatwill fetch variety of data in single query. For example: “Which authors’ bookshave been most profitable for the Movie Industry?” or “List all movie adaptationsalong with certain attributes for a particular author”. Both of these queries wouldhave required users to explore multiple websites to find the answer. With theintegrated data, this can be done with a single query. The results for latter queryis shown in Figure 7 and the equivalent SPARQL query is shown in Figure 8.

Path Query: Find Collaboration / Degree of Separation The third typeof queries find a collaborative path between two people based on the movies thatthey have acted in together. This kind of collaborative queries can be found in

Menu Item Description

Process Google Movies URL Extract movies from the Google Movies Cinema Pageand augment it with data from the repository

Get Information of Moviesbased on books by an Au-thor

Provide all attributes related to movies based on booksby user-provided author

Most Oscars Won By aMovie based on Book Adap-tation

Top 10 list of books which have been basis for movieswith most Oscar wins

Directed and Acted in aMovie and Won an Oscar

Find people who directed a movie, acted in it and woneither the Best Director or Best Actor/Actress Oscar

Collaborative Path BetweenTwo Actors of Fixed Length

Using the graph of actors and actresses connected witheach other through movies, find a path of particularlength between two people

Collaborative Path BetweenTwo Actors of Any Length

Using the graph of actors and actresses connected witheach other through movies, find a path of any lengthbetween two people

Partial Match Movie Search Find all movies that partially match a user providedphrase

Custom Query Users enter their own queries in text area.

Table 6: Description of pre-configured queries available through the softwareinterface

Fig. 7: List of movies with selected attributes along with the books they werebased on

other domains as well. When a user on LinkedIn16 clicks on a profile of anotherprofessional, they can see how they are connected to that person via professionalnetwork through other people. Similarly in research community, two examplesof measuring collaboration are Einstein number and Erdos number. Einsteinhimself has an Einstein number of 0. Anybody who has co-authored a paperwith Einstein has an Einstein number of 1. For instance, as shown in Figure9, Ernst Gabor Straus has an Einstein number of 1 and Lee A. Rubel whocollaborated with Ernst Gabor Straus has an Einstein number of 2. We havecalculated a similar metric between two people based on the movies they haveacted in. For instance, Figure 10a shows the minimum collaborative path betweenAl Pacino and Marlon Brando, which is of length 1, since they worked together

16 http:www.linkedIn.com

Fig. 8: SPARQL query to extract movies based on books of J.R.R. Tolkien

Fig. 9: An example of the Einstein Number

in Godfather (1972). Another version of this query finds path of fixed lengthbetween two people. The desired path length is provided by the user. Figure 10bshows result of query that finds a collaborative path of length 3 between thesame two actors.

5 Conclusion and Future Work

Ontologies are becoming an increasingly popular way of organizing data and areused in multiple domains. We have shown the value of Ontology based integrationin the domain of Movies and Books by answering queries which would haverequired exploring multiple web pages to answer.For future work, we feel that the data set developed for this project can be usedfor the application of interesting Machine Learning techniques and Social MediaAnalytics. For instance, in the domain of Authors/Researchers, having morepublications can be a measure of finding influential nodes in the graph. In thegraph of Actors/Actresses, having links to more movies does not give any clueabout the popularity or influence of the person. More sophisticated measuresof influence need to be established for such collaborative networks. Anotherinteresting measure would be to cluster movies based on multiple features e.g.

(a) Shortest collaborative path be-tween Al Pacino & Marlon Brando

(b) Collaborative path of length 3 be-tween Al Pacino & Marlon Brando

Fig. 10: Results of Path Queries

budget, gross, rating, cast, awards etc. to determine if any interesting patterns ofsimilarity can emerge. Hence automatic determination of influential nodes andfinding similar movies or people can be interesting next avenues to explore usingthe data we have generated for this project.

References

1. Sunitha Ramanujam, Anubha Gupta, Latifur Khan, Steven Seida, and Bhavani M.Thuraisingham. A relational wrapper for RDF reification. In Trust Management III,Third IFIP WG 11.11 International Conference, IFIPTM 2009 , West Lafayette,IN, USA, June 15-19, 2009. Proceedings, pages 196–214, 2009.

2. Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. ScientificAmerican, 284(5):34–43, May 2001.

3. Liyang Yu. A Developer’s Guide to the Semantic Web. Springer, 2011.

Integrated Movie Database - saeedm/Integrated_Movie_DB_report.pdf · Integrated Movie Database ......

Documents

Transcript of Integrated Movie Database - saeedm/Integrated_Movie_DB_report.pdf · Integrated Movie Database ......