Predict the Oscars with Data Science

56
bit.ly/oscars-dc

Transcript of Predict the Oscars with Data Science

Page 1: Predict the Oscars with Data Science

bit.ly/oscars-dc

Page 2: Predict the Oscars with Data Science

Predicting the Oscars with data science

Page 3: Predict the Oscars with Data Science

Data Science Process

• Frame the question.

• Collect the raw data.

• Process the data.

• Explore the data.

• Communicate results.

Page 4: Predict the Oscars with Data Science

Frame the question

• Who will win the Oscar for Best Picture?

Page 5: Predict the Oscars with Data Science

Collect the Data

• What kind of data do we need?

• Financial data (Budget, box office…)

• Reviews, ratings and scores.

• Awards and nominations.

Page 6: Predict the Oscars with Data Science

Process the data

• How’s the data “dirty” and how can we fix it?

• User input, redundancies, missing data…

• Formatting: adapt the data to meet certain specifications.

• Cleaning: detecting and correcting corrupt or inaccurate records.

Page 7: Predict the Oscars with Data Science

Explore the data

• What are the meaningful patterns in the data?

• How meaningful is each data point for our predictions?

Page 8: Predict the Oscars with Data Science

Goals

• Introduction to a data scientist's tools and methods:

• Jupyter notebooks, numpy, pandas, sklearn…

• Overview of basic machine learning concepts:

• Data formatting and cleaning, Decision trees, Overfitting, Random Forests…

Page 9: Predict the Oscars with Data Science

Jupyter Notebooks

• One of data scientist’s everyday tools.

• Find the links in our classroom tool.

• Contains cells with code.

Page 10: Predict the Oscars with Data Science

NumPy

• The fundamental package for scientific computing with Python.

• Provides powerful multi-dimensional array objects.

• Many methods for fast operations on arrays.

Page 11: Predict the Oscars with Data Science

Pandas

• Fundamental high-level building block for doing practical, real world data analysis in Python.

• Built on top of NumPy.

• Offers data structures and operations for manipulating numerical tables and time series.

Page 12: Predict the Oscars with Data Science

Scikit-learn

• Python module for machine learning.

• Provides a large menu of libraries for scientific computation, such as integration, interpolation, signal processing, linear algebra, statistics, etc.

Page 13: Predict the Oscars with Data Science

Initial imports and loading data with Pandas

Page 14: Predict the Oscars with Data Science

Understanding your data

• .head(n) method: Returns first n rows.

• .value_counts() method: Returns the counts of unique values in the DataFrame.

Page 15: Predict the Oscars with Data Science

Formatting your Data

Page 16: Predict the Oscars with Data Science

Formatting your Data

• Rate values in a non-numeric format. Thus, we will need to assign each rate a unique integer so that Python can handle the information.

• With the .ix method you create a subset of rows and assign a value to a certain variable of that subset of observations.

Page 17: Predict the Oscars with Data Science

Cleaning your Data

Page 18: Predict the Oscars with Data Science

Decision Trees

• It breaks down a dataset into smaller and smaller subsets.

• The final result is a model with a tree structure that has:

• Decision nodes: ask a question and have two or more branches.

• Leaf nodes: represent a classification or decision.

Page 19: Predict the Oscars with Data Science
Page 20: Predict the Oscars with Data Science

Classification vs Regression

• Classification — Predict categories.• Identifying group membership.

• Regression — Predict values.• Involves estimating or predicting a

response.

Page 21: Predict the Oscars with Data Science

Classification

Page 22: Predict the Oscars with Data Science

Classification

?

Page 23: Predict the Oscars with Data Science

Creating your first Decision Tree

You will use the scikit-learn and numpy libraries to build your first decision tree. We will need the following to build a decision tree

• target: A one-dimensional numpy array containing the target from the train data.

• features: A multidimensional numpy array containing the features/predictors from the train data.

Page 24: Predict the Oscars with Data Science

Creating your first Decision Tree

Page 25: Predict the Oscars with Data Science

Importances and Score

• .feature_importances_ attribute: tells us how important the features are for the final result.

• .score() method: returns the mean accuracy of our fitting.

Page 26: Predict the Oscars with Data Science

Importances and Score

Page 27: Predict the Oscars with Data Science

Predicting

Page 28: Predict the Oscars with Data Science

Pretty bad results :(Let’s improve it!

Page 29: Predict the Oscars with Data Science

Let’s improve it!

Page 30: Predict the Oscars with Data Science

Modify the feature list

Page 31: Predict the Oscars with Data Science

Run the prediction again

Page 32: Predict the Oscars with Data Science

Overfitting

• Resulting model too tied to the training set.

• It doesn’t generalize to new data, which is the point of prediction.

Page 33: Predict the Oscars with Data Science

Random Forest Classifier

• Random Forest Classifiers use many Decision Trees to build a classifier.

• We introduce a bit of randomness.

• Each Tree can give a different answer (a vote). The final classification is the most common amongst the Trees.

Page 34: Predict the Oscars with Data Science

Random Forest Classifier

Page 35: Predict the Oscars with Data Science

Importances and Score

Page 36: Predict the Oscars with Data Science

Predicting with Random Forest Classifiers

Page 37: Predict the Oscars with Data Science

Results

Page 38: Predict the Oscars with Data Science

1976

Rocky

Page 39: Predict the Oscars with Data Science

1984

Amadeus

Page 40: Predict the Oscars with Data Science

1996

The English Patient

Page 41: Predict the Oscars with Data Science

2009

The Hurt Locker

Page 42: Predict the Oscars with Data Science

And the Oscar goes to…

Page 43: Predict the Oscars with Data Science

La La Land!!

Page 44: Predict the Oscars with Data Science
Page 45: Predict the Oscars with Data Science
Page 46: Predict the Oscars with Data Science

The EndNothing happened after that.

Right?? RIGHT??

Page 47: Predict the Oscars with Data Science

We can predict the OscarsExcept for 2017 ¯\_( )_/¯

Page 48: Predict the Oscars with Data Science
Page 49: Predict the Oscars with Data Science

More about Thinkful

• Anyone who’s committed can learn programming or data science

• 1-on-1 mentorship is the best way to learn

• Flexibility matters — learn anywhere, anytime

Page 50: Predict the Oscars with Data Science

Our Program

You’ll learn concepts, practice with drills, and build capstone projects — all guided by a personal mentor

Page 51: Predict the Oscars with Data Science

Our Mentors

Mentors have, on average, 10+ years of experience

Page 52: Predict the Oscars with Data Science

Data Science Syllabus

• Managing data with SQL and Python

• Modeling with both supervised and unsupervised models

• Data visualization and communicating with data

• Technical interviews + Career prep

Page 53: Predict the Oscars with Data Science

Web Development Syllabus

• Frontend Development (HTML, CSS, Javascript)

• Backend Development (Node.js)

• Frontend Frameworks (React.js)

• Computer Science Fundamentals

• Technical interviews + Career prep

Page 54: Predict the Oscars with Data Science

Our Results

Job Titles after GraduationMonths until Employed

Page 55: Predict the Oscars with Data Science

Special Prep Course Offer

• Three-week program, includes nine mentor sessions for $500 $250

• Introduction to Programming in Python, Data Visualization, and Statistics

• Option to continue into full data science bootcamp

• Talk to me (or email me) if you’re interested