DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather...

50
DATA MINING ON /R/NBA ALEX CHENGELIS AND ANDREW YU

Transcript of DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather...

Page 1: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA MINING ON /R/NBAALEX CHENGELIS AND ANDREW YU

Page 2: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

INTRODUCTION

� Raw data processing using an API

� Data Processing and Storage

� Comment heat maps

� Comment scores based on game action

� Word counting

� Naïve Bayes Classifier

Page 3: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

TECHNOLOGIES USED

� Python

� NLTK

� PRAW

� Tableau

� CSV and a little Excel

Page 4: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA PREPROCESSING

� Gather data using PRAW

� Create an agent for use in Reddit’s API

� Gather URL’s to cycle through

� Write the comment, flair, and score to a CSV file

Page 5: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

WHAT OUR DATA LOOKS LIKE

Page 6: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

VISUALIZATION OF COMMENTS

Team City State Count Score Avg

Lakers Los AngelesCalifornia 461 7982 17.31

Hornets Charlotte North Carolina 124 3293 26.56

Celtics Boston mass 337 9083 26.95

Nuggets Denver Colorado 98 3175 32.40

Nets Brooklyn new York 65 6062 93.26

Bucks Milwaukee Wisconsin 96 1034 10.77

Pelicans New OrleansLouisiana 66 222 3.36

Bulls Chicago Illinois 320 7364 23.01

NBA 170 2077 12.22

Warriors Oakland California 310 6312 20.36

Pistons Detroit Michigan 110 1859 16.90

76ers PhiladelphiaPennsylvania 153 4922 32.17

Hawks Atlanta Georgia 104 4071 39.14

Suns Phoenix Arizona 107 295 2.76

Huskies hartford Connecticut 10 52 5.20

Grizzlies memphis Tennessee 113 408 3.61

Wizards Washington, D.C 123 1611 13.10

West 19 252 13.26

Mavericks Dallas Texas 100 3601 36.01

Heat Miami Florida 282 10087 35.77

Rockets Houston Texas 212 4929 23.25

Raptors Toronto 362 9086 25.10

Kings SacramentoCalifornia 99 1811 18.29

SupersonicsSeattle Washington 126 4173 33.12

Pacers IndianapolisIndiana 61 147 2.41

USA 10 22 2.20

Blazers Portland Oregon 157 2471 15.74

Thunder Oklahoma CityOklahoma City 276 11555 41.87

Clippers Los AngelesCalifornia 138 3935 28.51

Cavaliers Cleveland Ohio 960 15608 16.26

Spurs San Antonio Texas 310 3705 11.95

TimberwolvesMinneapolisMinnesota 168 6164 36.69

Knicks New York New york 324 4403 13.59

East 13 112 8.62

Bandwagon 227 6568 28.93

Jazz Salt Lake CityUtah 46 1649 35.85

Magic Orlando Florida 66 426 6.45

Page 7: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 1COMMENTS

Page 8: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 1 COMMENT SCORES

Page 9: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 2 COMMENTS

Page 10: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 2 COMMENT SCORES

Page 11: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 3 COMMENTS

Page 12: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 3 COMMENTS SCORE

Page 13: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 4 COMMENTS

Page 14: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 4 COMMENT SCORES

Page 15: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 5 COMMENTS

Page 16: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 5 COMMENT SCORES

Page 17: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 6 COMMENTS

Page 18: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 6 COMMENT SCORE

Page 19: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 7 – CLEVELAND CHAMPS

Page 20: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 7 – CLEVELAND CHAMPS

Page 21: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

USING TIME VARIANT

Page 22: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DOING SOME TEXT MINING

Page 23: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

WHAT WE DID WITH WORDS

� Tried inverted index but ran into some problems.

� 50 thousand + comments

� Took an easier term frequency while ignoring the 100 most used English words.

Page 24: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

Word Count

game 728

lebron 641

just 402

him 338

warriors 305

cavs 303

curry 287

com 254

team 253

love 236

3 222

he's 222

finals 218

fuck 214

had 205

even 203

nba 203

think 199

shit 197

it's 195

win 190

got 188

i'm 184

best 184

http 176

's 174

don't 173

7 172

series 171

cleveland 166

fucking 165

did 163

kyrie 163

good 162

after 158

back 158

player 157

ever 157

draymond 155

last 153

too 153

Page 25: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

CLEVELAND WINS WORD CLOUD

Page 26: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GOLDEN STATE WINS WORD CLOUD

Page 27: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

CAN YOU DETERMINE WHO WON BASED ON A COMMENT?NAÏVE BAYES CLASSIFIER - BASED ON GUIDE BY ANDY BROMBERG

HTTP://ANDYBROMBERG.COM/SENTIMENT-ANALYSIS-PYTHON/

Page 28: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

HOW WE BUILT THE NAÏVE BAYES CLASSIFIER

� Used the same Cleveland Wins and Golden State Wins text files.

� A lot like negative and positive sentiment analysis but with wins.

� Take ¾ of comments for training and ¼ for the testing

� Strip all punctuation and escape characters

Page 29: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

CONT.

� We call the classifier that is included with NLTK, initiate the reference and test Sets and populate the them.

� Before this we actually created a function that used a chi-square test to score each word.

� Finally we actually use the classifier for predictions

Page 30: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

RESULTS

Features Accuracy

All Words 57.713%

10 best 55.771%

100 best 60.302%

1000 best 66.235%

best 10000 60.949%

best 15000 58.360%

Page 31: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

INTERESTING RESULTS

� Shaun Livingston

� Bench player for the Warriors

� If he is in a comment.

� 95.28% chance that the Warriors won

Page 32: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

INTERESTING RESULTS

� Harrison Barnes

� Part time starter, part time bench players, full time punching bag

� If his name is in the comment.

� 94.68% chance CLEVELAND won

Page 33: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

INTERESTING RESULTS

� Kyrie and LeBron

� In game 5 both score 41 points

� If 41 is in the comments

� 93.24%

Page 34: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

MOST TELLING WORDS FOR BOTH TEAMS

Word Chance

Shaun 95.28%

fired 92.91%

range 92.00%

Thunder 91.80%

healthy 91.67%

talent 90.74%

splash 89.25%

Warriors Most UsefulWord Chance

Harrison 94.48%

41' 93.24%

Sunday 92.37%

tweet 90.74%

road 89.69%

calls 90.29%

mad 90.29%

Cleveland Most Useful

Page 35: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

ADDING TIME TO THE EQUATION

Page 36: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

HTML SOURCE CODE

Page 37: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

CSV TABLE

Page 38: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA PREPROCESSING (PYTHON, EXCEL, R)

Page 39: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 1

Page 40: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 2

Page 41: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 3

Page 42: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 4

Page 43: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME5

Page 44: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 6

Page 45: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

DATA VISUALIZATION (R) - GAME 7

Page 46: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

GAME 7 IN BROADCAST TIME(START @8PM)

Page 47: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle
Page 48: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

REDDIT.COM/R/NBA GAMETHREAD COMMENT DENTSITY

Page 49: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

COMMENT DENSITY: 10:28:20– 10:31:40 ET

Page 50: DATA MINING ON /R/NBAeecs.csuohio.edu/~sschung/CIS660/Data Mining Final...DATA PREPROCESSING Gather data using PRAW Create an agent for use in Reddit’s API Gather URL’s to cycle

WOW

I think i speak for the free world when I say: go not GSW

I LOVE YOU BRON BRON

Hollllly s***

HOLY s*** LEBRON

Barnes scarred of the moment

Every time KLove bricks a shot, an angel gets its wings.

Im watching a really laggy stream bro and im behind

Holy F*** Lebron

NO REGARD FOR HUMAN LIFE

HOLY F***ING s***

NAAAAAAAH GET THE F*** OUT!

OH MY GOD

HILY DUCK

HOLY s*** THAT BLOCK

Omfg.....

DAE RIGGED

Where is the Love? The Love. The Loooove....

Holy s*** this game.

HOLY s***

Why is my heart pounding!?

OH s*** ITS DAT BRON

JAMES!!

OH MY GOD

WOW

lebron!!!

HOW THE F***

WOWOW

this defense is so sexy

Can anyone hit a shot