data mining on /r/nbaeecs.csuohio.edu/~sschung/cis660/data mining final...data preprocessing gather...
TRANSCRIPT
DATA MINING ON /R/NBAALEX CHENGELIS AND ANDREW YU
INTRODUCTION
� Raw data processing using an API
� Data Processing and Storage
� Comment heat maps
� Comment scores based on game action
� Word counting
� Naïve Bayes Classifier
TECHNOLOGIES USED
� Python
� NLTK
� PRAW
� Tableau
� CSV and a little Excel
DATA PREPROCESSING
� Gather data using PRAW
� Create an agent for use in Reddit’s API
� Gather URL’s to cycle through
� Write the comment, flair, and score to a CSV file
WHAT OUR DATA LOOKS LIKE
VISUALIZATION OF COMMENTS
Team City State Count Score Avg
Lakers Los AngelesCalifornia 461 7982 17.31
Hornets Charlotte North Carolina 124 3293 26.56
Celtics Boston mass 337 9083 26.95
Nuggets Denver Colorado 98 3175 32.40
Nets Brooklyn new York 65 6062 93.26
Bucks Milwaukee Wisconsin 96 1034 10.77
Pelicans New OrleansLouisiana 66 222 3.36
Bulls Chicago Illinois 320 7364 23.01
NBA 170 2077 12.22
Warriors Oakland California 310 6312 20.36
Pistons Detroit Michigan 110 1859 16.90
76ers PhiladelphiaPennsylvania 153 4922 32.17
Hawks Atlanta Georgia 104 4071 39.14
Suns Phoenix Arizona 107 295 2.76
Huskies hartford Connecticut 10 52 5.20
Grizzlies memphis Tennessee 113 408 3.61
Wizards Washington, D.C 123 1611 13.10
West 19 252 13.26
Mavericks Dallas Texas 100 3601 36.01
Heat Miami Florida 282 10087 35.77
Rockets Houston Texas 212 4929 23.25
Raptors Toronto 362 9086 25.10
Kings SacramentoCalifornia 99 1811 18.29
SupersonicsSeattle Washington 126 4173 33.12
Pacers IndianapolisIndiana 61 147 2.41
USA 10 22 2.20
Blazers Portland Oregon 157 2471 15.74
Thunder Oklahoma CityOklahoma City 276 11555 41.87
Clippers Los AngelesCalifornia 138 3935 28.51
Cavaliers Cleveland Ohio 960 15608 16.26
Spurs San Antonio Texas 310 3705 11.95
TimberwolvesMinneapolisMinnesota 168 6164 36.69
Knicks New York New york 324 4403 13.59
East 13 112 8.62
Bandwagon 227 6568 28.93
Jazz Salt Lake CityUtah 46 1649 35.85
Magic Orlando Florida 66 426 6.45
GAME 1COMMENTS
GAME 1 COMMENT SCORES
GAME 2 COMMENTS
GAME 2 COMMENT SCORES
GAME 3 COMMENTS
GAME 3 COMMENTS SCORE
GAME 4 COMMENTS
GAME 4 COMMENT SCORES
GAME 5 COMMENTS
GAME 5 COMMENT SCORES
GAME 6 COMMENTS
GAME 6 COMMENT SCORE
GAME 7 – CLEVELAND CHAMPS
GAME 7 – CLEVELAND CHAMPS
USING TIME VARIANT
DOING SOME TEXT MINING
WHAT WE DID WITH WORDS
� Tried inverted index but ran into some problems.
� 50 thousand + comments
� Took an easier term frequency while ignoring the 100 most used English words.
Word Count
game 728
lebron 641
just 402
him 338
warriors 305
cavs 303
curry 287
com 254
team 253
love 236
3 222
he's 222
finals 218
fuck 214
had 205
even 203
nba 203
think 199
shit 197
it's 195
win 190
got 188
i'm 184
best 184
http 176
's 174
don't 173
7 172
series 171
cleveland 166
fucking 165
did 163
kyrie 163
good 162
after 158
back 158
player 157
ever 157
draymond 155
last 153
too 153
CLEVELAND WINS WORD CLOUD
GOLDEN STATE WINS WORD CLOUD
CAN YOU DETERMINE WHO WON BASED ON A COMMENT?NAÏVE BAYES CLASSIFIER - BASED ON GUIDE BY ANDY BROMBERG
HTTP://ANDYBROMBERG.COM/SENTIMENT-ANALYSIS-PYTHON/
HOW WE BUILT THE NAÏVE BAYES CLASSIFIER
� Used the same Cleveland Wins and Golden State Wins text files.
� A lot like negative and positive sentiment analysis but with wins.
� Take ¾ of comments for training and ¼ for the testing
� Strip all punctuation and escape characters
CONT.
� We call the classifier that is included with NLTK, initiate the reference and test Sets and populate the them.
� Before this we actually created a function that used a chi-square test to score each word.
� Finally we actually use the classifier for predictions
RESULTS
Features Accuracy
All Words 57.713%
10 best 55.771%
100 best 60.302%
1000 best 66.235%
best 10000 60.949%
best 15000 58.360%
INTERESTING RESULTS
� Shaun Livingston
� Bench player for the Warriors
� If he is in a comment.
� 95.28% chance that the Warriors won
INTERESTING RESULTS
� Harrison Barnes
� Part time starter, part time bench players, full time punching bag
� If his name is in the comment.
� 94.68% chance CLEVELAND won
INTERESTING RESULTS
� Kyrie and LeBron
� In game 5 both score 41 points
� If 41 is in the comments
� 93.24%
MOST TELLING WORDS FOR BOTH TEAMS
Word Chance
Shaun 95.28%
fired 92.91%
range 92.00%
Thunder 91.80%
healthy 91.67%
talent 90.74%
splash 89.25%
Warriors Most UsefulWord Chance
Harrison 94.48%
41' 93.24%
Sunday 92.37%
tweet 90.74%
road 89.69%
calls 90.29%
mad 90.29%
Cleveland Most Useful
ADDING TIME TO THE EQUATION
HTML SOURCE CODE
CSV TABLE
DATA PREPROCESSING (PYTHON, EXCEL, R)
DATA VISUALIZATION (R) - GAME 1
DATA VISUALIZATION (R) - GAME 2
DATA VISUALIZATION (R) - GAME 3
DATA VISUALIZATION (R) - GAME 4
DATA VISUALIZATION (R) - GAME5
DATA VISUALIZATION (R) - GAME 6
DATA VISUALIZATION (R) - GAME 7
GAME 7 IN BROADCAST TIME(START @8PM)
REDDIT.COM/R/NBA GAMETHREAD COMMENT DENTSITY
COMMENT DENSITY: 10:28:20– 10:31:40 ET
WOW
I think i speak for the free world when I say: go not GSW
I LOVE YOU BRON BRON
Hollllly s***
HOLY s*** LEBRON
Barnes scarred of the moment
Every time KLove bricks a shot, an angel gets its wings.
Im watching a really laggy stream bro and im behind
Holy F*** Lebron
NO REGARD FOR HUMAN LIFE
HOLY F***ING s***
NAAAAAAAH GET THE F*** OUT!
OH MY GOD
HILY DUCK
HOLY s*** THAT BLOCK
Omfg.....
DAE RIGGED
Where is the Love? The Love. The Loooove....
Holy s*** this game.
HOLY s***
Why is my heart pounding!?
OH s*** ITS DAT BRON
JAMES!!
OH MY GOD
WOW
lebron!!!
HOW THE F***
WOWOW
this defense is so sexy
Can anyone hit a shot