scattertext: a tool for visualizing differences in language
TRANSCRIPT
![Page 1: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/1.jpg)
1
Jason S. Kessler | Data Day Texas, January 14, 2017@jasonkessler
Scattertext: A Tool for Visualizing Differences in Language
![Page 2: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/2.jpg)
2
Word frequency
• Women and men tend to use different terms on Facebook.• As do introverts and extroverts.
• Hillary Clinton and Donald Trump used different terms in the presidential debate.
• Reveal differences in • content, • perceived strengths and weaknesses• communication style
• These are often obvious after being surfaced
![Page 3: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/3.jpg)
3
Outline
• Previous work• Ways of visualizing word association
• Scattertext• Open-source Python/D3 framework for visualizing these
differences• Inspecting LDA, word2vec, sparse classification models
• How CDK Global is using this to help dealerships better sell cars.• We’re hiring senior data scientists + devs in Austin and Seattle.
![Page 4: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/4.jpg)
4
OKCupid: an online dating site
hobos
almond butter
100 Years of Solitude
Bikram yoga
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words and phrases statistically distinguish ethnic groups and genders?
![Page 5: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/5.jpg)
5Source: Christian Rudder. Dataclysm. 2014.
Ranking with everyone else
High distance: white menignore k-pop
Low distance: white mendisproportionately mention Phish
The smaller the distance from the top left, the higher the association with white men.
![Page 6: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/6.jpg)
6Source: Christian Rudder. Dataclysm. 2014.
my blue eyes
![Page 7: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/7.jpg)
7
Scattertext: Democrats vs Republicans: 2012 Convention Speeches
![Page 8: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/8.jpg)
8
Word Use Reflecting Gender and Personality in Facebook Statuses
• Objective:• Find words, phrases, and topics that correlate to
• gender, and• Big 5 personality type
• Data source:• My Personality App • 75k voluntary participants in Facebook based survey,
>300mm words• Agreed to give researchers access to statuses.
• Scoring algorithm• Linear regression weights, 2000 LDA topics. Lyle Ungar
2013 AAAITutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-
Vocabulary Approach. Plos One. 2013.
![Page 9: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/9.jpg)
9
Lyle Ungar2013 AAAITutorial
The good:• Word clouds force
you to hunt for the most impactful terms
• You end up examining the long tail in the process
• Compactly represent a lot of phrases and topics
![Page 10: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/10.jpg)
10
Lyle Ungar2013 AAAITutorial
The bad:
• “Mullets of the Internet” --Jeffrey Zeldman, 2005
• Longer phrases are are more prominent.
• Ranking is unclear
• Does size indicate higher frequency?
![Page 11: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/11.jpg)
11
Lyle Ungar2013 AAAITutorial
![Page 12: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/12.jpg)
12Lyle Ungar2013 AAAITutorial
![Page 13: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/13.jpg)
13Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Political Convention Word Use by Party
![Page 14: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/14.jpg)
14Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
![Page 15: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/15.jpg)
15
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.
Diff
eren
ce in
z-s
core
s of
log-
odds
w/ p
rior log ¿(𝑤 ,𝐴)
|𝐴|−¿(𝑤 ,𝐴)¿−𝑙𝑜𝑔 ¿ (𝑤 ,𝐵)
|𝐵|−¿(𝑤 ,𝐵)¿
Log-odds for word w, categories A,B
log¿ (𝑤 , 𝐴 )+¿(𝑤 ,𝐶)
|𝐴|+¿𝐶∨− ¿(𝑤 , 𝐴)−¿ (𝑤 ,𝐶)¿−…
Log-odds w/ Dirichlet prior, given background corpus C
• Difference in z-score accounts for variation in word frequencies.
• Words with differences < 1.96 are greyed out.
![Page 16: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/16.jpg)
16
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403.
Diff
eren
ce in
z-s
core
s of
log-
odds
w/ p
rior
• Pros:• Popular among major CL
researchers (3rd edition of J+M)• Favors words which appear less
frequent in background.• Natural linear word listing
• Cons:• You have to pick a
representative, large background corpus. • If the corpus is small, divide
by 0 issue• Probably only practical for
unigrams• Inefficient use of space on chart
![Page 17: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/17.jpg)
17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:http://bit.ly/scattertextdevelopment
Topic models, word vectors, and The Lasso:http://bit.ly/scattertext2016debates
Movie revenue and practical use:http://bit.ly/scattertextrevenuemovie
Hands-on Tutorial
![Page 18: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/18.jpg)
18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party Automotive Site
# of users who read review:
# who went on to visit a Chevy dealer’s website: 15
20
Conversion rate of everyone who read review:
15/20=75%
Text:Car Reviewed: Chevy Cruze
Median conversion rate: 22%
![Page 19: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/19.jpg)
19
CDK Global: Finding Words that Sell Cars5 star review wordsLoveComfortableFeaturesSolidAmazing
<3 star review wordsTransmissionProblemIssueDealershipTimes
![Page 20: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/20.jpg)
20
CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review wordsTransmissionProblemIssueDealershipTimes
![Page 21: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/21.jpg)
21
CDK Global: Finding Words that Sell Cars5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]
![Page 22: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/22.jpg)
22
CDK Global: Finding Words that Sell Cars (SUV Specific)5 star review words High conversion wordsLove ComfortableComfortable Front [Seats]Features AccelerationSolid Free [Car Wash, Oil Change]Amazing Quiet
<3 star review words Low conversion wordsTransmission Money [spend my, save]Problem FeaturesIssue DealershipDealership AmazingTimes Build Quality [typically positive]
The worst thing you can say about an SUV may be:
I saved money and got all these amazing features!
![Page 23: Scattertext: A Tool for Visualizing Differences in Language](https://reader036.vdocuments.net/reader036/viewer/2022062822/587e6c6e1a28ab38068b48e1/html5/thumbnails/23.jpg)
23
Thank you.[first].[last]@gmail.com .Please see https://github.com/JasonKessler/scattertext for more info on this project.
We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more.
https://jobs.cdkglobal.com/