discovertext: tools for text
DESCRIPTION
A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.TRANSCRIPT
Tools for TextDr. Stuart Shulman @stuartwshulman [email protected]
Prepared for the Digital Methods Initiative Winter School 2014
University of Amsterdam!1
Acknowledgements
Richard RogersThe National Science Foundation
Mark J. Hoy
!2
Plan of Attack
A few high level thoughts
Five pillars of text analytics
Getting started on DiscoverText
A small collaborative project
The twittersifter.com beta release!3
“A funny thing happened…”
A brief history of DiscoverText
!
!4
A Master Metaphor: Sifter
!5
An Open Source Kernel
!6
Three Primary Tasks in CAT
!7
Classification of Text
A 2500 year-old problem
Plato argued it would be frustrating
It still is…
!8
Grimmer & Stewart “Text as Data”Political Analysis (2013)Volume is a problem for scholars
Coders are expensive Groups struggle to accurately label text at scale
Validation of both humans and machines is “essential” Some models are easier to validate than others
All models are wrong Automated models enhance/amplify, but don’t replace humans
There is no one right way to do this “Validate, validate, validate”
“What should be avoided then, is the blind use of any method without a validation step.”
!9
!10
(Patent Pending)
Three Important Books
!11
One Particularly Important Idea
!12
Five Pillars of Text Analytics
SearchFilterCode
ClusterClassify
You can execute all five using DT!13
Pillar #1: Search
!14
Search for Negative Cases
!15
Defined Search (Multi-term)
!16
Pillar #2: Filters
!17
Remember this filter
Another Common Filter
!18
!19
Pillar#3: Human Coding
!20
Keystroke Coding is Fast
!21
Coding Off a List is Faster
!22
Data Cleaning is Fundamental
!23
Pillar #4: Clustering
!24
!25
Latent Dirichlet Allocation (LDA) Topic Models
!26
LDA on the Christie Data
!27
Data is still processing…
Pillar#5: Machine-Learning
!28
Getting Started on DiscoverText
!29
Use the Key in Your Email
!30
Note the Peer Visibility Setting
!31
Peers Make Collaboration Possible
!32
!33
!34
!35
Perhaps a Trending Topic
!36
!37
The Basics
!38
Raw Data
Subsets of Data
Data Humans or Machines Classify
!39
Grab Some Twitter Data
!40
Create an Empty Archive
!41
Login to a Twitter Account
!42
Enable via OAuth
!43
Ready to Query Twitter
!44
Use Operators to Refine Queries
!45
Set the Frequency of Fetches
!46
Data Will Start Flowing
!47
Data List View
!48
Best List Settings for Twitter Data
!49
Use Buckets to Refine Lists
Search results go into buckets
“Defined search” is a multi-term filter
Meta data filters also useful for buckets
Buckets focus the text analytic process!50
!51
Create a Dataset to Code
Any archive or bucket
Use the random sampling tool
Standard: All coders get all items
Triage: Coders get next uncoded item!52
!53
Select from Three Coding Styles
Default: Mutually Exclusive Codes
Option 1: Non-Mutually Exclusive Codes
Option 2: User-Defined Codes (Grounded Theory)
!54
!55
Assign Peers to Code a Dataset
How many coders?
How many items need to be coded?
How many test or training sets?
There are no cookbook answers!56
Look at Inter-Rater Reliability
Highly reliable coding (easy tasks)
Unreliable coding (interesting tasks)
If humans can’t, neither can machines
Some tasks better suited for machines!57
Adjudication: The Secret Sauce
Expert review or consensus process
Invalidate false positives
Identify strong and weak coders
Exclude false positives from training sets!58
!59
!60
Use Classification Scores as Filters
Iteration plays a critical role
Train, classify, filter
Repeat until the model is trusted
Each round weeds out false positives!61
Classifier Histograms: More Filtering
!62
Track Your Progress
!63
!64
!66
Running the Classifier
!67
!68
!69
Filter by Classification
Filtered List >95% Not Chris Christie
!70
http://beta.twittersifter.com