discovertext: tools for text

74
Tools for Text Dr. Stuart Shulman @stuartwshulman [email protected] Prepared for the Digital Methods Initiative Winter School 2014 University of Amsterdam 1

Upload: stuart-shulman

Post on 27-Jun-2015

928 views

Category:

Technology


1 download

DESCRIPTION

A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.

TRANSCRIPT

Page 1: DiscoverText: Tools for Text

Tools for TextDr. Stuart Shulman @stuartwshulman [email protected]

Prepared for the Digital Methods Initiative Winter School 2014

University of Amsterdam!1

Page 2: DiscoverText: Tools for Text

Acknowledgements

Richard RogersThe National Science Foundation

Mark J. Hoy

!2

Page 3: DiscoverText: Tools for Text

Plan of Attack

A few high level thoughts

Five pillars of text analytics

Getting started on DiscoverText

A small collaborative project

The twittersifter.com beta release!3

Page 4: DiscoverText: Tools for Text

“A funny thing happened…”

A brief history of DiscoverText

!

!4

Page 5: DiscoverText: Tools for Text

A Master Metaphor: Sifter

!5

Page 6: DiscoverText: Tools for Text

An Open Source Kernel

!6

Page 7: DiscoverText: Tools for Text

Three Primary Tasks in CAT

!7

Page 8: DiscoverText: Tools for Text

Classification of Text

A 2500 year-old problem

Plato argued it would be frustrating

It still is…

!8

Page 9: DiscoverText: Tools for Text

Grimmer & Stewart “Text as Data”Political Analysis (2013)Volume is a problem for scholars

Coders are expensive Groups struggle to accurately label text at scale

Validation of both humans and machines is “essential” Some models are easier to validate than others

All models are wrong Automated models enhance/amplify, but don’t replace humans

There is no one right way to do this “Validate, validate, validate”

“What should be avoided then, is the blind use of any method without a validation step.”

!9

Page 10: DiscoverText: Tools for Text

!10

(Patent Pending)

Page 11: DiscoverText: Tools for Text

Three Important Books

!11

Page 12: DiscoverText: Tools for Text

One Particularly Important Idea

!12

Page 13: DiscoverText: Tools for Text

Five Pillars of Text Analytics

SearchFilterCode

ClusterClassify

You can execute all five using DT!13

Page 14: DiscoverText: Tools for Text

Pillar #1: Search

!14

Page 15: DiscoverText: Tools for Text

Search for Negative Cases

!15

Page 16: DiscoverText: Tools for Text

Defined Search (Multi-term)

!16

Page 17: DiscoverText: Tools for Text

Pillar #2: Filters

!17

Remember this filter

Page 18: DiscoverText: Tools for Text

Another Common Filter

!18

Page 19: DiscoverText: Tools for Text

!19

Page 20: DiscoverText: Tools for Text

Pillar#3: Human Coding

!20

Page 21: DiscoverText: Tools for Text

Keystroke Coding is Fast

!21

Page 22: DiscoverText: Tools for Text

Coding Off a List is Faster

!22

Page 23: DiscoverText: Tools for Text

Data Cleaning is Fundamental

!23

Page 24: DiscoverText: Tools for Text

Pillar #4: Clustering

!24

Page 25: DiscoverText: Tools for Text

!25

Page 26: DiscoverText: Tools for Text

Latent Dirichlet Allocation (LDA) Topic Models

!26

Page 27: DiscoverText: Tools for Text

LDA on the Christie Data

!27

Data is still processing…

Page 28: DiscoverText: Tools for Text

Pillar#5: Machine-Learning

!28

Page 29: DiscoverText: Tools for Text

Getting Started on DiscoverText

!29

Page 30: DiscoverText: Tools for Text

Use the Key in Your Email

!30

Page 31: DiscoverText: Tools for Text

Note the Peer Visibility Setting

!31

Page 32: DiscoverText: Tools for Text

Peers Make Collaboration Possible

!32

Page 33: DiscoverText: Tools for Text

!33

Page 34: DiscoverText: Tools for Text

!34

Page 35: DiscoverText: Tools for Text

!35

Page 36: DiscoverText: Tools for Text

Perhaps a Trending Topic

!36

Page 37: DiscoverText: Tools for Text

!37

Page 38: DiscoverText: Tools for Text

The Basics

!38

Raw Data

Subsets of Data

Data Humans or Machines Classify

Page 39: DiscoverText: Tools for Text

!39

Page 40: DiscoverText: Tools for Text

Grab Some Twitter Data

!40

Page 41: DiscoverText: Tools for Text

Create an Empty Archive

!41

Page 42: DiscoverText: Tools for Text

Login to a Twitter Account

!42

Page 43: DiscoverText: Tools for Text

Enable via OAuth

!43

Page 44: DiscoverText: Tools for Text

Ready to Query Twitter

!44

Page 45: DiscoverText: Tools for Text

Use Operators to Refine Queries

!45

Page 46: DiscoverText: Tools for Text

Set the Frequency of Fetches

!46

Page 47: DiscoverText: Tools for Text

Data Will Start Flowing

!47

Page 48: DiscoverText: Tools for Text

Data List View

!48

Page 49: DiscoverText: Tools for Text

Best List Settings for Twitter Data

!49

Page 50: DiscoverText: Tools for Text

Use Buckets to Refine Lists

Search results go into buckets

“Defined search” is a multi-term filter

Meta data filters also useful for buckets

Buckets focus the text analytic process!50

Page 51: DiscoverText: Tools for Text

!51

Page 52: DiscoverText: Tools for Text

Create a Dataset to Code

Any archive or bucket

Use the random sampling tool

Standard: All coders get all items

Triage: Coders get next uncoded item!52

Page 53: DiscoverText: Tools for Text

!53

Page 54: DiscoverText: Tools for Text

Select from Three Coding Styles

Default: Mutually Exclusive Codes

Option 1: Non-Mutually Exclusive Codes

Option 2: User-Defined Codes (Grounded Theory)

!54

Page 55: DiscoverText: Tools for Text

!55

Page 56: DiscoverText: Tools for Text

Assign Peers to Code a Dataset

How many coders?

How many items need to be coded?

How many test or training sets?

There are no cookbook answers!56

Page 57: DiscoverText: Tools for Text

Look at Inter-Rater Reliability

Highly reliable coding (easy tasks)

Unreliable coding (interesting tasks)

If humans can’t, neither can machines

Some tasks better suited for machines!57

Page 58: DiscoverText: Tools for Text

Adjudication: The Secret Sauce

Expert review or consensus process

Invalidate false positives

Identify strong and weak coders

Exclude false positives from training sets!58

Page 59: DiscoverText: Tools for Text

!59

Page 60: DiscoverText: Tools for Text

!60

Page 61: DiscoverText: Tools for Text

Use Classification Scores as Filters

Iteration plays a critical role

Train, classify, filter

Repeat until the model is trusted

Each round weeds out false positives!61

Page 62: DiscoverText: Tools for Text

Classifier Histograms: More Filtering

!62

Page 63: DiscoverText: Tools for Text

Track Your Progress

!63

Page 64: DiscoverText: Tools for Text

!64

Page 65: DiscoverText: Tools for Text
Page 66: DiscoverText: Tools for Text

!66

Page 67: DiscoverText: Tools for Text

Running the Classifier

!67

Page 68: DiscoverText: Tools for Text

!68

Page 69: DiscoverText: Tools for Text

!69

Filter by Classification

Page 70: DiscoverText: Tools for Text

Filtered List >95% Not Chris Christie

!70

Page 71: DiscoverText: Tools for Text

http://beta.twittersifter.com

Page 72: DiscoverText: Tools for Text
Page 73: DiscoverText: Tools for Text
Page 74: DiscoverText: Tools for Text

Thanks for Having Me!

Dr. Stuart Shulman @[email protected] discovertext.comtwittersifter.com

!74