discovertext: tools for text

Tools for TextDr. Stuart Shulman @stuartwshulman [email protected]

Prepared for the Digital Methods Initiative Winter School 2014

University of Amsterdam!1

Acknowledgements

Richard RogersThe National Science Foundation

Mark J. Hoy

!2

Plan of Attack

A few high level thoughts

Five pillars of text analytics

Getting started on DiscoverText

A small collaborative project

The twittersifter.com beta release!3

“A funny thing happened…”

A brief history of DiscoverText

!

!4

A Master Metaphor: Sifter

!5

An Open Source Kernel

!6

Three Primary Tasks in CAT

!7

Classification of Text

A 2500 year-old problem

Plato argued it would be frustrating

It still is…

!8

Grimmer & Stewart “Text as Data”Political Analysis (2013)Volume is a problem for scholars

Coders are expensive Groups struggle to accurately label text at scale

Validation of both humans and machines is “essential” Some models are easier to validate than others

All models are wrong Automated models enhance/amplify, but don’t replace humans

There is no one right way to do this “Validate, validate, validate”

“What should be avoided then, is the blind use of any method without a validation step.”

!9

!10

(Patent Pending)

Three Important Books

!11

One Particularly Important Idea

!12

Five Pillars of Text Analytics

SearchFilterCode

ClusterClassify

You can execute all five using DT!13

Pillar #1: Search

!14

Search for Negative Cases

!15

Defined Search (Multi-term)

!16

Pillar #2: Filters

!17

Remember this filter

Another Common Filter

!18

Pillar#3: Human Coding

!20

Keystroke Coding is Fast

!21

Coding Off a List is Faster

!22

Data Cleaning is Fundamental

!23

Pillar #4: Clustering

!24

Latent Dirichlet Allocation (LDA) Topic Models

!26

LDA on the Christie Data

!27

Data is still processing…

Pillar#5: Machine-Learning

!28

Getting Started on DiscoverText

!29

Use the Key in Your Email

!30

Note the Peer Visibility Setting

!31

Peers Make Collaboration Possible

!32

Perhaps a Trending Topic

!36

The Basics

!38

Raw Data

Subsets of Data

Data Humans or Machines Classify

Grab Some Twitter Data

!40

Create an Empty Archive

!41

Login to a Twitter Account

!42

Enable via OAuth

!43

Ready to Query Twitter

!44

Use Operators to Refine Queries

!45

Set the Frequency of Fetches

!46

Data Will Start Flowing

!47

Data List View

!48

Best List Settings for Twitter Data

!49

Use Buckets to Refine Lists

Search results go into buckets

“Defined search” is a multi-term filter

Meta data filters also useful for buckets

Buckets focus the text analytic process!50

Create a Dataset to Code

Any archive or bucket

Use the random sampling tool

Standard: All coders get all items

Triage: Coders get next uncoded item!52

Select from Three Coding Styles

Default: Mutually Exclusive Codes

Option 1: Non-Mutually Exclusive Codes

Option 2: User-Defined Codes (Grounded Theory)

!54

Assign Peers to Code a Dataset

How many coders?

How many items need to be coded?

How many test or training sets?

There are no cookbook answers!56

Look at Inter-Rater Reliability

Highly reliable coding (easy tasks)

Unreliable coding (interesting tasks)

If humans can’t, neither can machines

Some tasks better suited for machines!57

Adjudication: The Secret Sauce

Expert review or consensus process

Invalidate false positives

Identify strong and weak coders

Exclude false positives from training sets!58

Use Classification Scores as Filters

Iteration plays a critical role

Train, classify, filter

Repeat until the model is trusted

Each round weeds out false positives!61

Classifier Histograms: More Filtering

!62

Track Your Progress

!63

Running the Classifier

!67

!69

Filter by Classification

Filtered List >95% Not Chris Christie

!70

http://beta.twittersifter.com

Thanks for Having Me!

Dr. Stuart Shulman @[email protected] discovertext.comtwittersifter.com

!74

discovertext: tools for text

Technology

christie data data

twitter data

data cleaning

data list view

classification of text

data political analysis

buckets buckets

keystroke coding