text mining three cases. 2 outline federalist papers svdpdf vaers

45
Text Mining Three Cases

Upload: felicity-knight

Post on 11-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

Text Mining

Three Cases

Page 2: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

2

Outline Federalist Papers SVDPDF VAERS

http://zlin.ba.ttu.edu/sassrc.rar

http://zlin.ba.ttu.edu/DMTM9.rar

Page 3: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

3

Federalist Papers

Page 4: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

4

Page 5: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

5

Who wrote TheFederalist Papers?

Who wrote TheFederalist Papers?

HamiltonHamilton

STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.

STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.

MadisonMadison

Page 6: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

About the Data

Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at

http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.constitution.org/fed/federa00.htm.

Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)

6

Page 7: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

7

Corpus The Federalist Papers corpus is a collection of 85

essays.

Terms and TokensThe Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.

Page 8: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

8

The Federalist Papers Diagram

EM Clustering

Logistic Regression

TERGET: 1 – Madison; 0 – Hamilton; missing - unknown

Page 9: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

9

Federalist Papers Clusters

Cluster 1

HamiltonMadisonUnknown

2410

Cluster 2

HamiltonMadisonUnknown

271411

These clusters were obtained using numeric inputs derived from text mining. No author information wasemployed. Of interest is the fact that EM clustering placed all of the unknown essays into the same clusterthat contains 14 of the 15 Madison essays.

Page 10: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

10

Logistic Regression Classification of The Federalist Papers

Page 11: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

Text Mining Results

By Text Mining, the results of Mosteller and Wallace have been matched.

The predictions in the second column from the right show the strength of the decision.

The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays.

Essay 63, with a predicted value of 0.999691, has the strongest association with Madison.

All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category.

11

Page 12: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

12

Characteristics of a Document

A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections.

The elements of a document can be counted (for example, the number of characters, words,

or sentences) summarized (for example, mean, median, or kurtosis).

Page 13: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

13

Comparing Two Documents

Wor

d Size

Sente

nce

Size

Parag

raph

Size

Wor

d Fre

q

Sente

nce

Freq

Parag

raph

Fre

q

Doc 1

Doc 2

Page 14: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

14

Page 15: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

15

Contingency Table Comparing Essay 1 to Essay 37

continued...

Page 16: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

16

Contingency Table Comparing Essay 1 to Essay 37

Page 17: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

17

Text Miner Static Analysis

Page 18: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

18

Text Miner Interactive Analysis

Page 19: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

SVDPDF

19

Page 20: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

20

SAS Education Course Descriptions The data represents a collection of 130 course

summaries obtained from http://support.sas.com. The original 130 files were PDF files stored in one

location on an HTTP server. A SAS DATA step was used to read the files from the

server and write them to a local directory. The TMFILTER macro was used to process the PDF

files and store the results as a text field in 130 document records in a SAS data set.

The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.

Page 21: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

21

Static Analysis with SAS Text Miner

Page 22: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

22

Text Miner Settings

Page 23: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

23

Interactive Results

Page 24: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

24

Applications of Concept Lists A company can have specific conceptual goals. For

example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?

Page 25: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

25

Market Research for Quality What terms are most similar to the term “quality”?

– Find Similar– Filter

What documents address quality?– Filter on synonyms and similar terms– Find similar documents

What secondary concepts reflect information on quality?– SVD coefficients– Concept links

Page 26: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

26

VAERS

Page 27: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

27

VAERS VAERS was created by the Food and Drug

Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines.

No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious.

VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events.

Department of Health and Human Services, Public Health Service

Page 28: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

28

VAERS Data was obtained from http://www.vaers.org/. Data was downloaded in September 2002 as a series

of CSV files. A SAS DATA step was used to read and process the

data. The original data had 131,464 observations and 59

variables. Cleaning and screening reduced the data set to

48,523 observations and 44 variables. The data set has 6 text variables. The original data

had 21, but 15 were sparsely populated.

Page 29: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

29

Page 30: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

30

VAERS Sample Entries15 mon. male w/ hx of recurrent ear infections & measles

in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam. DEAF

Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV ASTHMA

Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live.

INFECT

Page 31: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

31

VAERS Text FieldsSYMPTOM_TEXT: Full text description of the adverse

reaction entered by a medical professional

SYM01: Brief description of primary symptom

SYM02-SYM05: Additional symptoms in decreasing importance

Page 32: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

32

VAERS Initial Diagram

Page 33: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

33

Equivalent Terms for Patient

Page 34: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

34

Property Panel for VAERS Text Miner Analysis

Page 35: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

35

Interactive Results

Page 36: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

36

Clusters Window

Why only one termwhen five wererequested?

Page 37: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

37

Cases with Fever

Last 16 entriesout of 98

Page 38: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

38

Headache Terms

Page 39: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

39

Headache Documents

Page 40: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

40

Terms Most Similar to Headache

Page 41: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

41

Documents Most Similar to Headache

Page 42: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

42

First 11 out of 65 Documents Filtered by Headache Terms

Page 43: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

43

VAERS Predictive Modeling Diagram

Page 44: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

44

Logistic Regression Model Effects Plot

Page 45: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

45

Logistic Regression Lift Plot