text mining three cases. 2 outline federalist papers svdpdf vaers

Post on 11-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text Mining

Three Cases

2

Outline Federalist Papers SVDPDF VAERS

http://zlin.ba.ttu.edu/sassrc.rar

http://zlin.ba.ttu.edu/DMTM9.rar

3

Federalist Papers

4

5

Who wrote TheFederalist Papers?

Who wrote TheFederalist Papers?

HamiltonHamilton

STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.

STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.

MadisonMadison

About the Data

Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at

http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.constitution.org/fed/federa00.htm.

Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)

6

7

Corpus The Federalist Papers corpus is a collection of 85

essays.

Terms and TokensThe Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.

8

The Federalist Papers Diagram

EM Clustering

Logistic Regression

TERGET: 1 – Madison; 0 – Hamilton; missing - unknown

9

Federalist Papers Clusters

Cluster 1

HamiltonMadisonUnknown

2410

Cluster 2

HamiltonMadisonUnknown

271411

These clusters were obtained using numeric inputs derived from text mining. No author information wasemployed. Of interest is the fact that EM clustering placed all of the unknown essays into the same clusterthat contains 14 of the 15 Madison essays.

10

Logistic Regression Classification of The Federalist Papers

Text Mining Results

By Text Mining, the results of Mosteller and Wallace have been matched.

The predictions in the second column from the right show the strength of the decision.

The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays.

Essay 63, with a predicted value of 0.999691, has the strongest association with Madison.

All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category.

11

12

Characteristics of a Document

A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections.

The elements of a document can be counted (for example, the number of characters, words,

or sentences) summarized (for example, mean, median, or kurtosis).

13

Comparing Two Documents

Wor

d Size

Sente

nce

Size

Parag

raph

Size

Wor

d Fre

q

Sente

nce

Freq

Parag

raph

Fre

q

Doc 1

Doc 2

14

15

Contingency Table Comparing Essay 1 to Essay 37

continued...

16

Contingency Table Comparing Essay 1 to Essay 37

17

Text Miner Static Analysis

18

Text Miner Interactive Analysis

SVDPDF

19

20

SAS Education Course Descriptions The data represents a collection of 130 course

summaries obtained from http://support.sas.com. The original 130 files were PDF files stored in one

location on an HTTP server. A SAS DATA step was used to read the files from the

server and write them to a local directory. The TMFILTER macro was used to process the PDF

files and store the results as a text field in 130 document records in a SAS data set.

The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.

21

Static Analysis with SAS Text Miner

22

Text Miner Settings

23

Interactive Results

24

Applications of Concept Lists A company can have specific conceptual goals. For

example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?

25

Market Research for Quality What terms are most similar to the term “quality”?

– Find Similar– Filter

What documents address quality?– Filter on synonyms and similar terms– Find similar documents

What secondary concepts reflect information on quality?– SVD coefficients– Concept links

26

VAERS

27

VAERS VAERS was created by the Food and Drug

Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines.

No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious.

VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events.

Department of Health and Human Services, Public Health Service

28

VAERS Data was obtained from http://www.vaers.org/. Data was downloaded in September 2002 as a series

of CSV files. A SAS DATA step was used to read and process the

data. The original data had 131,464 observations and 59

variables. Cleaning and screening reduced the data set to

48,523 observations and 44 variables. The data set has 6 text variables. The original data

had 21, but 15 were sparsely populated.

29

30

VAERS Sample Entries15 mon. male w/ hx of recurrent ear infections & measles

in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam. DEAF

Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV ASTHMA

Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live.

INFECT

31

VAERS Text FieldsSYMPTOM_TEXT: Full text description of the adverse

reaction entered by a medical professional

SYM01: Brief description of primary symptom

SYM02-SYM05: Additional symptoms in decreasing importance

32

VAERS Initial Diagram

33

Equivalent Terms for Patient

34

Property Panel for VAERS Text Miner Analysis

35

Interactive Results

36

Clusters Window

Why only one termwhen five wererequested?

37

Cases with Fever

Last 16 entriesout of 98

38

Headache Terms

39

Headache Documents

40

Terms Most Similar to Headache

41

Documents Most Similar to Headache

42

First 11 out of 65 Documents Filtered by Headache Terms

43

VAERS Predictive Modeling Diagram

44

Logistic Regression Model Effects Plot

45

Logistic Regression Lift Plot

top related