improving access to digitized historical newspapers with text mining, coordinated models, and...

16
Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Upload: marcella-leavitt

Post on 15-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design

Robert B. Allen

Drexel University

Page 2: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Theme of Access Rich cultural record

– Massive data sets. OCR text available.

Text processing of the OCR. Too much content for manual processing– Segmentation– Metadata assignment– Constraints

Beyond Traditional Approaches: – Events– Models,– Interfaces

Page 3: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Constraints for Processing and “Understanding”

Page numbersSurrounding articlesSections/FeaturesCyclic patters (e.g., “drought”)Named entities (People, organizations, places, etc.)Event threadsCommunity modelsComparisons across resources

• multiple newspapers• with digitized books• with historical records• with archives and special collections

Page 4: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Count Percent

Correct 64 67

Minor errors (e.g., merging a few words) 6 6

Combined two or more articles 11 12

Too much segmentation 14 15

Segmentationwith Ilya Waldstein and Weizhong Zhu

• Finding meaningful regions in the text.

• Several methods. Look for headings and then merge other sections of text.

• Different sections have different problems.

• Tradeoff: more hand-entered knowledge helps but takes effort

Page 5: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Finding Genres/Subjects (IPTC Codes)with Keyword Matching (Ad hoc technique )

Ads: Medicines

drugs, cure, cures, liver, kidneys, prescriptions, drug, pains, blood, nervous, eye, pain, dying, bone, extract, potency, ache, brain, skin, rectum, chronic, tonic, stomach, remedies, constipation, bottle, bladder, medicine, pills

Chess chess, check, checkmate, mate, pawn, rook, game, match, win, problems, capture

Weather report

weather, report, temperature, degrees, rain, sun, snow, warmer, colder, temperatures, cloudy, icy, rainy

• Does better with more concrete categories• Again, more hand-entered knowledge helps..

Page 6: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Sections and Features Articleswith Catherine Hall

J WILLIAM LEEE M EARLE SONTHE STORE THAT SAVES YOU MONEYNATIONAL BISCUIT COMPANYADVANCE SPRING STYLES

Page 7: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Text Mining for Words: Text for Holidays Oct-Dec from 1916, Philadelphia Evening Ledger

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 880

204060

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 880

50

100

150

200

250

“Thanksgiving”

“Christmas”

Page 8: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

0

5

10

15

20

25

30

35

Seri...

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

0102030405060708090

Towards “models” for event steams. Oct-Dec 1916, Philadelphia Evening Ledger

“Campaign”

“Election”

Page 9: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Oct 29 1906 awful breaking bridge camden coach dempsey drawbridge heroism motorman picked submerged surface survivors thoroughfare trestle windows

Nov18 1906 colon dillon hopes lacking princeton princetons teams tigers yale

Dec 31 1906 ambulances awful belt coaches cotta crowded empty horribly identified mangled relief rescuers splintered takoma terra

Beyond Keyword based Search Engines: Finding Important Events by Comparing Multiple Sources of Evidence

Combing information from two newspapers3 months from 1906, Washington Times, Washington HeraldFind distinctive words then overlaps of those distinctive words in the two newspapers

Page 10: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Towards Novel Under Interfaces

Page 11: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Focus-Context Timeline for History

(Allen, 1999)

Page 12: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Narrative Timeline

Causes of American Civil War

Page 13: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Interviews with Historians on Interface Needs:Two Themes: Search and Information Management

The Chicago Tribune database is good for searching names, but broader topics are hard to research – e.g., race relations brings back too many results.

A log of all searches – “this is a huge issue for me.” Editing a book manuscript recently, she found it “hugely taxing” to find items she hadn’t cited.

Searches lead to other searches, so she would like ways to see how searches are nested within each other and to get back to earlier search results. A visual map telling you where you are in your search would be especially helpful. A system that lets her easily use multiple windows.

[The historian] used newspapers to fill in gaps in research and corroborate information from other sources. Exploratory searching included looking at larger issues and events such as elections and campaigns. She used newspapers to find public opinion about changes in liquor license laws – to get a sense of “the texture of the city… how the city was thinking.

Page 14: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University
Page 15: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

Image Genres

Select images based on IPTC images genres Cluster the images based on features Learn to classify those clusters

Page 16: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University

STATEHOOD MEASURE WILL PASS THE HOUSE Republicans Determine to Rush Hamilton Bill Through to Be Ready for Senate in December margin but NSW Mexico which hasnearly I nearly double the population of Arizona is largely Republican at present The Republicans in their rule will provide that no amendment shall be con sidered

I THE T MEs71 I world Fair Contests it OFFEH NO lTf acid the three employes of the District or National + t tional Government collecting respectively the < Uteat number ofLouis Sti 4 Louis Exposition coupons to the Worlds Fair for 4 one week and payaIixpenses i pxpenses Note District or National Government ewtploUli es SUKonly Uli only the coupon

The Washington Times for 1904 was digitized from USNP microfilm to METS-ALTO format. The ALTO files have OCR along with fonts, point size, and coordinates. However, the OCR ranges from good to bad to ugly….