digital history and text mining · digital history and text mining for hist 511, topics in public...

42
Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of Mathematical Sciences, CCSU

Upload: others

Post on 27-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Digital History and Text Mining

For HIST 511, Topics in Public History: Digital History October 13, 2011

By Roger Bilisoly, Ph.D.

Department of Mathematical Sciences, CCSU

Page 2: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Overview of Talk

Some complexities of text: Tagging Multiple versions of a work Some free visualization tools

An extended example analyzing the Dictionary of Canadian Biography. This example was inspired by Associate Professor William discussion posted

Page 3: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Top: My photo of wall decorative pattern in the Miracle Mile Shops, Las Vegas, NV http://www.flickr.com/photos/66082566@N00/4281884117/ Right: an English Hieroglyphic Bible published by Isaiah Thomas in 1788. This is from

Left: King Ashur-nasir-pal at Brooklyn Museum, photo by wallyg http://www.flickr.com/photos/wallyg/2440285854/sizes/m/

Page 4: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

More on tagging: http://www.brooklynmuseum.org/community/blogosphere/2008/07/15/collection-preview-and-re-thinking-tagging/ http://www.steve.museum/

http

://w

ww.

guer

rilla

girls

.com

/ Online visitors can add tags.

Tags are given by users and vetted by staff.

Anyone can join

contribute.

Page 5: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Unfortunately, access to tag data seems limited.

Page 6: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

User gets to choose one of the following: keep it (green), trash it (red), not sure (yellow).

Page 7: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Folksonomy: Study of tags

science and individualism of tagging. Analysis of tags is still young: tag listing and clouds are still common at Flickr.com, Delicious.com, etc.

Possible future analyses: Collocations popular in linguistics,

which would be more informative. Short phrases instead of single

words.

Page 8: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

It takes several steps to obtain an electronic version of a text.

http://documents.nytimes.com/looking-over-the-shoulder-of-charles-dickens-the-man-who-wrote-of-a-christmas-carol http://www.themorgan.org/home.asp

Original manuscript is at the Morgan Library and Museum in New York City, and this facsimile is available online at the NYTimes.com.

Page 9: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

This image is from an 1845 edition scanned by Google and available from: http://books.google.com/books

Page 10: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Pick an edition and then convert to electronic text.

huge error levels--only later did they come to the more polished levels seen today. In fact, many of our eBooks were done totally without any supervision--by people who had never heard of Project Gutenberg-- This quote by Michael Hart is from: http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Mission_Statement_by_Michael_Hart

Page 11: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Gutenberg.org EBook #46,

Page 12: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Marley was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge

to. Old Marley was as dead as a door-nail.

about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my

to repeat, emphatically, that Marley was as dead as a door-nail. Scrooge knew he was dead? Of course he did. How could it be otherwise? Scrooge and he

administrator, his sole assign, his sole residuary legatee, his sole friend, and sole mourner. And even Scrooge was not so dreadfully cut up by the sad event, but that he was an excellent man of business on the very day of the funeral, and solemnised it with an undoubted bargain.

Raw Text

Page 13: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

A Christmas Carol

Written in only six weeks

Dickens wanted to publish in time for the Christmas of 1843.

Modifications were made later: Dickens modified text for his readings, which started in 1858 (See example next slide.) However, there have been many adaptations. The first was for the theater in 1844 (not by Dickens, though.) See http://en.wikipedia.org/wiki/List_of_A_Christmas_Carol_adaptations

Page 14: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

A Christmas Carol: 1868 reading version vs. 1843 original version

MARLEY was dead, to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it. And Scrooge's name was good upon 'Change for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Scrooge knew he was dead? Of course he did. How could it be otherwise? Scrooge and he were partners for I don't know how many years. Scrooge was his sole executor, his sole administrator, his sole assign, his sole residuary legatee, his sole friend, his sole mourner.

Marley was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and

put his hand to. Old Marley was as dead as a door-nail.

there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the

done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail. Scrooge knew he was dead? Of course he did. How could it be

years. Scrooge was his sole executor, his sole administrator, his sole assign, his sole residuary legatee, his sole friend, and sole mourner. And even Scrooge was not so dreadfully cut up by the sad event, but that he was an excellent man of business on the very day of the funeral, and solemnised it with an undoubted bargain.

1868 version condensed by Dickens for his readings. http://gaslight.mtroyal.ca/carol.htm

The 1845 version seen earlier. http://www.gutenberg.org/files/46/46-h/46-h.htm

Page 15: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

The Canterbury Tales,

which has no original manuscripts. Ellesmere ms. Experience / though noon Auctoritee Were in this world / were right ynogh to me To speke of wo / that is in mariage ffor lordynges / sith I .xij. yeer was of Age Hengwrt ms. Experience / thogh noon Auctoritee Were in this world / is right ynogh for me To speke of wo / that is in mariage ffor lordynges / sith þat I twelf yeer was of age Cambridge ms. Experyment / þough none auctoryte Were in þis worlde is ri t/ ynou e for me To speke of woo þat ys in mariage ffor lordynges siþen I twelfe yere was of age Corpus ms. Experiment þough non auctorite. Were in þis world is right ynough for me To speke of wo þat is in mariage ffor lordynges syn I twelue eer was of age /

Lansdowne ms. Experiment þouhe none auctorite Where in þis werlde is riht y-nouhe for me To speke of woo þat is in mariage For lordeinges sen .I. twelue ere was of Age Harleian ms. Experiens þough noon auctorite were in þis world. it were ynough for me To speke of wo þat is in mariage For lordyngs syns I twelf er was of age Petworth ms. Experience thou e noon autorite were in þis world ri t ynou e for me To speke of woo þat is in mariage ffor lordingges siþ I twelue ere was of age The Cambridge ms. completed by Egerton ms. Experience / though noon auctoritee Were in this world / is right I-now for me To speken of woo / that is in mariage ffor lordynges / syn I twelue er was of age

Manuscripts available at http://www.kankedort.net/ECT_manuscripts.htm

Page 16: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Ngram analysis of tweets

Many Eyes project by IBM

Page 17: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Ngram

http://ngrams.googlelabs.com/

Page 18: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Free tool: http://Trendistic.indextank.com/

Gardasil claim occurred on 9/12/2011.

Page 19: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

http://www-958.ibm.com/software/data/cognos/manyeyes/

Page 20: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Four visualization tools available as of 10-2011

There are already over 230,000 data sets available, and you can upload yours. This Jane Austen data set was already uploaded.

Page 21: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of
Page 22: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Sense and Sensibility by Matthew Hurst, posted on his blog,

Text Mining, Visualization and Social Media on 9-24-2011.

Lucy (Steele) is highlighted on the left. Common keywords are on the right. Each line represents one chapter.

From http://datamining.typepad.com/

Page 23: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

A Taste of Text Mining: Analyzing Text with Computers

Extracting information from the Web Power of regular expressions Example used here inspired by William Turkel (Associate Professor of History at the University of Western Ontario)

Concordancing A powerful technique from corpus linguistics Example here uses corpus obtained by approach Introduction of some information retrieval (IR) ideas

Page 24: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Extracting Information from the Web

This is done continuously by spiders written by companies like Google to update their search engines. Crawling the Web requires sophisticated programming, but scraping info from a particular site is not so hard. Following example based on ideas given in 6 blog posts

Turkel: http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-1.html http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-2.html http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-3.html http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-4.html http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-5.html http://digitalhistoryhacks.blogspot.com/2006/03/text-mining-dcb-part-6.html

Page 25: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Turkel harvests the online Dictionary of Canadian Biography (DCB).

This Web site allows searches using a form. Their terms of use, however, does not forbid downloading all their records. What is not forbidden must be done!

Page 26: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

A Browser Trick

Form submission can be automated because: (1) the queries are shown in the URL box (2) These queries have patterns Two stages: (1) Obtain all the 592

links to Canadians. (2) For each link, access

it via a program and grab its HTML.

Page 27: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Below are the URLs for the 1st, 2nd, 3rd, and 4th requests for 20 Canadian records.

http://www.biographi.ca/009004-110.01-e.php?PHPSESSID=06378al70rmt2mu4ho8mf7c952&q2=&q3=&q10=I&q7=&q5=&q1=&interval=20 http://www.biographi.ca/009004-110.01-e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=21&&PHPSESSID=06378al70rmt2mu4ho8mf7c952 http://www.biographi.ca/009004-110.01-e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=41&&&PHPSESSID=06378al70rmt2mu4ho8mf7c952 http://www.biographi.ca/009004-110.01-e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=61&&&&PHPSESSID=06378al70rmt2mu4ho8mf7c952

# Records Starting Point

Page 28: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Only eight lines of Perl code downloads all 592 DCB URLs for Canadians flourishing prior to 1700.

use LWP::Simple;; open (OUT, ">canadian_bio.txt");; $url_part1 = 'http://www.biographi.ca/009004-­110.01-­e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=100&sk=';; $url_part2 = '&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7';; for ($i = 1;; $i < 601;; $i += 100) $doc = get "$url_part1$i$url_part2";; print OUT "$doc\n\n\n";; close(OUT);;

The reason this is so short is that there is a module LWP that has commands to work with the Web. $doc = get "$url_part1$i$url_part2";; This line of code queries the DCB Web page, and the returned HTML is stored in the variable $doc.

Page 29: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

A Small Sample of the Downloaded DCB Links

<td> <a href="009004-110.01-e.php?&q10=I&sk=1&s=3&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">Descending</a></td </tr> <tr> <td class="td_data">1.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=1&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">ABRAHAM, JOHN</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">2.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=2&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AERNOUTSZ, JURRIAEN</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">3.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=3&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGARIATA</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">4.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=4&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGRAMONTE, JUAN DE</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr>

We want the URLs to each individual Canadian, which is contained in the <a href lines above (the URLs are bolded and in red.) These are extracted ( with Perl) and then used to download each biography (again with Perl).

Page 30: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Now we download the biographies themselves and print them to canadian_bios.txt.

use LWP::Simple;; open (IN, "canadian_biography_urls.txt");; open (OUT, ">canadian_bios.txt");; while (<IN>) chomp;; sleep(1);; # Be a polite spider $doc = get "$_";; # Download biographies @lines = split(/\n/, $doc);; $flag = 0;; foreach $x (@lines) if ($x =~ /<\/BODY>/) $flag = 0 if ($flag) print OUT "$x\n" # Print out biography if ($x =~ /<BODY>/) $flag = 1

Page 31: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

The results are still in HTML, but another program can remove the HTML tags.

<P CLASS="ParagraphFormat"><B>ABRAHAM</B>, <B>JOHN</B>, governor of Port Nelson; fl. 1672 89.</P> <P CLASS="ParagraphFormat"> He joined the HBC about 1672 and served in James Bay 1672 75 and 1676 78 under Governor Charles B<SPAN CLASS="SmallCaps">ayly</SPAN>, against whom he brought charges of mismanagement. In 1679 Abraham was appointed second to John N<SPAN CLASS="SmallCaps">ixon</SPAN>, successor and although he absconded with an advance of salary at sailing time, he was engaged in 1681 as mate of the <I>Diligence</I> (Capt. N<SPAN CLASS="SmallCaps">ehemiah </SPAN>W<SPAN CLASS="SmallCaps">alker</SPAN>) and wintered in James Bay.</SPAN></P>

All the HTML tags are in <>, which can be removed by a program:

ABRAHAM, JOHN, governor of Port Nelson; fl. 1672 89. He joined the HBC about 1672 and served in James Bay 1672 75 and 1676 78 under Governor Charles Bayly, against whom he brought charges of mismanagement. In 1679 Abraham was appointed second to John Nixon, successor and although he absconded with an advance of salary at sailing time, he was engaged in 1681 as mate of the Diligence (Capt. Nehemiah Walker) and wintered in James Bay.

Note that the top version is still valid HTML:

Page 32: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Now extract dates with a concordancing program.

Key is constructing regular expressions (regexes) to find text a text pattern of interest, which is a 4 digit number starting with 1 in this case. $target = '(\D1\d\d\d\D)';; \D stands for non-digit \d stands for digit

At right, all the matches of the regex above are shown after sorting. By looking at this concordance, a variety of patterns emerge.

The concordancing program is from

Page 33: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Complication: Dates have more than one use.

Dates have many uses and conventions, which complicates their analysis: Although volume 1 of the DCB

covers 1000-1700, there are references to modern texts, hence dates in the 20th century appear. There are range of dates (e.g.,

1495-1521) Dates followed by question

marks (e.g., 1522?) Dates in square brackets (none

shown here).

Page 34: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Years Appearing in Volume 1 of the DCB: My Results (top) v. (bottom)

Top produced by me using SAS. Bottom from http://photos1.blogger.com/blogger/4745/1988/1600/dcbo-vol1-dates.jpg

Turkel points out that many of the date-peaks do correspond to notable events in early Canadian history. For

second voyage, and 1666 was the first census of New France.

Page 35: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Term-Document Matrix For The DCB

Here documents are the biographies and terms are years.

Remember that there are complications.

Range of years: 1495-1521 Years in doubt use a question mark: 1522? Years of publication for references: (London, 1962)

Page 36: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Part of the DCB Name-Year Matrix

Page 37: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Angles between Canadians in the DCB

This output is from the programming language Mathematica.

Page 38: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Which Canadians are the most alike with respect to years noted in DCB?

Louis Gaudais-Dupont and Mézy de Saffray 1661,1662,1663 (6 times), 1664 1663 (8 times), 1664 (3 times), 1665 (twice) Angle between them is 21.5°

Thalour du Perron and Sieur de Monts

1662 (twice), 1663 (twice), 1668 1662 (3 times), 1663 (twice) Angle between them is 22.4°

Page 39: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Wordle.net word cloud using the DCB

Easier to do, but less informative.

Page 40: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

References

Language and Computers: A Practical Introduction to the Computer Analysis of Language Geoff Barnbrook

Practical Text Mining with Perl (PTMP) Roger Bilisoly

Corpus Linguistics: Investigating Language Structure and Use Biber, Conrad and Reppen

Concept Data Analysis: Theory and Applications Claudio Corpineto and Giovanni Romano (a more technical book)

Programming for Linguists: Perl for Language Researchers Michael Hammond

Corpora in Applied Linguistics Susan Hunston

Beginning Regular Expressions Andrew Watt

Text Mining: Predictive Methods for Analyzing Unstructured Information Shalom Weiss, Nitin Indurkhya, Tong Zhang and Fred Damerau (a more technical book)

Geometry and Meaning Dominic Widdows

Page 41: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

eXtensible Markup Language (XML) and the Text Encoding Initiative (TEI)

<text> <body><div0> <head>The following is a Copy of a LETTER sent by the Author's Master to the Publisher. </head><div1> <p n="1"> <name reg="Wheatley, Phillis" type="personal">PHILLIS</name> was brought from <name rend="italic" type="geographical">Africa</name> to <name type="geographical" key="italic">America</name>, in the Year 1761, between Seven and Eight Years of Age. Without any Assistance from School Education, and by only what she was taught in the Family, she, in sixteen Months Time from her Arrival, attained the English Language, to which she was an utter Stranger before, to such a Degree, as to read any, the most difficult Parts of the Sacred Writings, to the great Astonishment of all who heard her. </p> <p n="2">As to her WRITING, her own Curiosity led her to it; and this she learnt in so short a Time, that in the Year 1765, she wrote a Letter to the <name reg="Occum, Samson" type="personal">Rev. Mr. OCCOM</name> <note resp="editor" type="biographical">Samson Occum (1723-1792) was a converted Mohegan Indian who became a Christian minister. He was a friend of Susanna Wheatley, Phillis Wheatley's mistress, and a friend and correspondent of Phillis Wheatley.</note>, the <name type="ethnological" rend="italic">Indian</name> Minister, while in <name type="geographical" rend="italic">England</name>. </p> <p n="3">She has a great Inclination to learn the Latin Tongue, and has made some Progress in it. This Relation is given by her Master who bought her, and with whom she now lives. </p> <signed><name reg="Wheatley, John" type="personal">JOHN WHEATLEY</name>.</signed> <dateline rend="italic"><name rend="italic" type="geographical">Boston</name>, <date><distinct rend="italic">Nov.</distinct> 14, 1772</date>.</dateline> </div1> </div0></body></text> </TEI.2> Letter from John Wheatley to the Publisher sent Nov. 14, 1772.

From the Early Americas Digital Archive (http://www.mith2.umd.edu/eada/) Supported by Maryland Institute for Technology in the Humanities (MITH) http://www.mith2.umd.edu/eada/html/display.php?docs=wheatley_letter.xml&action=show

The DCB used HTML uses tags to inform a Web browser how to display its biographies. It would be useful to have additional tags that encode information for human consumption. A protocol called XML (a form of SGML) was created to do just this. The XML tags are red at left. The TEI Consortium organization (http://www.tei-c.org/index.xml) produces standards and encourages the encoding of information in literary and linguistic texts. Unfortunately, this kind of tagging is done by humans at present (see example at left), which is labor intensive.

No year tags!

Page 42: Digital History and Text Mining · Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of

Learning to Program

From teaching STAT 527, students vary in their like of

sounds interesting to you.

Teaches Python By William J. Turkel, Adam Crymble and Alan MacEachern http://niche-canada.org/programming-historian/

NICHE = Network in Canadian History & Environment NICHE =

Also see William home page: http://history.uwo.ca/faculty/turkel/,