htrc use cases
DESCRIPTION
HTRC Use Cases. HathiTrust Corpus Usage Patterns. HathiTrust Corpus. HathiTrust Corpus. HathiTrust Corpus. HathiTrust Corpus Usage Patterns (cont’d). C hapter 1. HathiTrust Corpus. C hapter 1. C hapter 1. Page IV. HathiTrust Corpus. Page IV. Page IV. Table of Contents 1………….# - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/1.jpg)
HTRC Use Cases
![Page 2: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/2.jpg)
HathiTrust Corpus Usage Patterns
HathiTrust Corpus
HathiTrust Corpus
HathiTrust Corpus
![Page 3: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/3.jpg)
HathiTrust Corpus Usage Patterns (cont’d)Chapter 1
Chapter 1
Chapter 1
HathiTrust Corpus
Page IV
Page IV
Page IVHathiTrust
Corpus
Table of Contents1………….#2…………##
Table of Contents1………….#2…………##
Table of Contents1………….#2…………##
HathiTrust Corpus
![Page 4: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/4.jpg)
Word Counts from HTRC Sample*
• Top 10 words– the (1,092,274,158)– of (729,347,125)– and (515,034,460)– to (429,304,807)– in (337,513,888)– a (315,487,516)– that (167,847,940)– is (163,694,582)– was (138,907,857)– I (123,743,522)
• Bottom 10 tokens
– ¿°‘»– ¿° ¿– ¿°° 1 ¿¦– ¡••••••««•– ¡•••■••– ¡►♦»– ¡—— – ¡„¡ – ¡■° 1 ¡•¦ 1 ¡►
*Public Domain non-Google digitized HT materials, 250,000 volumes
Occurrence Num of unique tokens
1 109
2 217
3 360
4 526
5 583
6 551
7 541
8 515
9 416
10 356
![Page 5: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/5.jpg)
OCR Corrections on HTRC Sample
Total number of N-grams 20,173,974,251
Total number of N-grams (minus numbers only and other easy-to-spot noises)
19,282,108,416
Number of corrections made 131,571,046
Number of valid correction rules 99,455
![Page 6: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/6.jpg)
HTRC Online Tools for Simple Analysis
![Page 7: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/7.jpg)
Tag Cloud Viewer
![Page 8: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/8.jpg)
Topic Modeling• Uses MALLET Topic Modeling to cluster • Top 8 topics showing at most 200 keywords for that
topic
![Page 9: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/9.jpg)
Concept Mapping• Sentiment Analysis– six core emotions (Love, Joy, Surprise, Anger, Sadness,
Fear)
![Page 10: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/10.jpg)
Correlation-Ngram Viewer
![Page 11: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/11.jpg)
Date Entity to Simile Timeline
Visualization for Extracted EntitiesNetwork Analysis
Location Entity to Google Map
SEASR Project, UIUC, http://seasr.org
![Page 12: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/12.jpg)
Mayor Rex Luthor announced today the establishment of a
new research facility in Alderwood. It will be known as
Boynton Laboratory.
NE:Person NE:Time
NE:Location
NE:Organization
Named Entity (NE) Tagging
SEASR Project, UIUC, http://seasr.org
![Page 13: HTRC Use Cases](https://reader036.vdocuments.net/reader036/viewer/2022081514/568132bb550346895d997b0c/html5/thumbnails/13.jpg)
Metadata Enrichment• Gender• Genre• Structural
– Chapters– Front matter– Indexes– Bibliographies
• Part-of-Speech (POS) tagging Example source: http://www.stanford.edu/~mjockers/cgi-bin/drupal/node/17