big data and programming (history 9808a) 27 october 2014
TRANSCRIPT
![Page 1: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/1.jpg)
Big Data and Programming(History 9808A)
27 October 2014
![Page 2: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/2.jpg)
Today’s Agenda Proposals
How are we with the due date? A Short Introduction to Big Data
A Big Data Project: People In Motion
![Page 3: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/3.jpg)
Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte,
terabyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes
“Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... “Big Data”
too large for current software to handle
Don’t be intimidated Not all DH sources (yet)
Instructive video – David McCandless, “The Beauty of Data Visualization
![Page 4: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/4.jpg)
Big Data for History Tools for journalists, lit scholars and others
Where does history fit in? “Digital history does not offer truths, but only a new
way of interpreting and understanding traces of the past.” (S. Graham, I. Milligan, & S. Weingart)
Blog Leaders Taryn
“…we have to have a better understanding of how programming works so we can at least engage with Computer Scientists to help develop the complex systems required…”
Tamar The Strange Case of Belgium/Ancestry.com
Nick K. The Case of the Missing API
![Page 5: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/5.jpg)
New approach: Crowdsourcing An “online, distributed problem-solving and
production model.” Examples:
Wikipedia reCAPTCHA
Luis von Ahn
Others... Transcribe Bentham Census transcription
![Page 6: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/6.jpg)
![Page 7: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/7.jpg)
![Page 8: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/8.jpg)
A Database for Your Project? Think about how you might use a database
but perhaps not too big! Databases can be very small and still be DH-
worthy Are there public docs out there that you can
digest? Google Refine
Incorporate a search function into your website? Resources
MS Excel (spreadsheet) MS Access (relational database) Google Refine
Cleaning data
![Page 9: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/9.jpg)
People in Motion:Longitudinal Data from
the Canadian CensusA Big Data Project at the University of Guelph
![Page 10: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/10.jpg)
‘Unbiased’ links connecting individuals/households over several
census years
A comprehensive infrastructure of longitudinal data
What we are working towards
1851Census
1871Census
1881Census 1891
Census
1901Census
1906 Census
1916Census
1911Census
US 1880
Census
US 1900
Census
![Page 11: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/11.jpg)
Stage 1: 1871 to 1881
100% of 1871
Census
Automatic Linking
4,277,807 records
3,601,663 records
Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta
100% of 1871
Census
100% of 1871
Census
100% of 1881
Census
100% of 1871
Census
![Page 12: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/12.jpg)
Teaching a Computer to be a genealogist Training with existing manually-created (True)
links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links
Bias concerns Think of any?
Logan Twp
Guelph
![Page 13: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/13.jpg)
Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,
widowed, unknown
![Page 14: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/14.jpg)
Automatic Linkage
The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense
The system:
![Page 15: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/15.jpg)
Data Cleaning and Standardization Cleaning
Names – remove non-alpha numerical characters; remove titles
Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);
All attributes - deal with English/French notations (e.g. days/jours, married/mariee)
Standardization Birthplace codes and granularity Marital status
![Page 16: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/16.jpg)
Computational Expense Very expensive to compare all the possible pairs
of records
Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)
Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
![Page 17: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/17.jpg)
Managing Computational Expense Blocking
By first letter of last name By birthplace
Using HPC Running the system on multiple processors in
parallel
![Page 18: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/18.jpg)
Record Comparison Comparing Strings
String measures: First letter, “edit Distance”, sound
Age +/- 2 years
Required exact matches Gender Birthplace
![Page 19: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/19.jpg)
Linkage Results 1871-81-91-1901
Over 500,000 links… About 20%
![Page 20: Big Data and Programming (History 9808A) 27 October 2014](https://reader030.vdocuments.net/reader030/viewer/2022032703/56649d095503460f949dbffe/html5/thumbnails/20.jpg)
Coding Playtime W3C tutorials The Programming Historian
http://programminghistorian.org/ Codeacademy
http://www.codecademy.com/learn