HIST*4170 Data : Big and Small
description
Transcript of HIST*4170 Data : Big and Small
HIST*4170Data: Big and Small29 January 2013
Today’s Agenda• Blog Updates• A Short Introduction to Databases• A Big Data Project: People In Motion• Special Guest: Dr. Rebecca Lenihan
Blog Highlights• Ambition
• Consider scalability• Consider source availability – local advantage?
• Keep your eye on the academic value• What do you want to teach? Learn?
• Themes: war, sport, family, mapping• Intellectual property/privacy• Resources:
• Google Sketchup• To make 3D buildings
Data Deluge• Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte,
petabyte, exabyte, zettabytes....• Library of Congress = 200 terabytes
• “Transferring “Libraries of Congress” of Data”• IP traffic is around 667 exabytes• It’s a deluge...
• Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference)
• “Big Data”• too large for current software to handle
• Don’t be intimidated• Not all DH sources (yet)
Introduction to Databases• Database – a system that allows for the efficient storage and
retrieval of information • We associate with...• Computers changed a lot• Problems: organization and efficient retrieval
• Organization = requires data structure• Efficient Retrieval = requires through algorithms
• Potential for Humanities?• ...new problems, questions visualization, and objects worthy
of study and reflection.
Database Design• The purpose of a database is to store information about a
particular domain and to allow one to ask questions about the state of that domain.
• Relational databases are more efficient because they store information separately• Attributes• Relationships
• Quamen reading is a nice introduction• Not as complicated as you might think, but following rules is
important• We will apply...
New approach: Crowdsourcing• An “online, distributed problem-solving and production
model.”• Daren C. Brabham (2008),
"Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies 14 (1): 75–90
• Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...”
• reCAPTCHA• Luis von Ahn
• Others...• Google?
There are limitations...• Organization • Quality Control• Selection
A Database for Your Project?• Think about how you might use a database
• but perhaps not too big!• Databases can be very small and still be DH-worthy• Are there public docs out there that you can digest?
• Google Refine• Incorporate a search function into your website?• Resources
• MS Excel (spreadsheet)• MS Access (relational database)• Google Refine
• Cleaning data
Assignment for Next Week• Reading: TBD (3D guns?)
• Help someone else out with their project• Read their blog• Comment and provide detailed feedback• Find a collaborator?
People in Motion:Creating Longitudinal Data from
Canadian Historical Census
‘Unbiased’ links connecting individuals/households over several
census years
A comprehensive infrastructure of longitudinal data
What we are working towards
1851Census
1871Census
1881Census 1891
Census
1901Census
1906 Census
1916Census
1911Census
US 1880
Census
US 1900
Census
Current Work
100% of 1871
CensusAutomatic Linking
4,277,807 records
3,601,663 records
Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta
100% of 1871
Census
100% of 1871
Census
100% of 1881
Census
100% of 1871
Census
Existing (True) Links
• Ontario Industrial Proprietors – 8429 links• Logan Township – 1760 links• St. James Church, Toronto – 232 links• Quebec City Boys – 1403 links
• Bias concerns– family context– others? Logan Twp
Guelph
Attributes for Automatic Linking
• Last Name – string• First Name – string• Gender – binary• Birthplace – code• Age – number• Marital status – single, married, divorced,
widowed, unknown
Automatic Linkage
• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense
• The system:
Data Cleaning and Standardization• Cleaning
– Names – remove non-alpha numerical characters; remove titles
– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);
– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)
• Standardization– Birthplace codes and granularity– Marital status
Computational Expense
• Very expensive to compare all the possible pairs of records
• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)
• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
Managing Computational Expense
• Blocking – By first letter of last name– By birthplace
• Using HPC– Running the system on multiple processors in
parallel
Record Comparison
• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone
• Age– +/- 2 years
• Exact matches – Gender– Birthplace
Linkage Results
Province Linkage Rate (%)
New Brunswick 24.45
Nova Scotia 21.50
Ontario 18.36
Quebec 17.45