HIST*4170 Data : Big and Small

HIST*4170Data: Big and Small29 January 2013

Today’s Agenda• Blog Updates• A Short Introduction to Databases• A Big Data Project: People In Motion• Special Guest: Dr. Rebecca Lenihan

Blog Highlights• Ambition

• Consider scalability• Consider source availability – local advantage?

• Keep your eye on the academic value• What do you want to teach? Learn?

• Themes: war, sport, family, mapping• Intellectual property/privacy• Resources:

• Google Sketchup• To make 3D buildings

Data Deluge• Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte,

petabyte, exabyte, zettabytes....• Library of Congress = 200 terabytes

• “Transferring “Libraries of Congress” of Data”• IP traffic is around 667 exabytes• It’s a deluge...

• Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference)

• “Big Data”• too large for current software to handle

• Don’t be intimidated• Not all DH sources (yet)

http://blogs.loc.gov/digitalpreservation/2011/07/transferring-libraries-of-congress-of-data/



http://www.triuhistory.ca/ian-milligan/

http://www.triuhistory.ca/ian-milligan/

http://www.triuhistory.ca/conference/

Introduction to Databases• Database – a system that allows for the efficient storage and

retrieval of information • We associate with...• Computers changed a lot• Problems: organization and efficient retrieval

• Organization = requires data structure• Efficient Retrieval = requires through algorithms

• Potential for Humanities?• ...new problems, questions visualization, and objects worthy

of study and reflection.

Database Design• The purpose of a database is to store information about a

particular domain and to allow one to ask questions about the state of that domain.

• Relational databases are more efficient because they store information separately• Attributes• Relationships

• Quamen reading is a nice introduction• Not as complicated as you might think, but following rules is

important• We will apply...

New approach: Crowdsourcing• An “online, distributed problem-solving and production

model.”• Daren C. Brabham (2008),

"Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies 14 (1): 75–90

• Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...”

• reCAPTCHA• Luis von Ahn

• Others...• Google?

http://www.clickadvisor.com/downloads/Brabham_Crowdsourcing_Problem_Solving.pdf

http://www.clickadvisor.com/downloads/Brabham_Crowdsourcing_Problem_Solving.pdf

http://en.wikipedia.org/wiki/Wikipedia:About

http://en.wikipedia.org/wiki/Internet

http://en.wikipedia.org/wiki/ReCAPTCHA

http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration.html

http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration.html

There are limitations...• Organization • Quality Control• Selection

A Database for Your Project?• Think about how you might use a database

• but perhaps not too big!• Databases can be very small and still be DH-worthy• Are there public docs out there that you can digest?

• Google Refine• Incorporate a search function into your website?• Resources

• MS Excel (spreadsheet)• MS Access (relational database)• Google Refine

• Cleaning data

http://code.google.com/p/google-refine/

Assignment for Next Week• Reading: TBD (3D guns?)

• Help someone else out with their project• Read their blog• Comment and provide detailed feedback• Find a collaborator?

People in Motion:Creating Longitudinal Data from

Canadian Historical Census

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Current Work

100% of 1871

CensusAutomatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Existing (True) Links

• Ontario Industrial Proprietors – 8429 links• Logan Township – 1760 links• St. James Church, Toronto – 232 links• Quebec City Boys – 1403 links

• Bias concerns– family context– others? Logan Twp

Guelph

Attributes for Automatic Linking

• Last Name – string• First Name – string• Gender – binary• Birthplace – code• Age – number• Marital status – single, married, divorced,

widowed, unknown

Automatic Linkage

• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

• The system:

Data Cleaning and Standardization• Cleaning

– Names – remove non-alpha numerical characters; remove titles

– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

• Standardization– Birthplace codes and granularity– Marital status

Computational Expense

• Very expensive to compare all the possible pairs of records

• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense

• Blocking – By first letter of last name– By birthplace

• Using HPC– Running the system on multiple processors in

parallel

Record Comparison

• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone

• Age– +/- 2 years

• Exact matches – Gender– Birthplace

Linkage Results

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45

HIST*4170 Data : Big and Small

Documents

Transcript of HIST*4170 Data : Big and Small