Next on OPRAH – Bringing Data Out of the Closet Walter Giesbrecht, Data Librarian York University...

Post on 11-Jan-2016

214 views 1 download

Tags:

Transcript of Next on OPRAH – Bringing Data Out of the Closet Walter Giesbrecht, Data Librarian York University...

Next on OPRAH –

Bringing Data Out of the Closet

Walter Giesbrecht, Data LibrarianYork University

Jeff Moon, Head, Documents Unit

Queen’s University

OLA SuperConference

Friday, 1 February, 2002

Not this Data …

… but these kinds!

Before we get all shaken up about data and statistics, with warnings that such and such a percent of people get such and such a disease

after following such and such a personal habit...

… it is useful to note that:

• 80% of those who go insane drink coffee, tea, or beer

• 98% of those who commit suicide sleep indoors

• and darned near 100% of those injured in traffic accidents are people who move from one

place to another!

Let’s take a look at

Data and Statistical Analysis…

have you ever seen the movie “Twins”?

Think of “Arnie” as the

“Data” continuum…

Tables, Charts, Graphs

(from books, journals, the web, etc...)

A ‘number’

Raw Survey Data

# French Mother Tongue (1996) in Ontario

Employment levels by

occupation class

Annual inflation rate from 1914 to present

Aggregate Data Microdata

Coded responses of

surveyed individuals

Canada - EmploymentTelecommunication Equipment

Industry

479,285

1914 7.21915 7.3

… …1990 93.31991 98.51992 1001993 101.81994 1021995 104.21996 105.81997 107.61998 108.61999 110.52000 112.1

Aggregate Data:

A Number

Tables, Charts, Graphs Time Series

Sources of Aggregate Data…

Statistics Canada is generally the first stop for Canadian Data:

• The Canada Year Book (print)

• The Daily (web)

• Canadian Social Trends (web/print)

• CANSIM / E-Stat (web) – time series…

• “Canadian Statistics” (web)

• Beyond 20/20 Files – multidimensional tables…

Survey Data (microdata):

Age Sex MarStat Children Income Occ Educ

Person 1 24 M 1 1 5 1 7Person 2 34 F 1 0 3 5 3Person 3 52 F 2 2 4 3 3Person 4 64 F 1 3 6 4 4Person 5 23 M 3 1 7 2 6Person 6 63 F 4 1 5 6 3………Person "n" 29 M 1 0 5 2 2

Statistical analysis software is used to generate meaningful results… e.g. SPSS, SAS.

“variables”“r

esp

on

den

ts”

Sources of Survey Data…

Once again, Statistics Canada is generally the first stop for Canadian Data:

• The “Data Liberation Initiative” (DLI) provides access to hundreds of publicly released survey data files.

Polling Companies (Environics, CROP, etc.) produce microdata files as well.

For US & International data, the “Inter-university Consortium for Political & Social Research” (ICPSR)

Survey DataAggregate DataPostcard Camera

“Fixed”

“Flexible”

Think of “Danny” as the

“Statistical Analysis” continuum…

Percentages

Counts

Standard

Deviations

Tests of

Significance

Descriptive Statistics

Averages

Inferential Statistics

Significance testing

Percentages Counts Standard Deviations

Averages

Tables, Charts, GraphsA ‘number’ Raw Survey Data

Data continuum…

Statistical Analysis continuum…

Aggregate / Descriptive Microdata / Inferential

To review…

Data

Aggregate &

Survey Data

(Microdata)

Statistical Analysis

Counts, Percentages, Averages, Standard Deviations, Cross-tabulations, t-tests,

Regression, etc.

Reference Question Example:

How many of you have had a patron arrive at the Reference Desk with a newspaper article reporting Statistics Canada data?

Globe & Mail, Dec 17, 2001, p A15

“…71% of 15- to 17-year-olds use online chat rooms, double the proportion of the only slightly older 20-

24-year-olds.”

First, note that the article says:

“Statistics Canada, in a study released

last week…”

So… where do you go from here?

First… Let’s try:

http://www.statcan.ca/start.html

Which leads you to the following:

Canadian Social Trends,

Winter 2001

Which leads, in turn to:

Here is the statistic quoted in the Globe…

and here is the source…

So… how do we check out this source?

General Social Survey, 2000

DLI Web Site (or Local Data Centre)

http://www.statcan.ca/english/Dli/dli.htm

Documentation

and Data…

So… going to your campus “Data Centre”

http://library.queensu.ca/webdoc/ssdc/key.htm

AGEGR5 less than or equal to 3

Results…

79.9 %

65.9 %

71 %

48 %vs

CanadianSocialTrends

?Our

cross-tab

“An errata will be issued for the table appearing in CST because the table does not show percentages for those who used the Net in the last month but for those

who used the Net in the last year.”

“The difference in the numbers is because I used the variable H19 while your client is using the variable

H20. H19 asked respondents who had used the Internet in the last year, if they had ever used the Internet to connect to an ONLINE CHAT SERVICE. H20 asked

respondents how often they used the Internet to connect to an online chat service in the last month.”

Reply from Statistics Canada…

So… let’s try again with H19

So we need…

The numbers match!

AND… you’ll note the table now says “last 12 months”

Original Table…

Revised…

Dec 2001

Jan 2002

So…

We can use survey files to verify published results.

But…

We can also use survey files to expand on published results and explore new avenues of research.

For example…

1. What is the influence of gender, education, or income on Internet use?

2. Are there differences between provinces? Between URBAN and RURAL dwellers?

3. Or any number of other “dimensions”… any question asked in the survey.

Survey DataAggregate DataPostcard Camera

“Fixed”

“Flexible”

Sources of Aggregate Data…

• print

– e.g., Canada Year Book, STC print publications

• CD-ROM

– e.g., 1996 Census Profiles, LFHR, other DSP products

• Web-based

– The Daily

– “Canadian Statistics”

– PDF versions of print publications

– Beyond 20/20 Files – multidimensional tables…

– CANSIM / E-Stat – time series

Beyond 20/20: what is it?

• Used to display multidimensional data, i.e., more than 3 dimensions or characteristics at once– e.g., age, sex (usually 3!), geography, date, etc. ...

• allows user to customize the display of the data• very useful for aggregate data, less so for

microdata

Beyond 20/20:what is it used for/in?

• used in an increasing number of STC products,

– many CD-ROM DSP products, • e.g., LFHR, ITC, Profiles, Nation Series,

Dimensions, etc.

– one of available formats on E-Stat

CANSIM

• acronym for CANadian Socio-Economic Information Management System

• time-series data

• available– direct from STC ($)– via E-Stat (free to registered institutions)– via DLI (from UofT)

CANSIM II via E-Stat

Dealing with data really isn’t that hard ...

Don’t be afraid to ask for help!