ICPSR Data Exploration Tools

Post on 11-Nov-2014

1.114 views 0 download

Tags:

description

Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Transcript of ICPSR Data Exploration Tools

ICPSR AT 50:Facilitating Research

and Data Sharing

Part I: Data ExplorationIASSIST Vancouver, BCMay 31, 2011

Welcome to Vancouver!Our Agenda

• Data Exploration– A Continuing Quest to Ease your Search– Social Science Variables Database– Bibliography of Data-related Literature

• Data Sharing– 2010 US Census Data– Public Data Collections

• Data Management– Data Management Plans– Computing & Data Sharing in Secure Environments– Managing Restricted Contracts

Managing the Clock

• Intro and Data Exploration (9:30-10:30)– Break

• Data Sharing (10:45-11:30)– Break

• Data Management (11:45–12:30)– Escape!

Disclaimer: Times are approximate!

• One of the world’s oldest and largest social science data archives, est. 1962

• Data distributed on punch cards, then reel-to-reel tape, now: – Data available on demand– Over 7,000 studies with over 65,000 data sets

• Membership organization among 21 universities, now:– Currently about 700 members world-wide– Federal funding of public collections

What is ICPSR? - Then and Now -

What We Do – It’s About Data!

• Seek research data and pertinent documents from researchers (PIs, research agencies, government)

• Process and preserve the data and documents

• Disseminate data

• Provide education, training, & instructional resources

Why People Use ICPSR

• Write articles, papers, or theses using real research data

• Conduct secondary research to support findings of current research or to generate new findings

• Use as intro material in grant proposals• Preserve/disseminate primary research

data– Fulfill data management plan (grant)

requirements• Study or teach quantitative methods

Data Exploration

The Challenge – Hoards of Data & Metadata

How does one make sense of:

• 7,000 studies• 65,000 datasets• 550,000 files• Millions of variables• 60,000 bibliographic citations

Data Exploration- Integrated Search -

Better Search for Better Results

Search Results

Docs, subjects, PIs, etc

SSVD the

variables

Data-related biblio

Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”

• In 2009, an improved search engine• Later, construction of full-text search • Faceted search to narrow large result sets

Reviewing the Study Home Page

The Search Continues: Automatic Search Updates

• Receive automatic updates on the study or series

• And updates on your query

Data ExplorationThe Social Science Variables

Database

Search Results

Docs, subjects, PIs, etc

Data-related biblioSSVD

the variables

The Social Science Variables Database (SSVD)

Sanda Ionescu,Documentation Specialist

sandai@umich.edu

The Social Science Variables Database at ICPSR• Enables ICPSR users to search variables

across datasets• Assists in:

– Data discovery – Comparison / harmonization projects – Data harvesting – Data analysis– Question mining for designing new

research

The Social Science Variables Database at ICPSR

Tool for teaching– Research Methods:–Concept operationalization– Effect of question wording, context, and

answer categories on variable distributions– Substantive classes:–Cultural / social changes reflected in

different question wordings, or elicited answers (longitudinal or time series data)

The Social Science Variables Database at ICPSR• Officially launched Spring 2009.• Pre-launch: two to three years’

preparation period– Gather variable-level documentation;

apply/refine selection criteria, quality checks

– Build database to host variable descriptions

– Initial upload: 3,500 files describing data from about 1,300 studies.

The Social Science Variables Database at ICPSR• Variables documented using the Data

Documentation Initiative (DDI) specification• DDI: a standard for documenting social

science data, written in XML– Easy to parse / process– Allows fine-grained searches– Flexible display in a variety of formats – Highly shareable, promotes interoperability– Ideal archival format (ASCII, not software

dependent)

The Social Science Variables Database at ICPSR

DDI variable descriptions • Generated through an automated

process used archive-wide to produce ICPSR’S archival and distribution information packages

• Include question text if available in the source documentation

The Social Science Variables Database at ICPSR

Relational database• Built in Oracle as a separate entity, with

links to studies’ and series’ descriptions (also stored in Oracle)

• Compatible with both DDI 2 and 3 (input and output)

• Oracle Text searches used in Beta-testing phase– Slow retrieval– Limited to 500 results

The Social Science Variables Database at ICPSR• Search: autumn 2009 switched to Solr/Lucene:

• Easy indexing• Faster searches, unlimited hits• Facets/Filters imported from Study Descriptions (also

DDI compatible)– Series– Study– Time Period– Geography

• Storage: XML files are being indexed and searched directly – no longer uploaded in the database

The Social Science Variables Database at ICPSR

• Current content:– 2,602 studies (48 percent of ICPSR

holdings with data and setups)– 6,493 datasets– Approx. 1.7 million variables

• Continues to grow by including– All new releases, if suitable– Retrofits as made available by small-

scale projects

The Social Science Variables Database at ICPSR

• DDI fields searched:– Variable name– Variable label – Question text sequence – Descriptive text – Category label

• Variable notes – not indexed / searched, but they are displayed

The Social Science Variables Database at ICPSR

The Public Search Features:• Stemming• “Phrase searches”• Fielded searches (treated as a default

Boolean “and”: Boolean operators “or,” and “not” are ignored)– Variable label– Question text– Value labels

http://www.icpsr.umich.edu/icpsrweb/ICPSR/

The Social Science Variables Database at ICPSRProjected improvements/additional features:• Enable selection of multiple filters• Enable users to toggle on/off stemming• Enable searching “within” results (adding new

query to a result set)• Show / hide response categories on result page• Create interface for selecting results and

exporting selection in a particular format• From individual variable display, enable

navigation to previous or next variable (to show context)

The Social Science Variables Database at ICPSR

Usage data (source: Google Analytics)

Data ExplorationThe Bibliography of Data-related

Literature

Search Results

Docs, subjects, PIs, etc

SSVD the

variables

Data-related biblio

ICPSR Bibliography of Data-related Literature

Elizabeth MossAssistant Librarian, ICPSR

eammoss@umich.edu

ICPSR Bibliography of Data-related Literature

What we will cover:

• What it is and how to access it

• How and why we developed it

• Main features

• How instructors find it useful

• You are a good source

What it is and how to access it

What it is and how to access it

It’s really a searchable database . . . containing 60,000 citations of known

published and unpublished works resulting from analyses of data archived at ICPSR

. . .that can generate study bibliographies associating each study with the literature

about it

. . . Now included in the integrated search on the ICPSR Web site

• Brainchild of Richard Rockwell, former ICPSR director

• Funded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access it

• ICPSR membership and federally-funded archives continue to support it

How and why we developed it

• Resources using data in the ICPSR holdings as the primary data source

• Resources using ICPSR data in a comparison with the primary dataset investigated

• Resources "about" an ICPSR dataset or study series.

How and why we developed it

What’s in the collection?

How and why we developed it

http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/methodology.jsp

How and why we developed it

How and why we developed it

How and why we developed it

Demonstrate impact of data for funding

Main features

http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/index.jsp

Main features

Search features:• Searches the full text of the elements of

citations, e.g., title, author, journal• Boolean “and” is assumed, and phrase

searching in quotation marks:adolescents and “mental health” — this works

• No Boolean “or” “not”:Havens or “Havens, Jennifer” — this doesn’t work (becomes “and”)

Main features

Linking from the search results:

• To full text for journals Directly via DOI Using OpenURL via Google Scholar and

WorldCat

• To full text of reports and other resources via PDF or HTML links

• To the detailed, fielded publication record

Main features

Internal and external linking from the detailed citation record:• To the related study(s)

• To other citation records of publications by the same author

• To other articles in the same journal (but outside the search)

• To full text options

Main features

Exporting citations:

• From search results: Up to 500 records in

RIS format, exports directly to EndNote

• From individual detailed record: Export the citation in RIS format

Main features

Filtering and sorting features:

• Filter search results by author, pub type, journal, pub. year

• Coming soon—pub year range filter (similar to that in study search)

• Sort search results by relevance, pub date (oldest or newest), title, recency

Browse from main Bibliography page:

• By author name (no authority control)Juster, F. (2)Juster, F. Thomas (22)Juster, F.T. (1)

• By journal title name (authority control)

Main features

Main features

Link from individual study pages:• to the dynamically-generated study

bibliography• to series collections, when applicable

Link from series description pages:• to series bibliographies from the series

page

How instructors find it useful

Senior seminar classes

• Profs choose dataset and ask students to think of a research question

• Bibliography allows students to see the wide variety of topics available for a single dataset

How instructors find it useful

Research proposal design

• Good for finding studies that examine what a student wants to propose

• Does the data they would want already exist?

• If so, are there survey questions they could replicate?

• Authors’ suggestions for future research

How instructors find it useful

Undergraduate introduction

• Research papers—Good starting point for finding literature on a particular topic

• Finding data—Starting with the Bibliography can be more intuitive

How instructors find it useful

From the ICPSR blog:

“I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”

How instructors find it useful

From the ICPSR blog:

“As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved in information literacy activities, this is a great tool.”

The EDRL – an Online Module

How instructors find it useful

You are a good source

Get credit for your work AND let us know about that of others:

• Send a citation via the Web form

• Or send them in an email to bibliography@icpsr.umich.edu

• If you have a large library, we can take EndNote XML imports, or even RIS-format imports

You are a good source

You are a good source

A final request:

• When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the data

• Encourage others to do it, too

• Here’s how and why

Let’s Take a BreakReturn at 10:45