Towards a Data Network for Integrated Social Science Research
-
Upload
cameroon45 -
Category
Technology
-
view
603 -
download
5
Transcript of Towards a Data Network for Integrated Social Science Research
Towards a Data Network for Integrated Social Science
Research Micah AltmanHarvard University
Archival Director, Henry A. Murray Research ArchiveAssociate Director, Harvard-MIT Data CenterSenior Research Scientist, Institute for Quantitative Social Sciences
E: [email protected]: http://maltman.hmdc.harvard.edu/
[Presented at the DLF Meeting 2008]
This Talk
Why is Access to Social Science Data Important?What are Challenges to Integrated Access?Social Science and Cyberinfrastructure Google ++ (--?) Dataverse Network (DVN): Virtual Archiving Data Preservation Alliance for Social Sciences
(Data-PASS): Replicated Institutional Preservation The Social Science Research Computing
Environment (RCE): Social Science & Research Workflows
ConclusionsMicah Altman, Senior Research Scientist
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
[Soc. Sci. Data Networks, DLF 2008](Page 2)
Related Work
Articles M. Altman and G. King. “A Proposed Standard for the Scholarly Citation of
Quantitative Data”, D-Lib, 13, 3/4 (March/April). 2007. M. Altman, et. al, “Data Preservation Alliance for the Social Sciences: A
Model for Collaboration” Proceedings of DigCcurr07, Chapel Hill. April 2007. G. King, “An Introduction to the Dataverse Network as an Infrastructure for
Data Sharing”, Sociological Methods and Research, 32, 2 (November, 2007): 173–199.
M. Altman , "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. Forthcoming 2008.
Collaborators & Co-conspirators Margaret Adams, Ken Bollen, Cavan Capps, Jonathan Crabtree, Darrell
Donakowski, Myron Gutmann, Gary King, Lois Timms-Ferrarra, Marc Maynard, Amy Pienta
Research Support Thanks to the Library of Congress (PA#NDP03-1), the National Institutes of
Aging (P01 AG17625-01), the National Science Foundation (SES-0318275, IIS-9874747), the Harvard University Library, the Institute for Quantitative SocialScience, the Harvard-MIT Data Center, and the Murray Research Archive.
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 3)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
What is Digital Social-Science Data?
DIGITAL Optical: DVD, CD Magnetic: Tapes, ‘Floppies’ Paper: cards, tapes
SOCIAL SCIENCE Social:
class, crime, social movements, culture, folklore, family
Economic: wealth, prosperity, labor, business, equity
Psychology: cognition, attitudes, stereotypes
Politics:justice, democracy, public policy, public administration, international conflic
DATA Raw measurements Numeric tables Administrative records (& email) Video and audio interviews, transcripts
(& blogs) Digital objects (web sites, interactive
databases)
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 4)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Data Access is the Key to Science
Science is not (only) about being scientificScientific progress requires community: Competition and collaboration in the pursuit of common goalsWithout access to the same materials: no community exists
… data is the nucleus of collaboration.
The value of an article that can’t be replicated: ?Scholarly articles are summaries, not the actual research resultsBut: Data access is spotty by field, finding the data is still hardHard for journal editors to verify.If you find it, how do you know it’s the same?Replication projects show:most published articles in social science cannot be replicated
… data is necessary for replication and verification
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 5)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Data Access is a Key to Democracy
Statistics = state-isticsThe state tax authority: counting people, estimating wealthReformers use data to assess the performance of the stateScience informs public policy continuallyIn modern democracy: the public needs a direct source of information
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 6)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
How Data Is Lost
Data Intentionally Discarded “It was just too long ago, I generally keep data for something like 10 years
beyond the last time I do something with them.” “Destroyed, in accord with APA 5-year post-publication rule.”
Unintentional Hardware Problems “Some data were collected, but the data file was lost in a technical
malfunction.” Destroyed for Confidentiality Reasons
“The material…was considered sensitive data. Institutional review boards.. required us to promise to destroy the data after a certain period of time...”
Acts of Nature “The data from the studies were on punched cards that were destroyed in a
flood in the department in the early 80s.”Discarded or Lost in a Move
“As I retired …. Unfortunately, I simply didn’t have the room to store these data sets at my house.”
Obsolescence “Speech recordings stored on a LISP Machine…, an experimental computer
which is long obsolete.”Simply Lost
“For all I know, they are on a [University] server, but it has been literally years and years since the research was done, and my files are long gone.”
Micah Altman, Senior Research Scientist
Research by:
[Soc. Sci. Data Networks, DLF 2008](Page 7)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Challenges to Research and PolicyLegal ChallengesTechnical Privacy ChallengesData DelugeNew Forms of Research
Micah Altman, Senior Research Scientist(Page 8) [Soc. Sci. Data Networks, DLF 2008]
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Legal Requirements
Micah Altman, Senior Research Scientist(Page 9) [Soc. Sci. Data Networks, DLF 2008]
Personal Information
Open accessIntellectual Property
Contract
SponsorInterests
HIPAA FERPA45 CFR 26
Invasionof
Privacy
Defamation
FOIAState FOI
PublisherInterests
DMCACopyright
Trademark
Patent Trade Secret
Contracts Licenses
Click-wrapagreements
Contributor Interests
ConsumerInterest
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Technical Privacy Challenges
Some challenging findings… Large, sparse datasets can “leak” private information when correlated
with external data. Even when significantly sub-sampled, perturbed, etc. [Narayan and Shmatikov 2008]
Repeated release of perturbation-masked geospatial point data leaks increasing amounts of information. Does not help to combine with aggregation masking [Zimmerman and Pavlik 2008]
Possible to identify other relationships in networks if you can generate seemingly innocuous relationships in same network [Backstrom, et. al 2007]
Pseudonymous communication can be linked through textual analysis [Tomkins et. al 2004]
K-anonymized data still vulnerable if homogenous, or attacker has enough background knowledge. L-diversity offered as replacement [MachanavaJJhala, et al 2007]
Additional anonymization challenges for geospatial data Very fine grained location – versus multi-state aggregation mask
required by HIPAA, and large social science surveys Background knowledge very likely
Easy to integrate with other datasets; Some data points may be directly observable Sequences of locations even more challenging
May cross aggregation units; Repetitive, temporally correlated; Induces unique networks
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 10)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Management of Legal Risks
Embedding all sensitive data access in a digital library can greatly improve subject privacy: Authentication, vetting, and access control Standardized license terms governing analysis
(derived from metadata and data characteristics) Models can be run on-line without access to raw data Monitoring and auditing of data use Limit sequence of analyses by a user, in some cases
( for promising results, see [Dwork, et al 2006]
Licensing and Intellectual Property Protections Standard licence terms and metadata Click-through agreements, vetting workflows Authentication, auditing, logging
Micah Altman, Senior Research Scientist(Page 11) [Soc. Sci. Data Networks, DLF 2008]
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Long-Term Social Science Data
Needs*… Social science –> human activities and perceptionsComputational capacity of human brain: 10^14 – 10^19?Future storage of a human history: 10^30 bytes/person? Compare to 10^10 bytes
– for a long high-res FMRI session
Micah Altman, Senior Research Scientist
* Or, “what are you thinking?”
[Soc. Sci. Data Networks, DLF 2008](Page 12)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Social Science Data Deluge*…
Collective holdings of all U.S. numeric social science data in all major data archives, government repositories: ~estimated 10’s of TB“Ambient” data increasingly becoming subject of social science research. Data deluge annually (2002 annual): Web (surface): 167 TB Radio: 3,500 TB Television: 69,000 TB Web (deep): 92000 TB Email (originals): 441,000 TB Telephone: 18,000,000 TB
Micah Altman, Senior Research Scientist
* Or, “what are you thinking?”
[Soc. Sci. Data Networks, DLF 2008](Page 13)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Research Infrastructure Challenges
Social science challenges… Few definitive answers Complex conceptual primitives Complex theories of behavior Reliance on observational data Specification uncertainty Changing evidence base
(blogs, video, continuously recorded behavioral data)Some trends
Compute-intensive inferential statistics Specification searches Sensitivity analyses Curse of dimensionality Data explosion Changing evidence base Agent-based models
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 14)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Why Infrastructure for Data?
Accessibility: Most large data sets: in public archives Most data in published articles:
not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available
Problems even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are lost
Data sets are not like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have altered content!
Why not Single Centralized infrastructure ? Single point of failure Data is heterogeneous in format, origin, size, effort needed to collect or analyze, IRB access
rules, etc. Data producers want credit, control, and visibility
Requirements Recognition, for data producers, distributors, related publishers Rule-based Public Distribution Authorization: fulfill requirements the author originally met Validation: check that data exists, without authorization Persistence Decades from now. . . . Verification: meaning of data remains unchanged, even as formats and computer systems Ease of Use: researchers are not archivists Standardize and Document Legal Protections: IRB, intellectual property,
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 15)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Emerging Technologies* Social
Science DataGoogle++ Virtual-Hosted archivesWorkflow systemsData networks
Micah Altman, Senior Research Scientist(Page 16) [Soc. Sci. Data Networks, DLF 2008]
* Plus Ça Change, Plus C'est Fou
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Google++ (--?)
Micah Altman, Senior Research Scientist
+ + +
+ = ?* Can you count how many ’s are in this picture?
[Soc. Sci. Data Networks, DLF 2008](Page 17)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Privacy? Law?
Preservation?Analysis?
Virtual Archiving: The Dataverse Network*
An Open-Source, Federated, Web 2.0 Data Network
Gateway to over 20000 social science studies (world’s largest catalog)Web Virtual Hosting 2.0 ServiceFederated access to other networks Unified access to major U.S. research data archives, government dataOpen service – endowed hostingOpen source – GPL-Affero-3
Micah Altman, Senior Research Scientist
Discovery Services Simple & fielded search Virtual collection browsing
Management Ingest Curation & review Virtual Hosting and administration
Metadata delivery Descriptive and structural Provenance (chain-of custody
metadata) Human and OAI interfaces
Preservation Standards based Reformatting Universal Numeric Fingerprints
Enhanced Delivery Replication Layered analysis services
To date: 132 Dataverses; 23,058 Studies; 576,387 Files(April 28, 2008)
To date: 132 Dataverses; 23,058 Studies; 576,387 Files(April 28, 2008)
[Soc. Sci. Data Networks, DLF 2008](Page 18)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
DVN Screenshots
Micah Altman, Senior Research Scientisthttp://dvn.iq.harvard.edu/
[Soc. Sci. Data Networks, DLF 2008](Page 19)
Some Dataverse UsesFuture Researchers:
discovery; linking; forward citation; verification; analysisJournals, for replicationAuthors, for their own dataTeachers, in depth analysisSections of scholarly organizations, to organize existing dataGranting agenciesResearch centersArchivesMajor Research ProjectsAcademic departments, universities, centers, libraries
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 20)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
DVN: Data Citations
Citations are a traditional formal mechanism to link together intellectual worksCitations glue together: Regulations, Publications, and EvidenceBut, lack of rules for citing numeric data:
No consistency in practice No fixed rules for copyeditors Sometimes in the list of references; sometimes a casual mention in
the text Sometimes the archive is noted Sometimes a version number exists Sometimes the version number is listed (if it exists) Archive numbers are sometimes given, if they exist Sometimes the author is noted Date of creation is sometimes given URLs often given, rarely persist Dates of access: protect the researcher, do not help find the data The data may not be available publicly The data may no longer exist
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 21)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
A Unified Citation Standard for Quantitative
Data
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 22)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
DVN: What’s NewTimeline
Version 1.0 (release) Dec. 2007 Version 1.1 March 2008 Version 1.2 April 2008
New Stuff OAI enhancements: Export Custom sets (1.2); Import DC, FGDC (1.1) as well as DDI Data services: zip delivery of remote files (1.2); plain-text and tab-delimited exports (1.2) Java 6 Support (1.2) Workflow Support Enhancements
Terms of use on login, upload, and download, configurable at network, dataverse, and study level (1.1, 1.2)
Enhanced workflows for account requests, password recovery, non-privileged (“drop box”) submissions, submissions review (1.1, 1.2)
Network Admin UI Enhancements JHove validation of individual studies (1.2) Batch ingest (1.2)
Numerous other performance, end-user, curator, and network UI enhancementsFuture: 2.0 (summer)
Data Services: save analyses to R, additional formats GUI for assigning geographic bounding box for study Support harvesting of DVN through LOCKSS Export multiple citation formats And many more features scheculed including Open Journal Integration, GenePattern
workflow integeration
See: http://thedata.org/software/releases
Micah Altman, Senior Research Scientist(Page 23) [Soc. Sci. Data Networks, DLF 2008]
DataPASS
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 24)
Collaboration for Preservation
Partnership Agreements Agreement to establish
good practice Preservation copies of
data collected Transfer Protocol: in
case of archival failure
Cooperating Operations Central database of
leads for acquisition Development of shared
procedures Review of acquisitions
Micah Altman, Senior Research Scientist
Joint “Not-bad” practices Identification & selection Metadata Security Confidentiality
Shared Catalog Unified Discovery Content exchange Layered Services
[Soc. Sci. Data Networks, DLF 2008](Page 25)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
"Nothing new that is really interesting comes without collaboration" -- James Watson
Data Rescued Examples
U.S. Information Agency Surveys Directly informed U.S. foreign policy through surveys of foreign public opinion Previously, only surveys from 1970-1990 were held in the national archives Collaboration be NARA and Roper to create a much more complete series
spanning the 1950-1990 Surveys conducted in Europe, Latin America, Asian countries include nuclear
arms control, Recent Subjects include US-Soviet relations, US strike on Libya, Soviet Union
invasion of Afghanistan, and economic matters, terrorism, economic summits, arms control, and the Soviet actions in Afghanistan, drug trafficking, democratization, and conflicts in El Salvador and Nicaragua.
Longitudinal Study of Personality Development. By Jack and Jeanne Humphrey Block The most intensive study of human personality development in existence. Thirty year longitudinal study. Mixed methods – quantitative, audio, video. More than 100 instruments, and 1000’s of measures (variables) Resulted in more than 100 publications. (Also shows how whiny kids are more likely to grow up to be conservatives.)
National Network of State Polls Diverse membership of 50 members in 38 states Covers a tremendous range of local and national issues Data imminently at risk
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 26)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Selected Topics & Sponsors
Political activity, political activism, voting behavior, protest activity, voter registration, fundraising, political alienation, relationship to the Black community, feminism, racial identity, attitudes toward abortion, attitudes toward federal programs; television viewing habits, affects of having children on the marriage, giving too much/little independence, discipline, overscheduling, overprotecting, measuring levels of success in teaching values, self-control, good citizenship, good money habits, religion, worries that parents have of the future facing their children; problems facing parents and children from drugs, sex, violence to the lack of various family and religious values; daycare, mothers working, childrearing, taxes, government spending, morals, children’s issues, economy, jobs, education, crime, health care, social security, local school administration, standardized testing, impact of poor scores on teachers, higher academic standards needed, too much/little homework, summer school., teachers, administrators, quality of academics, discipline matters, class size, level of science and math skills taught, Shakespeare, life skills, athletics, citizenship, Role of the US in the world and assessing US performance, terrorism, war in Iraq, respondent identified level of understanding of foreign affairs, US and foreign aid, assisting emerging democracies, enhancing national security, image of the US abroad, Seriousness of Welfare problems--abuse, fraud, generational, etc.; assessing list of remedies--limit duration, require job training, provide day care, unannounced visits, business tax breaks for hiring recipients, penalize recipients who have more children, etc.; profiling welfare recipients (e.g. more likely to be better/worse parents, lazy or hardworking, from troubled families; defining the American Ideal, how to teach kids what it means to be American, , national identity, appreciation of freedoms in the US, importance of voting, ashamed of nation's history of racism, job US does in teaching immigrant children, bi-lingualism, fly an American flag; most about the meaning of the rights the Constitution guarantees, assessing the level of appreciation of those rights in the US and how it is perceived to the international community; aging. Money Mangers; on union organizations, employers, and labor market institutions; tort law reforms; crime and urbanization; law and social control; natural disasters; awareness of selfNSF, NIH, The Danforth Foundation, The Ford Foundation, The David and Lucille Packard Foundation, and Ewing Marion Kauffman Foundation., State Farm Insurance, Ronald McDonald House Charities, Advertising Council, American Federation of Teachers, the Annenberg Institute, the George Gund Foundation, the National School Boards Association, U.S. Department of Education, GE Foundation, Nellie Mae Education Foundation, Wallace Foundation, Bill & Melinda Gates Foundation, Pew Charitable Trust, National Constitution Center, Alliance for Aging Research, American Federation for Aging Research; the MacArthur Foundation, NiMH
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 27)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 28)
Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 29)
Replication as Institutional Insurance
Schema driven:capture inter-archival preservation commitmentsAsymmetric: resource commitments proportional to holdingsVersioned: versioned data and citationsIntegration: LOCKSS + Archival Replication Schema + DVN technology + archival workflows
Micah Altman, Senior Research Scientist
Data-PASS Syndicated Storage Project
External Causes of Preservation Failure
Third party attacks Institutional funding Change in legal regimes
Quis custodiet ipsos custodes?
Unintentional curatorial modification
Loss of institutional knowledge & skills
Intentional removal Change in institutional
mission
[Soc. Sci. Data Networks, DLF 2008](Page 30)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Workflow Systems*
Micah Altman, Senior Research Scientist
Emerging tools for integration of research process in natural sciencesOrchestrate Data Collection, Transformation, AnalysisExamples: Taverna, Kepler, Genepattern, VisTrailsMost are science and grid-orientedAddresses different parts of scholarly work lifecycleNot focused on social science tasks
* Or “life on the grid”
[Soc. Sci. Data Networks, DLF 2008](Page 31)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Intersection of DL and Workflows
GenePattern Genomics workflow system Supports construction of complex reproducible data
analysis pipelines Targeted to local operations, but can make use of some
job queueing systems (LSF, SGE) http://www.broad.mit.edu/cancer/software/genepattern/
Integration project Extends coverage of total research lifecycle DVN will store GenePattern analyses as they evolve When analyses are published, dissemination,
preservation and reuse should be seamless Funded project in early planning stage
Micah Altman, Senior Research Scientist(Page 32) [Soc. Sci. Data Networks, DLF 2008]
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
New Social Science From Social Science “Research Computing Environment”
Project Assess need for high performance computing among social scientists
at Harvard Prototype interfaces to make grid computing usable by social
scientists Examples
Harvesting and analysis of blogs for virtual political opinion surveys Continuous collection of CSPAN, real-time subject coding, continuous
dissemination Cell phone data: movement, proximity to others, social network
analysis Participative goals-based redistricting Agent-based models of emerging institutions FMRI analyses of reaction to political and social scenarios
Modal Features** Analyses emerge through exploration and interactions Data collection from non-experimental, non instrumental, sources Increasing scale of data Compute limited Data confidentiality High-level analysis tools Remote collaboration is part of projectsMicah Altman, Senior Research Scientist
** Meta-features of social science: messy data + an abundance of plausible models
[Soc. Sci. Data Networks, DLF 2008](Page 33)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
Mind the Gaps No tool covers entire scholarly research lifecycle Most tools immature Poor integration across most tools Many tools for hard science do not meet social science needs for
non-experimental messy data (“strange sensors”), confidentiality, complex inferential methods …
Decoupling of dissemination, formal publication, citation, peer-review No tools integrate comprehensive, standard, flexible control over privacy,
intellectual property
Micah Altman, Senior Research Scientist
des
ign
pu
bli
shin
g
dis
sem
inat
ion
pre
ser
vati
on
reu
se
coll
ecti
on
pro
cess
ing
inte
gra
tio
n
anal
ysis
cati / capisweave / statdocscitations / identifiers Google-__________
data archives, hosting, networksGeneral digital libraries and repositoriesworkflow systems
[Soc. Sci. Data Networks, DLF 2008](Page 34)
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions
For More Information
Micah Altman, Senior Research Scientist(Page 35) [Soc. Sci. Data Networks, DLF 2008]
Dataverse Network Project:http://TheData.Org
Data-PASS Alliance: http://www.icpsr.umich.edu/DATAPASS/
Contact me:
http://maltman.hmdc.harvard.edu/ <[email protected]>
Introduction Access Challenges Google++
DVN Data-PASS RCE Conclusions