Extending database accelerators for data transformations ...
Documenting Data Transformations
-
Upload
australiannationaldataservice -
Category
Data & Analytics
-
view
10 -
download
0
Transcript of Documenting Data Transformations
“Provenance and Social Science Data”15 March 2017
Documenting DataTransformations
George Alter, University of Michigan
• Data are useless without Metadata – “data about data”
• Metadata should:– Include all information about data creation– Describe transformations to variables– Be easy to create
• Our goal: Automated capture of metadata
Why Metadata?
A few words about ICPSR
• World’s largest archive of social science data
• Consortium established 1962
• 760+ member institutions around the world
• Founding member and home office for the DDI Alliance
Powered by DDI Metadata
ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML
Codebooks (pdf and online) are rendered from the DDI.
Searchable database of 4.5M variables
Click here for online codebook
Online codebook shows variable in context of dataset
Link to online crosstab tool
What question was asked?
How was the question coded?Link to online
graph tool
Searchable database of 4.5M variables
Click here for variable comparison
Variable comparisondisplay
Click here for online codebook
Search for datasets with 3 desired variables
Check boxes for variable comparison
Crosswalk for American National Election Study (ANES) and General Social Survey (GSS)
Columns link to 70 datasets
134 tags in 8 lists
Variable comparison display
Variables linked to online codebooks
Metadata for the American National Election Study
What question was asked?
Who answered this question?
How was the question coded?
Who answered this question?
Metadata for the American National Election Study
Who answered this question?
Who answered this question?
How do we know who answered the question?
It’s in the pdf.
When data arrive at the archive…
• No question text• No interview flow (question order, skip
pattern)• No variable provenance• Data transformations are not documented.
How is research data created?
• Most surveys are conducted with computer assisted interview software (CAI)– CATI – Computer-assisted Telephone Interview– CAPI – Computer-assisted Personal Interview– CAWI – Computer Aided Web Interview
• There is no paper questionnaire• The CAI program is the questionnaire– i.e. the program is the metadata
Originaldata
DDI XML
Original metadata
CAI
CAI to
DDI
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
We already have tools to convert CAI to machine-
readable metadata.
SPSSSA
SStat
aR
Command scripts:
Originaldata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
What happens when a project modifies the data.
The modified data no longer
match the metadata.
SPSSSA
SStat
aR
Command scripts:
Originaldata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
Stat Packag
e to DDI
DDI XML
Extracted metadata
Extract metadata
from SPSS/SAS/
Stata/RData file
Metadata are re-created after the
data are transformed.
Transformations are
documented by hand
Statistics packages have limited metadata
• Variable names• Variable labels• Value labels• No provenance
SDTL
XML Update
r
DDI XML
SPSSSA
SStat
aR
Script Parser
Command scripts:
Originaldata
Revised metadata
DDI XML
Original metadata
Reviseddata
SPSSSASStata
R
CAI
CAI to
DDI
Statistical Packages
StandardData
Transformation Language
Convert to DDI:
CollecticaMQDSothers
Computer Assisted
Interviewing
Automating the capture of
transformation metadata.
Missing links that we will build.
What statistics packages should be covered?
ICPSR Downloads by Format
All downloadsStudies with all
formatsDelimited text 43% 29%SPSS 22% 24%SAS 10% 12%Stata 19% 23%R 5% 12%Excel 0% 1%Other 0% 0%
100% 100%Number 378,007 154,663
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X234-1
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X234-1
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X234-1
Why do we need an SDTL?
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X X Y Z2 2 83 34 4 9-1 -1
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X X Y Z2 2 83 34 4 9-1 9
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X X Y Z2 2 . 83 3 . .4 4 9 .-1 . . 8
Why do we need an SDTL?
What happens when a missing value is in a logical comparison?• SPSS– Logical expressions including a missing value are
considered “Missing.” Usually, “Missing” is equivalent to “False.”
• Stata– Missing values are treated as numbers equal to
infinity. So, any number is less than a missing value.• SAS– Missing values are treated as numbers equal to minus
infinity. So, any number is greater than a missing value.
Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.
X X Y Z2 2 83 34 4 9-1 NULL
Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3
X X Y Z2 2 83 34 4 9-1 ∞ 9
SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;
X X Y Z2 2 . 83 3 . .4 4 9 .-1 -∞ . 8
Missing Values in Comparisons
Benefits of automated metadata capture
• Metadata will be better– All the information in the CAI can be included.– Variable transformations can be described
• Automation will lower costs– Metadata will not be discarded and re-created
• All metadata will be standardized and machine readable– Codebooks with rich information can be rendered at
will• If we make it easy and beneficial, researchers
will use it.
Continuous Capture of Metadata for Statistical Data
(NSF ACI-1640575)Project Partners•Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan•Colectica•Metadata Technology North America•Norwegian Centre for Research Data•General Social Survey, NORC, University of Chicago•American National Election Study, University of Michigan
Questions?George Alter