A Data Model, Workflow, and Architecture for Integrating Data
-
Upload
dmassart -
Category
Data & Analytics
-
view
132 -
download
0
Transcript of A Data Model, Workflow, and Architecture for Integrating Data
A Data Model, Workflow, and Architecture for Integrating Data
David Massart, PhD
San Francisco – Feb. 12, 2015
Who Am I ?
Outline
• Data model: Resources, facts, and actions
• Data acquisition workflow
• Data integration and curation
• Views
• Architecture
Resources, Facts, and Actions
• “Resources” are described with “facts” collected during data acquisition “actions”
Resource
• Anything of interest (e.g., product, customer, geographical area)
• Characterized by:
– A type
– An identity
Fact
• Basic property of resources
• Characterized by
– Property name (e.g., weight)
– Value (e.g., 155 pounds)
– Timestamp (e.g., 2015-02-12)
Data Acquisition Action
• Occurs when a tool is used to acquire facts about resources from a data source at a given time
• Characterized by
– Action identifier (e.g., #1)
– Timestamp (e.g., 2015-02-12 09:35:12)
– Tool (e.g., web crawler)
– Data source (e.g., http://census.gov)
• Sources of data can be from a database, a human curator, a website, etc.
Action-Fact Fragment Data Model
{“action id”: 1,“action timestamp”: “2015-02-12 19:30:01”,“tool”: “zettadownloader”,“data source”:
“http://api.census.gov/data/2012/acs5?get=B25082_001E,B25111_001E,NAME&for=zip+code+tabulation+area:*”,
“resource type”: “area”,“resource id”: “94114”,“fact property”: “value”,“fact value”: “8508810400”,“fact timestamp”: “2012-07-01”
}
Data Acquisition Workflow
Acquisition & Caching
• Acquire
– Obtain raw data from a data source*
– Turn it into JSON records, if needed
• Cache
– Store raw data (timestamp)
– Makes actions re-playable at any time
* data acquisition action
Identification
• Identify the resources described in records
• Can be easy (e.g., zip codes for geographical areas)
• Or very complex (e.g., bibliographical references)
Normalization
• Replace properties and values found in records by identifiers from data dictionaries and controlled vocabularies
• Examples:
– Country name -> ISO 3166-1 (e.g., Belgium -> 32)
– Data source -> Data source ids
Fragmentation
• Break records into actions and facts
• Store action-fact fragments
Data Integration & Curation: Principles
• Data integration and data curation are data acquisition actions
• Allowed:– Data creations only (i.e., data acquisition actions)
• Not allowed– Data deletions*
– Data updates
* Except for legal reason
Data Integration
• Occurs at the Action-Fact-Fragment level
• Is required when inconsistencies are detected
• E.g., two or more fragments have different values for the same property of the same resource with the same timestamp
aidatime-stamp
tool source rtype rid fproperty fvalueftime-stamp
#1 20150124 X #13 area 94114 population 32100 20120701
#2 20150125 Y #6 area 94114 population 30100 20120701
Data Integration (cont.)
• Possible resolutions:
– Do nothing
– Select one of the existing values
– Derive a new value from existing ones
– Add the correct value
• Results in the addition of a new fragment
aidatime-stamp
tool source rtype rid fproperty fvalueftime-stamp
#1 20150124 X #13 area 94114 population 32100 20120701
#2 20150125 Y #6 area 94114 population 30100 20120701
#3 20150212 Z #101 area 94114 population 31100 20120701
Data Curation
• Occurs at the Action-Fact-Fragment level
• Consists of adding, updating, or removing a fact about a resource by adding a new fragment
aidatime-stamp
tool source rtype rid fproperty fvalueftime-stamp
#1 20150124 W #24 area 94113 population 15000 20120701
#2 20150125 W #24 area 94113 population 17000 20120701
#3 20150126 W #24 area 94113 population - 20120701
Views
• Built from fragments
• Application-specific (allows for optimization)
• Read-only and expendable
• Special cases
– State: All facts at a given timestamp
– Fact trending: Evolution of a given fact over time
– Action visualization: All facts generated by a given action
Data Collection Architecture
Application Architecture
Conclusion
• Simple– Fragment data model
• Flexible– Allows for easily building expendable views
• Scalable– E.g., using resource ids as sharding key
• Robust – Any action can easily be cancelled
– Any state can easily be restored
More details available at http://zettadatanet.wordpress.com
These slides are available at http://www.slideshare.net/dmassart/a-data-model-workflow-and-architecture-
for-integrating-data