1
Foundations VII: Data life-cycle, Mining and
Knowledge Discovery
Deborah McGuinness and Joanne Luciano
With Peter Fox and Li Ding
CSCI-6962-01
Week 13, November 29, 2010
Contents• Review assignment
• More advanced topics; life cycle, mining and adding to your knowledge base
• Summary
• Next week (your presentations)
2
3
Semantic Web Methodology and Technology Development Process
• Establish and improve a well-defined methodology vision for Semantic Technology based application development
• Leverage controlled vocabularies, et c.
Use Case
Small Team, mixed skills
Analysis
Adopt Technology Approach
Leverage Technology Infrastructur
e
Rapid PrototypeOpen World:
Evolve, Iterate, Redesign, Redeploy
Use Tools
Science/Expert Review & Iteration
Develop model/
ontology
EvaluationEvaluation
Data->Information->Knowledge
4
Data Life Cycle• Life cycle (we will define these shortly)
– Acquisition, curation, preservation– Long term stewardship
• Data and information – we use this to get to the discussion of knowledge– Content; the values– Context; the background, setting, etc.– Structure; organization and form
• Representation/ storage– Analog– Digital (and born digital)
5
Why it is important• 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving
Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com/2002/09/12/0912data_print.html)
• 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675)
• R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt 6
Why (cont’d)• e-science aims to derive new knowledge from
(possibly) multiple sources data
• The data needs to be persistent, available and usable
• The rate of creation of knowledge representations is increasing; they are a representation of the known ‘facts’ based on the data
• We studied KR creation, engineering, evolution and iteration
• Knowledge needs a life-cycle as well7
At the heart of it
• Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc.
• Inability to know the inter-relations, assumptions and missing information
• We’ll look at a (data) use case for this shortly
• But first we will look at what, how and who in terms of the full life cycle 8
What to collect?
• Documentation– Metadata– Provenance
• Ancillary Information
• Knowledge
9
Who does this?
• Roles:– Data creator– Data analyst– Data manager– Data curator
10
How it is done
11
Acquisition
12
Curation
13
Preservation• Usually refers to the full life cycle
• Archiving is a component
• Stewardship is the act of preservation
• Intent is that ‘you can open it any time in the future’ and that ‘it will be there’
• This involves steps that may not be conventionally thought of
• Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 14
Some examples and experience
• NASA
• NOAA
• Library community
• Note:– Mostly in relation to publications, books, etc but
some for data– Note that knowledge is in publications but the
structure form is meant for humans not computers, despite advances in text analysis
– Very little for the type of knowledge we are considering: in machine accessible form 15
Back in the day...
SEEDS Working Group on Data Lifecycle• Second Workshop Report
o https://esdswg.eosdis.nasa.gov/documents/W2_Bothwell.pdfo Many LTA recommendations
• Earth Sciences Data Lifecycle Reporto https://esdswg.eosdis.nasa.gov/documents/lta_prelim_rprt2.pdfo Many lessons learned from USGS experience, plus some
recommendations• SEEDS Final Report (2003) - Section 4
o https://esdswg.eosdis.nasa.gov/documents/FinRec.pdfo Final recommendations vis a vis data lifecycle
MODIS Pilot Project• GES DISC, MODAPS, NOAA/CLASS, ESDIS effort• Transferred some MODIS Level 0 data to CLASS
Mostly Technical Issues
• Data Preservationo Bit-level integrityo Data readability
• Documentation• Metadata• Semantics• Persistent Identifiers• Virtual Data Products• Lineage Persistence• Required ancillary data• Applicable standards
Mostly Non-Technical Issues
• Policy (constrained by money…)• Front end of the lifecycle
o Long-term planning, data formats, documentation...• Governance and policy• Legal requirements• Archive to archive transitions
• Money (intertwined with policy)• Cost-benefit trades• Long-term needs of NASA Science Programs • User input
o Identifying likely users• Levels of service• Funding source and mechanism
HDF4 Format "Maps"for Long Term Readability
C. Lynnes, GES DISCR. Duerr and J. Crider, NSIDC
M. Yang and P. Cao, The HDF Group
Use case: a real live one; deals mostlywith structure and (some) content
HDF=Hierarchical Data FormatNSIDC=National Snow and Ice Data CenterGES=Goddard Earth ScienceDISC=Data and Information Service Center
In the year 2025...
A user of HDF-4 data will run into the following likely hurdles:• The HDF-4 API and utilities are no longer supported...
o ...now that we are at HDF-7• The archived API binary does not work on today's OS's
o ...like Android 3.1 • The source does not compile on the current OS
o ...or is it the compiler version, gcc v. 7.x?• The HDF spec is too complex to write a simple read
program...o ...without re-creating much of the API
What to do?
HDF Mapping Files
Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now)• XML• Stored separately from, but close to the data files• Includes
o internal metadatao variable info o chunk-level info
byte offsets and length linked blocks compression information
Task funded by ESDIS project• The HDF Group, NSIDC and GES DISC
Map sample (extract)
<hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5"> <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer"> 0 0 </hdf4:Attribute> <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" /> <hdf4:Dataspace ndims="2"> 180 360 </hdf4:Dataspace> <hdf4:Datablock nblocks="1"> <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" /> </hdf4:Datablock> </hdf4:SDS>
Status and Future
Status • Map creation utility (part of HDF)• Prototype read programs
o Co Perl
• Paper in TGRS special issue• Inventory of HDF-4 data products within EOSDIS
Possible Future Steps• Revise XML schema• Revise map utility and add to HDF baseline• Implement map creation and storage operationally
o e.g., add to ECS or S4PA metadata files
Examples of NASA context
24
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Contextual Information:
• Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)
• Instrument/sensor calibration data and method• Processing algorithms and their scientific basis,
including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product)
• Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product
25
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Contextual Information (continued):
• Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive
• Quality assessment information• Validation record, including identification of validation data sets• Data structure and format, with definition of all parameters and
fields• In the case of earth based data, station location and any
changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record
• A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set
• Information received back from users of the data set or product
26
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
However…• Even groups like NASA do not have a
governance model for this work
• Governance: defintion
• Stakeholders:– NASA for integrity of their data holdings (is it their
responsibility?)– Public for value for and return on investment– Scientists for future use (intended and un-
intended)– Historians
27
NOAA
28
Library community• OAIS
• OAI (PMH and ORE)
29
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Metadata Standards - PREMIS
• Provide a core preservation metadata set with broad applicability across the digital preservation community
• Developed by an OCLC and RLG sponsored international working group– Representatives from libraries, museums,
archives, government, and the private sector.
• Based on the OAIS reference model
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Metadata Standards - PREMIS
• Maintained by the Library of Congress• Editorial board with international membership• User community consulted on changes
through the PREMIS Implementers Group • Version 1 was released in June 2005• Version 2 was just released
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Rights
Events
Agents
“a coherent set of contentthat is reasonably
described as a unit”For example, a web site, data set or collection of data sets
“a coherent set of contentthat is reasonably
described as a unit”For example, a web site, data set or collection of data sets
“a discrete unit of information in digital form”
For example, a data file
“a discrete unit of information in digital form”
For example, a data file“assertions of one or more
rights or permissionspertaining to an object
or an agent”e.g., copywrite notice, legalstatute, deposit agreement
“assertions of one or more rights or permissions
pertaining to an objector an agent”
e.g., copywrite notice, legalstatute, deposit agreement
“an action that involves atleast one object or agentknown to the preservation
repository”e.g., created, archived,
migrated
“an action that involves atleast one object or agentknown to the preservation
repository”e.g., created, archived,
migrated
“a person, organization, orsoftware program associatedwith preservation events in
the life of an object”e.g., Dr. Spock donated it
“a person, organization, orsoftware program associatedwith preservation events in
the life of an object”e.g., Dr. Spock donated it
PREMIS - Entity-Relationship Diagram
IntellectualEntities
Objects
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
PREMIS - Types of Objects
• Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity”
• File • Bitstream - “contiguous or non-contiguous
data within a file that has meaningful common properties for preservation purposes”
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Metadata Standards - METS
• Metadata Encoding and Transmission Standard
• An initiative of the Digital Library Federation
• Based on the Making of America II project
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
METS - What’s Its Purpose?• Provides the means to convey the metadata
necessary for – management of digital objects within a repository– exchange of objects between repositories (or
between repositories and their users)
• Designed to facilitate – shared development of information management
tools/services– interoperable exchange of digital materials
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
METS - What’s its status?• Version 1.6 was released in Sept. 2007
• Maintained by the Library of Congress
• International Editorial Board
• NISO registration as of 2006
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Backup Materials - MODIS Contextual Info
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Instrument/sensor characteristics
38
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Processing Algorithms & Scientific Basis
39
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Ancillary Data
40
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Processing History including Source Code
41
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Quality Assessment Information
42
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Validation Information
43
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Other Factors that can Influence the Record
44
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Bibliography
45
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Information from users• Data Errors found
• Quality updates
• Things that need further explanation
• Metadata updates/additions?
• Community contributed metadata????
Back to why you need to…• E-science uses data and it needs to be
around when what you create goes into service and you go on to something else
• That’s why someone on the team must address life-cycle (data, information and knowledge – we’ll get to the latter shortly) and work with other team members to implement organizational, social and technical solutions to the requirements
47
What would you need to do?
48
(Digital) Object Identifiers• Object is used here so as not to pre-empt an
implementation, e.g. resource, sample, data, catalog
• Examples:– DOI– URI– XRI
49
Versioning
50
Mining• We will start with data but the ideas apply to
information and knowledge bases as well
• Definition
• History
• Our interest
51
SAM: Smart Assistant for Earth Science Data Mining
PI: Rahul Ramachandran
Co-I: Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair
Science Motivation• Study the impact of natural iron fertilization process such as
dust storm on plankton growth and subsequent DMS production– Plankton plays an important role in the carbon cycle– Plankton growth is strongly influenced by nutrient availability (Fe/Ph)– Dust deposition is important source of Fe over ocean– Satellite data is an effective tool for monitoring the effects of dust
fertilization• Analysis entails
– Mine MODIS L1B data for dust storm events and identify the swath of area influenced by the passage of the dust storms.
– Examine correlations between fertilization, plankton growth and DMS production
Current Analysis Process
• MODIS aerosol products don’t provide speciation• Locate and download all the data to their local machine• Write code to classify and detect dust accurately [ 3-4
month effort]• Write code to classify and detect other dust aerosols [ 3-
4 month effort]• Write code to segment the detected region in order to
account for advection effect and correlation coefficient [2 months effort]
Analysis with SAM
• Create a workflow to perform classification using many different state of the art classifiers on distributed data
• Create a workflow to segment detected regions using image processing services on distributed data
Bottom line: • Scientist does not have to write all the code to perform
the analysis• Can compose workflows that utilize distributed
data/services• Can share the workflow with others to collaborate, reuse
and modify
Conducting Science using Internet as the Primary Computer
Mash-ups Example: Yahoo Pipes
Data Mining in the ‘new’ Distributed Data/Services Paradigm
Too many choices!!
•And that’s only part of the toolkit•ADaM-IVICS toolkit has over 100+ algorithms
SAM Objectives• Improve usability of Earth Science data by
existing data mining services for research, by incorporating semantics into the workflow composition process.– Semantic search capable of mapping a
conceptual task– Assistance in mining workflow composition– Verification that services are connected in a
semantically correct fashion
Ontology Use
Semi-automated Workflow Composition
Filtering services basedon data format
Semi-automated Workflow Composition
Filtering service optionsbased on both data formatand task selected
Semi-automated Workflow Composition
Final Workflow
Science Motivation• Study the impact of natural iron fertilization process
such as dust storm on plankton growth and subsequent DMS production– Plankton plays an important role in the carbon cycle– Plankton growth is strongly influenced by nutrient
availability (Fe/Ph)– Dust deposition is important source of Fe over ocean– Satellite data is an effective tool for monitoring the effects
of dust fertilization
Hypothesis• In remote ocean locations there is a positive
correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration
• There is a time lag between oceanic dust deposition and the photosynthetic activity
Primary source of ocean nutrients
WIND BLOWND
UST
SAHARA
SEDIMENTS FROM RIVER
OCEAN UPWELLING
SAHARA
DUST
SST
CLOUDS
NUTRIENTS
CHLOROPHYLL
Factors modulating dust-ocean photosynthetic effect
Objectives
• Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated.
• Determine physical processes responsible for observed relationship
Preliminary Results
Data and Method• Data sets obtained from SeaWiFS and
MODIS during 2000 – 2006 are employed
• MODIS derived AOT
The areas of study
1
5
6
8
43
2
7
1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3-Patagonia
4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean
*Figure: annual SeaWiFS chlorophyll image for 2001
Tropical North Atlantic Ocean dust from Sahara Desert
-0.68497
-0.1587
4
-0.856
11
-0.446
7
-0.75102
-0.6644
8
-0.72603
-0.17504 -0.0902 -0.328 -0.4595 -0.14019 -0.7253 -0.1095
Ch
loro
ph
yll
AOT
Arabian Sea Dust from Middle East
0.59895 0.66618 0.37991 0.45171 0.52250 0.36517 0.5618
0.76650
0.69797
0.75071
0.4412
0.8495
0.708625
0.65211
Ch
loro
ph
yll
AOT
Summary and future work• Dust impacts oceans photosynthetic activity,
positive correlations in some areas NEGATIVE correlation in other areas, especially in the Saharan basin
• Hypothesis for explaining observations of negative correlation: In areas that are not nutrient limited, dust reduces photosynthetic activity
• But also need to consider the effect of clouds, ocean currents. Also need to isolate the effects of dust. MODIS AOT product includes contribution from dust, DMS, biomass burning etc.
Case for SAM
• MODIS aerosol products don’t provide speciation• Why performing this data analysis is hard?
– Need to classify and detect Dust accurately – Need to classify and detect other aerosols (eg. DMS accurately)– Need to segment the detected region in order to account for
advection effects and correlation coefficient.• What will SAM provide?
– Provide capability to create a workflow to perform classification– Provide capability to create a workflow to segment detected regions
Bottom line: • Scientist does not have to write all the code to perform the
analysis• Can compose workflows that utilize distributed data/services• Can share the workflow with others to collaborate, reuse and
modify
Knowledge Discovery• Has a broad meaning
– Finding ontologies– Creating new knowledge from
• Previous knowledge• New sources (data, information)• Modeling
• We’ll look at a mining approach as an example
77
78
Ingest/pipelines: problem definition• Data is coming in faster, in greater volumes and outstripping our ability to perform
adequate quality control
• Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision
• We often fail to capture, represent and propagate manually generated information that need to go with the data flows
• Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects
• The task of event determination and feature classification is onerous and we don't do it until after we get the data
20080602 Fox VSTO et al.
79
80
• Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter?
• What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO?
• Find all good images on March 21, 2008.• Why are the quick look images from March 21,
2008, 1900UT missing?• Why does this image look bad?
Use cases
20080602 Fox VSTO et al.
81
20080602 Fox VSTO et al.
82
Summary• (Data) life cycle – key actions
– A– B
• Mining (data, information and knowledge) – key results and work in progress– A– B
• Facilitating new discoveries– A
83
Next week• This weeks assignments:
– Reading: None– Assignment: None
• Next class (week 14 – December 6): – Class presentation III: Use case iteration
• Term assignment due – December 6 before class• Office hours this week – by appointment or drop in
– Winslow 2104 (Professor McGuinness)– Winslow 2143 (Professor Luciano)
• Questions?
84
Top Related