PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o...
Transcript of PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o...
![Page 1: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/1.jpg)
http://research.microsoft.com
![Page 2: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/2.jpg)
A Tidal Wave of Scientific Data
![Page 3: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/3.jpg)
• Experimental Science
•
• Theoretical Science
• Newton‟s Laws, Maxwell‟s Equations…
• Computational Science
• Simulation of complex phenomena
• Data-Intensive Science
• captured by instruments
• generated by simulations
• generated by sensor networks
•
•
•
•
2
2
2.
3
4
a
cG
a
a
![Page 5: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/5.jpg)
An edited collection of 26 short
technical essays, divided into 4
sections
![Page 6: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/6.jpg)
The Problem for the e-Scientist
• Data ingest
• Managing a petabyte
• Common schema
• How to organize it
• How to reorganize it
• How to share with others
• Query and Vis tools
• Building and executing models
• Integrating data and Literature
• Documenting experiments
• Curation and long-term preservation
The Generic Problems
(With thanks to Jim Gray)
Experiments & Instruments
Simulations
Literature
Other Archives
facts
facts
facts
facts
Questions
Answers
![Page 7: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/7.jpg)
![Page 8: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/8.jpg)
Monitoring
Collation
Quality assurance
Aggregation
Analysis
Reporting
Forecasting
Distribution
Done poorly, but a few notable
counter-examples
Done poorly to moderately, not easy to find
Sometimes done well, generally discoverable and available,
but could be improved
Integration
(I. Zaslavsky & CSIRO, BOM, WMO)
![Page 9: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/9.jpg)
Environmental Ecosystem
9
Action Knowledge
Inform
![Page 10: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/10.jpg)
Environmental Ecosystem
10
Analysis
Insight
Publish
Data
Action Knowledge
Communicate
Decide
Implement
Inform
![Page 11: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/11.jpg)
•
•
•
•
![Page 12: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/12.jpg)
Data Variety – The Spice of Life
Manual Measurement
Automated Measurement
Sample Collection
Historical Photographs
Counting
Satellite
Relatively
Ubiquitous
Motes Aircraft Surveys Model Output
Typing
![Page 13: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/13.jpg)
•
••
•
•••
![Page 14: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/14.jpg)
••
•
••
•
••
•
••
••
•
Source Data (Swath format)
Reprojected Data (Sinusoidal format)
![Page 15: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/15.jpg)
Why Make this Distinction?
• Provenance and trust widely varies
• Data acquisition, early processing, and reporting ranges from a large government agency to individual scientists.
• Smaller data often passed around in email; big data downloads can take days (if at all)
• Data sharing concerns and patterns vary
• Open access followed by (non-repeatable and tedious) pre-processing
• True science ready data set but concerns about misuse, misunderstanding particularly for hard won data.
• Computational tools differ.
• Not everyone can get an account at a supercomputer center
• Very large computations require engineering (error handling)
• Space and time aren‟t always simple dimensions
Complex shared detector Simple instrument (if any)
Complex and Heavy process by experts Ad hoc observations and models
KB
PB
GB
TB
Science happens when PBs, TBs, GBs, and KBs can be mashed up simply Science happens when PBs, TBs, GBs, and KBs can be mashed up simply
![Page 16: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/16.jpg)
Scientist
Source Metadata
Scientific Results
AzureMODIS
Service Web Role Portal
Request Queue
Source Imagery Download Sites
. . . •
•
•
•
•
•
•
•
•
Reprojection Queue
Reduction Queue
Data Collection Stage
Reprojection Stage
Analysis/Reduction Stage
Catharine van Ingen (Microsoft Research), Jie Li, Marty Humphreys (UVA), Youngryel Ryu (UCB), Deb Agarwal (BWC/LBL)
![Page 17: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/17.jpg)
Continuation “R2”: 2012+
Science and engineering objectives
“Solve the carbon balance problem”
“Build an interoperable data system”
Pilot study “R1”: 2009
20 million observations
Engineering success
Collaborators:
Humberto da Rocha (USP)
Andreas Terzis (JHU)
Juliana Salles, Rob Fatland (MSR)
Brito Cruz (FAPESP)
![Page 18: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/18.jpg)
••
•
••
•
•
![Page 19: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/19.jpg)
Common Problems with Data
To use data from different sources o Non-standard formats, scales, and units
o Lack of data quality control
o Lack of metadata
o Difficult to repurpose data for different (my) tools
To share data o Lack of incentive (no credit)
o Need extra resources and tools
Hidden problems, seldom addressed o Versioning
o Provenance
o Curation
(data)
SQL
CSV Data
Cube
data data
XML
Data Sources
![Page 20: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/20.jpg)
(data)
SQL
CSV Data
Cube
data data
XML
Data Sources
Cloud Service HPC Cluster DB Sever Data Server
… Web server …
Applications
… Android iPhone Windows
Phone WebOS
Java Silverlight .NET AJAX PHP Excel MATLAB
Current State of Data Ecosystem
![Page 21: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/21.jpg)
Advance data discoverability, accessibility, and consumability
Marketplace SQL Spatial
PivotViewer
maps http://www.odata.org
A Web protocol for querying and updating data provides a way to unlock your data and free it from data silos
does this by building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores.
In Open Source/Specifications Promise
An application of a set of internet standards: HTTP,
Atom (RFC 4287),
AtomPub (RFC 5023),
REST semantics
Existing standards + easy data access API
Adding Geospatial data support – Feedback from the Community encouraged – www.odata.org
It allows you to form URLs based on what you know about the underlying data
![Page 22: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/22.jpg)
••
•
••
••
••
![Page 23: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/23.jpg)
NodeXL
Binary and source code:
http://nodexl.codeplex.com
Network graph visualization
![Page 24: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/24.jpg)
NodeXL Network Overview Discovery and Exploration add-in for Excel 2007/2010
A minimal network can illustrate the
ways different locations have different
values for centrality and degree
![Page 25: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/25.jpg)
![Page 26: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/26.jpg)
interactive
exploration cinematic narrative
http://www.digitalnarratives.net
![Page 27: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/27.jpg)
http://www.digitalnarratives.net/
![Page 28: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/28.jpg)
•• speech recognition technologies used to „crack‟ audio files
• Indexing automatic transcripts as text does not work • „real‟ enterprise automatic transcription accuracy is only 50-80%
•• 50-140% accuracy improvement over indexing automatic transcripts
• index word alternatives – robust to recognizer errors
• index timing – navigate to exact point in video
•• No need to invest in H/W infrastructure
![Page 29: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/29.jpg)
ScienceCinema
••
• Use NLP and Bing Search to expand word dictionary
•• Enables discovery of speech content
•
• 1,000 hours of AV content currently available
• NEW
http://www.osti.gov/sciencecinema
http://research.microsoft.com/mavis
![Page 30: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/30.jpg)
Seamless Rich Social Media Virtual Sky
Web application for science and education
Goals
Integration of data sets and one-click contextual access
Easy access and use
Tours for sharing information/insights
Updates
API for extensibility
Excel Add-in for easy data integration
We invite you to experience it! www.worldwidetelescope.org
![Page 31: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/31.jpg)
![Page 32: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/32.jpg)
Natural User Interfaces (NUI)
• Rethinking ways in which people will interact with computers/technologies of the future
• Re-evaluating everything from their (non-) physical design to the human needs and interaction models
• Revolutionize the way we think about technology and what it can do on our behalf
http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/
![Page 33: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/33.jpg)
NUI – Kinect SDK and WWT
![Page 34: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/34.jpg)
•••••••
•
![Page 35: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/35.jpg)
•
•••
•
••
![Page 37: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata](https://reader033.fdocuments.us/reader033/viewer/2022052005/6018c8a0975fd343a1021324/html5/thumbnails/37.jpg)