Addressing the Challenges of the Scientific Data Deluge
description
Transcript of Addressing the Challenges of the Scientific Data Deluge
![Page 1: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/1.jpg)
1
Addressing the Challenges of the Scientific Data Deluge
Kenneth ChiuSUNY Binghamton
![Page 2: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/2.jpg)
2
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
![Page 3: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/3.jpg)
3
Autoscaling Project
• Traditional research focus in sensor networks on energy, routing, etc.
• In “environmental observatories”, management is the problem.
• Adding a sensor takes a lot of manual reconfiguration.– Calibration, recalibration.– QA/QC is also a major issue.
• What corrections have been applied to the data, and, what calibrations/maintenance have been applied to the sensor?
• With U. Wisconsin, SDSC, and Indiana University.
![Page 4: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/4.jpg)
4
Motivation
• Adding a sensor requires a great deal of manual effort.– Reconfiguring datalogger– Reconfiguring data acquisition software– Reconfiguring QA/QC triggers– Reconfiguring database tables
• QA/QC is not very automated• Result: Sensor networks are not very scalable.• Goal: Automate.
![Page 5: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/5.jpg)
5
Metadatafor each final table
Metadata
• describes each final table
• are used to generate forms dynamically for data retrieval from website
• entered manually
![Page 6: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/6.jpg)
6
Approach
• Use a agent-based, bottom-up approach.
• Agents coordinate among themselves, as much as possible.
• Unify communications. All communications done via data streams.
• Data streams represented as content-based, publish-subscribe systems.
![Page 7: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/7.jpg)
7
Long-Term Ecological Research (LTER)
Data-logger
QAAgent
Env.Agent
QAAgent
Oracle
WebServer
WebBrowser
Config. Event (CIMA)
Other Connection
Trout Lake Station
University ofWisconsin Campus
Buoy
ORB
ARTS Connection
Sensors
Env. Event (CIMA)
JDBC/ODBC Connection
WebBrowser
Config.Agent
Config.Agent
Config.Agent
Config.Agent
1
2
3
3
4
Other Locations
![Page 8: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/8.jpg)
8
Agents
• Characteristics– Autonomous– Bottom up– Distributed coordination– Independence/loosely-coupled
• Can be thought of as a “style” for implementing distributed systems.
![Page 9: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/9.jpg)
9
Sensor Metadata
• Each sensor has intrinsic and extrinsic properties.– Intrinsic are type, model number, etc.
• Static: Cannot be changed.• Dynamic: SDI-12 address.
– Extrinsic are location, sampling rate, etc.
• Use code generation techniques to generate the proper code based on the sensor data.
![Page 10: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/10.jpg)
10
Automatic Sensor Detection and Inventory
InstrumentAgent
WebService
Acquisition Computer
Field Station Computer
DataloggerProgram
SensorMetadata
Repository
3
5: Upload
4: Generate
Datalogger
Sensor
Response
Request
2
2
3
1: Detection event
Data Center
Database
6: Data
7
![Page 11: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/11.jpg)
11
QA/QC
• Malfunctioning anemometer detected as an abnormal occurrence of zero wind speed values.
0
50
100
150
200
250
Jan-95 Jan-97 Jan-99 Jan-01 Jan-03
frequency ofzero hourly average wind speed values per month
![Page 12: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/12.jpg)
12
Another Example
• Buoy was pulled down in the water by the ice.
-2
-1
0
1
2
3
4
23-Nov 23-Dec 22-Jan 21-Feb
watertemperature(deg C)
-2
-1
0
1
2
3
4
15-Nov 15-Dec 14-Jan 13-Feb
sensors displaced normal winter
Hu and Benson
![Page 13: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/13.jpg)
13
Crystal Grid Framework
• Seeks to develop standards and middleware for integrating instrument and sensor data into wide-area infrastructures, such as grid computing.
• With Indiana University.
![Page 14: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/14.jpg)
14
Motivation• Process of collecting and generating data is often critical.
– Current mechanisms for monitoring and control either require physical presence, or use ad hoc protocols, formats.
• Instruments and sensors are already “wired”.– Usually via obscure, or perhaps proprietary protocols.
• Using standard mechanisms and protocols can give these devices a grid presence.– Benefit from a single, unified paradigm, terminology.– Single set of standards; exploit existing grid standards.– Simplifies end-to-end provenance tracking.– Faster, seamless interactions between data acquisition and data
processing.– Greater interoperability and compatibility.
Philosophy: Push grid standards as close to the instrument or sensor as possible. (But no further!) Deal with “impedance mismatches” close to the instrument, so as to localize complexity.
![Page 15: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/15.jpg)
15
• Develop a set of standard grid services for accessing and controlling instruments.– Based on Web standards such as WSDL, SOAP, XML, etc.
• Develop a instrument ontology for describing instruments.– Applications use the description to interact.
• The goal is to develop middleware that abstracts and layers functionality.– Minor differences in instruments should only result in minor loss
of functionality to the application.
• Move metadata and provenance as close to the instrument as possible.
Goals
![Page 16: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/16.jpg)
16
Overview
Physical Network Transport
Data Pipeline
AcquisitionComponent
AcquisitionCode
InstrumentAccess
AnalysisComponent
AnalysisCode
InstrumentAccess
CurationComponent
CurationCode
InstrumentAccess
Instrument
Sensor 1
Controller
Sensor 2
InstrumentPresentation
Scientist
InstrumentAccess
RemoteAccess
GUI
Device-Independent Application Module
Device-Dependent Virtualization Module
Shared Implementation
![Page 17: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/17.jpg)
17
Distributed X-Ray Crystallography
• Crystallographer, chemist, and technician may be separated.– Large resources such as synchrotrons– Convenience and productivity– Expanding usage to smaller institutions
• Data collection, analysis, and curation may be separated.
• Approximate data requirements: 1-10 TB/year.– Currently stored at IU.
• Real-time data collection and control.• Collaboration with IU, Sydney, JCU,
Southampton.
![Page 18: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/18.jpg)
18
X-Ray Crystallography
• Scientists are very reluctant (understandably) to install your software on the acquisition machine.– Use a proxy box by which to access files via CIFS or
NFS.– Scan for files which indicate activity.
• Unfortunately, scientists can manually create files, which can confuse the scanner. No ideal solution.
• For sensor data, request-response is not ideal.– Push data using one-way messages.
• In WSDL 2.0, consider “connecting” out-only services to in-only services.
![Page 19: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/19.jpg)
19
X-Ray Crystallography
Portal
InstrumentManager
DataArchive
Non-grid service
Grid service
Persistent
Non-persistent
Portal
InstrumentManager
DataArchive
Indiana University
University of Sydney
InstrumentServices
Proxy BoxAcquisition Machine
CIFS
Argonne National Labs
University of Southhampton
Fromdiffractometer Instrument
Services
Proxy BoxAcquisition Machine
CIFS
Fromdiffractometer
![Page 20: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/20.jpg)
20
TASCS: Center for Technology for Advanced Scientific Component
Software
• Multi-institution DOE project.• Seeks to develop a common component
architecture for scientific components.• My focus within it is to develop a
BabelRMI/Proteus implementation.– And develop C++ reflection techniques to improve
dynamic connection abilities.
• With LLNL and many other institutions.
![Page 21: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/21.jpg)
21
Babel
• Language interoperability toolkit developed at LLNL.
• Allows writing objects in a number of languages, including non-OOP ones such as Fortran.
• Began as a purely in-process tool, now includes an RMI interface.
![Page 22: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/22.jpg)
22
Proteus
• Started off as a unification API for messaging over multiple standards and implementations, such as CORBA, JMS, SOAP.
• Moving towards focusing on multiprotocol web services.
• Though almost always bound to SOAP, WSDL actually fully supports almost any protocol.
![Page 23: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/23.jpg)
23
Runtime
Stub
IOR
C++ Skel
RMI Stub
Proteus
Impl
Skel
IOR
C++ Stub
Proteus
SerializableObject
B-PAdapter
B-PAdapter
SerializableObject
Generated
Library
User
Babel-ProteusGenerated
WSIT WSIT
![Page 24: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/24.jpg)
24
Multiprotocol
Network
Proteus
Client
ProviderA
ProviderB
Proteus
Client
ProviderA
ProviderB
Protocol A
Protocol B
Process 1 Process 2
![Page 25: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/25.jpg)
28
Lake Sunapee• Most e-Science/cyberinfrastructure R&D is for
institutional science.– Assume significant resources and expertise.
• Much less work on CI for citizen science, non-profits organizations, etc.
• This project explores how to engage them in the development of cyberinfrastructure and e-Science.– Also with a focus on how to use e-Science to
engage and educate K-12.– Also with a focus on how to train CS students to
better engage scientists.• With U. Wisconsin, U. Michigan, LSPA, and
IES.
![Page 26: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/26.jpg)
29
• Hold a series of workshops to understand needs.
• Research and develop systems to allow them accessible means to interpret the sensor data.
• Course component: seminar/project course where students will work with citizen scientists in small groups to define and implement e-Science projects with the lake association.
![Page 27: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/27.jpg)
30
• Semantic publish-subscribe.– Content-based publish-subscribe needs a
content model.– Semantic web/description logics provide an
ideal content model.
![Page 28: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/28.jpg)
31
Many Small Datasets
• Much ecological data is characterized not by a few large datasets, but many small datasets.– e-Science has up to now chosen to focus on a
few large datasets, mostly.
![Page 29: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/29.jpg)
32
Flexible Electronics and Nanotechnology
• Work with Howard Wang in BU ME.
• “Ontologies” for materials science processes (internal).
• Undergraduate education project (NSF).
![Page 30: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/30.jpg)
33
Material Processes
• Materials science research product is the characterization of a process (vibration, heating, chemical, electrical, etc.).
• Applying such research is finding a sequence of processes that will transform a material A (with certain properties such as particle size) to a material B (with certain other properties).
• Very difficult to search the research literature.• Also, this is a type of path finding problem.
![Page 31: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/31.jpg)
34
“annealing”
hasName
tempSchedule
“a schedule”
Conceptually, the schedule is just a function that gives the temperature as output given the time as input. One question is whether or not to attempt to represent it partially in the graph model, or to treat it’s representation as completely outside the model.
For example, a function can be represented as a table, or a Fourier series, wavelets, etc.
“annealing”
hasName
tempSchedule
“a differentschedule”
This is an anonymous node that only serves to “bind” the other nodes together. You can think of it as representing the process as a whole.
Information is sparse.
![Page 32: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/32.jpg)
35
Undergraduate Education
• Groups of nanotechnology students develop senior design projects with CS students.
![Page 33: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/33.jpg)
36
Programs-Australia-Canada-China-Finland-Florida-New Zealand-Israel-South Korea-Taiwan-United Kingdom-Wisconsin
First meeting:San DiegoMarch 7-9, 2005
Source: T. Kratz
![Page 34: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/34.jpg)
37
Vision and Driving Rationale for GLEON
• A global network of hundreds of instrumented lakes, data, researchers, students,
• Predict lake ecosystems response to natural and anthropogenic mediated events – Through improved data inputs to simulation models– To better plan and preserve freshwater resources on
the planet
• More or less a grass roots organization.• Led by Peter Arzberger at SDSC, and with U.
Wisconsin.
![Page 35: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/35.jpg)
38
Why develop such a network?
• Global e-science becoming increasingly possible
• Developments in sensors and sensor networks allow some key measurements to be automated
Porter, Arzberger, C. Lin, F. P. Lin, Kratz, et al. (2005)
July 2005 Issue
Source: T. Kratz
![Page 36: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/36.jpg)
39
![Page 37: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/37.jpg)
40
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
![Page 38: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/38.jpg)
41
Research Challenges
• Biggest challenge is data.• Much time and effort is spent managing data in time-
consuming and human-intensive means.– Often stored in Excel, text files, SAS.– Metadata in notebooks, gray matter.
• No incentives to make data reusable.– Providing data is not valued academically.
• Too much manual work involved in acquisition.– Means much is not captured automatically and semantically.
• Standardization of things such as ontologies are very slowl, and tend to be top-down.– Can we first build a system that provides some benefit without
forcing them to go through a painful standardization process?
![Page 39: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/39.jpg)
42
Cyberinfrastructure and e-Science
• There have been huge improvements in hardware.
• There have been huge local improvements in software.
• Not so many improvements in large-scale integration and interoperability.
![Page 40: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/40.jpg)
43
Data, Data, and More Data!
• Data is the driver of science.
• Recent advances in technology have given us the ability to acquire and generate prodigious amounts of data.
• Processing power, disk, memory have increased at exponential rates.
![Page 41: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/41.jpg)
44
It’s Not a Few Huge Datasets
• Huge datasets get more attention.– More glamorous.– Traditional type of CS problem.– Easier to think about.
• But it’s the number of different datasets that is the real problem.– If you have one big one, can concentrate efforts on
the problem.– Not very amenable to traditional CS “thinking”, since
there is a very significant “human-in-the-loop” component.
– The best CS research is useless if the human ignores it.
![Page 42: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/42.jpg)
45
We Are The Same!(More or Less)
Technology advances fast.
People advance slowl!People compose our institutions, our organizations, our
modes of practice.
Result: The old ways of doing things don’t cut it. But we haven’t yet figured out the
new ways.
![Page 43: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/43.jpg)
46
Technology Impacts Slowly
• Technologies often require many systemic changes to bring benefits.– Sometimes require other complementary technologies
to be invented.
• Steam engine invented in 1712, did not become huge economic success till 1800’s.
• Motor and generator invented in early 1800s.– Real benefits did not occur till 1900s.
![Page 44: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/44.jpg)
47
• Steam-powered factories built around a single large engine.
• Belts and other mechanical drives distributed power.• If you brought a motor to a factory foreman:
– His factory wasn’t built for it.– He might not be able to power it.
• Chicken-and-egg problem.
– He doesn’t even know how to use it.
• It took decades.• Similarly, I believe we are in the early stages when it
comes to computer technology.
Steam To Electric
![Page 45: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/45.jpg)
48
Socio-Technical Problem
• What will it take to figure out how to use all this data?
• Not a pure CS problem, people’s actions affect how easy is it use all the data.
• Many problems these days are sociotechnical in nature.– Password security is a solved problem.– Interoperability is a solved problem.
• Figuring out how to use data is even harder than power, since power distribution is physical, easy to see.– Data/info flow is hard to see.
![Page 46: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/46.jpg)
49
A Vision
• A scientist sits in his office.• He wonders: “I wonder if children who live closer
to cell towers have higher rates of autism?”• How much time would it take a scientist to test
this hypothesis?– Find the data.– Reformat the data, convert it, etc.– Run some analysis tools. Maybe find time on a large
resource.• But the data is out there!
– There are many hypotheses that are never tested because it would take too much work.
![Page 47: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/47.jpg)
50
• This vision also applies to business, military, medicine, industry, management, etc.
• There are a million sources of data out there.– Real-time data streams, archived data, scientific
publications, etc.
• How can we build a flexible infrastructure that will allow analyses to be composed and answered on the fly?
• How do we go from data+computation to knowledge?
![Page 48: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/48.jpg)
51
RDF-like Data Model
• We hypothesize that part of the problem is that RDBMS are based on data models that do not fit scientific data well.– This “impedance” mismatch is a barrier.
• Thus, develop models that more closely resemble the mental model that scientists use when thinking about data.– The less a priori structure imposed on the
data, the better.
![Page 49: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/49.jpg)
52
Goals• Allow some common subset of code and design to be used for
many scientific data and applications.• Suggest a data and information architecture for querying and
storage.• Provide some fundamental semantics. Each discipline would then
refine these semantics.• Don’t get bogged down in trying to figure out everything. Just try to
find some LCD.• This is a logical model of data. Also need a “physical” model to
handle transport, archiving, etc. Then need to map from the physical model to the logical model. For example, an image file has more than just the raw intensities. But some metadata may not be in the file. We don’t want the logical model to be concerned about the how the data is actually arranged.
• Promote bottom-up, grass-roots approaches to building standards.
![Page 50: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/50.jpg)
53
One Person’s Metadata Is Another Person’s Data
• Distinction between data and metadata is artificial and problematic.– What is metadata in one context becomes data in another. For
example, suppose you are taking the temperature at a set of locations (determined via GPS). So for each reading, the temperature is the data, and the metadata is the location. But now suppose that you need the error of the location. So now the error becomes the metametadata of the location metadata?
– A made-up example based loosely on crystallography: The spatial correction is based on a calibration image obtained from a brass plate. So the calibration image is metadata for the set of frames. Now suppose that they need the temperature of the brass plate when the image was made. So now the temperature is metametadata.
![Page 51: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/51.jpg)
54
• Use a graph-based model.– Base on RDF.– Actual data is stored as a graph
• Contrast with models like E-R, where the graph “models” the data, rather than actually being the data.
• A node in E-R might be “customer”, and represent the class of entities that are customers, rather than any specific customer.
• The model:– Each node is a datum.– Each edge denotes an association/attribute/property.– Nodes can be grouped into nodesets, which are also nodes.
• A node may be in more than one nodeset.
– A node-edge-node triple can also be a node.– Main difference from RDF is an attempt to build reification into
the model.
• Somewhat similar to a hypergraph.
![Page 52: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/52.jpg)
55
• The edge with the attribute name set_attr_1 is an attribute of a nodeset.• The edge with the attribute name triple_prop is an attribute of the above
edge.
13
20
temperature
angle
set_attr_1
triple_prop
Nodeset
Nodeset
![Page 53: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/53.jpg)
57
Complete Capture of Raw Data
• Complete digital capture of data and metadata.– Already digital.
• Must have full provenance and other metadata.
![Page 54: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/54.jpg)
58
Put Everything In the Triplestore
• Unify semantic networks and data graphs.
• Metadata relationships can use reified triples.
• Don’t wait for standards, people take too long to decide.– Bottom-up standards tend to work better.– First must have the demand for the standard.
• All data is read-only.
![Page 55: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/55.jpg)
59
But We Can Never Store That Much
• Maybe we can.
• But to drive a technology, first need to show a need.
• RDBMS have had several decades of research to improve performance.
![Page 56: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/56.jpg)
60
Publications Are Data
• In some fields, such as materials science, papers are 80% boilerplate text.
• It’s better to directly publish this as structured, semantic data.– No NL.
• Use NL annotations where needed.
![Page 57: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/57.jpg)
61
• A scientist runs experiments.– All data is captured.
• She reaches a point where she wishes to publish.
• She reviews her experimental data (all captured with provenance, and full metadata, sensor calibration, etc.), and drags and drops what is most relevant.
• She creates a narrative by creating some annotated links between experiments to explain the insights.– Typically probably at most one page of text, maybe
less.• She clicks a button to submit for publication.
![Page 58: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/58.jpg)
62
Closer Ties Between Theoreticians and Practitioners
• In the real world, likely that semantic data treatments will need to deal with uncertainty, quantitativeness, ambiguity, fuzziness.– There is research in these areas, but not a lot of
penetration into practice, which prevents good feedback to the theoreticians.
– For example, many practitioners don’t even know about polyhierarchies. (Clay Shirky)
• Often attempts to create ontologies result in trying to figure out which class is the parent.
![Page 59: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/59.jpg)
63
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
![Page 60: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/60.jpg)
64
Distributed Triplestores
• Published in e-Science 2007.
• With IU student Tharaka Devadithya.
![Page 61: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/61.jpg)
65
Motivation
• Data in some domains is dynamically structured.• Predefining structures (e.g., schemas in RDBMS)
creates a barrier for storing such data.– Certain minute details may get discarded.
• Scientists generally store experiment details in text or binary files (e.g., spreadsheets, word processing documents).– These files can be stored in databases as BLOBs.– However, it is not possible to efficiently query these
data.– Sharing data among other collaborators require that
everyone can read the format used by the author.
![Page 62: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/62.jpg)
66
Storing Dynamically Structured Data
• An RDBMS can be used by modifying its schema each time the structure of data changes. – Not a feasible option if the schemas need to be
modified very frequently.• Data can be stored in a file system with a
hierarchical directory structure to organize the data.– The author needs to remember the organization of
data.– Difficult to share data among collaborators.
• There is a strong requirement for a store of dynamically structured data that does not hinder the ability of efficient querying.
![Page 63: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/63.jpg)
67
Dynamic Structures with Databases
Timestamp Value Units
2006-10-12 14:23:33
25.2 Celsius
2006-10-12 16:44:25
25.5 Celsius
Timestamp Timezone Value Units
2006-10-12 14:23:33
EST (or NULL?)
25.2 Celsius
2006-10-12 16:44:25
EST (or NULL?)
25.5 Celsius
Date Time Timezone Value Units
2006-10-12 14:23:33
EST (or NULL?)
25.2 Celsius
2006-10-12 16:44:25
EST (or NULL?)
25.5 Celsius
New column
![Page 64: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/64.jpg)
68
Dynamic Structures with Databases…more issues
• Suppose the following information is stored about a sensor.– Manufacturer– Measurement type (e.g., temperature, humidity)– Measurement units
• What if there is one sensor whose manufacturer is not known?– Insert NULL to the Manufacturer field?
• Now, what if it is required to store purchased date only for one sensor?– Add a new column? What value to store in this column
for other sensors?– Add another table and join with the original table?
![Page 65: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/65.jpg)
69
Semantic Web Solution
• Semantic web solutions have been successfully used both in scientific and commercial environments.– Do not impose any structure on the data.– Data modeled as a directed graph.
• Resource Description Framework (RDF) is the most commonly used standard for representing such graphs.– Can be used to describe any property about any
resource.
![Page 66: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/66.jpg)
70
RDF and Triplestores
• Triple– Subject: the resource being described– Predicate: the property being described– Object: the value of the property
• E.g., methyl-cyanide crystallographer John– The crystallographer for methyl-cyanide is John.
• A graph in RDF is represented as a set of triples. – Each triple connects a subject node to an object node
in the graph.• A persistent set of such triples is known as a
triplestore.
Subject Predicate Object
![Page 67: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/67.jpg)
71
Example of RDF Graph
![Page 68: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/68.jpg)
72
XML Databases
• Proposed as suitable for such dynamically structured data.
• Commercial databases starting to provide native support for XML.
• XML is extensible and does not impose any structure on the data.– Therefore, it allows to dynamically build
structures
• Suffers from update anomalies
![Page 69: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/69.jpg)
73
Update Anomalies with XML• Assume an XML database is used for storing information about
crystallography experiments, as follows.
<experiment> <crystallographer> <name>John Smith</name> <designation>Scientist</designation> <address>...</address> </crystallographer> <startTime>...</startTime> <location>IUMSC</location> ...</experiment>
• Results in storing redundant information– Address of John Smith will be the same for all experiments.– What happens if he changes his address? Update all previous XML fragments?
• Solution: Normalize certain details as in relational DBMS.– E.g., separate address information from the experiment details and provide a link
(reference) to an address document
![Page 70: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/70.jpg)
74
• However, in order to normalize, the schema should be known in advance.
• This is not possible when data gets added arbitrarily without being compliant with any predefined schema.
• The user has to determine how to normalize the data.• Solution: to normalize everything
– resulting only in attribute, value pairs. E.g.,
<experiment> <crystallographer ref=“JohnSmith”></exeriment>
<JohnSmith> <name ref=”John Smith”></JohnSmith>
…– Very similar to RDF model
![Page 71: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/71.jpg)
75
Need for a Distributed Triplestore
• Origination points• Ownership• Scalability
– Large number of triples.• E.g., consider a table in a RDBMS having 15 columns.
Migrating its data to a triplestore would result in 15 triples for each row in the table.
• Also there will be – data from more than one table– data that normally do not get stored in a database
– This leads to scalability issues.• E.g., querying would be slow, indices might often need to be
fetched/stored from disk.– In order to go beyond the scalability limits of a single triplestore,
triples need to be distributed across multiple triplestores.
![Page 72: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/72.jpg)
77
Our Approach
• Clients access the triplestores via a mediator. • Mediator maintains several indexes to facilitate efficient
querying. • When the mediator receives a query
– breaks down the query in to several sub-queries– find out which triplestores are capable of responding to each sub-query.
• Indexes are mainly used to– build a cost model for the querying– eliminate the triplestores that are unable to give results for a given sub-
query.
![Page 73: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/73.jpg)
78
Types of Indexes at the Mediator
• Predicate Index– Contains details about the predicates in each triplestore– Certain fields are used for cost estimation for sub-queries.
• Node Index– Maintains a list of nodes in the triple graph along with the
triplestores in which these nodes exist.– Contains only resources (E.g., ns:crystallographer); Literals
(E.g., “John Smith”) are not stored.– Used to eliminate certain triplestores when sub-querying.
• Edge Index– Two edge indexes are used for outgoing and incoming edges,
respectively. – Used to avoid querying triplestores that do not have the
corresponding edges from or to them.
![Page 74: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/74.jpg)
79
Future Work
• Minimize joins between triplestores– Identify frequent joins– Instruct the triplestores to re-distribute their
triples such that most of the future joins will be performed locally.
• Avoid extra level of network hop due to the mediator by using a mediator cache.
• Consider network communication when estimating costs for the query plan.
![Page 75: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/75.jpg)
80
Parallel XML Parsing
• Published in Grid 2006, CCGrid 2007, e-Science 2007, IPDPS 2008, ICWS 2008 (streaming), HiPC 2008 (streaming).
• With BU students Yinfei Pan and Ying Zhang.
![Page 76: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/76.jpg)
81
Motivation
• XML is gained wide prevalence as a data format for input and output.
• Multicore CPUs are becoming widespread.– Plans for 100 cores.
• If you have 100 cores, and you are only using one to read and write your output, that could be a significant waste.
![Page 77: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/77.jpg)
82
Parallel XML Parsing
• How can XML parsing be parallelized?– Task parallelism.– Pipeline parallelism.– Data parallelism.
![Page 78: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/78.jpg)
83
• Task parallelism.– Multiple independent processing steps.– The sauce for a dish with sauce can be made in parallel to the
main part.
Step 1
Step 2A
Step 2B
Step 3
Time
Core 1
Core 1
Core 2
Core 1
![Page 79: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/79.jpg)
84
• Pipeline parallelism.– Multiple stages, all simultaneously performed in parallel.– If you are making two cakes (but only have one oven), you can start
mixing the batter for the second cake while the first one is in the oven.
Stage 1Data C
Stage 2Data B
Stage 3Data A
Tim
e
Core 1 Core 2 Core 3
Stage 1Data D
Stage 2Data C
Stage 3Data B
Stage 1Data E
Stage 2Data D
Stage 3Data C
![Page 80: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/80.jpg)
85
• Data parallelism– Divide the data up, process multiple pieces in parallel.
Input Chunk 1 Input Chunk 2 Input Chunk 3
Core 1 Core 2 Core 3
Output Chunk 1 Output Chunk 1 Output Chunk 1
Merge
Output
![Page 81: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/81.jpg)
86
But XML is Inherently Sequential
• How can a chunk be parsed without knowing what came before?
• The parser doesn’t know what state to start in.• Could do various scanning forwards and
backwards, but it is ad hoc, and tricky.– Special characters like < can be in comments.
<element attr=“value”>content</element>
![Page 82: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/82.jpg)
87
Previous work
• We used a fast, sequential preparse scan– Build an outline of the document (skeleton)– Skeleton are used to guide full parse by first
decomposing XML document into well-formed fragments on well-defined unambiguous positions
– The XML fragments are parsed separately on each core by Libxml2 APIs
– Merge the results into final DOM with Libxml2 APIs
• The preparse is sequential, however, so Amdahl’s law kicks in. We scale well to 4 cores, or so.
• So how can we parallelize the preparse?
![Page 83: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/83.jpg)
88
Example: The Preparsing DFA
• The preparsing DFA has two actions: START and END, which are used to build the skeleton during execution of the DFA.
0 1
2
5
3
6
74
>/!"'a
> / ! a'
> / ! a"
< / ! a"'a
</
>!
>
a( START )
a
>
/
>( END )
( END )
""
'
'
![Page 84: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/84.jpg)
89
Example of running preparsing DFA
<foo>sample</foo>
0 1 0 03 0 1 2
END
2 0
START
3
How can this be parallelized?
![Page 85: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/85.jpg)
90
Meta-DFA• Goal
– Pursues simultaneously all possible states at the beginning of a chunk when a processor is about to parse the chunk
• Achieved by:– Transforming the original DFA to a meta-DFA whose transition
function runs multiple instances of the original DFA in parallel via sub-DFAs
– For each state q of the original DFA, the meta-DFA includes a complete copy of the DFA as a sub-DFA which begins execution in state q at the beginning of the chunk
– For the actual execution, the meta-DFA transitions from a set of states to another set of states
![Page 86: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/86.jpg)
91
Output Merging• Since the meta-DFA pursues multiple
possibilities simultaneously, there are also multiple outputs when a chunk is finished.– One corresponding to each possible initial state.
• We know definitively the state at the end of the first chunk.– This is used to select which output of the second
chunk is the correct one.– The definitive state at the end of the second chunk is
now known.– Etc.
![Page 87: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/87.jpg)
92
Performance Evaluation
• Machine:– Sun E6500 with 30 400 MHz US-II processors– Operating System: Solaris 10– Compiler: g++ 4.0 with the option -O3– XML Standard Library: Libxml2 2.6.16
• Tests:– We take the average of ten runs– Test file is selected from a well-known project named
Protein Data Bank (PDB), sized to 34 MB– All the speedups are measured against parsing with
stand-alone Libxml2
![Page 88: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/88.jpg)
93
• The full parsing process is:– First do a parallel preparse using a meta-DFA.
This generates an outline of the document known as the skeleton.
– Then use techniques based on parallel depth-first tree search to parallelize the full parse.
– Subtrees of the document are parsed using unmodified libxml2.
![Page 89: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/89.jpg)
94
Preparser Speedup
• Parallel preparser relative to the non-parallel preparser
![Page 90: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/90.jpg)
95
Speedup on parallel full parsing
• After applying our meta-DFA technique in parallizing the preparsing stage, the parallel full parsing is now scalable.
![Page 91: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/91.jpg)
96
Summary• Data parallel XML parsing is challenging because the
parser does not know in which state to begin a chunk.– One solution is to simply begin the parser in all states
simultaneously.
• This can be achieved by modeling the parser as a DFA with actions, then transforming the DFA into a meta-DFA (product machine).
• The meta-DFA runs multiple instances of the original DFA, one instance for each state of the original DFA.
• The number of states in the meta-DFA is finite, so it is also a DFA and can be executed by a single core.– The parallelism of the meta-DFA is logical parallelism.
![Page 92: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/92.jpg)
97
Future Work
• Parallelizing XPath– Significantly more challenging, but due to
Amdahl’s law, first need to parallelize parsing.
• Offload preparsing to FPGA or perhaps GPU.
![Page 93: Addressing the Challenges of the Scientific Data Deluge](https://reader035.fdocuments.us/reader035/viewer/2022062805/56814c50550346895db9624b/html5/thumbnails/93.jpg)
98
Acknowledgements
• Grateful for the support provided by the NSF and the DOE for this work.– NSF awards 0836667, 0753178, 0513687,
and 0446298– DOE Award DE-FG02-07ER25803