Building a Scientific Data Warehouse - SLAC … · · 2012-09-14Building a Scientific Data...
Transcript of Building a Scientific Data Warehouse - SLAC … · · 2012-09-14Building a Scientific Data...
Building a Scientific Data Warehouse Supporting Petascale
Science and Data Mining
XLDB 2012
Clark Gaylord Chief Information Officer Virginia Tech Transportation Institute [email protected] 11 September 2012
Background
VTTI was established in August 1988 by agreement between US DOT and the University Transportation Centers Program
• Largest university-level research center at Virginia Tech – Approximately 300 faculty, staff and students
working on over 150 projects – $80 Million Awarded – Approximately $30 Million in Annual Expenditures – Largest supporter of both undergraduate and
graduate students
11 September 2012 CKG – Building Data Warehouse 2
Unique Facilities
Instrumented Vehicles The Virginia Smart Road
11 September 2012 CKG – Building Data Warehouse 3
The Virginia Smart Road
• Advanced Control Room
• Weather capabilities
• Variable Lighting Systems
• Pavement Testing
11 September 2012 CKG – Building Data Warehouse 4
VTTI Naturalistic Driving Research
9/13/2012 5 VTTI | Driving Transportation with Technology
Empirical Data Collection
Large-Scale Naturalistic Data Collection
• Proactive • Provides important
ordinal crash risk info
• Precise knowledge about crash risk
• Information about important circumstances and scenarios that lead to crashes
• Imprecise, relies on unproven safety surrogates
• Experimental situations modify driver behavior
• Reactive • Very limited pre-crash
information
• “Natural” driver behavior in full driving context
• Detailed pre-crash/crash info including driver performance/ behavior, driver error and vehicle kinematics
• Can utilize combination of crash, near crash and other safety surrogate data
Epidemiological Data Collection
Naturalistic Method
• Study participants use an instrumented vehicle for an extended period (e.g., several months to two years)
• Able to get detailed pre-crash/crash information along with routine driving behaviors
• Highly capable data acquisition
• Able to collect crash pre-cursor data and driver performance/behavior data using sensors and video cameras
11 September 2012 CKG – Building Data Warehouse 6
SHRP2 Naturalistic Driving Study
• Strategic Highway Research Program
• Funded through the Transportation Research Board of the National Academies
• Large-scale nationwide naturalistic driving study
– Six regional centers
11 September 2012 CKG – Building Data Warehouse 7
Scale of SHRP2
100 Car • 150 vehicle years (100
vehicles, 18 months) • 43,000 hours • 2,000,000 miles • 6 TB total storage
– 94% video
• 700GB sensor database • More constrained by
instrumentation
SHRP2 • 4,000 vehicle years (3,000
distinct vehicles, 2,000 at a time for two years)
• 2,000,000 hours • 60,000,000 miles • 1.5 PB total storage
– 85% video
• 250TB sensor data • ~400 a priori research
questions • 20-30 year life cycle for
research, data mining
• 11 September 2012 CKG – Building Data Warehouse 9
SHRP2 Data Gathering
• Real time health check
• Automatic crash reports
• Bulk data are harvested every few months
• Total 1-2TB/day
• Sites send data from regional center via Internet2
11 September 2012 CKG – Building Data Warehouse 10
Future Naturalistic Studies
• Many more studies coming
• None (yet) planned as large as SHRP2
– Commercial vehicles
• Some more “epoch-based”
11 September 2012 CKG – Building Data Warehouse 11
Experiences analyzing data
• Data analysis for VTTI’s legacy naturalistic studies has focused on using individual trip data (both sensor and video) files
• Identification of events from sensor algorithms, coupled with effort-intensive data reduction and annotation
• Difficult or expensive to scale this method to larger studies
• Analysis methods and infrastructure were not suitable to perform larger scale data mining – Very useful for “case study” (e.g. crash investigation, random samples)
• Some success extracting data to a database for mining
11 September 2012 CKG – Building Data Warehouse 13
Application support
• Desktop and cluster: Matlab, R – SAS only on desktop, mostly due to licensing cost
on cluster
• Cluster: python, shell for data ingestion, other utility tasks
• Custom Windows applications for visualization on desktop
• Legacy Windows computational/simulation software
11 September 2012 CKG – Building Data Warehouse 14
Typical Analysis Workflow
• Researcher tries to pull all data into Matlab (or R)
• Researcher eventually learns some things can be expressed better in SQL
• Researcher finds out not everything performs well in SQL
• Researcher pulls all data into Matlab (or R)
11 September 2012 CKG – Building Data Warehouse 15
Data types
• Sensor data:
– Time series data
– Not on same unified time mesh
• Compressed Video (h.264)
• Geospatial data (GIS/SQL)
• Other sources
11 September 2012 CKG – Building Data Warehouse 16
Data structures
• Legacy studies were “big rectangle” synchronized measurements
• SHRP2 and other current studies are more “AV-pair” pattern: – File_ID – Timestamp – VariableID – Value – Status
• A variableID may be one observation per file or several observations thousand per file over time – Commonly 10Hz, 20Hz, 1Hz, or interrupt driven
11 September 2012 CKG – Building Data Warehouse 17
Why database?
• Performance and scalability – With 100Car, 200,000 files were collected. Computations
routinely took weeks to perform – Bringing 100Car collected data into database, this could be less
under an hour
• Common interface (JDBC/ODBC) supports many tools • Expressive semantics, accessibility of SQL • Maturity of technology • Good support for indexing and partitioning • Natural metadata • Typed data – not just strings and AV pairs • Not so much referential integrity, etc
11 September 2012 CKG – Building Data Warehouse 18
File-oriented approach?
• File-oriented technologies, e.g. hadoop, have promise but need further investigation and feasibility/proof-of-concept
• Not optimized for computational intensive environments, floating point algorithms
• Less mature, accessible, and ubiquitous than SQL/databases
• Potentially ultimately more scalable or cost-effective • Lower software licensing costs
– Open source databases, e.g. PostgreSQL, also an option
• Perhaps 3-5 year horizon?
11 September 2012 CKG – Building Data Warehouse 19
Schema for Instrumentation Data
• Collected data have variables: – File ID, variable ID, timestamp, data value, sanity – Up to about twenty tables have this structure
• Each of these tables exist for data value of type: – Integers: Short/int/long – Floats: Real/double – String
• Each of these have different tables for: – “hot” vs “cold” – “low-frequency” – this reflects a specific DB2-ism
• Plus separate tables for “PII” data
11 September 2012 CKG – Building Data Warehouse 21
(Simplified) Schema
• Collected data – File_ID
– Timestamp
– VariableID
– Data • Float
• Int
• String
– QA Status
• Many of these by type, tier, index type
• Metadata – VariableID
– Module name
– Variable name
– Units
• File_Info – File_ID
– Datafile_ID
– Filetype • Video/Audio
– Filename
11 September 2012 CKG – Building Data Warehouse 22
Various metadata
FILE_INFO
FILE_ID BIGINT
FILE_GROUP_ID BIGINT
FILE_TYPE_ID INTEGER
DATA_FILE_ID BIGINT
FILE_NAME VARCHAR(512)
FILE_PATH VARCHAR(512)
FILE_HASH_VALUE VARCHAR(512)
HASH_TYPE VARCHAR(512)
KEY_INITIALIZATION_VECTOR VARCHAR(512)
ENCRYPTION_KEY VARCHAR(512)
KEY_TYPE VARCHAR(512)
INSERTTIME TIMESTAMP
LASTUPDATETIME TIMESTAMP
FILE_SIZE BIGINT
DATA_FILE_EXTRA_INFORMATION
FILE_ID BIGINT
MINIMUM_TIME BIGINT
MAXIMUM_TIME BIGINT
ACQUISITION_BOARD_BOARD_ID DOUBLE
ACQUISITION_BOARD_STRING_ID VARCHAR(8000)
STORAGE_BOARD_BOARD_ID DOUBLE
STORAGE_BOARD_STRING_ID VARCHAR(8000)
INSERTTIME TIMESTAMP
LASTUPDATETIME TIMESTAMP
FILE_GROUP
FILE_GROUP_ID BIGINT
FILE_NAME_BASE VARCHAR(512)
HDD_SERIAL VARCHAR(512)
COPY_DATE_TIME TIMESTAMPFILE_HEADERS
FILE_ID BIGINT
HEADER XML
HEADERSOURCEID SMALLINT
INSERTTIME TIMESTAMP
LASTUPDATETIME TIMESTAMP
METADATA
MODULENAME VARCHAR(128)
VARIABLENAME VARCHAR(128)
VARIABLEID INTEGER
TABLENAME VARCHAR(128)
COLUMNNAME VARCHAR(128)
COLLECTEDFREQUENCY DOUBLE
ISCOLLECTED SMALLINT
ISDEMUXED SMALLINT
ISCOMPUTED SMALLINT
ISSTANDARD SMALLINT
UNITS VARCHAR(128)
CLASS VARCHAR(8)
SOLTYPE VARCHAR(16)
FINAL_TABLE VARCHAR(128)
SUMMARY_INFO
FILE_ID BIGINT
VEHICLE_MANAGEMENT_ID INTEGER
PARTICIPANT_ID INTEGER
LOCATION_CODE VARCHAR(4)
COLLECTED_DATE_TIME TIMESTAMP
COLLECTION_MODE VARCHAR(25)
COLLECTION_PHASE VARCHAR(50)
VIDEO_FILE_EXTRA_INFORMATION
FILE_ID INTEGER
DEGREESROTATION SMALLINT
INSERTTIME TIMESTAMP
LASTUPDATETIME TIMESTAMP
OFFSET INTEGER
ALIGNMENT_VARIABLE VARCHAR(128)
11 September 2012 CKG – Building Data Warehouse 24
Sample data
FILE_ID STATUS VARIABLEID TIMESTAMP DATA --------- ------ ---------- --------- ------ 1,895,896 0 -396 561,198 0.0928 1,895,896 0 -396 561,299 0.0986 1,895,896 0 -396 561,398 0.1015 1,895,896 0 -396 561,499 0.1073 1,895,896 0 -396 561,598 0.1131 1,895,896 0 -396 561,699 0.1102 1,895,896 0 -396 561,798 0.1073 1,895,896 0 -396 561,899 0.1131 1,895,896 0 -396 561,998 0.116 1,895,896 0 -396 562,099 0.1247 1,895,896 0 -396 562,198 0.1305 1,895,896 0 -396 562,299 0.1276 1,895,896 0 -396 562,398 0.1247 1,895,896 0 -396 562,499 0.1247 1,895,896 0 -396 562,598 0.1276 [Entire file’s x_accel takes < 0.5 second to query.]
11 September 2012 CKG – Building Data Warehouse 25
Sample summary query
FILE_ID VARIABLEID MODULENAME VARIABLENAME COUNT_DATA AVERAGE_DATA ------- ---------- ---------- ------------ ---------- ------------ 810 -396 IMU Accel_X 636 -0.0417 810 -397 IMU Accel_Y 636 -0.0141 810 -398 IMU Accel_Z 636 -0.9831 811 -396 IMU Accel_X 2,903 -0.0338 811 -397 IMU Accel_Y 2,903 -0.0091 811 -398 IMU Accel_Z 2,903 -0.9857 822 -396 IMU Accel_X 54 -0.0276 822 -397 IMU Accel_Y 54 -0.0056 822 -398 IMU Accel_Z 54 -0.9869 831 -396 IMU Accel_X 81 -0.0265 831 -397 IMU Accel_Y 81 -0.0051 831 -398 IMU Accel_Z 81 -0.9871 838 -396 IMU Accel_X 6,363 -0.0045 838 -397 IMU Accel_Y 6,363 0.0018 838 -398 IMU Accel_Z 6,363 -0.9903 857 -396 IMU Accel_X 10,928 -0.0240 857 -397 IMU Accel_Y 10,928 -0.0168 857 -398 IMU Accel_Z 10,928 -0.9872 859 -396 IMU Accel_X 403 0.0066 859 -397 IMU Accel_Y 403 0.0127 859 -398 IMU Accel_Z 403 -0.9870 862 -396 IMU Accel_X 413 -0.0547 862 -397 IMU Accel_Y 413 -0.0312 862 -398 IMU Accel_Z 413 -0.9841
... [for over 6,000 file_id‘s] Less than 20 seconds for query results
11 September 2012 CKG – Building Data Warehouse 26
Approach to Data Center Design
• Technical and performance specs • Balance cost with performance & availability • Focus on more mature technology
– While still needing to push state of the art
• Matlab/R/SAS researchers can add SQL to their skill set – Not so much Java/C++
• Other (non-programmer) analysts need visualization tools
• Systems programmers can use python, Java, C++
11 September 2012 CKG – Building Data Warehouse 27
High Performance?
• What do we mean by “high performance”?
– Actually we do “high throughput”…
• Computational and communication resources that are beyond those normally achievable by individual desktop workstations or stand-alone servers in typical enterprise environments.
11 September 2012 CKG – Building Data Warehouse 28
Infrastructure to support data-intensive science
• Large (parallel) file system
– Especially for unstructured data
• Hierarchical storage
• Compute cluster
• Distributed workflow
• Structured data warehouse
– Parallel database using PostgreSQL, DB2, …
28 Apr 2011 CKG – Scientific Data Management 29
11 September 2012 CKG – Building Data Warehouse 31
VTTI Storage array (1 PB - GPFS)
VTTI Compute Cluster48 node (12x4) Dell C6100
Platform PCM hybrid Lin/Win
VTTI Scientific Data WarehouseInfosphere(DB2) ~400TB
IBM/SGI
VTData Center
LAN
Info
Sphere
Data
Wa
reho
use
LA
N
DB2 DW
WorkerInfoSphere
DWHEAD node(active)IBM
X3650128GB RAM
IBM DS340012 x 450GB
SAS
InfoSphere DWHEAD node(standby)
IBM X3650128GB RAM
InfoSphere DW
ETL node(active)IBM
X3650128GB RAM
IBM DS340012 x 450GB
SAS
researchers
Internet
DB2 DW Workers(each):
8 partitions/worker20TB/partition
IBM DS4800SATA
DELL R710SQL Serverreplication
GPFS10GigE
DB2 DW
Worker
DB2 DW
Worker
DB2 DW
Worker
VTTI Smart Data Center Infrastructure
Dell C61004 CPU X 6 core
(24 cores)
Intr
a-C
lust
er
Fabric
(10g
igE
)
Dell R710Head Node
Linux
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell R710Head Node
Linux
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
Dell C61004 CPU X 6 core
(24 cores)
VT Data Center 1G LAN
HP
C 1
0G
LA
N
HPC 10G / 1G LAN
DB2 DW
Worker
DB2 DWWorker
(standby)
SGI DMF"Archive"
5+PB disk/tape
Data warehouse building block
X3650 8 core
128GB RAM
DS3400 4TB SAS
DS3400 4TB SAS
EXP3000 4TB SAS
EXP3000 4TB SAS
DS3512 30TB NLSAS
DS3512 30TB NLSAS
EXP3512 30TB NLSAS
DS3512 30TB NLSAS
DS3400 9TB SATA
EXP3000 9TB SATA
Fibre
Ch
ann
el
SAS
SAS
SAS
SAS
SAS