An Introduc+on to Data Intensive Compu+ng
Chapter 2: Data Management
Robert Grossman University of Chicago Open Data Group
Collin BenneC
Open Data Group
November 14, 2011 1
1. Introduc+on (0830-‐0900) a. Data clouds (e.g. Hadoop) b. U+lity clouds (e.g. Amazon)
2. Managing Big Data (0900-‐0945) a. Databases b. Distributed File Systems (e.g. Hadoop) c. NoSql databases (e.g. HBase)
3. Processing Big Data (0945-‐1000 and 1030-‐1100) a. Mul+ple Virtual Machines & Message Queues b. MapReduce c. Streams over distributed file systems
4. Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)
What Are the Choices?
Databases (SqlServer, Oracle, DB2)
File Systems
Distributed File Systems (Hadoop, Sector)
Clustered File Systems (glusterfs, …)
NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)
Applica+ons (R, SAS, Excel, etc. )
What is the Fundamental Trade Off?
Scale up Scale out
vs …
2.1 Databases
Advice From Jim Gray
1. Analyzing big data requires scale-‐out solu+ons not scale-‐up solu+ons (GrayWulf)
2. Move the analysis to the data. 3. Work with scien+sts to find the
most common “20 queries” and make them fast.
4. Go from “working to working.”
PaCern 1: Put the metadata in a database and point to files in a
file system.
Example: Sloan Digital Sky Survey • Two surveys in one
– Photometric survey in 5 bands – Spectroscopic redshii survey
• Data is public – 40 TB of raw data – 5 TB processed catalogs – 2.5 Terapixels of images
• Catalog uses Microsoi SQLServer • Started in 1992, finished in 2008 • JHU SkyServer serves millions of queries
Example: Bionimbus Genomics Cloud
www.bionimbus.org
Database Services
Analysis Pipelines & Re-‐analysis
Services
GWT-‐based Front End
Data Cloud Services
Data Inges+on Services
U+lity Cloud Services
Intercloud Services
Database Services
Analysis Pipelines & Re-‐analysis
Services
GWT-‐based Front End
Large Data Cloud Services
Data Inges+on Services
Elas+c Cloud Services
Intercloud Services
(Hadoop, Sector/Sphere)
(Eucalyptus, OpenStack)
(PostgreSQL)
ID Service (UDT, replica+on)
Sec+on 2.2 Distributed File Systems
Sector/Sphere
Hadoop’s Large Data Cloud
Storage Services
Compute Services
13
Hadoop’s Stack
Applica+ons
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services NoSQL Databases
PaCern 2: Put the data into a distributed file system.
Hadoop Design • Designed to run over commodity components that fail.
• Data replicated, typically three +mes. • Block-‐based storage. • Op+mized for efficient scans with high throughput, not low latency access.
• Designed for write once, read many. • Append opera+on planned for future.
Hadoop Distributed File System (HDFS) Architecture
Name Node
Data Node
Data Node
Data Node
Client control
Data Node
Data Node
Data Node
data
Rack Rack Rack
• HDFS is block-‐based.
• WriCen in Java.
Sector Distributed File System (SDFS) Architecture
• Broadly similar to Google File System and Hadoop Distributed File System.
• Uses na+ve file system. It is not block based. • Has security server that provides authoriza+ons.
• Has mul+ple master name servers so that there is no single point of failure.
• Use UDT to support wide area opera+ons.
Sector Distributed File System (SDFS) Architecture Master Node
Slave Node
Slave Node
Slave Node
Client control
Slave Node
Slave Node
Slave Node
data
Rack Rack Rack
• HDFS is file-‐based.
• WriCen in C++. • Security server. • Mul+ple masters.
Security Server
control
Master Node
GlusterFS Architecture
• No metadata server. • No single point of failure. • Uses algorithms to determine loca+on of data. • Can scale out by adding more bricks.
GlusterFS Architecture
Brick
Brick
Brick
Client
Brick
Brick
Brick
data
Rack Rack Rack
File-‐based.
GlusterFS Server
Sec+on 2.3 NoSQL Databases
21
Evolu+on • Standard architecture for simple web applica+ons: – Presenta+on: front-‐end, load balanced web servers – Business logic layer – Backend database
• Database layer does not scale with large numbers of users or large amounts of data
• Alterna+ves arose – Sharded (par++oned) databases or master-‐slave dbs – memcache
22
Scaling RDMS • Master – slave database systems
– Writes to master – Reads from slaves – Can be boClenecks wri+ng to slaves; can be inconsistent
• Sharded databases – Applica+ons and queries must understand sharing schema
– Both reads and writes scale – No na+ve, direct support for joins across shards
23
NoSQL Systems
• Suggests No SQL support, also Not Only SQL • One or more of the ACID proper+es not supported
• Joins generally not supported • Usually flexible schemas • Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra
• Quite a few recent open source systems
24
PaCern 3: Put the data into a NoSQL applica+on.
26
C
A P
Consistency
Availability Par++on-‐resiliency
CA: available and consistent, unless there is a par++on.
AP: a reachable replica provides service even in a par++on, but may be inconsistent.
CP: always consistent, even in a par++on, but a reachable replica may deny service without quorum.
Dynamo, Cassandra
BigTable, HBase
CAP – Choose Two Per Opera+on
CAP Theorem
• Proposed by Eric Brewer, 2000 • Three proper+es of a system: consistency, availability and par++ons
• You can have at most two of these three proper+es for any shared-‐data system
• Scale out requires par++ons • Most large web-‐based systems choose availability over consistency
28 Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002
Eventual Consistency • If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent
• Eventually, a node is either updated or removed from service.
• Can be implemented with Gossip protocol • Amazon’s Dynamo popularized this approach • Some+mes this is called BASE (Basically Available, Soi state, Eventual consistency), as opposed to ACID
29
Different Types of NoSQL Systems
• Distributed Key-‐Value Systems – Amazon’s S3 Key-‐Value Store (Dynamo) – Voldemort – Cassandra
• Column-‐based Systems – BigTable – HBase – Cassandra
• Document-‐based systems – CouchDB
30
Hbase Architecture
HRegionServer
Client Client Client Client Client
HBaseMaster
REST API
Disk
HRegionServer
Java Client
Disk
HRegionServer
Disk
HRegionServer
Disk
HRegionServer
Source: Raghu Ramakrishnan
HRegion Server • Records par++oned by column family into HStores
– Each HStore contains many MapFiles
• All writes to HStore applied to single memcache • Reads consult MapFiles and memcache • Memcaches flushed as MapFiles (HDFS files) when full • Compac+ons limit number of MapFiles
HRegionServer
HStore
MapFiles
Memcache writes
Flush to disk
reads
Source: Raghu Ramakrishnan
Facebook’s Cassandra
• Modeled aier BigTable’s data model • Modeled aier Dynamo’s eventual consistency • Peer to peer storage architecture using consistent hashing (Chord hashing)
33
Databases NoSQL Systems Scalability 100’s TB 100’s PB Func+onality Full SQL-‐based queries,
including joins Op+mized access to sorted tables (tables with single keys)
Op+mized Databases op+mized for safe writes
Clouds op+mized for efficient reads
Consistency model
ACID (Atomicity, Consistency, Isola+on & Durability) – database always consist
Eventual consistency – updates eventually propagate through system
Parallelism Difficult because of ACID model; shared nothing is possible
Basic design incorporates parallelism over commodity components
Scale Racks Data center 34
Sec+on 2.3 Case Study: Project Matsu
Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images
Zoom Level 3: 64 images Zoom Level 4: 256 images
Source: Andrew Levine
Mapper Input Key: Bounding Box
Mapper Input Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes
(minx = -‐135.0 miny = 45.0 maxx = -‐112.5 maxy = 67.5)
Step 1: Input to Mapper
Step 2: Processing in Mapper Step 3: Mapper Output
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Mapper Output Key: Bounding Box Mapper Output Value:
Build Tile Cache in the Cloud -‐ Mapper
Source: Andrew Levine
Reducer Key Input: Bounding Box (minx = -‐45.0 miny = -‐2.8125 maxx = -‐43.59375 maxy = -‐2.109375)
Reducer Value Input:
Step 1: Input to Reducer
…
Step 2: Reducer Output
Assemble Images based on bounding box
• Output to HBase • Builds up Layers for WMS for various datasets
Build Tile Cache in the Cloud -‐ Reducer
Source: Andrew Levine
HBase Tables
• Open Geospa+al Consor+um (OGC) Web Mapping Service (WMS) Query translates to HBase scheme – Layers, Styles, Projec+on, Size
• Table name: WMS Layer – Row ID: Bounding Box of image -‐Column Family: Style Name and Projec+on -‐Column Qualifier: Width x Height -‐Value: Buffered Image
Sec+on 2.4 Distributed Key-‐Value Stores
S3
PaCern 4: Put the data into a distributed key-‐value store.
S3 Buckets • S3 bucket names must be unique across AWS • A good prac+ce is to use a paCern like
tutorial.osdc.org/dataset1.txt for a domain you own.
• The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/
dataset1.txt • If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt
S3 Keys
• Keys must be unique within a bucket. • Values can be as large as 5 TB (formerly 5 GB)
S3 Security
• AWS access key (user name) • This func+on as your S3 username. It is an alphanumeric text string that uniquely iden+fies users.
• AWS Secret key (func+ons as password)
AWS Account Informa+on
Access Keys
User Name Password
Other Amazon Data Services
• Amazon Simple Database Service (SDS) • Amazon’s Elas+c Block Storage (EBS)
Sec+on 2.5 Moving Large Data Sets
The Basic Problem
• TCP was never designed to move large data sets over wide area high performance networks.
• As a general rule, reading data off disks is slower than transpor+ng it over the network.
TCP Throughput vs RTT and Packet Loss
0.01%
0.05%
0.1%
0.1%
0.5%
1000
800
600
400
200
1 10 100 200 400
1000
800
600
400
200
Thro
ughp
ut (M
b/s)
Round Trip Time (ms)
LAN US-EU US-ASIA US
Source: Yunhong Gu, 2007, experiments over wide area 1G.
The Solu+on
• Use parallel TCP streams – GridFTP
• Use specialized network protocols – UDT, FAST, etc.
• Use RAID to stripe data across disks to improve throughput when reading
• These techniques are well understood in HEP, astronomy, but not yet in biology.
Case Study: Bio-‐mirror
[The open source GridFTP] from the Globus project has recently been improved to offer UDP-‐based file transport, with long-‐distance speed improvements of 3x to 10x over the usual TCP-‐based file transport. -‐-‐ Don Gilbert, August 2010, bio-‐mirror.net
Moving 113GB of Bio-‐mirror Data
Site RTT TCP UDT TCP/UDT Km NCSA 10 139 139 1 200 Purdue 17 125 125 1 500 ORNL 25 361 120 3 1,200 TACC 37 616 120 55 2,000 SDSC 65 750 475 1.6 3,300 CSTNET 274 3722 304 12 12,000
GridFTP TCP and UDT transfer +mes for 113 GB from gridip.bio-‐mirror.net/biomirror/blast/ (Indiana USA). All TCP and UDT +mes in minutes. Source: hCp://gridip.bio-‐mirror.net/biomirror/
Case Study: CGI 60 Genomes
• Trace by Complete Genomics showing performance of moving 60 complete human genomes from Mountain View to Chicago using the open source Sector/UDT.
• Approximately 18 TB at about 0.5 Mbs on 1G link. Source: Complete Genomics.
Resource Use
Protocol CPU Usage* Memory* GridFTP (UDT) 1.0% -‐ 3.0% 40 Mb GridFTP (TCP) 0.1% -‐ 0.6% 6 Mb
*CPU and memory usage collected by Don Gilbert. He reports that rsync uses more CPU than GridFTP with UDT. Source: hCp://gridip.bio-‐mirror.net/biomirror/.
Sector/Sphere
• Sector/Sphere is a pla{orm for data intensive compu+ng built over UDT and designed to support geographically distributed clusters.
Ques+ons?
For the most current version of these notes, see rgrossman.com
Top Related