Managing Big Data (Chapter 2, SC 11 Tutorial)

An Introduc+on to Data Intensive Compu+ng

Chapter 2: Data Management

Robert Grossman University of Chicago Open Data Group

Collin BenneC

Open Data Group

November 14, 2011 1

1.  Introduc+on (0830-‐0900) a.  Data clouds (e.g. Hadoop) b.  U+lity clouds (e.g. Amazon)

2.  Managing Big Data (0900-‐0945) a.  Databases b.  Distributed File Systems (e.g. Hadoop) c.  NoSql databases (e.g. HBase)

3.  Processing Big Data (0945-‐1000 and 1030-‐1100) a.  Mul+ple Virtual Machines & Message Queues b.  MapReduce c.  Streams over distributed file systems

4.  Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)

What Are the Choices?

Databases (SqlServer, Oracle, DB2)

File Systems

Distributed File Systems (Hadoop, Sector)

Clustered File Systems (glusterfs, …)

NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)

Applica+ons (R, SAS, Excel, etc. )

What is the Fundamental Trade Off?

Scale up Scale out

vs …

2.1 Databases

Advice From Jim Gray

1.  Analyzing big data requires scale-‐out solu+ons not scale-‐up solu+ons (GrayWulf)

2.  Move the analysis to the data. 3.  Work with scien+sts to find the

most common “20 queries” and make them fast.

4.  Go from “working to working.”

PaCern 1: Put the metadata in a database and point to files in a

file system.

Example: Sloan Digital Sky Survey •  Two surveys in one

– Photometric survey in 5 bands – Spectroscopic redshii survey

•  Data is public – 40 TB of raw data – 5 TB processed catalogs – 2.5 Terapixels of images

•  Catalog uses Microsoi SQLServer •  Started in 1992, finished in 2008 •  JHU SkyServer serves millions of queries

Example: Bionimbus Genomics Cloud

www.bionimbus.org

Database Services

Analysis Pipelines & Re-‐analysis

Services

GWT-‐based Front End

Data Cloud Services

Data Inges+on Services

U+lity Cloud Services

Intercloud Services

Database Services

Analysis Pipelines & Re-‐analysis

Services

GWT-‐based Front End

Large Data Cloud Services

Data Inges+on Services

Elas+c Cloud Services

Intercloud Services

(Hadoop, Sector/Sphere)

(Eucalyptus, OpenStack)

(PostgreSQL)

ID Service (UDT, replica+on)

Sec+on 2.2 Distributed File Systems

Sector/Sphere

Hadoop’s Large Data Cloud

Storage Services

Compute Services

Hadoop’s Stack

Applica+ons

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL Databases

PaCern 2: Put the data into a distributed file system.

Hadoop Design •  Designed to run over commodity components that fail.

•  Data replicated, typically three +mes. •  Block-‐based storage. •  Op+mized for efficient scans with high throughput, not low latency access.

•  Designed for write once, read many. •  Append opera+on planned for future.

Hadoop Distributed File System (HDFS) Architecture

Name Node

Data Node

Client control

Data Node

Rack Rack Rack

•  HDFS is block-‐based.

•  WriCen in Java.

Sector Distributed File System (SDFS) Architecture

•  Broadly similar to Google File System and Hadoop Distributed File System.

•  Uses na+ve file system. It is not block based. •  Has security server that provides authoriza+ons.

•  Has mul+ple master name servers so that there is no single point of failure.

•  Use UDT to support wide area opera+ons.

Sector Distributed File System (SDFS) Architecture Master Node

Slave Node

Client control

Slave Node

Rack Rack Rack

•  HDFS is file-‐based.

•  WriCen in C++. •  Security server. •  Mul+ple masters.

Security Server

control

Master Node

GlusterFS Architecture

•  No metadata server. •  No single point of failure. •  Uses algorithms to determine loca+on of data. •  Can scale out by adding more bricks.

GlusterFS Architecture

Client

Rack Rack Rack

File-‐based.

GlusterFS Server

Sec+on 2.3 NoSQL Databases

Evolu+on •  Standard architecture for simple web applica+ons: – Presenta+on: front-‐end, load balanced web servers – Business logic layer – Backend database

•  Database layer does not scale with large numbers of users or large amounts of data

•  Alterna+ves arose – Sharded (par++oned) databases or master-‐slave dbs – memcache

Scaling RDMS •  Master – slave database systems

– Writes to master – Reads from slaves – Can be boClenecks wri+ng to slaves; can be inconsistent

•  Sharded databases – Applica+ons and queries must understand sharing schema

– Both reads and writes scale – No na+ve, direct support for joins across shards

NoSQL Systems

•  Suggests No SQL support, also Not Only SQL •  One or more of the ACID proper+es not supported

•  Joins generally not supported •  Usually flexible schemas •  Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra

•  Quite a few recent open source systems

PaCern 3: Put the data into a NoSQL applica+on.

Consistency

Availability Par++on-‐resiliency

CA: available and consistent, unless there is a par++on.

AP: a reachable replica provides service even in a par++on, but may be inconsistent.

CP: always consistent, even in a par++on, but a reachable replica may deny service without quorum.

Dynamo, Cassandra

BigTable, HBase

CAP – Choose Two Per Opera+on

CAP Theorem

•  Proposed by Eric Brewer, 2000 •  Three proper+es of a system: consistency, availability and par++ons

•  You can have at most two of these three proper+es for any shared-‐data system

•  Scale out requires par++ons •  Most large web-‐based systems choose availability over consistency

28 Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002

Eventual Consistency •  If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent

•  Eventually, a node is either updated or removed from service.

•  Can be implemented with Gossip protocol •  Amazon’s Dynamo popularized this approach •  Some+mes this is called BASE (Basically Available, Soi state, Eventual consistency), as opposed to ACID

Different Types of NoSQL Systems

•  Distributed Key-‐Value Systems – Amazon’s S3 Key-‐Value Store (Dynamo) –  Voldemort –  Cassandra

•  Column-‐based Systems –  BigTable – HBase –  Cassandra

•  Document-‐based systems –  CouchDB

Hbase Architecture

HRegionServer

Client Client Client Client Client

HBaseMaster

REST API

HRegionServer

Java Client

HRegionServer

Source: Raghu Ramakrishnan

HRegion Server •  Records par++oned by column family into HStores

–  Each HStore contains many MapFiles

•  All writes to HStore applied to single memcache •  Reads consult MapFiles and memcache •  Memcaches flushed as MapFiles (HDFS files) when full •  Compac+ons limit number of MapFiles

HRegionServer

HStore

MapFiles

Memcache writes

Flush to disk

Source: Raghu Ramakrishnan

Facebook’s Cassandra

•  Modeled aier BigTable’s data model •  Modeled aier Dynamo’s eventual consistency •  Peer to peer storage architecture using consistent hashing (Chord hashing)

Databases NoSQL Systems Scalability 100’s TB 100’s PB Func+onality Full SQL-‐based queries,

including joins Op+mized access to sorted tables (tables with single keys)

Op+mized Databases op+mized for safe writes

Clouds op+mized for efficient reads

Consistency model

ACID (Atomicity, Consistency, Isola+on & Durability) – database always consist

Eventual consistency – updates eventually propagate through system

Parallelism Difficult because of ACID model; shared nothing is possible

Basic design incorporates parallelism over commodity components

Scale Racks Data center 34

Sec+on 2.3 Case Study: Project Matsu

Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images

Zoom Level 3: 64 images Zoom Level 4: 256 images

Source: Andrew Levine

Mapper Input Key: Bounding Box

Mapper Input Value:

Mapper Output Key: Bounding Box Mapper Output Value:

Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes

(minx = -‐135.0 miny = 45.0 maxx = -‐112.5 maxy = 67.5)

Step 1: Input to Mapper

Step 2: Processing in Mapper Step 3: Mapper Output

Mapper Output Key: Bounding Box Mapper Output Value:

Build Tile Cache in the Cloud -‐ Mapper

Reducer Key Input: Bounding Box (minx = -‐45.0 miny = -‐2.8125 maxx = -‐43.59375 maxy = -‐2.109375)

Reducer Value Input:

Step 1: Input to Reducer

Step 2: Reducer Output

Assemble Images based on bounding box

•  Output to HBase •  Builds up Layers for WMS for various datasets

Build Tile Cache in the Cloud -‐ Reducer

HBase Tables

•  Open Geospa+al Consor+um (OGC) Web Mapping Service (WMS) Query translates to HBase scheme – Layers, Styles, Projec+on, Size

•  Table name: WMS Layer – Row ID: Bounding Box of image -‐Column Family: Style Name and Projec+on -‐Column Qualifier: Width x Height -‐Value: Buffered Image

Sec+on 2.4 Distributed Key-‐Value Stores

PaCern 4: Put the data into a distributed key-‐value store.

S3 Buckets •  S3 bucket names must be unique across AWS •  A good prac+ce is to use a paCern like

tutorial.osdc.org/dataset1.txt for a domain you own.

•  The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/

dataset1.txt •  If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt

S3 Keys

•  Keys must be unique within a bucket. •  Values can be as large as 5 TB (formerly 5 GB)

S3 Security

•  AWS access key (user name) •  This func+on as your S3 username. It is an alphanumeric text string that uniquely iden+fies users.

•  AWS Secret key (func+ons as password)

AWS Account Informa+on

Access Keys

User Name Password

Other Amazon Data Services

•  Amazon Simple Database Service (SDS) •  Amazon’s Elas+c Block Storage (EBS)

Sec+on 2.5 Moving Large Data Sets

The Basic Problem

•  TCP was never designed to move large data sets over wide area high performance networks.

•  As a general rule, reading data off disks is slower than transpor+ng it over the network.

TCP Throughput vs RTT and Packet Loss

1 10 100 200 400

Round Trip Time (ms)

LAN US-EU US-ASIA US

Source: Yunhong Gu, 2007, experiments over wide area 1G.

The Solu+on

•  Use parallel TCP streams – GridFTP

•  Use specialized network protocols – UDT, FAST, etc.

•  Use RAID to stripe data across disks to improve throughput when reading

•  These techniques are well understood in HEP, astronomy, but not yet in biology.

Case Study: Bio-‐mirror

[The open source GridFTP] from the Globus project has recently been improved to offer UDP-‐based file transport, with long-‐distance speed improvements of 3x to 10x over the usual TCP-‐based file transport. -‐-‐ Don Gilbert, August 2010, bio-‐mirror.net

Moving 113GB of Bio-‐mirror Data

Site RTT TCP UDT TCP/UDT Km NCSA 10 139 139 1 200 Purdue 17 125 125 1 500 ORNL 25 361 120 3 1,200 TACC 37 616 120 55 2,000 SDSC 65 750 475 1.6 3,300 CSTNET 274 3722 304 12 12,000

GridFTP TCP and UDT transfer +mes for 113 GB from gridip.bio-‐mirror.net/biomirror/blast/ (Indiana USA). All TCP and UDT +mes in minutes. Source: hCp://gridip.bio-‐mirror.net/biomirror/

Case Study: CGI 60 Genomes

•  Trace by Complete Genomics showing performance of moving 60 complete human genomes from Mountain View to Chicago using the open source Sector/UDT.

•  Approximately 18 TB at about 0.5 Mbs on 1G link. Source: Complete Genomics.

Resource Use

Protocol CPU Usage* Memory* GridFTP (UDT) 1.0% -‐ 3.0% 40 Mb GridFTP (TCP) 0.1% -‐ 0.6% 6 Mb

*CPU and memory usage collected by Don Gilbert. He reports that rsync uses more CPU than GridFTP with UDT. Source: hCp://gridip.bio-‐mirror.net/biomirror/.

Sector/Sphere

•  Sector/Sphere is a pla{orm for data intensive compu+ng built over UDT and designed to support geographically distributed clusters.

Ques+ons?

For the most current version of these notes, see rgrossman.com

Managing Big Data (Chapter 2, SC 11 Tutorial)

Technology

Transcript of Managing Big Data (Chapter 2, SC 11 Tutorial)

Managing Offender’s Personal Property Corrections Technology Association Charleston, SC Tuesday May 8, 2007.

A basic tutorial on how to use trello for managing tasks

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

Tutorial 3 Managing Folders and Files

Tutorial on Managing z/VM through the xCAT-UI

Windows Tutorial 7 Managing Multimedia Files

CIGRE SC B2 Overhead linesOverhead lines TUTORIAL ... 2011/Umberto - SPACER TUTORIAL CIGRE 20… · CIGRE SC B2 Overhead linesOverhead lines TUTORIAL PRESENTATION ON SPACERS AND SPACER

1 Parallel Performance Analysis with Open|SpeedShop Half Day Tutorial @ SC 2008 Austin, TX.

GMS 8.0 Tutorial MODFLOW – Managing Transient Datagmstutorials-8.0.aquaveo.com/MODFLOW-ManagingTransientData.pdf · GMS 8.0 Tutorial . MODFLOW – Managing Transient Data . Creating

Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial.

Managing Your Organisation’s Portal Team Account Tutorial 7.

Tutorial 6: Managing Multiple Worksheets and Workbooks

PERCEPTION MAT-PAC Managing Stock Inventories-Tutorial

Tutorial for SC 32/WG 1 e-Business Standards

SC 32 Tutorial Session

Managing Interconnect Resources Embedded SLIP Tutorial Phillip Christie.

Managing photos with QGIS Tutorial reveclipsegeomatics.com/.../03/C3-Managing-photos-with-QGIS-Tutorial… · Tutorial 1. Upload your georeferenced photos to an appropriate project

Inter-SC Trades for California ISO Nodal Market Tutorial · An Inter-SC Trade (IST) is an energy quantity (MWh) traded from one SC to another SC for a specific hour, trade place,

ADSK Managing Subscription Contracts Tutorial r1

Tutorial 12 Managing Hardware and Networks