The Big Data Stack

The Big Data Stack

Zubair [email protected]

7 January, 2014

mailto:[email protected]

mailto:[email protected]

Data Timeline

0

5EB

2.7ZB

8ZBfork()

2003

2012

2015

Example*: Facebook

• 2.5B – content items shared• 2.7B – ‘Likes’• 300M – photos uploaded• 105TB – data scanned every 30 minutes• 500+TB – new data ingested• 100+PB – data warehouse* VP Engineering, Jay Parikh – 2012

Example: Facebook’s Haystack*

• 65B photos– 4 images of different size stored for each photo– For a total of 260B images and 20PB of storage

• 1B new photos uploaded each week– Increment of 60TB

• At peak traffic, 1M images served per second• An image request is like finding a needle in a haystack

*Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.

More Examples

• The LHC at CERN generates 22PB of data annually (after throwing away around 99% of readings)

• The Square Kilometre Array (under construction) is expected to generate hundreds of PB each day

• Farecast, a part of Bing, searches through 225B flight and price records to advise customers on their ticket purchases

More Examples (2)

• The amount of annual traffic flowing over the Internet is around 700EB

• Walmart handles in excess of 1M transactions every hour (25PB in total)

• 400M Tweets everyday

Big Data

• Large datasets whose processing and storage requirements exceed all traditional paradigms and infrastructure– On the order of terabytes and beyond

• Generated by web 2.0 applications, sensor networks, scientific applications, financial applications, etc.

• Radically different tools needed to record, store, process, and visualize

• Moving away from the desktop• Offloaded to the “cloud”• Poses challenges for computation, storage, and

infrastructure

The Stack

• Presentation layer• Application layer: processing + storage• Operating System layer• Virtualization layer (optional)• Network layer (intra- and inter-data center)• Physical infrastructure layerCan roughly be called the “cloud”

Presentation Layer

• Acts as the user-facing end of the entire ecosystem

• Forwards user queries to the backend (potentially the rest of the stack)

• Can be both local and remote• For most web 2.0 applications, the

presentation layer is a web portal

Presentation Layer (2)

• For instance, the Google search website is a presentation layer– Takes user queries– Forwards them to a scatter-gather application– Presents the results to the user (within a time

bound)• Made up of many technologies, such as HTTP, HTML,

AJAX, etc.• Can also be a visualization library

Application Layer

• Serves as the back-end• Either computes a result for the user, or

fetches a previously computed result or content from storage

• The execution is predominantly distributed• The computation itself might entail cross-

disciplinary (across sciences) technology

Processing

• Can be a custom solution, such as a scatter-gather application

• Might also be an existing data intensive computation framework, such as MapReduce, Spark, MPI, etc. or a stream processing system, such as IBM Infosphere Streams, Storm, S4, etc.

• Analytics engines: R, Matlab, etc.

Numbers Everyone Should Know*

* Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.

Operation Time (nsec)

L1 cache reference 0.5

Branch mispredict 5

L2 cache reference 7

Mutex lock/unlock 25

Main memory reference 100

Send 2K over 1Gbps network 20,000

Read 1MB sequentially from memory 250,000

Disk seek 10,000,000

Read 1MB sequentially from disk 20,000,000

Send packet CA -> NL -> CA 150,000,000

Time

0.5s

5s

7s

25s

1m40s

5h30m

~3days

~6days

8months

4.75years

Ubiquitous Computation: Machine Learning

• Making predictions based on existing data• Classifying emails into spam and non-spam• American Express analyzes the monthly

expenditures of its cardholders to suggest products to them

• Facebook uses it to figure out the order of Newsfeed stories, friend and page recommendations, etc.

• Amazon uses it to make product recommendations while Netflix employs it for movie recommendations

Case Study: MapReduce

• Designed by Google to process large amounts of data– Google’s “hammer for 80% of their data crunching”– Original paper has 9000+ citations

• The user only needs to write two functions• The framework abstracts away work distribution, network

connectivity, data movement, and synchronization• Can seamlessly scale to hundreds of thousands of

machines• Open-source version, Hadoop, being used by everyone,

from Yahoo and Facebook to LinkedIn and The New York Times

Case Study: MapReduce (2)

• Used for embarrassingly parallel applications, most divide-and-conquer algorithms

• For instance, the count of each word in a billion document library can be calculated in less than 10 lines of custom code

• Data is stored on a distributed filesystem• map() -> groupBy -> reduce()

Case Study: Storm

• Used to analyze “data in motion”– Originally designed at Backtype but later acquired by

Twitter; now an Apache source project• Each datapoint, called a tuple, passes through a

processing pipeline

• The user only needs to provide the code for each operator and a graph specification (topology)

Source (spout) Operator(s) (bolt) Sink

Storage

• Most Big Data solutions revolve around data without any structure (possibly from heterogeneous sources)

• The scale of the data makes a cleaning phase next to impossible

• Therefore, storage solutions need to explicitly support unstructured and semi-structured data

• Traditional RDBMS being replaced by NoSQL and NewSQL solutions– Varying from document stores to key-value stores

Storage (2)

1. Relational database management systems (RDBMS): IBM DB2 MySQL, Oracle DB, etc. (structured data)

2. NoSQL: Key-value stores, document stores, graphs, tables, etc. (semi-structured and unstructured data)– Document stores: MongoDB, CouchDB, etc.– Graphs: FlockDB, etc.– Key-value stores: Dynamo, Cassandra, Voldemort, etc.– Tables: BigTable, HBase, etc.

3. NewSQL: The best of both worlds: Spanner, VoltDB, etc.

NoSQL

• Different Semantics: – RDBMS provide ACID semantics:

• Atomicity: The entire transaction either succeeds or fails• Consistent: Data within the database remains consistent after each• Transaction• Isolation: Transactions are sandboxed from each other• Durable: Transactions are persistent across failures and restarts

– Overkill in case of most user-facing applications– Most applications are more interested in availability and willing to

sacrifice consistency leading to eventual consistency• High Throughput: Most NoSQL databases sacrifice consistency

for availability leading to higher throughput (in some cases an order of magnitude)

Case Study: BigTable*

• Distributed multi-dimensional table• Indexed by both row-key as well as column-key• Rows are maintained in lexicographic order and

are dynamically partitioned into tablets• Implemented atop GFS• Multiple tablet servers and a single master

* Fay Chang, et al. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.

Case Study: Spanner*

• A database that stretches across the globe, seamlessly operating across hundreds of datacenters and millions of machines, and trillions of rows of information

• Took Google 4 and a half years to design and develop• Time is of the essence in distributed systems;

(possibly geo-distributed) machines, applications, processes, and threads need to be synchronized

* James C. Corbett, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA, 251-264.

Case Study: Spanner (2)

• Spanner consists of a “TrueTime API”, which makes use of atomic clocks and GPS!

• Ensures consistency for the entire system• Even if two commits (with agreed upon ordering)

take place at other ends of the globe (say US and China), their ordering will be preserved

• For instance, the Google ad system (an online auction where ordering matters) can span the entire globe

Framework Plurality

Cluster Managers• Mix different programming paradigms

– For instance, batch-processing with stream-processing

• Cluster consolidation– No need to manually partition cluster across multiple frameworks

• Data sharing– Pass data from, say, MapReduce to Storm and vice versa

• Higher level job orchestration– The ability to have a graph of heterogeneous job types

• Examples include YARN, Mesos, and Google’s Omega

Operating System Layer

• Consists of the traditional operating system stack with the usual suspects, Windows, variants of *nix, etc.

• Alternatives exist though. Specialized for the cloud or multicore systems

• Exokernels, multikernels, and unikernels

Virtualization Layer

• Allows multiple operating systems to run on top of the same physical hardware

• Enables infrastructure sharing, isolation, and optimized utilization

• Different allocation strategies possible• Easier to dedicate CPU and memory but not the

network• Allocation either in the form of VMs or containers• VMWare, Xen, LXC, etc.

Network Layer

• Connects the entire ecosystem together• Consists of the entire protocol stack• Tenants assigned to Virtual LANs• Multiple protocols available across the stack• Most datacenters employ traditional Ethernet as the L2

fabric, although optical, wireless, and Infiniband are not far-fetched

• Software Defined Networks have also enabled more informed traffic engineering

• Run-of-the-mill tree topologies being replaced by radical recursive and random topologies

Physical Infrastructure Layer• The physical hardware itself• Servers and network elements• Mechanism for power distribution, wiring, and cooling• Servers are connected in various topologies using different

interconnects• Dubbed as datacenters• Modular and self-containing, container-sized datacenters can be

moved at will• “We must treat the datacenter itself as one massive warehouse-scale

computer” – Luiz André Barroso and Urs Hölzle, Google*

* Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.

Power Generation

• According to the New York Times in 2012, datacenters are collectively responsible for the energy equivalent of 7-10 nuclear power plants running at full capacity

• Datacenters have started using renewable energy sources, such as solar and wind power

• Engendering the paradigm of “move computation wherever renewable sources exist”

Heat Dissipation

• The scale of the set up necessitates radical cooling mechanisms

• Facebook in Prineville, US, “the Tibet of North America”– Rather than use inefficient water chillers, the datacenter

pulls the outside air into the facility and uses it to cool down the servers

• Google in Hamina, Finland, on the banks of the Baltic Sea– The cooling mechanism pulls sea water through an

underground tunnel and uses it to cool down the servers

Case Study: Google

• All that infrastructure enables Google to:– Index 20B web pages a day– Handle in excess of 3B search queries daily– Provide email storage to 425M Gmail users– Serve 3B YouTube videos a day

The Big Data Stack

Technology

Transcript of The Big Data Stack