Balogh gyorgy big_data

BigData

with brawn or brain?

What is BigData?

● Data volume cannot be handled traditional

solutions (eg.: relational database)

● More than 100 million data rows, typically

multi billion

Global rate of data production

(per second)

● 30 TB/sec (22000 films)

● Digital media ○ 2 hours of YouTube video

● Communication ○ 3000 business emails

○ 300000 SMS

● Web

○ Half million page views

● Logs ○ Billions

BigData Market

Why now?

● Long term trends

○ Size of stored data doubles every 40 months since

1980s

○ Moore’s law: number of transistors on integrated

circuits doubles every 18 months

Different exponential trends

Hard drives in 1991 and in 2012

● 1991

● 40 MB

● 3500 RPM

● 0.7 MB/sec

● full scan: 1 minutes

● 2012

● 4 TB ( x 100000)

● 7200 RPM

● 120 MB/sec ( x 170)

● full scan: 8 hours (x 480)

Data access getting a scarce

resource! -> Paradigm shift

Google’s hardware 1998


● 12 data centers worldwide

● More than a million nodes

● A data center costs $600 million to build

● Oregon data center

○ 15000 m2

○ power of 30 000 homes


● Cheap commodity hardware

○ each has its own battery!

● Modular data centers

○ Standard container

○ 1160 servers per container

● Efficiency: 11% overhead (power

transformation, cooling)

Google cannot afford inefficiency

● Thought experiment: 3% data compression

and data processing speed improvement for

Google would save a whole data center!

(magnitude of a billion dollar capital cost +

operation costs)

● Optimal code is essential since we multiply

everything with a million!

Distributed storage and processing

● Data is distributed and

replicated

● Process data where it is

(moving data is costly)

● Increase data access speed by

increasing the number of nodes

Example: BigQuery

● SQL queries on terabytes of data in seconds

● Data is distributed over thousands of nodes

● Each node processes one part of the dataset

● Thousands of nodes work for us for a few

milliseconds

select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;

Hadoop

Inefficiency can waste huge

resources

● 300 node cluster

● Hadoop

● Hive

● One node

● Vectorwise

● Vectorwise holds world speed

record in analytical database

queries on a single node

=

Clever ways to improve efficiency

● Lossless data compression (even 50x!)

● Clever lossy compression of data (e.g.: olap

cubes)

● Cache aware implementations (asymmetric

trends, memory access bottleneck)

Lossless data compression ○ compression can boost sequential data

access even 50 times! (100 MB/sec -> 5

GB/sec)

■ Less data -> less I/O operation

■ One CPU can decompress data even

at 5 GB/sec

○ gzip decompression is very slow

○ snappy, lzo, lz4 can reach 1 GB/sec

decompression speed

○ decompression used by column oriented

databases can reach 5 GB/sec (PFOR)

■ two billion integers per second!

(almost one integer per clock cycle!!!)

Example: clever lossy compression

(LogDrill)

2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562





2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134

2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1

Cache aware programming

● CPU speed increasing about 60% a year

● Memory speed increasing only 10% a year

● The increasing gap is covered with multi

level cache memories

● Cache is under-exploited

100x speed up!!!

Lesson learned

● Put effort into deep understanding your

problem before Hadoop-ing it! ○ Modern analytical databases with multi node scaling

gives magnitudes better performance

○ Clever aggregation can get rid of the big data

problem

○ If Hadoop be cost effective! (cheap hardware not

expensive servers!)

Balogh gyorgy big_data

Technology

Transcript of Balogh gyorgy big_data