Balogh gyorgy big_data
-
Upload
logdrill -
Category
Technology
-
view
263 -
download
2
Transcript of Balogh gyorgy big_data
BigData
with brawn or brain?
What is BigData?
● Data volume cannot be handled traditional
solutions (eg.: relational database)
● More than 100 million data rows, typically
multi billion
Global rate of data production
(per second)
● 30 TB/sec (22000 films)
● Digital media ○ 2 hours of YouTube video
● Communication ○ 3000 business emails
○ 300000 SMS
● Web
○ Half million page views
● Logs ○ Billions
BigData Market
Why now?
● Long term trends
○ Size of stored data doubles every 40 months since
1980s
○ Moore’s law: number of transistors on integrated
circuits doubles every 18 months
Different exponential trends
Hard drives in 1991 and in 2012
● 1991
● 40 MB
● 3500 RPM
● 0.7 MB/sec
● full scan: 1 minutes
● 2012
● 4 TB ( x 100000)
● 7200 RPM
● 120 MB/sec ( x 170)
● full scan: 8 hours (x 480)
Data access getting a scarce
resource! -> Paradigm shift
Google’s hardware 1998
Google’s hardware 2013
● 12 data centers worldwide
● More than a million nodes
● A data center costs $600 million to build
● Oregon data center
○ 15000 m2
○ power of 30 000 homes
Google’s hardware 2013
● Cheap commodity hardware
○ each has its own battery!
● Modular data centers
○ Standard container
○ 1160 servers per container
● Efficiency: 11% overhead (power
transformation, cooling)
Google cannot afford inefficiency
● Thought experiment: 3% data compression
and data processing speed improvement for
Google would save a whole data center!
(magnitude of a billion dollar capital cost +
operation costs)
● Optimal code is essential since we multiply
everything with a million!
Distributed storage and processing
● Data is distributed and
replicated
● Process data where it is
(moving data is costly)
● Increase data access speed by
increasing the number of nodes
Example: BigQuery
● SQL queries on terabytes of data in seconds
● Data is distributed over thousands of nodes
● Each node processes one part of the dataset
● Thousands of nodes work for us for a few
milliseconds
select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
Hadoop
Inefficiency can waste huge
resources
● 300 node cluster
● Hadoop
● Hive
● One node
● Vectorwise
● Vectorwise holds world speed
record in analytical database
queries on a single node
=
Clever ways to improve efficiency
● Lossless data compression (even 50x!)
● Clever lossy compression of data (e.g.: olap
cubes)
● Cache aware implementations (asymmetric
trends, memory access bottleneck)
Lossless data compression ○ compression can boost sequential data
access even 50 times! (100 MB/sec -> 5
GB/sec)
■ Less data -> less I/O operation
■ One CPU can decompress data even
at 5 GB/sec
○ gzip decompression is very slow
○ snappy, lzo, lz4 can reach 1 GB/sec
decompression speed
○ decompression used by column oriented
databases can reach 5 GB/sec (PFOR)
■ two billion integers per second!
(almost one integer per clock cycle!!!)
Example: clever lossy compression
(LogDrill)
2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562
2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321
2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522
2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425
2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432
2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134
2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1
Cache aware programming
● CPU speed increasing about 60% a year
● Memory speed increasing only 10% a year
● The increasing gap is covered with multi
level cache memories
● Cache is under-exploited
100x speed up!!!
Lesson learned
● Put effort into deep understanding your
problem before Hadoop-ing it! ○ Modern analytical databases with multi node scaling
gives magnitudes better performance
○ Clever aggregation can get rid of the big data
problem
○ If Hadoop be cost effective! (cheap hardware not
expensive servers!)