Big Data: an introduction

51
Big Data Big Data: an introduction Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk March 28, 2014 1 / 51

description

Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.

Transcript of Big Data: an introduction

Page 1: Big Data: an introduction

Big Data

Big Data: an introduction

Dr. ir. ing. Bart Vandewoestyne

Sizing Servers Lab, Howest, Kortrijk

March 28, 2014

1 / 51

Page 2: Big Data: an introduction

Big Data

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

2 / 51

Page 3: Big Data: an introduction

Big Data

Introduction: Big Data?

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

3 / 51

Page 4: Big Data: an introduction

Big Data

Introduction: Big Data?

Exponential growth of data

© 2013 International Business Machines Corporation 4

Big Data: This is just the beginning

2010

Volu

me in

Exabyte

s

9000

8000

7000

6000

5000

4000

3000

2015

Percentage of uncertain data Pe

rce

nt o

f unce

rtain

data

100

80

60

40

20

0

You are here

Sensors & Devices

VoIP

Enterprise Data

Social Media

4 / 51

Page 5: Big Data: an introduction

Big Data

Introduction: Big Data?

Big Data definition

Definition of Big Data depends on who you ask:

Big Data

“Multiple terabytes or petabytes.”(according to some professionals)

“I don’t know.”(today’s big may be tomorrow’s normal)

“Relative to its context.”

5 / 51

Page 6: Big Data: an introduction

Big Data

Introduction: Big Data?

Quotes on Big Data

“Big data” is a subjective label attached to situations inwhich human and technical infrastructures are unable tokeep pace with a company’s data needs.

It’s about recognizing that for some problems otherstorage solutions are better suited.

6 / 51

Page 7: Big Data: an introduction

Big Data

Introduction: Big Data?

The Three V’s

Volume The amount of data is big.

Variety Different kinds of data:

structuredsemi-structuredunstructured

Velocity Speed-issues to consider:

How fast is the data available for analysis?How fast can we do something with it?

Other V’s: Veracity, Variability, Validity, Value,. . .

7 / 51

Page 8: Big Data: an introduction

Big Data

Introduction: Big Data?

Structured data

Structured data

Pre-defined schema imposed on the data

Highly structured

Usually stored in a relational database system

Example

numbers: 20, 3.1415,. . .

dates: 21/03/1978

strings: ”Hello World”

. . .

Roughly 20% of all data out there is structured.

8 / 51

Page 9: Big Data: an introduction

Big Data

Introduction: Big Data?

Semi-structured data

Semi-structured data

Inconsistent structure.

Cannot be stored in rows and tables in a typical database.

Information is often self-describing (label/value pairs).

Example

XML, SGML,. . .

BibTeX files

logs

tweets

sensor feeds

. . .

9 / 51

Page 10: Big Data: an introduction

Big Data

Introduction: Big Data?

Unstructured data

Definition (Unstructured data)

Lacks structure or parts of it lack structure.

Example

multimedia: videos, photos,audio files,. . .

email messages

free-form text

word processing documents

presentations

reports

. . .

Experts estimate that 80 to 90 % of the data in anyorganization is unstructured.

10 / 51

Page 11: Big Data: an introduction

Big Data

Introduction: Big Data?

Data Storage and Analysis

Storage capacity of hard drives has increased massively overthe years.

Access speeds have not kept up.

Example (Reading a whole disk)

Year Storage Capacity Transfer Speed Time

1990 1370 MB 4.4 MB/s ≈ 5 minutes2010 1 TB 100 MB/s > 2.5 hours

Solution: work in parallel!

Using 100 drives (each holding 1/100th of the data),reading 1 TB takes less than 2 minutes.

11 / 51

Page 12: Big Data: an introduction

Big Data

Introduction: Big Data?

Working in parallel

Problems

1 Hardware failure?

2 Combining data from different disks for analysis?

Solutions

1 HDFS: Hadoop Distributed Filesystem

2 MapReduce: programming model

12 / 51

Page 13: Big Data: an introduction

Big Data

Big Data Technology

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

13 / 51

Page 14: Big Data: an introduction

Big Data

Big Data Technology

Big Data Landscape

14 / 51

Page 15: Big Data: an introduction

Big Data

Big Data Technology

Hadoop

Hadoop is VMware, but the other way around.

15 / 51

Page 16: Big Data: an introduction

Big Data

Big Data Technology

Hadoop as the opposite of a virtual machine

VMware

1 take one physical server

2 split it up

3 get many small virtualservers

Hadoop

1 take many physical servers

2 merge them all together

3 get one big, massive, virtualserver

16 / 51

Page 17: Big Data: an introduction

Big Data

Big Data Technology

Hadoop: core functionality

HDFS Self-healing, high-bandwidth, clustered storage.

MapReduce Distributed, fault-tolerant resource management,coupled with scalable data processing.

17 / 51

Page 18: Big Data: an introduction

Big Data

Big Data Technology

HDFS architecture

18 / 51

Page 19: Big Data: an introduction

Big Data

Big Data Technology

MapReduce

19 / 51

Page 20: Big Data: an introduction

Big Data

Big Data Technology

MapReduce

20 / 51

Page 21: Big Data: an introduction

Big Data

Big Data Technology

Apache Hadoop essentials: technology stack

21 / 51

Page 22: Big Data: an introduction

Big Data

Big Data Technology

Pig

MapReduce requires programmers

think in terms of map and reducefunctions,more than likely use the Java language.

Pig provides a high-level language (PigLatin) that can be used by

AnalystsData ScientistsStatisticiansEtc. . .

22 / 51

Page 23: Big Data: an introduction

Big Data

Big Data Technology

Hive

Originated at Facebook to analyze log data.

HiveQL: Hive Query Language, similar to standard SQL.

Queries are compiled into MapReduce jobs.

Has command-line shell, similar to e.g. MySQL shell.

23 / 51

Page 24: Big Data: an introduction

Big Data

Big Data Technology

Example Hadoop distributions

24 / 51

Page 25: Big Data: an introduction

Big Data

Big Data Technology

NoSQL

25 / 51

Page 26: Big Data: an introduction

Big Data

Big Data Technology

RDBMS: Codd’s 12 rules

Codd’s 12 rules

A set of rules designed to define what is required from a databasemanagement system in order for it to be considered relational.

Rule 0 The Foundation rule

Rule 1 The Information rule

Rule 2 The guaranteed access rule

Rule 3 Systematic treatment of null values

Rule 4 Active online catalog based on the relational model

. . . . . .

26 / 51

Page 27: Big Data: an introduction

Big Data

Big Data Technology

ACID

ACID

A set of properties that guarantee that database transactions areprocessed reliably.

Atomicity A transaction is all or nothing.

Consistency Only transactions with valid data.

Isolation Simultaneous transactions will not interfere.

Durability Written transaction data stays there “forever”(even in case of power loss, crashes, errors,. . . ).

27 / 51

Page 28: Big Data: an introduction

Big Data

Big Data Technology

Scaling up

What if you need to scale up your RDBMS in terms of

dataset size,

read/write concurrency?

This usually involves

breaking Codds rules,

loosening ACID restrictions,

forgetting conventional DBA wisdom,

loose most of the desirable properties that made RDBMS soconvenient in the first place.

NoSQL to the rescue!

28 / 51

Page 29: Big Data: an introduction

Big Data

Big Data Technology

NoSQL

NoSQL

‘Invented’ by Carl Strozzi in 1998 (for his file-based database)

“Not only SQL”

It’s NOT about

saying that SQL should never be used,

saying that SQL is dead.

29 / 51

Page 30: Big Data: an introduction

Big Data

Big Data Technology

NoSQL databases

Four emerging NoSQL categories:

30 / 51

Page 31: Big Data: an introduction

Big Data

Big Data Technology

Us the right tool for the right job!

http://db-engines.com/

31 / 51

Page 32: Big Data: an introduction

Big Data

Big Data in my company?

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

32 / 51

Page 33: Big Data: an introduction

Big Data

Big Data in my company?

Typical RDBMS scaling story

1. Initial Public Launch

From local workstation → remotely hosted MySQL instance.

2. Service popularity ↑, too many reads hitting the database

Add memcached to cache common queries. Reads are now nolonger strictly ACID; cached data must expire.

3. Popularity ↑↑, too many writes hitting the database

Scale MySQL vertically by buying a beefed-up server:

16 cores

128 GB of RAM

banks of 15 k RPM hard drives

Costly

33 / 51

Page 34: Big Data: an introduction

Big Data

Big Data in my company?

Typical RDBMS scaling story

4. New features → query complexity ↑, now too many joins

Denormalize your data to reduce joins.(Thats not what they taught me in DBA school!)

5. Rising popularity swamps the server; things are too slow

Stop doing any server-side computations.

34 / 51

Page 35: Big Data: an introduction

Big Data

Big Data in my company?

Typical RDBMS scaling story

6. Some queries are still too slow

Periodically prematerialize the most complex queries, and try tostop joining in most cases.

7. Reads are OK, writes are getting slower and slower. . .

Drop secondary indexes and triggers (no indexes?).

If you stay up at nightworrying about your database(uptime, scale, or speed), you

should seriously considermaking a jump from theRDBMS world to HBase.

35 / 51

Page 36: Big Data: an introduction

Big Data

Big Data in my company?

Use-cases of Big Data

‘Core Big Data’ company

Big Data

crunching,

hacking,

processing,

analyzing,

. . .

‘General Big Data’ company

Business Analytics

improve decision-making,

gain operational insights,

increase overallperformance,

track and analyzeshopping patterns,

. . .

Both

Explore! Discover hidden gems!

36 / 51

Page 37: Big Data: an introduction

Big Data

Big Data in my company?

Some examples

Intrusion detection based onserver log data

Real-time security analytics

Fraud detection

Customer behavior basedsentiment analysis of socialmedia

Campaign analytics

37 / 51

Page 38: Big Data: an introduction

Big Data

Big Data in my company?

Big Data in your company

38 / 51

Page 39: Big Data: an introduction

Big Data

IWT TETRA project

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

39 / 51

Page 40: Big Data: an introduction

Big Data

IWT TETRA project

IWT TETRA project

Data mining: van relationele database naar Big Data.

Dates

Submitted: 12/03/2014

Notification of acceptance: July, 2014

Runs from 01/10/2014 – 01/10/2016

People involved

Wannes De Smet (researcher)

Bart Vandewoestyne (researcher)

Johan De Gelas (project coordinator)

Interested? → Come talk to us!

40 / 51

Page 41: Big Data: an introduction

Big Data

IWT TETRA project

Project plan, work packages

RDBMS vs.DistributedProcessing

TechnologyChoice

MapReduce &Alternatives

Big DataStack

Analysis

BIOptimization

DistributedProcessing

Optimization Infrastructure& CloudAnalysis

Dissemination

41 / 51

Page 42: Big Data: an introduction

Big Data

IWT TETRA project

WP1: RDBMS vs. Distributed Processing

Key question

When to switch from a ‘traditional’ technology to ‘Big Data’technology?

Evaluate traditional database systems (Virtuoso, VoltDB,. . . )

Find their limitations.

Strengths? Weaknesses?42 / 51

Page 43: Big Data: an introduction

Big Data

IWT TETRA project

WP2: Analyse Big Data technology stack

Key idea

Get acquinted with Hadoop and its most important softwarecomponents.

Find best way to setup, administer and use Hadoop.

Get familiar with most important software components (Pig,Hive, HBase,. . . ).

Find out how easy it is to integrate Hadoop into existingarchitectures.

43 / 51

Page 44: Big Data: an introduction

Big Data

IWT TETRA project

WP3: Alternatives for MapReduce

Key question

What are valuable alternatives for MapReduce?

Faster querying (compared to Pig & Hive)

Lightning-fast cluster computing

Distributed and fault-tolerant realtime computation

Apache Storm

44 / 51

Page 45: Big Data: an introduction

Big Data

IWT TETRA project

WP4: BI optimization

Key questions

Where can existing BI solutions be optimized?

How can current BI solution interact with Big Datatechnology?

Virtuoso, MS SQLServer 2014, VoltDB,. . .

Apache Sqoop

45 / 51

Page 46: Big Data: an introduction

Big Data

IWT TETRA project

WP5: Distributed Processing optimization

Key question

Where can Big Data technology be performance tuned?

How is the data stored?

Optimal settings for Hadoop, MapReduce,. . .

Benchmarks such as TestDFSIO, TeraSort, NNBench,MRBench,. . .

46 / 51

Page 47: Big Data: an introduction

Big Data

IWT TETRA project

WP6: Infrastructure & Cloud analysis

Key question

What hardware best fits the (Big Data) needs?

Perform hardware monitoring.

Analyze cloud solutions.

Formulate best practices.

Give advice on hardware choice.47 / 51

Page 48: Big Data: an introduction

Big Data

IWT TETRA project

WP7: Dissemination & project follow-up

Key idea

Spread the message!

Document case-studies.

Prepare for education.

Presentations at events.

Blogs, articles,. . .

Workshops

48 / 51

Page 49: Big Data: an introduction

Big Data

Conclusions

Outline

1 Introduction: Big Data?

2 Big Data Technology

3 Big Data in my company?

4 IWT TETRA project

5 Conclusions

49 / 51

Page 50: Big Data: an introduction

Big Data

Conclusions

Conclusions

“Big” can be small too.

The Big Data landscape is huge.

The right tool for the right job!

We can help → advice, case studies

Your company can benefit from Big Data technology.

Be brave in your quest. . .

50 / 51

Page 51: Big Data: an introduction

Big Data

Conclusions

Questions?

Questions?

[email protected]

[email protected]

[email protected]

51 / 51