Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

21
© ALTOROS Systems | CONFIDENTIAL Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success Andrei Yurkevich Chief Technology Officer [email protected]

description

Watch this presentation by Andrei Yurkevich, Altoros's President and CTO, to know what are the main challenges causing a big data project fail. Reveal a strategy that can help you to mitigate risks when planning a large-scale long-term project. Enjoy vivid examples that show the mistakes Altoros made and learn how all the issues were overcome with a prototype. See more at http://blog.altoros.com/big-data-analytics-2013-in-london.html

Transcript of Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

Page 1: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL

Big Data, Big Projects, Big

Mistakes: How to Jumpstart and Deliver

with Success

Andrei YurkevichChief Technology Officer

[email protected]

Page 2: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 2

• Hadoop/NoSQL performance engineering

• Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace,

CloudStack and OpenStack using Chef/Puppet, RightScale and SCALR

• 300+ employees globally (UK, USA, Denmark, Switzerland, Norway, Belarus,

Argentina)

• v

About Altoros

Featured customers Partners

Page 3: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 3

It's a Mad Mad Big Data World

Page 4: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 4

It's a Mad Mad Big Data World

Page 5: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL

It's a Mad Mad Big Data World

56 Combinations

Page 6: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL

It's a Mad Mad Big Data World

56 Combinations

15625

Page 7: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 7

It's a Mad Mad Big Data World

Page 8: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 8

No clear business goals

Big amounts of data

from many sources

Architecture design 

The variety of tools

Compatibility of technologies/platforms

Lack of professionals

All features in one release 

Budget

Big Data Traps

Page 9: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 9

1 million of sensors generates 2.5 TB of data daily

Project Requirements

Page 10: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 10

Project Requirements

Functional requirements Value Non-functional requirements

The amount of data added daily: 2.5 TB• Infrastructure-independent

architecture

• Scalability

• Open-source tools

Data type: raw data processed

data

Data storage time:

raw data Processed data

min a week min a year

Response time:

for building reports based on a pre-set template

for building reports for a custom period of time

< 30 sec

< 6 hours

Uptime: 99%

Fault-tolerance: required

Deployment cost per day: < $1,000

Page 11: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 11

InfrastructureAmazon AWS Joyent Rackspace

Types of a contract On Demand, Reserved, Spot

On Demand, Reserved

On Demand

Types of instances (classified by compute units)

• General Purpose• Compute optimized• Memory optimized• Storage optimized

• Standard• High Memory• High CPU• High Storage• High I/O

• General Purpose

Storage options • EBS• S3• Low-cost storage

• Network storage based on ZFS

• Cloud Block Storage

• Cloud Files

Operating systems Linux, Windows SmartOS, Linux, Windows

Linux, Windows

A management console

AWS Console Joyent SmartDataCenter

Cloud Control Panel

A Cloud API • Command line interface

• Java, .NET, Ruby SDK and API

• Command line interface (CLI)

• Node.js SDK• REST API

REST API

Regions America, Europe, Asia, Australia

North America, Europe

America, Europe, Asia, Australia

Estimated cost per month

$18,300 $17,500 $21,350

Page 12: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 12 a good fit a normal fit a bad fit

InfrastructureOption 2 Option 1

Feature Amazon AWS Joyent Rackspace

Types of a contract On Demand, Reserved, Spot

On Demand, Reserved On Demand

Types of instances (classified by compute units)

• General Purpose• Compute optimized• Memory optimized• Storage optimized

• Standard• High Memory• High CPU• High Storage• High I/O

• General Purpose

Storage options • EBS• S3• Low-cost storage

• Network storage based on ZFS

• Cloud Block Storage• Cloud Files

Operating systems Linux, Windows SmartOS, Linux, Windows

Linux, Windows

A management console AWS Console Joyent SmartDataCenter Cloud Control Panel

A Cloud API • Command line interface

• Java, .NET, Ruby SDK and API

• Command line interface (CLI)

• Node.js SDK• REST API

REST API

Regions America, Europe, Asia, Australia

North America, Europe America, Europe, Asia, Australia

Estimated cost per month $18,300 $17,500 $21,350

Score 1.5 3.5

Page 13: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 13

Features HBase Cassandra MongoDB MySQL Cluster

License Apache Apache AGPL  GPL

Protocol HTTP/REST (also Thrift)

Thrift and custom binary CQL3

Custom, binary (BSON)

JDBC, ODBC

Data model Column family Column family JSON documents Tables

Queries / Query Language

JRuby-based (JIRB) shell

Cassandra Query Language

 JavaScript expressions

SQL

Partitioning Strategy

Ordered Partitioning

Random Partitioning 

Sharding by key Partition by key

Replication between nodes

yes yes yes yes

Replication between data centers

noyes

noyes

Capability to store 2.5 TB daily

yes yes yes yes

Implementation Experience

1+ 1+ 2+ 5+

Score 2 3 2 5

Choosing a Database

a good fit a normal fit a bad fit

Page 14: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 14

Choosing a DatabaseFeatures HBase Cassandra MongoDB MySQL Cluster

License Apache Apache AGPL  GPL

Protocol HTTP/REST (also Thrift)

Thrift and custom binary CQL3

Custom, binary (BSON)

JDBC, ODBC

Data model Column family Column family JSON documents Tables

Queries / Query Language

JRuby-based (JIRB) shell

Cassandra Query Language

 JavaScript expressions

SQL

Partitioning Strategy

Ordered Partitioning

Random Partitioning 

Sharding by key Partition by key

Replication between data centers

noyes

noyes

Capability to store 2.5 TB daily

yes yes yes yes

Implementation Experience

1+ 1+ 2+ 5+

Deployment cost per day

$450 $400 $500 $1,500

Score 2.5 4 2.5 0

a good fit a normal fit a bad fit

Page 15: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 15

Choosing a database: Cassandra, MongoDB, HBase

Storing Raw Data

Page 16: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 16

Feature HBase Cassandra MongoDB

Replication between data centers

Asynchronous, needs testing

Replicas can span data centers with

synchronous replication

Not supported

A cluster admin node NameNode Any node mongos process

Implementation Experience

1+ 1+ 2+

Time spent on inserting 30 MB of data

7 sec 9 sec 20 sec

Deployment cost per day $450 $400 $500

Score 2 2.5 0

Choosing a Database

a good fit a normal fit a bad fit

Page 17: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 17

Architecture of the System

Page 18: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 18

Examples of reports

Storing Processed Data

Page 19: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 19

Prototype’s Correspondence to the Initial Requirements

A requirement The prototype features

Storing of 2.5 TB of daily raw data for a week Capable

Storing of 1.5 TB of processed data for a year Capable

Response time for building reports based on a pre-set template ~25 sec

Response time of less than 6 hours for building a custom report ~7 hours

Scalability Good

Infrastructure Independence Yes

Using open-source tools For all components

Fault-tolerance Yes

Deployment cost per day < $1,000 ~$600

Page 20: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 20

Properly visualize and test the functionality

Detect bottlenecks and change a technology/tool/database before it was implemented in the real system

Get a real vision of the final solution

Make sure you stick to the budget

How to Make a Big Data Project Work

Page 21: Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

© ALTOROS Systems | CONFIDENTIAL 21

Andrei YurkevichPresident/CTO

[email protected]