BIG DATA: From mammoth to elephant

Roman Nikitchenko, 10.05.2015

BIG DATA: FROM MAMMOTH TO ELEPHANT

MAMMOTHThe only real truth we know about them is their rests. Do you feel your enterprise data infrastructure goes this way?

Come and see in the nearest data center...

TWO YEARS AGO● Our exciting high scalability realtime

BIG DATA solution with broad technologies stack in production.

This is our PRESENT DAY

.. yet is powered by

storage storage

SQL DB Processed inbound data

Inbound Outbound

Healthcare providers data: labs, cares ...

Mostly insurance companies

SQL DB Application data

SQL DB Outbound information

OUR INITIAL STATE: TOP VIEW

CLIENT APPLICATIONS

storage storage

Inbound Outbound

CLIENT APPLICATIONS

Inbound data archives(pretty short cycle)

One SQL DB per application

Huge amount of data. Serious amount of duplicates

How about retention and data issues investigation?

Outbound flow is slow because of RDBMS processing

storage storage

Inbound Outbound

CLIENT APPLICATIONS

Inbound data retention cycle is short, so prolonged period data investigation is hard

Overall huge amount of SQL databases, high operational complexity

One application DB per service client makes inter-application analytics and monitoring extremely hard

YELLOW ALARMS

BIG DATA

Better ways to store huge data volumes: cheaper, safer and easier.

WHAT TO RUN FOR?

MORE STORAGE

BIG DATAWHAT TO RUN FOR?

Scalable effective distributed processing models to open new opportunities like machine learning.

MORE POWER

BIG DATAWHAT TO RUN FOR?

More flexible data structures closer to subject area and real world.

RDBMS LIMITS● Good for anything

● Not so good for anything in particular

OUR MAIN ENEMY WAS ...

MASSIVE ANALYSISIs about massive access to your data objects

Yourdatabase

Subject area objects data

Processing

Transformation from database structure into object structure

Distributed parallel

processing

Effective results collection

Distributed processing

results to be joined

WHY SQL IS EVIL

RDBMS LIMITS

When you go massive processing, objects collection is getting too complex. Think about 100.000.000 people data scan.

Address ID City Street

1 New York 1020, Blue lake

2 Atlanta 203, Bricks av.

3 Seattle 120, Green drv.

FirstName LastName Address Payer

John Smith 1 2

Kate Davis 2 1

Samuel Brown 3 2

Payer ID Name State

1 SaferLife GA

2 YourGuard CA

Kate Davis,Atlanta 203, Bricks av.SafeLife, GA

SUBJECT AREA OBJECT COLLECTION

FirstName

LastName

Address

Birthday

RDBMS LIMITSFirstName LastName Address Payer

John Smith 1 2

Kate Davis 2 1

Samuel Brown 3 2

And now let us add new «Birthday» column.Easy as pie!

Let it be Patients table ...

ALTER TABLE Patient ADD Birthday ...

TABLE STRUCTURE MODIFICATION

Let's do this with 2.000.000.000 rows MySQL table in production. What to do if your table grows further?

ANY RELATIONAL DATA MODELSOONER OR LATER

Your SQLdatabase

Processing

How to partition data? What to do

when new shard is added?

Need another cluster for

processing?

results to be joined

HOW TO SCALE?

RDBMS LIMITS

If you need to store plain text log, collection of objects for a long time or current user session attributes do you really need SQL?

Cross-application data storage

Small realtime requests Batch analytic and reporting

● One-time ETL as initial step and backup strategy.● Full migration to Apache Hbase.● As a transition period solution — realtime synchronization.

OUR INITIALBIG PLAN WAS

OPEN SOURCE framework for big data. Both distributed storage and processing

Provides RELIABILITY and fault tolerance by SOFTWARE design (for example file system with replication factor 3 as default one.Horizontal scalability from

single computer up to thousands of nodes

Why Hadoop (initially 1.x)?

First ever worldDATA OS

10.000 nodes computer... Can start in production from just 4 servers, 1 of them is for management and coordination. Single server is enough for development environment.

HBase motivationWHY

LATENCY, SPEED AND ALL HADOOP PROPERTIES

Database

Region server

WHY YET ?

DataNode Node

File system Hardware

TaskTracker

Region server DataNode NodeTaskTracker

● Good both for OLTP and batch load.● Natural scaling and reliability with Hadoop.● Data processing locality, natural sharding with regions.● Coordination with ZooKeeper.

ZooKeeperBecause coordinating distributed systems is a Zoo.

● Quorum based service for fast distributed system coordination.

● Came in our stack with Apache Hbase where it was needed for coordination. Now is part of core Hadoop infrastructure.

● Yet we use it for our own applications,

Finally we went initial production with HADOOP 2.0

RESOURCE MANAGEMENT

DISTRIBUTED PROCESSING

FILE SYSTEM

COORDINATION

HADOOP 2.x CORE

Database

Region server

Distributed processing & coordination

Real initial approach

DataNode Node

File system Hardware

Region server DataNode Node

● ZooKeeper Instances are distributed among cluster.● MapReduce is not service in Hadoop 2.x, just YARN application.

Resource management

NodeManager

FIRST REAL RESULT

Cross-application data storage

Small realtime requests Batch analytic and reporting

CLOSE BUT NOT EXACT PLAN

Daily ETL. Satisfied our daily reporting needs with major SQL infrastructure offload. Direct profit — massive processing is much faster, can handle inter-application data.

DO NOT WEAR PINK GLASSES

APPROACH WE HAVE FIXED MUCH LATER

SQLserver

Table1

Table2

Table3

Table4

ETL stream

SQLserver

Table1

Table2

Table3

Table4

ETL stream

BIG DATA shard

Bulkload

Hadoop: don't do it yourself

DON'T DO IT YOURSELF

Because of number of factors starting from our distributed team support needs we have selected

x MAX+

BIG DATA

HADOOP as INFRASTRUCTURE

WHERE TO GO FROM HERE?31

The admission of temporary residents into Canada is a privilege, not a right.

http://www.cic.gc.ca/

SEARCH / SECONDARY INDICES

NO SEARCH OUT OF THE BOX OTHER THAN LINEAR SCAN OVER THE TABLE AND FILTERS.

The same happened to be applicable to secondary indices in Hbase.

HOW WE MADE IT

HBase handles user data changes

Indexes are built on SOLR

NGData Lily indexer transforms data

changes into SOLR index updates

HBase: Data and search integration

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexerREPLICATION

Translates data changes into SOLR

index updates.

SOLR cloud

Search requests (HTTP)

Apache Zookeeper does all coordination Provides real

indexing

Search and indexing together

● Kafka is a high throughput distributed messaging system.

● Allows true realtime system reaction through publish-subscribe approach.

● New services can subscribe to data events stream.

GOING REALTIMEBatch load

Realtime load

New data

● Kafka can be separated from Hadoop infrastructure or have backup cluster.

● Data publishers can switch to another cluster.

● Subscribers (including Spark on Hadoop) keep 2 places of subscription.

● So now you are free to put Kafka cluster in maintenance or backup subscribers.

GOING REALTIME

GENTLY

MAINTENANCE

This is our PRESENT DAY

.. yet is powered by

SO WHERE ARE WE GOING?

OVER BIG DATAREACTIVE MANIFESTO

MOTIVATION

… users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures.

… we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems. http://www.reactivemanifesto.org/

Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively.

RESPONSIVE

OVER BIG DATAREACTIVE MANIFESTOThe system stays responsive in the face of failure.

… The client of a component is not burdened with handling its failures.

RESILIENT All services here are located through ZooKeeper which is quorum based so resilience is achieved

Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs.

ELASTICBoth HDFS and Hbase

allow dynamic node addition / removal

YARN already handles most resource allocation

work and makes progress

Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling.

MESSAGE DRIVEN

Asynchronous messages from

applications

Any application can subscribe, not only Hadoop services

LESSONS LEARNED

● No transition in one step. You enter Big Data world step by step.

● Change you mind first. You should stop thinking in old style. Do not try simply to map your existing approaches.

● No silver bullet. Don't ruin your existing infrastructure. Extend it. NoSQL is not always good and some cases are really to be kept on SQL. Use the right tool.

● As you progress you pay more attention to operations and reactive system properties.

QUESTION?

BIG DATA: From mammoth to elephant

Technology

Transcript of BIG DATA: From mammoth to elephant

Mammoth Unified School District Mammoth High School€¦ · Mammoth Unified School District Mammoth High School “Husky Pride” 365 Sierra Park Road Mammoth Lakes, CA 93546 (760)

PILOT SCALE GEOTHERMAL SILICA RECOVERY AT MAMMOTH LAKES€¦ · We also benefited greatly from the staff at Mammoth Pacific LP for ... Mammoth Lakes, California geothermal ... Mammoth

Mammoth Caves

מצגת של PowerPoint - Negba€¦ · EARS INSURED FOR WONDER ELEPHANT SOARS TO FAME! Miracle Mammoth Startles World!ï

Amazon Web Services · 2019-12-21 · and shaggy fur that covered mammoth skeletons. They reveal that mammoth trunks worked like elephant trunks and that mammoths (like elephants)

Big Elephant Chain Hoist Catalogue

Big data and mstr bridge the elephant

the Elephant - BRIQS...Elephant Rumbles through the Circular City, is difficult. People feel ropes, snakes, fans, trees spears, but no-one sees the big picture: the elephant. And it

Bug bites elephant - Test-driven quality assurance in Big Data ...

EARLY YEARS TOOLKIT...Get your big elephant ears ﬂapping!’ – everyone puts their hands by their ears and ﬂaps them like elephant ears. ‘Get your big elephant trunk waving’

Mammoth Cave

Tackling Big Data with the Elephant in the Room

Biology Food Chains: Death of the Arctic megafauna handouts/Lesson 009.pdf · That’s how big a woolly mammoth, an early relation of the elephant, was. With its long tusks it must

ಮುಖಪುಟ - ಕರ್ನಾಟಕ ವಿದ್ಯುತ್ ... · 2020. 2. 21. · Now that the mammoth is extinct, elephant is the largest of all animals living and the

©Mammoth-WEBCO, Inc. 1 Mammoth Water-Source Heat Pumps.

Original language: English CoP18 Prop. XXX CONVENTION ON ... · Woolly mammoth ivory is causing in curbing elephant ivory trafficking. To ban or to promote mammoth ivory trade? There

A Century of Service. A Lifetime of Memories. ELEPHANT ...€¦ · Elephant Butte, New Mexico ELEPHANT BUTTE DAM CENTENNIAL CELEBRATION OCTOBER 7–23, 2016 7 BIG EVENTS! Food, drinks

Eating the Elephant - Why Big IT Projects Fail

Woolly Mammoth

Mammoth Cave By: Sarah Ward And Kelly Gary. Mammoth cave location 1 Mammoth Cave Pkwy, Mammoth Cave, KY 42259.