Elephant grooming: quality with Hadoop

Roman Nikitchenko, 14.02.2015

WITHQUALITY

SUBJECTIVE

ELEPHANTGROOMING

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale, better resource control, lower latency, higher security and fault tolerance.

x MAX+

BIG DATA

Hadoop in one picture

OPEN SOURCE framework for big data. Both distributed storage and processing.

Provides RELIABILITY and fault tolerance BECAUSE OF SOFTWARE design. Example — File system as replication factor 3 as default one.

Unique horisontal scalability from single computer up to thousands of nodes.

HADOOPThis is what everybody told you. Starting from your Hadoop distribution vendor.

Small, dirty, clumsy, always hungry … good if alive at all

HADOOP What you really get is ...

YARNOur HADOOP is healthy and is growing.

TODAY WE FOCUS ON QUALITY

Billions of medical records processed

every day

10s of millions medical histories

QUALITYFears and reality

Feeling of something wrong

FEAR OF

WHAT IS THE ROOT CAUSE?

LOOKS TOO EASY

How everyone (who usually sells something) depicts

Hadoop complexity

GREAT BIG INFRASTRUCTURE AROUND

SMALL CUTE CORE

YOUR APPLICATION

SAFE and FRIENDLY

How it looks from the real user point of view

Feeling of something wrong

CORE HADOOPC

PLETELY

YOUR APPLICATION

FEAR OF

NO BACKUPS

REALITY

Most of failures in Hadoop are not about your functionality but about infrastructure. Any testing strategies are to account it.

REALITY

BIG DATA

Failure is normal case

Failures are normal in Hadoop and happen EVERY DAY. This is MAJOR difference for software. Do your test cover severe performance degradation or disaster recovery procedure?

REALITY

No more isolated testing. You are geooming not only elephant

REALITY OF

Infrastructure is really complex. Just here: Hive, Hadoop, Giraf, Tez, Pig, Tomcat, Wildfly...REALITY

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

REALITY

VERIFICATION INFRASTRUCTURE

BIGGEST BIG DATA failure IS ...

NO DATAThe same about testing infrastructure

STAGINGCOMMON GIRL ISSUES

● All people need staging environment but who should perform maintenance?

● Can you drop all data on staging cluster if other team need it?

● This is exactly what usually happens in production.

● Grants limited isolation from other teams work.

● Don't addict to this drug! You can miss serious integration issues in production.

● Your environment definitely gets underused this way.

STAGINGTIME DIVISION MULTIPLEXING

STAGING

Try to find something

MULTINODE CLUSTERS

Multinode — hard to investigate issues. Single node — does not cover scalability cases.

STAGINGPHYSICAL CLUSTERS

● Good start: USED single chassis server: 4 nodes (1 master + 3 workers), each 2x6 cores, 64G RAM. HDD is up to you. About $5K (2014).

● You can save on SSD and siphisticated I/O. Do not save on CPU. Have memory upgrade plan.

SUBJECTIVE

VIRTUAL STAGING

CLUSTERSNOT SO REAL ELEPHANT

● If you production cluster is virtual — here you go!

● Public clouds — unclear budget and resources. Great fast start.

● Single node virtual machine for developers — hard to support.

OUR ENVIRONMENT FOR AUTOMATED TESTING

SUREFIRE

Integration testing

Unit testing

TESTING UTILITY

MINICLUSTER

ARTIFACTS!

LOCAL WORKERS

TESTING SEQUENCE

org.apache.hbase hbase-testing-util

org.apache.hadoop hadoop-mnicluster

● Everything is in SINGLE JVM scope including code under test so you can attach debugger, profiler or measure test coverage.

● Environment starts in about 10 seconds. Everything is just test dependency in maven. Actually could be used even in unit testing sequence.

● All services are dynamic so more than one developer can run tests on single host. Configuration is dynamic.

OUR OWN WRAPPER

STEP FORWARDDEVELOPMENT ENVIRONMENT

LOCAL WORKERS

● Now mini-cluster is service. Starts in about 30 seconds. Static service ports. YARN and logging like in real cluster. But mostly we reuse auto-testing cluster components.

● Developer can use local workers for MapReduce and Spark (single JVM with its code) or can use cluster services close to REAL cluster.

● Hbase Lily indexer, SOLR and Hive are started as separate JVM. Everything is taken through Maven dependency.

MINICLUSTER CORE PREVIOUS SLIDE

SINGLE JVM SEPARATE JVMS

Hadoop: don't do it yourself

STAGING● Yet we build real hardware staring cluster.

● We use Cloudera solutions, ready Maven artifacts, exchange experience on conferences and much more.

Unit testing

Integration testing

● Everything outside Big Data is to be checked BEFORE Big Data adds complexity.

● No elephants before integratino test phase. Packages are to be ready. Only simple things like general logic checks can be done before.

● Any test environment is to be created from scratch or at least checked for consistency.

TEST STRATEGY

BETTER TEST YOUR ARCHITECTURE, ONLY THEN IMPLEMENTATION

TESTING IN PRODUCTION

● You cannot avoid testing in production if you do Hadoop.

● It is bright if your application can work into the same cluster but with different data.

● Bring security and resource control so your test runs cannot harm production jobs.

● If your solution is non-realtime and have unused time slots, use them.

APPROACH TO PACKAGE AND DEPENDENCY MANAGEMENT

MAKING IT EASIER

Lowering verification efforts

WHY HADOOP TESTING IS SO HARD?

Source of complexity What to do

Extra work because of unfamiliar environment

Assure verification engineers have adequate Linux knowledge!

Inadequate environment. Having memory overcommitment before test you get everything wrong

Provide adequate hardware resources which can reproduce production issues including scalability ones

Issues come from outside of your tests

Check test pre-conditions!

● QA are to monitor code quality metrics. Not only developers.

● Project size metrics mater. Track correlation between lines of code number of comment lines and not covered lines.

● Branches coverage matters extremely in scalable solutions. Think about statement coverage if you use Scala.

We do it with

MEASURE AND PUSH YOUR CODE QUALITY TO GET SOLUTION QUALITY

● Force developers to reuse approaches, solutions, infrastructure, components. Probability to find something unexpected in already tested reused approach is much lower.

● Force QA to automate test processes, environment setup, release engineering. Consider automatice code validation before integration.

PUT EFFORTS TO LOWER EFFORTS

AUTO-GROOMING… at least some steps

Unit testing

Integration testing

MASTER gets built with all checks.

Metrics go to SonarQube, results are published

MASTER build

After review change is pushed into remote integration branch

Integration branches are monitored, locally merged to master and

Developer works in private branch.

Builds are local with all possible checks.

Integration

branch push

Continuous integration environment

Development in private branchR

Further release engineering

On integration passed merged branch is pushed to master

NEVER!

WHEN TO STOP VERIFICATION?

No surprise.

SCALING QUALITY

Grooming growing elephant.

UNIFIED HADOOP DOES NOT EXISTS

● Started with 4 virtual machines inside AMD 4 cores / 16G desktop to try.

● Then you start to buy i5 4 cores / 32G desktops to build something working.

● Then you start adding E5 12 cores / 48G servers to get results.

● As you grow you 100% go heterogeneous with better hardware. 'Partial' failure on unified cluster puts you in this state.

● So think about heterogeneous RIGHT FROM THE START. Both from design and testing point of view.

Every cluster is different. Add scale configuration to your tests and client configuration

1x node development mini-cluster

Large producion cluster

4x nodes staging cluster

CLUSTERS DIFFER

PERFORMANCEDo you know your enemy?

● Majors: memory, CPU, I/O, network. QA MUST detect resource usage skews.

● CPU is most easy to see. Always balance between optimization and new hardware. Most important for QA is to understand how it is used.

● Memory usage can be hard to understand. Avoid swapping almost at all costs. Global trend this resource is vital.

● HDD are cheap. Just buy more space if you need.

● Network comes last but is hard to tame. If your bottleneck is switch, replace it but it's hard to upgrade channel for every node so here QA should track architecture scalability.

Give your verifiaction team access to production! They must detect issues before they are reported by support team.

Establish and constantly monitor data quality metrics and resource usage on production.

Be proactive. Explain unknown before it puts you into troubles.

TESTING FOREVER

I WANT BETTER ELEPHANT

CURRENT GAP● Resource bottlenecks resolution is

manual process. No silver bullet.

● Much more easy if you can reproduce it on single node or in single VM scope.

CURRENT GAP

● Operational mistakes happen and it is really hard to handle them by verification.

● Usually it can be handled by better design. So prefer to test your approaches and architecture, not just implementation.

Questions and discussion

Elephant grooming: quality with Hadoop

Technology

Transcript of Elephant grooming: quality with Hadoop

Backlog Grooming - The Importance of Good Grooming Habits

Hadoop & Spark Performance tuning using Dr. Elephant

The Elephant in the Library - Integrating Hadoop

Hadoop Elephant in Active Directory Forest

The Elephant in the Cloud: Bring True Cloud Economics to Hadoop/BigInsights

PERSONAL GROOMING. Objective What is “Grooming” Importance of Personal Grooming Relation between personal appearance and image projection.

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM … DATA Hadoop workshop... · HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP. ... – Fast network connections

Apache Hadoop for System Administrators - USENIX Lisa: Leonardo Da Vinci White Elephant: Ecce Homo: …

Hadoop 2 @Twitter, Elephant Scale. Presented at

Taming the Elephant: The Power of SQL on Hadoop

Grooming Books · Grooming Books - Continued. To order call --- or go to . Grooming Books. The All Breed Dog Grooming Guide 4th Edition (2012)

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

Grooming behaviour and the morphology of grooming … · 2007. 4. 27. · Grooming behaviour and the morphology of grooming appendages in the endemic South American crab genus Aegia

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

About this Tutorial - · PDF fileA mahout is one who drives an elephant as its master. The name comes from its close association with Apache Hadoop which uses an elephant as its logo

MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop

Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) · 2019-07-12 · ﬁts well with the simplicity philosophy of Hadoop. Hadoop++ changes the internal

Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Networking for Big Datajain//cse570-13/ftp/m_11nbd.pdf · Hadoop An open source implementation of MapReduce Named by Doug Cutting at Yahoo after his son’s yellow plus elephant Hadoop

Hadoop – An Elephant can't jump. But can carry heavy ...€¦ · 1.2 What is hadoop? (Name of a toy elephant actually) Hadoop is a framework which provides open source libraries