Post on 18-Jul-2015
2
First ever worldDATA OS
10.000 nodes computer... Recent technology changes are focused on higher scale, better resource control, lower latency, higher security and fault tolerance.
4
OPEN SOURCE framework for big data. Both distributed storage and processing.
Provides RELIABILITY and fault tolerance BECAUSE OF SOFTWARE design. Example — File system as replication factor 3 as default one.
Unique horisontal scalability from single computer up to thousands of nodes.
8
TODAY WE FOCUS ON QUALITY
Billions of medical records processed
every day
10s of millions medical histories
11
How everyone (who usually sells something) depicts
Hadoop complexity
GREAT BIG INFRASTRUCTURE AROUND
SMALL CUTE CORE
YOUR APPLICATION
SAFE and FRIENDLY
12
How it looks from the real user point of view
Feeling of something wrong
CORE HADOOPC
OM
PLETELY
UN
KN
OW
N
INFR
AS
TR
UC
TU
RE
SO
METH
ING
YO
U
UN
DER
STA
ND
YOUR APPLICATION
FEAR OF
14
Most of failures in Hadoop are not about your functionality but about infrastructure. Any testing strategies are to account it.
REALITY
15
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
Failure is normal case
Failures are normal in Hadoop and happen EVERY DAY. This is MAJOR difference for software. Do your test cover severe performance degradation or disaster recovery procedure?
REALITY
16
No more isolated testing. You are geooming not only elephant
REALITY OF
Infrastructure is really complex. Just here: Hive, Hadoop, Giraf, Tez, Pig, Tomcat, Wildfly...REALITY
18
VERIFICATION INFRASTRUCTURE
BIGGEST BIG DATA failure IS ...
NO DATAThe same about testing infrastructure
19
STAGINGCOMMON GIRL ISSUES
● All people need staging environment but who should perform maintenance?
● Can you drop all data on staging cluster if other team need it?
● This is exactly what usually happens in production.
20
● Grants limited isolation from other teams work.
● Don't addict to this drug! You can miss serious integration issues in production.
● Your environment definitely gets underused this way.
STAGINGTIME DIVISION MULTIPLEXING
21
STAGING
Try to find something
here
MULTINODE CLUSTERS
Multinode — hard to investigate issues. Single node — does not cover scalability cases.
22
STAGINGPHYSICAL CLUSTERS
● Good start: USED single chassis server: 4 nodes (1 master + 3 workers), each 2x6 cores, 64G RAM. HDD is up to you. About $5K (2014).
● You can save on SSD and siphisticated I/O. Do not save on CPU. Have memory upgrade plan.
23
SUBJECTIVE
VIRTUAL STAGING
CLUSTERSNOT SO REAL ELEPHANT
● If you production cluster is virtual — here you go!
● Public clouds — unclear budget and resources. Great fast start.
● Single node virtual machine for developers — hard to support.
24
OUR ENVIRONMENT FOR AUTOMATED TESTING
SUREFIRE
Integration testing
Unit testing
TESTING UTILITY
MINICLUSTER
ARTIFACTS!
LOCAL WORKERS
TESTING SEQUENCE
org.apache.hbase hbase-testing-util
org.apache.hadoop hadoop-mnicluster
● Everything is in SINGLE JVM scope including code under test so you can attach debugger, profiler or measure test coverage.
● Environment starts in about 10 seconds. Everything is just test dependency in maven. Actually could be used even in unit testing sequence.
● All services are dynamic so more than one developer can run tests on single host. Configuration is dynamic.
OUR OWN WRAPPER
25
STEP FORWARDDEVELOPMENT ENVIRONMENT
LOCAL WORKERS
● Now mini-cluster is service. Starts in about 30 seconds. Static service ports. YARN and logging like in real cluster. But mostly we reuse auto-testing cluster components.
● Developer can use local workers for MapReduce and Spark (single JVM with its code) or can use cluster services close to REAL cluster.
● Hbase Lily indexer, SOLR and Hive are started as separate JVM. Everything is taken through Maven dependency.
MINICLUSTER CORE PREVIOUS SLIDE
SINGLE JVM SEPARATE JVMS
26
Hadoop: don't do it yourself
STAGING● Yet we build real hardware staring cluster.
● We use Cloudera solutions, ready Maven artifacts, exchange experience on conferences and much more.
27
Unit testing
Integration testing
● Everything outside Big Data is to be checked BEFORE Big Data adds complexity.
● No elephants before integratino test phase. Packages are to be ready. Only simple things like general logic checks can be done before.
● Any test environment is to be created from scratch or at least checked for consistency.
TEST STRATEGY
29
TESTING IN PRODUCTION
● You cannot avoid testing in production if you do Hadoop.
● It is bright if your application can work into the same cluster but with different data.
● Bring security and resource control so your test runs cannot harm production jobs.
● If your solution is non-realtime and have unused time slots, use them.
32
WHY HADOOP TESTING IS SO HARD?
Source of complexity What to do
Extra work because of unfamiliar environment
Assure verification engineers have adequate Linux knowledge!
Inadequate environment. Having memory overcommitment before test you get everything wrong
Provide adequate hardware resources which can reproduce production issues including scalability ones
Issues come from outside of your tests
Check test pre-conditions!
33
● QA are to monitor code quality metrics. Not only developers.
● Project size metrics mater. Track correlation between lines of code number of comment lines and not covered lines.
● Branches coverage matters extremely in scalable solutions. Think about statement coverage if you use Scala.
We do it with
MEASURE AND PUSH YOUR CODE QUALITY TO GET SOLUTION QUALITY
34
● Force developers to reuse approaches, solutions, infrastructure, components. Probability to find something unexpected in already tested reused approach is much lower.
● Force QA to automate test processes, environment setup, release engineering. Consider automatice code validation before integration.
PUT EFFORTS TO LOWER EFFORTS
35
AUTO-GROOMING… at least some steps
Unit testing
Integration testing
MASTER gets built with all checks.
Metrics go to SonarQube, results are published
MASTER build
After review change is pushed into remote integration branch
Integration branches are monitored, locally merged to master and
built
Developer works in private branch.
Builds are local with all possible checks.
Integration
branch push
Continuous integration environment
Development in private branchR
eview
inte
grat
ion
build
Further release engineering
On integration passed merged branch is pushed to master
38
UNIFIED HADOOP DOES NOT EXISTS
● Started with 4 virtual machines inside AMD 4 cores / 16G desktop to try.
● Then you start to buy i5 4 cores / 32G desktops to build something working.
● Then you start adding E5 12 cores / 48G servers to get results.
● As you grow you 100% go heterogeneous with better hardware. 'Partial' failure on unified cluster puts you in this state.
● So think about heterogeneous RIGHT FROM THE START. Both from design and testing point of view.
39
Every cluster is different. Add scale configuration to your tests and client configuration
1x node development mini-cluster
Large producion cluster
4x nodes staging cluster
CLUSTERS DIFFER
40
PERFORMANCEDo you know your enemy?
● Majors: memory, CPU, I/O, network. QA MUST detect resource usage skews.
● CPU is most easy to see. Always balance between optimization and new hardware. Most important for QA is to understand how it is used.
● Memory usage can be hard to understand. Avoid swapping almost at all costs. Global trend this resource is vital.
● HDD are cheap. Just buy more space if you need.
● Network comes last but is hard to tame. If your bottleneck is switch, replace it but it's hard to upgrade channel for every node so here QA should track architecture scalability.
41
Give your verifiaction team access to production! They must detect issues before they are reported by support team.
Establish and constantly monitor data quality metrics and resource usage on production.
Be proactive. Explain unknown before it puts you into troubles.
TESTING FOREVER
43
CURRENT GAP● Resource bottlenecks resolution is
manual process. No silver bullet.
● Much more easy if you can reproduce it on single node or in single VM scope.
44
CURRENT GAP
● Operational mistakes happen and it is really hard to handle them by verification.
● Usually it can be handled by better design. So prefer to test your approaches and architecture, not just implementation.