Hadoop applicationarchitectures

Hadoop Ecosystem Architectures

BigData + Oracle/SQL Server DatabasesSummary from Absolute SW slides

BigData Failures

>50% of Hadoop initiatives fail; Why? Start: Assume Hadoop replaces a database and the

DB apps Progression: Assume Hadoop supplements the DB

and is not a complete replacement. Some of the batch jobs can migrate to HadoopThis may solve the problem of having to pay the next round

of licensing fees for the next higher step up in db capacity Most of these initiatives still fail. Why?

Hadoop/DB Migrations

Takes too long to migrate the db schema to Hadoop for the longer batch queries. Too long=> increased cost=> :( Vendor Training is not adequate

to get business logic implemented in an API on top of Hadoop quickly.(tools e.g. SQOOP)

For devops/production/customization Confusion in which components to use; workflows

w/Oozie; PIG+UDFs or Spark or Hive+UDFs; HBase Fix: Use REST APIs/Services + Hadoop MR+Spark

Shell; Training

What is a better strategy?

Besides going all in with Hadoop and buying the Cloudera/MapR/Hortonworks sales pitch; what is missing?

Goal: quickly establish a user base; not 2 years. ~6 months; Mix REST services with Hadoop/HDFS. Tableau

one example, better to custom develop Start w/ opensource hadoop; not CM or Ambari;

build the source; learn to apply the patches to Jira bugs (used to be important). Drives understanding in internals for configuration, skills for production

Open Source strategy

Normally takes 1-2y Training reduces time from POC to deployment to 6

months for first use case Training on both REST services to establish a corporate

agile strategy/template with Hadoop takes years to develop. Different than Hadoop Vendor training for implementing business logic

Covers REST examples w/Spring and/or Guice and building the source, removing the unnecessary components to keep the code base small; adding integration tests specific to a customer deployment using iTest; puppet scripts and how to deploy from a single source tree using Jenkins

Use case: DB Queries

Misconception replacing DB queries in complex schema with Hadoop Hive/Pig/Spark queries as a strategy

Develop REST BE/FE template/skills(<1H implementation). Can Deploy w/HDFS(w/wo indexes) Queries. Why?

• Faster perf, less code to do the same thing, less admin; lower cost at small scale. REST services are closer to a db than Hadoop. :) users

With training REST services take 1h-1day to build. Hadoop impediments:

having to provision a cluster, understanding what the XML files do, running benchmarks, configuring kerberos, setting ACLS, versioning data, testing backup and recovery strategies, testing auditing...etc...

REST + Hadoop

Successful deployments contain a mix of homegrown services + Hadoop components Training to develop REST services quickly

No Spring, no J2EE, no Glassfish, no complex s/w with millions of lines of code.

DI with Google Guice; maven; Jetty; FE using jQuery or use Twitter bootstrap. Keep the BE and FE simple first before looking at web frameworks like Play, Django, Ruby, node.js... etc...

Training materials: no Guice, w/Guice Package REST services with Hadoop distro using

the Bigtop Skills

https://github.com/dougc333/jettyj2ee

https://github.com/dougc333/DistServersPOC

Back to Hadoop

K/V storage; why? Add nodes to scale out horizontally; i.e. need more

memory to handle more data<=> more db rows problem/soln

M/R spills to disk; speeding up data reads are ok but M/R still a problem; Spark/Scala in memory computation w/KV store

Building a data repository, customize the CDK to reflect the schemas. Productionize using Guice. Spring too rigid, not morphlines(like SED)

Hive/Pig/Oozie/Sqoop

Departments pick their own tools/approach based on the problem description

HTTPFS isn't an API Add REST API

Hive/PIG slow to develop. Developing UDFs take time, production code hard to maintain/modify buried behind production firewall Better with beeline add jar

Scala/Spark

Some parts of Scala/Spark not parallelizable Parallelize over threads in ExecutionContext vs.

Workers in separate JVMs Takes 3x to get something right for users

1) Learning;everything new(vendor training good)

2) Know what is important for your own use case; focus time on soln here; code is different than first time. e.g. scala teaching

3) now know what the problem definition is and probably what the best soln is; can focus on execution and making service fast and usable

Analytics Use case: Model building

Models take a long time to build. Example: Random Forest 4h on 8GB macbook(~2010;R) 4h on AWS Large instance(R) 16h(Mahout; not same impl as R) on M/R in AWS

cluster on 4 nodes. More not faster Soln:

Distributed+MultiTenant. Not Mahout

http://www.slideshare.net/DougChang1/demographics-andweblogtargeting-10757778

Hadoop applicationarchitectures

Internet

Transcript of Hadoop applicationarchitectures