Introduction to Hadoop and HDFS. Table of Contents Hadoop – Overview Hadoop Cluster HDFS.
Hadoop applicationarchitectures
-
Upload
doug-chang -
Category
Internet
-
view
182 -
download
0
Transcript of Hadoop applicationarchitectures
![Page 1: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/1.jpg)
Hadoop Ecosystem Architectures
BigData + Oracle/SQL Server DatabasesSummary from Absolute SW slides
![Page 2: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/2.jpg)
BigData Failures
>50% of Hadoop initiatives fail; Why? Start: Assume Hadoop replaces a database and the
DB apps Progression: Assume Hadoop supplements the DB
and is not a complete replacement. Some of the batch jobs can migrate to HadoopThis may solve the problem of having to pay the next round
of licensing fees for the next higher step up in db capacity Most of these initiatives still fail. Why?
![Page 3: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/3.jpg)
Hadoop/DB Migrations
Takes too long to migrate the db schema to Hadoop for the longer batch queries. Too long=> increased cost=> :( Vendor Training is not adequate
to get business logic implemented in an API on top of Hadoop quickly.(tools e.g. SQOOP)
For devops/production/customization Confusion in which components to use; workflows
w/Oozie; PIG+UDFs or Spark or Hive+UDFs; HBase Fix: Use REST APIs/Services + Hadoop MR+Spark
Shell; Training
![Page 4: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/4.jpg)
What is a better strategy?
Besides going all in with Hadoop and buying the Cloudera/MapR/Hortonworks sales pitch; what is missing?
Goal: quickly establish a user base; not 2 years. ~6 months; Mix REST services with Hadoop/HDFS. Tableau
one example, better to custom develop Start w/ opensource hadoop; not CM or Ambari;
build the source; learn to apply the patches to Jira bugs (used to be important). Drives understanding in internals for configuration, skills for production
![Page 5: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/5.jpg)
Open Source strategy
Normally takes 1-2y Training reduces time from POC to deployment to 6
months for first use case Training on both REST services to establish a corporate
agile strategy/template with Hadoop takes years to develop. Different than Hadoop Vendor training for implementing business logic
Covers REST examples w/Spring and/or Guice and building the source, removing the unnecessary components to keep the code base small; adding integration tests specific to a customer deployment using iTest; puppet scripts and how to deploy from a single source tree using Jenkins
![Page 6: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/6.jpg)
Use case: DB Queries
Misconception replacing DB queries in complex schema with Hadoop Hive/Pig/Spark queries as a strategy
Develop REST BE/FE template/skills(<1H implementation). Can Deploy w/HDFS(w/wo indexes) Queries. Why?
• Faster perf, less code to do the same thing, less admin; lower cost at small scale. REST services are closer to a db than Hadoop. :) users
With training REST services take 1h-1day to build. Hadoop impediments:
having to provision a cluster, understanding what the XML files do, running benchmarks, configuring kerberos, setting ACLS, versioning data, testing backup and recovery strategies, testing auditing...etc...
![Page 7: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/7.jpg)
REST + Hadoop
Successful deployments contain a mix of homegrown services + Hadoop components Training to develop REST services quickly
No Spring, no J2EE, no Glassfish, no complex s/w with millions of lines of code.
DI with Google Guice; maven; Jetty; FE using jQuery or use Twitter bootstrap. Keep the BE and FE simple first before looking at web frameworks like Play, Django, Ruby, node.js... etc...
Training materials: no Guice, w/Guice Package REST services with Hadoop distro using
the Bigtop Skills
![Page 8: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/8.jpg)
REST + Hadoop
Successful deployments contain a mix of homegrown services + Hadoop components Training to develop REST services quickly
No Spring, no J2EE, no Glassfish, no complex s/w with millions of lines of code.
DI with Google Guice; maven; Jetty; FE using jQuery or use Twitter bootstrap. Keep the BE and FE simple first before looking at web frameworks like Play, Django, Ruby, node.js... etc...
Training materials: no Guice, w/Guice Package REST services with Hadoop distro using
the Bigtop Skills
![Page 9: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/9.jpg)
Back to Hadoop
K/V storage; why? Add nodes to scale out horizontally; i.e. need more
memory to handle more data<=> more db rows problem/soln
M/R spills to disk; speeding up data reads are ok but M/R still a problem; Spark/Scala in memory computation w/KV store
Building a data repository, customize the CDK to reflect the schemas. Productionize using Guice. Spring too rigid, not morphlines(like SED)
![Page 10: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/10.jpg)
Hive/Pig/Oozie/Sqoop
Departments pick their own tools/approach based on the problem description
HTTPFS isn't an API Add REST API
Hive/PIG slow to develop. Developing UDFs take time, production code hard to maintain/modify buried behind production firewall Better with beeline add jar
![Page 11: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/11.jpg)
Scala/Spark
Some parts of Scala/Spark not parallelizable Parallelize over threads in ExecutionContext vs.
Workers in separate JVMs Takes 3x to get something right for users
1) Learning;everything new(vendor training good)
2) Know what is important for your own use case; focus time on soln here; code is different than first time. e.g. scala teaching
3) now know what the problem definition is and probably what the best soln is; can focus on execution and making service fast and usable
![Page 12: Hadoop applicationarchitectures](https://reader036.fdocuments.us/reader036/viewer/2022083003/558e2e911a28ab53618b457d/html5/thumbnails/12.jpg)
Analytics Use case: Model building
Models take a long time to build. Example: Random Forest 4h on 8GB macbook(~2010;R) 4h on AWS Large instance(R) 16h(Mahout; not same impl as R) on M/R in AWS
cluster on 4 nodes. More not faster Soln:
Distributed+MultiTenant. Not Mahout