Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Strata+Hadoop NY 2015: Use case examples of building applications on Hadoop with CDAP
Click here to load reader
-
Upload
cask-data-inc -
Category
Software
-
view
685 -
download
0
Transcript of Strata+Hadoop NY 2015: Use case examples of building applications on Hadoop with CDAP
PROPRIETARY & CONFIDENTIAL
Why Cask?
2
@jgrayla
SIMPLE ACCESS TO POWERFUL TECHNOLOGY
Cask’s goal is to enable every developer and enterprise to
quickly and easily build and run modern data applications
using open source big data technologies like Hadoop
PROPRIETARY & CONFIDENTIAL
Introduction to Data Applications
Data Applications, also known as Operational Analytics,integrate analytics into applications in order to drive actionable intelligence, turning insights into action.
Apps that utilize data insights to enhance the
customer experience, achieve a business objective,
improve a business process orenable new products or lines of business.
PROPRIETARY & CONFIDENTIAL
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
The Path to Data AppsENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
PROPRIETARY & CONFIDENTIAL
Data App Customer Examples
E-Commerce Mature company with existing dependence on SaaS services and legacy apps, moving to a Data Lake and custom Apps
Enterprise Large global enterprise with
multiple Lakes and Apps attempting to centralize
platform and extend to LoB
SaaS Technically advanced company with product teams focused on
minimizing the time-to-market for Apps
PROPRIETARY & CONFIDENTIAL
Data App in SaaS
Deep Technical Talent with Product FocusReduce time-to-market
NRT Event Monitoring
High-throughput, real-time event ingestion
Consistent processing of events with persistence
Scalable and guaranteed NRT event processing
PROPRIETARY & CONFIDENTIAL
Data App in E-Commerce
Unfamiliar Technical Talent with Shift to DIYAddress skill gaps
Consumer Intelligence
Ingestion of multiple customer data sources
Real-time analytics and machine learning
Targeting and optimization for consumer web
PROPRIETARY & CONFIDENTIAL
Data App in the Enterprise
Highly variable talent with organizational gaps and executive pressure
Provide reference architecture and abstractions
Enterprise Data Hub
Centralized data storage & management in Data Lakes
Tools and APIs that extend data into lines of business
Multi-tenant PaaS for governance and accessibility
Big Data Application Example
Big Data Technology Explosion
Core HadoopHDFS, MR
2006
HBaseZooKeeper
Core Hadoop
2008
HivePig
MahoutHBase
ZooKeeperCore Hadoop
2009
SqoopWhirrAvroHivePig
MahoutHBase
ZookeeperCore Hadoop
2010
FlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZookeeperCore Hadoop
2011
SparkImpala
SolrKafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZookeeperCore Hadoop
2012
SentryTez
ParquetSentryRangerKnoxSparkYARNImpala
SolrKafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZookeeperCore Hadoop
Present
Hadoop Alone = 47+ projects across 6 distros — Merv Adrian, Gartner Analyst
PROPRIETARY & CONFIDENTIAL
Data App Challenges
E-Commerce Emphasis and time spent
drastically skewed towardsops and integration logic
Difficult for existing Java devs to use APIs like HBase and build
apps without dev lifecycle tools
Significant time and effort to bring applications from
development into production
Enterprise Lack of standards and best practices leads to divergent
architectures across org
Gaps between IT and LoBmakes it difficult to cooperate and separate areas of concern
Multiple OSS projects and Hadoop distributions make ops
and governance a challenge
SaaS Real-time, scale-out data
ingestion into HDFS and HBase means building infrastructure
Difficult to provide guarantees such as high-availability
and data consistency
Tight coupling of ingestion pipeline and processing pipeline creates operational complexity
PROPRIETARY & CONFIDENTIAL
The Power of Abstraction
Hadoop is a diverse collection ofopen source infrastructure projects,
the universe of open source infrastructurecontinues to expand and accelerate.
This stuff is already too hard
There is a dire need for abstraction, to provide standardization and encapsulation,
to enable accessibility, flexibility and reusability.
PROPRIETARY & CONFIDENTIAL
Analogy of Hadoop to RDBMSM
arke
t Si
ze
1970 1980 1990 2000 2010 2020 2030
BI Market
RDBMS Created
$30-$40bn
MIDDLEWARE & MANAGEMENT
INFRASTRUCTURE
APPLICATIONS
$$$
$
VALUE ACCRUAL OVER TIMEMIDDLEWARE & MANAGEMENT
INFRASTRUCTURE
APPLICATIONS
$$$
$
VALUE ACCRUAL OVER TIME
PROPRIETARY & CONFIDENTIAL
Analogy of Hadoop to RDBMSM
arke
t M
atur
ity
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
HadoopCreated
Pioneers Adopt and extend Hadoop
Early Adopters Adopt
Fast Followers
Still early days…
PROPRIETARY & CONFIDENTIAL
CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and Running Data Applications on Hadoop
Integrates the LatestBig Data Technologies
Supports All MajorHadoop Distributions
Fully Open Sourceand Highly Extensible
PROPRIETARY & CONFIDENTIAL17
Key FeaturesCASK DATA APPLICATION PLATFORM
Infrastructure INTEGRATION
Provide an integrated product experience with out-of-the-box
capabilities
Architecture STANDARDS Define a reference
architecture to standardize support for mixed
infrastructure
Programming ABSTRACTIONS
Utilize abstraction layers to encapsulate complex
patterns and insulate developers
Production SERVICES
Provides development tools and runtime services to
enable productionapps and data
PROPRIETARY & CONFIDENTIAL
• Application Container Architecture
• Reusable Programming Abstractions
• Global User and Machine Metadata
Applications
ProgramsMapReduce SparkTigon Workflow Service Worker
Metadata
DatasetsTable Avro ParquetTimeseries OLAP CubeGeospatial ObjectStore
Metadata
Metadata
PROPRIETARY & CONFIDENTIAL
PROPRIETARY & CONFIDENTIAL
Self-Service Ingestion and ETL for Hadoop Data Lakes
Built for Productionon CDAP
Rich Drag-and-DropUser Interface
Open Source &Highly Extensible
PROPRIETARY & CONFIDENTIAL
DISCOVERdata using user and machine
generated metadata
INGESTany data from any source
in real-time and batch
BUILDdrag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESSany data to any destination
in real-time and batch
PROPRIETARY & CONFIDENTIAL
Data Apps on CDAP
E-Commerce CDAP is an integrated platform that accelerates onboarding and
keeps focus on business logic
Dataset Patterns and App Templates provide higher-level,
domain-specific APIs
Tools and runtime services enable a true dev lifecycle and
faster time to production
Enterprise CDAP defines a set of standards
and embeds best practices to drive a common architecture
Tools, reusable abstractions and self-service interfaces enable
non-IT users to access Hadoop
CDAP provides portability across many versions of major distros and is going beyond Hadoop
SaaS CDAP Streams capability provides scale-out, highly
available real-time data ingest
Tephra transaction engine integrates with Datasets and
Programs for data consistency
Streams separate ingestion from processing and combine support
for real-time, batch and SQL
Demo
CDAP Community 100% Open Source (ASL2)
Website: http://cdap.io
Mailing List: [email protected] [email protected]
IRC: #cdap on freenode.net
CDAP Enterprise 100% Commercially Supported
Website: http://cask.co
Contact Sales: [email protected]
Contact Me: [email protected] or @jgrayla
Accelerate Your Big Data Journey
Tap In @ cask.co
Download Now or Learn More at cask.co