Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent...
-
Upload
hortonworks -
Category
Technology
-
view
109 -
download
0
description
Transcript of Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent...
Page 1
Accelerate Big Data Application Development with Cascading and HDP
April 22, 2014
Page 2
Agenda
• Take advantage of the latest Hadoop processing frameworks like YARN and Tez in HDP 2.1
• How developers can create future proof, data-driven applications built on Apache Hadoop with Cascading
• How Cascading accelerates Hadoop application development by abstracting the platforms underneath
Page 3
Speakers
Ajay Singh, Director of Technical Channels, Hortonworks
Supreet Oberoi, VP of Field Engineering, Concurrent
Page 4
Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process
Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind
Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills
Enable your Modern Data Architecture by delivering Enterprise Apache Hadoop
Our Mission:
Reseller Partners:
Headquartered in Palo Alto, CA; 300+ employees and growing
Page 5
A data architecture under pressure from new data
APPLICAT
IONS*
DATA
**SYSTEM*
REPOSITORIES*
SOURC
ES*
Exis4ng*Sources**(CRM,*ERP,*Clickstream,*
Logs)*
RDBMS* EDW* MPP*
Business**Analy4cs*
Custom*Applica4ons*
Packaged*Applica4ons*
Source: IDC
2.8*ZB*in*2012*
85%*from*New*Data*Types*
15x*Machine*Data*by*2020*
40*ZB*by*2020*
OLTP,&ERP,&CRM&
Systems&
Unstructured&documents,&
emails&
Clickstream&
Server&logs&
Sen>ment,&Web&
Data&
Sensor.&Machine&
Data&
Geoloca>on&
Page 6
A Modern Data Architecture AP
PLICAT
IONS*
DATA
**SYSTEM*
REPOSITORIES*
SOURC
ES*
Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*
RDBMS* EDW* MPP*
Emerging*Sources**(Sensor,*Sen4ment,*Geo,*Unstructured)*
OPERATIONAL*TOOLS*
MANAGE*&*MONITOR*
DEV*&*DATA*TOOLS*
BUILD*&*TEST*
Business**Analy4cs*
Custom*Applica4ons*
Packaged*Applica4ons*
Gov
erna
nce
&
Inte
grat
ion
ENTERPRISE HADOOP
Secu
rity
Ope
ratio
ns
Data Access
Data Management
Page 7
Clickstream Capture and analyze website visitors’ data trails and optimize your website
Sensors Discover patterns in data streaming automatically from remote sensors and machines
Server Logs Research logs to diagnose process failures and prevent security breaches
New types of data Hadoop Value:
Sentiment Understand how your customers feel about your brand and products – right now
Geographic Analyze location-based data to manage operations where they occur
Unstructured Understand patterns in files across millions of web pages, emails, and documents
Page 8
Enterprise Hadoop: Core Foundation of Hadoop Applications
Page 9
Core Capabilities of Enterprise Hadoop
Load data and manage
according to policy
Deploy and effectively
manage the platform
Store and process all of your Corporate Data Assets &
Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered
approach to security through Authentication, Authorization,
Accounting, and Data Protection
&
DATA**MANAGEMENT*
SECURITY*DATA**ACCESS*GOVERNANCE*&*INTEGRATION* OPERATIONS*
Enable both existing and new application to provide value to the organization
PRESENTATION*&*APPLICATION*
Empower existing operations and security tools to manage Hadoop
ENTERPRISE*MGMT*&*SECURITY*
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT*OPTIONS*
Page 10
HDP 2.1: Enterprise Hadoop
HDP 2.1 Hortonworks Data Platform
**
Provision,*Manage*&*Monitor*
&
Ambari&
Zookeeper&
Scheduling*&
Oozie&
Data*Workflow,*Lifecycle*&*Governance*
*Falcon&
Sqoop&
Flume&
NFS&
WebHDFS&
YARN*:*Data*Opera4ng*System&
DATA**MANAGEMENT*
SECURITY*DATA**ACCESS*GOVERNANCE*&*INTEGRATION*
Authen4ca4on*Authoriza4on*Accoun4ng*
Data*Protec4on*&
Storage:&HDFS&
Resources:&YARN&
Access:&Hive,&…&&
Pipeline:&Falcon&
Cluster:&Knox&
OPERATIONS*
Script*&
Pig&
**
Search**
Solr&
**
SQL**
Hive/Tez,&
HCatalog&
**
NoSQL**
HBase&
Accumulo&
**
Stream***
Storm&
&
**
Others**
InUMemory&
Analy>cs,&&
ISV&engines&
1& °& °& °& °& °& °& °& °& °&
°& °& °& °& °& °& °& °& °& °&
°& °& °& °& °& °& °& °& °& °&
°&
°&
N*
HDFS**(Hadoop&Distributed&File&System)&
Batch**
Map&
Reduce&
**
Deployment*Choice&Linux Windows On-Premise Cloud
Page 11
Hadoop is wholly integrated into the data center
APPLICAT
IONS*
DATA
**SYSTEM*
SOURC
ES*
RDBMS* EDW* MPP*
Emerging*Sources**(Sensor,*Sen4ment,*Geo,*Unstructured)*
HANA
BusinessObjects BI
OPERATIONAL*TOOLS*
DEV*&*DATA*TOOLS*
Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*
INFRASTRUCTURE*
HDP 2.1 G
over
nanc
e
& In
tegr
atio
n
Secu
rity
Ope
ratio
ns
Data Access
Data Management
Page 12
Developing Apps on Hadoop
• Spring XD Framework – Consistent configuration & Java API across wide range of Hadoop ecosystem
projects
• Microsoft .NET SDK For Hadoop – API access to HDP on windows and HDInsight service
– LINQ libraries for accessing Hive
• Cascading – Delivers an easy to use abstraction layer for developing Hadoop applications
– Supports development in Scala & Clojure
– Hortonworks to Certify, Support & Deliver Cascading SDK with Hortonworks Data Platform
DRIVING INNOVATION THROUGH DATAACCELERATE BIG DATA APPLICATION DEVELOPMENT WITH CASCADING AND HDPSupreet Oberoi | April 22, 2014 VP Field Engineering, Concurrent Inc
HORTONWORKS PARTNERS WITH CONCURRENT
• The Cascading SDK will now be integrated with the Hortonworks Data Platform (HDP)
• Hortonworks will certify and support Cascading™ SDK with HDP
• Cascading will support Apache Tez; companies using Cascading or domain-specific languages on Cascading can seamlessly migrate HDP supporting Apache Tez
The partnership benefits users by combining the power and simplicity of Cascading with the reliability and stability of HDP.
Confidential
AGENDA
3
• Who is Concurrent • What is Cascading • Where is it used • What problems does Cascading solve • What is included in the Cascading kit !
Confidential
ABOUT CONCURRENT, INC.
4
Confidential
GET TO KNOW CONCURRENT
5
Leader in Application Infrastructure for Big Data!
• Building enterprise software to simplify Big Data application development and management
Products and Technology!
• CASCADINGThe most widely used application infrastructure for building Big Data applications with over 150,000 downloads each month
• DRIVEN Enterprise Data Application management for Big Data apps
Proven - Simple, Reliable, Robust!
• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.
Founded: 2008 HQ: San Francisco, CA !CEO: Gary Nakamura CTO, Founder: Chris Wensel !www.concurrentinc.com
PRODUCTS AND TECHNOLOGY
!
!
Big Data Application Development!Simple, Reliable, Repeatable
!
!
Unmatched Application Insight!Visibility into your Data Applications
Open Source Commercial
www.concurrentinc.com/products
Open Source Community!Focused on Data App Development
!Project home of Cascading
Collection of sub-projects / tools !!
Data App Management!Realtime monitoring
Performance Management Operational Control Data Provenance
Compliance Governance
BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data
BUSINESSES DEPEND ON US
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
BUSINESSES DEPEND ON US
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
BUSINESSES DEPEND ON US
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
CASCADING DEPLOYMENTS
11
DRIVING ADVANTAGE WITH DATA APPLICATIONS
Enterprise IT!Extract Transform Load
Log File Analysis Systems Integration Operations Analysis
!
Corporate Apps!HR Analytics
Employee Behavioral Analysis Customer Support | eCRM
Business Reporting !
Telecom!Data processing of Open Data
Geospatial Indexing Consumer Mobile Apps Location based services
Marketing / Retail!Mobile, Social, Search Analytics
Funnel analysis Revenue attribution
Customer experiments Ad Optimization
Retail recommenders !
Consumer / Entertainment!Music Recommendation Comparison Shopping Restaurant Rankings
Real Estate Rental Listings
Travel Search & Forecast !
!
Finance!Fraud and Anomaly Detection
Fraud Experiments Customer Analytics
Insurance Risk Metric !
Health / Biotech!Aggregate metrics for Govt
Person biometrics Veterinary diagnostics Next-Gen Genomics
Argonomics Environmental Maps
!
BIG DATA — THE NEXT PHASE OF MATURITY
“It’s all about the Apps”"There needs to be a comprehensive solution for building, deploying, running and
managing these new class of enterprise applications
Business Strategy Data & TechnologyLoyalty and promotions analysis
Retention campaigns Marketing campaign optimization
Fraud detection Risk management Scientific research
Remote monitoring and diagnosis and more!
Your Data & Systems Hadoop, EDW, Mainframe,
System Logs, NO SQL DBs, etc.Challenges!!
Leveraging existing skill sets, existing systems, past investments and existing business processes
Connecting Business and Data
Confidential
PRODUCTS OVERVIEW
14
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
15
Process Planner
Processing API Integration APIScheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
ScriptingScala, Clojure, JRuby, Jython, Groovy
Enterprise Java
KEY CASCADING CONCEPTS
Tap
KEY CASCADING CONCEPTS
PipeFlow
• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical
• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)
• Aggregations ‣ Count, Average, etc ‣ Rolling windows
SOME COMMON PATTERNS
18
filter
filter
function
functionfilterfunctiondata
PipelineSplit Join
Merge
data
Topology
WORD COUNT EXAMPLE!
!String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!
configuration
integration
!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!
processing
// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
scheduling
!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.complete(); // <<-- Runs jobs on Cluster
CASCADING OVERVIEW
www.cascading.org
Build Data Apps that are
scale-free!!!!
Design principals ensure best practices at any scale
Test-Driven Development!
!Efficiently test code and process local files before you deploy on a cluster
Staffing Bottleneck!
!Use existing Java, SQL,
modeling skills sets
Operational Complexity!
!Simple - Package up into
one jar and hand to operations
Application Portability!
!!
Write once, then run on different computation
fabrics.
Systems Integration!
!!
Hadoop never lives alone. Easily integrate to your
existing systems!
Proven application development framework for building Data
applications
Framework addresses
OPERATIONAL READINESS: DISCIPLINE & ABILITY TO MEASURE
• Visibility into app development • Business SLA • Balance & Controls • Application testing • Data quality • Process to “productionalize” apps • High fidelity execution analysis • Real-time monitoring • …
PRODUCTS AND TECHNOLOGY
LINGUAL Simplifying Systems Integration
PATTERN Enabling Machine Scoring Algorithms
!
!
Big Data Application Development!Simple, Reliable, Repeatable
!
!
Unmatched Application Insight!Visibility into your Data Applications
Open Source Commercial
www.concurrentinc.com/products
CASCADING ECOSYSTEM IS MORE THAN CASCADING FRAMEWORK
Lingual, Pattern and other Dynamic Programming Languages such as
Scalding are part of the Cascading Ecosystem and are included as part
of the Cascading kit
http://www.cascading.org/extensions/
LINGUAL
• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps!
• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC!
• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache Hadoop
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
SCALDING
• Scalding is a language binding to Cascading for Scala!
- The name Scalding comes from the combining of SCALa and cascaDING!
• Scalding is great for Scala developers; can crisply write constructs for matrix math… !
• Scalding has very large commercial deployments at:!
- Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality!
- Ebay - Use cases include search analytics and other production data pipelines
DRIVEN OVERVIEW
What is Driven?!The first application
performance management product for Big Data
applications
Capabilities
Visualize your Data App!
No more black box! Instantly visualize your
running app in real-time
Diagnose App Failures!
Identify where and how your app failed… all without sorting through logs!
Track App Performance!
For all your apps, view and compare history of your
app’s runtime performance
Insight into your Applications!
At any moment, quickly understand what your app
is doing on your clusterLINGUALPATTERN
SCALDINGCASCALOG
Benefits
Key Features
• Accelerate Time to Market • Build Reliable Applications • Optimize Application Performance
• Application visualization • Dashboard performance view • Application performance history • Insights for each application (workflow,
telemetry, error types) • Team collaboration and management
Works with:
www.cascading.io
Driven is free for developer use (cloud)
Lingual Pattern
Availability Cascading 2.5 Available Now
Lingual 1.1 Available Now
Pattern 1.0-WIPWIP Available Now
License Apache License 2.0 Apache License 2.0 Apache License 2.0
SupportCommunity Forums & Mailing List, Enterprise
Support
Community Forums & Mailing List, Enterprise
Support
Community Forums & Mailing List, Enterprise
Support
CASCADING AVAILABILITY
Cascading, Lingual and Pattern are open source projects freely available to the general public under Apache License 2.0
ConfidentialConfidential29
Summary!• APM for Big Data | The first application performance management product for Big Data applications
!
!
!
!
• For Developers and Operators | Significantly improves developer productivity and operations control by providing an unprecedented level of insight into building and managing enterprise-grade data applications
• Collaboration | Facilitates and encourages user collaboration to build enterprise data applications • Community Integration | Driven is a free cloud service integrated with the Cascading open source community • Licensing | Driven is free for development (cloud only) and licensable for production or on-premise deployments • Deployment Options | Deploy in the cloud or on-premise
Accelerate Time to Market
Process visualization and monitoring capabilities in a rich UI
Build Reliable Apps
Detailed insight into data processing logic and algorithms
Optimize App Performance
Key application behavior metrics with historical data to trend performance
GET STARTED WITH CASCADING ON HDP 2.1
1. Download HDP 2.1
2. Take Cascading for a spin by running the Impatient tutorial at http://docs.cascading.org/impatient/
DRIVING INNOVATION THROUGH DATATHANK YOUSupreet Oberoi | April 18, 2014
Page 13
SAN JOSE June 3-5
AMSTERDAM April 2-3
• 6 tracks, 3 days, and 120+ sessions to choose from • Community Focused - Sessions voted on by the public and
selected by a committee of industry luminaries • Deep Dive Technical Content - Including a Committer track with
content presented by Apache committers • Business and Technical Topics • Community Activities - Hadoop Summit will host community meet-
ups and birds of a feather sessions
www.hadoopsummit.org
The Largest Hadoop Community Events in �Europe and North America
Page 14
Questions? Use the Q/A panel to ask your questions Download the Hortonworks Sandbox and Cascading • Cascading and HDP 2.1 Sandbox
• Hortonworks Sandbox
• Cascading Impatient Tutorial