Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big...
Transcript of Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big...
![Page 1: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/1.jpg)
1 ©2016 Talend
Talend – Spark Meetup
Edward Ost
![Page 2: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/2.jpg)
2 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016(Revenue Growth)
Data Integration
Master Data Management
Data Quality
Big Data
Application Integration
Hadoop 2.0
Spark & Cloud
Key Facts
• Founded in 2006
• 550+ employees worldwide
• 7 countries
• 1300+ customers
• 2M+ open source downloads
Talend: A History of Innovation and Growth
Data Preparation
![Page 3: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/3.jpg)
3
Top Big Data Challenges
Talend Directly
Addresses these
Challenges
Source:
Gartner 12 September 2013 - G00255160
Survey Analysis: Big Data Adoption in
2013 Shows Substance Behind the Hype
![Page 4: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/4.jpg)
4
Talend Real-time Big Data
The first data integration platform on Spark
Internet of Things
Delivers an end-to-end integration platform for
IoT
Continuous Delivery
Provides Continuous Delivery data integration
with unmatched productivity
New Insight
Easily access master data from Big Data, Mobile, and Cloud Apps using
MDM REST APIs
Smarter, More Secure Data
New data masking and semantic discovery
capabilities
Unleashing the Power of Spark with Real-time Big Data Integration
Talend 6.0
![Page 5: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/5.jpg)
5
Talend Remains Ahead of the Curve for Big Data
Talend 6 (Sept 2015)
Talend 6.1 (Dec 2015)
Talend 5.6.x (Dec 2014)
No
SQL
Had
oo
p
Dis
tro
s
Had
oo
p
Clo
ud
5.4 5.1
2.3 2.2 2.2
5.7
4.0.X 4.0.X 5.1
1.3 1.1* 1.6
2.0 2.0 3.4
2.6 2.6 3.2
2.0
5.5
5.1
1.5
2.2
3.0
2.4
Talend 6.2 (Jun 2016) * Tech Preview
4.x 3.x 3.x
3.3 3.2
4.x BigInsights
![Page 6: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/6.jpg)
6
The More Data, The Better Talend Performs
2X
Number of Records Processed (in Millions)
5 9.5 19 37 75
3.5X 3.8X
5.4X
7.8X
Faster
Faster Faster
Faster
Faster 7.8X Faster
Benchmark
MCG Global Services
![Page 7: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/7.jpg)
7
Easily Convert MapReduce to Spark
MapReduce Performance
(runs on disk)
One Click
Spark Performance
(runs in-memory & on disk)
5X Faster
![Page 8: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/8.jpg)
8
Technical Concerns
• Decouple source systems
• Increase agility
• Reduce process latency
• Avoid re-engineering
• At scale
Information Supply Chain Drivers
Business Drivers
• Evolving business network
• Data Broker ecosystem
• Transform Data into Information
• Onboarding data sources rapidly
• Accelerate insight
![Page 9: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/9.jpg)
9
Step 1: Establish the Business Keys, Hubs
Step 2: Establish the relationships between the Business Keys, Links
Step 3: Establish description around the Business Keys, Satellites
Step 4: Add Standalone components like Calendars and code/descriptions for decoding in Data Marts
Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures
DataVault
![Page 10: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/10.jpg)
10
Simple Data Vault Design Flow - Relational
Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT
ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT
USERS
Identify Business Keys
Identify Attributes
Establish Linkages
Control Lineage
Control History De-Normalize
![Page 11: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/11.jpg)
11
Simple Data Vault Design Flow – Big Data
Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT
ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT
USERS
Identify Business Keys
Identify Attributes
Establish Linkages
Control Lineage
Control History
De-Normalize
Create PIT & BRIDGE records
![Page 12: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/12.jpg)
12
• Focus on business keys and simplicity in source extracts
• Autonomous extracts enable parallel processing
• Capture and preserve auditable data in raw data vault
• Defer more complex business rules to the business vault
• Consider point-in-time tables for operational data vault
Spark and Data Vault Design Notes
![Page 13: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/13.jpg)
13
Basic Ingest
![Page 14: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/14.jpg)
14
Data Vault – Relational Model
• Extract data write to DV ready CSV files
• Push to S3/RDS
• Use ELT to De-Normalize into Columnar DataMart
![Page 15: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/15.jpg)
15
Data Vault – Big Data Analytics
• Sqoop data directly into S3/DV (Redshift)
• Use ELT to De-Normalize into Columnar DataMart
![Page 16: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/16.jpg)
16
Data Vault with Spark – Big Data Real Time
• Sqoop data directly into S3/DV (Hive)
• Transform to Data Vault with Spark Batch
• Operational Data Vault with Spark Streaming
• ELT to De-Normalize into Columnar DataMart
![Page 17: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/17.jpg)
17
• OLTP
• Systems of Engagement
• Data Warehouse
• Analytics
• BI
From Data to Information
• Supply Chain
• Collaboration
• Self-Service
• On-Demand
![Page 18: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/18.jpg)
18
Lambda Architecture
Extract
Load
Transform
Transform Ingest
Update
Reporting
Data
Mining
MDD/OLAP
Dashboarding
Data Discovery
API
Analytics
Applications
IOT
NoSQL
Web Logs
Systems of
records
ERP
DBMS Learn
Act
Streaming layer
Batch layer
App. Events
![Page 19: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/19.jpg)
19
• Discover the Talend Big Data Jumpstart Sandbox • Starting the Talend Big Data Sandbox
• Big Data Sandbox Forum
• Get It Right, in Real Time with SPARK
• Using AWS EMR, Redshift, and Spark to Power Your Analytics
• TalendForge Big Data Forum
• Data Vault Basics
• Data Vault Series – Agile Modeling not an Option Anymore
Talend Big Data Resources
![Page 20: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an](https://reader031.fdocuments.us/reader031/viewer/2022022510/5adb45047f8b9a52528df91c/html5/thumbnails/20.jpg)
20
Questions
Edward Ost
Channels Technical Director
301-666-1039