Big Data in the Cloud with Informatica Cloud and Amazon Redshift
-
Upload
informatica-cloud -
Category
Technology
-
view
1.813 -
download
2
description
Transcript of Big Data in the Cloud with Informatica Cloud and Amazon Redshift
Cloud and Amazon Redshift
Rahul Pathak, Amazon Redshift Product ManagementNicolas Brisoux, Informatica Cloud Platform AdoptionDarren Cunningham, Informatica Cloud Marketing
@infacloud #redshift
Today’s Agenda
• Informatica and Amazon Strategic Partnership
• Amazon Redshift Overview
• Informatica Cloud Redshift Connector
• Demonstration
• Discussion
• Next Steps
2
Informatica: The Information Management Leader
B2B Data Exchange
Informatica supports the requirements of cross-organizational
data exchange, so users apply familiar & trusted data integration
tools and techniques to the growing practice of B2B data integration.
Cloud Data IntegrationEnterprise Data Integration
Complex Event Processing
Informatica received high praise for its services from customers. For deployments involving systems
monitoring use cases, Informatica offers a five-day stand‐up of
RulePoint.
Ultra Messaging
In spite of the new entrants, Informatica remains the market
leader in this highly demanding part of the messaging market.
Data Quality Master Data Management
Application ILM
Informatica Cloud: our fastest growing product lineToday’s Focus: Cloud Data Integration
4
Informatica Cloud and Amazon Redshift:Enabling cost-effective data warehousing
• Redshift Connector pre-release announced in February
• General availability this month (August)
5
InformaticaCloud.com/Amazon-Redshift
Rahul Pathak | [email protected] | @rahulpathakSenior Product Manager
Amazon Redshift
AWS Database Services
Amazon RDSFully managed SQL database service for OLTP workloads
Amazon DynamoDB
Fully managed NoSQL service for massively scalable, high throughput, low latency workloads
Amazon Redshift
Fully managed fast and powerful, petabyte-scale data warehouse service
Amazon ElastiCache
Fully managed Memcached-compliant in memory caching service
We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
A Lot Faster
A Lot Cheaper
A Lot SimplerAmazon Redshift
Data warehousing done the AWS way
• Pay as you go, no up front costs
• Fast, cheap, easy to use
• SQL
• Easy to provision
Common Customer Use Cases
• Reduce costs by extending DW rather than adding HW
• Migrate completely from existing DW systems
• Respond faster to business; provision in minutes
• Improve performance by an order of magnitude
• Make more data available for analysis
• Access business data via standard reporting tools
• Add analytic functionality to applications
• Scale DW capacity as demand grows
• Reduce HW & SW costs by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Progress Since Launch on Feb 14, 2013
• Fastest growing service in AWS history
• Well over 1,000 customers; adding over 100 per week
• Obtained SOC1 & SOC2 certification with more in progress
• Deployed in US East (N. Virginia), US West (Oregon), EU (Ireland) and Asia Pacific (Tokyo)
• Additional global regions coming soon
Amazon Redshift Customers
• 5x – 20x reduction in query times; 4x cost reduction over HIVE
• 20x – 40x reduction in query times
• Nokia: 50% reduction in costs, 2x improvement in query times
Amazon Redshift Customer: bit.ly
“When we want to answer a question with Redshift, we just write a SQL query and get an answer within a few minutes – if not seconds.”
- Sean O’Connor, Engineer at bit.lyBit.ly provides social link sharing analytics, managing over 300 million shortens and 5 billion clicks each month
14
Amazon Redshift Customer: HasOffers
“Amazon Redshift introduces a major opportunity to improve the performance of our real-time reporting, allowing us to run queries up to 50 times faster than our current OLAP solution.”
- Niek Sanders, VP of Engineering,
HasOffers
HasOffers records and reports billions of desktop and mobile interactions for performance marketers
Amazon Redshift Customer: Infor
“This is the formula for fast and broad adoption, where customers can get consistent, accurate, and useful data fast - in weeks not months or years.”
- Ali Shadman, SVP, Business Cloud & Upgrades, Infor
Infor is the world’s third largest ERP vendor, serving over 70,000 customers in 194 countries
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do unnecessary I/O
• To get total amount, you have to read everything
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• With column storage, you only read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Columnar compression saves space & reduces I/O
• Amazon Redshift analyzes and compresses your data
analyze compression listing;
Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Track of the minimum and maximum value for each block
• Skip over blocks that don’t contain the data needed for a given query
• Minimize unnecessary I/O
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage to maximize throughput
• Hardware optimized for high performance data processing
• Large block sizes to make the most of each read
• Amazon Redshift manages durability for you
Amazon Redshift architecture
• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution
• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via
Amazon S3– Parallel load from Amazon
DynamoDB
• Single node version available
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
Amazon Redshift runs on optimized hardware
HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage
• Optimized for I/O intensive workloads
• High disk density
• Runs in HPC - fast network
• HS1.8XL available on Amazon EC2
128 GB RAM
16 cores
16 TB disk
16 GB RAM
2 TB disk
2 cores
Amazon Redshift lets you start small and grow big
Extra Large Node (HS1.XL)3 spindles, 2 TB, 16 GB RAM, 2 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL)24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE
Cluster 2-100 Nodes (32 TB – 1.6 PB)
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
Note: Nodes not to scale
Amazon Redshift is priced to let you analyze all your data
Simple Pricing Number of Nodes x Cost per HourNo charge for Leader Node No upfront costsPay as you go
Price Per Hour for HS1.XL Single Node
Effective Hourly Price Per TB
Effective Annual Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation
$ 0.500 $ 0.250 $ 2,190
3 Year Reservation
$ 0.228 $ 0.114 $ 999
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Slides not intended for redistribution.
Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated– All blocks on disks and in
Amazon S3 encrypted
• No direct access to compute nodes
• Amazon VPC support
Slides not intended for redistribution.
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3 / Amazon DynamoDB
Customer VPC
InternalSecurityGroup
JDBC/ODBC
LeaderNode
Compute Node
Compute Node
Compute Node
Amazon Redshift continuously backs up your data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
Slides not intended for redistribution.
Amazon Redshift works with your existing analysis tools
More coming soon…
JDBC/ODBC
Amazon Redshift
Connect using drivers from PostgreSQL.org
Amazon Redshift integrates with multiple data sources
Amazon Elastic MapReduce
Amazon DynamoDB
Amazon Elastic Compute Cloud
(EC2)
AWS Storage Gateway Service
Amazon Simple Storage Service
(S3)
Corporate Data Center
Amazon Relational Database Service
(RDS)
Amazon Redshift
Today’s Agenda
• Informatica and Amazon Strategic Partnership
• Amazon Redshift Overview
• Informatica Cloud Redshift Connector
• Demonstration
• Discussion
• Next Steps
30
2
1
Informatica Cloud Architecture Overview
4SecureAgent
Your Company 3
Marketplace
Amazon Redshift
Map Once. Deploy Anywhere.
ON PREMISE HADOOP 3rd PARTYAPPLICATIONS
CLOUD
Cloud Amazon Redshift Connector DemoNicolas Brisoux, Cloud Platform Adoption
Best practices to remember…
• The Amazon S3 bucket that holds the data files must be created in the same region as your cluster
• Files are deleted from Amazon S3 bucket when upload is complete
• Choose a batch size where the number of batches matches the number of slices in your cluster
• Each XL node has 2 slices, each 8XL node has 16
• If you have a 2 node XL cluster and 40,000 rows of data, choose a batch size of 10,000
• The Informatica Cloud Redshift connector can maximize Amazon’s parallel processing capabilities this way
Informatica Cloud Amazon Redshift demonstration
Firewall
Informatica Cloud Secure Agent
Metadata Mappings
Authenticate and retrieve Data Synchronization Task
1
1
Retrieve Account Data2
2
3 Perform lookup on SLA level
3
4
4
Put Account Data & SLA Level into Flat File
5 Transferred compressed Flat File
5
6 Initiate load from Amazon S3
6
7 Load data into Amazon Redshift
7
PowerCenter Mappings and Informatica Cloud
• If you want to reuse your existing PowerCenter mappings with Informatica Cloud and Redshift you have 2 options:
• Use the PowerCenter Repository Manager to export your existing workflows and import them into Informatica Cloud using the PowerCenter Tasks feature
Or…
• Keep your existing mappings in PowerCenter and stage the data
• Create a DSS task in Informatica Cloud to move the data to Redshift from the staging area
• This task can be managed from PowerCenter
1
2
Why Informatica Cloud Integration for Redshift?
37
1 Map Once, Deploy Anywhere
2 Rapid Connectivity & Deployment
3 Advanced Integration Delivered Easily
4 Excellence in batch and real-time integration
InformaticaCloud.com
Next Steps
• Get started with Amazon Redshift
• Get started with Informatica Cloud
• InformaticaCloud.com
• Learn more about our Redshift Connector
• InformaticaCloud.com/Amazon-Redshift
38
Discussion
Rahul Pathak, Amazon Redshift Product Management
Nicolas Brisoux, Informatica Cloud Platform Adoption
Darren Cunningham, Informatica Cloud Marketing
@infacloud #redshift
InformaticaCloud.com