Aws summit 2014 redshift

48
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. May 2014 Usos e Melhores Práticas para Amazon Redshift Eric Ferreira Sr. Database Engineer

description

Apresentações do AWS Summit Sao Paulo 2014. Baixe o conteúdo preparado por nossos especialistas para auxiliá-lo na jornada para a nuvem.

Transcript of Aws summit 2014 redshift

  • 1. 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. May 2014 Usos e Melhores Prticas para Amazon Redshift Eric Ferreira Sr. Database Engineer

2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift 3. Redshift EMR EC2 Analyze Glacier S3 DynamoDB Store Direct Connect Collect Kinesis AWS Summit 2014 4. Petabyte scale Massively parallel Relational data warehouse Fully managed; zero admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler AWS Summit 2014 5. Common Customer Use Cases Reduce costs by extending DW rather than adding HW Migrate completely from existing DW systems Respond faster to business Improve performance by an order of magnitude Make more data available for analysis Access business data via standard reporting tools Add analytic functionality to applications Scale DW capacity as demand grows Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies 6. Amazon Redshift Customers AWS Summit 2014 7. Growing Ecosystem AWS Summit 2014 8. Data Loading Options Parallel upload to Amazon S3 AWS Direct Connect AWS Import/Export Amazon Kinesis Systems integrators Data Integration Systems Integrators AWS Summit 2014 9. Amazon Redshift Architecture Leader Node SQL endpoint Stores metadata Coordinates query execution Compute Nodes Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH Two hardware platforms Optimized for data processing DW1: HDD; scale from 2TB to 1.6PB DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC AWS Summit 2014 10. Amazon Redshift Node Types Optimized for I/O intensive workloads High disk density On demand at $0.85/hour As low as $1,000/TB/Year Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate High performance at smaller storage size High compute and memory density On demand at $0.25/hour As low as $5,500/TB/Year Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage AWS Summit 2014 11. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage With row storage you do unnecessary I/O To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014 12. With column storage, you only read the data you need Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014 13. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage COPY compresses automatically You can analyze and override More performance, less cost AWS Summit 2014 14. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage 10 | 13 | 14 | 26 | | 100 | 245 | 324 375 | 393 | 417 512 | 549 | 623 637 | 712 | 809 | 834 | 921 | 959 10 324 375 623 637 959 Track the minimum and maximum value for each block Skip over blocks that dont contain relevant data AWS Summit 2014 15. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Use local storage for performance Maximize scan rates Automatic replication and continuous backup HDD & SSD platforms AWS Summit 2014 16. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize AWS Summit 2014 17. Load in parallel from Amazon S3 or Amazon DynamoDB or any SSH connection Data automatically distributed and sorted according to DDL Scales linearly with number of nodes Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize AWS Summit 2014 18. Backups to Amazon S3 are automatic, continuous and incremental Configurable system snapshot retention period. Take user snapshots on-demand Cross region backups for disaster recovery Streaming restores enable you to resume querying faster Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize AWS Summit 2014 19. Resize while remaining online Provision a new cluster in the background Copy data in parallel from node to node Only charged for source cluster Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize AWS Summit 2014 20. Automatic SQL endpoint switchover via DNS Decommission the source cluster Simple operation via Console or API Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize AWS Summit 2014 21. Amazon Redshift is priced to let you analyze all your data Number of nodes x cost per hour No charge for leader node No upfront costs Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498 22. Amazon Redshift has security built-in SSL to secure data in transit; load encrypted from S3; ECDHE perfect forward security Encryption to secure data at rest AES-256; hardware accelerated All blocks on disks & in Amazon S3 encrypted On-premises HSM & CloudHSM support No direct access to compute nodes Audit logging & AWS CloudTrail integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC AWS Summit 2014 23. Amazon Redshift continuously backs up your data and recovers from failures Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times Backups to Amazon S3 are continuous, automatic, and incremental Designed for eleven nines of durability Continuous monitoring and automated recovery from failures of drives and nodes Able to restore snapshots to any Availability Zone within a region Easily enable backups to a second region for disaster recovery AWS Summit 2014 24. 50+ new features since launch in Feb 2013 Regions N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney Certifications PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others Security Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE for perfect forward security Manageability Snapshot sharing, backup/restore/resize progress indicators, Cross- region backups Query Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50 slots Ingestion S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR AWS Summit 2014 25. 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. MicroStrategy Industrys best BI, web and mobile applications on-demand in the cloud May 2014 26. Youve Got the Data. Now What? 27. Deriving Big Insights from Big Data Trends in business analytics Popular use cases Agile Business Intelligence Governed Self-Service Information Driven Mobile Apps Redshift Certified Customer Success 28. Popular Use Cases Information driven purchasing reviews searches pricing comparisons social networks recommendations Omni-channel Customer 360 Improve sales efficiency and effectiveness Data blending o sales o product o Marketing CRM integration Mobility o BYOD o Caching Information driven experience o Interaction o Videos o Documents In-store apps o analytics o customer 360 o personal shopper Store of the future Real-time decisions Inventory management Customer Analytics Sales Enablement Retail 29. Self-Service Analytics Revolutionizes Traditional BI Boost user satisfaction while massively increasing productivity More Productive More content per creator More Producers More users can create content More Collaborative Peer-to-peer sharing 5-10x More Content 5-10x More Content Creators 5-10x More Sharing >100x more content creation and consumption 30. Governed Self-Service 31. World-Class Information-Driven Mobile Apps The future of dashboards More than graphs Multi-media Transaction enabled Live data Comprehensive data Intuitive Easy to use Guided workflow for consistent user experience Personalized for each user 32. Business User Access to 1000s of Data Sources Faster access to your data MicroStrategy Modeled Data Personal or Departmental Cloud- Based Data Relational Databases Big Data & Hadoop Enterprise Applications SAP, Oracle e-Business, Siebel, Peoplesoft, etc. Redshift, Oracle, SQL Server, MySQL, Teradata, Netezza, etc. Salesforce.com, NetSuite. Facebook, Eloqua, Google Docs, etc. Spreadsheets, Access databases, CSV, public data downloads, etc. EMR, MapR, Cloudera, Hortonworks, etc Enterprise-certified single- version of the truth Redshift Certified Redshift integration 33. Customer Success Netflix has deployed the MicroStrategy business intelligence software platform on top of Amazon Elastic MapReduce (Amazon EMR) for interactive insights. Netflix analysts use advanced visualizations to explore the performance of its streaming service closer to recorded time by directly accessing Hadoop data on an ad-hoc basis and without writing code. 34. Big Data Analytics Transform your growing Big Data resources into insight and profit. Self-Service Analytics See and understand data in minutes. No IT needed. Enterprise-Grade Business Intelligence Produce and publish trusted analytics to improve your business operations. AWS Partner Comprehensive Analytics Platform #1 Mobile Analytics MicroStrategy Analytics Enterprise Business Agility with Trusted, Governed Data 35. Experience MicroStrategy on AWS Today! 36. Ingestion Best Practices Goal 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead Best Practices Preferred method - COPY from S3 Loads data in sorted order through the compute nodes Single Copy command, Split data into multiple files Strongly recommend that you gzip large datasets copy time from 's3://mybucket/data/timerows.gz credentials 'aws_access_key_id=;aws_secret_access_key= gzip delimiter '|; If you must ingest through SQL Multi-row inserts Avoid large number of singleton insert/update/delete operations To copy from another table CREATE TABLE AS or INSERT INTO SELECT insert into category_stage values (default, default, default, default), (20, default, 'Country', default), (21, 'Concerts', 'Rock', default); AWS Summit 2014 37. Ingestion Best Practices (Cotd..) Best Practices Verifying load data files For US east S3 provides eventual consistency. Verify files are in S3 Listing Object Keys Query Redshift after load. This query returns entries for loading the tables in the TICKIT database select query, trim(filename), curtime, status from stl_load_commits where filename like '%tickit%' order by query; query | btrim | curtime | status -------+---------------------------+----------------------------+-------- 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1 22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1 22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1 22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1 22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1 22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1 AWS Summit 2014 38. Ingestion Best Practices (Cotd..) Best Practices We do not support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target Update then Insert. We do NOT enforce primary key constraint, if you COPY same data twice, there will be a duplicate copy. Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3; Run the ANALYZE command whenever youve made a non-trivial number of changes to your data to ensure your table statistics are current Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed. Check character set : Support only UTF8 AWS Summit 2014 39. Choose a Sort key Goal Skip over data blocks to minimize IO Best Practice Sort based on range or equality predicate (WHERE clause) If you access recent data frequently, sort based on TIMESTAMP AWS Summit 2014 40. Choose a Distribution Key Goal Distribute data evenly across nodes Minimize data movement among nodes : Co-located Joins and Co-located Aggregates Best Practice Consider using Join key as distribution key (JOIN clause) If multiple joins, use the foreign key of the largest dimension as distribution key Consider using Group By column as distribution key.( GROUP BY clause) Avoid Keys used as equality filter as your distribution key If de-normalized tables and no aggregates, do not specify a distribution key. Redshift will use round robin AWS Summit 2014 41. Query Performance Best practices Encode date and time using TIMESTAMP data type instead of Char Specify Constraints RedShift does not enforce constraints (primary key, foreign key, unique values) but the optimizer uses it. Loading and/or applications need to be aware Specify redundant predicate on the sort column SELECT * FROM tab1, tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > '1/1/2013' AND tab2.timestamp > '1/1/2013'; AWS Summit 2014 42. Workload Manager Allows you to manage and adjust query concurrency WLM allows you to Increase query concurrency up to 50 Define user groups and query groups Segregate short and long running queries Help improve performance of individual queries Be aware that query workload is distributed to every compute node. Increasing concurrency may not always help due to resource contention (cpu, memory and I/O). Total throughput may increase by letting one query to complete first and other queries to wait. AWS Summit 2014 43. Workload Best Practices Organizing and keeping your load files in S3 allows for re-run or scenario testing as you evolve your workflow in the platform. Keep in S3 or Glacier for fiscal/legal reasons Data updated for short-term consider having a short-term version of the table for staging and a long term version once data gets stable. Round Robin distribution key When you dont have a good Distribution Key Check Part 1 for query on checking for distribution skew Trade off with collocated joins Loading the target (final) table Use a chronological date/timestamp columns for first sortkey. Vacuum is needed less often and runs faster When first sort column has low cardinality/resolution (i.e, date instead of timestamp), subsequent columns should match common filters and/or grouping columns AWS Summit 2014 44. Workload Best Practices cont. Use UNLOAD command to archive data that is not needed for business reasons Data that needs to exist only for fiscal/legal reasons can be re-loaded as needed. Consider applying retention policies less often than the regular workflow Weekly/Monthly process during a less busy time Make space provision for the data growth Make sure all queries have date/timestamp range filters (> and Snapshot -> Spin Query clusters -> Tear down High ratio: Consider Performance above space needs when choosing number of nodes Normalization Rule of Thumb De-normalize only to avoid large non-collocated joins Slow Changing Dimensions (type II): Keep normalized, match distkey with fact table AWS Summit 2014 46. Space Management Redshift has a single pool of space used for tables and temporary segments. Loads need 2.5 times the space of the data being loaded if table has a sortkey Vacuum may need 2.5 times the size of the table. Monitor the free space Performance Tab in the console Cloudwatch Alarms SQL AWS Summit 2014 47. Space Management cont. Tables Sizes select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema, trim(a.name) as Table, b.mbytes, a.rows from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace join pg_database as pgdb on pgdb.oid = a.db_id join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl order by mbytes desc, a.db_id, a.name; Free Space select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024 as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes from stv_partitions where part_begin=0; Redshift allows you to resize your cluster up and down and across node types. Online (read-only access). AWS Summit 2014 48. For more best practices, search youtube for Amazon Redshift Best Practices Thank You ! AWS Summit 2014