December 2013 HUG: InfiniDB for Hadoop

1. Bay Area Hadoop Users Group Turning the Tables with InfiniDB for Hadoop December 18, 2013

2. Agenda InfiniDB Background InfiniDB Technical Foundations Parallelism Partitioning Model Additional I/O Efficiencies (My)SQL for Hadoop When to use Columnar/InfiniDB for Hadoop InfiniDB BenchmarksCopyright 2013 Calpont. All Rights Reserved. 3. InfiniDB Background PlatformsVersions InfiniDB InfiniDB Launched Feb 2010 InfiniDB for the Cloud InfiniDB 4 latest release available October 2013 InfiniDB for Hadoop Added InfiniDB for Hadoop Source code at https://github.com/infinidb GPL v2 No restrictions on syntax, scale, or performanceCopyright 2013 Calpont. All Rights Reserved. 4. InfiniDB Background - Customer BaseCopyright 2013 Calpont. All Rights Reserved. 5. InfiniDB Background Platforms InfiniDBLocal Disk, GlusterFS, Windows* http://www.calpont.com/products/tryinfinidb InfiniDB for HadoopCDH or HDP http://www.calpont.com/products/tryinfinidb InfiniDB for the CloudAny availability zoneCopyright 2013 Calpont. All Rights Reserved. 6. InfiniDB Background InfiniDB for Hadoop InfiniDB is a non-map/reduce engine Reads and writes natively to HDFSPig/HiveHBaseMap ReduceInfiniDB for HadoopHadoop Distributed File System6 7. InfiniDB Background - InfiniDB for Hadoop Is InfiniDB a Database? InfiniDB turns SQL developersnot a General Purpose DBMS.into Big Data developers. We deployed it quickly and easilyIs InfiniDB NoSQL?for our online sales analytics. only in the sense that we discardedSomething we couldnt dotraditional DBMS architectures.with Hadoop, Mongo, or TeradataIs InfiniDB an SQL for Hadoop technology? Yes, but not general purpose SQL.InfiniDB is highly optimized for analytic workloads/queries.7 8. InfiniDB Foundation - Parallelism User Module Processes SQL Requests Performance Module Executes the Queries Single ServerMPPorLocal disk / EBS GlusterFS / HDFS 8 9. InfiniDB Foundation - Parallelism Purpose-built C++ engine Parallelism is at the thread level Example: 12 PM Servers with 8 cores each yields 96 parallel processing engines. SQL is translated into thousands or tens of thousands of discrete jobs or primitives. The UM sends primitives to the processing engines. 9 10. InfiniDB Foundation - Parallelism User Module Processes SQL Requests Performance Module Executes the Queries Single ServerMPP Primitives are issued to thread queue within PM Fixed thread count at PM Local disk / EBS GlusterFS / HDFS 10 11. Fully Parallel SQL + Full SQL SyntaxDoWReduce SQL Operations are translated into thousands of jobs via custom Distribution of Work: Parallel/Distributed Data Access Parallel/Distributed Joins (Inner, Outer) Parallel/Distributed Sub-queries (From, Where, Select) Parallel/Distributed Group By, Distinct, and Aggregation Extensible with Parallel/Distributed User Defined Functions Results are returned to User Module in Reduce Phase 11 12. InfiniDB Data Partitioning 2-Dimensional Partitioning Model Vertical Partitioning by Column o Not Column-Family (no relation to HBase) o Only do I/O for columns requestedHorizontal Partitioning by range of rows o Meta-data stored within in-memory structure12 13. InfiniDB Data Partitioning Partition elimination can occur based on: o Columns not included in SQL. o Based on filter expressed within query. o Based on filter expressed on a join table:Table1 filter can drive Table2 I/O elimination o Intersection between filters: Filter1 and Filter2 does I/O on intersection 13 14. Column Restriction and Projection |-------- Column # Seventeen -----------|Extent # 27Filter 3Filter 2Filter 1|-------------- Column # Six ---------------||-------------- Column # Four ---------------|ProjectionExtent # 5Projection Automatic Vertical Partitioning + Horizontal Partitioning Just-In-Time Materialization 14 15. Additional I/O Efficiency Techniques to Avoid Unnecessary I/O Vertical Partitioning: read only the columns required Horizontal Partition: focus on the rows required Just-in-time materializationTechniques for Efficient I/O Columnar compression reduces I/O from disk Global data buffer cache can reduce disk I/O (in-memory) Avoidance of Random I/O15 16. InfiniDB Design Principles ScalableFast16Simple 17. (My)SQL for Hadoop - Engine=InfiniDB InfiniDB uses standard Engine=InfiniDB syntax:CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=InfiniDB;17 18. (My)SQL for Hadoop Leverage existing tools that connect to MySQLExpose Structured Data to the BusinessFamiliar User Privilege AdministrationMicroStrategy JasperSoft PentahoMySQL ease of use + Hadoop Scale + Columnar Performance 18 19. Syntax SupportBroad MySQL SQL syntax-+Analytic/windowing functions included with InfiniDB 4No indexing needed. Partitioning is automatic.InfiniDB Supported Syntax 19 20. When to Use InfiniDB for HadoopQuery Size (Vision/Scope) defines workloads: 1100 10,0001,000,000100,000,000 10,000,000,000Query Size/Vision/ScopeOLTP/NoSQL WorkloadsROLAP/Analytic/Reporting WorkloadsGeneral purpose DBMS missed the target ( dated database technology generally not optimal ) 20 21. What is your typical query? 1100 10,0001,000,000100,000,000 10,000,000,000Query Vision/ScopeOLTP/NoSQL WorkloadsAnalytic Workloads There is no average query. The challenges are at the extremes: o The challenge of high concurrency levels with small queries. o The challenge of latency for very large queries. Most use cases imply multiple data technologies. 21 22. Columnar Appropriate Workloads 1100 10,0001,000,000100,000,000 10,000,000,000Query Vision/ScopeOLTP/NoSQL WorkloadsPure Columnar about 10x worse I/O for single record lookups 22ROLAP/Analytic/Reporting WorkloadsPure Columnar about 10x better I/O for large data access patterns 23. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Unstructured Data Schema on readSchema on writeSmall QueriesLarge QueriesTransform (ETL)Targeted ExtractPre-defined queries 23StructuredAd-hoc queries 24. InfiniDB Query Performance Percona Star Schema Benchmark (SSB) Q5 Series 5 table JoinsQ1 Series 2 table JoinsQ2 Series 3 table JoinsQ3 Series 4 table Joins24 25. 1000 Genomes Data Set 289 Billion Rows Fast load Rate Millions rows/sec Billions rows/hour Scalable load rate1000 Genomes data set on AWS 26. 1000 Genomes Data Set ~ 24 trillion base nucleotide values Scaling: 4 > 8 > 16 Performance Modules Fast Analytics Millions of rows/second Scalable AnalyticsSecondsper core Automatic parallelism Performance Modules (PMs) ActiveFigure 2 - TATA Binding Protein Source: http://en.wikipedia.org/wiki/TATA_binding_protein 27. Impala-InfiniDB Benchmark (Piwik Data Set)InfiniDBFigure 1 - Piwik Standard Query PerformanceInfiniDBFigure 2 - Piwik Ad-Hoc Query PerformancePiwik is an Open Source alternative to Google Analytics Queries 1-6 offered are Piwik production queries Queries 7-9 are additional ad-hoc queries covering all data Amazon 5-node cluster 28. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Structured Schema on readInfiniDBSchema on writeSmall QueriesLarge QueriesTransform (ETL)Targeted ExtractFigure 2 - Piwik Ad-Hoc Query PerformanceAd-hoc queries 28 29. Download Today InfiniDB and InfiniDB for Hadoop: www.calpont.com InfiniDB for the Cloud: InfiniDB AMI in any AWS Availability Zone/RegionServices Inquiries: [email protected] Twitter: @InfiniDB@jtommaney 2013 Calpont Corporation. Calpont, the Calpont logo, InfiniDB, and the InfiniDB logo are trademarks of Calpont Corporation. AWS is a trademark of Amazon.com, Inc., and Apache Hadoop is a trademark of the Apache Software Foundation. Other product names and logos may be trademarks of their respective owners.29

December 2013 HUG: InfiniDB for Hadoop

Technology

Transcript of December 2013 HUG: InfiniDB for Hadoop