Taming Big Data with Big SQL 3.0

1. 2014 IBM CorporationTaming Big Data with Big SQLSession 3477Berni Schiefer ([email protected])Bert Van der Linden ([email protected]) 2. Please NoteIBMs statements regarding its plans, directions, and intent are subject to changeor withdrawal without notice at IBMs sole discretion.Information regarding potential future products is intended to outline our generalproduct direction and it should not be relied on in making a purchasing decision.The information mentioned regarding potential future products is not acommitment, promise, or legal obligation to deliver any material, code orfunctionality. Information about potential future products may not be incorporatedinto any contract. The development, release, and timing of any future features orfunctionality described for our products remains at our sole discretion.Performance is based on measurements and projections using standard IBMbenchmarks in a controlled environment. The actual throughput or performancethat any user will experience will vary depending upon many factors, includingconsiderations such as the amount of multiprogramming in the users job stream,the I/O configuration, the storage configuration, and the workload processed.Therefore, no assurance can be given that an individual user will achieve resultssimilar to those stated here.1 3. AgendaInternet of Things & Big DataBigInsightsBig SQL 3.0 Architecture Performance Best practices2 4. Systems of Insight from DataEmanating from the Internet of Things (IoT)3 5. How Big is the Internet of Things?4 6. 2+billionpeople onthe Webby end2011data every day Where is Big Data coming from?30 billion RFIDtags today(1.3B in 2005)4.6billioncameraphonesworld wide100s ofmillionsof GPSenableddevicessoldannually76 million smartmeters in 2009200M by 201412+ TBsof tweet dataevery day25+ TBs oflog data everyday? TBs of5 7. A 10 The They Now, major million read meters gas theyre the are smart and readmetersinstallingmeters electric read utility hasevery 15 minutes =10 350 once a smart an million billion hour.month.meters.transactions a year.6 8. Data AVAILABLE toan organizationData an organizationcan PROCESSThe Big Data Conundrum The percentage of available data an enterprise can analyze isdecreasing This means enterprises are getting more naive over time7 9. Transactional Application DataMachine Data Social Data EnterpriseContent Volume Structured Throughput Velocity Structured Ingestion Variety Unstructured Veracity Variety Unstructured VolumeBig Data is All Data and All Paradigms8 10. Big Data Landing zone eco-system9 11. Big Data Landing zone eco-systemBigSQL10 12. InfoSphere BigInsights builds on open source Hadoopcapabilities for enterprise class deploymentsBI /ReportingBI /ReportingExploration /VisualizationFunctionalAppIndustryAppPredictiveAnalyticsContentAnalyticsAnalytic ApplicationsIBM Big Data PlatformSystemsManagementApplicationDevelopmentVisualization DiscoveryAcceleratorsStreamComputingDataWarehouseVisualizationExplorationDevelopment ToolsEnterprisecapabilities HadoopInformation IntegrationGovernanceOpen sourcebasedcomponentsSystemBusiness benefits Quicker time-to-value due to IBMtechnology and support Reduced operational risk Enhanced business knowledgewith flexible analytical platform Leverages and complementsexisting softwareInfoSphere BigInsightsAdvanced EnginesConnectorsWorkload OptimizationAdministrationSecurityOpen source HadoopcomponentsBig SQL11 13. 2013 IBM CorporationAnnouncing BigInsights v3.0Standard EditionBreadth of capabilitiesEnterprise classEnterprise Edition- Spreadsheet-style tool- Web console- Dashboards- Pre-built applications- Eclipse tooling- RDBMS connectivity- Big SQL 3.0- Jaql- Platform enhancements- . . .- Accelerators- GPFS FPO- Adaptive MapReduce- Text analytics- Enterprise Integration- Monitoring and alerts- Big R- InfoSphere Streams*- Watson Explorer*- Cognos BI*- . . .ApacheHadoop12* Limited use license includedAvailable for Linux on POWER (Redhat) and Intel x64 Linux (Redhat/SUSE) 14. Common Hadoop core in all Hadoop DistributionsComponent BigInsights 3.0HortonWorksHDP 2.0MapR 3.1Pivotal HD1.1ClouderaCDH5Hadoop 2.2 2.2 1.0.3 2.0.5 * 2.3HBase 0.96.0 0.96.0 0.94.13 0.94.8 0.96.1Hive 0.12.0 0.12 0.11 0.11.0 0.12.0Pig 0.12.0 0.12 0.11.0 0.10.1 0.12.0Zookeeper 3.4.5 3.4.5 3.4.5 3.4.5 3.4.5Oozie 4.0.0 4.0.0 3.3.2 3.3.2 4.0.0Avro 1.7.5 X X X 1.7.5Flume 1.4.0 1.4.0 1.4.0 1.3.1 1.4.0Sqoop 1.4.4 1.4.4 1.4.4 1.4.2 1.4.4Current as of April 27, 201413 15. What is Big SQL 3.0?Comprehensive SQL functionality IBM SQL/PL support, including Stored procedures (SQL bodied and external) Functions (SQL bodied and external) IBM Data Server JDBC and ODBC driversLeverages advanced IBM SQL compiler/runtime High performance native (C++) runtimeReplaces Map/Reduce Advanced message passing runtime Data flows between nodes withoutrequiring persisting intermediate results Continuous running daemons Advanced workload management allowsresources to remain constrained Low latency, high throughputSQL-basedApplicationIBM data serverclientBig SQLEngineSQL MPP Run-timeData SourcesCCSSVV SSeeqq PPaarqrquueet t RRCCAAvvroro OORRCC JJSSOONN CCuusstotommInfoSphere BigInsights14 16. Big SQL 3.0 ArchitectureHead (coordinator) node Listens to the JDBC/ODBC connections Compiles and optimizes the query Coordinates the execution of the queryBig SQL worker processes reside on compute nodes (some or all)Worker nodes stream data between each other as neededWorkers can spill large data sets to local disk if needed Allows Big SQL to work with data sets larger than available memoryBig SQLMgmt NodeHiveMetastoreMgmt NodeName NodeMgmt Node Job TrackerMgmt NodeTaskTrackerDataNodeCompute NodeTaskTrackerDataNodeCompute NodeTaskTrackerDataNodeCompute NodeTaskTrackerData NodeCompute NodeBigSQLBigSQLBigSQLBigSQLGPFS/HDFS15 17. Big SQL 3.0 works with HadoopAll data is Hadoop data In files in HDFS SEQ, RC, delimited, Parquet Never need to copy data to a proprietary representationAll data is catalog-ed in the Hive metastore It is the Hadoop catalog It is flexible and extensibleAll Hadoop data is in a Hadoop filesystem HDFS or GPFS-FPO16 18. Big SQL 3.0 Architecture (cont.)Big SQL's runtime execution engine is all native codeFor common table formats a native I/O engine is utilized e.g. delimited, RC, SEQ, Parquet, For all others, a java I/O engine is used Maximizes compatibility with existing tables Allows for custom file formats and SerDe'sAll Big SQL built-in functions are native codeCustomer built UDx's can be developed in C++ or Java Existing Big SQL UDF's can be used with a slightchange in how they are registeredBig SQLMgmt NodeTaskTrackerDataNodeBigSQLCompute NodeBig SQL WorkerRuntimeNative I/OEngineNative UDFsJava UDFsJava I/OEngineSerDe I/O Fmt17 19. Big SQL 3.0 Enterprise securityUsers may be authenticated via Operating system Lightweight directory access protocol (LDAP) KerberosUser authorization mechanisms include Full GRANT/REVOKE based security Group and role based hierarchical security Object level, column level, or row level (fine-grained) access controlsAuditing You may define audit policies and track user activityTransport layer security (TLS) Protect integrity and confidentiality of data between the client and Big SQL18 20. Row Based Access Control - 4 easy steps2) Create permissions *CREATE PERMISSION BRANCH_A_CCCRRREEEAAATTTEEE PPPEEERRRMMMIIISSSSSSIIIOOONNN BBBRRRAAANNNCCCHHH___AAA___AAAACCCCCCCCEEEESSSSSSSS OOOONNNN BBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLLFFFFOOOORRRR RRRROOOOWWWWSSSS WWWWHHHHEEEERRRREEEE((((VVVVEEEERRRRIIIIFFFFYYYY____RRRROOOOLLLLEEEE____FFFFOOOORRRR____UUUUSSSSEEEERRRR((((SSSSEEEESSSSSSSSIIIIOOOONNNN____UUUUSSSSEEEERRRR,,,,''''BBBBRRRRAAAANNNNCCCCHHHH____AAAA____RRRROOOOLLLLEEEE'''')))) ==== 1111AAAANNNNDDDDBBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLL....BBBBRRRRAAAANNNNCCCCHHHH____NNNNAAAAMMMMEEEE ==== ''''BBBBrrrraaaannnncccchhhh____AAAA''''))))EEEENNNNFFFFOOOORRRRCCCCEEEEDDDD FFFFOOOORRRR AAAALLLLLLLL AAAACCCCCCCCEEEESSSSSSSSEEEENNNNAAAABBBBLLLLEEEE3) Enable access control *AAAALLLLTTTTEEEERRRR TTTTAAAABBBBLLLLEEEE BBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLL AAAACCCCTTTTIIIIVVVVAAAATTTTEEEE RRRROOOOWWWW AAAACCCCCCCCEEEESSSSSSSS CCCCOOOONNNNTTTTRRRROOOOLLLL4) Select as Branch_A userCCCCOOOONNNNNNNNEEEECCCCTTTT TTTTOOOO TTTTEEEESSSSTTTTDDDDBBBB UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnnSSSSEEEELLLLEEEECCCCTTTT **** FFFFRRRROOOOMMMM BBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLLEEEEMMMMPPPP____NNNNOOOO FFFFIIIIRRRRSSSSTTTT____NNNNAAAAMMMMEEEE BBBBRRRRAAAANNNNCCCCHHHH____NNNNAAAAMMMMEEEE-------------------------------------------- ------------------------------------------------ --------------------------------------------2222 CCCChhhhrrrriiiissss BBBBrrrraaaannnncccchhhh____AAAA3333 PPPPaaaauuuullllaaaa BBBBrrrraaaannnncccchhhh____AAAA5555 PPPPeeeetttteeee BBBBrrrraaaannnncccchhhh____AAAA8888 CCCChhhhrrrriiiissssssssiiiieeee BBBBrrrraaaannnncccchhhh____AAAA4444 rrrreeeeccccoooorrrrdddd((((ssss)))) sssseeeelllleeeecccctttteeeedddd....DataSSSSEEEELLLLEEEECCCCTTTT **** FFFFRRRROOOOMMMM BBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLLEEEEMMMMPPPP____NNNNOOOO FFFFIIIIRRRRSSSSTTTT____NNNNAAAAMMMMEEEE SSSSAAAALLLLAAAARRRRYYYY---------------------------- ------------------------------------------------ --------------------------------------------1111 SSSStttteeeevvvveeee BBBBrrrraaaannnncccchhhh____BBBB2222 CCCChhhhrrrriiiissss BBBBrrrraaaannnncccchhhh____AAAA3333 PPPPaaaauuuullllaaaa BBBBrrrraaaannnncccchhhh____AAAA4444 CCCCrrrraaaaiiiigggg BBBBrrrraaaannnncccchhhh____BBBB5555 PPPPeeeetttteeee BBBBrrrraaaannnncccchhhh____AAAA6 SSSStttteeeepppphhhhaaaannnniiiieeee BBBBrrrraaaannnncccchhhh____BBBB7777 JJJJuuuulllliiiieeee BBBBrrrraaaannnncccchhhh____BBBB8888 CCCChhhhrrrriiiissssssssiiiieeee BBBBrrrraaaannnncccchhhh____AAAA1) Create and grant access and roles *CCCCRRRREEEEAAAATTTTEEEE RRRROOOOLLLLEEEE BBBBRRRRAAAANNNNCCCCHHHH____AAAA____RRRROOOOLLLLEEEEGGGGRRRRAAAANNNNTTTT RRRROOOOLLLLEEEE BBBBRRRRAAAANNNNCCCCHHHH____AAAA____RRRROOOOLLLLEEEE TTTTOOOO UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnnGGGGRRRRAAAANNNNTTTT SSSSEEEELLLLEEEECCCCTTTT OOOONNNN BBBBRRRRAAAANNNNCCCCHHHH____TTTTBBBBLLLL TTTTOOOO UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnn**** NNNNooootttteeee:::: SSSStttteeeeppppssss 1111,,,, 2222,,,, aaaannnndddd 3333 aaaarrrreeee ddddoooonnnneeee bbbbyyyy aaaa uuuusssseeeerrrr wwwwiiiitttthhhhSSSSEEEECCCCAAAADDDDMMMM aaaauuuutttthhhhoooorrrriiiittttyyyy....19 21. Column Based Access Control2) Create permissions *3) Enable access control *AAAALLLLTTTTEEEERRRR TTTTAAAABBBBLLLLEEEE SSSSAAAALLLL____TTTTBBBBLLLL AAAACCCCTTTTIIIIVVVVAAAATTTTEEEE CCCCOOOOLLLLUUUUMMMMNNNN AAAACCCCCCCCEEEESSSSSSSS CCCCOOOONNNNTTTTRRRROOOOLLLLCREATE MASK SALARY_CCCRRREEEAAATTTEEE MMMAAASSSKKK SSSAAALLLAAARRRYYY___MMMMAAAASSSSKKKK OOOONNNN SSSSAAAALLLL____TTTTBBBBLLLL FFFFOOOORRRRCCCCOOOOLLLLUUUUMMMMNNNN SSSSAAAALLLLAAAARRRRYYYY RRRREEEETTTTUUUURRRRNNNNCCCCAAAASSSSEEEE WWWWHHHHEEEENNNN VVVVEEEERRRRIIIIFFFFYYYY____RRRROOOOLLLLEEEE____FFFFOOOORRRR____UUUUSSSSEEEERRRR((((SSSSEEEESSSSSSSSIIIIOOOONNNN____UUUUSSSSEEEERRRR,,,,''''MMMMAAAANNNNAAAAGGGGEEEERRRR'''')))) ==== 1111TTTTHHHHEEEENNNN SSSSAAAALLLLAAAARRRRYYYYEEEELLLLSSSSEEEE 0000....00000000EEEENNNNDDDDEEEENNNNAAAABBBBLLLLEEEE4b) Select as a MANAGERCCCCOOOONNNNNNNNEEEECCCCTTTT TTTTOOOO TTTTEEEESSSSTTTTDDDDBBBB UUUUSSSSEEEERRRR ssssooooccccrrrraaaatttteeeessssSSSSEEEELLLLEEEECCCCTTTT **** FFFFRRRROOOOMMMM SSSSAAAALLLL____TTTTBBBBLLLLEEEEMMMMPPPP____NNNNOOOO FFFFIIIIRRRRSSSSTTTT____NNNNAAAAMMMMEEEE SSSSAAAALLLLAAAARRRRYYYY---------------------------- ------------------------------------------------ --------------------------------------------1111 SSSStttteeeevvvveeee 2222555500000000000000002222 CCCChhhhrrrriiiissss 2222000000000000000000003333 PPPPaaaauuuullllaaaa 11110000000000000000000000003333 rrrreeeeccccoooorrrrdddd((((ssss)))) sssseeeelllleeeecccctttteeeedddd....DataSSSSEEEELLLLEEEECCCCTTTT **** FFFFRRRROOOOMMMM SSSSAAAALLLL____TTTTBBBBLLLLEEEEMMMMPPPP____NNNNOOOO FFFFIIIIRRRRSSSSTTTT____NNNNAAAAMMMMEEEE SSSSAAAALLLLAAAARRRRYYYY---------------------------- ------------------------------------------------ --------------------------------------------1111 SSSStttteeeevvvveeee 2222555500000000000000002222 CCCChhhhrrrriiiissss 2222000000000000000000003333 PPPPaaaauuuullllaaaa 11110000000000000000000000001) Create and grant access and roles *CCCCRRRREEEEAAAATTTTEEEE RRRROOOOLLLLEEEE MMMMAAAANNNNAAAAGGGGEEEERRRRCCCCRRRREEEEAAAATTTTEEEE RRRROOOOLLLLEEEE EEEEMMMMPPPPLLLLOOOOYYYYEEEEEEEEGGGGRRRRAAAANNNNTTTT SSSSEEEELLLLEEEECCCCTTTT OOOONNNN SSSSAAAALLLL____TTTTBBBBLLLL TTTTOOOO UUUUSSSSEEEERRRR ssssooooccccrrrraaaatttteeeessssGGGGRRRRAAAANNNNTTTT SSSSEEEELLLLEEEECCCCTTTT OOOONNNN SSSSAAAALLLL____TTTTBBBBLLLL TTTTOOOO UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnnGGGGRRRRAAAANNNNTTTT RRRROOOOLLLLEEEE MMMMAAAANNNNAAAAGGGGEEEERRRR TTTTOOOO UUUUSSSSEEEERRRR ssssooooccccrrrraaaatttteeeessssGGGGRRRRAAAANNNNTTTT RRRROOOOLLLLEEEE EEEEMMMMPPPPLLLLOOOOYYYYEEEEEEEE TTTTOOOO UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnn4a) Select as an EMPLOYEECCCCOOOONNNNNNNNEEEECCCCTTTT TTTTOOOO TTTTEEEESSSSTTTTDDDDBBBB UUUUSSSSEEEERRRR nnnneeeewwwwttttoooonnnnSSSSEEEELLLLEEEECCCCTTTT **** FFFFRRRROOOOMMMM SSSSAAAALLLL____TTTTBBBBLLLLEEEEMMMMPPPP____NNNNOOOO FFFFIIIIRRRRSSSSTTTT____NNNNAAAAMMMMEEEE SSSSAAAALLLLAAAARRRRYYYY---------------------------- ------------------------------------------------ --------------------------------------------1111 SSSStttteeeevvvveeee 00002222 CCCChhhhrrrriiiissss 00003333 PPPPaaaauuuullllaaaa 00003333 rrrreeeeccccoooorrrrdddd((((ssss)))) sssseeeelllleeeecccctttteeeedddd....**** NNNNooootttteeee:::: SSSStttteeeeppppssss 1111,,,, 2222,,,, aaaannnndddd 3333 aaaarrrreeee ddddoooonnnneeee bbbbyyyy aaaa uuuusssseeeerrrr wwwwiiiitttthhhhSSSSEEEECCCCAAAADDDDMMMM aaaauuuutttthhhhoooorrrriiiittttyyyy....20 22. Big SQL 3.0 Other enterprise featuresFederation Join between your Hadoop data and other external relational platforms Optimizer determines most efficient execution pathOpen integration across Business Analytic Tools IBM Optim Data Studio performance tool portfolio Superior enablement for IBM Software e.g. Cognos Enhanced support by 3rd party software e.g. MicrostrategyMixed workload cluster management Capacity sharing with the rest of the cluster Specify %cpu and %memory to dedicate to BigSQL 3.0 SQL based workload management Integration with Platform Symphony to manage mixed cluster workloadsSupport for standard development tools21 23. Workload Management(1) Create service classescreate service class BIGDATAWORKcreate service class HIGHPRIWORK under BIGDATAWORKcreate service class LOWPRIWORK under BIGDATAWORK(2) Identify workloads and associate to service classescreate workload SALES_WL CURRENTclient_appname(SalesSys') service class HIGHPRIWORKcreate workload ITEMCOUNT_WL CURRENTclient_appname(InventorySys') service class LOWPRIWORK(3a) Avoid thrashing by queueing low priority work.create threshold LOW_CONCURRENT for service classLOWPRIWORK under BIGDATAWORK activities enforcementdatabase enable when concurrentdbcoordactivities5 andqueued activities unbounded continue(3b) Stop high priority job if SLA cannot be metcreate threshold HIGH_CONCURRENT for service classHEAVYQUERIES under BIGDATAWORK activitiesenforcement database enable when concurrentdbcoordactivities 30 and queued activities0 stop execution(4a) Stop very long running jobscreate threshold LOWPRI_WL_TIMEOUT for serviceclass LOWPRIWORK under BIGDATAWORK activitiesenforcement database enable when activitytotaltime 30 minutes stop execution(4b) Stop jobs that return too many rowscreate threshold TOO_MANY_ROWS_RETURNED forservice class HIGHPRIWORK under BIGDATAWORKenforcement database when sqlrowsreturned 30 stopexecution(5) Collect data for long running jobsCreate threshold LONGRUNINVENTORYACTIVITIESfor service class LOWPRIWORK activities enforcementdatabase when activitytotaltime15 minutes collectactivity data with details continue(6) Reporting system activitycreate event monitor BIGDATAMONACT foractivities write to table 24. Using existing standard SQL tools: EclipseUsing existing SQL tooling against BigData,Same setup as for existing SQL sources!!Support for standard authentication!!23 25. Using existing standard SQL tools: SQuirrel SQLUsing existing SQL tooling against BigDataSupport for authenticating (not supported for Hive,BUT supported by Big SQL!)24 26. Using BigSheets in BigInsights: data discoveryDiscovery and analytics in a spreadsheet-like environment. 27. Big SQL 3.0 PerformanceQuery rewrites Exhaustive query rewrite capabilities Leverages additional metadata such as constraints and nullabilityOptimization Statistics and heuristic driven query optimization Query optimizer based upon decades of IBM RDBMS experienceTools and metrics Highly detailed explain plans and query diagnostic tools Extensive number of available performance metricsSELECT ITEM_DESC, SUM(QUANTITY_SOLD),AVG(PRICE), AVG(COST)FROM PERIOD, DAILY_SALES, PRODUCT,STOREWHEREPERIOD.PERKEY=DAILY_SALES.PERKEY ANDPRODUCT.PRODKEY=DAILY_SALES.PRODKEY ANDSTORE.STOREKEY=DAILY_SALES.STOREKEYANDCALENDAR_DATE BETWEEN AND'01/01/2012' AND '04/28/2012' ANDSTORE_NUMBER='03' ANDCATEGORY=72GROUP BY ITEM_DESCThread 0DSSTQA (tq1)AGG (complete)BNOEXTThread 1TA (Product)NLJN (DailySales)NLJN (Period)NLJN (Store)AGG (partial)TQB (tq1)EXTThread 2TA (DS_IX7)EXTThread 3TA (PER_IX2)EXTThread 4TA (ST_IX1)EXTAccess Query transformation plan generationAccesssection~150 querytransformationsNLJOINNLJOIN Daily SalesNLJOINProduct StoreNLJOINNLJOINNLJOINPeriodHSJOINHSJOINHundreds or thousandsof access plan optionsStoreProductPeriodProductDaily SalesStoreHSJOINDaily SalesHSJOINPeriodProductZZJOIN StoreDaily SalesPeriod26 28. Statistics are key to performanceTable statistics: Cardinality (count) Number of Files Total File SizeColumn statistics (this applies to column group stats also): Minimum value Maximum value Cardinality (non-nulls) Distribution (Number of Distinct Values) Number of null values Average Length of the column value (for string columns) Histogram Frequent Values (MFV)27 29. Performance, Benchmarking, BenchmarketingPerformance matters to customersBenchmarking appeals to Engineers to drive product innovationBenchmarketing used to convey performance in a memorableand appealing waySQL over Hadoop is in the Wild West of Benchmarketing 100x claims! Compared to what? Conforming to what rules?The TPC (Transaction Processing Performance Council) is thegrand-daddy of all multi-vendor SQL-oriented organizations Formed in August, 1988 TPC-H and TPC-DS are the most relevant to SQL over Hadoop R/W nature of workload not suitable for HDFSBig Data Benchmarking Community (BDBC) formed28 30. Power and Performance of Standard SQLEveryone loves performance numbers, but that's not the whole story How much work do you have to do to achieve those numbers?A portion of our internal performance numbers are based upon read-onlyversions of TPC benchmarksBig SQL is capable of executing All 22 TPC-H queries without modification All 99 TPC-DS queries without modificationSELECT s_name, count(*) AS numwaitFROM supplier, lineitem l1, orders, nationWHERE s_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdatel1.l_commitdateAND EXISTS (SELECT s_name, count(*) AS numwaitFROM supplier, lineitem l1, orders, nationWHERE s_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdatel1.l_commitdateAND EXISTS (SELECT *FROM lineitem l2WHERE l2.l_orderkey = l1.l_orderkeySELECT *FROM lineitem l2WHERE l2.l_orderkey = l1.l_orderkeyAND l2.l_suppkeyl1.l_suppkey)AND l2.l_suppkeyl1.l_suppkey)AND NOT EXISTS (SELECT *AND NOT EXISTS (SELECT *FROM lineitem l3WHERE l3.l_orderkey = l1.l_orderkeyAND l3.l_suppkeyl1.l_suppkeyAND l3.l_receiptdatel3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'GROUP BY s_nameORDER BY numwait desc, s_nameFROM lineitem l3WHERE l3.l_orderkey = l1.l_orderkeyAND l3.l_suppkeyl1.l_suppkeyAND l3.l_receiptdatel3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'GROUP BY s_nameORDER BY numwait desc, s_nameJOINJOIN(SELECT s_name, l_orderkey, l_suppkey(SELECT s_name, l_orderkey, l_suppkeyFROM orders oJOINFROM orders oJOIN(SELECT s_name, l_orderkey, l_suppkey(SELECT s_name, l_orderkey, l_suppkeyFROM nation nJOIN supplier sFROM nation nJOIN supplier sON s.s_nationkey = n.n_nationkeyAND n.n_name = 'INDONESIA'ON s.s_nationkey = n.n_nationkeyAND n.n_name = 'INDONESIA'JOIN lineitem lJOIN lineitem lON s.s_suppkey = l.l_suppkeyON s.s_suppkey = l.l_suppkeyWHERE l.l_receiptdatel.l_commitdate) l1WHERE l.l_receiptdatel.l_commitdate) l1ON o.o_orderkey = l1.l_orderkeyAND o.o_orderstatus = 'F') l2ON o.o_orderkey = l1.l_orderkeyAND o.o_orderstatus = 'F') l2ON l2.l_orderkey = t1.l_orderkey) aON l2.l_orderkey = t1.l_orderkey) aWHERE (count_suppkey1) or ((count_suppkey=1)AND (l_suppkeymax_suppkey))) l3WHERE (count_suppkey1) or ((count_suppkey=1)AND (l_suppkeymax_suppkey))) l3ON l3.l_orderkey = t2.l_orderkey) bWHERE (count_suppkey is null)ON l3.l_orderkey = t2.l_orderkey) bWHERE (count_suppkey is null)OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) cOR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c(SELECT s_name, t2.l_orderkey, l_suppkey,GROUP BY s_nameORDER BY numwait DESC, s_nameGROUP BY s_nameORDER BY numwait DESC, s_nameSELECT s_name, count(1) AS numwaitFROMSELECT s_name, count(1) AS numwaitFROM(SELECT s_name FROM(SELECT s_name FROM(SELECT s_name, t2.l_orderkey, l_suppkey,count_suppkey, max_suppkeycount_suppkey, max_suppkeyFROM(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey,max(l_suppkey) as max_suppkeyFROM lineitemWHERE l_receiptdatel_commitdateGROUP BY l_orderkey) t2RIGHT OUTER JOIN(SELECT s_name, l_orderkey, l_suppkeyFROM(SELECT s_name, t1.l_orderkey, l_suppkey,count_suppkey, max_suppkeyFROM(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey,max(l_suppkey) as max_suppkeyFROM lineitemGROUP BY l_orderkey) t1FROM(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey,max(l_suppkey) as max_suppkeyFROM lineitemWHERE l_receiptdatel_commitdateGROUP BY l_orderkey) t2RIGHT OUTER JOIN(SELECT s_name, l_orderkey, l_suppkeyFROM(SELECT s_name, t1.l_orderkey, l_suppkey,count_suppkey, max_suppkeyFROM(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey,max(l_suppkey) as max_suppkeyFROM lineitemGROUP BY l_orderkey) t1Original QueryRe-written for Hive29 31. Comparing Big SQL and Hive 0.12 for Ad-Hoc QueriesBig SQL is upto 41x fasterthan Hive 0.12*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the 1TB ClassicBI Workload in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-HBenchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions areperformed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC).Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3.Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables ina production environment. Results as of April 22, 201430Big SQL is upto 41x fasterthan Hive 0.12 32. Comparing Big SQL and Hive 0.12for Decision Support QueriesBig SQL is10x faster than Hive0.12(Total elapsed time)* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the 1TB Modern BIWorkload in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS BenchmarkStandard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publicallyavailable. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Resultsmay not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a productionenvironment. Results as of April 22, 201431Big SQL is10x faster than Hive0.12(Total elapsed time) 33. How many times Faster is Big SQL than Hive 0.12?MaxMaxSpeedupof 74xSpeedupof 74xAvgAvgSpeedupof 20xSpeedupof 20xQueries sorted by speed up ratio (worst to best)* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the 1TB Modern BIWorkload in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS BenchmarkStandard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publicallyavailable. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Resultsmay not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a productionenvironment. Results as of April 22, 201432 34. Big SQL 3.0 Best PracticesEnsure you have a homogenous and balanced cluster Utilize IBM reference architectureChoose an optimized file format (if possible) ORC or ParquetChoose appropriate data types Use the smallest and most precise datatype availableDefine informational constraints Primary key, foreign key, check constraintsEnsure you have good statistics Current and comprehensiveUse the full power of SQL available to you Dont constrain yourself to Hive syntax/capability33 35. BigInsights Big SQL 3.0: SummaryBig SQL provides rich, robust, standards-based SQL support for datastored in BigInsights Uses IBM common client ODBC/JDBC driversBig SQL fully integrates with SQL applications and tools Existing queries run with no or few modifications* Existing JDBC and ODBC compliant tools can be leveragedBig SQL provides faster and more reliable performance Big SQL uses more efficient access paths to the data Queries processed by Big SQL no longer need to use MapReduce Big SQL is optimized to more efficiently move data over the networkBig SQL provides and enterprise grade data management Security, Auditing, workload management 34 36. Questions? 37. We Value Your FeedbackDont forget to submit your Impact session and speakerfeedback! Your feedback is very important to us we use it tocontinually improve the conference.Use the Conference Mobile App or the online Agenda Builder toquickly submit your survey Navigate to Surveys to see a view of surveys for sessionsyouve attended36 38. Thank You 39. Legal Disclaimer IBM Corporation 2014. All Rights Reserved. The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information containedin this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBMs current product plans and strategy, which aresubject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothingcontained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms andconditions of the applicable license agreement governing the use of IBM software. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/orcapabilities referenced in this presentation may change at any time at IBMs sole discretion based on market opportunities or other factors, and are not intended to be a commitment tofuture product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken byyou will result in any specific sales, revenue growth or other results. If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user willexperience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storageconfiguration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costsand performance characteristics may vary by customer. Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBMLotus Sametime Unyte). Subsequent references can drop IBM but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the or symbol. Do not use abbreviations for IBM product names in yourpresentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included inyour presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of InternationalBusiness Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. If you reference Adobe in the text, please mark the first use and include the following; otherwise delete:Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. If you reference Java in the text, please mark the first use and include the following; otherwise delete:Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. If you reference Microsoft and/or Windows in the text, please mark the first use and include the following, as applicable; otherwise delete:Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. If you reference Intel and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States andother countries. If you reference UNIX in the text, please mark the first use and include the following; otherwise delete:UNIX is a registered trademark of The Open Group in the United States and other countries. If you reference Linux in your presentation, please mark the first use and include the following; otherwise delete:Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks ofothers. If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, ZetaBank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustrationpurposes only.38 40. Background:What is Hadoop?39 41. What is Hadoop?Hadoop is not a piece of software, you can't install hadoopIt is an ecosystem of software that work together Hadoop Core (API's) HDFS (File system) MapReduce (Data processing framework) Hive (SQL access) HBase (NoSQL database) Sqoop (Data movement) Oozie (Job workflow) . There are is a LOT of Hadoop softwareHowever, there is one common component they all build on: HDFS40 42. HDFS configuration (shared-nothing cluster)NN DNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksDNLocaldisksNN = NameNode, which manages all the metadataDD = DataNode, which reads/writes the file data41 43. HDFSDriving principals Files are stored across the entire cluster Programs are brought to the data, not the data to the programDistributed file system (DFS) stores blocks across the whole cluster Blocks of a single file are distributed across the cluster A given block is typically replicated for resiliency Just like a regular file system, the contents of a file is up to the application1011010010100100111100111111001010011101001010010110010012010100110001010010111010111010113110110110101011010010101001010101010111040100110101110100Logical FileBlocks1Cluster4 311222334442 44. Hadoop I/OHadoop (HDFS) doesn't dictate file content/structure It is just a filesystem! It provides standard API's to list directories, open files, delete files, etc. In particular it allows your task to ask where does each block live?Hadoop provides a framework for creating splittable data sources A data source is typically file(s), but not necessarily A large input is split into pieces, each piece to be processed in parallel Each split indicates the host(s) on which that split can be found For files, a split typically refers to an HDFS block, but not necessarily1011010010100100111100111111001010011101001010010110010012010100110001010010111010111010113110110110101011010010101Logical FileSplits12 3App App AppClusterResults43 45. InputFormatThis splitting process is encapsulated in the InputFormat interface Hadoop has a large library of InputFormat's for various purposes You can create and provided your own as wellAn InputFormat does the following Configured with a set of name/value pair properties When configured you can ask it for a list of InputSplit's Each input split has A list of hosts on which the data for the split is recommended to be processed (optional) A size in bytes (optional) Given an InputSplit, an InputFormat can produce a RecordReaderA RecordReader does the following Acts as an input stream to read the contents of the split Produces a stream of records There is no fixed definition of a record it depends upon the input typeLet's look at an example of an InputFormat44 46. InputFormat example - TextInputFormatPurpose Reads input file(s) line by line, each read produces one line of textConfiguration Configured with the names of one or more (HDFS) files to processSplits Each split it produces represents a single HDFS block of a fileRecordReader When opened, finds the first newline of the block it is to read Each read produces the next available line of text in the block May read into the next block of text to ensure the last line is fully read Even if the block is physically located on another host!!101101001010010011100111111001010011101001010010110010010101101101100101001001110011111100120100111010010100103110010010101Text File(logical)Splits ReadersRecords45 (lines of text) 47. Hadoop MapReduceMapReduce is a way of writing parallel processing programsBuilt around InputFormat's (and OutputFormat's)Programs are written in two pieces: Map and ReducePrograms are submitted to the MapReduce job scheduler: JobTracker The JobTracker asks for the InputFormat splits For each split, tries to schedule the processing on a host on which the split lives Hosts are chosen based upon available processing resourcesProgram is shipped to a host and given a split to processOutput of the program is written back to HDFS46 48. MapReduce - MappersMappers Small program (typically), distributed across the cluster, local to data Handed a portion of the input data (called a split) Each mapper parses, filters, and/or transforms its input Produces grouped key,value pairs1011010010100100111100111111001010011101001010010110010012010100110001010010111010111010113110110110101011010010101001010101010111040100110101110100LogicalInput File1 mapsort2 mapsort3 mapsort4 mapsortreducereducecopy mergemergeLogical Output File10110100101001001110011111100101001110100101001011001001Logical Output File10110100101001001110011111100101001110100101001011001001To DFSTo DFSMap Phase47 49. MapReduce The ShuffleThe shuffle is transparently orchestrated by MapReduceThe output of each mapper is locally grouped together by keyOne node is chosen to process data for each unique keyShuffle10110100101001001111001111110010100111010010100101100100120101001100010100101110101110101131101101101010110100101010010101010101110401001101011101001 mapsort2 mapsort3 mapsort4 mapsortreducereducecopy mergemergeLogical Output File10110100101001001110011111100101001110100101001011001001Logical Output File10110100101001001110011111100101001110100101001011001001To DFSTo DFS48 50. MapReduce Reduce PhaseReducers Small programs (typically) that aggregate all of the values for the keythat they are responsible for Each reducer writes output to its own fileReduce Phase10110100101001001111001111110010100111010010100101100100120101001100010100101110101110101131101101101010110100101010010101010101110401001101011101001 mapsort2 mapsort3 mapsort4 mapsortreducereducecopy mergemergeLogical Output File10110100101001001110011111100101001110100101001011001001Logical Output File10110100101001001110011111100101001110100101001011001001To DFSTo DFS49 51. Joins in MapReduceHadoop is used to group data together at the same reducer based upon the joinkey Mappers read blocks from each table in the join The key is the value of the join key, the value is the record to be joined Reducer receives a mix of records from each table with the same join key Reducers produce the results of the joinreducedept 1reducedept 2reducedept 310110100101001100111001111110010100110101110111010select e.fname, e.lname, d.dept_namefrom employees e, depts dwhere e.salary30000and d.dept_id = e.dept_id1 map2 map21 mapemployees10110101010100100111100111011deptsselect e.fname, e.lname, d.dept_namefrom employees e, depts dwhere e.salary30000and d.dept_id = e.dept_id50 52. Joins in MapReduce (cont.)For N-way joins involving different join keys, multiple jobs are usedselect e.fname, e.lname, d.dept_name, p.phone_type, p.phone_numberfrom employees e, depts d, emp_phones pwhere e.salary30000and d.dept_id = e.dept_idand p.emp_id = e.emp_idreducedept 1reducedept 2reducedept 310110100101001001110011111100101001101011101110101 1 map2 map21 mapemployees10110101010100100111100111011select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_numberfrom employees e, depts d, emp_phones pwhere e.salary30000and d.dept_id = e.dept_idand p.emp_id = e.emp_iddepts10110100101001100111001111110010100121010111011101010110101010100100111100111011101101001010011001110011111100101001210101110111010(temp files)1011010010100110011100111111001010012101011101110101 map2 mapemp_phones1 map1 map2 map1 map2 mapreduceemp_id 1reduceemp_id 2reduceemp_id Nresultsresultsresults51

Taming Big Data with Big SQL 3.0

Software

Transcript of Taming Big Data with Big SQL 3.0