Implementing a Data Warehouse with Microsoft SQL Server 2012
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
-
Upload
rizaldy-ignacio -
Category
Technology
-
view
53 -
download
2
Transcript of Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseIIH-5529Olivier Bernin & Rizaldy Ignacio October 30th
© 2014 IBM Corporation
Please Note• IBM’s statements regarding its plans, directions, and intent are subject to change or
withdrawal without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
2
Agenda• Big Data & the Hybrid Data Warehouse
A quick recap on BigData The Hybrid Data Warehouse
• Big SQL 3.0 – SQL on Hadoop without compromises What is Big SQL ? Features
• Setting up an Hybrid Data Warehouse with Big SQL Federation Overview Configuration
3
2009
800,000 petabytes
2020
35 zettabytesas much Data and ContentOver Coming Decade
44x Business leaders frequently make decisions based on information they don’t trust, or don’t have1 in 3
83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness
Business leaders say they don’t have access to the information they need to do their jobs
1 in 2
of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions
60%
… And Organizations Need Deeper Insights
Of world’s datais unstructured
80%
Information is at the Center of a New Wave of Opportunity…
Hadoop• Framework / ecosystem of component aimed at distributing work
across (very) big cluster for parallelizing …
• Components A distributed file system running on commodity hardware (HDFS) The MapReduce programming model & associated APIs Hive, HBase, Sqoop, etc ...
• Hadoop nodes store and process data Bringing the program to the data Easy scalability – just add more nodes
Bringing the Program to the Data
1
Cluster
32
MRJob
MRJob MR
JobApp(Compute)
MRJob
Result1011010010100100111001111110010100111010010100101100100101010011000101001011101011101011110110110101011010010101
1
2
3
Logical File
Splits
Map Reduce Job
Hadoop for Big Data ?
• Well suited for Big Data Distributed, parallel execution Cheap(er) per storage unit Flexible: no schema Highly available
• But … Data needs to be integrated to generate insights Prioritization of applications Capitalizing on investment
High priority / high performance SQL
applications
Semi-structured data applications
Low priority, low performance SQL
applications
Experimental applications
Hybrid Data Warehouse
Exploration,Integrated
Warehouse, and Mart Zones
Discovery Deep Reflection Operational Predictive
All Data Sources
Decision Management
BI and Predictive Analytics
Analytic Applications
IntelligenceAnalysis
Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine
Learning
Video/Audio Network/Sensor Entity Analytics Predictive
Stream Processing
Data Integration
Master Data
Streams
Analytic Applications
Real-time Analytic Zone
Landing Area, Analytics Zone
and Archive
Information Ingestion and Operational Information
Big Data Reference Architecture
Hybrid Data Warehouse
Exploration,Integrated
Warehouse, and Mart Zones
Discovery Deep Reflection Operational Predictive
All Data Sources
Decision Management
BI and Predictive Analytics
Analytic Applications
IntelligenceAnalysis
Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine
Learning
Video/Audio Network/Sensor Entity Analytics Predictive
Stream Processing
Data Integration
Master Data
Streams
Analytic Applications
Real-time Analytic Zone
Landing Area, Analytics Zone
and Archive
Information Ingestion and Operational Information
Hybrid Data Warehouse – Business Scenarios
Hybrid Data Warehouse2Application and
Data Portability1• Analytic Sandbox
Analytic Dev / Test Exploratory analytic Lower cost analytic Unstructured data
• DW Off-load Complete off-load Low priority app. Cold data
• Hadoop for ETL Big or unstructured
data data set Using Map Reduce
• Joining with Archive Archived & hot
data
• Correlating data across multiple sources and types
Structured & unstructured
• Access Data where Optimal
Where it makes most performance sense
Exploration,Integrated
Warehouse, and Mart Zones
Discovery Deep Reflection Operational Predictive
Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine
Learning
Landing Area, Analytics Zone
and Archive
SQL on Hadoop ?• Hadoop / HDFS great for storing lots of data cheaply• But …
Requires strong programming expertise Steep learning curve
• Yet, many, if not most, uses cases about structured data
• Why not use SQL in places its strength shine ? Familiar, widely used syntax Separation of what you want vs. how to get it Robust ecosystems of tools
Big SQL 3.0
• IBM BigInsights SQL-on-Hadoop solution BigInsights is IBM's enterprise ready Hadoop
distribution Big SQL is a standard component of BigInsights
• Integrates seamlessly with other components
• Big SQL applies SQL to your data in Hadoop Performant Rich SQL Compliance
What is Big SQL ?• Architected for low latency and high
throughput• MapReduce replaced with a modern MPP
architecture Compiler and runtime are native code Worker daemons live on cluster nodes
• Continuously running
• Processing happens locally at the data Message passing allows data to flow
directly between nodes
• Operations occur in memory with the ability to spill to disk
Supports aggregations and sorts larger than available RAM
InfoSphere BigInsights
Big SQL
SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-basedApplication
IBM Data Server Client
Architecture
Management Node
Big SQLMaster Node
Management Node
Big SQLScheduler
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
Database Service
Hive Metastore
Hive Server
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
Big SQLWorker Node
JavaI/O
FMP
NativeI/O
FMP
HDFS Data Node
MRTask Tracker
Other ServiceHDFS
Data HDFSData HDFS
Data
TempData
UDF FMP
Compute Node
DDLFMP
UDF FMP
*FMP = Fenced mode process
Open Architecture• No proprietory storage format• A “table” is simply a view on your
Hadoop data• Table definitions shared with Hive
The Hive Metastore catalogs table definitions
Reading/writing data logic is sharedwith Hive
Definitions can be shared across theHadoop ecosystem
• Can still use MapReduce & other Hadoop ecosystem tools
Sometimes SQL is not the answer
Hive
Hive Metastore
Hadoop Cluster
PigHive APIs
SqoopHive APIs
Big SQLHive APIs
Big SQL 3.0 - At a glanceApplication Portability &
Integration
Data shared with Hadoop ecosystemComprehensive file format support
Superior enablement of IBM softwareEnhanced by Third Party software
Performance
Modern MPP runtimePowerful SQL query rewriter
Cost based optimizerOptimized for concurrent user throughput
Results not constrained by memory
Federation
Distributed requests to multiple data sources within a single SQL statement
Main data sources supported:DB2 LUW, Teradata, Oracle, Netezza
Enterprise Features
Advanced security/auditingResource and workload management
Self tuning memory managementComprehensive monitoring
Rich SQLComprehensive SQL Support
IBM SQL PL compatibility
Next in Jasmine A !!
IIH-4666: Big SQL Is Here: Blistering Fast SQL Access To Your Hadoop data
Rich SQL Support• Modern SQL capabilities• All standard join operations
Standard and ANSI join syntax - inner, outer, and full outer joins Equality, non-equality, cross join support Multi-value join (WHERE (c1, c2) = …) UNION, INTERSECT, EXCEPT
• Full support for subqueries In SELECT, FROM, WHERE and HAVING Correlated and uncorrelated Equality, non-equality subqueries EXISTS, NOT EXISTS, IN, ANY, SOME, etc.
• SQL Procedures, Flow of Control, etc ...
Drivers & Tooling Support
• Big SQL 3.0 adopts IBM's standard Data Server Client Drivers Robust, standards compliant ODBC, JDBC, and .NET drivers Same driver used for DB2 LUW, DB2/z and Informix Expands support to numerous languages (Python, Ruby, Perl, etc.)
• Together with rich SQL support, provides application portability Allows interaction with external Information Management tools
Federation at a Glance• Federation enables Big SQL 3.0 to access a variety of existing
relational data stores and run queries across the two systems Joins, mixed workloads
• Wide range of vendors & system supported DB2 LUW, Oracle, Netezza, Terradata More coming …
• Fully integrated into the Big SQL processing engine Optimizer will pick most efficient plan based on remote data source
capabilities
Federation – A Closer Look
• Server: A remote data source.• Wrapper: Library handling communication between engine &
remote database client. • Nickname: A local name representing an object (Table, View,
etc…) in the remote data source.• User mapping: An association between a local and a remote
authorization ID.
The Wrapper• A library to access a particular type of data sources
Typically one wrapper for each external vendor and / or version
• One wrapper to be registered for each data source type, regardless of the number of data sources of that type
• Implemented by means of a library of routines called wrapper module
Performs remote query execution using remote system client APIs
• Register in federated database using CREATE WRAPPER statement
The Server• Defines the properties and options of a specific data source
Type and version of data source Database name for the data source Access parameters & other metadata specific to the data source
• Server is defined using CREATE SERVER statement• A wrapper for this type of data source must have been previously
registered to the federated server• Multiple servers can be defined for the same remote data source
instance• e.g. Multiple servers may be defined for two different databases
from remote Oracle instance
Nicknames & Mappings• Nickname
A nickname “maps” a remote table or view in Big SQL Once declared, can be used transparently by the application
• Mappings Possible to map other obects from remote data source locally
• User ID
• Data Types
• Functions
Federation
Big SQL 3.0
SQL
SQL
SQL
(Remote Statement)
Client/Application
Relational Data Source
Big SQL Engine W
RAPPERS
NicknameOptimizer
Runtime
CLIENT
Data Source client
Putting it All TogetherBig SQL 3.0 Management Node
Big SQL Engine W
RAPPERS
NicknameOptimizer
RuntimeSQL
SQLCLIENT
Data Source client
SQL
(Remote Statement)
Client/Application
Remote Relational Data
Source
Big SQL Worker Node
Big SQL 3.0 Compute Node
Big SQL Worker Node
Big SQL 3.0 Compute Node
Big SQL Worker Node
Big SQL 3.0 Compute Node
Practical ExampleWhat the application submits:
SELECT c.cust_nation, sum(o.totalprice) FROM customers c, orders o
WHERE o.orderstatus = 'OPEN' and c.custkey = o.custkey and c.mktsegment = 'BUILDING'
GROUP BY c.cust_nation
SELECT o.custkeyFROM orders oWHERE o.orderstatus = 'OPEN'
SELECT c.custkey, c_cust_nationFROM customers.cWHERE c.mktsegment = 'BUILDING'
What Big SQL Federation does:Join rows from both sourceSort them by cust_nation and sum up total order price for each nationReturn result to application
Orders Customers
Optimization & Push down
• Federated execution plan chosen by cost based optimizer
• Different plans depending on how much work is executed locally vs. how much is executed on remote data source
“push-down”
String Comparison • Things to consider when comparing string types
Collation sequence - A > B ? Blank sensitivity - “ABC” = “ABC ” ? Empty string as NULL - “” = NULL ?
• Big SQL uses BINARY collation Byte for byte comparison of the data – Hive behaviour Blank sensitive, empty string are not NULL
• More String operation processing can be pushed on remote server if the collations are compatible
Use the COLLATING_SEQUENCE parameter to indicate compatibility with the BINARY collation
Remote Data Source String Comparison Behaviour
• Big SQL must be made aware of the remote data source string comparison behaviour
OPTIONS parameter when declaring server
Data Source Collation Blank
SensitiveEmpty Strings are NULL
SERVER OPTIONS parameter (recommended)
DB2 LUW Identity N N COLLATING_SEQUENCE=NPUSHDOWN=Y
Oracle Binary Y Y COLLATING_SEQUENCE=NPUSHDOWN=Y
Teradata ASCII N N COLLATING_SEQUENCE=NPUSHDOWN=Y
Netezza Binary Y N COLLATING_SEQUENCE=YPUSHDOWN=Y
Federation - Installation & Support• Available out of the box
Setup by BigInsights installation Wrappers available for following RDBMs: DB2, Teradata, Oracle,
Netezza
• Version supported
RDBM Version Wrapper Library
DB2 LUW 9.7, 9.8, 10.1, 10.5 libdb2drda.so
Teradata 12, 13 libdb2teradata.so
Oracle 11g, 11gR1, 11gR2 libdb2net8.so
Netezza 4.6, 5.0, 6.0, 7.1, 7.2 libdb2rcodbc.so
Federation – Set up Overview• Set-up the environment• Set-up the data source client
Environment variable and/or entries in $HOME/sqllib/cfg/db2dj.ini If db2dj.ini modified, reload the db2profile
• Create the wrapper• Define the server object on Big SQL• Create the nicknames• Create the user & data type mappings
Federation – Example Detailed Set-up for Teradata• Set-up the data source client
Add the Terradata client location to your environment
export TERADATA_LIB_DIR=/opt/teradata/client/lib64 Run the djxlinkTeradata tool
su root<HOME>/sqllib/bin/djxlinkTeradata Set the TERADATA_CHARSET variable in the db2dj.ini file
TERADATA_CHARSET=ASCII
• Create the wrapperCREATE WRAPPER TERA LIBRARY 'libdb2teradata.so'
Federation – Example Detailed Set-up for Teradata• Define the server object on Big SQL
CREATE SERVER TERASERV TYPE TERADATAVERSION 13 WRAPPER TERA AUTHORIZATION 'terauser' PASSWORD 'terapwd' OPTIONS (NODE 'teranode.ibm.com', PUSHDOWN 'Y', COLLATING_SEQUENCE 'N')
• Create the user & data type mappingsCREATE USER MAPPING FOR dbuser SERVER TERASERV OPTIONS (REMOTE_AUTHID 'terauser', REMOTE_PASSWORD 'terapwd')
• Create the NicknameCREATE NICKNAME TD_CUSTOMERSFOR TERARVER.SALES.CUSTOMERS
Federation – Example Detailed Set-up for Teradata• Run a query involving the remote table through the nickname
SELECT * FROM TD_CUSTOMERS
Conclusion
• Big Data offers new opportunities & potential for competitive advantages
• Hadoop is the solution, but must be integrated with existing data management infrastructure
• Big SQL 3.0 offers a performant, rich SQL on Hadoop solution, with a powerful federation capabilities, on which to build your hybrid data warehouse
Reference
• IBM InfoSphere BigInsights 3.0 Knowledge Center• IBM Big SQL 3.0• Set up and use federation in InfoSphere BigInsights Big SQL V3.0
• Try it for yourself at the drop-in lab LCI-6260: Federating Big SQL 3.0 with a Relational Data Store
• Coming next in this room ! IIH-4666: Big SQL Is Here: Blistering Fast SQL Access To Your
Hadoop data
• Get it BigInsights Quick Start Edition VM
• ibm.co/quickstart Analytics for Hadoop Service
• bluemix.net Big SQL Tech Preview
• bigsql.imdemocloud.com• Learn it
Follow online tutorials• ibm.biz/tutorial
Enroll in online classes • BigDataUniversity.com
• HadoopDev Links all available Watch video demos, read articles, etc. https://developer.ibm.com/hadoop
Want to Learn More ?
We Value Your Feedback!• Don’t forget to submit your Insight session and speaker feedback!
Your feedback is very important to us – we use it to continually improve the conference.
• Access the Insight Conference Connect tool to quickly submit your surveys from your smartphone, laptop or conference kiosk.
41
Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2014. All rights reserved. — U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
IBM, the IBM logo, ibm.com, Information Management and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at
•“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
•Other company, product, or service names may be trademarks or service marks of others.
42