DB2 Information Integrator Version 8 - Lightyear Consulting€¦ · · 2014-07-28DB2 Information...
Transcript of DB2 Information Integrator Version 8 - Lightyear Consulting€¦ · · 2014-07-28DB2 Information...
DB2 Information Integrator Version 8.1 Overview for DB2 User Group of Northern CaliforniaMichael Lee – Oct 14th, 2003
IBM Q4 Events!
• Oct 14 - Chat with the Lab – Data Warehouses and Information Integration: Best Practices and Performance Characterization
• Oct 21-23 - DB2 Multiplatform Tools Proof of Technology• Oct 27-31 – Data Management Technical Conference• Nov 11-13 - DB2 Information Integrator Proof of Technology
• Education!– Oct 27-31 - DB2 Universal Database Performance Tuning and Montoring Workshop– Nov 17-20 - DB2 UDB Advanced Administration Workshop
AgendaIntegration - Some BackgroundDB2 and Information IntegrationDB2 Information IntegratorDB2 Information Integrator Performance TopicsDB2 Information Integrator - The Fine Print
Different Types of Integration Required
"Legacy"Apps
InformationIntegration
ProcessIntegration
UserInteraction
Consumers
Trading Partners
Service Providers
Suppliers
B2B Portal
B2C Site
Value ChainHeterogeneousEnvironments
New e-business Applications
Supply Chain Management
Customer Relationship Management
Service Provider Integration Into
ERP and HR Systems
Product Development Management
ApplicationConnectivity
Build to Integrate
What is Information Integration?Information integration refers to a category of middleware which lets applications access data as though it were in a single database, whether or not it is. It enables the integration of data and content sources to provide real-time read and write access, to transform data for business analysis and data interchange, and to manage data placement for performance, currency, and availability.
The State of the Business World
Audio & Video
ComputerGenerated
Output
Web Content
ScannedDocuments
Images
Text
Spatial
Relational
Hierarchical,Files
Temporal
History of DB2 federated database productsDB2 UDB & II V8 DataJoiner
Heterogeneous Data AccessDistributed JoinsHeterogeneous ReplicationSpatial Data Support
V1
V2
V1
Parallel Edition
V1
DB2UDBV5
DB2 UDB V5Object RelationalOLAP, OLTP and BIFull ParallelismIntegrated ToolsWeb Integration
Parallel Edition
V1
Common Server
V2
Heterogeneous Data Access (Read-only)
Limited Data SourcesDistributed JoinsEnhanced O-R supportOLAP, OLTP, and BISpatial Data Support
Heterogeneous Data Access (Read and Write)
Heterogeneous ReplicationMost Data Sources
RelationalNon-relationalLife Sciences
Custom WrappersMaterialized Query Tables
over Relational Nicknames Net Search Extender over
Relational NicknamesWeb ServicesMQand more
DB2 UDB V7
http://www.almaden.ibm.com/cs/garlic/
Information Integration with DB2 Universal DatabaseWebsphere MQSeries User-Defined Functions, MQListenerWeb services deployment with DADX, Websphere SOAP Server, SOAP User-Defined FunctionsXML
ƒ XML Extenderƒ XML Publishing functions (delivered with DB2 UDB V8.1.2)
Federation to IBM Relational Sourcesƒ DB2 UDBƒ DB2 for zOS and OS/390ƒ DB2 for iSeriesƒ Informix Dynamic Server (Versions 7 & 9)ƒ Informix Extended Parallel Server (XPS)
What Problems Does DB2 Information Integrator AddressData Access
ƒ Heterogeneous Data FederationData Placement
ƒ Heterogeneous Data Replicationƒ Heterogeneous Data Caching General Availability
June 13, 2003
Data Federation
Functions
Transparencyƒ Appears to be one source
Heterogeneityƒ Integrates data from diverse sourcesƒ Relational, Structured, XML, messages, Web, …
Extensibility ƒ Federate almost any data source.ƒ Development tooling provided
High Functionƒ Full query support against all dataƒ Capabilities of sources as well
Autonomyƒ Non-disruptive to data sources, existing applications, systems.
Performanceƒ Optimization of distributed queries
Federated Database Architecture
Federated Database Server
Data
Relational Data Source Data
Global Catalog
SQL API(JDBC/ODBC)
Wrappers
00001|SONY|Television|... 00002|RCA|VideoPlayer|.. 00004|SONY|DVDPlayer
00003|SONY|VideoRecorder.......
Database Application
SELECT I.man, count(*)FROM transactions T, items IWHERE I.id=T.item_id AND I.category='Television' AND YEAR(T.tran_date)=2001 GROUP BY I.man;
SELECT tran_date, item_id FROM transactions
ITEMS
TRANSACTIONS
List the number of TV salesper manufacture in 2001
Staging tablesStaging tables
Combined with the federation engine -> a powerful integration tool!
IMS
DB2/zOS
DB2/iSeries
DB2/UDB
Sybase
SQL Server
IBM Informix
Oracle
ANY source
LOG based
Trigger based
DB2/zOS
DB2/iSeries
Sybase
SQL Server
IBM Informix
Oracle
DB2/UDB
External application
Replication architecture for integrating information
Teradata
Federated Engine
Multi-tiered Caching with Materialized Query TablesImprove query performance and availability Administrator defines Materialized Query Table
ƒ Precomputed or frequently used valuesƒ Any data from the federated systemƒ Application indicates ability to use cacheƒ Implicit or explicit use
Developer enables cache useƒ If enabled, reads are handled from the cache, writes
passed through to the sourceƒ If not, reads and writes passed through to source
Cache refresh managed:ƒ Manuallyƒ By replicationƒ Various refresh strategies under design
Reaching Relational Data Sources
ApplicationWebsphereQuery ToolOther vendor SWC,Java, Cobol,RPGJDBC/ODBC,Embedded SQL
DB2 interfaceDDF - z/OSRDB - iSeriesDB2 RT Client
UNIX, Linux,Windows
Data SourceDB2
z/OSiSeriesLUW (UDB)VM and VSE
OracleSybaseInformixTeradataSQL ServerODBC Sources
DB2 instance federated server
Wrappers DRDA NET8 CTLIB INFORMIX TERADATA DJXMSSQL3 ODBC
Client DB2 Connect - built-in Oracle client Sybase Open Client Informix Client SDK Teradata CLI ODBC Drivers
Operating system Windows 2000, AIX, Solaris, HP-UX, Linux/Intel
DB2 database
wrapper server user mappings nicknames
Config File/Directory DB2 DB Directory tnsnames.ora interfaces sqlhosts /etc/hosts odbc.ini/System DSN
Relational wrappers require client libraries/configurations.
Crystal DecisionsVision
ƒ As a world-leading information infrastructure company, Crystal Decisions helps businesses make better decisions by bringing together their people and their information.
Challenge ƒ Improve response time for complex queries over distributed heterogeneous data sources
Solution ƒ Relational Connect provides transparent, globally optimized access to heterogeneous, distributed data.
Crystal Reports accesses the distributed data as if it were a single database. Response time improvement of up to 98% seen in house.
Business Valueƒ "Users can provide coherent reporting accessing non-DB2 data sources and discover new ways to meet
the information needs of their organizations. And the more information business analysts can incorporate into their views of their company's activities, the more effectively they can steer their companies in the direction of higher profits." - Janet Wood, vice president of business development, Crystal Decisions.
Competitive Valueƒ "DB2 Relational Connect provides Crystal Reports with the fastest federated querying capability on the
market today." -Trevor Smith, Program Manager, Business Development Group, Crystal Decisions
Reaching Non-Relational Data SourcesSame architecture as for Life Sciences Data Connect
ApplicationWebsphereQuery ToolOther vendor SWC,Java, Cobol,RPGJDBC/ODBC,Embedded SQL
DB2 interfaceDDF - z/OSRDB - iSeriesDB2 RT Client
UNIX, Linux,Windows
DB2 instance federated server
Wrappers Flatfile XML BLAST Excel Documentum HMMER Entrez BioRS Extended Search
Operating system (check for specific wrappers) Windows 2000, AIX, Solaris, HP-UX, Linux/Intel
DB2 database
wrapper server user mappings nicknames
Table-Structured File XML file
Excel 97/2000 Excel file
Dctm clientDocumentum
data storeHMMER data
sourceHMMERdaemon
NCBI Website
BioRS Server
ES ServerES client
BLAST data source
BLAST daemon
AventisVision
ƒ A leader in the discovery and development of innovative pharmaceutical products dedicated to improving life through the discovery and development of innovative products.
Challenge ƒ Increase drug research efficiency and encourage interdisciplinary cooperation between chemists and biologists. Scientific users require integrated view of chemical & biological information stored in distributed Oracle sources, as well as external non-relational sources.
Solution ƒ DiscoveryLink provides federated access to Oracle databases and external sources such as genomics & proteomics data across four worldwide locations.ƒ Sophisticated scientific mining algorithms
Business Valueƒ Increased research productivity leading to drug innovation and reduced time-to-market
Competitive Valueƒ " DiscoveryLink allows us to access and mine the physical data in a way never before possible,
significantly speeding up the drug discovery and development process." -- Peter Loupos, Global
Federated ConceptsWrapper: the data access code; this understands data source capabilitiesServer: a specific data source, e.g. a database on a DB2 instance.User Mapping: information needed to connect to a specific serverPassthru Session: a special mode that allows users to submit SQL statements directly to a data sourceNickname: a specific data set managed by a server, mapped to rows and columns in DB2 UDBIndex Specification: a index catalog entry for a nicknameType Mapping: a mapping between a data source type and a DB2 UDB data typeFunction Mapping: a mapping between a data source function and a DB2 UDB functionFunction Template: a virtual function definition for a data source function that cannot be executed on DB2 UDBOption: an additional attribute specific to each source to customize an object
DB2 Information Integration Features Insert/Update/Delete against relational nicknames
ƒ Used by Heterogeneous replication in DPropR.ƒ Supported operations include:
–insert with values, insert with subselect–searched updates/deletes–positioned updates/deletes
Materialized Query Tables over relational nicknamesƒ Allows previously cached query results to be used to answer queries.
Heterogeneous ReplicationEnhanced Control Center GUI for Federated SystemTransparent DDL
ƒ Allows CREATE TABLE to be issued on DB2 to create a remote table and a nickname for this remote table in one step.
Large Object Supportƒ LOB retrieval for all relational wrappersƒ LOB Insert/Update/Delete for Oracle
More DB2 Information Integration FeaturesExpand data source support
ƒ Relational Wrappers–ODBC, Teradata wrappers
ƒ Non-Relational Wrappers–Text, Excel, XML, ...–HMMER, Entrez, BioRS wrapper, ...–Lotus Extended Search wrapper
Expanded operating system supportƒ In addition to AIX ® and Windows NT, servers that use Linux, HP-UX,Solaris ™ Operating Environment,
and Windows 2000 operating systems can now be used as DB2 federated servers.–For now, Linux on Intel only
Garlic wrapper planning technologyƒ Allows non-relational wrappers to be written more "easily"ƒ 'Quirky' data sources can be modeled more accurately.ƒ Life Sciences Data Connect wrappers are all now using Garlic wrapper planning technology.
Net Search Extender: text search against relational nicknamesƒ useful for those sources that do not support text searchesƒ text indexes can be defined over relational nicknames on DB2 II
Control Centerƒ Tools to configure and administer standard wrappers ƒ Plug-in architecture allows custom wrappers to be administered
Wrapper Administration Tools: Plug-in architecture
Wrapper Administration Tools: Dynamic DiscoveryDynamic Discovery mechanism enables new data sets to be discovered and automatically configuredWrapper writer optionally provides GUI and discovery routine for custom wrappers
Wrapper Module
Shared library with specific entry pointsRepresents a class of data sourcesDynamically loaded on demand by UDB
Client Application
SQL DB2/UDB
flatfile
Teradata
create wrapper teradata library 'teradata.so' create wrapper flatfile library 'libdb2lsfile.a'
Server
Represents a specific data sourceProvides information about data source capabilities
Tera
data
Se
rver
Teradata
Client Application
SQL DB2/UDB
flatfile
Teradata
create server mapping from mytd to node "tdatsvr" type teradata VERSION 3.0 protocol "teradata" authid "tempx" password "xxpibm"; create server ff wrapper flatfile;
db2 identifies a specific db2 SID
Remote User
Information needed to identify end user to a particular server
Client Application
SQL DB2/UDB
teradata
Tera
dat
aSe
rver
Store Data
uid: j15user1
pwd: secret
flatfile
create user mapping from user to server mytd authid "tempx" password "xxxxxm" ;
Nickname
Represents a particular collection of data at a serverMapped to rows and columns in UDBActs like an alias
Client Application
SQL DB2/UDB
drda Tera
dat
aSe
rver Store Data
CatalogsMOLECULES
TESTS
flatfile
Nicknames
To create the DB2 nickname:create nickname td_div for mytd.dw_dss.lu_division; To create the flat file nickname: CREATE NICKNAME tests (id char(5), score smallint, type char(20)) for server ff OPTIONS(FILE_PATH '/home/etlin/sql/flattests.txt', COLUMN_DELIMITER '|');
CREATE NICKNAME O_EMP FOR DB2OS390.J15USER3.EMP
A table EMP exists on DB2 for OS/390EMPNO EMPNAME
100200300400500
Owner is J15USER3
SmithJonesAdamsMillerBennett
A table OFFICE exists on Teradata:EMPNO
100200300400500
OFFICENO
C200C202C204C206C208
Owner is J15USER1
CREATE NICKNAME TD_OFFICE FOR TERADATA.J15USER1.OFFICE
SmithJonesAdamsMillerBennett
EMPNAME OFFICENO
C200C202C204C206C208
Putting it all together...
SELECT O_EMP.EMPNAME, TD_OFFICE.OFFICENOFROM o_emp, td_officeWHERE O_EMP.EMPNO= TD_OFFICE.EMPNO
DB2 Information Integrator Packaging
II Advanced Edition
• Full Function DB2 ESE• Net search Extender
II Standard Edition• Heterogeneous Federation• Caching • Limited Use DB2 ESE
•••
•••
• Heterogeneous Replication• Limited use DB2 ESE
II Replication Edition
• Full Function DBMS• Includes II Foundation for DB2
• Homogeneous Federation & Replication to DB2 & IDS
• XML store
DB2 ESE
Two Other Editionsƒ DB2 II Advanced Edition Unlimitedƒ DB2 II Developer's Edition
Connector LicensingDB2 UDB, Informix IDS, Informix XPS, WebSphere MQ, Web services, and OLE DB - provided with base server license.
Relational databases - access only one database instance of Oracle, Microsoft SQL Server, Sybase or Teradata databases, or any one data source instance accessed using the ODBC connector.
Microsoft Excel - access to all Excel documents that reside on one or more servers. Flat files - access to all flat files that reside on one or more servers. XML - access to all XML documents that reside on one or more servers.
Documentum - access to one repository of Documentum documents.
Extended Search - access to all data sources accessible through Extended SearchBLAST - access to all data sources accessible through one BLAST daemon. Entrez - access to all supported data sources accessible through Entrez. Access to PubMed and Nucleotide databases is supported by DB2 Information Integrator. HMMER - access to all data sources accessible through one HMMER daemon. BioRS - access to all the databanks supported by a single installation of BioRS.
Upgrades from DataJoiner and Relational Connect
These Customers: Are entitled to these upgrades:DataJoiner Server 1 DB2 II SE ProcessorDataJoiner Additional Processor 1 DB2 II SE ProcessorDataJoiner Relational Data Source 1 DB2 II ConnectorDataJoiner Spatial Extender Processor 1 DB2 Spatial Extender ProcessorRelational Connect 2 DB2 II SE Processors
2 DB2 II ConnectorsLife Sciences Data Connect 2 DB2 II SE Processors
1 DB2 II Connector
Sunsetting DB2 DataJoiner and DB2 Relational Connectƒ Withdrawal from Marketing: September 9, 2003ƒ Withdrawal from Service: September 9, 2004ƒ See IBM Announcement Letter # 903-106: Software Withdrawal: DB2 Relational Connect,
DB2 Life Sciences Data Connect, DB2 DataJoiner, DB2 DataJoiner Spatial Extender
DB2 Information Integrator – Capability Comparison
Wrapper toolkit for developers
Control Center – Discovery for nicknames
Web services consumer
Application Dev’t via standard tooling (e.g. WebSphere Studio, MS Visual Studio)
• Excel, flat file, Life Sciences wrappers (R/O)
• MQ series 1PC
• XML wrapper (R/O)
• Extended search wrapper (R/O)
(1PC)• ODBC R/W
Wrapper architecture
Heterogeneous replication
Web services provider
Caching - MQTs over nicknames
(1PC) (R/O)• Oracle, Sybase, SQL Server, Teradata R/W
(1PC) (1PC) (R/O)Federated Reach: • DB2/Informix Read/Write
II V8.1DB2 V8.1DB2 V7.2 & RC/LSDC
DataJoinerFunction
DB2 Information Integrator Performance TopicsCost-based query optimizationPerformance of queries against a single federated data source Performance of queries against multiple federated data sources Application development with and without DB2 II
A Simple Join QueryShow me:
the number of TV sales per manufacture in year 2001
SQL to DB2:SELECT I.man, count(*) FROM transactions T, items I WHERE I.id=T.item_id and I.cat='Television' and YEAR(T.tran_date)=2001GROUP BY I.man;
Global Plan RETURN | GROUP BY | SORT | Hash Join / \ Relational Evaluate cat='Television' Query | SHIP Non-Relational Plan | RPD Nickname: | TRANSACTIONS Nickname: ITEMS
DB2 Query Processing Extended for Federated DataParse QueryCheck SemanticsRewrite Query
ƒ Transform a user query based on heuristics & data source knowledge
Pushdown Analysisƒ Analyze how to decompose a user query
Optimize Access Planƒ Generate an optimal query execution plan using cost
estimates including data source knowledge: database statistics, indexes, source functions, server capacity, network capacity (but assumes DB2-like behavior!)
Remote SQL Generatorƒ Produce efficient DBMS specific SQL for SQL-speaking
sourcesGeneration Executable Code
ƒ Generate the executable for the plan for local and remote operations
Execute Planƒ Drive execution of query over local & distributed dataƒ Allows function compensation & virtual database view
Relational Data Source Pushdown AnalysisDetermine what portion of a query can be executed outside DB2 for relational sourcesWill not dictate how much really gets pushed down to the data source in most casesWhat matters:
ƒ data source capabilities–e.g. Can it handle join operation ?
ƒ characteristics about the data–e.g. Can sorting sequence affects predicates on a column ?
ƒ function mappings–e.g. Can this source evaluate COUNT(*) ?
ƒ and more
Actual pushdown is cost-basedJust because processing can be pushed down doesn't mean it will be.Influenced by estimates of rows processed/returned. Example: two-table join with nicknames ORA.T1 and ORA.T2 to a single remote source that is "nearly" a Cartesian product. May be better to do the join at the Federated server to avoid retrieval of many rows. Example: Retrieving (10,000 + 25) rows to do a local join is faster than retrieving (10,000 * 25) = 250,000-row remote join result
SELECT .... from ORA.T1, ORA.T2 where T1.a = T2.b
ORA.T1 ORA.T2
25 rows 10,000 rows
Single remote Oracle source
Q4: Rows
RETURN
( 1)
Cost
I/O
|
5
SHIP
( 2)
4.51969e+07
1.80339e+06
/------+-----\
6e+06 2.39966e+07
NK: TPCD NK: TPCD ORDERS LINEITEM
Reading the Explain plan for Q4
Q4: join between nicknames to two tables (Orders and Lineitem) on the remote sourceOrders: 6M rows, Lineitem: ~24M rowsNo processing at the Federated server - topmost operator is SHIP. 5 rows are estimated to be returned to the Federated server What statement is SHIPped to the remote source? Look in details of the SHIP operator in EXPLAIN output
Original Query:SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT FROM TPCD.ORDERS WHERE O_ORDERDATE >= DATE('1993-07-01') AND O_ORDERDATE < DATE('1993-07-01') + 3 MONTHS AND EXISTS (SELECT * FROM TPCD.LINEITEM WHERE L_ORDERKEY = O_ORDERKEY AND L_COMMITDATE < L_RECEIPTDATE)GROUP BY O_ORDERPRIORITYORDER BY O_ORDERPRIORITY;
RED = NicknamesPINK = Remote source table names
What is shipped to the remote source:SELECT A0."O_ORDERPRIORITY", COUNT(*) FROM "TPCH"."ORDERS" A0WHERE (EXISTS (SELECT A1."L_RETURNFLAG" FROM "TPCH"."LINEITEM" A1 WHERE (A1."L_ORDERKEY" = A0."O_ORDERKEY") AND (A1."L_COMMITDATE" < A1."L_RECEIPTDATE")))AND(TO_DATE('19930701 000000','YYYYMMDD HH24MISS') <= A0."O_ORDERDATE") AND (A0."O_ORDERDATE" < TO_DATE('19931001 000000','YYYYMMDD HH24MISS'))GROUP BY A0."O_ORDERPRIORITY"ORDER BY 1 ASC;
Even for a completely pushed down query...The remote query text is not the same as the original due to
ƒ SQL differences on remote sourceƒ Federated optimization includes a rewrite phase!
Q4: What is shipped to the remote source?
Q14: What went wrong? Rows RETURN ( 1) Cost I/O | 1 (rows output) GRPBY (operator name) ( 2) (operator no.) 370486 (Cost estimate) 27108.4 (I/O estimate) | 2.39966e+06 SHIP ( 3) 369851 27108.4 /------+-----\ 800000 2.39966e+07 NK: TPCD NK: TPCD PART LINEITEM
SELECT 100.00 * SUM(CASE WHEN P_TYPE LIKE 'PROMO%' THEN L_EXTENDEDPRICE*(1-L_DISCOUNT) ELSE 0END)/SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS PROMO_REVENUE FROM TPCD.LINEITEM, TPCD.PART WHERE L_PARTKEY = P_PARTKEY AND L_SHIPDATE >= DATE('1995-09-01') ANDL_SHIPDATE < DATE('1995-09-01') + 1 MONTH
Join is pushed down to the remote database, but SUM/CASE/GROUPBY is done locally. Thus *many* rows are processed locally. Why? Not all versions of the remote database can deal with a CASE expression of this type
Q20: A single-source 5-table join RETURN ( 1) | UNIQUE ( 2) | NLJOIN ( 3) /---------------+--------------\ NLJOIN FILTER ( 4) ( 18) /--------+--------\ | TBSCAN SHIP GRPBY ( 5) ( 13) ( 19) | /------+-----\ | SORT NK: TPCD NK: TPCD SHIP ( 6) PART PARTSUPP ( 20) | | NLJOIN NK: TPCD ( 7) LINEITEM /------+-----\ SHIP SHIP ( 8) ( 11) | | NK: TPCD NK: TPCD NATION SUPPLIER
4 smaller remote queries are generated below each SHIP operator in this 5-table join
A local join
A pushed-down join
A local join
Even though few operations are pushed down, Q20 is faster on Federated than when executed using the native interface to the other database. This is unusual.
Understanding performance of a query to 1 remote sourceConsider a Federated query Q. Get DB2EXFMT output for Q.
ƒ Watch for portions of the query that are not pushed downFor a fully pushed-down query: Find the remote query Q' from the SHIP operatorIf you can, try executing the remote query Q' directly at the remote source to isolate any performance issues. Try executing Q' by hand with a SELECT COUNT(*) to verify the number of rows returnedCompare execution time of Q at the Federated server and Q' at the remote source. If you know how, get an EXPLAIN of Q' at the remote sourceUseful tool: db2batch -d <dbname> -f <query file> -o p 2 executes query at Federated server and measures elapsed and CPU time Technique generalizes to case with more than one Q' (query Q decomposed into two or more remote queries)
Elapsed time
QQ'
A simple distributed two-source query
SELECT P_PARTKEY, SUM (100.00 * L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS PROMO_REVENUEFROM TPCD2.LINEITEM, TPCD.PARTWHERE L_PARTKEY = P_PARTKEY and P_TYPE LIKE 'PROMO%' AND L_SHIPDATE >= DATE('1995-09-01') AND L_SHIPDATE < DATE('1995-09-01') + 3 MONTHGROUP BY P_PARTKEY ORDER BY PROMO_REVENUE FETCH FIRST 20 ROWS ONLY;
DB2 Federated Server
TPCDPART
TPCD2LINEITEM
Join LINEITEM and PART tables located on sources TPCD2 and TPCDJoin predicate L_PARTKEY= P_PARTKEYPredicate on PART: consider only promotional partsPredicate on LINEITEM: consider only items shipped between certain dates
Does the simple distributed query perform well?
Comparison:Elapsed Time of Federated distributed query vs. time to
ƒ Export & move qualifying PART rows from source TPCD to source TPCD2 and...
ƒ ...do a local query on source TPCD2 between LINEITEM and the "moved" PART rows stored in a temporary table
Consider:ƒ Federated moves 864K + 30K = 894K rowsƒ Manual co-location moves only 30K, but must export/insert rows.
What if we make a "mistake" and move LINEITEM to PART instead?
Qualitative criteria: Yes.ƒ Minimal Data Movementƒ Work is pushed down
Quantitative criteria: Compare Federated with temporary co-location ("move and join")
: : | 29917.6 SORT ( 6) | 32329.1 HSJOIN ( 7) /------+-----\ 864484 29917.6 SHIP SHIP ( 8) ( 10) | | 2.39966e+07 800000 NK: TPCD2 NK: TPCD LINEITEM PART DB2 Federated Server
TPCDPARTTPCD2
LINEITEM
30k Part rows
Federated Move Part to Lineitem, Local Join
Move Lineitem to Part, Local Join
19 12 48
Results of the comparison experiment
Elapsed Time, sec.
Compare "move and join" to Federated joinƒ in both directions (Part to Lineitem and vice-versa).
Results:ƒ move-and-join can be faster than a Federated join orƒ it can be slower because it is dominated by cost to export/load a large temporary table.
Federated Move Part to Lineitem, Local Join
Move Lineitem to Part, Local Join
32.5 32.3 52.4
"Move and join" is not always worth the trouble
Elapsed Time, sec.
Repeat the experiment with a modified query in which the join is not as "lopsided"ƒ Predicate on PART table changed to select much more data than beforeƒ Almost as many PART rows as LINEITEM rows participate in the distributed join
Result: Federated does as well as the best "move and join" experiment.
Comparing "move and join" with Federated Move-and-join: temporarily move data to be joined from two (or more sources) to one sourceƒ This strategy is NOT available for DB2 II today
Move-and-join may be quicker than Federated join ifƒ one side of the join is much smaller than the otherƒ you correctly choose to move the small side
Butƒ It takes considerable knowledge of the query to implement move-and-join correctly and efficientlyƒ It requires additional scripting and administration ƒ If both sides of the join are of comparable size, performance will usually be similar to Federatedƒ Need authority to create new tables, insert dataƒ Performance may be much worse than Federated if you make a mistake
Federated is a good bet in a dynamic environment!
local DB2 for "scratch" temp tables
Comparing performance of distributed queries in an application with and without DB2 Federated
DB2 Federated
Server
Oracle Excel/ODBC
DB2
Federated Application
Non-Federated Application
Connection to Federated server
Connection to all individual data sources
Comparing performance of distributed queries with and without DB2 Information IntegratorWithout DB2 Federated: Application connects to each source, issues SQL in its dialect, retrieves appropriate data from each, inserts into local temporary tables, and processes query locallyWith DB2 Federated: Application connects only to Federated server and submits queries against nicknames to several sources ƒ Federated manages the decomposition and processing of the queryƒ Can also create join and union views over nicknames to make multiple remote tables appear as one to
the applicationResults w/J2EE servlets issuing queries involving 3 remote data sources
ƒ 40 - 65% less code with DB2 IIƒ 50 - 65% less time required to develop with DB2 II
Query Time with Federated
Time without Federated
1 3.5 sec 3.4 sec2 0.24 sec 0.16 sec3 54.2 sec 170.1 sec4 6.5 sec 81.2 sec5 15.1 sec 9.9 sec
Tuning stuff ...Pay attention to pushdown
ƒ Ensure that appropriate work is being pushed down to remote sourcesƒ Affected by collating sequences, data types, null attributes
Pay attention to sorting and temporary tablespacesƒ Even with appropriate pushdown, DB2 II will be doing lots of work
–Results from remote sources become temporary tables on DB2 II–Final sorts, aggregates may need to be performed on DB2 II, especially with multiple remote data sources
ƒ Monitor sortheap, sheapthresh–ensure most sorts take place in memory
ƒ Appropriately size bufferpool for temporary tablespaces–ensure spills from sortheap have a place to land in bufferpool–ensure adequate space for temporary tables being materialized on DB2 II
ƒ Storage underlying temporary tablespaces–ensure good performance when temporary tables are forced to disk
Temporary tables aren't reusable for multiple usersƒ Consider using MQT's to persist frequently used data
EEE instance
Current parallelism limitations in Federated performance No SMP intra-query parallelism at the Federated server (yet)
ƒ A query involving one or more nicknames will run using only one db2 agent ƒ A single user running a distributed query can't exploit the full power of (say) an 8-way SMP
Watch out when making EEE instance Federated. Joins between local partitioned tables (EEE) and nicknames run on coordinator partition only - no parallelism!
P1P2
P3P4
JOIN
Nickname
T
This join will run on one partition only.Better... create an MQT over the nickname. The MQT is a local table that can be joined in parallel with the partitioned table
Upcoming Features...2 Phase Commit not yet supported
ƒ No triggers on nicknames that cause updates to another siteƒ Only read-only IMS, CICS Transactions to federated sources
Nicknames on remote stored procedures not yet supportedReplication with DB2 II requires DB2 DataPropagator Version 8
ƒ Version 7 tolerated, but replication subscriptions can't be changedRUNSTATS on nicknames not yet supported
ƒ Can drop and recreate, but dependent objects are dropped as wellƒ GET_STATS (IBM supplied program) is available from DB2 II support site
–http://www-3.ibm.com/software/data/integration/db2ii/support.htmlOptimizer assumes DB2 UDB behavior for relational data sourcesMay need to run db2updv8 on databases after upgrading to FixPak 2
Open infrastructure for integrating diverse and distributed data for real time access ƒ Structured, semi-structured, and unstructured data, standards-based extensibility ƒ Leverage native source capabilitiesƒ Query engine provides optimized cross-source queries
An extension of industrial-strength technologyƒ Robust, high function, high performanceƒ Benefit from R & D in relational DBMSƒ Benefit from experience in content management
Eases application developmentƒ Transparency, heterogeneity, high functionƒ APIs for modern environments (XML, J2EE, Web Services)
Application Autonomyƒ Does not disrupt existing applications
Summary: II Federated Technology
For More Information DB2 Information Integrator Home Pageƒ http://www-3.ibm.com/software/data/integration/db2ii/
DB2 Information Integrator Documentationƒ Now in DB2 UDB Documentation (with DB2 UDB Documentation FP2)
IBM Systems Journal on Information Integrationƒ http://www.research.ibm.com/journal/sj41-4.html
Information Integration White Papersƒ http://www-3.ibm.com/software/data/pubs/papers/#ii