XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag

XPEDIA: XML Processing for Data Integration by Amit Shvarchenberg and Rafi Sayag

XPEDIA: XML Processing for Data Integration

Amit Shvarchenberg andRafi SayagManish Bhide, Manoj K AgarwalIBM India Research LabIndia{abmanish,manojkag}@in.ibm.comAmir Bar-Or, SriramPadmanabhanIBM Software Group,USA{baroram,srp}@us.ibm.comSrinivas K. Mittapalli, GirishVenkatachaliahIBM Software GroupIndia{smittapa,girish}@in.ibm.com

XPEDIA - IntroductionXPEDIA stands for XML Processing for Data IntegrationXML documents became popularXPEDIA is designed to improve data integration for XML documentsXPEDIA uses parallelization and ELT flow

ETL In DatabasesExtract, transform, and load (ETL):Extracting data from outside sources Transforming data to fit operational needsLoading it into the end target (database or data warehouse)

Typical ETL Scenario With XML

Typical ETL Scenario With XMLZoom-In Flow-1

The Read_XML_Tableoperator simply reads the XML Documents

XML Hierarchy Tree

The Equi-Hierarchical Join operator The operator goes over all the Country sub-tree in the xml

The operator finds the set of employees working in each department in that country

The operator creates new element named Dept2 which contain a list of all employees working in that department

The Aggregation operator

The operator calc the total salary of all the employees in a department

The operator adds the calc to the XML document as totalSalary.

The Shredder Operator

The operator writes the totalSalary in the modified XML document to the relational database.

ProblemToday, databases support a limited representation of XML documentsProcessing an XML document, requires full extraction and parsing of the documentXML documents grow larger with timeA need for complex transformations has arose

Problem Computational ModelRelational data is represented in the form of rows and columnsIn this model, each XML document is represented as a single row and a single column.There is a need for a technique that handles complex data flows while preserving the simple specification

Problem ScalabilityIn relational data, the size of a row/tuple is seldom larger than a few KBsXML documents, which are composed of many small objects, often gets to over 1GB

There is a need to handle such large complex objects in an efficient mannerBecause of the hierarchical structure of XML, it is challenging to execute a single transformation job (compromised of a series of operators) in a parallel environment.

14The Solution XPEDIAComputational ModelELT SupportScalability parallelism

XPEDIA Computational ModelXPEDIA Computational ModelXPEDIA uses a dataflow model consisting of operators and edgesThe key difference in XPEDIA model:The data that flows between operators is an ordered list of XML documents that comply with a single XML schema

Example

...

List:

XPEDIA Computational Model (cont.)Operators can iterate over a sub-vector of a document object

The iterated vector is defined as scope vector of the operator

XML OperatorsFilter operator:Filters one of the vectors within a scope

Project Operator:Iterates over a single vector and generates a new output vector that is based on a set of select expressions

XML Operators Aggregate OperatorProduces statistics by aggregating one of the vectors. The aggregation restarts for each scope item

XML Operators Equi-Hierarchical-Join Performs an equality based join between two vectors that are contained within the scope instance

XML Operators Read/Write TableRead Table OperatorReads all the rows of a single table and outputs a relational tuple or XML document

Write Table OperatorUsed for writing a relational or XML data into a table

XML Operators Output Stage OperatorInput:

Department/Company/Country/DeptProject/Company/Country/Emplyee/PNameEmp ID/Company/Country/Emplyee/Einfo/EmpID

Transforms a relational input into an XML documentThe input is a table, and a mapping from each relational attribute to an XPath in the XML document

24ELT Optimization In XPEDIAELTELT (Extract, Load, Transform) Take parts of the ETL job flow and converts it into SQL/XML queriesELT is a technique to gain efficiency and performance by shifting a significant processing into the database

ELT In XPEDIADatabases such as DB2 9, Oracle 11g and SQL Server 2005 have inbuilt XQuery and SQL/XML query engines.XPEDIA applies rewriting techniques to transform parts of the ETL job flow into SQL/XML

How XPEDIA converts ETL to ELTThe following tasks are required for converting ETL to ELT: 1. Rewrite the ETL flow in terms of simpler operators.2. Convert each operator into a SQL/XML query.3. Merge the SQL/XML queries of adjacent operators into a single SQL/XML query.4. Convert the merged SQL/XML queries to an ELT job definition which can be executed on XPEDIA.Simplify The ETL FlowMost of the operators in XPEDIA can be directly converted to a SQL/XML query

Complex operators, like the OutputStage, are difficult to translate to SQL/XML queries directly

We need to rewrite complex operators with a simpler operators

ExampleThe algorithm to convert the OutputStage operator to the set of simpler operators

Step 1: Apply XMLize operator on the relational data to obtain flat XML document Example (cont.)

Step 2:Example (cont.)

Step 3: Use Project Operator to add and drop nodes, so as to bring the height of all output node at correct position.

Step 4: Use Project Operator to change names of nodes

Query Generation and MergingThe XPEDIA ELT optimizer has a set of algorithms for converting operators to SQL/XML query.

The XPEDIA ELT optimizer uses a set of rules for merging these SQL/XML queries..Generating The ELT Job DefinitionThe generated SQL/XML queries are mapped to the XPEDIA job definition

XPEDIA translates the job definition to a Read Table operator and the rest of the ETL flow remain the sameThe Result We can now use a single SQL/XML query to replace the operators between the XML data source to RDBMSELT allows us to use only Read/Write table operatorsBenefits: reduction of the size of the data that needs to be moved

XPEDIA ELT ConclusionXPEDIA is able to use the native XML processing capabilities of the database engine to greatly improve performance.

If the database does not have native XML support or is present in a flat file, XPEDIA can not use the ELT optimizer

XPEDIA Parallel Processing Parallel Processing of XML DataXPEDIA supports 2 types of job parallelism:Pipeline: each operator is handled by a different resourcePartitioning: the XML document is divided into several partitions, each processed separately

Pipelining LimitationsPipelining limits the scalability can only use as much resources as the number of operatorsIn pipelining, each resource will need to work on the entire dataBy using partitioning, we allow better usage of available resources on large documents

Partitions GenerationXPEDIA identifies what nodes are optimal for partitioningThe chosen partition is than divided between resources in one of the following methods:Round RobinChunking SchemeShallow ParsingDividing the work requires some parsingThe parsing that is done is only partial, from root node to partition node

Since shallow parsing overhead is different for every partition, sometimes load balancing is done when choosing chunks sizesWhat have we gained with XPEDIA What Have We Gained With XPEDIA performance gain of up to 70% by using XPEDIA ETL tools so that more processing is done inside the database engine.

What Have We Gained With XPEDIA Using XPEDIA to partitioning the ETL job on multiple nodes is scalable and can improve the processing speed of the ETL job by up to 2.9 times for a 4 processor configuration

SummaryWe saw how the XPEDIA deals with this new problems that arose Parallel processing techniques is used for handling large XML documentXPEDIA ELT system is able to take advantage of the native XML processing capabilities of the database engine and greatly improve performance.Questions ?

XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag

Documents

Transcript of XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag