Ab Initio - Introductory_course - I

92
 Asia’s Lar gest Global Software & Serv ices Company Confidential Ab Initio 1 Introduction

description

Abinition course I

Transcript of Ab Initio - Introductory_course - I

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    1

    Introduction

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    2

    Agenda

    z WhatisDatawarehousing?z WhyDatawarehousing?z ETLprocessz VariousETLtoolsz IntroductionaboutAb Initioz whyAb Initioz HowUnixinvolvedwithAb Initioz GDEwindowz EMERepositoryz Sandboxes UserandStandardSandboxz Ab Initio Componentsz Creationofsimplegraphs

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Data warehousing and

    ETL Process

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    4

    DataWarehouse

    DataWarehouseisacollectionoflogical

    DataMarts,eachofwhichisdesignedfora

    particularlineofbusinessi.e.Sales,Marketing(designedtofavor/facilitatedata

    analysisandreporting).

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    5

    Why

    Datawarehousing?

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    6

    ETLprocess

    DataisfirststoredtemporarilyinaStagingTable/Area

    andiscalledStagingData

    i.e.

    Dataqueuedforprocessing.

    TheprocessingtoolreadstheStagedData,performsqualitativeprocessing,filtering,

    cleansing(AsrequiredfortheOLAPi.e.reporting/analysis)and

    finallyloads/writes

    themintoDataWarehouse.

    Allthesedataflow(bothinwardandoutward)anddataprocessingactivities

    (ExtractionfromSourceSystem Transformationofdatabycleansing/filtering

    LoadingintoDataWarehouse)areperformedusinganETLtooli.e.Ab

    Initio,

    Informatica

    etc.

    ThisentireprocessissaidtobeasETLprocess.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    7

    Extract,Transformation,Load(ETL)functionalities

    Extract: ThefirstphaseofanETLprocessistoextractthedatafromthesourcesystems.

    Eachseparatesystemmayalsouseadifferentdataorganization/

    format.

    Commondatasourceformatsarerelationaldatabases,andflatfiles,butother

    sourceformatsexist.Extractionconvertsthedataintorecordsandcolumns.

    Transform: Thetransformphaseappliesaseriesofrulesorfunctionstotheextracteddata.

    Examples: Deriveanewcalculatedvalue(e.g.sale_amount =qty*unit_price) Summarizemultiplerowsofdata(e.g.totalsalesforeachregion) Load:o Theloadphaseloadsthedataintothedatawarehouse.Depending

    onthe

    requirementsoftheorganization,thisprocessrangeswidely.

    o SimpleOverwriteolddatawithnew.o Morecomplexsystems>Maintenanceofhistoryandaudittrailofallchangesto

    thedata

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    8

    VariousPopularETLTools

    Tool Name Company Name

    Informatica Informatica Corporation

    DT/studio Embarcadero technologies

    Datastage IBM

    Abinitio Abinitio Software corporation

    Talend Talend corporation

    Pentaho Pentaho corporation

    Datajunction Pervasive Software

    Oracle warehouse builder Oracle Corporation

    Microsoft SQl Server Integration Microsoft

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Introduction-Ab-Initio

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    10

    IntroductiontoAbinitio

    Data processing tool from Ab Initio software corporation (http://www.abinitio.com)

    Latin for from the beginning Designed to support largest and most complex business applications Graphical, intuitive, and fits the way your business works.

    Focus:Moving Data -

    Move small and large volumes of data in an efficient manner.Deal with the complexity associated with business data.

    High Performance Scalable Solutions

    Better productivityUsage:

    Data Warehousing Batch Processing Data Movement Data Transformation

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    11

    ProductConstituents

    CooperatingSystem(Co>Ops)

    GraphicalDevelopmentEnvironment

    (GDE)

    SSHREXECTELNETDCOM

    EMEDBConduct>ITCF

    Product Functionality

    GDE UserInterfaceforcreatingGraphsandPlansinAb

    Initio

    DataProfiler Ab

    Initio

    ToolforDataProfiling

    Co>Ops ServerComponentforrunningdeployedAb

    Initio

    programs

    EME Ab

    Initio

    TechnicalRepository PartofCo>OpsInstall

    Database Ab

    Initio

    ServerDatabaseComponents

    Conduct>IT Ab

    Initio

    ServerComponentforrunningAb

    Initio

    Plans

    Continuous

    FlowAb

    Initio

    ServerComponentsforrunningCFprograms

    Allservercomponentsareinstalledbydefault

    AB_HOME

    referstoinstallationlocationofAb

    Initio VariousConnectorsandPlugins

    installedinAB_HOME/Connectors

    &

    AB_HOME/plugins

    locationAllbinariesandlibraryfilesavailableinAB_HOME/bin

    &AB_HOME/lib

    respectively

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    12

    Ab

    Initio

    ProductArchitecture

    Native Operating System (Unix, Windows, OS/390)

    The Ab Initio Co>Operating System

    Component Library

    Development Environments

    GDE Shell

    3rd PartyComponents

    User-definedComponents

    User Applications

    Ab Initio

    EME

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    13

    ProductArchitecture

    Unix Shell Script or NT Batch Filey Supplies parameter values to underlying

    programs through arguments and environment variables

    y Controls the flow of data through pipesy Usually generated using the GDE

    Operating System( Unix , Windows NT )

    UserPrograms

    Co>Operating SystemAb Initio Built-in Component Programs (Partitions, Transforms etc)

    Host Machine 1

    Operating System

    UserPrograms

    Host Machine 2

    Co-OperatingSystem

    GDE

    y Ability to graphically design batch programs comprising Ab Initio components, connected by pipes

    y Ability to test run the graphical design and monitor its progress

    y Ability to generate a shell script or batch file from the graphical design

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    14

    Co>OperatingSystemandGDE

    Co>Operating System Layered on the top of the operating system. Unites a network of computing resources CPUs, storage disks,

    programs, datasets into a data-processing system with scalable

    performance.

    GDE

    can talk to the Co-operating system using several protocols like

    Telnet, Rexec and FTP

    It is GUI for building applications in Ab Initio

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    15

    The Graph Model

    Graph

    isthelogicalmodularunitofanapplication. consistsofseveralcomponentsthatformsthebuildingblocksof

    anAbInitio

    application

    StartScript(HostSetup)

    LocaltotheGraph EndScript

    LocaltotheGraph

    Component isaprogramthatdoesaspecifictypeofjobandcanbecontrolledbyitsparameter

    settings.Ex:Join,Reformatetc

    ComponentOrganizer Groupsallcomponentsunderdifferentcategories.

    SetupCommand AbInitioHost(AIH)file BuildsuptheenvironmenttorunanAbInitioapplication.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    16

    Partsoftypicalgraph

    Files Formats Components Flows Layouts Building with mp job Building with mp run

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    17

    The Graph Model: Naming the pieces

    A Sample Graph

    L1

    Customers

    L1*

    Scoreout*

    deselect*

    L1*

    Select

    L1

    GoodCustomers

    L1

    OtherCustomers

    DatasetComponents

    Datasets

    Flows

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    18

    The Graph Model: A closer look

    ASampleGraph

    Ports

    Record format metadata

    Expression Metadata

    Layout

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    19

    RuntimeEnvironment

    Agraph,afterdevelopment,isdeployedtothebackendserverasaUnixshellscriptorWindowsNTbatchfile.

    ThisbecomestheexecutabletorunatthebackendwiththehelpoftheCooperatingsystem.

    TheexecutioncanbedonefromtheGDEitselformanuallyfromthebackend AbInitioruntimeenvironmentisdifferentfromthedevelopmentenvironment.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioUnixandAbinitio

    Unix serves as backend for Ab-initio. All the graphs/Jobs in Ab-initio can be accessible through Unix(backend) Putty connectivity Environment Quick Overview:

    $AI_RUN,$AI_BINrun directory, .ksh scripts$AI_PLAN, $AI_SERIAL_

    $AI_DMLrecord format files $AI_XFRtransform files $AI_MPgraphs $AI_DBdatabase config files $AI_SERIAL - serial source data, other serial data $AI_MFS - Ab Initio multifile directory in training will also contain partition

    directories (more about this later!)

    $AI_LOG - A location to place logging files, etc

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Sandboxes are work areas used to develop, test or run code associated with a given project. Only one version of the code can be held within the sandbox at any time. The EME Datastore contains all versions of the code that have been checked into it.

    Check-in

    Check-out

    Sandboxes and EME

    Check-out

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioAbinitioEnvironment

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioAbinitioEnvironment

    Jobrun

    How a job runs

    The execution of an Ab Initio graph is a job. To run a job, need to invoke a shell script that the GDE generates from a

    graph.

    The script process initiates job processes that control the execution of the programs represented by the graph.

    Graph->mp/graph1.mp ; Shell script->run/graph1.ksh

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioAbinitioEnvironment

    Jobrun

    You can invoke the script in two ways:

    From the GDE

    From a command line

    To invoke the script from the GDE, click the Run button or choose Run > Start from the GDE menu bar.

    To invoke the script through command line,

    For bin script: ksh scriptname.ksh in bin path.

    To run a graph from backend: $AI_RUN Graphname.ksh parameters(if needed) in run path.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioCreationofaGraph

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Components - Overview

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioComponentOrganizer

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    28

    Asamplegraph

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    29

    A sample korn shell script

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioDatasetComponentProperties

    Double click on acomponent to bringup its Properties Page

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioViewingPortProperties

    Click on the Ports Tabto view the Port(s)Properties

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    32

    DMLsandXFRs

    DML Ab Initio stores metadata in the form of record formats. Metadata can be embedded within a component or can be stored

    external to the graph in a file with a .dml extension. XFR

    Data can be transformed with the help of transform functions. Transform functions can be embedded within a component or can be

    stored external to the graph in a file with a .xfr extension.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    33

    DataMetadataLanguageorDML

    DMLSyntax Recordtypesbeginwithrecord andendwithend Fieldsaredeclared:data_type(len)field_name; Fieldnamesconsistofletters(az,AZ),digits(09)andunderscores(_)and

    areCasesensitive Keywords/Reservedwordsarerecord, end,date.

    SomeoftheDataTypesavailable String Decimal Integer

    StoringDatainbinaryform DateandDatetime EBCDICandASCIIrecords NullinAbInitio Nonexistenceofcolumnvalues.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioInTextview(specialsymbolasdelimiter)

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioRecordformat

    InGraphicalform(gridview)

    DML format created for a data

    0345John Smith0212Sam Spade0322Elvis Jones0492Sue West0121Mary Forth0221Bill Black

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioEditingTypesinGDE

    DML creation

    Field name Field type Field length

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioMoreRecordFormatEditing

    View Attributes.

    Field Type drop-down

    Length can be delimiter string

    Date format goes here

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioAutoDMLcreationinTablecomponent

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioDMLcreation Usefileoptionindataset

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    40

    TransformFunctions:XFRs

    User-defined function producing one or more output from one or more input

    Associated with transform components Rules that computes expression from input values and local variable and

    assigns the result to output objects Syntax

    Functions :output-records : : function-name (input-records) =

    begin

    assignments

    End;

    Assignments :

    Direct Mapping without any transformation: out.* :: in.*

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioInputfilesettings

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioInputData

    RecordView

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioInputfileView Backend

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioOutput

    Settings(Propagatingfrominput)

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    45

    LookupFile

    Serial or MultifilesHeld in main memory Searching and Retrieval is key-based and faster as compared to files stored on disks

    associates key values with corresponding data values to index records and retrieve them

    Lookup parameters Key Record Format

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    46

    BasicComponents

    FilterbyExpression Reformat RedefineFormat Sort Join Replicate Dedup Rollup

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    47

    FilterbyExpression

    Readsrecordfrominput port Evaluatetheselect_expr Ifresultistrue,recordwrittentoout port Ifresultisfalse,recordwrittentodeselect port

    true?

    expr

    No

    Input port

    DeselectOut port

    Yes

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    48

    Diagnostic Ports

    REJECT Input records that caused error

    ERROR Associated error message

    LOG Logging records

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioFilterbyExpression

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioFilteredoutput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    51

    Reformat

    1.

    Readsrecordfrominput

    port

    2.

    Recordpassesasargumenttotransformfunctionorxfr

    3.

    Recordswrittentoout

    ports,ifthefunctionreturnsasuccessstatus

    4.

    Recordswrittentoreject

    ports,ifthefunctionreturnsafailurestatus

    5.

    ParametersofReformatComponent

    Count Transform(Xfr)Function RejectThreshold Abort NeverAbort UseLimit&Ramp Limit Numberoferrorstotolerate Ramp ScaleoferrorstotolerateperInput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioReformatrejectthreshold

    A drop-down menu specifying the number of errors to tolerate.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioTransformfunctionalityinReformat

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioReformattedoutput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    55

    Sort

    SortComponent

    Readsrecordsfrominputport,sortsthembykey,writesresulttooutputportParameters

    Key Maxcore

    Keys

    Akeyidentifiesafieldorsetoffieldstoorganizeadataset

    SingleField:employee_number MultiplefieldorCompositekey:(last_name;first_name) Modifiers:employee_numberdescending

    Maxcore:Maximummemoryusageinbytes

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioSortFunctionality

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioSortedoutput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    58

    Join

    1.

    Readsrecordsfrommultipleinputports

    2.

    Operatesonrecordswithmatchingkeysusingamultiinputtransformfunction

    3.

    Writesresulttotheoutputport

    PORTS PARAMETERS

    inoutunusedreject(optional)error(optional)log(optional)

    countkeyoverridekeytransformlimitramp

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioJoinParameters

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioJoinedoutput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioRollup

    Rollup evaluates a group of input records that have the same key, and then generates records that either summarize each group or select certain information from each group.

    Parameters:

    check-sort,sorted input limit,Ramp

    logging log_group

    log_input log_intermediate

    log_output grouped-input

    error_group key

    key-method major-key

    log_reject max-core

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioRollup - functionality

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioRollup

    Output

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioBuiltinFunctionsforRollup

    The following aggregation functions are predefined and are only available in the rollup component:

    avg max min count first Product last Sum Multi-stage Transform

    initialize,iterate,finalize,use of variables

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioRollupWizard

    Note the use of an aggregation function in the expression

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioSimpleandComplexComponents

    Inthesecomponentstherecordformatmetadata

    typicallychanges(goesthroughatransformation)

    frominputtooutput

    Inthesecomponentstherecordformat

    metadatadoesnotchangefrominputtooutput

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    67

    PriorityAssignment

    ThePriority

    istheorderofevaluationofrulesinatransformfunction.

    Anexample

    Ajoincomponentmayhaveatransformfunctionwithprioritizedrulesas

    out.ssn:1:in1.ssn;

    out.ssn:2:in2.ssn;

    out.ssn:3:"999999999";

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioPriorityAssignmentcontd

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioUsinglookupinsteadofJoin

    Using Last- Visitsas a lookup file

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioUsingalookupfileinaTransformFunction

    Output record format:recorddecimal(4) id;string(8) city;decimal(3) amount;date(YYYY/MM/DD) dt;

    end

    Input 0 record format:recorddecimal(4) id;string(6) name;string(8) city;decimal(3) amount;

    end

    Transform function:out :: lookup_info(in) =begin

    out.id : : in.id;out.city : : in.city;out.amount : : in.amount;out.dt :1 : lookup(Last-Visits, in.id).dt;out.dt :2 : 1900/01/01;

    end;

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioTheGDEDebugger

    TheGDEhasabuiltindebuggercapability ToenabletheDebugger,Debugger:EnableDebugger TheDebuggerToolbar

    Enable Debugger

    Add Watcher File

    Isolate Components

    Remove All Watchers

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    72

    MultistageTransform

    Datatransformationinmultiplestagesfollowingseveralsetsof

    rules

    Eachsetofruleformonetransformfunction

    Informationispassedacrossstagesbytemporaryvariables

    Stagesincludeinitialization,iteration,finalizationandmore

    Fewmultistagecomponentsareaggregate,rollup,scan

    Aggregate/Rollup/Scan

    Generatessummaryrecordsforgroupofinputrecords

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioDatabaseComponents

    *

    Join with DB* Truncate Table

    Deletes all the rows in a specified DB table

    * Run SQL Executes SQL statements in a DB

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    74

    BuiltInFunctions

    AbInitiobuiltinfunctionsareDMLexpressionsthat

    canmanipulatestrings,dates,andnumbers accesssystemproperties

    Functioncategories

    Datefunctions:now(),today(),date_to_int(),.. Inquiryanderrorfunctions:is_defined(),is_valid(),force_error(),.. Lookupfunctions:lookup(),lookup_local(),.. Mathfunctions:ceiling(),floor(),.. Miscellaneousfunctions:decimal_round(),hash_value(),.. Stringfunctions:string_substring(),is_blank(),..

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    75

    Components contd..

    Name Description

    Normalize Generates multiple data records from each input data recordSeparate a data record with a vector field into several individual records, each containing one element of the vector.

    Denormalize Sorted

    Consolidates groups of related data records into a single output record with a vector field for each groupRequires Grouped Input

    Validate Records

    Separates valid data records from invalid data records

    Check Order Tests whether data records are sorted according to a key-specifier.

    Compare Records

    Compares data records from two flows one by one

    Generate Records

    Generates a specified number of data records with fields of specified lengths and types.

    Gather Logs Collects the output from the log ports of components for analysis of a graph after execution

    Sample Selects a specified number of data records at random from one or multiple input flows

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Mechanism by which some or all constituents of an application datasets and processing modules are replicated into a number of

    partitions, each spawning a process.

    This makes the Ab initio to process considerable huge volume (in millions) of records with an optimum usage of hardware available.

    The power of Ab Initio lies in the fact that it can process data in parallel

    runtime environment

    Types of Parallelismz Component Parallelismz Pipeline Parallelismz Data Parallelism

    ParallelisminAbInitio

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Component Parallelism is achieved when different instances of same

    component run on separate data sets. Component parallelism scales to the

    number of branches of a graph the more branches a graph has, the greater

    the component parallelism. If a graph has only one branch, component

    parallelism cannot occur.

    ComponentParallelism

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Pipeline parallelism occurs when several connected program components on

    the same branch of a graph execute simultaneously. In this kind the two

    processing stages of the graph run concurrently.

    PipelineParallelism

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    When data is divided into segments or partitions and multiple instances of

    program components run simultaneously on each partition

    Expanded View

    Linear View

    DataParallelism

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    80

    DataParallelism

    Multifiles

    Aglobalviewofasetofordinaryfilescalledpartitions usuallylocatedondifferentdisksorsystems

    AbInitioprovidesshelllevelutilitiescalledm_commands forhandlingmultifiles(copy,delete,moveetc.)

    MultifilesresideonMultidirectoriesEachisrepresentedusingURLnotationwithmfile astheprotocolpart:

    mfile://pluto.us.com/usr/ed/mfs1/new.dat

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    //host1/vol4/pA/mydir/myfile.dat

    //host2/vol3/pB/mydir/myfile.dat

    //host3/vol7/pC/mydir/myfile.dat

    ControlPartition

    DataPartition on Host1

    DataPartition on Host2

    DataPartition on Host3

    Afilespanningacrosspartitionsonsame/differenthosts

    mfile://host1/u/jo/mfs/mydir/myfile.dat

    //host1/u1/jo/mfs/mydir /myfile.dat

    AMultifile

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    82

    DataPartitioningComponents

    Data can be partitioned using

    Partition by Round-robin

    Partition by Key

    Broadcast

    Partition by Expression

    Partition by Range

    Partition by Percentage

    Partition by Load Balance

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Writes records to each partition evenly Block-size records go into one partition before moving on to the next.

    RoundrobinPartition

    BCD

    FCDBGB

    DF

    D

    BC

    D

    FC

    DB

    GB

    DF

    E

    E

    E

    E

    A

    AA

    A

    A

    AA

    AD

    Partition 0 Partition 1 Partition 2

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioA Data Parallel Application: The Global View

  • Asias Largest Global Software & Services CompanyConfidential

    Ab InitioPartitioning by Key

    BC

    D

    FC

    DBGB

    DF

    D

    Partition 0 Partition 1 Partition 2

    BCD

    FCDBGB

    DF

    E

    E

    A

    AA

    A

    E

    E

    A

    AA

    AD

    BC

    D

    FC

    DBGB

    DF

    D

    Partition 0 Partition 1 Partition 2

    BCD

    FCDBGB

    DF

    E

    E

    A

    AA

    A

    E

    E

    A

    AA

    AD

    A hash code computed using the key determines which partition a record will be written on, meaning that records with the same key value will go to the same partition

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    86

    DepartitioningComponents

    Gather Reads data records from the flows connected to the input port Combines the records arbitrarily and writes to the output

    Concatenate Concatenate appends multiple flow partitions of data records one

    after another

    Merge Combines data records from multiple flow partitions that have been

    sorted on a key

    Maintains the sort order

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    87

    Factors:Phases&Checkpoints

    Phasing:Breaking an application into phases limits the contention for

    Main memory.Processor(s).

    Breaking an application into phases costDisk space.

    Checkpoint - Purpose: Provide same functionality as phase Additional: Provide restart capability

    How does it work ? At job start, output datasets are copied to temporary files (in .WORK-

    serial or .WORK-parallel directories) At checkpoint completion, intermediate datasets and job state are

    stored in temporary files Recovery information is stored in host and vnode directories

    represented by AB_WORK_DIR defined in the Ab Initio environment

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    Directory dedicated to Co>Ops Should have enough free space; Cannot be NFS or NAS mounted Holds Storage of Internal Log Files (used in recovery of Ab Initio Graph) Used when components are connected via name pipes Sub-directories of AB_WORK_DIR

    host Holds Control Node Recovery Files vnode Holds Processing Node Recovery Files data Holds files for Layouts cache Holds Cache Files needed by remote components

    Important logging information in host and vnode directories Usually does not have data files. Components with host layouts or database layouts, data written to data

    subdirectory AB_WORK_DIR fill up leads to non-recovery of Ab Initio Jobs.

    88

    AB_WORK_DIR

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    89

    Performance:DebuggingLogfile

    A sample log file ..

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    90

    Performance:DebuggingLogfile

    ReadingtheLog:CPU CPUtime:totalprocessingforcomponent Status:[Running:Finished] Skew:amongCPUtimesofeachpartition Vertex:component

    ReadingtheLog:DATA Databytes:#processed Records:#processed Status:[unopened:opened:closed] Skew:amongdatabytesinpartitions Flow:linkbetweencomponents

    datatrackinginfoisdisplayedonflowsinGDE Vertex:component Port:ofcomponent

    Interpretingthelog Computedatabytes/secthroughcomponent,ineachpartition Lookforserialization:effectiveCPU=(cpu time)/(elapsedtime) compareopenvs.closedpartitions:serialized whensomepartitionsremainopenlongafter

    othershaveclosed dataskew Deadlock:no changeinrecordcountsovercoupleofintervals

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    91

    Performance:Inanutshell..

    AvoidSortsasitisconsumingmorememory.

    AvoidcomponentslikeJoinwithDB(hitting

    dbforeachandeveryrecord).

    UseLookups.

    UseInmemoryJoin/Rollup.

    AssignDrivingPortofJoincorrectly.

    Filteringunrequireddatabeforeprocessing.

    Phasing.

  • Asias Largest Global Software & Services CompanyConfidential

    Ab Initio

    92

    THANK YOU

    Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Slide Number 57Slide Number 58Slide Number 59Slide Number 60Slide Number 61Rollup - functionalitySlide Number 63Slide Number 64Slide Number 65Slide Number 66Slide Number 67Slide Number 68Slide Number 69Slide Number 70Slide Number 71Slide Number 72Slide Number 73Slide Number 74Slide Number 75Slide Number 76Slide Number 77Slide Number 78Slide Number 79Slide Number 80Slide Number 81Slide Number 82Slide Number 83Slide Number 84Slide Number 85Slide Number 86Slide Number 87Slide Number 88Slide Number 89Slide Number 90Slide Number 91Slide Number 92