I. WHAT IS ETL PROCESSdocshare04.docshare.tips/files/30521/305214311.pdfI. WHAT IS ETL PROCESS In...

I. WHAT IS ETL PROCESS

In computing, extract, transform, and load (ETL) refers to a process in database usage and especially in data warehousing that:

Extracts data from outside sources

Transforms it to fit operational needs, which can include quality levels

Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)

ETL systems are commonly used to integrate data from multiple applications, typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example a cost accounting system may combine data from payroll, sales and purchasing.

Extract

The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, since extracting data correctly sets the stage for how subsequent processes go further.

Most data warehousing projects consolidate data from different source systems. Each separate systemmay also use a different data organization and/or format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through webspidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

Transform

The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:

Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).

Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)

Encoding free-form values (e.g., mapping "Male" to "M")

Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

Sorting

Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data

Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)

Generating surrogate-key values

Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns)

Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addressesin one record into single addresses in a set of records in a linked address table)

Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.

Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.

Load

The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely.

II. OBJECTIVE OF ETL TESTING

The objective of ETL testing is to assure that the data that has been loaded from source to destination after business transformation is accurate. It also involves the verification of data at various middle stages that are being used between a source and destination.

Across the tools and databases the following are two documents that will always be two hands of an ETL Tester. But it also important that the following two document should be in complete state before starting ETL testing. Continuous change in the below two documents will lead to inaccurate testing andre-work.

ETL mapping sheets

DB schema of Source, Target and any middle stage that is in between.

An ETL mapping sheets contains all the information of source and destination tables including each and every column and their look-up in reference tables.

An ETL Testers needs to be comfortable with SQL queries as ETL Testing may involve writing big queries with multiple joins to validate data at any stage of ETL. ETL mapping sheets provide a significant help while writing queries for data verification. DB schema should also be kept handy to verify any detail in mapping sheets.

III. DWH

What is DWH

A data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before they are used in the DW for reporting.

The typical ETL-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational datastore (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a starschema. The access layer helps users retrieve data.

A data warehouse constructed from an integrated data source systems does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems.

Benefits of DWH

A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:

Congregate data from multiple sources into a single database so a single query engine can be used to present data.

Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.

Maintain data history, even if the source transaction systems do not.

Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.

Improve data quality, by providing consistent codes and descriptions, flagging or even fixingbad data.

Present the organization's information consistently. Provide a single common data model for all data of interest regardless of the data's source.

Restructure the data so that it makes sense to the business users. Restructure the data so that it delivers excellent query performance, even for complex

analytic queries, without impacting the operational systems. Add value to operational business applications, notably customer relationship management

(CRM) systems.

DWH Architecture

IV. DIMENSIONAL DATA MODELING

Dimensional modeling (DM) is the name of a set of techniques and concepts used in data warehouse design. It is considered to be different from entity-relationship modeling (ER). Dimensional Modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. According to data warehousing consultant Ralph Kimball,[1] DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. According to him, although transaction-oriented ER is very useful for the transaction capture, it should be avoided for end-user delivery.

Dimensional modeling always uses the concepts of facts (measures), and dimensions (context).Facts are typically (but not always) numeric values that can be aggregated, and dimensions aregroups of hierarchies and descriptors that define the facts. For example, sales amount is a fact;timestamp, product, register#, store#, etc. are elements of dimensions. Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc. Because the differentbusiness process areas share some but not all dimensions, efficiency in design, operation, and consistency, is achieved using conformed dimensions, i.e. using one copy of the shared dimension across subject areas. The term "conformed dimensions" was originated by Ralph Kimball.

Dimensional modeling process

The dimensional model is built on a star-like schema, with dimensions surrounding the fact table. To build the schema, the following design model is used:

Choose the business process

Declare the grain

Identify the dimensions

Identify the fact

Choose the business process

The process of dimensional modeling builds on a 4-step design method that helps to ensure theusability of the dimensional model and the use of the data warehouse. The basics in the design build on the actual business process which the data warehouse should cover. Therefore the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML).

Declare the grain

After describing the Business Process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should befocusing on. This could for instance be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver.

Identify the dimensions

The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table, and are where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory etc. These dimensions are where all the data isstored. For example, the date dimension could contain data such as year, month and weekday.

Identify the facts

After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the data warehouse. Therefore most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.

Fact Table

In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually acomposite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi additive measures.

Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).

Types of Measures

Additive - Measures that can be added across any dimension.

Non Additive - Measures that cannot be added across any dimension.

Semi Additive - Measures that can be added across some dimensions.

Types of Fact Tables

There are basically three fundamental measurement events, which characterizes all fact tables.

Transactional

A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every lineon a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.

Snapshot

The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it

needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.

Cumulative

This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order. An order moves through specific steps untilit is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Dimension Table

In data warehousing, a dimension table is one of the set of companion tables to a fact table. The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables. Contrary to fact tables, dimension tables contain descriptive attribute (or fields) that are typically textual fields (or discrete numbers that behave like text).

Dimension table rows are uniquely identified by a single key field. It is recommended that the key fieldbe a simple integer because a key value is meaningless, used only for joining fields between the fact and dimension tables.

A dimensional data element is similar to a categorical variable in statistics.

Typically dimensions in a data warehouse are organized internally into one or more hierarchies. "Date"is a common dimension, with several possible hierarchies:

"Days (are grouped into) Months (which are grouped into) Years",

"Days (are grouped into) Weeks (which are grouped into) Years"

"Days (are grouped into) Months (which are grouped into) Quarters (which are grouped into) Years"

Types of Dimensions

Confirmed Dimension

Junk Dimension

Schema

A schema is a collection of database objects, including tables, views, indexes, and synonyms.

There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The most common data-warehouse schema model is a star schema. However, a significant but smaller number of data warehouses use third-normal-form (3NF) schemas, or other schemas which are more highly normalized than star schemas.

Star Schema

The star schema is the simplest data warehouse schema. It is called a star schema because the diagram of a star schema resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables.

A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or lookup tables),each of which contains information about the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of lookup tables. Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other.

Cost-based optimization recognizes star queries and generates efficient execution plans for them. (Star queries are not recognized by rule-based optimization.)

A typical fact table contains keys and measures. For example, a simple fact table might contain the measure Sales, and keys Time, Product, and Market. In this case, there would be corresponding dimension tables for Time, Product, and Market. The Product dimension table, for example, would typically contain information about each product number that appears in the fact table. A measure is typically a numeric or character column, and can be taken from one column in one table or derived from two columns in one table or two columns in more than one table.

A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table normally has a concatenated index on the key columns to facilitate this type of join.

The main advantages of star schemas are that they:

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design.

Provides highly optimized performance for typical data warehouse queries.

Snowflake Schema

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake.

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a Product table, a Product_Category table, and a Product_Manufacturer table in a snowflake schema. While this saves space, it increases the number ofdimension tables and requires more foreign key joins. The result is more complex queries and reducedquery performance.

DATA MART

A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data. This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse.In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

Types of Data Marts

Dependent Data Mart and Independent

There are two basic types of data marts: dependent and independent. The categorization is based primarily on the data source that feeds the data mart. Dependent data marts draw data from a centraldata warehouse that has already been created. Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data, or both.

The main difference between independent and dependent data marts is how you populate the data mart; that is, how you get data out of the sources and into the data mart. This step, called the Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems,filtering it, and loading it into the data mart.

With dependent data marts, this process is somewhat simplified because formatted and summarized (clean) data has already been loaded into the central data warehouse. The ETL process for dependent

data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form.

With independent data marts, however, you must deal with all aspects of the ETL process, much as you do with a central data warehouse. The number of sources is likely to be fewer and the amount of data associated with the data mart is less than the warehouse, given your focus on a single subject.

The motivations behind the creation of these two types of data marts are also typically different. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access of data relevant to a specific department. The creation of independent data marts is often driven by the need to have a solution within a shorter time.

INDUSTRY STANDARD SCD

A Slowly Changing Dimension (SCD) is a well-defined strategy to manage both current and historical data over time in a data warehouse. You must first decide which type of slowly changing dimension to use based on your business requirements.

Type Use DescriptionPreserves History?

Type1 Overwriting Only one version of the dimension record exists. When a change is made, the record is overwritten and no historic data is stored.

No

Type2 Creating Another Dimension Record

There are multiple versions of the same dimension record, and new versions are created while old versions are still kept upon modification.

Yes

Type3 Creating a Current Value Field

There are two versions of the same dimension record: old values and current values, and old values are kept upon modification on current values.

Yes

Type 1

This methodology overwrites old with new data, and therefore does not track historical data.

Example of a supplier table:

Supplier_Key Supplier_Code Supplier_Name Supplier_State123 ABC Acme Supply Co CA

In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys.

If the supplier relocates the headquarters to Illinois the record would be overwritten:

Supplier_Key Supplier_Code Supplier_Name Supplier_State123 ABC Acme Supply Co IL

The disadvantage of the Type I method is that there is no history in the data warehouse. It has the advantage however that it's easy to maintain.

Type 2

This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history ispreserved for each insert.

For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version123 Abc Acme Supply Co CA 0123 Abc Acme Supply Co IL 1

Another method is to add 'effective date' columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State ETL_EFFCT_STRT_DT ETL_END_DT

123 abc Acme Supply Co

CA 01-Jan-2000 21-Dec-2004

123 abc Acme Supply Co

IL 22-Dec-2004 9999-12-31

Type 3

This method tracks changes using separate columns and preserves limited history. The Type 3 preserves limited history as it's limited to the number of columns designated for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type III adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.

Supplier_Key Supplier_Code Supplier_Name Original_Supplier_State Effective_Date Current_Supplier_State

123 Abc Acme Supply Co

CA 25-Dec-2004 HL

This record contains a column for the original state and current state—cannot track the changes if the supplier relocates a second time.

One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.

ETL TESTING

Inputs:

Test Strategy Document

Test plan document

Mapping specification

Macro and Micro design

Copy books

Schema file

Test Data file (cobol,.csv)

Test Environment:

Data stage (ETL Tool)

DB2 (Database)

Putty (UNIX)

ETL Testing Life cycle

ETL testing techniques

Smoke Test:

Smoke Test is to check the count of records present in Source and target. If the count is matching both the sides then only we proceed further and if it is not matching then primary key, attribute all willfail. So it will block all the test cases. When we start testing we have to do first smoke test.

For example: If 2000 records are present in source side then same no. of record should move to the target i.e. all 2000 records should move to target

Primary Key Validation:

Primary Key is the key which uniquely identify the record. After performing Smoke test we have to doPrimary Key Validation if the count is matching in Smoke Test. Here, we are validating the primary keyin source and target after applying the transformation, occur rule whether they are matched or not. If they are matched then the Primary Key validation is passed otherwise it is failed.

Attribute Validation:

After performing Primary Key Validation we performed attribute testing i.e. we check whether each attribute is moving correctly to PP or not as per mapping transformation. We just add an attribute along with the primary key with that transformation logic and do source minus target. If everything is matching then it will show source minus target count 0 else it will fail and some error count will be there.

For example: If PP transformation is 'convert to char' then data what we will get in target should be in char format.

Duplicate Checking:

It is to check whether duplicates are being moved to target or not. The main aim of duplicate test is to validate that same data should not move again and again to the target.

TDQ (Technical Data Quality): TDQ stands for Technical Data Quality. TDQ is performed before applying the transformation. In TDQ we are checking whether null or spaces or invalid dates are moving correctly or not. There are different scenarios we have while performing TDQ.

1. If the source field data type is varchar(20) or integer and null or space is there in source and target field is also varchar then it should be moved as null in PP file.

2. If the source field data type is varchar(50) or integer and null or space is there in source and target field is integer then 0 should be moved in target.

3. If source field is varchar(20) or integer and null or space is there in source and target field is date then ‘1111-11-11’ should be moved in target.

4. If source field contains any invalid values in source and target field is date then ‘1212-12-12’ should be moved in target.

CDC (Change Data Capture):

Change Data Capture is the process of capturing changes made at the data source. It improves the operational efficiency and ensures data synchronization. It easily identifies the data that has been changed and makes the data available for further use.

CDC is applied when DML (Data Manipulation Language) operations (Insert, Update, and Delete) are performed on the data.

APPROACH:

● Execute SQL scripts to check for active records in every individual table.

● Attribute values are compared for two periods, Period 1 and Period 2.

● The values are compared for four scenarios, Matching, Unmatching, Insertion and Deletion.

● Once the scripts are executed for these four scenarios in both the periods, then the results arecompared manually to check for the correctness of CDC that has been applied.

We are using two types of files: Delta file and Full Snapshot file.

➢ Delta File:

Delta processes compare the last historical file with the current one to identify the changes that have occurred. The delta processing application will only update the data that has changed in the source systems.

Delta load

The delta load process extracts only that data which has changed since the last time a build was run. The delta load process is used for extracting data for the operational data store of IBM® Rational® Insight data warehouse. This topic is an overview of the delta load implementation.

To run the delta load process, you need to store the date and time of the last successful build of the ETL (extract, transform, and load) process. The CONFIG.ETL_INFO table in the data warehouse is defined for this purpose. Every time an ETL job is run, some variables are initialized. For the delta loadprocess, the following two variables are used:

The MODIFIED_SINCE variable.

The ETL job searches the CONFIG.ETL_INFO table to get date and time for the last successful ETL run and sets that value to the MODIFIED_SINCE variable, which will be used in the later ETL builds later to determine if there are changes to the data since the last run.

The ETL_START_TIME variable

The ETL job gets the system date and time and stores that value to the ETL_START_TIME variable. After the ETL job is over, the value stored in this variable is used for updating the CONFIG.ETL_INFO table.

Whether the delta load process works for a specific product or not depends upon the data service through which the product data is extracted.

➢ Full Snapshot File:

Full snapshot files containing all active source records. It captures information each period regardless if a change occurs.

CDC SCENARIOS:

Matching Scenario: No content change occurs in between Period 1 and Period 2. Same records are present in both periods if Period 2 is a Full file. If Period 2 is a Delta file then the records which have not undergone any updation will not be present in Period 2. Always the records will be moving from Period 1 to Target with the effective date as Period 1 date and end date as 12/31/9999.

Unmatched Scenario: The content of the records will be different from Period 1 to Period 2. The records which undergone some updation in Period 1 should be expired and the end date should be the effective date of Period 2. The updated records of Period 1 will be active in Period 2 with end date as 12/31/9999. All the records from Period 1 and Period 2 will be moved to the Target.

Insertion Scenario: New records are inserted in Period 2. The effective date of these records will be same as the effective date of Period 2 and the end date will be 12/31/9999. All the records along with newly inserted records will be moved to the Target.

Deletion Scenario: The records which got expired in Period 1 will not be present in Period 2 i.e., those records are deleted. But the deleted records are moved to the Target from Period 1 with effective date and end date as the effective date of Period 1.

Default Checking: It is done to check whether the defaulted columns are being moved to the target correctly or not.

For example: If transformation rule is null we should get null in the target.

REPORTING TECHNIQUES

Report is a route map that keeps track of every result that has been captured from different scenarios of testing. Testing is carried out in three cycles and the report is generated for every cycle. We are maintaining the 'Test Result Summary' report which explains the overall status of Testing.

Test Case Metrics:

Execution of test cases leads to creation of metrics which are then incorporated in various reports for management as well as test team reporting:

1. Initial metrics involve around the total number of test cases required in each area, the numberof test cases completed, the percentage completed, the start date of test case preparation and target completion.

2. Metrics are created once the test cases are put into execution mode in terms of Pass/Fail and updated accordingly in Quality Center for tracking purposes

3. During execution mode, defect numbers and the volume of defects generated during each test cycle are also closely monitored and reported. The defects are reported by each integration area to measure the data quality or code issues and any repeat defects for a particular area.

Useful Unix Commands

cd dirname --- change directory. You basically 'go' to another directory, and you willsee the files in that directory when you do 'ls'. You always start out in your 'home directory', and you can get back there by typing 'cd' without arguments.

cd .. will get you one level up from your current position. You don't have to walk along step by step - you can make big leaps or avoid walking around by specifying pathnames.

mkdir dirname --- make a new directorymv - move or rename files or directories

rm - remove files or directoriesrmdir - remove a directory

cp filename1 filename2 --- copies a file diff - display differences between text files

diff filename1 filename2 --- compares files, and shows where they differ

wc filename --- tells you how many lines, words, and characters there are in a file

http://www.ucs.cam.ac.uk/docs/leaflets/u5/u5#mv

http://www.ucs.cam.ac.uk/docs/leaflets/u5/u5#diff

http://www.ucs.cam.ac.uk/docs/leaflets/u5/u5#rm

grep - searches files for a specified string or expression

ls - list names of files in a directory

ls -l --- lists your files in 'long format', which contains lots of useful information, e.g. the exact size of the file, who owns the file and who has the right to look at it, and when it was last modified.

ls -a --- lists all files, including the ones whose filenames begin in a dot, which you do not always want to see. There are many more options, for example to list files bysize, by date, recursively etc.

cat --- The most common use of cat is to read the contents of files, and cat is often the most convenient program for this purpose. All that is necessary to open a text file for viewing on the displaymonitor is to type the word cat followed by a space and the name of the file and then press the ENTERkey. For example, the following will display the contents of a file named file1:

cat file1

OTHERS

Test plan:

A test plan is a document detailing a systematic approach to testing a system such as a machine orsoftware. The plan typically contains a detailed understanding of what the eventual workflow will be.

Test strategy:

A test strategy is an outline that describes the testing approach of the software development cycle. It is created to inform project managers, testers, and developers about some key issues of the testing process. This includes the testing objective, methods of testing new functions, total time and resources required for the project, and the testing environment.

Test strategies describe how the product risks of the stakeholders are mitigated at the test-level, which types of test are to be performed, and which entry and exit criteria apply. They are created based on development design documents. System design documents are primarily used and occasionally, conceptual design documents may be referred to. Design documents describe the functionality of the software to be enabled in the upcoming release. For every stage of development design, a corresponding test strategy should be created to test the new feature sets.

DB

http://en.wikipedia.org/wiki/Software_development_process

http://en.wikipedia.org/wiki/Workflow

http://en.wikipedia.org/wiki/Software

http://en.wikipedia.org/wiki/Machine

http://www.ucs.cam.ac.uk/docs/leaflets/u5/u5#ls

http://www.ucs.cam.ac.uk/docs/leaflets/u5/u5#grep

A database is an organized collection of data. The data are typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies).

DBMS

Database management systems (DBMSs) are specially designed applications that interact with the user, other applications, and the database itself to capture and analyze data. A general-purpose database management system (DBMS) is a software system designed to allow the definition, creation,querying, update, and administration of databases. Well-known DBMSs include MySQL, PostgreSQL, SQLite, Microsoft SQL Server, Microsoft Access, Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Baseand FileMaker Pro. A database is not generally portable across different DBMS, but different DBMSs can inter-operate by using standards such as SQL and ODBC or JDBC to allow a single application to work with more than one database.

RDBMS

It is used to establish the relationship between two database objects.

A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd, of IBM's San Jose Research Laboratory. Many popular databases currently in use are based on the relational database model.

RDBMSs have become a predominant choice for the storage of information in new databases used for financial records, manufacturing and logistical information, personnel data, and much more. Relationaldatabases have often replaced legacy hierarchical databases and network databases because they are easier to understand and use. However, relational databases have been challenged by object databases, which were introduced in an attempt to address the object-relational impedance mismatch in relational database, and XML databases.

ORDBMS

An object-relational database (ORD), or object-relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the datamodel with custom data-types and methods.

An object-relational database can be said to provide a middle ground between relational databases and object-oriented databases (OODBMS). In object-relational databases, the approach is essentially that of relational databases: the data resides in the database and is manipulated collectively with queries in a query language; at the other extreme are OODBMSes in which the database is essentially a persistent object store for software written in an object-oriented programming language, with a programming API for storing and retrieving objects, and little or no specific support for querying.

The basic goal for the Object-relational database is to bridge the gap between relational databases and the object-oriented modeling techniques used in programming languages such as Java, C++, Visual Basic .NET or C#. However, a more popular alternative for achieving such a bridge is to use a standard relational database systems with some form of Object-relational mapping (ORM) software. Whereas traditional RDBMS or SQL-DBMS products focused on the efficient management of data drawn from a limited set of data-types (defined by the relevant language standards), an object-

relational DBMS allows software developers to integrate their own types and the methods that apply tothem into the DBMS.

ODS

An operational data store (or "ODS") is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systems for further operations and to the data warehouse for reporting.

Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis.

OLTP

Online transaction processing, or OLTP, is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define itin terms of business or commercial transactions. OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application.

Contrasting OLTP and Data Warehousing Environments

OLAP

In computing, online analytical processing, or OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.OLAP is part of the broader category of business intelligence, which also encompasses relational database, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process

http://en.wikipedia.org/wiki/Business_process_management

http://en.wikipedia.org/w/index.php?title=Management_reporting&action=edit&redlink=1

http://en.wikipedia.org/wiki/Marketing

http://en.wikipedia.org/wiki/Business_reporting

http://en.wikipedia.org/wiki/Data_mining

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/Business_intelligence

http://en.wikipedia.org/w/index.php?title=Multi-dimensional_analytical&action=edit&redlink=1

http://en.wikipedia.org/wiki/Computing

http://en.wikipedia.org/wiki/Automatic_teller_machine

http://en.wikipedia.org/wiki/Financial_transaction

http://en.wikipedia.org/wiki/Transaction_Processing_Performance_Council

http://en.wikipedia.org/wiki/Database_transactions

http://en.wikipedia.org/wiki/Transaction_processing

http://en.wikipedia.org/wiki/Information_system

management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture.The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).

OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing. Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. By contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales by individual products that make up a region’s sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set ofdata of the OLAP cube and view (dicing) the slices from different viewpoints.

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical andad-hoc queries with a rapid execution time. They borrow aspects of navigational databases, hierarchical databases and relational databases.

DATA MININING

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science,is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Data mining uses information from past data to analyze the outcome of a particular problem or situation that may arise. Data mining works to analyze data stored in data warehouses that are used to store that data that is being analyzed. That particular data may come from all parts of business, from the production to the management. Managers also use data mining to decide upon marketing strategies for their product. They can use data to compare and contrast among competitors. Data mining interprets its data into real time analysis that can be used to increase sales, promote new product, or delete product that is not value-added to the company.

SDLC

WATERFALL MODEL

The waterfall model is a sequential design process, often used in software development processes, in which progress is seen as flowing steadily downwards (like a waterfall) through the phases of Conception, Initiation, Analysis, Design, Construction, Testing, Production/Implementation, and Maintenance.

The waterfall development model originates in the manufacturing and construction industries; highly structured physical environments in which after-the-fact changes are prohibitively costly, if not impossible. Since no formal software development methodologies existed at the time, this hardware-oriented model was simply adapted for software development.

http://en.wikipedia.org/wiki/Construction

http://en.wikipedia.org/wiki/Manufacturing

http://en.wikipedia.org/wiki/Software_maintenance

http://en.wikipedia.org/wiki/Implementation

http://en.wikipedia.org/wiki/Software_testing

http://en.wikipedia.org/wiki/Software_design

http://en.wikipedia.org/wiki/Analysis

http://en.wikipedia.org/wiki/Waterfall


http://en.wikipedia.org/wiki/Design

http://en.wikipedia.org/wiki/Sequence

http://en.wikipedia.org/wiki/Online_algorithm

http://en.wikipedia.org/wiki/Data_visualization

http://en.wikipedia.org/wiki/Statistical_inference

http://en.wikipedia.org/wiki/Statistical_model

http://en.wikipedia.org/wiki/Data_pre-processing

http://en.wikipedia.org/wiki/Data_management

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Data_set

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/Hierarchical_database

http://en.wikipedia.org/wiki/Navigational_database

http://en.wikipedia.org/wiki/Ad-hoc

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/OLAP_cube

http://en.wikipedia.org/wiki/OLTP

http://en.wikipedia.org/wiki/Agriculture

http://en.wikipedia.org/wiki/Financial_reporting

http://en.wikipedia.org/wiki/Forecasting

http://en.wikipedia.org/wiki/Budget

http://en.wikipedia.org/wiki/Business_process_management

AGILE MODEL & METHODOLOGY

Agile software development is a group of software development methods based on iterative and incremental development, where requirements and solutions evolve through collaboration between self-organizing, cross-functional teams. It promotes adaptive planning, evolutionary development and delivery, a time-boxed iterative approach, and encourages rapid and flexible response to change. It is a conceptual framework that promotes foreseen interactions throughout the development cycle. The Agile Manifesto introduced the term in 2001.

V-MODEL

The V-model represents a software development process (also applicable to hardware development) which may be considered an extension of the waterfall model. Instead of moving down in a linear way,the process steps are bent upwards after the coding phase, to form the typical V shape. The V-Model demonstrates the relationships between each phase of the development life cycle and its associated phase of testing. The horizontal and vertical axes represents time or project completeness (left-to-right) and level of abstraction (coarsest-grain abstraction uppermost), respectively.

http://en.wikipedia.org/wiki/Software_testing

http://en.wikipedia.org/wiki/Source_code


http://en.wikipedia.org/wiki/Timeboxing

http://en.wikipedia.org/wiki/Cross-functional_team

http://en.wikipedia.org/wiki/Self-organization#Self-organization_in_agile_software_development

http://en.wikipedia.org/wiki/Iterative_and_incremental_development

http://en.wikipedia.org/wiki/Iterative_and_incremental_development

http://en.wikipedia.org/wiki/Software_development_methodologies

SPIRAL MODEL

The spiral model is a software development process combining elements of both design and prototyping-in-stages, in an effort to combine advantages of top-down and bottom-up concepts. Also known as the spiral lifecycle model (or spiral development), it is a systems development method (SDM) used in information technology (IT). This model of development combines the features of the prototyping and the waterfall model. The spiral model is intended for large, expensive and complicatedprojects.

The spiral model combines the idea of iterative development (prototyping) with the systematic, controlled aspects of the waterfall model. It allows for incremental releases of the product, or incremental refinement through each time around the spiral. The spiral model also explicitly includes risk management within software development. Identifying major risks, both technical and managerial, and determining how to lessen the risk helps keep the software development process under control

STLC

Contrary to popular belief, Software Testing is not a just a single activity. It consists of series of activities carried out methodologically to help certify your software product. These activities (stages) constitute the Software Testing Life Cycle (STLC).

The different stages in Software Test Life Cycle -


http://en.wikipedia.org/wiki/Software_development

http://en.wikipedia.org/wiki/Risk_management

http://en.wikipedia.org/wiki/Waterfall_model

http://en.wikipedia.org/wiki/Iterative_development

http://en.wikipedia.org/wiki/Waterfall_model

http://en.wikipedia.org/wiki/Information_technology

http://en.wikipedia.org/wiki/Top-down_and_bottom-up_design

http://en.wikipedia.org/wiki/Prototyping

http://en.wikipedia.org/wiki/Design


Each of these stages have a definite Entry and Exit criteria , Activities & Deliverables associated with it.

In an Ideal world you will not enter the next stage until the exit criteria for the previous stage is met. But practically this is not always possible. So for this tutorial , we will focus of activities and deliverables for the different stages in STLC. Lets look into them in detail.

Requirement Analysis

During this phase, test team studies the requirements from a testing point of view to identify the testable requirements. The QA team may interact with various stakeholders (Client, Business Analyst, Technical Leads, System Architects etc) to understand the requirements in detail. Requirements could be either Functional (defining what the software must do) or Non Functional (defining system performance /security availability ) .Automation feasibility for the given testing project is also done in this stage.

Activities

Identify types of tests to be performed. Gather details about testing priorities and focus.

Prepare Requirement Traceability Matrix (RTM).

Identify test environment details where testing is supposed to be carried out.

Automation feasibility analysis (if required).

Deliverables

RTM Automation feasibility report. (if applicable)

Test Planning

This phase is also called Test Strategy phase. Typically , in this stage, a Senior QA manager will determine effort and cost estimates for the project and would prepare and finalize the Test Plan.

Activities

http://guru99.com/traceability-matrix.html

http://www.guru99.com/forum/software-testing/247-what-is-entry-and-exit-criteria.html

Preparation of test plan/strategy document for various types of testing Test tool selection

Test effort estimation

Resource planning and determining roles and responsibilities.

Training requirement

Deliverables

Test plan /strategy document. Effort estimation document.

Test Case Development

This phase involves creation, verification and rework of test cases & test scripts. Test data , is identified/created and is reviewed and then reworked as well.

Activities

Create test cases, automation scripts (if applicable) Review and baseline test cases and scripts

Create test data (If Test Environment is available)

Deliverables

Test cases/scripts Test data

Test Environment Setup

Test environment decides the software and hardware conditions under which a work product is tested. Test environment set-up is one of the critical aspects of testing process and can be done in parallel with Test Case Development Stage. Test team may not be involved in this activity if the customer/development team provides the test environment in which case the test team is required to do a readiness check (smoke testing) of the given environment.

Activities

Understand the required architecture, environment set-up and prepare hardware and softwarerequirement list for the Test Environment.

Setup test Environment and test data

Perform smoke test on the build

Deliverables

Environment ready with test data set up Smoke Test Results.

http://www.guru99.com/software-testing-test-data.html

http://www.guru99.com/testing-estimation.html

http://www.guru99.com/test-plan.html

Test Execution

During this phase test team will carry out the testing based on the test plans and the test cases prepared. Bugs will be reported back to the development team for correction and retesting will be performed.

Activities

Execute tests as per plan Document test results, and log defects for failed cases

Map defects to test cases in RTM

Retest the defect fixes

Track the defects to closure

Deliverables

Completed RTM with execution status Test cases updated with results

Defect reports

Test Cycle Closure

Testing team will meet , discuss and analyze testing artifacts to identify strategies that have to be implemented in future, taking lessons from the current test cycle. The idea is to remove the process bottlenecks for future test cycles and share best practices for any similar projects in future.

Activities

Evaluate cycle completion criteria based on Time, Test coverage,Cost,Software,Critical Business Objectives , Quality

Prepare test metrics based on the above parameters.

Document the learning out of the project

Prepare Test closure report

Qualitative and quantitative reporting of quality of the work product to the customer.

Test result analysis to find out the defect distribution by type and severity.

Deliverables

Test Closure report Test metrics

DEFECT LIFE CYCLE

What is a Defect/Bug?

Bug can be defined as the abnormal behavior of the software. No software exists without a bug. The elimination of bugs from the software depends upon the efficiency of testing done on the software. A bug is a specific concern about the quality of the Application under Test (AUT).

Bug Life Cycle:

In software development process, the bug has a life cycle. The bug should go through the life cycle to be closed. A specific life cycle ensures that the process is standardized. The bug attains different states in the life cycle. The life cycle of the bug can be shown diagrammatically as follows:

The different states of a bug can be summarized as follows:

1. New 2. Open 3. Assign 4. Test 5. Verified 6. Deferred 7. Reopened 8. Duplicate 9. Rejected10. Closed

Description of Various Stages:

1. New: When the bug is posted for the first time, its state will be “NEW”. This means that the bug is not yet approved.

2. Open: After a tester has posted a bug, the lead of the tester approves that the bug is genuine and he changes the state as “OPEN”.

3. Assign: Once the lead changes the state as “OPEN”, he assigns the bug to corresponding developer or developer team. The state of the bug now is changed to “ASSIGN”.

4. Test: Once the developer fixes the bug, he has to assign the bug to the testing team for next round of testing. Before he releases the software with bug fixed, he changes the state of bug to “TEST”. It specifies that the bug has been fixed and is released to testing team.

5. Deferred: The bug, changed to deferred state means the bug is expected to be fixed in next releases. The reasons for changing the bug to this state have many factors. Some of them are priority of the bug may be low, lack of time for the release or the bug may not have major effect on the software.

6. Rejected: If the developer feels that the bug is not genuine, he rejects the bug. Then the state of the bug is changed to “REJECTED”.

7. Duplicate: If the bug is repeated twice or the two bugs mention the same concept of the bug, then one bug status is changed to “DUPLICATE”.

8. Verified: Once the bug is fixed and the status is changed to “TEST”, the tester tests the bug. If the bug is not present in the software, he approves that the bug is fixed and changes the status to “VERIFIED”.

9. Reopened: If the bug still exists even after the bug is fixed by the developer, the tester changes thestatus to “REOPENED”. The bug traverses the life cycle once again.

10. Closed: Once the bug is fixed, it is tested by the tester. If the tester feels that the bug no longer exists in the software, he changes the status of the bug to “CLOSED”. This state means that the bug isfixed, tested and approved.

While defect prevention is much more effective and efficient in reducing the number of defects, most organization conducts defect discovery and removal. Discovering and removing defects is an expensiveand inefficient process. It is much more efficient for an organization to conduct activities that prevent defects.

Guidelines on deciding the Severity of Bug:

Indicate the impact each defect has on testing efforts or users and administrators of the application under test. This information is used by developers and management as the basis for assigning priority of work on defects.

A sample guideline for assignment of Priority Levels during the product test phase includes:

1. Critical / Show Stopper — An item that prevents further testing of the product or function under test can be classified as Critical Bug. No workaround is possible for such bugs. Examplesof this include a missing menu option or security permission required to access a function under test. .

2. Major / High — A defect that does not function as expected/designed or cause other functionality to fail to meet requirements can be classified as Major Bug. The workaround can be provided for such bugs. Examples of this include inaccurate calculations; the wrong field being updated, etc. .

3. Average / Medium — The defects which do not conform to standards and conventions can be classified as Medium Bugs. Easy workarounds exists to achieve functionality objectives.

Examples include matching visual and text links which lead to different end points. .

4. Minor / Low — Cosmetic defects which does not affect the functionality of the system can be classified as Minor Bugs.

Guidelines on writing Bug Description:

Bug can be expressed as “Result followed by the action”. That means, the unexpected behavior occurring when a particular action takes place can be given as bug description.

1. Be specific. State the expected behavior which did not occur - such as after pop-up did not appear and the behavior which occurred instead.

2. Use present tense.

3. Don’t use unnecessary words.

4. Don’t add exclamation points. End sentences with a period.

5. DON’T USE ALL CAPS. Format words in upper and lower case (mixed case).

6. Mention steps to reproduce the bug compulsorily.

Keys Related To Dimensional Data Modeling

Business Key:

A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order item line number within a sales order details table.

Natural Key:

A natural key is a key that is formed of attributes that already exist in the real world. For example, a USA citizen's social security number could be used as a natural key.

Surrogate Key:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.

Important Links for your reference

For Manual Testing and others:

http://www.onestoptesting.com/introduction/

www. softwaretesting help.com/

www. guru99 .com

For DWH:

http://www.guru99.com/

http://www.softwaretestinghelp.com/

http://www.onestoptesting.com/introduction/

http://docs.oracle.com/cd/B10501_01/server.920/a96520/concept.htm

http://www.kimballgroup.com/

For SQL

http://beginner-sql-tutorial.com/sql-operators.htm

http://docs.oracle.com/cd/B12037_01/server.101/b10759/preface.htm

http://docs.oracle.com/cd/B12037_01/server.101/b10759/preface.htm

http://beginner-sql-tutorial.com/sql-operators.htm

http://www.kimballgroup.com/

http://docs.oracle.com/cd/B10501_01/server.920/a96520/concept.htm

I. WHAT IS ETL PROCESSdocshare04.docshare.tips/files/30521/305214311.pdfI. WHAT IS ETL PROCESS In...

Documents

Transcript of I. WHAT IS ETL PROCESSdocshare04.docshare.tips/files/30521/305214311.pdfI. WHAT IS ETL PROCESS In...