Data Mangement

57
JBIMS MIM Second Year (2015 – 18) Data Management 15-I-131 Mufaddal Nullwala

Transcript of Data Mangement

Page 1: Data Mangement

JBIMS MIM Second Year (2015 – 18)

Data Management15-I-131 Mufaddal Nullwala

Page 2: Data Mangement

What is Database?

Database is the collection of interrelated data OR Organised mechanism to manage, store and retrieve data

Properties Of Database:• Efficient

• Robust

• Stable

Example: • Students Information

• Bank Registrar / Book of Accounts

• Employees Master

Page 3: Data Mangement

What is database management system?

It is a software used to manage and access the database in efficient way

Advantages: • It gives you Data whenever you require in few click of buttons

• Searching of Critical Information is Easy

Example:• Oracle 11g

• MSSQL

• MySQL

Page 4: Data Mangement

ER DigramER-Diagram is a visual representation of data that describes how data is related to each other.

Components of E-R Diagram are:• Entity - An Entity can be any object, place, person or

class.

• Attribute - An Attribute describes a property or characteristic of an entity.

• Relationship - A Relationship describes relations between entities. There are three types of relationship that exist between Entities.

Page 5: Data Mangement

Relationships between ER Diagram

For a binary relationship set the mapping cardinality must be one of the following types:

One to one

One to many

Many to one

Many to many

Page 6: Data Mangement

Going up in this structure is called generalisation, where entities are clubbed together to represent a more generalised view.

Specialisation is the opposite of generalisation. In specialisation, a group of entities is divided into sub-groups based on their characteristics.

Page 7: Data Mangement

Database Keys:

Keys are used to establish and identify relation between tables.

Types of Keys:

PRIMARY KEY• Serves as the row level addressing mechanism in the relational database model.

• It can be formed through the combination of several items.

• Indicates uniqueness within records or rows in a table.

FOREIGN KEY• A column or set of columns within a table that are required to match those of a

primary key of a second table.

• The primary key from another table, this is the only way join relationships can be established.

Page 8: Data Mangement

Primary Key : In Table A, Parcel no. is the Primary Key but the Foreign key in Table B.

Page 9: Data Mangement

CRUD Operations• Create new tables & records

• Retrieve records from tables

• Update tables definition and records data

• Delete existing tables and records

Page 10: Data Mangement

What is OLTP?We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

Online Transection Processing - is characterised by a large number of short on-line transactions

• INSERT

• UPDATE

• DELETE

OLTP Systems are used for Order Entry, Financial Transections, CRM ( Customer Relationship Management), Retail Sales etc. Such systems have large number of users who conduct short transactions.

An important attribute of an OLTP system is its ability to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized.

Page 11: Data Mangement

Why OLTP is Important?Source of Data or Operational Data

To control and run fundamental business tasks

Reveals a snapshot of ongoing business process

Short and fast inserts and updates initiated by end users

Typically very fast (Performance Optimised)

Space requirements: can be relatively small if historical data is archived

Database Design Highly Optimised

Operational data is critical to run business there for the backup religiously

Page 12: Data Mangement

Design PrincipalApplication Oriented

Used to run Business

Detailed Data

Current Up to Date

Isolated Data

Repetitive Access

Clerical Users

Performance Sensitive

Few Records assessed at a time (tens)

Read / Update access

No Data Redundancy

Database Size (100 MB - 100 GB)

Page 13: Data Mangement

Business CasesEcommerce applications (eg. Amazon, Flipkart)

ERP Solutions

CRM

SCM

Page 14: Data Mangement

Data WarehouseIn computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.

DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.

The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.

Page 15: Data Mangement

Data Warehouse continued..

A collection of data that is used primarily in organisational decision making

A decision support database that is maintained separately from the organisation’s operational databases.

A data warehouse is a • subject-oriented,

• integrated,

• time-varying,

• non-volatile

Page 16: Data Mangement

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

Page 17: Data Mangement

Characteristics of Data Warehouse

Subject oriented: Data are organised based on how the users refer to them.

Integrated: All inconsistencies regarding naming convention and value representations are removed.

Nonvolatile: Data are stored in read-only format and do not change over time.

Time variant: Data are not current but normally time series.

Page 18: Data Mangement

Why Separate Data Warehouse?

Performance• Operational databases designed & tuned for known workloads

• Complex OLAP queries would degrade performance, taxing operations

• Special data organisation, access & implementation methods needed for multidimensional views & queries

Function• Missing data: Decision support requires historical data, which operational

databases do not typically maintain

• Data consolidation: Decision support requires consolidation (aggregation, summarisation) of data from many heterogeneous sources: operational databases, external sources.

• Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Page 19: Data Mangement

The Complete Decision Support System (Source:

Franconi)Information Sources Data Warehouse

Server(Tier 1)

OLAP Servers(Tier 2)

Clients/DSS(Tier 3)

Operational DB’s

Semistructured Sources

extracttransform

loadrefresh

etc.

Data Marts

DataWarehouse

e.g., MOLAP

e.g., ROLAP

serve

Analysis

Query/Reporting

Data Mining

serve

serve

Page 20: Data Mangement

Three-Tier ArchitectureWarehouse database server

Almost always a relational DBMS; rarely flat files

OLAP servers

Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations.

Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.

Clients

Query and reporting tools

Analysis tools

Data mining tools (e.g., trend analysis, prediction)

Page 21: Data Mangement

Data MartsA data mart is a scaled down version of a data warehouse that focuses on a particular subject area.

A data mart is a subset of an organisational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.

Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organisation.

Usually designed to support the unique business requirements of a specified department or business process

Implemented as the first step in proving the usefulness of the technologies to solve business problems

Eg: Departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales

Page 22: Data Mangement

Why Data mart?A data mart is the access layer of the data warehouse environment that is used to get data out to the users.

The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.

This enables each department to isolate the use, manipulation and development of their data. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

Organizations build data warehouses and data marts because the information in the database is not organized in a way that makes it readily accessible, requiring queries that are too complicated or resource-consuming.

Page 23: Data Mangement

From the Data Warehouse to Data Marts

Data Warehouse

Less

More

HistoryNormalised

Detailed

Data

InformationIndividuallyStructured

DepartmentallyStructured

OrganisationallyStructured

Page 24: Data Mangement

Characteristics of the Departmental Data Mart

• Small

• Flexible

• Customised by Department

• OLAP

• Source is departmentally structured data warehouse

Data mart

Data warehouse

Page 25: Data Mangement

The Meta DataLast and the most component of DW environments.

It is information that is kept about the warehouse rather than information kept within the warehouse.

The metadata is simply data about data.

It is important for designing, constructing, retrieving, and controlling the warehouse data.

Page 26: Data Mangement

Types of Meta DataTechnical metadata: Include where the data come from, how the data were changed, how the data are organised, how the data are stored, who owns the data, who is responsible for the data and how to contact them, who can access the data , and the date of last update.

Business metadata: Include what data are available, where the data are, what the data mean, how to access the data, predefined reports and queries, and how current the data are.

Page 27: Data Mangement

Application of Data Ware House

Industry ApplicationFinance Credit Card Analysis

Insurance Claims, Fraud AnalysisTelecommunication Call record analysis

Transport Logistics managementConsumer goods promotion analysis

Data Service providers Value added dataUtilities Power usage analysis

Page 28: Data Mangement

What is OLAP?Definition - OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling , thereby providing the insight and understanding they need for better decision making. Users can pivot, filter, drill down and drill up data and generate numbers of views.

Application - It is the foundation for many kinds of business applications for Business Performance Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse Reporting.

Page 29: Data Mangement

An OLAP structure created from the operational data is called an OLAP cube. As Figure shows, the cube holds data more like a 3D spreadsheet rather than a relational database, allowing different views of the data to be quickly displayed

Page 30: Data Mangement

The term OLAP was first introduced by E. F. Codd, who pioneered Relational Database Management Systems (RDBMS). Below are the twelve rules defined by Codd that OLAP technology must support.

Multidimensional conceptual view

Supports EIS (Executive Information System) slice and dice operations and is usually required in financial modeling.

Transparency Is part of an open system that supports heterogeneous data sources. Furthermore, the end user should not be concerned about the details of data access or conversions.

Accessibility Presents the user with a single logical schema of the data. OLAP engines act as middleware, sitting between heterogeneous data sources and an OLAP front-end.

Consistent reporting performance Performance should not degrade as the number of dimensions in the model increases.

Client/server architectureRequires open, modular systems. Not only the product should be client/server but the server component of an OLAP product should allow that various clients could be attached with minimum effort and programming for integration.

Generic dimensionality Not limited to 3-D and not biased toward any particular dimension. A function applied to one dimension should also be able to be applied to another.

Dynamic sparse-matrix handling

Related both to the idea of nulls in relational databases and to the notion of compressing large files, a sparse matrix is one in which not every cell contains data. OLAP systems should accommodate varying storage and data-handling options.

Multiuser support Supports multiple concurrent users, including their individual views or slices of a common database.

Unrestricted cross-dimensional operations

All dimensions are created equal, so all forms of calculation must be allowed across all dimensions, not just the measures dimension.

Intuitive data manipulation Users shouldn't have to use menus or perform complex multiple step operations when an intuitive drag and drop action will do.

Flexible reporting Users should be able to print just what they need, and any changes to the underlying model should be automatically reflected in reports.

Unlimited dimensional and aggregation levels Supports at least 15, and preferably 20, dimensions.

Page 31: Data Mangement

The OLAP Report, one of the most internationally authoritative sources of information on OLAP products and applications, defines OLAP in five keywords: Fast Analysis of Shared Multidimensional Information, or FASMI for short.

FastThe system is targeted to deliver most responses to users within

about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds.

AnalysisThe system can cope with any business logic and statistical analysis

that is relevant for the application and the user, and keep it easy enough for the target user.

Shared

The system implements all the security requirements for confidentiality and, if multiple write access is needed, concurrent

update locking at an appropriate level. Not all applications need users to write data back, but for the growing number that do, the system

should be able to handle multiple updates in a timely, secure manner.

Multidimensional

The system must provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies.

InformationThe capacity of various products is measured in terms of how much input data they can handle, not how many gigabytes they take to

store it.

Page 32: Data Mangement

OLAP Operations

Roll-UpDecreases a number of dimensions - removes row headers.

Drill – DownIncreases a number of dimensions - adds new headers

Page 33: Data Mangement

Slice

• Performs a selection on one dimension of the given cube, resulting in a sub-cube.

• Reduces the dimensionality of the cubes.

• Sets one or more dimensions to specific values and keeps a subset of dimensions for selected values.

Page 34: Data Mangement

Dice

• Define a sub-cube by performing a selection of one or more dimensions.

• Refers to range select condition on one dimension, or to select condition on more than one dimension.

• Reduces the number of member values of one or more dimensions.

Pivot (or rotate)

• Rotates the data axis to view the data from different perspectives.

• Groups data with different dimensions.

Page 35: Data Mangement

OLAP ArchitecturesMOLAP ROLAP

Information retrieval is fast.

Information retrieval is comparatively slow.

Uses sparse array to store data-sets.

Uses relational table.

MOLAP is best suited for inexperienced users, since it is very easy to

use.

ROLAP is best suited for experienced users.

Maintains a separate database for data cubes.

It may not require space other than available in the Data

warehouse.DBMS facility is weak. DBMS facility is strong.

Static Database Dynamic Database

Page 36: Data Mangement

Dimensional Modelling

Dimensional modelling is one of the methods of data modelling, that help us store the data in such a way that it is relatively easy to retrieve the data from the database.

Different ways of storing data gives us different advantages. For example, ER Modelling gives us the advantage of storing data is such a way that there is less redundancy. Dimensional modelling, on the other hand, give us the advantage of storing data in such a fashion that it is easier to retrieve the information from the data once the data is stored in database.

Page 37: Data Mangement

Dimensional Modeling V/S ER Modeling

Dimensional Models are designed for reading, summarising and analysing numeric information, whereas Relational Models are optimised for adding and maintaining data using real-time operational systems.

Page 38: Data Mangement

Dimensional Modeling

It is comprised of "fact" and "dimension" tables.

A "fact" is a numeric value that a business wishes to count or sum

A "dimension" is essentially an entry point for getting at the facts.  Dimensions are things of interest to the business.

Page 39: Data Mangement

Dimensional Modeling

Benefits

• Faster Data Retrieval

• Better Understandability

• Extensibility

https://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide

Page 40: Data Mangement

Star schema

The star schema architecture is the simplest data warehouse schema.

It is called a star schema because the diagram resembles a star, with points radiating from a centre.

The centre of the star consists of fact table and the points of the star are the dimension tables.

Page 41: Data Mangement

Star Schema

Page 42: Data Mangement

Star Schema

Fact Tables A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.

A dimension is a structure usually composed of one or more hierarchies that categories data.

http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html

Page 43: Data Mangement

Snowflake Schema

The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalised.

Page 44: Data Mangement

Snowflake Schema

Page 45: Data Mangement

ETL Process

Page 46: Data Mangement

ETL Process

Page 47: Data Mangement

ETL process

The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.

Page 48: Data Mangement

ETL Steps

Initiation

Build reference data

Extract from sources

Validate

Transform

Load into stages tables

Audit reports

Publish

Archive

Clean up

Page 49: Data Mangement

ETL Process

Page 50: Data Mangement

Steps of ETL process

Extracts data from homogeneous or heterogeneous data sources

Transforms the data for storing it in proper format or structure for querying and analysis purpose

Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)

Page 51: Data Mangement

ExtractionExtracting the data from different sources – the data sources can be files (like CSV, JSON, XML) or RDBMS etc.

This is the first step in ETL process. It covers data extraction from the source system and makes it accessible for further processing. The main objective of the extraction step is to retrieve all required data from source system with as little resources as possible. The extraction step should be designed in a way that it does not negatively affect the source system. Most data projects consolidate data from different source systems. Each separate source uses a different format. Common data-source formats include RDBMS, XML (like CSV, JSON). Thus the extraction process must convert the data into a format suitable for further transformation.

Page 52: Data Mangement

TransformationTransforming the data – this may involve cleaning, filtering, validating and applying business rules.

In this step, certain rules are applied on the extracted data. The main aim of this step is to load the data to the target database in a cleaned and general format (depending on the organization’s requirement). This is because when the data is collected from different sources each source will have their own standards like –For example if we have two different data sources A and B. In source A, date format is like dd/mm/yyyy, and in source B, it is yyyy-mm-dd.

Page 53: Data Mangement

Transformation continued..

In the transforming step we convert these dates to a general format. The other things that are carried out in this step are:

Cleaning (e.g. “Male” to “M” and “Female” to “F” etc.)

Filtering (e.g. selecting only certain columns to load)

Enriching (e.g. Full name to First Name , Middle Name , Last Name)

Splitting a column into multiple columns and vice versa

Joining together data from multiple sources

In some cases data does not need any transformations and here the data is said to be  “rich data” or  “direct move” or “pass through” data.

Page 54: Data Mangement

LoadingLoading - data is loaded into a data warehouse or any other database or application that houses data.

This is the final step in the ETL process. In this step, the extracted data and transformed data is loaded to the target database. In order to make data load efficient, it is necessary to index the database and disable constraints before loading the data.

All the three steps in the ETL process can be run parallel. Data extraction takes time and so the second step of transformation process is executed simultaneously. This prepares data for the third step of loading. As soon as some data is ready it is loaded without waiting for completion of the previous steps.

Page 55: Data Mangement

ETL Tools1. Oracle Warehouse Builder (OWB)

2. SAP Data Services.

3. IBM Infosphere Information Server.

4. SAS Data Management.

5. PowerCenter Informatica.

6. Elixir Repertoire for Data ETL.

7. Data Migrator (IBI)

8. SQL Server Integration Services (SSIS)

Page 56: Data Mangement

OLTP V/S OLAP

Page 57: Data Mangement

“Thank you”