Introduction to Teradata

82
Introduction to the Teradata Database

description

Brief Introduction to Teradata

Transcript of Introduction to Teradata

Page 1: Introduction to Teradata

Introduction to theTeradata Database

Page 2: Introduction to Teradata

Course Modules

This course consists of:

Module 1: Teradata Database OverviewModule 2: Relational Database ConceptsModule 3: Teradata and the Data WarehouseModule 4: Components and ArchitectureModule 5: Databases and UsersModule 6: Data Distribution and AccessModule 7: Secondary Indexes and Full-Table ScansModule 8: Fault Tolerance and Data ProtectionModule 9: Client Tools and Utilities

Page 3: Introduction to Teradata

Teradata Database Overview

Module 1

Page 4: Introduction to Teradata

What is the Teradata Database?

• Relational Database Management System• Built on a Parallel Architecture• Runs on MP-RAS UNIX,• Microsoft Windows 2000/2003• Server, and SuSE Linux

Page 5: Introduction to Teradata

Teradata Parallel Architecture

• More warehouse data• Linear Scalability (10GB to 100+TB)• Hashing provides for automatic data distribution• Parallel-Aware Optimizer• Single, Administrative View• Ad hoc queries with ANSI

Page 6: Introduction to Teradata

Teradata Database Advantages

• Proven Linear Scalability - increased workload without decreased throughput

• Most Concurrent Users - multiple complex queries• Unconditional Parallelism - sorts, aggregations and full-table scans are

performed in parallel• Mature Optimizer - robust and parallel aware, handles complex

queries, multiple joins per query, ad hoc processing• Low TCO - ease of setup and maintenance, robust parallel utilities, no

re-orgs,• Automatic data distribution, low disk to data ratio, robust expansion

utility• High Availability - no single point of failure, fault-tolerant architecture• Single View of the Business - single database server for multiple clients

Page 7: Introduction to Teradata

Teradata Database Manageability

Things Teradata Database Administrators never have to do!

• Reorganize data or index space• Pre-allocate table or index space• Physically format partitions or disk space• Pre-prepare data for loading (convert, sort, split, etc.)• Ensure that queries run in parallel• Unload/reload data spaces due to expansion

** The Administrator knows that if the data is to be doubled, the system can be easily expanded to accommodate it.** The amount of work required to create a table which willcontain 100 rows is the same as that to create a tablewhich will contain 1,000,000,000 rows.

Page 8: Introduction to Teradata

Teradata Database Features

• Designed to process large quantities of detail data• Ideal for data warehouse applications• Parallelism makes easy access to very large tables possible• Open architecture - uses industry standard components• Performance increase is linear as components are added• Runs as a database server to client applications• Runs on multiple hardware platforms (SMP) and Teradata

hardware(MPP)

Page 9: Introduction to Teradata

Module 2

Relational Database Concepts

Page 10: Introduction to Teradata

What is a Database?

A database is a collection of permanently stored data that is:

– Logically related - the data relates to other data– Shared - many users may access the data.– Protected - access to data is controlled.– Managed - the data has integrity and value.

Page 11: Introduction to Teradata

Logical/Relational Modeling

The Logical Model• Should be designed without regard to usage

– Accommodates a wide variety of front end tools – Allows the database to be created more quickly

• Should be the same regardless of data volume• Data is organized according to what it represents (real world business data in

table (relational) form)• Includes all the data definitions within the scope of the application or

enterprise• Is generic – the logical model is the template for physical implementation on

any RDBMS platform

Normalization• Process of reducing a complex data structure into a simple, stable oneInvolves removing redundant attributes, keys, and relationships from the conceptual data model

Page 12: Introduction to Teradata

Relational Databases

The employee table has: – Nine columns of data – Six rows of data - one per employee – Only one row format for the entire table – Missing data values represented by nulls – Column and row order are arbitrary

Relational Databases are founded on Set Theory and based on the Relational Model.• A Relational Database consists of a collection of logically related tables.• A table is a two dimensional representation of data consisting of rows and columns.

Page 13: Introduction to Teradata

Primary Keys

Page 14: Introduction to Teradata

Foreign Keys

Page 15: Introduction to Teradata

Answering Questions with a Relational Database

Page 16: Introduction to Teradata

Relational Advantages

Advantages of a Relational Database compared to other databasemethodologies include:• More flexible than other types• Allows businesses to quickly respond to changing conditions• Being data-driven vs. application driven• Modeling the business, not the processes• Makes applications easier to build because the data does more of the work• Supporting trend toward end-user computing• Being easy to understand• No need to know the access path• Solidly founded in set theory

Page 17: Introduction to Teradata

Module 3

Teradata and the Data Warehouse

Page 18: Introduction to Teradata

Evolution of Data Processing

unit of work

A transaction is a logical

Page 19: Introduction to Teradata

The Advantage of Using Detail Data

Page 20: Introduction to Teradata

Data Warehouse Usage Evolution

Page 21: Introduction to Teradata

Active Data Warehousing

• Performance - response time within seconds• Scalability

– large amounts of detailed data – mixed workloads (both tactical and strategic queries) for mission critical applications – concurrent users

• Availability and Reliability - 7 x 24• Data Freshness - accurate, up to the minute data, including access to

operational data store level information

Page 22: Introduction to Teradata

The Data Warehouse

A central, enterprise-wide database that contains information extracted from operational systems.

Based on enterprise-wide model• Can begin small but may grow large rapidly• Populated by extraction/loading of data from operational systems• Responds to end-user “what if” queries• Minimizes data movement/ synchronization• Provides a Single View of the business

Page 23: Introduction to Teradata

Data Marts

A data mart is a special purpose subset of enterprise data for a particular function or application. It may contain detail or summary data or both.Data mart types: – Independent - created directly from operational systems to a separate physical data store – Logical - exists as a subset of existing data warehouse via Views – Dependent - created from data warehouse to a separate physical data store

Page 24: Introduction to Teradata

Module 4

Components and Architecture

Page 25: Introduction to Teradata

What is a Node?

• Teradata software, gateway software and channel-driver software run as processes• Parsing Engines (PE) and Access Module Processors (AMP) are Virtual Processors (VPROC) which run under control of Parallel Database Extensions (PDE)• Each AMP is associated with a Virtual Disk (VDISK)• A single node is called a Symmetric Multi-Processor (SMP)• All AMPs and PEs communicate via the BYNET

Page 26: Introduction to Teradata

Major Components of the Teradata Database

Page 27: Introduction to Teradata

The Parsing Engine (PE)

The Parsing Engine is responsible for:• Managing individual sessions (up to 120 sessions per PE)• Parsing and optimizing your SQL requests• Building query plans with the parallel-aware, cost-based, intelligent Optimizer• Dispatching the optimized plan to the AMPs• EBCDIC/ASCII input conversion (if necessary)• Sending the answer set response back to the requesting client

Page 28: Introduction to Teradata

The BYNET

Dual redundant, fault-tolerant, bi-directional interconnect network that enables:• Automatic load balancing of message traffic• Automatic reconfiguration after fault detection• Scalable bandwidth as nodes are addedThe BYNET connects and communicates with all the AMPs on the system:• Between nodes, the BYNET hardware carries broadcast and point-to-point communications• On a node, BYNET software and PDE together control which AMPs receive a

Page 29: Introduction to Teradata

The Access Module Processor (AMP)

The AMP is responsible for:• Storing rows to and retrieving

rows from its VDISK• Lock management• Sorting rows and aggregating

columns• Join processing• Output conversion and

formatting (ASCII, EBCDIC)• Creating answer sets for clients• Disk space management and

accounting• Special utility protocols• Recovery processing

Page 30: Introduction to Teradata

The MPP System

• The BYNET (both software and hardware) connects two or more SMP Nodes to create a Massively Parallel Processing (MPP) system.• The Teradata Database is linearly expandable by adding nodes.

Page 31: Introduction to Teradata

Teradata Database Software

Page 32: Introduction to Teradata

Channel-Attached Client Software

CLI (Call-Level Interface)• Request and response control• Buffer allocation and initialization• Lowest level interface to the Teradata Database• Library of routines for blocking/unblocking requests and responses to/from RDBMS• Performs logon and logoff functions

TDP (Teradata Director Program)• Manages session traffic between CLI and the Teradata Database• Session balancing across multiple PEs• Failure notification (application failure, Teradata Database restart)• Logging, verification, recovery, restart, security

Page 33: Introduction to Teradata

Network-Attached Client SoftwareODBC• Call-level interface• Teradata Database ODBC driver is used to connect applications with the Teradata Database

MTDP (Micro Teradata Director Program)• Performs many TDP functions including session management but not session balancing across PEs

MOSI (Micro Operating SystemInterface)• Provides operating system and network protocol independent interface.

Page 34: Introduction to Teradata

Module 5

Databases and Users

Page 35: Introduction to Teradata

Databases and Users Defined

• Databases and Users are the repositories for objects: – Tables - require Perm Space – Views - do not require Perm Space – Macros - do not require Perm Space – Triggers - do not require Perm Space – Stored Procedures - require Perm Space• Space limits are specified for each database and for each user: – Perm Space - maximum amount of space available for permanent tables – Spool Space - maximum amount of work space available for request processing – Temp Space - maximum amount of space available for global temporary tables• A database is created with the CREATE DATABASE command.• A user is created with the CREATE USER command.• The only difference between a database and a user is the user has a password and may logon to the system.• A database or user with no perm space may not contain permanent tables but may contain views and macros.

Page 36: Introduction to Teradata

Teradata Database Space Management

• A new database or user must be created from an existing database or user.• All Perm Space limits are subtracted from the owner.• Perm Space is a zero-sum game – the total of all Perm Space limits must equal the total amount of disk space available.• Perm Space currently not being used is available for Spool Space or Temp Space.

Page 37: Introduction to Teradata

Module 6

Data Distribution and Access

Page 38: Introduction to Teradata

How Does the Teradata Database Distribute Rows?

– The Teradata Database uses a hashing algorithm to randomly distribute table rows across the AMPs. – The Primary Index choice determines whether the rows of a table will be evenly or unevenly distributed across the AMPs. – Evenly distributed table rows result in evenly distributed workloads. – Each AMP is responsible for its subset of the rows of each table. – The rows are not placed in any particular order.The benefits of unordered rows include: – No maintenance needed to preserve order. – The order is independent of any query being submitted.The benefits of hashed distribution include: – The distribution is the same regardless of data volume. – The distribution is based on row content, not data demographics.

Page 39: Introduction to Teradata

Primary Key (PK) vs. Primary Index (PI)

• The PK is a relational modeling convention which uniquely identifies each row.• The PI is a Teradata convention which determines row distribution and access.• A well designed database will have tables where the PI is the same as the PK as well as tables where the PI is defined on columns different from the PK.• Join performance and known access paths might dictate a PI that is different from the PK.

Page 40: Introduction to Teradata

Primary Indexes

• The physical mechanism used to assign a row to an AMP• A table must have a Primary Index• The Primary Index cannot be changedUPA If the index choice of column(s) is unique, we call this a UPI (Unique Primary Index). A UPI choice will result in even distribution of the rows of the table across all AMPs.Reasons to Choose a UPI: UPI’s guarantee even data distribution, eliminateduplicate row checking, and are always a one-AMP operation.NUPA• If the index choice of column(s) isn’t unique, we call this a NUPI (Non-Unique Primary Index).• A NUPI choice will result in even distribution of the rows of the table proportional to the degree of uniqueness of the index.• NUPIs can cause skewed data.Why would you choose an Index that is different from the Primary Key?1. Join performance2. Known access paths

Page 41: Introduction to Teradata

Defining the Primary Index

• The Primary Index (PI) is defined at table creation. Every table must have one Primary Index. The Primary Index may consist of 1 to 64 columns. The Primary Index of a table may not be changed. The Primary Index is the mechanism used to assign a row to an AMP. The Primary Index may be Unique (UPI) or Non-Unique (NUPI). Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.• Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Page 42: Introduction to Teradata

Row Distribution via Hashing

Page 43: Introduction to Teradata

Unique Primary Index (UPI) Access

Page 44: Introduction to Teradata

Non-Unique Primary Index (NUPI) Access

Page 45: Introduction to Teradata

UPI Row Distribution

• Order_Number values are unique (UPI).• The rows will distribute evenly across the AMPs.

Page 46: Introduction to Teradata

NUPI Row Distribution

• Customer_Number values are non-unique (NUPI).• Rows with the same PI value distribute to the same AMP causing skewed row distribution.

Page 47: Introduction to Teradata

Highly Non-Unique NUPI Row Distribution

• Order_Status values are highly non-unique (NUPI).

• Only two values exist. The rows will be distributed to two AMPs.

• This table will not perform well in parallel operations.

• Highly non-unique columns are poor PI choices.

• The degree of uniqueness is critical to efficiency.

Page 48: Introduction to Teradata

Partitioned Primary Index (PPI)

The Orders table defined with aNon-Partitioned Primary Index(NPPI) on Order_Number (O_#)

Partitioned Primary Indexes:• Improve performance on range constraint queries• Use partition elimination to reduce the number of rows accessed

The Orders table defined with aPrimary Index on Order_Number(O_#) Partitioned By Order_Date(O_Date) (PPI)

Page 49: Introduction to Teradata

Module 7

Secondary Indexes and Full-Table Scans

Page 50: Introduction to Teradata

Secondary Indexes

•A secondary index is an alternate path to the rows of a table.•A table may have from 0 to 32 secondary indexes.•A secondary index: – does not affect table row distribution. – is chosen to improve access performance. – may reference from 1 to 64 table columns. – may be defined at table creation. – may be defined after the table is created. – may be dropped at any time. – uses a sub-table which utilizes Perm Space. – may impact table maintenance performance (row inserts, row updates and/or row deletes).

Page 51: Introduction to Teradata

Defining a Secondary Index

Unique Secondary Index (USI)A Unique Secondary Index requires unique column values in each rowAccess to a referenced value requires 2 AMPs (serial operation) and returns 0 or 1 rows.SQL to create: CREATE UNIQUE INDEX (social_security) on Employee;

Non-Unique Secondary Index (NUSI)• A Non-Unique Secondary Index (NUSI) allows duplicate column values in the rows.• Access to a referenced value requires all AMPs (parallel operation) and returns 0 to n rows.• SQL to create:CREATE INDEX (last_name) on Employee;CREATE INDEX (last_name, first_name) on Employee;

Page 52: Introduction to Teradata

Other Types of Secondary Indexes• Join Index– Define a pre-join table on frequently joined columns (with optional aggregation) without denormalizing the database.– Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing their rows to the same AMP.– Define a summary table without denormalizing the database.– Can be defined on one or several tables.• Sparse IndexAny join index, whether simple or aggregate, multi-table or single-table, can be sparse.– Uses a constant expression in the WHERE clause of its definition to narrowly filter its row population.

Page 53: Introduction to Teradata

• Hash Index– Used for the same purposes as single-table join indexes.– Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing them to the same AMP.– Can be defined on one table only.Value-Ordered NUSI– Very efficient for range conditions and conditions with an inequality on the secondary index column set.

Page 54: Introduction to Teradata

Primary Index vs. Secondary Index

Page 55: Introduction to Teradata

Full-Table Scans

• Every data block of the table is read once• All AMPs scan their portion of the table in parallel.• The Primary Index choice will affect parallel scan performance (UPI is even; NUPI is potentially skewed).• Full-table scans typically occur when: – the index columns are not used in the query – a non-equality or range test is specified for the index columns

Page 56: Introduction to Teradata

Module 8

Fault Tolerance and Data Protection

Page 57: Introduction to Teradata

Locks

4 Types of Locks• Exclusive—prevents any other type of concurrent access• Write—prevents other Read, Write, Exclusive locks• Read—prevents Write and Exclusive locks• Access—prevents Exclusive locks only

3 Levels of Locks• Database—applies to all tables/views in the database• Table/View—applies to all rows in the table/view• Row Hash—applies to all rows with same row hash

Page 58: Introduction to Teradata

Lock requests are based on the SQL request:• SELECT—requests a Read lock• UPDATE—requests a Write lock• CREATE TABLE—requests an Exclusive lock

Lock requests may be upgraded or downgraded:• LOCKING TABLE Table1 FOR ACCESS . . .• LOCKING TABLE Table1 FOR EXCLUSIVE . . .

Page 59: Introduction to Teradata

Transient Journal

• Maintains a copy on each AMP of before images of all rows affected.• Provides rollback of changed rows in the event of TXN failure.• Activities are automatic and transparent to user.• Before images are reapplied to table if TXN fails.• Before images are discarded upon TXN completion.

Successful TXNBEGIN TRANSACTIONUPDATE Row A — Before image Row A recorded(Add $100 to checking)UPDATE Row B — Before image Row B recorded(Subtract $100 from savings)END TRANSACTION — Discard before images

Page 60: Introduction to Teradata

Failed TXNBEGIN TRANSACTIONUPDATE Row A — Before image Row A recordedUPDATE Row B — Before image Row B recorded(Failure occurs)(Rollback occurs) — Reapply before images(Terminate TXN) — Discard before images

Page 61: Introduction to Teradata

RAID ProtectionRAID 1 (Mirroring)• Each physical disk in the array has an exact copy in the same array.• The array controller can read from either disk and write to both.• When one disk of the pair fails, there is no change in performance.• Mirroring reduces available disk space by 50%.• Array controller reconstructs failed disks quickly.RAID 5 (Parity)• Data and parity striped across rank of 4 disks.• If a disk fails, any missing block may bereconstructed using the other three disks.• Parity reduces available disk space by 25% in a 4-disk rank.• Reconstruction of failed disks takes longer than RAID 1.

Summary: RAID-1 - Good performance with disk failures Higher cost in terms of disk spaceRAID-5 - Reduced performance with disk failures Lower cost in terms of disk space

Page 62: Introduction to Teradata

FallbackA Fallback table is fully available in the event of an unavailable AMP.A Fallback row is a copy of a primary row stored on a different AMP in the same CLUSTER of AMPs.

Benefits of Fallback:• May be specified at the table or database level• Permits access to table data during AMP off-line period• Adds a level of data protection beyond disk array RAID 1 & 5• Highest level of data protection is RAID 1 and Fallback• Automatically restores data changed during AMP off-line• Critical for high availability applicationsCosts of Fallback:• Twice the disk space for table storage is needed• Twice the I/O for INSERTs, UPDATEs and DELETEs is needed

Page 63: Introduction to Teradata

Recovery Journal for Down AMPs

Recovery Journal is: • Automatically activated when an AMP is taken off-line• Maintained by other AMPs in the cluster• Totally transparent to users of the systemWhile AMP is off-line:• Journal is active• Table updates continue as normal• Journal logs Row-IDs of changed rows for down-AMPWhen AMP is back on-line:• Restores rows on recovered AMP to current status• Journal discarded when recovery complete

Page 64: Introduction to Teradata

Cliques

• A clique is a defined set ofnodes with failover capability.• All nodes in a clique are able to access the vdisks of all AMPs in the clique.• If a node fails, its vprocs willmigrate to the remaining nodesin the clique.• Each node can support 128 Vprocs

Page 65: Introduction to Teradata

Archiving and Recovering Data

Archive Recovery Utility (ARC)• Runs on IBM, UNIX, Linux and Win2K• Archives data from RDBMS• Restores data from archive media• Permits data recovery to aspecified checkpointOther Archive Applications• BakBone NetVault• Symantec NetBackupCommon uses of ARC• Dump database objects for backup or disaster recovery• Restore non-fallback tables after disk failure.• Restore tables after corruption from failed batch processes.• Recover accidentally dropped tables, views, or macros.• Recover from miscellaneous user errors.• Copy a table and restore it to another Teradata Database.

Page 66: Introduction to Teradata

Module 9

Client Tools and Utilities

Page 67: Introduction to Teradata

Query Tools - BTEQ

Page 68: Introduction to Teradata

SQL front-end to Teradata Database and other ODBC compliant databases

Query Tools – Teradata SQL Assistant

Page 69: Introduction to Teradata

Fast Load Utility

• Fast batch utility for loading a single empty table• Automatic checkpoint/restart capability• Errors reported and collected in error tables• Supports INMOD routines and Access Modules• Loads data in two phases

Page 70: Introduction to Teradata

MultiLoad Utility

• Loads/maintains up to five empty or populated tables• Performs block level operations against target tables• Affected data blocks are written once• Multiple operations with one pass of input files• Uses conditional logic to applying updates• Supports INSERT, UPDATE, DELETE and UPSERT operations• Supports INMOD routines and Access Modules• Errors reported and collected in error tables• Provides automatic checkpoint/restart capability

Page 71: Introduction to Teradata

FastExport Utility

• Exports large volumes of formatted data from one or more tables on the Teradata Database to a host file or user-written application

• Supports multiple sessions• Export from multiple tables• Provides automatic checkpoint/restart capability

Page 72: Introduction to Teradata

TPump Utility• Allows near real-time updates from transactional systems into the warehouse• Allows constant loading of data into a table• Performs INSERT, UPDATE, DELETE, and ATOMIC UPSERT operations, or acombination, to more than 60 tables at a time• High-volume SQL-based continuous update of multiple tables• Allows target tables to:– Have secondary indexes, referential integrity, constraints and enabled triggers– Be MULTISET or SET– Be populated or empty• Allows conditional processing• Supports automatic restarts• No session limit—use as many sessions as necessary

Page 73: Introduction to Teradata

• No limit to the number of concurrent instances• Uses row-hash locks, allowing concurrent updates on the same table• Can be stopped at any time with work committed with no ill effect• Designed for highest possible throughput• Gives users the control over the rate per minute (throttle) at which statements are sent to the database either dynamically or by script.

Page 74: Introduction to Teradata

Teradata Parallel Transporter

• Parallel Extract, Transform and Load (end-to-end parallelism) eliminatessequential bottlenecks• Data Streams eliminate the overhead of persistent storage• Single SQL-like scripting language• Access to various data sources• Open API enables Third Party and user application integration

Page 75: Introduction to Teradata

Teradata Parallel Transporter Operators

Page 76: Introduction to Teradata

Teradata ManagerGraphical system management tool - Collects, analyzes, and displays:– Performance information

Page 77: Introduction to Teradata

Teradata Dynamic Workload ManagerQuery workload management tool (formerly Teradata Dynamic Query Manager) that: Restricts (i.e. runs, suspends, schedules later or rejects) query based on set thresholds

Based on analysis control:Too long -- Toomany rowsBased on object control:- User ID- Table- Day/time- Group IDLogs workload performance for analysis Based on environmentalfactors- CPU- Disk utilization- Network activity- Number of users

Page 78: Introduction to Teradata

Analyst Tools – Teradata Visual Explain

– Provides the ability to capture and graphically represent the steps of a query plan and perform comparisons of two or more plans– Stores query plans in a Query Capture Database (QCD)

Page 79: Introduction to Teradata

Analyst Tools – Teradata System Emulation Tool– Emulates a target system by exporting and importing all information necessary to emulate in a test environment

- Use with TargetLevel Emulationto generate queryplans on a testsystem as if theywere run on thetarget system

- Verifies queriesand reproducesoptimizer relatedissues in a testenvironment

Page 80: Introduction to Teradata

Analyst Tools – Teradata Index Wizard

Recommends secondary indexes for tables, based on a particular workload

Page 81: Introduction to Teradata

Analyst Tools – Teradata Statistics Wizard– Recommends and automates the Statistics Collection process– Recommends Statistics to be re-collected due to table growth

Page 82: Introduction to Teradata

Thank You