Introduction to Teradata

Click here to load reader

  • date post

    01-Nov-2014
  • Category

    Documents

  • view

    101
  • download

    2

Embed Size (px)

description

Brief Introduction to Teradata

Transcript of Introduction to Teradata

Introduction to the Teradata Database

Course ModulesThis course consists of:

Module 1: Teradata Database Overview Module 2: Relational Database Concepts Module 3: Teradata and the Data Warehouse Module 4: Components and Architecture Module 5: Databases and Users Module 6: Data Distribution and Access Module 7: Secondary Indexes and Full-Table Scans Module 8: Fault Tolerance and Data Protection Module 9: Client Tools and Utilities

Module 1

Teradata Database Overview

What is the Teradata Database? Relational Database Management System Built on a Parallel Architecture Runs on MP-RAS UNIX, Microsoft Windows 2000/2003 Server, and SuSE Linux

Teradata Parallel Architecture

More warehouse data Linear Scalability (10GB to 100+TB) Hashing provides for automatic data distribution Parallel-Aware Optimizer Single, Administrative View Ad hoc queries with ANSI

Teradata Database Advantages Proven Linear Scalability - increased workload without decreased throughput Most Concurrent Users - multiple complex queries Unconditional Parallelism - sorts, aggregations and full-table scans are performed in parallel Mature Optimizer - robust and parallel aware, handles complex queries, multiple joins per query, ad hoc processing Low TCO - ease of setup and maintenance, robust parallel utilities, no reorgs, Automatic data distribution, low disk to data ratio, robust expansion utility High Availability - no single point of failure, fault-tolerant architecture Single View of the Business - single database server for multiple clients

Teradata Database ManageabilityThings Teradata Database Administrators never have to do! Reorganize data or index space Pre-allocate table or index space Physically format partitions or disk space Pre-prepare data for loading (convert, sort, split, etc.) Ensure that queries run in parallel Unload/reload data spaces due to expansion ** The Administrator knows that if the data is to be doubled, the system can be easily expanded to accommodate it. ** The amount of work required to create a table which will contain 100 rows is the same as that to create a table which will contain 1,000,000,000 rows.

Teradata Database Features

Designed to process large quantities of detail data Ideal for data warehouse applications Parallelism makes easy access to very large tables possible Open architecture - uses industry standard components Performance increase is linear as components are added Runs as a database server to client applications Runs on multiple hardware platforms (SMP) and Teradata hardware(MPP)

Module 2

Relational Database Concepts

What is a Database?A database is a collection of permanently stored data that is: Logically related - the data relates to other data Shared - many users may access the data. Protected - access to data is controlled. Managed - the data has integrity and value.

Logical/Relational ModelingThe Logical Model Should be designed without regard to usage Accommodates a wide variety of front end tools Allows the database to be created more quickly Should be the same regardless of data volume Data is organized according to what it represents (real world business data in table (relational) form) Includes all the data definitions within the scope of the application or enterprise Is generic the logical model is the template for physical implementation on any RDBMS platform Normalization Process of reducing a complex data structure into a simple, stable one Involves removing redundant attributes, keys, and relationships from the conceptual data model

Relational DatabasesRelational Databases are founded on Set Theory and based on the Relational Model. A Relational Database consists of a collection of logically related tables. A table is a two dimensional representation of data consisting of rows and columns.

The employee table has: Nine columns of data Six rows of data - one per employee Only one row format for the entire table Missing data values represented by nulls Column and row order are arbitrary

Primary Keys

Foreign Keys

Answering Questions with a Relational Database

Relational AdvantagesAdvantages of a Relational Database compared to other database methodologies include: More flexible than other types Allows businesses to quickly respond to changing conditions Being data-driven vs. application driven Modeling the business, not the processes Makes applications easier to build because the data does more of the work Supporting trend toward end-user computing Being easy to understand No need to know the access path Solidly founded in set theory

Module 3

Teradata and the Data Warehouse

Evolution of Data ProcessingA transaction is a logical

unit of work

The Advantage of Using Detail Data

Data Warehouse Usage Evolution

Active Data Warehousing Performance - response time within seconds Scalability large amounts of detailed data mixed workloads (both tactical and strategic queries) for mission critical applications concurrent users Availability and Reliability - 7 x 24 Data Freshness - accurate, up to the minute data, including access to operational data store level information

The Data WarehouseA central, enterprise-wide database that contains information extracted from operational systems.Based on enterprise-wide model Can begin small but may grow large rapidly Populated by extraction/loading of data from operational systems Responds to end-user what if queries Minimizes data movement/ synchronization Provides a Single View of the business

Data MartsA data mart is a special purpose subset of enterprise data for a particular function or application. It may contain detail or summary data or both. Data mart types: Independent - created directly from operational systems to a separate physical data store Logical - exists as a subset of existing data warehouse via Views Dependent - created from data warehouse to a separate physical data store

Module 4

Components and Architecture

What is a Node?

Teradata software, gateway software and channel-driver software run as processes Parsing Engines (PE) and Access Module Processors (AMP) are Virtual Processors (VPROC) which run under control of Parallel Database Extensions (PDE) Each AMP is associated with a Virtual Disk (VDISK) A single node is called a Symmetric Multi-Processor (SMP) All AMPs and PEs communicate via the BYNET

Major Components of the Teradata Database

The Parsing Engine (PE)The Parsing Engine is responsible for: Managing individual sessions (up to 120 sessions per PE) Parsing and optimizing your SQL requests Building query plans with the parallel-aware, cost-based, intelligent Optimizer Dispatching the optimized plan to the AMPs EBCDIC/ASCII input conversion (if necessary) Sending the answer set response back to the requesting client

The BYNET

Dual redundant, fault-tolerant, bi-directional interconnect network that enables: Automatic load balancing of message traffic Automatic reconfiguration after fault detection Scalable bandwidth as nodes are added The BYNET connects and communicates with all the AMPs on the system: Between nodes, the BYNET hardware carries broadcast and point-to-point communications On a node, BYNET software and PDE together control which AMPs receive a

The Access Module Processor (AMP)The AMP is responsible for: Storing rows to and retrieving rows from its VDISK Lock management Sorting rows and aggregating columns Join processing Output conversion and formatting (ASCII, EBCDIC) Creating answer sets for clients Disk space management and accounting Special utility protocols Recovery processing

The MPP System The BYNET (both software and hardware) connects two or more SMP Nodes to create a Massively Parallel Processing (MPP) system. The Teradata Database is linearly expandable by adding nodes.

Teradata Database Software

Channel-Attached Client Software

CLI (Call-Level Interface) Request and response control Buffer allocation and initialization Lowest level interface to the Teradata Database Library of routines for blocking/unblocking requests and responses to/from RDBMS Performs logon and logoff functions

TDP (Teradata Director Program) Manages session traffic between CLI and the Teradata Database Session balancing across multiple PEs Failure notification (application failure, Teradata Database restart) Logging, verification, recovery, restart, security

Network-Attached Client SoftwareODBC Call-level interface Teradata Database ODBC driver is used to connect applications with the Teradata Database MTDP (Micro Teradata Director Program) Performs many TDP functions including session management but not session balancing across PEs

MOSI (Micro Operating System Interface) Provides operating system and network protocol independent interface.

Module 5

Databases and Users

Databases and Users Defined Databases and Users are the repositories for objects: Tables - require Perm Space Views - do not require Perm Space Macros - do not require Perm Space Triggers - do not require Perm Space Stored Procedures - require Perm Space Space limits are specified for each database and for each user: Perm Space - maximum amount of space available for permanent tables Spool Space - maximum amount of work space available for request processing Temp Space - maximum amount of space available for global temporary tables A database is created with the CREATE DATABASE command. A user is created with the CREATE USER command. The only difference between a database and a user is the user has a password and may logon to the system. A database or user with no perm space may not contain permanent tables but may contain views and macros.

Teradata Database Space Management

A new database or user must be created from an existing database or user. All Perm Space limits are subtracted from the owner. Perm Space is a zero-sum game the total of all Perm Space limits must equal the total amount of disk space available. Perm Space currently not being used is available for Spool Space or Temp Space.

Module 6

Data Distribution and Access

How Does the Teradata Database Distribute Rows?

The Teradata Database uses a hashing algorithm to randomly distribute table rows across the AMPs. The Primary Index choice determines whether the rows of a table will be evenly or unevenly distributed across the AMPs. Evenly distributed table rows result in evenly distributed workloads. Each AMP is responsible for its subset of the rows of each table. The rows are not placed in any particular order. The benefits of unordered rows include: No maintenance needed to preserve order. The order is independent of any query being submitted. The benefits of hashed distribution include: The distribution is the same regardless of data volume. The distribution is based on row content, not data demographics.

Primary Key (PK) vs. Primary Index (PI) The PK is a relational modeling convention which uniquely identifies each row. The PI is a Teradata convention which determines row distribution and access. A well designed database will have tables where the PI is the same as the PK as well as tables where the PI is defined on columns different from the PK. Join performance and known access paths might dictate a PI that is different from the PK.

Primary Indexes The physical mechanism used to assign a row to an AMP A table must have a Primary Index The Primary Index cannot be changed UPA If the index choice of column(s) is unique, we call this a UPI (Unique Primary Index). A UPI choice will result in even distribution of the rows of the table across all AMPs. Reasons to Choose a UPI: UPIs guarantee even data distribution, eliminate duplicate row checking, and are always a one-AMP operation. NUPA If the index choice of column(s) isnt unique, we call this a NUPI (Non-Unique Primary Index). A NUPI choice will result in even distribution of the rows of the table proportional to the degree of uniqueness of the index. NUPIs can cause skewed data. Why would you choose an Index that is different from the Primary Key? 1. Join performance 2. Known access paths

Defining the Primary Index The Primary Index (PI) is defined at table creation. Every table must have one Primary Index. The Primary Index may consist of 1 to 64 columns. The Primary Index of a table may not be changed. The Primary Index is the mechanism used to assign a row to an AMP. The Primary Index may be Unique (UPI) or Non-Unique (NUPI). Unique Primary Indexes result in even row distribution and eliminate duplicate row checking. Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

Unique Primary Index (UPI) Access

Non-Unique Primary Index (NUPI) Access

UPI Row Distribution Order_Number values are unique (UPI).

The rows will distribute evenly across the AMPs.

NUPI Row Distribution Customer_Number values are non-unique (NUPI). Rows with the same PI value distribute to the same AMP causing skewed row distribution.

Highly Non-Unique NUPI Row Distribution Order_Status values are highly nonunique (NUPI). Only two values exist. The rows will be distributed to two AMPs. This table will not perform well in parallel operations. Highly non-unique columns are poor PI choices. The degree of uniqueness is critical to efficiency.

Partitioned Primary Index (PPI)The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#) Partitioned Primary Indexes: Improve performance on range constraint queries Use partition elimination to reduce the number of rows accessed

The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI)

Module 7

Secondary Indexes and Full-Table Scans

Secondary Indexes A secondary index is an alternate path to the rows of a table. A table may have from 0 to 32 secondary indexes. A secondary index: does not affect table row distribution. is chosen to improve access performance. may reference from 1 to 64 table columns. may be defined at table creation. may be defined after the table is created. may be dropped at any time. uses a sub-table which utilizes Perm Space. may impact table maintenance performance (row inserts, row updates and/or row deletes).

Defining a Secondary IndexUnique Secondary Index (USI) A Unique Secondary Index requires unique column values in each row Access to a referenced value requires 2 AMPs (serial operation) and returns 0 or 1 rows. SQL to create: CREATE UNIQUE INDEX (social_security) on Employee; Non-Unique Secondary Index (NUSI) A Non-Unique Secondary Index (NUSI) allows duplicate column values in the rows. Access to a referenced value requires all AMPs (parallel operation) and returns 0 to n rows. SQL to create: CREATE INDEX (last_name) on Employee; CREATE INDEX (last_name, first_name) on Employee;

Other Types of Secondary Indexes Join Index Define a pre-join table on frequently joined columns (with optional aggregation) without denormalizing the database. Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing their rows to the same AMP. Define a summary table without denormalizing the database. Can be defined on one or several tables. Sparse Index Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. Uses a constant expression in the WHERE clause of its definition to narrowly filter its row population.

Hash Index Used for the same purposes as single-table join indexes. Create a full or partial replication of a base table with a PI on a FK column to facilitate joins of large tables by hashing them to the same AMP. Can be defined on one table only. Value-Ordered NUSI Very efficient for range conditions and conditions with an inequality on the secondary index column set.

Primary Index vs. Secondary Index

Full-Table Scans

Every data block of the table is read once All AMPs scan their portion of the table in parallel. The Primary Index choice will affect parallel scan performance (UPI is even; NUPI is potentially skewed). Full-table scans typically occur when: the index columns are not used in the query a non-equality or range test is specified for the index columns

Module 8

Fault Tolerance and Data Protection

Locks4 Types of Locks Exclusiveprevents any other type of concurrent access Writeprevents other Read, Write, Exclusive locks Readprevents Write and Exclusive locks Accessprevents Exclusive locks only 3 Levels of Locks Databaseapplies to all tables/views in the database Table/Viewapplies to all rows in the table/view Row Hashapplies to all rows with same row hash

Lock requests are based on the SQL request: SELECTrequests a Read lock UPDATErequests a Write lock CREATE TABLErequests an Exclusive lock

Lock requests may be upgraded or downgraded: LOCKING TABLE Table1 FOR ACCESS . . . LOCKING TABLE Table1 FOR EXCLUSIVE . . .

Transient Journal Maintains a copy on each AMP of before images of all rows affected. Provides rollback of changed rows in the event of TXN failure. Activities are automatic and transparent to user. Before images are reapplied to table if TXN fails. Before images are discarded upon TXN completion. Successful TXNBEGIN TRANSACTION UPDATE Row A Before image Row A recorded (Add $100 to checking) UPDATE Row B Before image Row B recorded (Subtract $100 from savings) END TRANSACTION Discard before images

Failed TXNBEGIN TRANSACTION UPDATE Row A Before image Row A recorded UPDATE Row B Before image Row B recorded (Failure occurs) (Rollback occurs) Reapply before images (Terminate TXN) Discard before images

RAID ProtectionRAID 1 (Mirroring) Each physical disk in the array has an exact copy in the same array. The array controller can read from either disk and write to both. When one disk of the pair fails, there is no change in performance. Mirroring reduces available disk space by 50%. Array controller reconstructs failed disks quickly. RAID 5 (Parity) Data and parity striped across rank of 4 disks. If a disk fails, any missing block may be reconstructed using the other three disks. Parity reduces available disk space by 25% in a 4-disk rank. Reconstruction of failed disks takes longer than RAID 1.Summary: RAID-1 - Good performance with disk failures Higher cost in terms of disk space RAID-5 - Reduced performance with disk failures Lower cost in terms of disk space

FallbackA Fallback table is fully available in the event of an unavailable AMP. A Fallback row is a copy of a primary row stored on a different AMP in the same CLUSTER of AMPs.

Benefits of Fallback: May be specified at the table or database level Permits access to table data during AMP off-line period Adds a level of data protection beyond disk array RAID 1 & 5 Highest level of data protection is RAID 1 and Fallback Automatically restores data changed during AMP off-line Critical for high availability applications Costs of Fallback: Twice the disk space for table storage is needed Twice the I/O for INSERTs, UPDATEs and DELETEs is needed

Recovery Journal for Down AMPsRecovery Journal is: Automatically activated when an AMP is taken off-line Maintained by other AMPs in the cluster Totally transparent to users of the system While AMP is off-line: Journal is active Table updates continue as normal Journal logs Row-IDs of changed rows for down-AMP When AMP is back on-line: Restores rows on recovered AMP to current status Journal discarded when recovery complete

Cliques A clique is a defined set of nodes with failover capability. All nodes in a clique are able to access the vdisks of all AMPs in the clique. If a node fails, its vprocs will migrate to the remaining nodes in the clique. Each node can support 128 Vprocs

Archiving and Recovering DataArchive Recovery Utility (ARC) Runs on IBM, UNIX, Linux and Win2K Archives data from RDBMS Restores data from archive media Permits data recovery to a specified checkpoint Other Archive Applications BakBone NetVault Symantec NetBackup Common uses of ARC Dump database objects for backup or disaster recovery Restore non-fallback tables after disk failure. Restore tables after corruption from failed batch processes. Recover accidentally dropped tables, views, or macros. Recover from miscellaneous user errors. Copy a table and restore it to another Teradata Database.

Module 9

Client Tools and Utilities

Query Tools - BTEQ

Query Tools Teradata SQL AssistantSQL front-end to Teradata Database and other ODBC compliant databases

Fast Load Utility Fast batch utility for loading a single empty table Automatic checkpoint/restart capability Errors reported and collected in error tables Supports INMOD routines and Access Modules Loads data in two phases

MultiLoad Utility Loads/maintains up to five empty or populated tables Performs block level operations against target tables Affected data blocks are written once Multiple operations with one pass of input files Uses conditional logic to applying updates Supports INSERT, UPDATE, DELETE and UPSERT operations Supports INMOD routines and Access Modules Errors reported and collected in error tables Provides automatic checkpoint/restart capability

FastExport Utility Exports large volumes of formatted data from one or more tables on the Teradata Database to a host file or user-written application Supports multiple sessions Export from multiple tables Provides automatic checkpoint/restart capability

TPump Utility Allows near real-time updates from transactional systems into the warehouse Allows constant loading of data into a table Performs INSERT, UPDATE, DELETE, and ATOMIC UPSERT operations, or a combination, to more than 60 tables at a time High-volume SQL-based continuous update of multiple tables Allows target tables to: Have secondary indexes, referential integrity, constraints and enabled triggers Be MULTISET or SET Be populated or empty Allows conditional processing Supports automatic restarts No session limituse as many sessions as necessary

No limit to the number of concurrent instances Uses row-hash locks, allowing concurrent updates on the same table Can be stopped at any time with work committed with no ill effect Designed for highest possible throughput Gives users the control over the rate per minute (throttle) at which statements are sent to the database either dynamically or by script.

Teradata Parallel Transporter Parallel Extract, Transform and Load (end-to-end parallelism) eliminates sequential bottlenecks Data Streams eliminate the overhead of persistent storage Single SQL-like scripting language Access to various data sources Open API enables Third Party and user application integration

Teradata Parallel Transporter Operators

Teradata ManagerGraphical system management tool - Collects, analyzes, and displays: Performance information

Teradata Dynamic Workload ManagerQuery workload management tool (formerly Teradata Dynamic Query Manager) that: Restricts (i.e. runs, suspends, schedules later or rejects) query based on set thresholds Based on analysis control: Too long -- Too many rows Based on object control: - User ID - Table - Day/time - Group ID Logs workload performance for analysis Based on environmental factors - CPU - Disk utilization - Network activity - Number of users

Analyst Tools Teradata Visual Explain Provides the ability to capture and graphically represent the steps of a query plan and perform comparisons of two or more plans Stores query plans in a Query Capture Database (QCD)

Analyst Tools Teradata System Emulation Tool Emulates a target system by exporting and importing all information necessary to emulate in a test environment- Use with Target Level Emulation to generate query plans on a test system as if they were run on the target system

- Verifies queries and reproduces optimizer related issues in a test environment

Analyst Tools Teradata Index WizardRecommends secondary indexes for tables, based on a particular workload

Analyst Tools Teradata Statistics Wizard Recommends and automates the Statistics Collection process Recommends Statistics to be re-collected due to table growth

Thank You