Tera-Tom Teradata Basics

74
Teradata Basics Teradata Basics............................................1 Introduction...............................................2 Overview.................................................2 The Ten Rules of Data Warehousing........................2 Rule # 1 - Start Building Towards A Central Data Warehouse.............................................. 3 Rule # 2 - Build for the User..........................5 Rule # 3 - Let the IT Department Lead the Way to "User Utopia"................................................ 6 Rule # 4 - Build the Foundation Around Detail Data.....7 Rule # 5 - Build Data Marts from the Detail............8 Rule # 6 - Make Scalability Your Best Friend...........8 Rule # 7 - Model the Data Correctly...................10 Rule # 8 - Don't Let a Technical Issue Make Your Data Warehouse a Failure Statistic.........................11 Rule # 9 - Take a Building Block Approach.............12 Rule # 10 - Buy a Teradata Data Warehouse.............13 Teradata—The Shining Star.................................13 Overview................................................13 Parallel Processing.....................................14 Components of a Personal Computer.......................15 Teradata Spreads Data over Multiple Processors..........16 Teradata has Linear Scalability.........................17 A Logical View of the Teradata Architecture.............18 Parsing Engine (PE).....................................18 Access Module Processor (AMP)...........................19 The BYNET...............................................20 Teradata Building Block Approach........................20 Teradata Tables.........................................21 Teradata Spreads the Data Evenly Across the AMPs........23 Primary Indexes.........................................24 There are two types of Primary Indexes..................24 The Hash Map............................................26 How the Hash Map and Primary Index Work Together........27 Retrieving the Data.....................................31 The Full Table Scan.....................................31 Secondary Indexes.......................................33 Join Indexes............................................34 Teradata Databases, Users and Space.......................35 Overview................................................35 1

description

Tera-Tom on Teradata Basics is designed to explain both data warehousing concepts and the basics behind the brilliance of Teradata. Teradata Architecture, Basic fundamentals and Teradata Utilities.

Transcript of Tera-Tom Teradata Basics

Teradata Basics

Teradata Basics

Teradata Basics1Introduction2Overview2The Ten Rules of Data Warehousing2Rule # 1 - Start Building Towards A Central Data Warehouse3Rule # 2 - Build for the User5Rule # 3 - Let the IT Department Lead the Way to "User Utopia"6Rule # 4 - Build the Foundation Around Detail Data7Rule # 5 - Build Data Marts from the Detail8Rule # 6 - Make Scalability Your Best Friend8Rule # 7 - Model the Data Correctly10Rule # 8 - Don't Let a Technical Issue Make Your Data Warehouse a Failure Statistic11Rule # 9 - Take a Building Block Approach12Rule # 10 - Buy a Teradata Data Warehouse13TeradataThe Shining Star13Overview13Parallel Processing14Components of a Personal Computer15Teradata Spreads Data over Multiple Processors16Teradata has Linear Scalability17A Logical View of the Teradata Architecture18Parsing Engine (PE)18Access Module Processor (AMP)19The BYNET20Teradata Building Block Approach20Teradata Tables21Teradata Spreads the Data Evenly Across the AMPs23Primary Indexes24There are two types of Primary Indexes24The Hash Map26How the Hash Map and Primary Index Work Together27Retrieving the Data31The Full Table Scan31Secondary Indexes33Join Indexes34Teradata Databases, Users and Space35Overview35Databases and Users37Three Types of Teradata Space38What is a View?39What is a Macro?40Access Rights for Teradata Users41Automatic, Implicit, and Explicit Rights41Data Protection42Overview42Transaction Concept & Transient Journal43FALLBACK Protection44Down AMP Recovery Journal (DARJ)46Redundant Array of Independent Disks (RAID)47Cliques48Permanent Journal50Locking Modes in Teradata51Referential Integrity52Loading the Data53Overview53Fastload53Multiload54Tpump54ConclusionA Final Thought on Teradata55

Introduction

Overview

A full 40% of Fortune's "U.S. Most Admired" companies use Teradata. What do they know that your company needs to know? I've been in the computer business for more than 27 years. I've witnessed so much since the early days of punch cards, assembler languages, and COBOL programming. With that in mind, the most magnificent, ingenious technology that I've ever seen is a database from the NCR Corporation called "Teradata."

"The wave of the future is coming and there is no fighting it."

Anne Morrow Lindbergh

Teradata is absolutely the wave of the future in data warehousing. I introduced this technology to a great friend, Morgan Jones. He immediately recognized that Teradata is the gold standard for all data warehousing, and as a result, we've partnered to write this book. So, sit back, relax, and enjoy. With our guidance, you will soon realize why Teradata is the greatest technology on the planet!

The Ten Rules of Data Warehousing

What weapon was deemed so powerful that experts claimed it would end all wars? Believe it or not, it was the crossbow! Throughout history, people have improved technology and advanced society through foresight and ingenuity. Just when we think something is impossible it becomes a reality. Who would have dreamed we could send a person to the moon, or that someone could run a mile in under four minutes? Ingenuity and the desire to improve are attributes of the human race, and both are found in numerous avenues, from sports to business.

"Expect the unexpected, or you won't find it."

Roger von Oech

When Frank Lloyd Wright began to design the Imperial Hotel in Tokyo, he discovered the unexpected: just eight feet below the surface of the ground lay a sixty-foot bed of soft mud. Since Japan is a land of frequent shakes and tremors Wright was faced with what appeared to be an insurmountable obstacle. This gave him an idea: Why not float the Imperial Hotel building on the bed of mud, and let it absorb the shock of any quake? Critics and cynics alike laughed at such an impossible idea. Frank Lloyd Wright built the hotel anyway. Shortly after the grand opening of the hotel, Japan suffered its worst earthquake in fifty-two years. All around Tokyo buildings were destroyed, but the Imperial Hotel stood firm.

For a long time the mainframe and OLTP industry laughed at those who recommended the data warehouse design principles set forth in this book. But those companies that build one based upon these rules will join the ranks of the elite. Consider this: ten of the Top 13 global communications companies use Teradata; nine of the top 16 global retailers use Teradata; and eight of the top 20 global banks use Teradata.

The ability to continually improve is one of Teradata's greatest strengths. The database was designed in 1976 and has continually improved ever since. Teradata has averaged one data warehouse installation per week for the past decade. Through continual improvement based on customer feedback from many of the largest data warehouse sites, Teradata has been able to identify itself as "the data warehouse of choice for award winning data warehouses."

This book begins with the 10 cardinal rules to follow for data warehouse success. It illustrates how Teradata helps customers follow these rules. Then it explains the brilliance of how Teradata works. By the end, the reader will have a real grasp of essential Teradata concepts.

Rule # 1 - Start Building Towards A Central Data Warehouse

Moments after midnight on July 30, 1945, the Navy cruiser "USS Indianapolis," suffered a fatal torpedo hit from a Japanese submarine. It had been traveling unescorted through the Philippine Sea. Within 12 minutes of the deadly hit, the ship sank. Over 300 men were killed and nearly 900 were stranded in shark-infested seas. Tragically, those who survived until daylight faced four tortuous days in the water, and battled continuous shark attacks before being stumbled upon by a passing ship. In the end, only 316 souls survived. With a crew of 1,199 people, this was one of the worst military disasters of World War II for the United States.

Most people assume that war is cruel, but the heart-wrenching story above becomes even more tragic when the following facts are revealed: First, the ship's captain did not have all of the facts, and second, the Navy did not provide the captain with a single version of the truth. The Captain's request for a destroyer escort was denied even though the regional Naval command knew another ship had been attacked just two days earlier, plus multiple enemy sightings had occurred within the previous five days. Not only were these crucially relevant facts withheld, but also the captain of the "Indianapolis" was told that his passage route was clear and there would be no need for a destroyer escort.

"To withhold news is to play God."

John Hess

Had everyone involved with the "USS Indianapolis" adhered to a single version of the truth, with detail data to back them up, this disaster may have never occurred. Likewise, if your company doesn't maintain detail data in a Centralized Data Warehouse, you will never know which version of the truth to believe. Each division of a business will have its own view of the truth. Summarized data, such as a data mart, does have its place in knowledge management, but it should always be built from the detail data within the central data warehouse.

Most companies don't have a Central Data Warehouse. Why? Because they don't have proper leadership or direction. Company leaders often let different branches of the company create data marts that are effective short-term solutions. These solutions are based on departmental leadership that is most interested in short-term solutions. Such leaders don't plan on being with a particular department forever, so they are only interested in keeping things simple, controlled, and beneficial to them.

"We're all in this alone."

Lily Tomlin

For example, imagine a company that made cars on an assembly line. Instead of using a giant plant with the latest and greatest technology, the company builds cars in 300 small garages. Each garage is owned by a different department, and has different needs. In addition, every user has his access restricted to his or her garage. With this structure, leaders feel safe, but building cars, logistically, is a nightmare. In fact, just moving cars from one garage to the next would be a joke. This scenario may seem simple-minded, but that is how most data warehouses are built. Each part of some data warehouses operates alone.

Now, imagine a giant car assembly plant where the assembly line was managed by the idea of "There is no I in Team." This plant would continually improve processes, finding better ways to work together. Everyone has an idea what the others are doing, and new ideas are welcome. Management is able to run the entire plant with one team of dedicated professionals, and decisions are made cooperatively, concisely, and clearly.

This style of management is the idea behind a central data warehouse. From the top layer of management down through the entire company, they are one solid team. A data warehouse experienced team saves valuable money and resources, plus users can manage the entire data warehouse. Executives may ask any question targeted to any part of the business. Decisions are made with long-term vision, and every employee is confident that when they need answers - the data warehouse will provide them.

"If I have seen further it is by standing on the shoulders of giants."

Isaac Newton

When asked how he had discovered the Law of Gravity, Isaac Newton did not grab all of the glory for himself. He claimed that his work stood on the foundation of those early scientists who had gone before him. Likewise, a central data warehouse allows users to stand on the shoulders of another giant. This giant, built right, allows major corporations to make decisions and act on those decisions quickly.

In 1993, I was asked to train one of the world's largest retailers on its Teradata data warehouse. I flew to Bentonville, Arkansas, and an employee met me at the airport then escorted me to the classroom. As we walked down the hallways, most employees seemed to be at a pace I had never seen before. They were practically running. I asked, "What's up?" "Why is everyone hurrying?" The employee replied, "It's work time!" I was shocked. In all of places I had previously worked, we strolled. This place had a leadership that I've never encounteredanywhere. H. Ross Perot described this kind of team when he said, "When building a team, I first look for people who love to win, if I can't find any of those, then I look for people who hate to lose." This was a concise team of employees so motivated and so empowered that they thought they could take over the world!

As I grew to know the team, I asked them how long it took top management to make a decision. And how long did it take to implement that decision at thousands of stores nationwide. They simply said, "About two hours!" I was amazed. Today, this team continues to have one of the single greatest data warehouses ever built. They use it extensively and it grows stronger every day.

While visiting with this team, management decided at one point that stores across the country should place Halloween displays and candy near the cash registers. In less than two hours, stores moved their Halloween candy from the normal candy aisles to end-caps near the cash register. Every store participated but one!

When asked why he didn't participate, the store manager said he had simply run out of time to create the displays plus move the Halloween candy from his normal candy aisle to the end-caps. Management was ticked. Telling the manager they would get back to him, they then asked the DBA to query the data warehouse to see how much this snafu had cost the company. The DBA came back and reported that the store actually sold almost the same amount of Halloween candy as forecasted. Management was surprised and honestly a little disappointed with the answer. But then the DBA added somewhat sheepishly, "I found something else, too." "Go ahead", replied members of the management team. He said, "I found out they actually sold about 40% more normal candy then we forecasted for this holiday." Management got on the phone immediately and told the other thousand stores: "Move those goblins and Halloween candy back to the normal candy aisles!"

What that DBA did was to use his instinct and the data warehouse to find out exactly what was going on with the business at that time. He was armed with a system that had cross-functional analysis. A central data warehouse gives good management great confidence because they see the whole picture. When users can ask any question, at any time, and on any data, their knowledge is unlimited.

Most Teradata Central Data Warehouse sites will tell you most of their Return On Investment (ROI) came from areas they never suspected. Thomas Jefferson once said, "We don't know one millionth of a percent about anything." When we explained Teradata to Jefferson he did not build another Monticello, but he did retract his statement! Companies with a centralized data warehouse know about a million percent more than companies that have invested in stovepipe applications and 300 different data marts.

Actually, any company planning on competing in this millennium must think long-term and begin building a centralized data warehouse. If not, that company will be on the short end of the stick when competing with a company that chose to build one. That thought should sound scarier than a goblin near the cash registers on Halloween!

If you think about it, every major decision in business makes someone happy. If you are armed with facts supported by a central data warehouse and you do your homework, your business decisions will make your shareholders happy. However, if you are making decisions with a data mart strategy, those decisions are more likely to make your competitors happy.

There are many companies that are fearful of such an undertaking. They want a central data warehouse, but wonder: "What if it fails? Which database should we choose? What type of hardware do we need? Should we do an RFP?" Decisions, decisions! It would literally take me about 30 seconds to make a decision on Teradata. There would be no RFP. We used to wade in swimming pools of data; today we are swamped in oceans of data. Teradata is built for this type of environment. This book explains the fundamentals of Teradata. Anyone with any experience or knowledge about data warehouse environments will clearly see why Teradata is the best solution.

Rule # 2 - Build for the User

"A learned person is not one who gives the right answers; it is the one who asks the right questions."

Claude Levi-Strauss

The user is the heart of the data warehouse, and they get better with each day of experience. The user makes decisions that affect the company's bottom line. That's why the data warehouse is built around the business user. Building a data warehouse is simple: find out what data the business users need and what type of queries they want to ask but are not able to ask today. Then, find out if the data is available and if the queries can be attained. With those answers, you will exceed users expectations.

An experienced data warehouse user is usually shocked when he or she first uses Teradata. Its sheer power and flexibility enables users to ask questions they have never been able to ask before. On a recent consultant trip of mine, a young DBA got antsy when a particular query took more than a minute or so with Teradata. So I asked, "Well, how long did that same query take with your OLTP-based data warehouse?" He retorted, "We couldn't even run this query on the old system." I said, "So, what's wrong two minutes?" He added, "You know, some of our business users are so used to how long our queries used to run that they will be sitting, staring at the screen, without realizing that Teradata has already brought back the answer!" With Teradata, users can expand their thinking by using intuition and keen business sense without technology barriers.

The building of an enterprise data warehouse begins with top management, but then cascades down to a relationship between the IT department and the business user community.

The IT department must realize they have a supporting role. That role is to please the business user by making data available so the business user can easily ask questions and get answers. It's also the IT department's role to build a system that allows users to ask questions on their own without IT intervention. Forget about building a system where users ask IT to run the queries for them. When users need information, the IT department should eventually be able to say, "Ask the question yourselfit is all available to you".

The business users are actually the stars, however the entire business community must take responsibility for the warehouse's success. These users must continually educate themselves and other users on the capabilities of the data warehouse, new tools, and new techniques that will enhance its potential. Those same users must help IT help them. If both understand their respective roles and work together to help the company, then the data warehouse will be a huge success.

Rule # 3 - Let the IT Department Lead the Way to "User Utopia"

Few sports challenges are as grueling or demanding as the Tour de France. But victory at this event eluded Lance Armstrong, a powerful young cyclist from Austin, Texas. Lance excelled in individual competition, even winning the World Championships. But despite his hard work Lance could not overcome the Europeans strong and proud tradition at the Tour de France. A few years ago, Lance was thrown into the battle of his life, not against others but against himself. He discovered that he had cancer and was given virtually no chance of surviving. Suddenly he found out how little cycling really meant in life. With all his might, Lance battled his way back to health, beating the odds. Now he found out how very much cycling could mean in life. His bicycle became a tool to reclaim the future. He found a spot as a team member for the U.S. Postal Service team. With a new perspective and a new depth of character, Lance led that team to victory in the next Tour de France. And he repeated this victory again for the next two years!

To win the premier event in the cycling world, Lance Armstrong had to totally rethink his role. In the same way, the key members of any company seeking success with its data warehouse must rethink their roles. The IT department plays a key role in a data warehouse. What do users know about technical issues? Not enough to build a data warehouse. So, technical issues are the responsibility of the IT department. The danger with this train of thought is that while the IT department has years of experience with handling company transactions through production databases and applications, most are new at data warehousing. A data-warehousing environment can be extremely different than anything an IT department has ever built or used before. Therefore, it's a bad idea to build a data warehouse without the help of experienced people.

An OLTP environment gets more and more predictable each month. It is designed to be tweaked and tuned in order to maximize a company's environment. On the other hand, a data warehouse is an unpredictable environment where the only way to gain control is to actually give up control. In data warehousing, the user must be allowed the freedom to ask the questions and they will blossom in an environment where flexibility is accepted and welcomed.

"The only sure weapon against bad ideas is better ideas."

A. Whitney Griswold

If the IT department decides to build hundreds of data marts that will please each and every department, then they are missing the boat. Data warehouse experience is a hard teacher because it gives the test first, and the lesson afterwards. Abraham Lincoln once said, "A house divided cannot stand." With that in mind, build the data warehouse so it will stand strong for a long time.

What's the formula? First and foremost, start by building your data warehouse around detail data. Bring transaction data, along with key details, from the OLTP systems into the data warehouse. Then, as known queries are identified, build data marts to enhance their performance, and also insist that data marts are created and maintained directly from the detail data. Doing so will build a foundation that will stand.

Next, the IT department needs to keep an open mind about creating an environment called "User Utopia". Have you ever been there? In "User Utopia" the user confidently asks queries without fear of being charged by the minute. The user has meta-data so he or she becomes intimate with the data, then makes informed decisions. The user should also be able to ask monster queries with the full backing of IT. Recently, on one such query, the IT department wanted to pull the plug. But the DBA held out, granting the user more time. When the query finished running, the information it brought back from the detail data saved the company millions of dollars. Overall, a user will get the majority of his or her answers back quickly from data marts, but he or she also needs the capability of going back to the detail data for more information. This is "User Utopia."

Here is the message for IT: Don't follow the idea that "if you build it, they will come." Instead, become a leader go to the users and build it together.

Rule # 4 - Build the Foundation Around Detail Data

Business is always trying to predict the unpredictable! The US Air Force Reserve's 53rd Weather Reconnaissance Squadron is a special force that flies their planes directly into tropical storms and hurricanes. Using a WC-130 Hercules aircraft they fly into storms at low altitudes between 1,000 and 10,000 feet, taking weather readings that are relayed to the National Hurricane Center in Florida. They measure wind speeds, measure the pressure and structure of the storm, and, most importantly, locate the eye of the storm. The data collected by these "Hurricane Hunters" is used to determine when and where a storm might hit the coast and how strong it will be at that time. Teradata has no fear of detail data; its virtual processors will fly right into thick of your data warehouse to bring back valuable information for decision support. You see, Teradata enables you to understand the storms in your business today while helping you predict when and where the next storm will hit tomorrow.

I estimate that 80% of today's data warehouses are built on "summary (summarized) data." Therefore, 80% of all data warehouses will never come close to realizing their full potential. Your data warehouse does not have to be one of them!

"A bird does not sing because it has the answers, it sings because it has a song."

A data warehouse built on detail data does not "sing because it has a song, it sings because it has the answers." When you capture detail data, answers to an infinite amount of questions are available. But, if this is truly the case then why doesn't everybody build around detail data? Well, there are two reasons. One is price! Like a bird, many companies decide to go "cheap cheap". But watch out! The real expense is not the cost of the data warehouse; it is the money that you will not make without one. The second reason is power! Many companies don't have the wingspan to fly through the detail, so they "sore" with the summary. In addition, some companies don't want to pay for the disk space it actually takes to keep detail data, but believe me, that cost is a small price to pay for success.

"Once you miss the first buttonhole it becomes difficult to button your shirt."

Many companies use the same database for their data warehouse as they have done for their OLTP system. This is a critical mistake. In essence, they have missed the first buttonhole and most likely will lose their shirt on their data warehouse adventure.

At this point, companies no longer have a choice of using detail data. They must summarize for performance reasons. As one marine told his boot camp soldiers jokingly, "The beatings will continue until the moral improves." Similarly, a database designed for OLTP takes a continual beating when it tries to query large amounts of detail data.

Companies building true data warehouses don't compromise on price, and will have a data warehouse that is built for decision support, not one that specializes in OLTP. With this decision, you have buttoned the first buttonhole and are well on your way to reaching the top.

Detail data is the foundation that data warehouses are built upon. Users can ask any question, anytime, and conduct data mining, OLAP, ROLAP, SQL and SPL functions, build data marts directly from the detail data, and can easily maintain and grow the environment on a daily basis. Now that's a tune well worth singing. Make a note of it!

Rule # 5 - Build Data Marts from the Detail

"You cannot teach a man anything; you can only help him find it within himself."

Galileo

Galileo was a smart man. How did he know so much about life and data marts? When we explained to Galileo data marts he said, "You cannot build a data mart directly from the OLTP systems, you can only build a data mart directly from the detail within." He was right!

Many companies build data mart after data mart directly from the OLTP systems and their universe begins to revolve around continual maintenance. Then as things get worse, as Galileo predicted their universe begins to revolve around the son. The son of a gun sent in to replace them!

Why does this happen? At first, things work out great, but soon there are more and more requests for additional information. As a result, more and more data marts are created, and soon the system looks like a giant spider web. Different data marts start to yield different results on like data, and the actual maintenance of this complicated spider web takes up most of IT's time. Meanwhile, short-term dreams turn into long-term nightmares like this one: A man and his wife had had a big argument just before he went on a business trip. Feeling rather contrite about his harsh words, he arranged to send his wife some flowers and asked the florist to write on the card, "I'm sorry. I love you." The beautiful bouquet arrived at the door. But then his wife read the words the florist had actually written in haste, "I'm sorry I love you."

The top reasons to build data marts directly from detail data are:

Users can get answers from the data mart, but must validate their findings or check out additional information from the detail that built it.

There is only one consistent version of the truth

Maintenance is easy

If a user comes up with a data mart answer that does not make sense, then he or she has the ability to drill down into the detail and investigate. Sometimes summary data can spark interest and finding out the "why" can result in big bucks.

If users don't trust the data, they won't use the system.When a data warehouse is built on a foundation of detail data and then data marts are erected from that foundation, you have a winning combination. The results will always be consistent and trustworthy. However, you should only build data marts when there is a credible business case, and you should be ready to drop them when they are no longer needed. The life span of a data mart is relatively short to that of its mother and father (better known as the detail data). If you build the data mart from the detail, it makes them easy to manage, easy to drop, and easy to change.

Rule # 6 - Make Scalability Your Best Friend

"Plan your life for a million tomorrows, and live your life as if tomorrow may be your last."

Morgan Jones

The roar of class-6 rapids on a river in Suriname can be almost deafening against the dense walls of the jungle. Especially when you are 9 years old. Our mission was to lower our canoe down the waterfall with ropes. The Trio Amer-Indian who anchored our 40-foot dugout canoe let go of the anchor rope too quickly. Without warning, the heavy boat began a freefall through the rocky water with my father hanging onto the side for dear life. He disappeared under the rocky waters and I knew for sure we had lost him. My heart pounded in against my chest. As I rallied myself to grasp this loss as only a nine year old can, the Indians abruptly began cheering wildly above the roar of the river. My dad had resurfaced a hundred yards downstream, battered and bruised; but he was alive! In just one short minute I determined that I would love my family every day as if there were no tomorrow.

As I made my family my best friend a data warehouse must make scalability its best friend. A data warehouse that does not scale will have no tomorrow. It is only a matter of time until the warehouse disappears in rocky waters only to never come up for air. Don't let go of the anchor rope.

The data-warehousing environment will throw obstacles in your way every single day. A data warehouse must be planned to meet today's needs. But it must also be capable of meeting tomorrow's challenges. The future cannot be predicted, so plan for unlimited growth, or linear scalability - - both vertical and horizontal. There are so many data warehouses that start out with sizzling performance, but as they grow, they eventually and inevitably hit the scalability "wall". However, before they hit the wall, there is a pattern of diminishing performance.

A data warehouse designed without scalability in mind is doomed before it is begun. It can never reach its potential. Take the scalability question out of the equation by investing in a database that allows you to start small, but grows linearly.

In today's fast paced world, Gigabytes soon become Terabytes. It may not sound like much, but it weighs a ton on the shoulders of giants. Listen to these measurements and pick your data warehouse's life span. For example, if you lived for a million seconds (Megabyte), then you would live for 11.5 days. In comparison, if you lived for a billion seconds (Gigabyte), then you would live for 31.5 years. Plus, if you lived for a trillion seconds (Terabyte), then you would live for 31,688 years!

How nice it would be on your 31,688th birthday that people would say, "You sure look good for your age".

Data warehouses hit the wall of scalability because they cannot grow with the same degree that the amount of data being gathered grows. Teradata allows for unlimited "linear scalability." "Linear Scalability" is a building block approach to data warehousing that ensures that as building blocks are added, the system continues at the same performance level.

This is why the largest data warehouses in the world use Teradata. I was lucky to be in the right place at the right time, and taught beginning stages at what are considered the two largest data warehouse sites in the world: South Western Bell (SBC) and Wal-Mart.

Wal-Mart's data warehouse started with less than 30 gigabytes, and SBC started with less than 200 gigabytes and 100 users. Both warehouses:

Started small and simple;

Used Teradata from the beginning;

Have built the largest Enterprise Data Warehouse in their respective industries;

Continue to realize additional Return On Investment (ROI) on an annual basis;

Have grown to more than 10 Terabytes of data, and are still growing;

Have thousands of users (some estimates are shocking);

Have educated and experienced data warehouse staffs;

Have educated and experienced data warehouse users;

Experience continual growth without boundaries;

Have experienced linear performance by Teradata in every single upgrade (from gigabytes to terabytes and from terabytes to tens of terabytes);

Both companies are impressed with Teradata's power and performance;

And both SBC and Wal-Mart are committed to the excellence of Teradata.

A data warehouse is built in small building blocks. Linear Scalability is described in three ways:

First, building blocks are added until the performance requirements of your environment are met. (Guaranteed Success);

Second, every time the data doubles, building blocks are doubled, and the system maintains its performance level. (Guaranteed Success); and

Third, any time the environment changes, building blocks are added until performance requirements are met. (Guaranteed Success)

Scalability is not just about growing the data volume. It also means growing, or increasing, the number of users. Many systems work flawlessly until as few as 5 users are added, then they slow down to a crawl. Companies need a system where growth and performance are easily calculated and implemented. That means where the number of users, size and complexity of queries, volume of data, and number of applications being used can be calculated and compared to the current system's actual size. If more power, speed, or size is needed, then the company can simply add building blocks to the system until the requirements are met.

Rule # 7 - Model the Data Correctly

"You will find only what you bring in."

Yoda, Jedi Master in Star Wars

We model a database for the same reasons that Boeing builds an aircraft model to test flight characteristics in a wind tunnel. It's simpler and cheaper to model, than to reconstruct the plane by iterations until you get it right. A proper data model should be designed to reflect the business components and possible relationships.

Here are three rules for modeling data in a data warehouse:

1. Model the data quickly

2. Normalize the detail data

3. Use a dimensional model for data marts.

The "3rd Normal Form" believes each column in a table should be directly related to the primary key, the whole key, and nothing but the key. Data is placed into tables where it makes the most sense and has no repeating groups, derived data, or optional columns. This allows users to ask any question, at any time, on all data within the enterprise. Users do not have to strive for "3rd Normal Form," but just normalize the data the best they can. There will be fewer columns in a table, but a lot more tables overall. This model is easier to maintain, incredibly flexible, and allows a user to ask any question on any data at any time.

A "Star-Schema" model is comprised of a fact table and a number of dimension tables. The fact table is a table with a multi-part key. Each element of the key is, itself a foreign key, to a single dimension table. The remaining fields in the fact table are known as facts, and are numeric, continuously valued, and additive. Facts can be thought of as measurements taken at the intersection of all of the dimensions. Dimension attributes are mostly textual, and are almost always the source of constraints and report breaks. This model enhances performance on known queries, or in other words, queries users run repeatedly day after day.

Most database modelers prefer to create a logical model in "3rd Normal Form," but most database engines are overcome by physical limitations, so they must compromise the model. The four most difficult functions for a database to handle are:

Join tables

Aggregate data

Sort data

Scan large volumes of data.

In order to get around these system limitations, vendors will suggest a model to avoid joins, use summarized data to avoid aggregation, store data in sorted order to avoid sorts, and overuse indexes to avoid large scans. With these limitations, vendors are also going to avoid being able to compete! That is like placing a ball and chain around the runner's leg and saying, "I wish you all the best in the marathon!" Come on! Whose side are these vendors really on?

Teradata is the only database engine I have seen that has the power and maturity to use a "3rd Normal Form" physical model on databases exceeding a terabyte in size. Because of the physical limitations, other databases have had to use a "Star-Schema" model to enhance performance, but have given up on the ability to perform ad-hoc queries and data mining.

A "normalized" model is one that should be used for the central data warehouse. It allows users to ask any question, at any time, on information from any place within the enterprise. This is the central philosophy of a data warehouse. It leads to the power of ad-hoc queries and data mining, whereby advanced tools discover relationships that are not easily detected, but do exist naturally in the business environment.

A "Star-Schema" model enhances performance on known queries because we build our assumptions into the model. While these assumptions may be correct for the first application, they may not be correct for others. Flexibility is a big issue, but data marts can be dropped and added with relative ease if each is built directly from the detail data.

Remember, build the data warehouse around detail data using a normalized model. Then, as query patterns emerge and performance for well-known queries becomes a priority, "Star Schema" data marts can be created by extracting summarized or departmental data from the centralized data warehouse. The user will then have access to both the data marts for repetitive queries, and the central warehouse for other queries.

Because data marts can be an administrative nightmare, Teradata enables "Star-Schema" access without requiring physical data marts. By setting up a join index as the intersection of your "Star-Schema" model, you can create a "Star-Schema" structure directly from your "3rd Normal Form" data model. Best of all, once it is created, the data is automatically maintained as the underlying tables are updated.

Keep in mind, 80% of data warehouse queries are repetitive, but 80% of the Return On Investment (ROI) is actually provided by the other 20% of the queries that go against detailed data in an iterative environment. By using a normalized model for your central data warehouse and a "Star-Schema" model on data marts, you can enhance the possibility of realizing an 80% Return on Investment and still enhance the performance on 80% of your queries.

Rule # 8 - Don't Let a Technical Issue Make Your Data Warehouse a Failure Statistic

"Experience is a hard teacher because she gives the test first, the lesson afterwards."

Scottish Proverb

Did you know that 3/4thof the people in the world hate fractions, and that 40% of the time a data warehouse fails is because of a technical issue? There are many traps and pitfalls in every data warehouse venture. One winter day a hunter met a bear in the forest. The bear said, "I'm hungry. I want a full stomach." The man replied, "Well, I'm cold. I would like a fur coat." "Let's compromise," said the bear, and he quickly gobbled up the hunter. They both got what they asked for. The bear went away with a full belly and the man left wrapped in a fur coat. With that in mind, good judgment comes from experience; experience comes from bad judgment. You have shown good judgment by reading this book; so let our experience keep your company from having a bad data warehouse experience.

Author Daniel Borsten wrote in The Discoverist, "The greatest obstacle to discovering the shape of the earth, the continents, and the oceans was not ignorance, but rather the illusion of knowledge." There is a lot of "illusion of knowledge" being spread around in the data-warehousing environment. Before you decide on any data warehouse product, ask yourself, and the vendor, these questions:

As my data demands increase, will the system be able to physically load the data? Our experience shows that many systems are not capable of handling very large volumes of data. Do the math!

As the data grows in volume, can the system meet the performance requirements? Do the math!

As the number of users grows, will the system be able to scale? Do the math!

As my environment changes, will the system be flexible enough to allow changes quickly and easily? Do the math!

Will the system need so many Database Administrators (DBAs) that my systems cost skyrockets? Do the math!

If we suddenly merged with another company and needed to incorporate into their mainframe or LAN environment, would the system be able to connect and include them? Do the math!

Can I continue to meet my batch window timeframes? Do the math!

Could I become the hero of the company one day, only to have some technical glitch blamed on me because of my poor foresight and be thrown out of the company into a giant mud puddle? Do the bath!

Rule # 9 - Take a Building Block Approach

"Be not afraid of growing slowly; be afraid only of standing still."

Chinese Proverb

Ever since Vasco de Balboa discovered the Pacific coast of Panama in 1513, kings and businessmen alike dreamed of the impossible: to cut a waterway across the mountainous isthmus, creating a shortcut between the Atlantic and Pacific Oceans. Those dreams turned into reality during the Industrial Revolution. It took almost forty years of trial and error before the world's greatest engineering feat since the Pyramids was completed in 1914. Ships move through the locks of the canal, rising 85' above sea level before they descend to the opposite side. Since its grand opening in 1920, the Panama Canal has revolutionized trans-oceanic traffic joining East and West. Its 50-mile stretch saves every vessel about 8,000 extra nautical miles of travel around the bottom tip of South America. Several modifications have been engineered through the years to accommodate the increasing size of ships.

Data warehouses, like the Panama Canal, must be built over time and changed over time to meet new demands. A data warehouse must grow with the environment, but the environment is unpredictable. All sailors know that they can't direct the wind, but that they can adjust their sails. In comparison, all data warehouse users know they can't direct the environment, but they can adjust their warehouse. Sometimes the data warehouse will grow quickly and sometimes it will grow slowly, but it should always be growing.

So, take a building block approach to data warehousing. Teradata allows you to expand without boundaries - one building block at a time. Plus, adding on building blocks is easy.

There are two aspects to a building block approach. First, you need to add applications to your data warehouse in three to six month intervals. Once the first application works, then you are ready for more projects. As you become more experienced with this approach, you can add multiple projects in parallel by involving multiple organizations.

The second aspect of the building block approach is in the actual data warehouse architecture. It doesn't matter if yours is the smallest data warehouse in the world, the largest, or falls somewhere in between, power and scalability always fuel success.

Not long ago a customer flew out to San Diego for a Teradata demonstration and benchmark. The benchmark ran late into the evening, but the numbers were more than 50% better than the competition. The customer was extremely impressed, but before buying he demanded to see the system scalability that everyone had been talking about. Although it was already late, a Teradata employee was called in the middle of the night, arrived within 10 minutes (in pajamas), hooked up the building blocks, and ran a utility called "config." She ran another called "reconfig," and in less than two hours the system size doubled.

As the environment changes in terms of users, data, complexity, capacity, batch windows, time changes, events, or opportunities, users should be able to continue building applications and architecture. The more a Teradata system grows, the more Teradata outshines the competition.

Rule # 10 - Buy a Teradata Data Warehouse

"Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing had happened."

Winston Churchill

Winston Churchill led Britain through World War II, during what he called that country's "finest hour." When users see consistent data, the system, too, is in "its finest hour". Teradata gives users the ability to ask questions they could never ask before. Users trust Teradata because of its industry performance and reputation, and because it "never gives in". Constant use gives users optimal business experience, and no matter what a user asks, the system responds with a hearty, "Yes, Sir!"

When we explained Teradata to Churchill he said,

"A data WARe-house that consists of 250 Data marts is like poison; and if I were the MIS department responsible for maintaining them, I'd take it".

Teradata guarantees an Enterprise Data Warehouse with no scalability issues. Data loads like lightning and system administration is a breeze. You can pick the performance level that meets your requirements for today and forever. The database can be normalized around detail data, and because of Teradata's power, users have the flexibility to ask any question, at any time, on any data.

All other databases are suspect in data loading capabilities, scalability, reference sites, decades of data warehouse experience, flexibility, system administration difficulties, and inability to handle the complex queries of today's users. These users are good!

TeradataThe Shining Star

Overview

Teradata has always been at the top of the data warehouse game, even if the experts weren't bright enough to know it. The incredible vision that the original designers had was tremendous. It was so far to the left of genius that most thought the idea was impossible.

"Only he who attempts the ridiculous may achieve the impossible."

Don Quixote

The Teradata database was originally designed in 1976, and many of the fundamental concepts still remain today. Nearly 25 years later, Teradata is still considered ahead of its time.

In 1976, IBM mainframes dominated the computer business. Everyone who was anyone had an IBM Mainframe. However, the original founders of Teradata noticed that it took about 4 years for IBM to produce a new mainframe. They also noticed a little company called Intel. Intel created a new PC chip every 2 years. With mainframes moving forward every 4 years and PC chip every 2 years, Teradata recognized their vision: to network enough PC chips together that the mainframe would be overpowered, yet costs would be hundreds of times cheaper than a mainframe. The Teradata team estimated the power surge would come in 1990.

IBM laughed out loud! They said, "Let's get this straight you are going to network a bunch of PC chips together and overpower our mainframes? That's like plowing a field with a 1,000 chickens!" In fact, IBM salespeople are still trying to dismiss Teradata as just a bunch of PCs in a cabinet.

Teradata was convinced it could produce a product that would power large amounts of data and achieve the impossible: using PC technology in mainframe territory. It's founders agreed with Napoleon Bonaparte who asserted, "The word impossible is not in my dictionary!" Sure enough, when we looked in his dictionary, that word was not there. And it is not in Teradata's Data Dictionary, either! The Teradata team set two goals: build a database that could

Perform parallel processing; and

Accommodate a Terabyte of data

Driving in the car one evening, Morgan's eight-year old daughter Kara piped up from the back seat, "Daddy, can you buy Teradata in the store? I mean, what does Teradata really do?" Morgan thought for a moment and then replied, "Do you remember when you went on the Easter egg hunt last spring? Well, imagine that we had fifty eggs and you were the only child there. If I asked you to find all the purple eggs, would you be able to do that?" Kara said, "Sure! But it might take me a long time." Morgan continued, "What if we now let fifty children go in and I asked them to show me all of the purple eggs. How long would that take?" His daughter responded, "It wouldn't take any time at all because each child would only have to look at one egg." That is precisely how Teradata works. It divides up huge tasks among its processors and tackles each portion simultaneously, with amazing speed. And it doesn't matter if you have a trillion eggs in your basket!

In 1984, the DBC/1012 was introduced. Since then, Teradata has been the dominant force in data warehousing. Teradata got the chickens plowing, and is considered outstanding. Meanwhile, IBM's plow is out rusting in its field.

Parallel Processing

"An invasion of armies can be resisted, but not an idea whose time has come."

Victor Hugo

The idea of parallel processing gives Teradata the ability to have unlimited users, unlimited power, and unlimited scalability. This is an idea whose time has come. And, it all starts with something called "parallel processing". So what is parallel processing? Let us explain:

It was 10 p.m. on a Saturday night and two friends were having dinner and drinks. One of the friends looked at his watch and said, "I have to get going." The other friend responded, "What's the hurry?" His friend went on to tell him that he had to leave to do his laundry at the Laundromat. The other friend could not believe his ears. He responded, "What?! You're leaving to do your laundry on a Saturday night?! Do it tomorrow!" His buddy went on to explain that there were only 10 washing machines at the laundry. "If I wait until tomorrow, it will be crowded and I will be lucky to get one washing machine. I have 10 loads of laundry, so I will be there all day. If I go now, there will be nobody there, and I can do all 10 loads at the same time. I'll be done in less than an hour and a half."

This story describes what we call "Parallel Processing." Teradata is the only database in the world that loads data, backs-up data, and processes data in parallel. Teradata was born to be parallel, and instead of allowing just 10 loads of wash to be done simultaneously, Teradata allows for hundreds even thousands of loads to be done simultaneously. Teradata users may not be washing clothes, but this is the technology that has been cleaning every database's clock in performance tests.

"After enlightenment, the laundry"

Zen Proverb

"After parallel processing the laundry, enlightenment!"

Teradata Zen Proverb

With the computer world seeing Terabytes of data, hundreds to thousands of users are asking a wide variety of complex questions, and need instantaneous access to data. In short, this is the technology needed in a data warehouse environment. What we find most fascinating is that Teradata has unlimited power, and grows without boundaries, and was born out of the PC (personal computer) world by people with vision.

Components of a Personal Computer

"A ship in harbor is safe, but that's not why ships are built."

John Shedd

In 1805 the pivotal Battle of Trafalgar matched Britain's flotilla of battle ships against the almighty Spanish Armada. Spain had huge battleships, some having four tiered decks of canons. But Britain's Admiral Horatio Nelson used two lines of ships to sail circles around the Armada, attacking them at their most vulnerable point, the stern. That battle paralyzed the Armada and turned the world of naval warfare upside down. Teradata stunned the data-warehousing world by taking personal computer technology right into the mighty, mainframe-dominated environment and beating them on their own turf. Armed with a "lightweight" technology built on Intel processor chips, memory, a hard drive, and an operating system Teradata achieved the unthinkable: lightning-fast processing speed managing terabytes of data.

A Personal Computer (PC) is made up of the following components:

Processor Chip This is the brain of the computer. All tasks are done at the direction of the processor.

Memory This is the hand of the computer. The memory allows data to be viewed, manipulated, changed, or altered. Data is brought in from the hard drive and the processor works with the data in memory. Once changes are made in memory, the processor can command that the information be written back to disk.

Hard Drive This is the spine of the computer. The hard drive stores data, applications, and the Operating System inside the PC. The hard drive, also called the disk drive, holds the contents of the data for the system on its disk.

For example, suppose you made three new good friends this month and want to add their names to your list. Opening that document brings it up from the hard drive and displays it on your screen. As you type in the new names, the processor executes your request onto the document while it is still being displayed in memory. Upon completion, you close the document and the processor writes all the changes to the disk where it is stored.

In the picture below, we see the basic components of a Personal Computer. Note that it also holds a file called "Best_Friends listing," and lists eight best friends.

Teradata Spreads Data over Multiple Processors

"I don't mind starting the season with unknowns. I just don't like finishing the season with them."

Coach Lou Holtz

With Teradata you will never finish with any unknowns about your business; you can know it all! One reason why this assertion is true can be found by looking at the unique way this database places the data into the system and processes it. Teradata takes every table in the system and spreads the data across multiple processors. Each processor works on its portion of the database in parallel when requested to do so. This is why we call it parallel processing. In the previous example, one processor listed eight best friends on its disk. In that case, Teradata would read eight rows.

The Teradata example on the next page shows two processors, each having direct access to its own physical disk. The "Best_Friends" table has been spread out evenly across both processors. When we ask for a list of best friends the system, both processors will receive data in parallel and will return combined results over the connecting network. Returns for this example could easily double the speed of the previous example.

Even though we still need to read eight records, each processor is only responsible for reading four records and simultaneously the other processor reads the remaining four records. So, how could we double the speed of this system again?

Teradata has Linear Scalability

"Every ceiling, when reached becomes a floor upon which one walks and now can see a new ceiling."

Tom Stoppard

There is no ceiling on the Teradata database's ability to grow. Any time you want to double the speed, simply double the number of processors. This is called "Linear Scalability". This allows unlimited growth with minimal effects on response time. Each time a new processor is added in Teradata, a new storage disk is also added. By doing so, the system can continually grow, and there are no worries about the disk becoming the bottleneck of data.

Notice in the system below there are four processors, and that each is assigned two rows of data. When we ask for our "Best_Friends," the system will read all eight rows. Since data is spread evenly over four processors, Teradata reads two rows simultaneously across four processors. Now, the system is four times faster.

Most data warehouses have tables that hold millions, even billions of rows. Teradata allows you to decide how many processors are needed to get the desired response time. This is called the "Divide and Conquer" theory. To accommodate desired response rates, some customers have thousands of processors. Tasks are divided up between the AMPs and processed in parallel.

A Logical View of the Teradata Architecture

"You are either making history or you are history."

Leonard Sweet

A frustrated choral director was preparing for a concert, then suddenly stopped and said, "I've got to tell you eight years ago I was directing another choir in this anthem, and they made the same mistake you're making." He continued, "Do any of you have a clue as to what the mistake is?" Just then a voice from the choir called out, "Same director!"

Many data warehouse environments have an architecture that is not designed for Decision Support, yet company officials wonder why their data warehouse failed, when they actually never had a chance to succeed. In ancient days, Solomon wrote, "Where there is no vision, the people perish." It is no different today. Company leaders must cast a new vision that enables Decision Support with technology that can handle it or their companies, too, will be history.

The following picture shows a logical view of Teradata. The illustration shows a proper architecture for a data warehouse. In the example, a user logs on to Teradata from a LAN or mainframe host, and then is given a session with a "Parsing Engine" processor (PE). The user then asks a specific query using SQL.

The PE checks SQL syntax, then checks to see if the user has proper rights (authority) to access the table. Next, the PE creates a plan for the Access Module Processors (AMPs) to execute. The PE passes the plan to the AMPs over the BYNET. The AMPs obtain information on their disks, then pass it to the PE over the BYNET. The PE then passes the data back to the user.

Parsing Engine (PE)

"Even a stopped clock is right twice a day."

Polish Proverb

A man and his son were riding a bicycle built for two when they came to a steep hill. It took a great deal of struggle for them to complete what proved to be a very steep climb. When they got to the top, the father in front said, "Boy, that sure was a hard climb!" His son in the back responded, "Yes it was, Dad. And if I hadn't kept the brakes on all the way we would have rolled down backwards." Teradata has an ingenious way to keep this type of situation from happening inside the data warehouse. Most databases make educated guesses about the best way to retrieve data. The Teradata PE or "Optimizer" has both the experience and design to KNOW the best way to retrieve data.

When users log-on to Teradata they are connecting to a Parsing Engine (PE). When a user submits a query, then the PE takes action. The PE creates a PLAN that tells the AMPs exactly what to do in order to get the data. The PE knows how many AMPs are in the system, how many rows are in the table, and the best way to get to the data. Teradata's PE has been continually enhanced since 1984. It has such a great reputation for speeding up data access that it has earned the name "The OPTIMIZER."

The PE loves to serve valid Teradata users, but it was raised like a guard dog. A good guard dog loves its family, but it barks and may bite when strangers approach. The PE will always check users security (access) rights to ensure the user has the proper authority to obtain the information that is being requested. If the user has authority, the PE instructs the AMPs to get the data. If the user doesn't have proper access rights, the query is rejected.

The PE doesn't like to brag, but it did graduate at the top of its class. Customers like Wal-Mart, Anthem Blue Cross and Blue Shield, Bank of America, AT&T, and SouthWestern Bell have continually pushed the data warehouse envelope. This has given the PE years of experience in guiding AMPs to answer complex questions some of which have never been asked before in their respective industries. This experience allows users to ask any question regardless of its complexity. The PE isn't called "The Optimizer" for nothing. It needs no tuning by a Database Administrator (DBA) or hints from the user. Teradata users ask the questions, and Teradata returns the answers.

Access Module Processor (AMP)

"Wise men talk because they have something to say; fools talk because they have to say something."

Plato

Two men decided to go ice fishing. They found a good spot on some ice and began digging. As soon as they finished the hole, they heard a voice from above saying, "There are no fish here." Taking that as a sign, they moved about thirty feet and began digging again. A second time they heard the voice saying, "There are no fish here." So they moved another thirty feet and began to dig a third hole. This time the impatient voice spoke from above, "There are no fish here in this ice skating rink!" Some people just don't listen. But this is never the case with Teradata's "Access Module Processors."

The Access Module Processor (AMP) is a processor of little words. It keeps its mouth shut and it's ears open. Each AMP listens to the PE via the BYNET network for instructions. Each AMP retrieves data from its disk or writes data to its disk. The AMP is the worker bee of the system. It is the perfect employee. It never complains, rarely calls in sick, and lives to take direction from its boss the Parsing Engine (PE). The best example is to think of each AMP as a computer processor attached to its own disk.

Every AMP has its own disk, and it's the only AMP allowed to read or write data to that disk. This action is referred to as a "Shared-Nothing" architecture. Although AMPs are the perfect workers, they are not the perfect playmates. Even as children AMPs would never share toys with other AMPs on the playground. Each AMP has its own disk, and it shares this with no other AMP, hence a "Shared-Nothing" architecture.

Teradata spreads the rows of a table evenly across all AMPs in the system. When the PE asks the AMPs to get the data, each AMP will read the rows only on their particular disk. If this is done simultaneously, all AMPs should finish at about the same time. As a matter of fact, when we explained this philosophy to Confucius he stated, "A query is only as fast as the slowest AMP." Confucius, however, did say not to quote him!

Again, an AMP's job is to read and write data to its disk. The AMP takes its direction from the Parsing Engine (PE). The number of AMPs varies per system. Today, some Teradata systems have just four AMPs, while others have more than 2,000!

The BYNET

"Even if you're on the right track, you'll still get run over if you just sit there."

Will Rogers

The BYNET ensures communication between AMPs and PEs is on the right track and that it happens rapidly. When communication between AMPs and PEs is necessary, the BYNET operates as a communication superhighway.

There are always two BYNETs per system. They are called "BYNET 0" and "BYNET 1." The duplication is insurance in case one BYNET fails, and it also enhances performance. As an example, think of two BYNETs as two telephone lines in your home. AMPs and PEPs can talk to one another over either BYNET, or over both.

Morgan Jones, co-author, has been talking to his four-year old son, David, about AMPs, PEs, and the BYNET. Little David asked, "Daddy, what happens when the AMPs and PEs get lonely?" Morgan replied, "They talk to each other over the BYNET".

Here are the steps that outline exactly how the AMPs, PEs, and BYNETs work together: A user performs a LOGON to Teradata. A PE is assigned to manage all SQL for that particular user. The user then asks Teradata a question. Next,

The PE checks the user's SQL Syntax;

The PE checks the user's security rights;

The PE comes up with a plan for the AMPs to follow;

The PE passes the plan along to the AMPs over the BYNET;

The AMPs follow the plan and retrieve the data requested.

The AMPs pass the data to the PE over the BYNET; and

The PE then passes the final data to the user.

Teradata Building Block Approach

"Better a diamond with a flaw, than a pebble without one."

Anonymous

Teradata builds its data warehouses in building blocks called "nodes." Each building block is a gem composed of four Intel processors. Each node is connected flawlessly to other nodes through two BYNETs. The AMPs and PEs reside inside the node's memory. Each node is connected to a disk array where each AMP has direct access to one virtual disk.

Below is a picture of a Teradata system. It has four Intel processors, and the AMPs and PEs reside in memory. Each AMP is directly attached to its one virtual disk.

The following picture shows two nodes connected together over the BYNETs.

Teradata Tables

"Nearly everyone takes the limits of his own vision for the limits of the world. A few do not. Join them."

Arthur Schopenhauer

Do you have one of those notoriously messy "junk" drawers in your kitchen? You know the one we're talking about the one next to the silverware drawer. This drawer may often contain old washer and dryer warranties, matches, half-used flashlight batteries, straws, odd nuts, bolts and washers, corncob holders, etc. Fortunately, the dresser drawers in your bedroom are typically much more organized! In fact, you probably store your clothing in those drawers much more neatly so you can get to what you need quickly.

Relational databases store data much like we organize our dresser drawers: Just as you might put all of your t-shirts in one drawer and your socks in another, the database will store data about one topic in one table and data that pertains to another topic is kept in another table. For example, a database might contain a CustomerTable containing items to track such as customer number, CustomerName, city, and order number. Another table, the OrderTable, might hold data like Order Number, Order Date, CustomerName, Item No, and Quantity.

An example of each table follows:

CUSTOMER TABLE called "CustomerTable"

CustomerID (PK)CustomerNameCityNameOrder Number (FK)Customer Rep

1001JC PenneyDallas105372Dreyer

1002Office DepotColumbia105799Crocker

1003DillardsAtlanta106227Smith

ORDER TABLE called "OrderTable"

Order Number (PK)Order DateItem NoQuantityCustomer ID (FK)

10537203/07/2001212201001

10579904/18/2001296521002

10622710/17/2001325171003

The data stored in the CustomerTable is logically related to the data stored in the Order Table. The two tables both have columns called "Order Number." These tables make up an extended family, joined by the "marriage" of the columns named "Order Number" in each table.

Earlier programming languages referred to files, records and fields. Relational databases use the terms "Tables", Rows, and "Columns." Each Row of a table is comprised of one or more fields identified by a column name. A Row is the smallest value that can be inserted into a table. A column is the smallest value within a table that can be updated, or modified. The data value stored in each column must match the data type for that column. For example, you cannot enter the name of a city in a column that is defined as a decimal data type. Columns that are defined but have no data value will display a "null", or are sometimes represented by a "?".

One column, or combination of columns, in each table is chosen to be the "Primary Key (PK)". This is a logical modeling term. The primary key contains a unique value for each row, and enforces the uniqueness of that row. The PK cannot be null, and should contain values that will not change. In the CustomerTable, the primary key is the CustomerID column. Each customer has a unique CustomerID. The data in the columns of every row must be consistent with the unique CustomerID for that row. The rows in a table need not be stored in any particular order. This is also called being "arbitrary" or an "unordered set." Before the table is defined, the order of the columns is also arbitrary. It doesn't matter if you place CustomerName before CityName or after it. However, once the table is created, the order of the columns (e.g., the row format for the table) must remain the same. Plus, you cannot have multiple row formats within a table.

What forms the relationship between the tables in a relational database? A key that is common to each table forms it. A "Foreign Key (FK)" is a key in a table that is a Primary Key (PK) in another table. The PK and FK relationship allows the two tables to relate to one another. When you need to display data from more than one table, you can JOIN the two tables by matching a common key between the two tables. A great choice is to match the primary key of one table to the foreign key of the other table. Remember that a table may have only one PK, but it may have multiple FKs.

Here is a quick reference chart for Primary and Foreign Keys:

PRIMARY KEYFOREIGN KEY

Not optionalOptional

Comprised of one or more columnsComprised of one or more columns

Can only have one PK per tableCan have multiple FKs per table

No duplicates allowedDuplicates allowed

No changes allowedChanges allowed

No nulls allowedNulls allowed

Teradata Spreads the Data Evenly Across the AMPs

"A chain is only as strong as its weakest link"

Because Teradata spreads data evenly no AMP or disk is ever the weakest link. Teradata is the only database that strings hundreds and thousands of processors together to achieve awesome processing power for today's data warehouses. Today, the AMPs (Access Module Processors) are software processors that reside in memory. Teradata always attempts to spread data evenly so each AMP will manage approximately the same amount of data. As a result, the rows of every table are distributed across all of the AMPs. In other words, every AMP stores a portion of every table in the database on its virtual disk (VDISK). If a data warehouse has 200 tables, then each AMP will hold a portion of 200 tables. This method of data distribution is unique to Teradata.

There are some significant benefits to handling data this way:

First, when each AMP has nearly the same quantity of table rows, then no one AMP becomes a data bottleneck. AMPs can all retrieve their portion of the data in parallel so you do not have AMPs sitting idle while one or two others are chugging away. Baseball phenomenon Casey Stengel once said, "It's easy to get good players. Gettin' em to play together, that's the hard part." AMPs love to work together in parallel.

Second, each AMP is unaware of any data except its own portion. The only AMP that can read or write to a particular row of data is the AMP that actually owns that row. This makes retrieving data from a particular row very efficient as all AMPs do their own work.

Third, each AMP automatically groups all of its rows by the tables from which they come. Have you ever been to a large aquarium and seen one of the displays that look like a very tall, clear cylinder? As you walk around the glass, the fish tend to swim in schools. Similarly, Teradata does this with the rows on the AMPs to boost performance. When you ask for data from any given table, an AMP will immediately go to that particular group of rows, and then select what you need. It doesn't need to look through the rows of many tables before it finds what you need. This is how parallel processing works. The AMPs retrieve data in parallel, then pass it over the BYNET to the Parsing Engine (PE), and the PE ensures the data is delivered to the user. Keep in mind, the Bynet is an internal Teradata network over which the PEs and the AMPs communicate.

The example below shows the information we have just discussed. Notice that the system has four AMPs, and three tables: "Employee," "Customer," and "Order." Notice each AMP holds a portion of the rows for every table. AMP1, for example, holds 1/4th of the Employee table rows, 1/4th of the Customer table rows, and 1/4th of the Order table rows.

Plus, the data is spread evenly for all tables. If a query asks for all rows in the Customer Table, then each AMP will retrieve their Customer table rows in parallel with the other AMPs. Each AMP will then pass its data to the PE via the BYNET. Because the data in the Customer table is spread evenly among all AMPs, each should finish reading at exactly the same time.

Also, notice how each AMP separates each table. Just like schools of fish, the rows of the Employee Table are grouped together. In addition, the Customer and Order tables are grouped together. This is important in a data warehouse environment because most queries read millions of rows to satisfy a single query. Performance is enhanced when table rows are grouped together and Teradata is permitted to bring blocks of rows into memory.

Primary Indexes

"Every road has two directions."

Russian Proverb

When world-renowned explorer, Dr. David Livingstone, was working in Africa, a group of friends wrote to him saying, "We would like to send other men to you. Have you found a good road into your area yet?" According to a member of his family, Dr. Livingstone sent this message in response, "If you have men who will only come if there is a good road, I don't want them. I want men who will come if there is no road at all."

Although it doesn't have to cut its way through the dense African jungle, the PRIMARY INDEX (PI) is the trailblazer in Teradata that paves the way for the rest of the data to follow. The PI is so important to Teradata functionality that every table in the database is required to have one. As the quote above states, "Every road has two directions." The Primary Index is used in two directions:

1. The Primary Index WILL DETERMINE which rows go to which AMPs; and

2. The Primary Index is ALWAYS the FASTEST RETRIEVAL method.

If the user doesn't define a PRIMARY INDEX when creating a table, the system will automatically choose one by default. Once it is defined, the PI column cannot be dropped or changed. The table would need to be re-created in order to change the PI.

There are two types of Primary Indexes

"A man who chases two rabbits catches none."

Roman Proverb

"A man who chases two rabbits misses both by a HARE! A person who chases two Primary Indexes misses both by an ERR!"

Tera-Tom Coffing

Each table may only have one Primary Index, but every table must have a Primary Index defined. It is either an UPI or a NUPI; in other words, a Unique Primary Index (UPI) or a Non-Unique Primary Index (NUPI). The Primary Index is created when the table is created. An example of creating a Unique Primary Index on the column EMP follows:

CREATE Table employee

(emp INTEGER

,dept INTEGER

,lname CHAR(20)

,fname VARCHAR(20)

,salary DECIMAL(10,2)

,hire_date DATE

)

UNIQUE PRIMARY INDEX(emp);

An example of creating a Non-Unique Primary Index is listed below. Notice you never see the prefix "NON":

CREATE Table TomC.employee

(emp INTEGER

,dept INTEGER

,lname CHAR(20)

,fname VARCHAR(20)

,salary DECIMAL(10,2)

,hire_date DATE

)

PRIMARY INDEX(dept);

PRIMARY INDEXES may be defined on one column, or on a set of columns viewed as a composite unit. Up to 16 columns may be defined as a Primary Index. An example of creating a multi-column Unique Primary Index follows:

CREATE Table employee

(

emp INTEGER

,dept INTEGER

,lname CHAR(20)

,fname VARCHAR(20)

,salary DECIMAL(10,2)

,hire_date DATE

)

UNIQUE PRIMARY INDEX(emp, dept);

"Being related hardly insures relatability."

Michael E. Angier

All of the tables in a Teradata database are related to each other. But the Primary Key and Primary Index ensure their relatability in day-to-day use. What is the difference between a PRIMARY KEY and a PRIMARY INDEX? A Primary Key is a logical term used to label column(s) that enforce the uniqueness of each row in a table. PKs determine relationships among tables. A Primary Index is a physical term used to label column(s) that is used to store and locate rows of data.

To illustrate, imagine a library. The Primary Key, the logical, is like the actual construction of the library. Do you know what part of the library is reserved for fiction? What about for non-fiction? Plus, where will the card catalog reside? Once the library is logically correct, it is ready to receive books. A Primary Key on a table helps to logically determine what data to track in the table.

The Primary Index is much like a card catalog in the library. Inside the card catalog drawers are thousands of index cards that provide the book's title, author, publisher, and the Dewey Decimal number. By taking that index card, you can immediately find where that book is shelved within the library. The Primary Index column value for a Teradata table tells where the row should reside. It's also the fastest mechanism to retrieve data.

Teradata uses the Primary Index to distribute each table's rows to the proper AMPs. Teradata also uses the Primary Index to retrieve rows at lightning speed.

Exactly how does Teradata actually accomplish this? Well, I'm glad you asked! Let's look at the HASH MAP next:

The Hash Map

"The map is not the territory."

Alfred Korzybski

The first map of all the known lands in the world has been attributed to the Greek philosopher Anaximander of Miletus (610-ca.546 BC). He may have been the first person to attempt such a map, although others had drawn local maps before. The "Hash Map" was created by a group of individuals so Teradata could maximize its parallel processing roots. It's the hash map that tells which AMP holds a particular row. It does not contain any data rows; it just shows where to find them. Overall, the idea of the hash map is to spread the data as equally as possible.

Once a travel agent received a call from a man asking, "Is it possible to see England from Canada?" The agent said, "No." The man replied, "But they look so close on the map!"

Teradata uses a map called the "HASH MAP" in combination with the PRIMARY INDEX to distribute data rows. The HASH MAP is not a two-dimensional array, although it appears that way in diagrams. It is more like a honeycomb with myriad buckets. But, while the honeycomb holds honey in its buckets, the HASH MAP buckets contain just one thing the number of an AMP. All AMPs and PEPs use the very same HASH MAP.

The picture on the following page shows the hash map for a four-amp system. This is shown for simulation purposes. The actual hash map has 65,536 buckets. On the diagram, notice that inside each bucket is an AMP number, and that AMP number goes 1, 2, 3, 4, then starts over again. Why? It's because this is the hash map for a four-AMP system.

Hash Map

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

The next diagram shows the hash map for an eight-AMP system. As before, this is for simulation purposes. Notice that the AMP number for this hash map goes 1, 2, 3, 4, 5, 6, 7, 8, and then starts over again. Why? Because this hash map is for an eight-AMP system.

1 2 3 4 5 6

7 8 1 2 3 4

5 6 7 8 1 2

3 4 5 6 7 8

1 2 3 4 5 6

7 8 1 2 3 4

5 6 7 8 1 2

3 4 5 6 7 8

How the Hash Map and Primary Index Work Together

"Choice, not chance, determines destiny."

Anonymous

The choice made for the Primary Index determines the exact AMP destination for each row in a table. It must not be left up to chance!

Here is how the Hash Map and Primary Index work together: When a table is being loaded with data, the rows will be spread among all AMPs. The Hash Map determines the actual DESTINATION AMP for each row of the table.

Destination is determined using the "Whiz-Bang Formula (a secret NCR formula)." First, we'll explain the theory, and then we will invent our own Wiz-Bang Formula to show you how it works conceptually.

Let's start with a table to load on our four-AMP system. Imagine you have listed your eight best friends in a table called "Best_Friends." You have two columns in the table. They are titled "Friend_Number" and "Friend_name." We've chosen only even numbers for Friend_Num because our friends are so even tempered. We have also made the Friend_Num a Unique Primary Index (UPI) on the table.

Best_Friends Table

Friend_NumFriend_Name

2Ben Hon

4Joe Davis

6Mary Gray

8John Davis

10Don Roy

12Sam Mills

14Kyle Marx

16Lyn Jones

For this example, Teradata will attempt to spread the table rows among the four-AMP system. A picture of the four-AMP configuration follows:

Since there is a four-AMP configuration, the system will use a four-AMP hash map. Here is an illustration:

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

Instead of trying to figure out the NCR Wiz-Bang formula (a secret), we can show you the theory of distributing data and retrieving data with our own formula. It is called the:

Coffing/Jones Wiz-Bang formula: Take a table's Primary Index and divide the column value by 2. The answer points to a hash map bucket, and that bucket tells which AMP will hold the row.

Let's take our first row and determine on which AMP it will reside. Remember, we will get the Primary Index value of the row, divide it by the Coffing/Jones Wiz-Bang formula (divide by 2), and the answer will point to a bucket in the hash map. Inside that bucket will be the AMP number in which the row will reside. Let's take our first row and determine it's proper location:

Friend_NumFriend_Name

SHAPE

2Bill Hon

Since we designated Friend_Num as the Primary Index, we merely divide the value of Friend_Num (2) by the Coffing/Jones Wiz-Bang Formula (divide by 2):

2 divided by 2 = 1

The hash map bucket number is one. Let's check the hash map to see bucket number 1 and to see what AMP number is inside that bucket. As seen in the picture below, the first bucket in the hash map says the row's destination is AMP 1.

Let's look at another random row:

Friend_NumFriend_Name

SHAPE

16Lyn Jones

Since we designated Friend_Num as the Primary Index, we merely divide the value of Friend_Num (16) by the Coffing/Jones Wiz-Bang Formula (divide by 2) and the answer is:

16 divided by 2 = 8

Thus, the hash map bucket number is now eight. Let's check our hash map to see bucket number eight, determine which AMP number is inside that bucket. As you can see below, bucket eight in the hash map says the row's destination is AMP four.

If we continue the process until all data is laid out, the system would look like this:

Best_Friends Table

Friend_NumFriend_Name

2Ben Hon

4Joe Davis

6Mary Gray

8John Davis

10Don Roy

12Sam Mills

14Kyle Marx

16Lyn Jones

HASH MAP

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

1 2 3 4 1 2

3 4 1 2 3 4

Remember, the Teradata hashing formula is a secret. However, the Coffing/Jones Whiz Bang Formula did not crack the code. The purpose is to show you how the hash map works, in theory, to distribute and locate rows. Simply, you should understand that the formula is mathematical (similar to Coffing/Jones Whiz-Bang Formula) and it will be consistent. When we divided Friend_Number two by two, we got bucket one in the hash map. However, if we ran the formula on this premise a million times, we would still get the same results.

"If you always do what you always did, you'll always get what you always got."

Verne Hill

In summary, Teradata will always be able to find a row if it knows the Primary Index. It can rerun the hash formula, point to the bucket in the hash map, and then retrieve the row from the correct AMP. The Teradata hashing formula always does what it always did, and always gets what it always got. Since it always runs the same formula, it is consistent.

Retrieving the Data

When Teradata needs to retrieve data, the fastest and most efficient way is via the Primary Index. An example of SQL showing how Teradata retrieves the data follows:

SELECT Friend_Num, Friend_Name

FROM Best_Friends

WHERE Friend_Num = 8;

The Parsing Engine understands that the user wants to have two columns, titled "Friend_Num" and "Friend_Name," returned. The PE gets excited when it notices that we are after Friend_Num eight. It recognizes that Friend_Num is the PRIMARY INDEX. The PE then runs the hash formula for eight. For explanation purposes the Coffing/Jones hash formula is used, and merely divides the PI by two. When the PE divides the value eight by two, then it receives an answer of four. It looks in bucket four and sees the AMP number. The PE passes a plan to retrieve the data to ONLY AMP number four as this is a one AMP operation.

The Full Table Scan

"What matters is not the size of the dog in the fight, but the size of the fight in the dog."

Coach Bear Bryant

When we travel the globe teaching Teradata classes, we often ask students, "Are Full Table Scans acceptable in a data warehouse?"

About 80% of the time students respond, "NO!" After we complete training they respond, "Heck YES!"

Tom told me that he wrestled his way through high school and college. I said, "Really? I didn't think the classes were that difficult myself!" Actually, Tom earned a wrestling scholarship to college and achieved the All-American level. His wrestling coach drilled into the wrestlers minds that the size of the opponent is not to be feared, but the size of their will. The truth is that most databases do not have the FIGHT in them to handle a Full Table Scan. That's why so many students are surprised at Teradata's abilities to actually handle Full Table Scans.

A Full Table Scan (FTS) is a query that reads every row of a table. The table may be small or have billions of rows. With Teradata, a Full Table Scan (FTS) means every AMP reads only the rows it owns in parallel with all other AMPs in the system. Doing so speeds up a Full Table Scan hundreds to thousands of times.

For example, imagine a table that has 100 rows in a system that has 10 AMPs. Each AMP owns 10 rows. On a Full Table Scan, each AMP reads its 10 rows. Next, each AMP passes the information over the BYNET to the PEP. This process is 10 times faster than most systems. But what happens with systems that have hundreds, or even thousands of AMPS? Well, one major telecommunications company copied a 3.5 billion-row table in just 18 minutes. The 1,900 AMPs in its system helped return results very rapidly. Talk about efficiency!

Most FTS bring traditional databases to their knees, but Teradata was born to be parallel. Teradata was specifically designed for data warehousing. When you ask decision support questions like, "Who are my best and worst customers?" then you are asking the system to read through an entire table. Full Table Scans are fundamental and an important part of data warehousing. They allow users to literally ask any question, about any data, at any time. Teradata has the experience, power, and architecture to allow Full Table Scans.

A an example of a query asking for a Full Table Scan is:

SELECT Friend_Num, Friend_Name

FROM Best_Friends;

In this example, the Parsing Engine receives the SQL and checks the syntax and security. If the user passes these tests, the query continues. The PE knows this query asks to return all records. This is a Full Table Scan. Therefore, it passes the AMPs a plan that says, "Retrieve all of your Best_Friends table rows, and then pass them to me (PE) over the BYNET." With that in mind:

Each AMP reads the Best_Friends rows individually own.

Each AMP passes its rows to the PE over the BYNET.

Let's run through the SQL again and see the result:

SELECT Friend_Num, Friend_Name

FROM Best_Friends;

8 rows returned

Friend_NumFriend_Name

6Mary Gray

14Kyle Marx

8John Davis

16Lyn Jones

2Ben Hon

10Don Roy

4Joe Davis

12Sam Mills

In this chapter, we have shown you two opposite approaches to retrieving data. In our first query, we used the Primary Index to retrieve one row. In the next query, we used a Full Table Scan (FTS) to retrieve all the rows. One approach is the fastest way, and the other is the slowest way. But are these the only options for retrieving data? No. There is another option in a Secondary Index.

Secondary Inde