PCI Express and Bus Architecture

25
PCI Express: An Overview PCI Express has generated a lot of excitement in the PC enthusiast scene in a short amount of time. And with good reason, since it promises to rid the PC of its bandwidth woes and enable a new class of applications. Introduction With the launch of Intel's 900-series chipsets and the recent return of SLI to the video card scene, PCI Express has finally arrived on the PC enthusiast scene in a big way. PCI Express- enabled motherboards are going to start becoming more and more common, and with the new bus's increasing ubiquity will come the inevitable confusion that accompanies the rise of any new technology, especially one as complex and feature-rich as PCI Express. In this article, we'll take a detailed look at the features of PCI Express ? what it is, what it isn't, and how it improves on the venerable interconnect scheme that we've all come to know and curse: PCI. Basic PC system architecture No doubt most Ars readers are familiar with the basic layout of a PC system, but it's worthwhile to do a brief recap in order to set the stage for the discussion that follows. Logically, an average PCI system is laid out in something like the following manner:

Transcript of PCI Express and Bus Architecture

Page 1: PCI Express and Bus Architecture

PCI Express: An Overview

PCI Express has generated a lot of excitement in the PC enthusiast scene in a short amount of time. And with good reason, since it promises to rid the PC of its bandwidth woes and enable a new class of applications.

Introduction

With the launch of Intel's 900-series chipsets and the recent return of SLI to the video card scene, PCI Express has finally arrived on the PC enthusiast scene in a big way. PCI Express-enabled motherboards are going to start becoming more and more common, and with the new bus's increasing ubiquity will come the inevitable confusion that accompanies the rise of any new technology, especially one as complex and feature-rich as PCI Express. In this article, we'll take a detailed look at the features of PCI Express ? what it is, what it isn't, and how it improves on the venerable interconnect scheme that we've all come to know and curse: PCI.

Basic PC system architecture

No doubt most Ars readers are familiar with the basic layout of a PC system, but it's worthwhile to do a brief recap in order to set the stage for the discussion that follows.

Logically, an average PCI system is laid out in something like the following manner:

Page 2: PCI Express and Bus Architecture

Figure 1: PCI system layout

The core logic chipset acts as a switch or router, and routes I/O traffic among the different devices that make up the system.

In reality, the core logic chipset is split into two parts: the northbridge and the southbridge (or I/O bridge). This split is there for a couple of reasons, the most important of which is the fact that there are three types of devices that naturally work very closely together, and so they need to have faster access to each other: the CPU, the main memory, and the video card. In a modern system, the video card's GPU is functionally a second (or third) CPU, so it needs to share privileged access to main memory with the CPU(s). As a result, these three devices are all clustered together off of the northbridge.

The northbridge is tied to a secondary bridge, the southbridge, which routes traffic from the different I/O devices on the system: the hard drives, USB ports, Ethernet ports, etc. The traffic from these devices is routed through the southbridge to the northbridge and then on to the CPU and/or memory.

Page 3: PCI Express and Bus Architecture

Figure 2: northbridge and southbridge

As is evident from the diagram above, the PCI bus is attached to the southbridge. This bus is usually the oldest, slowest bus in a modern system, and is the one most in need of an upgrade.

For now, the main thing that you should take away from the previous diagram is that the modern PC is a motley collection of specialized buses of different protocols and bandwidth capabilities. This mix of specialized buses designed to attach different types of hardware directly to the southbridge is something of a continuously evolving hack that has been gradually and collectively engineered by the PC industry as it tries to get around the limitations of the aging PCI bus. Because the PCI bus can't really cut it for things like Serial ATA, Firewire, etc., the trend has been to attach interfaces for both internal and external I/O directly to the southbridge. So today's southbridge is sort of the Swiss Army Knife of I/O switches, and thanks to Moore's Curves it has been able to keep adding functionality in the form of new interfaces that keep bandwidth-hungry devices from starving on the PCI bus.

In an ideal world, there would be one primary type of bus and one bus protocol that connects all of these different I/O devices ? including the video card/GPU ? to the CPU and main memory. Of course, this "one bus to rule them all" ideal is never, ever going to happen in the real world. It won't happen with PCI Express, and it won't happen with Infiniband (although it technically

Page 4: PCI Express and Bus Architecture

could happen with Infiniband if we threw away all of today's PC hardware and started over from scratch with a round of natively Infiniband-compliant devices).

Still, even though the utopian ideal of one bus and one bus protocol for every device will never be achieved, there has to be way bring some order to the chaos. Luckily for us, that way has finally arrived in the form of PCI Express (a.k.a. PCIe).

With Intel's recent launch of its 900-series chipsets and NVIDIA and ATI's announcements of PCI Express-compatible cards, PCIe will shortly begin cropping up in consumer systems. This article will give you the lowdown on what you can expect from the bus technology that will dominate the personal computer for the coming decade.

Note: A few of the more server-specific features of PCI Express are not covered in this article. These include hot plugging and hot swapping, as well as reliability-oriented features like packet retries and such.

A primer on PCI

Before I go into detail on PCIe, it helps to understand how PCI works and what its limitations are.

The PCI bus debuted over a decade ago at 33MHz, with a 32-bit bus and a peak theoretical bandwidth of 132MB/s. This was pretty good for the time, but as the rest of the system got more bandwidth hungry both the bus speed and bus width were cranked up in a effort keep pace. Later flavors of PCI included a 64-bit, 33MHz bus combination with a peak bandwidth of 264MB/s; a more recent 64-bit, 66MHz combination with a bandwidth of 512MB/s.

PCI uses a shared bus topology to allow for communication among the different devices on the bus; the different PCI devices (i.e., a network card, a sound card, a RAID card, etc.) are all attached to the same bus, which they use to communicate with the CPU. Take a look at the following diagram to get a feel for what a shared bus looks like.

Figure 3: the shared bus

Page 5: PCI Express and Bus Architecture

Because all of the devices attached to the bus must share it among themselves, there has to be some kind of bus arbitration scheme in place for deciding who gets access to the bus and when, especially in situations where multiple devices need to use the bus at the same time. Once a device has control of the bus, it becomes the bus master, which means that it can use the PCI bus to talk to the CPU or memory via the chipset's southbridge.

Speaking of the southbridge, the large system diagram that I presented on the first page ? the one with the PCI devices attached to the southbridge ? represents how things are actually configured in the real world, as opposed to the idealized representation given immediately above. The southbridge, the northbridge, and the CPU all combine to fill the host or root role, which we'll discuss in a bit more detail momentarily. For now, it will suffice to note that the root runs the show ? it detects and initializes the PCI devices, and it controls the PCI bus by default. Or another way to put it would be to say that the purpose of the PCI bus is to connect I/O devices to the root, so that the root can read from them and write to them, and just generally use them to talk either to storage devices or to the outside world.

The shared bus topology's main advantages are that it's simple, cheap, and easy to implement ? or at least, that's the case as long as you're not trying to do anything too fancy with it. Once you start demanding more performance and functionality from a shared bus, then you run into its limitations. Let's take a look at some of those limitations, in order to motivate our discussion of PCI Express's improvements.

From the CPU's perspective, PCI devices are accessible via a fairly straightforward load-store mechanism. There's flat, unified chunk of address space dedicated for PCI use, which looks to the CPU much like a flat chunk of main memory address space, the primary difference being that at each range of addresses there sits a PCI device instead of a group of memory cells containing code or data.

Page 6: PCI Express and Bus Architecture

Figure 4: memory space

So in the same way that the CPU access memory by performing loads and stores to specific addresses, it accesses PCI devices by performing reads and writes to specific addresses.

When a PCI-enabled computer boots up, it must initialize the PCI subsystem by assigning chunks of the PCI address space to the different devices so that they'll be accessible to the CPU. Once the devices are initialized and know which parts of the address space that they "own," they start listening to the bus for any commands and data that might be directed their way. Once an individual PCI device "hears" an address that it owns being placed on the bus, then it reads any data following behind that address.

This scheme works fine when there are only a few devices attached to the bus, listening to it for addresses and data. But the nature of a bus is that any device that's attached to it and is "listening" to it injects a certain amount of noise onto the bus. Thus the more devices that listen to the bus ? and thereby place an electrical load on the bus ? the more noise there is on the bus and the harder it becomes to get a clean signal through.

Sharing the bus

In this respect, the shared bus is kind of like the following slightly loopy scenario: Imagine an office building in which there is only one phone line that everyone shares. People work all day in their cubicles with their phones off the hook and their hands-free speakerphones turned on, listening for the front-office secretary to call out their name, "Mr. Smith, Ms. Jones is here at my desk and wants to talk to you, so I'm going to put her on. Now pay attention, because here she

Page 7: PCI Express and Bus Architecture

is..." With only a few employees this lame scheme would be a pain but it would at least be feasible. But in an office of hundreds, the amount of ambient background noise pouring into each speakerphone would combine to make the entire line a noisy mess, and it would be very hard to hear your name called out above the racket.

This load-related noise phenomenon, along with clock skew issues, is the reason that PCI buses are limited to five card-based devices at most. (If you solder PCI devices directly onto the motherboard, the signal is cleaner so you can put a few more than five on a single bus.)

What this means in real life is that if you want to put more than five PCI devices on a system, then you must use PCI-to-PCI bridge chips configured in the following manner:

Figure 5: PCI-to-PCI bridge chips

This hierarchical tree structure, outlined above, is one of the features that distinguishes PCI from peer-to-peer and point-to-point next-generation interconnects like HyperTransport and Infiniband. The root at the top of the diagram is the master controller which is responsible for initializing and configuring all of the PCI devices in the system at boot-up. This makes every PCI device a slave device, with one master controlling them. And because the master must enumerate all of the devices and configure the entire system at boot time, there can be no hot-plugging or hot-swapping.

Excursus: organizing bus traffic

Generally speaking, there are two pairs of categories into which all bus traffic can be placed. The first pair of categories is address traffic and data traffic. The data is the information that you're using the bus to send or receive from a device that's attached to it, and address is location of the particular device (or the region within a particular device) where the information is being sent.

Page 8: PCI Express and Bus Architecture

So any bus which supports multiple devices will need a way of handling both address traffic and data traffic, and of distinguishing between the two.

The second pair of categories, which overlaps the first pair, is command traffic and read/write traffic. A command consists of a chunk of data containing some type of configuration or control information (= a specific type of data) which is sent to a particular device (= a particular address) on the bus. So command traffic includes both address and data traffic. Examples of command traffic are initialization instructions for a device, a device reset signal, a configuration command that causes the device to switch operating modes, etc. Command traffic allows the CPU to control how the PCI device handles the data that flows in and out of it.

Read/write traffic is the most important type of traffic, because it consists of the actual information that is being sent to the device. For instance, a PCI RAID controller uses read and write traffic to send and receive the actual files which it reads from and writes to its attached hard disks, a PCI sound card uses read/write traffic to get the sound data that it puts out through its speaker jack, and so on. Like command traffic, read/write traffic consists of addresses coupled with data, and so accounts for part of both of these types of traffic.

Different buses and bus protocols have different ways of handling these four overlapping types of traffic. For instance, many common bus types actually consist of two separate buses: an address bus and a data bus. Addresses are placed on the address bus and data is placed on the data bus, with the result that data is able to flow quickly between devices because each type of traffic has its own dedicated bus.

The alternative to this would be to "multiplex" address and data onto the same bus. This involves first placing the address on the bus, and then following it with the data that is to be sent to that address. PCI takes this approach, with a single 32-bit bus on which addresses and data are multiplexed. In fact, remember the office phone line analogy? "Mr. Smith, a Ms. Jones is here at my desk and wants to talk to you, so I'm going to put her on. Now pay attention, because here she is..." The "Mr. Smith" in this sentence would be the address, and Mrs. Jones' speech to Mr. Smith would be the data.

Obviously multiplexing is a little less bandwidth-efficient than having two dedicated buses, because address traffic takes up precious bandwidth that could be put to better use carrying bus traffic. But multiplexed buses are a lot cheaper than shared buses, because half the number of bus lines are needed, and the devices on the bus need half the number of pins.

The other popular way of handling bus traffic is to split it into control traffic and read/write traffic and give each its own bus. To return to our office analogy, this would be like installing a separate line for management to use to talk to employees.

PCI and MSI

Later versions of the PCI specification opt in part for the last method of organizing bus traffic outlined above, and have what is called a "side-band bus" for transmitting some types of command traffic. The side-band bus is a smaller bus consisting of a few lines dedicated to the

Page 9: PCI Express and Bus Architecture

transmission of control and configuration information. Of course, this side-band bus increases pin count, power draw, cost, etc., so it's not the most optimal solution.

Even more recent versions of the PCI spec dictate a method for using standard read and write operations to pass one type of command and control traffic to PCI devices. This method, called Message Signal Interrupt (MSI), sets aside a special message space in the PCI flat memory space for passing a certain type of control message called an interrupt. This message space is kind of like a bulletin board, onto which the CPU writes interrupt messages which the device then reads. As we'll see below, PCI Express expand the MSI spec to include not just interrupts but all side-band control signals. But we're getting ahead of ourselves...

Summary of PCI's shortcomings

To summarize, PCI as it exists today has some serious shortcomings that prevent it from providing the bandwidth and features needed by current and future generations of I/O and storage devices. Specifically, its highly parallel shared-bus architecture holds it back by limiting its bus speed and scalability, and its simple, load-store, flat memory-based communications model is less robust and extensible than a routed, packet-based model.

PCI-X: wider and faster, but still outdated

The PCI-X spec was an attempt to update PCI as painlessly as possible and allow it to hobble along for a few more years. This being the case, the spec doesn't really fix any of the inherent problems outlined above. In fact, it actually makes some of the problems worse.

The PCI-X spec essentially doubled the bus width from 32 bits to 64 bits, thereby increasing PCI's parallel data transmission abilities and enlarging its address space. The spec also ups PCI's basic clock rate to 66MHz with a 133MHz variety on the high end, providing yet another boost to PCI's bandwidth and bringing it up to 1GB/s (at 133MHz).

The latest version of the PCI-X spec (PCI-X 266) also double-pumps the bus, so that data is transmitted on the rising and falling edges of the clock. While this improves PCI-X's peak theoretical bandwidth, its real-world sustained bandwidth gains are more modest. (See this article for more on the relationship between peak theoretical bandwidth and real-world bandwidth.)

While both of these moves significantly increased PCI's bandwidth and its usefulness, they also made it more expensive to implement. The faster a bus runs, the sensitive it becomes to noise; manufacturing standards for high-speed buses are exceptionally strict for this very reason; shoddy materials and/or wide margins of error translate directly into noise at higher clock speeds. This means that the higher-speed PCI-X bus is more expensive to make.

The higher clock speed isn't the only thing that increases PCI-X's noise problems and manufacturing costs. The other factor is the increased bus width. Because the bus is wider and consists of more wires, there's more noise in the form of crosstalk. Furthermore, all of those new wires are connected at their endpoints to multiple PCI devices, which means an even larger load on the bus and thus more noise injected into the bus by attached devices. And then there's the

Page 10: PCI Express and Bus Architecture

fact that the PCI devices themselves need 32 extra pins, which increases the manufacturing cost of each individual device and of the connectors on the motherboard.

All of these factors, when taken together with the increased clock rate, combine to make the PCI-X a more expensive proposition than PCI, which keeps it out of mainstream PCs. And it should also be noted that most of the problems with increasing bus parallelism and double-pumping the bus also plague recent forms of DDR, and especially the DDR-II spec.

And after all of that pain, you still have to deal with PCI's shared-bus topology and all of its attendant ills. Fortunately, there's a better way.

PCI Express: the next generation

PCI Express (PCIe) is the newest name for the technology formerly known as 3GIO. Though the PCIe specification was finalized in 2002, PCIe-based devices have just now started to debut on the market.

PCIe's most drastic and obvious improvement over PCI is its point-to-point bus topology. Take a look at the following diagram, and compare it to the layout of the PCI bus.

Figure 6: shared switch

?

Page 11: PCI Express and Bus Architecture

Figure 3: the shared bus

In a point-to-point bus topology, a shared switch replaces the shared bus as the single shared resource by means of which all of the devices communicate. Unlike in a shared bus topology, where the devices must collectively arbitrate among themselves for use of the bus, each device in the system has direct and exclusive access to the switch. In other words, each device sits on its own dedicated bus, which in PCIe lingo is called a link.

Like a router in a network or a telephone switchbox, the switch routes bus traffic and establishes point-to-point connections between any two communicating devices on a system. To return to our office analogy from the previous section, each employee has his or her own private line to the front desk; so instead of shouting over a shared line to get a particular employee's attention, the front desk secretary uses a switchboard to connect employees directly to incoming callers and to each other.

In the point-to-point diagram above, the CPU at the top can talk to any of the PCIe devices by "dialing" that device's address and opening up a direct and private communications link, via the switch, with it. Of course, as with a modern telephone call, or even better, an internet connection between a browser and a website, the two communicating parties only think they're talking to each other via a private, direct, continuous link; in reality, though, the communications stream is broken up into discrete packets of data, which the switch routes ? like a postal worker delivering addressed envelopes ? back and forth between the two parties.

Enabling Quality of Service

The overall effect of the switched fabric topology is that it allows the "smarts" needed to manage and route traffic to be centralized in one single chip ? the switch. With a shared bus, the devices on the bus must use an arbitration scheme to decide among themselves how to distribute a shared resource (i.e., the bus). With a switched fabric, the switch makes all the resource-sharing decisions.

By centralizing the traffic-routing and resource-management functions in a single unit, PCIe also enables another important and long overdue next-generation function: quality of service (QoS). PCIe's switch can prioritize packets, so that real-time streaming packets (i.e., a video stream or an audio stream) can take priority over packets that aren't as time critical. This should mean

Page 12: PCI Express and Bus Architecture

fewer dropped frames in your first-person shooter and lower audio latency in your digital recording software.

Backwards compatibility

Now, you've probably heard that PCIe is backwards-compatible with PCI, and that operating systems can boot on and use a PCIe-based system without modification. So you're no doubt wondering how PCI's load-store model, described previously, can be compatible with the switched packet-based model outlined here. The answer is more straightforward than you might think.

PCI and PCI Express, like many computer systems designed to transmit data, implement a part of the OSI network stack. This article is not the place for a detailed breakdown of a network stack, but the basic idea behind it is easy enough to grasp.

PCI implements the first four layers of the OSI stack, which specify the physical aspects of transmission (i.e., the wire-level signals) up through the higher-level load-store interface that software uses to send and receive via PCI. PCI Express's designers have left this load-store-based, flat memory model unchanged. So a legacy application that wants to communicate via PCIe still executes a read from or a write to a specific address. The next two stack levels down, however, take this read or write request and convert it into a packet by appending routing and flow control information, as well as CRC information, placing it in a frame, and then sending it to its destination.

So the application still thinks that it's reading to or writing from a memory address when it talks to a PCI device, but behind the scenes there's a totally different network of protocols and signals at work shuffling that read or write request to along to its destination.

This brings us to back to the topic of command and control signals. As I hinted at earlier, PCIe takes all PCI side-band signals and converts them to MSI signals (which are load-store) so that they can be encapsulated into packets and routed just like any other read/write traffic. Of course, this means that all types of PCIe traffic ? whether command or read/write, or address or data ? are transmitted over a single bus.

It's important to note at this point that the two pairs of bus traffic types are logically divided under PCIe, even if they're not physically separated onto different buses. The first two types of traffic, address and data, are combined in the form of the packet. The core of packet consists of an address combined with a chunk of data; so the packet structure fuses these two types.

The packets themselves, though, generally fall into the two other categories: command and read/write. In fact, literature on a packet-based bus system like PCIe or RAMBUS will often talk of command packets and data packets, the latter being the more common name for what I'm calling read/write packets.

Page 13: PCI Express and Bus Architecture

Traffic runs in lanes

When PCIe's designers started thinking about a true next-generation upgrade for PCI, one of the issues that they needed to tackle was pin count. In the section on PCI above, I covered some of the problems with the kind of large-scale data parallelism that PCI exhibits (i.e. noise, cost, poor frequency scaling, etc.). PCIe solves this problem by taking a serial approach.

As I noted previously, a connection between two a PCIe device and a PCIe switch is called a link. Each link is composed of one or more lanes, and each lane is capable of transmitting one byte at a time in both directions at once. This full-duplex communication is possible because each lane is itself composed of one pair of signals: send and receive.

Figure 7: Links and lanes

In order to transmit PCIe packets, which are composed of multiple bytes, a one-lane link must break down each packet into a series of bytes, and then transmit the bytes in rapid succession. The device on the receiving end must collect all of the bytes and then reassemble them into a complete packet. This disassembly and reassembly happens must happen rapidly enough to where it's transparent to the next layer up in the stack. This means that it requires some processing power on each end of the link. The upside, though, is that because each lane is only one byte wide, very few pins are needed to transmit the data. You might say that this serial transmission scheme is a way of turning processing power into bandwidth; this is in contrast to the old PCI parallel approach, which turns bus width (and hence pin counts) into bandwidth. It so happens that thanks to Moore's Curves, processing power is cheaper than bus width, hence PCIe's tradeoff makes a lot of sense.

Page 14: PCI Express and Bus Architecture

I stated earlier that a link can be composed of "one or more lanes", so let me clarify that now. One of PCIe's nicest features is the ability to aggregate multiple individual lanes together to form a single link. In other words, two lanes could be coupled together to form a single link capable of transmitting two bytes at a time, thus doubling the link bandwidth. Likewise, you could combine four lanes, or eight lanes, and so on.

A link that's composed of a single lane is called an x1 link; a link composed of two lanes is called an x2 link; a link composed of four lanes is called an x4 link, etc. PCIe supports x1, x2, x4, x8, x12, x16, and x32 link widths.

PCIe's bandwidth gains over PCI are considerable. A single lane is capable of transmitting 2.5Gbps in each direction, simultaneously. Add two lanes together to form an x2 link and you've got 5 Gbps, and so on with each link width. These high transfer speeds are good, good news, and will enable a new class of applications, like...

PCIe, the GPU, and you

...SLI video card rendering.

When announcements of Alienware's new PCIe-based SLI technology hit the wires, I saw a few folks claiming that the company had somehow rebranded some basic PCIe functionality. If you've made it this far in the article, though, then you probably noticed that no single one of the PCIe capabilities that I've outlined thus far seems specifically enabling of this kind of vid card cooperation. That's because it's PCIe's whole, high-bandwidth, next-generation package that allows this functionality, and not any one feature.

3D rendering involves moving a lot of data around, very quickly, between the video card, the CPU, and main memory. In current systems the AGP bus is a bottleneck. You can tell just how much of a bottleneck it is by observing how much RAM vendors are cramming into high-end video cards. All of that RAM is needed so that the GPU doesn't have to go out to main memory to get rendering data.

This picture changes when you add PCIe into the mix. Two video cards placed in a pair of x16 slots will have high-bandwidth pipes connecting them to each other, to main memory, and to the CPU. They can use all of that bandwidth to cooperate on rendering chores at a level that wouldn't have been feasible with previous bus technologies.

For more on PCIe and graphics, check out the following links.

Alienware announces dual PCI-Express graphics subsystem PCI Express for graphics: Analyzing ATI and NVIDIA's PCI-E strategies NVIDIA's SLI resurrects GPU teaming: Kickin' it old school?with 32 pipes

And be sure to stick around Ars, because this PCIe article is just the groundwork for our future coverage of all things PCIe, including graphics.

Page 15: PCI Express and Bus Architecture

At this point, I want to use the second two articles in the list above to bring up two other features of PCIe that are worth taking a look at, especially because they factor in to the emerging SLI GPU scene.

Lane negotiation at startup

In the last article linked in the above list (the one on NVIDIA's SLI) TR notes that no currently available motherboard has two x16 links. Now, some boards have two x16 slots, but those slots are connected to the bridge by x8 links. What gives? This can be kind of confusing, so a diagram will help.

Figure 8: lane negotiation

At startup, PCIe devices negotiate with the switch to determine the maximum number of lanes that the link can consist of. This link width negotiation depends on the maximum width of the link itself (i.e., the actual number of physical signal pairs that the link consists of), on the width of the connector into which the device is plugged, and the width of the device itself. (It also depends on the width of the switch's interface, but we'll leave that out and assume that the switch's interface width equals the physical link width.)

Now, a PCIe-compliant device has a certain number of lanes built into it. So NVIDIA's first SLI cards are all x16 cards, which means that they have enough copper connectors at their bottom contact edges to support 16 lanes. This also means that they need to be plugged into a connector slot that supports at least 16 lanes. If the connector had fewer than 16 lanes, then it wouldn't have enough contacts to understand all of the signals coming out of the card. If it supports more, then those extra lanes can be ignored.

Page 16: PCI Express and Bus Architecture

However, just because the card and connector are x16 doesn't mean the link itself is x16. The physical link itself could have enough copper traces for exactly sixteen lanes, or some number less than sixteen, or some number greater than sixteen. If the link has only enough signal pairs to support less than sixteen lanes, then the switch and the device will negotiate to figure this out, and they'll use only the lanes that the link has. If the link supports more than sixteen lanes, then the extra lanes will be ignored.

If you take a close look at the diagram above, then you'll see how this works. Extra lanes are ignored, while too few lanes means that the devices on each end just throttle back their bandwidth accordingly.

There is one situation depicted above that just won't work, and that's the last one with the text above it in red. Plugging an x16 card into an x8 connector doesn't work, because there aren't enough contacts in the connector to pick up all of the lanes coming out of the card.

This link width negotiation allows for some flexibility in designing systems and integrating devices with different lane widths, but it will make for some headache in the consumer space. People will have to figure out how to match link widths with device widths, and they'll be initially confused by situations in which the link is one width and the connector another, as is the case with an NVIDIA card plugged into an x16 slot attached to an x8 link.

The NVIDIA card plugged into the x8 link will talk to the switch and figure out that the link is only x8. It will then train down accordingly and transmit data at the appropriate x8 rate.

(If you're confused, just go back and meditate on the previous diagram some more. It took me a while of staring at it before it sank in for me, too, and I'm the one who made the diagram!)

PCIe to PCI bridging

One thing that you're going to hear a lot about in the coming months is PCI to PCIe bridging. Fortunately, it's a lot easier to grasp than the whole training and lane width thing.

Basically, a PCI to PCIe bridge translates PCIe packets back into regular old PCI signals, allowing a legacy PCI device to be plugged into a PCIe system. This bridging can happen anywhere, from on the motherboard to on the card. NVIDIA is taking such an approach with their first-generation PCIe cards. There's a PCIe-to-PCI bridge embedded on the card, which means that the card itself is still a "PCI" card even though it fits into a PCIe slot.

ATI, in contrast, has cards that support PCIe natively and therefore don't need the bridge chip.

I don't expect these bridges to make a whole lot of difference in anyone's life in the near-term, and in the long-term they'll disappear entirely as companies like NVIDIA rework their product line for native PCIe support. The translation chip will add some cost to the device, but it's impact on performance (if any) will be very hard to quantify and absolutely impossible to isolate. Still, expect this talk about bridging to play a role in the graphics wars in the next few months. My advice, though, is to ignore it and focus on the benchmarks, which are all that matter anyway.

Page 17: PCI Express and Bus Architecture

Conclusion: PCI Express in the real world

A good example of PCIe-to-PCI bridging on the motherboard is in Intel's new 900-series chipsets. These chipsets employ PCIe-to-PCI bridge logic integrated directly into the southbridge. This allows legacy PCI devices to coexist with new PCIe devices in the same system.

I won't go into detail about these chipsets, because that's been done in the reviews accessible under the link above. What I will do, though, is give you one last diagram, showing you how PCIe is used in newly announced chipsets.

Figure 9: PCIe usage in new chipsets

As you can see, PCIe links hang off of both the northbridge and the southbridge. Just as the northbridge and southbridge combined with the CPU to fill the role of PCI host (or root), the northbridge and southbridge join with each other to fulfill the role of the PCIe switch. In Intel's design, the north and south bridges are PCIe switches combined with a single, high-bandwidth PCIe link.

I began this article with a discussion of how PCI has caused different buses to be absorbed into the chipset. Thus the chipset in a pre-PCIe system functions as a switch, with the various attached devices connected in something resembling a hacked-up switched fabric. PCIe brings

Page 18: PCI Express and Bus Architecture

some order to this chaos by making the core logic chipset into a bona fide switch ? a PCIe switch. It also it turns some of the attached buses into PCIe buses, and it makes the PC as a system more cleanly extensible and future-proof by eliminating the need for one specialized bus after another.