pciexpress overview

7/27/2019 Pciexpress Overview

http://slidepdf.com/reader/full/pciexpress-overview 1/24

PCI Express: An OverviewPCI Express has generated a lot of excitement in the PC enthusiast scene ina …

by Jon Stokes

Introduction

With the launch of Intel's 900-series chipsets and the recent return of SLI to the video card

scene, PCI Express has finally arrived on the PC enthusiast scene in a big way. PCI

Express-enabled motherboards are going to start becoming more and more common, and

with the new bus's increasing ubiquity will come the inevitable confusion that accompanies

the rise of any new technology, especially one as complex and feature-rich as PCI Express.

In this article, we'll take a detailed look at the features of PCI Express ? what it is, what it

isn't, and how it improves on the venerable interconnect scheme that we've all come to

know and curse: PCI.

Basic PC system architecture

No doubt most Ars readers are familiar with the basic layout of a PC system, but it's

worthwhile to do a brief recap in order to set the stage for the discussion that follows.

Logically, an average PCI system is laid out in something like the following manner:

http://arstechnica.com/author/hannibal/






Figure 1: PCI system layout

The core logic chipset acts as a switch or router, and routes I/O traffic among the different

devices that make up the system.

In reality, the core logic chipset is split into two parts: the northbridge and the southbridge

(or I/O bridge). This split is there for a couple of reasons, the most important of which is the

fact that there are three types of devices that naturally work very closely together, and so

they need to have faster access to each other: the CPU, the main memory, and the video

card. In a modern system, the video card's GPU is functionally a second (or third) CPU, so

it needs to share privileged access to main memory with the CPU(s). As a result, these

three devices are all clustered together off of the northbridge.

The northbridge is tied to a secondary bridge, the southbridge, which routes traffic from the

different I/O devices on the system: the hard drives, USB ports, Ethernet ports, etc. The

traffic from these devices is routed through the southbridge to the northbridge and then on

to the CPU and/or memory.



Figure 2: northbridge and southbridge

As is evident from the diagram above, the PCI bus is attached to the southbridge. This bus

is usually the oldest, slowest bus in a modern system, and is the one most in need of an

upgrade.

For now, the main thing that you should take away from the previous diagram is that the

modern PC is a motley collection of specialized buses of different protocols and bandwidth

capabilities. This mix of specialized buses designed to attach different types of hardware

directly to the southbridge is something of a continuously evolving hack that has been

gradually and collectively engineered by the PC industry as it tries to get around thelimitations of the aging PCI bus. Because the PCI bus can't really cut it for things like Serial

ATA, Firewire, etc., the trend has been to attach interfaces for both internal and external I/O

directly to the southbridge. So today's southbridge is sort of the Swiss Army Knife of I/O

switches, and thanks to Moore's Curves it has been able to keep adding functionality in the

form of new interfaces that keep bandwidth-hungry devices from starving on the PCI bus.

http://arstechnica.com/paedia/m/moore/moore-1.html






In an ideal world, there would be one primary type of bus and one bus protocol that

connects all of these different I/O devices ? including the video card/GPU ? to the CPU and

main memory. Of course, this "one bus to rule them all" ideal is never, ever going to happen

in the real world. It won't happen with PCI Express, and it won't happen with Infiniband

(although it technically could happen with Infiniband if we threw away all of today's PChardware and started over from scratch with a round of natively Infiniband-compliant

devices).

Still, even though the utopian ideal of one bus and one bus protocol for every device will

never be achieved, there has to be way bring some order to the chaos. Luckily for us, that

way has finally arrived in the form of PCI Express (a.k.a. PCIe).

With Intel's recent launch of its 900-series chipsets and NVIDIA and ATI's announcements

of PCI Express-compatible cards, PCIe will shortly begin cropping up in consumer systems.

This article will give you the lowdown on what you can expect from the bus technology that

will dominate the personal computer for the coming decade.

Note: A few of the more server-specific features of PCI Express are not covered in this

article. These include hot plugging and hot swapping, as well as reliability-oriented features

like packet retries and such.

A primer on PCI

Before I go into detail on PCIe, it helps to understand how PCI works and what its

limitations are.

The PCI bus debuted over a decade ago at 33MHz, with a 32-bit bus and a peak

theoretical bandwidth of 132MB/s. This was pretty good for the time, but as the rest of

the system got more bandwidth hungry both the bus speed and bus width were cranked

up in a effort keep pace. Later flavors of PCI included a 64-bit, 33MHz bus combination

with a peak bandwidth of 264MB/s; a more recent 64-bit, 66MHz combination with a

bandwidth of 512MB/s.

PCI uses a shared bus topology to allow for communication among the differentdevices on the bus; the different PCI devices (i.e., a network card, a sound card, a RAID

card, etc.) are all attached to the same bus, which they use to communicate with the

CPU. Take a look at the following diagram to get a feel for what a shared bus looks like.



Figure 3: the shared bus

Because all of the devices attached to the bus must share it among themselves, there

has to be some kind of bus arbitration scheme in place for deciding who gets access

to the bus and when, especially in situations where multiple devices need to use the bus

at the same time. Once a device has control of the bus, it becomes the bus master ,

which means that it can use the PCI bus to talk to the CPU or memory via the chipset's

southbridge.

Speaking of the southbridge, the large system diagram that I presented on the first page

? the one with the PCI devices attached to the southbridge ? represents how things are

actually configured in the real world, as opposed to the idealized representation given

immediately above. The southbridge, the northbridge, and the CPU all combine to fill

the host or root role, which we'll discuss in a bit more detail momentarily. For now, it

will suffice to note that the root runs the show ? it detects and initializes the PCI devices,and it controls the PCI bus by default. Or another way to put it would be to say that the

purpose of the PCI bus is to connect I/O devices to the root, so that the root can read

from them and write to them, and just generally use them to talk either to storage

devices or to the outside world.

The shared bus topology's main advantages are that it's simple, cheap, and easy to

implement ? or at least, that's the case as long as you're not trying to do anything too

fancy with it. Once you start demanding more performance and functionality from a

shared bus, then you run into its limitations. Let's take a look at some of those

limitations, in order to motivate our discussion of PCI Express's improvements.

From the CPU's perspective, PCI devices are accessible via a fairly straightforward

load-store mechanism. There's flat, unified chunk of address space dedicated for PCI

use, which looks to the CPU much like a flat chunk of main memory address space, the



primary difference being that at each range of addresses there sits a PCI device instead

of a group of memory cells containing code or data.

Figure 4: memory space

So in the same way that the CPU access memory by performing loads and stores to

specific addresses, it accesses PCI devices by performing reads and writes to specific

addresses.

When a PCI-enabled computer boots up, it must initialize the PCI subsystem by

assigning chunks of the PCI address space to the different devices so that they'll be

accessible to the CPU. Once the devices are initialized and know which parts of the

address space that they "own," they start listening to the bus for any commands and

data that might be directed their way. Once an individual PCI device "hears" an address

that it owns being placed on the bus, then it reads any data following behind that

address.

This scheme works fine when there are only a few devices attached to the bus, listening

to it for addresses and data. But the nature of a bus is that any device that's attached to

it and is "listening" to it injects a certain amount of noise onto the bus. Thus the more

devices that listen to the bus ? and thereby place an electrical load on the bus ? the

more noise there is on the bus and the harder it becomes to get a clean signal through.



Sharing the bus

In this respect, the shared bus is kind of like the following slightly loopy scenario:

Imagine an office building in which there is only one phone line that everyone shares.

People work all day in their cubicles with their phones off the hook and their hands-freespeakerphones turned on, listening for the front-office secretary to call out their name,

"Mr. Smith, Ms. Jones is here at my desk and wants to talk to you, so I'm going to put

her on. Now pay attention, because here she is..." With only a few employees this lame

scheme would be a pain but it would at least be feasible. But in an office of hundreds,

the amount of ambient background noise pouring into each speakerphone would

combine to make the entire line a noisy mess, and it would be very hard to hear your

name called out above the racket.

This load-related noise phenomenon, along with clock skew issues, is the reason that

PCI buses are limited to five card-based devices at most. (If you solder PCI devices

directly onto the motherboard, the signal is cleaner so you can put a few more than five

on a single bus.)

What this means in real life is that if you want to put more than five PCI devices on a

system, then you must use PCI-to-PCI bridge chips configured in the following manner:

Figure 5: PCI-to-PCI bridge chips

This hierarchical tree structure, outlined above, is one of the features that distinguishes

PCI from peer-to-peer and point-to-point next-generation interconnects like



HyperTransport and Infiniband. The root at the top of the diagram is the master

controller which is responsible for initializing and configuring all of the PCI devices in the

system at boot-up. This makes every PCI device a slave device, with one master

controlling them. And because the master must enumerate all of the devices and

configure the entire system at boot time, there can be no hot-plugging or hot-swapping.

Excursus: organizing bus traffic

Generally speaking, there are two pairs of categories into which all bus traffic can be

placed. The first pair of categories is address traffic and data traffic. The data is the

information that you're using the bus to send or receive from a device that's attached to

it, and address is location of the particular device (or the region within a particular

device) where the information is being sent. So any bus which supports multiple devices

will need a way of handling both address traffic and data traffic, and of distinguishing

between the two.

The second pair of categories, which overlaps the first pair, is command

traffic and read/write traffic. A command consists of a chunk of data containing some

type of configuration or control information (= a specific type of data) which is sent to a

particular device (= a particular address) on the bus. So command traffic includes both

address and data traffic. Examples of command traffic are initialization instructions for a

device, a device reset signal, a configuration command that causes the device to switch

operating modes, etc. Command traffic allows the CPU to control how the PCI device

handles the data that flows in and out of it.

Read/write traffic is the most important type of traffic, because it consists of the actual

information that is being sent to the device. For instance, a PCI RAID controller uses

read and write traffic to send and receive the actual files which it reads from and writes

to its attached hard disks, a PCI sound card uses read/write traffic to get the sound data

that it puts out through its speaker jack, and so on. Like command traffic, read/write

traffic consists of addresses coupled with data, and so accounts for part of both of these

types of traffic.

Different buses and bus protocols have different ways of handling these four

overlapping types of traffic. For instance, many common bus types actually consist of

two separate buses: an address bus and a data bus. Addresses are placed on the

address bus and data is placed on the data bus, with the result that data is able to flow

quickly between devices because each type of traffic has its own dedicated bus.



The alternative to this would be to "multiplex" address and data onto the same bus. This

involves first placing the address on the bus, and then following it with the data that is to

be sent to that address. PCI takes this approach, with a single 32-bit bus on which

addresses and data are multiplexed. In fact, remember the office phone line analogy?

"Mr. Smith, a Ms. Jones is here at my desk and wants to talk to you, so I'm going to puther on. Now pay attention, because here she is..." The "Mr. Smith" in this sentence

would be the address, and Mrs. Jones' speech to Mr. Smith would be the data.

Obviously multiplexing is a little less bandwidth-efficient than having two dedicated

buses, because address traffic takes up precious bandwidth that could be put to better

use carrying bus traffic. But multiplexed buses are a lot cheaper than shared buses,

because half the number of bus lines are needed, and the devices on the bus need half

the number of pins.

The other popular way of handling bus traffic is to split it into control traffic and

read/write traffic and give each its own bus. To return to our office analogy, this would

be like installing a separate line for management to use to talk to employees.

PCI and MSI

Later versions of the PCI specification opt in part for the last method of organizing bus

traffic outlined above, and have what is called a "side-band bus" for transmitting some

types of command traffic. The side-band bus is a smaller bus consisting of a few lines

dedicated to the transmission of control and configuration information. Of course, thisside-band bus increases pin count, power draw, cost, etc., so it's not the most optimal

solution.

Even more recent versions of the PCI spec dictate a method for using standard read

and write operations to pass one type of command and control traffic to PCI devices.

This method, called Message Signal Interrupt (MSI), sets aside a special message

space in the PCI flat memory space for passing a certain type of control message called

an interrupt. This message space is kind of like a bulletin board, onto which the CPU

writes interrupt messages which the device then reads. As we'll see below, PCI Express

expand the MSI spec to include not just interrupts but all side-band control signals. But

we're getting ahead of ourselves...

Summary of PCI's shortcomings

To summarize, PCI as it exists today has some serious shortcomings that prevent it

from providing the bandwidth and features needed by current and future generations of



I/O and storage devices. Specifically, its highly parallel shared-bus architecture holds it

back by limiting its bus speed and scalability, and its simple, load-store, flat memory-

based communications model is less robust and extensible than a routed, packet-based

model.

PCI-X: wider and faster, but still outdated

The PCI-X spec was an attempt to update PCI as painlessly as possible and allow it to

hobble along for a few more years. This being the case, the spec doesn't really fix any

of the inherent problems outlined above. In fact, it actually makes some of the problems

worse.

The PCI-X spec essentially doubled the bus width from 32 bits to 64 bits, thereby

increasing PCI's parallel data transmission abilities and enlarging its address space.

The spec also ups PCI's basic clock rate to 66MHz with a 133MHz variety on the highend, providing yet another boost to PCI's bandwidth and bringing it up to 1GB/s (at

133MHz).

The latest version of the PCI-X spec (PCI-X 266) also double-pumps the bus, so that

data is transmitted on the rising and falling edges of the clock. While this improves PCI-

X's peak theoretical bandwidth, its real-world sustained bandwidth gains are more

modest. (See this article for more on the relationship between peak theoretical

bandwidth and real-world bandwidth.)

While both of these moves significantly increased PCI's bandwidth and its usefulness,

they also made it more expensive to implement. The faster a bus runs, the sensitive it

becomes to noise; manufacturing standards for high-speed buses are exceptionally

strict for this very reason; shoddy materials and/or wide margins of error translate

directly into noise at higher clock speeds. This means that the higher-speed PCI-X bus

is more expensive to make.

The higher clock speed isn't the only thing that increases PCI-X's noise problems and

manufacturing costs. The other factor is the increased bus width. Because the bus is

wider and consists of more wires, there's more noise in the form of crosstalk.Furthermore, all of those new wires are connected at their endpoints to multiple PCI

devices, which means an even larger load on the bus and thus more noise injected into

the bus by attached devices. And then there's the fact that the PCI devices themselves

need 32 extra pins, which increases the manufacturing cost of each individual device

and of the connectors on the motherboard.

http://arstechnica.com/paedia/b/bandwidth-latency/bandwidth-latency-1.html






All of these factors, when taken together with the increased clock rate, combine to make

the PCI-X a more expensive proposition than PCI, which keeps it out of mainstream

PCs. And it should also be noted that most of the problems with increasing bus

parallelism and double-pumping the bus also plague recent forms of DDR, and

especially the DDR-II spec.

And after all of that pain, you still have to deal with PCI's shared-bus topology and all of

its attendant ills. Fortunately, there's a better way.

PCI Express: the next generation

PCI Express (PCIe) is the newest name for the technology formerly known as 3GIO.

Though the PCIe specification was finalized in 2002, PCIe-based devices have just

now started to debut on the market.

PCIe's most drastic and obvious improvement over PCI is its point-to-point bus

topology. Take a look at the following diagram, and compare it to the layout of the PCI

bus.

Figure 6: shared switch

?

http://arstechnica.com/news/posts/20040621-3911.html






Figure 3: the shared bus

In a point-to-point bus topology, a shared switch replaces the shared bus as the single

shared resource by means of which all of the devices communicate. Unlike in a shared

bus topology, where the devices must collectively arbitrate among themselves for use of

the bus, each device in the system has direct and exclusive access to the switch. In

other words, each device sits on its own dedicated bus, which in PCIe lingo is called

a link.

Like a router in a network or a telephone switchbox, the switch routes bus traffic and

establishes point-to-point connections between any two communicating devices on a

system. To return to our office analogy from the previous section, each employee has

his or her own private line to the front desk; so instead of shouting over a shared line to

get a particular employee's attention, the front desk secretary uses a switchboard to

connect employees directly to incoming callers and to each other.

In the point-to-point diagram above, the CPU at the top can talk to any of the PCIe

devices by "dialing" that device's address and opening up a direct and private

communications link, via the switch, with it. Of course, as with a modern telephone call,

or even better, an internet connection between a browser and a website, the two

communicating parties only think they're talking to each other via a private, direct,

continuous link; in reality, though, the communications stream is broken up into discrete

packets of data, which the switch routes ? like a postal worker delivering addressed

envelopes ? back and forth between the two parties.

Enabling Quality of Service

The overall effect of the switched fabric topology is that it allows the "smarts" needed to

manage and route traffic to be centralized in one single chip ? the switch. With a shared

bus, the devices on the bus must use an arbitration scheme to decide among



themselves how to distribute a shared resource (i.e., the bus). With a switched fabric,

the switch makes all the resource-sharing decisions.

By centralizing the traffic-routing and resource-management functions in a single unit,

PCIe also enables another important and long overdue next-generation function: quality

of service (QoS). PCIe's switch can prioritize packets, so that real-time streaming

packets (i.e., a video stream or an audio stream) can take priority over packets that

aren't as time critical. This should mean fewer dropped frames in your first-person

shooter and lower audio latency in your digital recording software.

Backwards compatibility

Now, you've probably heard that PCIe is backwards-compatible with PCI, and that

operating systems can boot on and use a PCIe-based system without modification. So

you're no doubt wondering how PCI's load-store model, described previously, can becompatible with the switched packet-based model outlined here. The answer is more

straightforward than you might think.

PCI and PCI Express, like many computer systems designed to transmit data,

implement a part of theOSI network stack. This article is not the place for a detailed

breakdown of a network stack, but the basic idea behind it is easy enough to grasp.

PCI implements the first four layers of the OSI stack, which specify the physical aspects

of transmission (i.e., the wire-level signals) up through the higher-level load-store

interface that software uses to send and receive via PCI. PCI Express's designers have

left this load-store-based, flat memory model unchanged. So a legacy application that

wants to communicate via PCIe still executes a read from or a write to a specific

address. The next two stack levels down, however, take this read or write request and

convert it into a packet by appending routing and flow control information, as well as

CRC information, placing it in a frame, and then sending it to its destination.

So the application still thinks that it's reading to or writing from a memory address when

it talks to a PCI device, but behind the scenes there's a totally different network of

protocols and signals at work shuffling that read or write request to along to itsdestination.

This brings us to back to the topic of command and control signals. As I hinted at

earlier, PCIe takes all PCI side-band signals and converts them to MSI signals (which

are load-store) so that they can be encapsulated into packets and routed just like any

http://www.interfacebus.com/Design_OSI_Stack.html





other read/write traffic. Of course, this means that all types of PCIe traffic ? whether

command or read/write, or address or data ? are transmitted over a single bus.

It's important to note at this point that the two pairs of bus traffic types are logically

divided under PCIe, even if they're not physically separated onto different buses. The

first two types of traffic, address and data, are combined in the form of the packet. The

core of packet consists of an address combined with a chunk of data; so the packet

structure fuses these two types.

The packets themselves, though, generally fall into the two other categories: command

and read/write. In fact, literature on a packet-based bus system like PCIe or RAMBUS

will often talk of command packets and data packets, the latter being the more

common name for what I'm calling read/write packets.

Traffic runs in lanes

When PCIe's designers started thinking about a true next-generation upgrade for PCI,

one of the issues that they needed to tackle was pin count. In the section on PCI above,

I covered some of the problems with the kind of large-scale data parallelism that PCI

exhibits (i.e. noise, cost, poor frequency scaling, etc.). PCIe solves this problem by

taking a serial approach.

As I noted previously, a connection between two a PCIe device and a PCIe switch is

called a link. Each link is composed of one or more lanes, and each lane is capable of

transmitting one byte at a time in both directions at once. This full-duplex

communication is possible because each lane is itself composed of one pair of signals:

send and receive.



Figure 7: Links and lanes

In order to transmit PCIe packets, which are composed of multiple bytes, a one-lane link

must break down each packet into a series of bytes, and then transmit the bytes in rapid

succession. The device on the receiving end must collect all of the bytes and then

reassemble them into a complete packet. This disassembly and reassembly happens

must happen rapidly enough to where it's transparent to the next layer up in the stack.

This means that it requires some processing power on each end of the link. The upside,

though, is that because each lane is only one byte wide, very few pins are needed to

transmit the data. You might say that this serial transmission scheme is a way of turning

processing power into bandwidth; this is in contrast to the old PCI parallel approach,

which turns bus width (and hence pin counts) into bandwidth. It so happens that thanks

to Moore's Curves, processing power is cheaper than bus width, hence PCIe's tradeoff

makes a lot of sense.

I stated earlier that a link can be composed of "one or more lanes", so let me clarify thatnow. One of PCIe's nicest features is the ability to aggregate multiple individual lanes

together to form a single link. In other words, two lanes could be coupled together to

form a single link capable of transmitting two bytes at a time, thus doubling the link

bandwidth. Likewise, you could combine four lanes, or eight lanes, and so on.



A link that's composed of a single lane is called an x1 link; a link composed of two lanes

is called an x2 link; a link composed of four lanes is called an x4 link, etc. PCIe supports

x1, x2, x4, x8, x12, x16, and x32 link widths.

PCIe's bandwidth gains over PCI are considerable. A single lane is capable of

transmitting 2.5Gbps in each direction, simultaneously. Add two lanes together to form

an x2 link and you've got 5 Gbps, and so on with each link width. These high transfer

speeds are good, good news, and will enable a new class of applications, like...

PCIe, the GPU, and you

...SLI video card rendering.

When announcements of Alienware's new PCIe-based SLI technology hit the wires, I

saw a few folks claiming that the company had somehow rebranded some basic PCIefunctionality. If you've made it this far in the article, though, then you probably noticed

that no single one of the PCIe capabilities that I've outlined thus far seems specifically

enabling of this kind of vid card cooperation. That's because it's PCIe's whole, high-

bandwidth, next-generation package that allows this functionality, and not any one

feature.

3D rendering involves moving a lot of data around, very quickly, between the video

card, the CPU, and main memory. In current systems the AGP bus is a bottleneck. You

can tell just how much of a bottleneck it is by observing how much RAM vendors are

cramming into high-end video cards. All of that RAM is needed so that the GPU doesn't

have to go out to main memory to get rendering data.

This picture changes when you add PCIe into the mix. Two video cards placed in a pair

of x16 slots will have high-bandwidth pipes connecting them to each other, to main

memory, and to the CPU. They can use all of that bandwidth to cooperate on rendering

chores at a level that wouldn't have been feasible with previous bus technologies.

For more on PCIe and graphics, check out the following links.

Alienware announces dual PCI-Express graphics subsystem

PCI Express for graphics: Analyzing ATI and NVIDIA's PCI-E strategies

NVIDIA's SLI resurrects GPU teaming: Kickin' it old school?with 32 pipes

And be sure to stick around Ars, because this PCIe article is just the groundwork for our

future coverage of all things PCIe, including graphics.

http://arstechnica.com/news/posts/1084398037.html


http://techreport.com/etc/2004q1/pciexpress/index.x?pg=1


http://techreport.com/etc/2004q2/nvidia-sli/index.x?pg=1







At this point, I want to use the second two articles in the list above to bring up two other

features of PCIe that are worth taking a look at, especially because they factor in to the

emerging SLI GPU scene.

Lane negotiation at startup

In the last article linked in the above list (the one on NVIDIA's SLI) TR notes that no

currently available motherboard has two x16 links. Now, some boards have two x16

slots, but those slots are connected to the bridge by x8 links. What gives? This can be

kind of confusing, so a diagram will help.

Figure 8: lane negotiation

At startup, PCIe devices negotiate with the switch to determine the maximum number of

lanes that the link can consist of. This link width negotiation depends on the maximum

width of the link itself (i.e., the actual number of physical signal pairs that the link

consists of), on the width of the connector into which the device is plugged, and the

width of the device itself. (It also depends on the width of the switch's interface, but we'll

leave that out and assume that the switch's interface width equals the physical link

width.)

Now, a PCIe-compliant device has a certain number of lanes built into it. So NVIDIA's

first SLI cards are all x16 cards, which means that they have enough copper connectors



at their bottom contact edges to support 16 lanes. This also means that they need to be

plugged into a connector slot that supports at least 16 lanes. If the connector had fewer

than 16 lanes, then it wouldn't have enough contacts to understand all of the signals

coming out of the card. If it supports more, then those extra lanes can be ignored.

However, just because the card and connector are x16 doesn't mean the link itself is

x16. The physical link itself could have enough copper traces for exactly sixteen lanes,

or some number less than sixteen, or some number greater than sixteen. If the link has

only enough signal pairs to support less than sixteen lanes, then the switch and the

device will negotiate to figure this out, and they'll use only the lanes that the link has. If

the link supports more than sixteen lanes, then the extra lanes will be ignored.

If you take a close look at the diagram above, then you'll see how this works. Extra

lanes are ignored, while too few lanes means that the devices on each end just throttle

back their bandwidth accordingly.

There is one situation depicted above that just won't work, and that's the last one with

the text above it in red. Plugging an x16 card into an x8 connector doesn't work,

because there aren't enough contacts in the connector to pick up all of the lanes coming

out of the card.

This link width negotiation allows for some flexibility in designing systems and

integrating devices with different lane widths, but it will make for some headache in the

consumer space. People will have to figure out how to match link widths with device

widths, and they'll be initially confused by situations in which the link is one width and

the connector another, as is the case with an NVIDIA card plugged into an x16 slot

attached to an x8 link.

The NVIDIA card plugged into the x8 link will talk to the switch and figure out that the

link is only x8. It will then train down accordingly and transmit data at the appropriate x8

rate.

(If you're confused, just go back and meditate on the previous diagram some more. It

took me a while of staring at it before it sank in for me, too, and I'm the one who madethe diagram!)

PCIe to PCI bridging



One thing that you're going to hear a lot about in the coming months is PCI to PCIe

bridging. Fortunately, it's a lot easier to grasp than the whole training and lane width

thing.

Basically, a PCI to PCIe bridge translates PCIe packets back into regular old PCI

signals, allowing a legacy PCI device to be plugged into a PCIe system. This bridging

can happen anywhere, from on the motherboard to on the card. NVIDIA is taking such

an approach with their first-generation PCIe cards. There's a PCIe-to-PCI bridge

embedded on the card, which means that the card itself is still a "PCI" card even though

it fits into a PCIe slot.

ATI, in contrast, has cards that support PCIe natively and therefore don't need the

bridge chip.

I don't expect these bridges to make a whole lot of difference in anyone's life in thenear-term, and in the long-term they'll disappear entirely as companies like NVIDIA

rework their product line for native PCIe support. The translation chip will add some cost

to the device, but it's impact on performance (if any) will be very hard to quantify and

absolutely impossible to isolate. Still, expect this talk about bridging to play a role in the

graphics wars in the next few months. My advice, though, is to ignore it and focus on

the benchmarks, which are all that matter anyway.

Conclusion: PCI Express in the real world

A good example of PCIe-to-PCI bridging on the motherboard is in Intel's new 900-series

chipsets. These chipsets employ PCIe-to-PCI bridge logic integrated directly into the

southbridge. This allows legacy PCI devices to coexist with new PCIe devices in the

same system.

I won't go into detail about these chipsets, because that's been done in the reviews

accessible under the link above. What I will do, though, is give you one last diagram,

showing you how PCIe is used in newly announced chipsets.









Figure 9: PCIe usage in new chipsets

As you can see, PCIe links hang off of both the northbridge and the southbridge. Just as

the northbridge and southbridge combined with the CPU to fill the role of PCI host (or

root), the northbridge and southbridge join with each other to fulfill the role of the PCIe

switch. In Intel's design, the north and south bridges are PCIe switches combined with a

single, high-bandwidth PCIe link.

I began this article with a discussion of how PCI has caused different buses to be

absorbed into the chipset. Thus the chipset in a pre-PCIe system functions as a switch,

with the various attached devices connected in something resembling a hacked-up

switched fabric. PCIe brings some order to this chaos by making the core logic chipset

into a bona fide switch ? a PCIe switch. It also it turns some of the attached buses intoPCIe buses, and it makes the PC as a system more cleanly extensible and future-proof

by eliminating the need for one specialized bus after another.



PCI Express Primer

PCI Express is a serial point to point link that operates at 2.5 Gbits/sec in each direction and

which is meant to replace the legacy parallel PCI bus. PCI Express (PCIe) is designed toprovide software compatibility with older PCI systems, however the hardware is completely

different. Since PCIe is point to point, there is no arbitration for resources on the link. Each

pair of links (both directions) is referred to as a "lane", and multiple lanes can beaggregated to form a single higher bandwidth connection. The following sections describesome of the details of the PCIe interface.

PCI Express Topology

As can be seen in the figure below, a PCI Express fabric consists of three types of devices:

the root complex, switches, and endpoints. The root complex is generally associated withthe processor and is responsible for configuring the fabric at power-up. Since PCIeconnections are point to point, switches are used to expand the fabric. PCIe endpoints are

the I/O devices in the fabric - the sources of, and destinations for the data.

PCI Express Layers



PCIe is implemented in three of the OSI model layers: the transaction layer, the data link

layer, and the physical layer. The following figure displays the layers as connected betweentwo PCIe devices.

As can be seen in the figure, the user logic interfaces to the transaction layer. The user

forms Transaction Layer Packets, or TLPs which contain a header, data payload, and

optionally an end-to-end CRC, ECRC. The ECRC, if used, is generated by the user logic atthe transmitter and checked by the user logic at the receiver. The data link layer is

responsible for link management including error detection. In this layer, a CRC (called theLink CRC or LCRC) is appended and a sequence number is prepended to the Transaction

Layer Packet. When a packet is transmitted from the data link layer, the receiver sends

back an ACK (success) or NACK (failure) to the transmitter which will retransmit in the caseof an error. These ACKs and NACKs are sent via special packets which originate from thedata link layer called Data Link Layer Packets, or DLLPs. The physical layer consists of two

differential pairs with 8B/10B encoded data allowing for a DC balance on the transmissionmedia and for clock recovery at the destination. Framing information is added to the data

link layer packet, and this is encoded and driven onto the link. The following diagramdisplays the encapsulation of packets in PCIe:



The transaction layer supports the notion of Virtual Channels and Traffic Classes which can

be used for real-time isochronous and prioritized data transport. The maximum data

payload (MDP) in a PCIe system is a system-wide user defined parameter. The desired MDP

is requested in a PCIe configuration register which is read by the root complex. After pollingall of the MDP values in the system, the lowest value is written to a separate configurationregister on each side of the link. Legal values of the MDP are 128 bytes through 4096 bytes

in powers of 2. A transmitter must not send a packet which exceeds the programmed MDP.

PCI Express Transactions

PCIe provides four types of transactions that originate at the transaction layer: memory,

I/O, configuration, and message. In general, memory transactions are the basic method oftransferring data. I/O transactions are provided for backward compatibility with PCI (which

provided them for backward compatibility with ISA) and are not recommended for futureuse. Configuration transactions are similar to those of the same name in the PCI bus and

are used by the root complex to configure the system upon power-up. Message transactionsare new and are used to send interrupts and error conditions, as well as other information

through the fabric. Transactions can be further classified as posted, non-posted, and



completion. A memory write operation is an example of a posted transaction since it does

not require a response from the destination. A memory read request is a non-posted

transaction that will later cause a completion transaction with the read data. The completiontransaction is initiated by the destination when the read data is available. Both I/O read and

I/O write are non-posted transactions, as are configuration read and write. Messagetransactions are of type posted.

Flow Control

PCIe implements a point to point (not end to end) credit policy for managing buffers. The

data link layer sends Data Link Layer Packets which indicate the amount of receiver buffer

space available in units of credits. The transmitter must ensure that the buffer space is notexceeded prior to commencing a transmission.

pciexpress overview

Documents