[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

proposal to fix ordering problem in PCI 2.1


We would like to sollicit comments on the proposal included below to
modify the PCI protocol in order to fix a problem related to
transaction ordering.

The motivation for the proposal is purely technical.  Basically, we
found what we believe to be a reasonable solution to the problem
described in Section 3.11, Item 6, of the PCI 2.1 Specification, and
we would like to propose it as a way of making the protocol more
robust.  HP will submit the proposal to the SIG, but in the meantime
we would appreciate getting feedback from this reflector.

The proposal has a very long section entitled "Detailed Description of
the Problem" that lists scenarios where things can go wrong due to the
ordering problem.  The reason for this is that there is disagreement
as to the severity of the problem.  The purpose of the scenarios is to
show that the problem is likely to manifest itself.  However, the
scenarios are rather theoretical.  We would appreciate hearing more
concrete ones---or hearing arguments that support the contrary opinion
that it won't manifest itself.

By the way, we think the Delayed Transaction mechanism introduced in
the 2.1 protocol, and the Producer-Consumer model that goes with it,
are very elegant.  The intention of this proposal is not to attack or
put down the protocol.  On the contrary, we want to make it more
robust, so that the Producer-Consumer model is guaranteed to work in
all cases.

Francisco Corella
Hewlett-Packard Company

------------------------ here is the proposal ------------------------


This proposal offers a solution to a known problem in the PCI

The problem is the fact that the completion of a given delayed
transaction may be given to a different transaction initiated by a
different master.  This allows a Delayed Request to, in effect, pass a
Posted Write, causing ordering violations which in turn may result in
data inconsistency and synchronization failures.  The problem is
documented in Section 3.11 of the PCI Local Bus Specification Revision
2.1, Item 6, pages 116-117.  However, the severity of the problem is
underestimated in the Specification.  Only one of several scenarios
resulting in incorrect behavior is discussed there.

The solution is to use 4 reserved pins to implement a Local Master ID.
The Arbiter communicates the Local Master ID of the current
Delayed Transaction to the target, which can thus reserve the
completion of the transaction for the master that requested it and
avoid giving it to a different master.  

In addition to solving the ordering problem, a Local Master ID would
allow a sophisticated target to enforce fairness among masters that
compete for access to it, in order to prevent starvation of one of the
masters by the others, something which is impossible with the current
PCI protocol.

		 Detailed Description of the problem

A Delayed Completion is matched against a Delayed Request on the basis
of Address and Byte Enables.  The identity of the master that issued
the transaction is not used in the comparison, simply because there is
no Master ID in PCI, and thus the target does not know what master
issued the original Delayed Request that generated the Delayed
Completion, nor what Master is issuing the current Delayed Request
that is being matched against the completion.  Thus if the new
request is issued by a different master, the new master will obtain
the completion originally intended for the Master of the original

This may cause ordering violations, data inconsistency and
synchronization failure in several scenarios.  Here are some examples:

Scenario 1  

(This scenario is a more concrete version of the one described in the
PCI 2.1 Specification, Section 3.11, Item 6.)

Two masters, A and B, on a PCI bus communicate with a target that
resides on the same bus and implements Delayed Transaction
Termination.  The target is the PCI interface of a device.  Commands
can be sent to the device by writing to a location X in the address
range assigned to the target.  The state of the device can be checked
by reading that same location X.  Master A periodically polls the
state of the device using a Delayed Read transaction.  Master B
periodically sends a command as a Posted Write transaction, and then
checks the resulting state of the device using a Delayed Read

A concrete realization of this scenario could be an application where
a PC is used for process control in a factory or a laboratory.  The
device behind the PCI target would be controlling the process.  Master
A would be periodically collecting information for display on a
control panel, while Master B would be the host bridge of the PC and
would be sending commands to the device on behalf of the CPU.

Master A issues a Delayed Read request to X.  The target, using the
Delayed Transaction mechanism, retries the transaction while latching
the request.  When it obtains the data giving the current state of the
device, it readies a Delayed Completion entry, and waits for Master A
to issue the request again.  But before Master A retries its request,
Master B sends a command to the device as a Posted Write, and then
checks the resulting state of the device issuing a Delayed Read
request.  This new request is matched by the target against the
waiting completion.  The completion is given to Master B, and no
interrogation of the device takes place.  Thus Master B sees the state
of the device before its command was executed, when it thinks it is
observing the result of the command.

The same error may occur if the target does not use the Delayed
Transaction Termination mechanism, but is separated from the two
masters by a bridge.

As a workaround, the Specification recommends that Master B issue two
Reads after the Posted Write.  The first Read will eliminate any
waiting completion, allowing the second one to obtain timely data.
However, this workaround is not practical.  In cases where Read
transactions have side effects, it is not feasible.  In the case of
existing applications, it would require extensive modifications to the
code.  In the case of new applications and in situations where Read
transactions have no side-effects, determining what Writes must be
followed by dummy reads would impose an unreasonable burden on
programmers of device drivers, and any mistake would result in
hard-to-track intermittent failures.

Scenario 2

The Producer-Consumer Model is the basic synchronization mechanism
provided by the PCI protocol.  It is based on the fact that, if a
master, called the Producer, writes to a location called Data, and
then to a location called Flag, and if a second master, called the
Consumer, sees the value written to Flag by the Producer and then
reads from Data, then the Consumer is guaranteed to see the value
written to Data by the Producer.  This guarantee is supposed to hold
no matter where Producer, Consumer, Data and Flag are placed in an
arbitrary tree of PCI buses.  But this mechanism may fail due to
problem that we are discussing.

Suppose that, in addition to the Producer and the Consumer, there is a
third master, called the Observer.  The Observer reads from Data
occasionally but does not synchronize with the Producer and the
Consumer.  (For example, the Observer could be sampling Data at
regular or random intervals to gather statistics or for some other
reason.)  The Producer and Consumer are not aware of the existence of
the Observer, but follow the Producer-Consumer model to synchronize
among themselves.  The three masters (Producer, Consumer, Observer)
and the two targets containing the Data and Flag locations are placed
anywhere in an arbitrary tree of PCI buses.  The target containing
Data implements Delayed Transaction Termination.

The Observer sends a Delayed Read request to Data.  The Data target
latches the request and retries the transaction.  Then the Data target
obtains the current value of Data, prepares the Delayed Completion
entry, and waits for the Observer to issue the request again.  But
before the Observer retries its request, the Producer write to Data
and then to Flag, the Consumer reads from Flag and sees the value
written by the Producer, and then the Consumer sends a Delayed Read
request to Data.  The Data Target matches this new request to the old
Completion entry, and gives the Completion to the Consumer.  Thus the
Consumer sees the old value of Data rather than the one written by the
Producer, a violation of the Producer-Consumer model.

As a variant of this scenario, the Data target does not implement
Delayed Transaction Termination, but is located behind a bridge.  For
example, Producer, Consumer, Observer and Flag are on a PCI bus B1,
while Data is on a different bus B2 linked to B1 by a bridge.  It is
then the bridge that delivers the Completion to the wrong master.

As another variant, the Producer, the Consumer and the Observer are
CPUs and the Flag is in main memory.  The host bridge acts on behalf
of the three processors.

As yet another variant, all masters are PCI controllers doing DMA to
access Data in main memory.  The Flag can be in main memory or in PCI.

Scenario 3

The Producer-Consumer Model allows the Producer to set the Flag as
soon as its Posted Write to Data completes on the Producer's bus,
without waiting for the Posted Write to reach its final destination.
There are cases, however, where a master needs to wait for full
completion of a Posted Write.  Here are some examples:

    1. The master intends to communicate the completion of the Posted
    Write to another master through a sideband signal.  Sideband
    signals are explicitly allowed by the PCI protocol, Section 2.3.

    2. The master intends to communicate the completion of the Posted
    Write to a CPU through an interrupt.

    3. The master intends to send a second transaction to a different
    target, and must ensure that the second transaction reaches its
    destination only after the first transaction has taken effect.
    The relative timing of transactions sent to different targets is
    important if the two targets are two devices that jointly control
    and/or monitor the same external process, or if the two targets
    are two ports of the same device.

The PCI protocol provides a second synchronization mechanism that
addresses this need.  After sending a Posted Write, a master may send
a Read to the same address.  This read cannot pass the Posted Write,
and hence it will complete at the destination bus after the Posted
Write does.  Furthermore, if the target has a queue of incoming
transactions, the Read cannot pass the Posted Write on that queue.
The Read can complete inside the target only after the Write does.
Thus when the master receives the Read's completion, it knows that the
Write has taken effect.

This second synchronization mechanism also fails due to the problem
that we are discussing.  Suppose that two Masters A and B are sending
Posted Writes to a remote target behind a bridge, and are using
subsequent Reads to ensure completion.  Master A sends a Write to a
location X on the remote target, followed by a Read.  The Read is
latched by the bridge and retried.  Eventually the bridge receives a
Completion from the remote target, stores it, and waits for Master A
to retry its request.  In the meantime, Master B sends a Posted Write
to the same location X, followed by a Read.  The Posted Write is
queued by the bridge, and then the Read is completed using the
existing Completion.  Thus Master B's Read completes while the Posted
Write is queued at the bridge, well before it has reached its final
destination, and the synchronization mechanism is defeated.

As a variant of this scenario, the two masters and the target may be
on the same PCI bus, with the target implementing Delayed Transaction
termination.  In that case Master B's Read receives Master A's
completion from the target itself.  By then Master B's Posted Write
must have reached the target, but may still be in the target's
incoming queue.  Thus Master B's Read may complete while its Posted
Write has not really been seen by the target.

Scenario 4

Two devices A and B act as masters on a PCI bus, and a third device C
acts as target on a remote bus connected to the first bus by a bridge.
A and B send Delayed Writes to a location X in C.  

For example, device C could control a piece of equipment in a
process-control application, and devices A and B could be connected to
sensors.  When a sensor detects an event, the corresponding device, A
or B, writes a value to X that indicates the type of event that has
been sensed.

Master A sends a Delayed Write request with the value 0 addressed to
X.  The bridge latches the request and retries the transation.  Then
the bridge issues the transaction on the remote bus, where it is
completed by X.  The bridge stores a DWC entry with a "normal"
termination status and waits for Master A to reissue its request.
Before Master A reissues its request, Master B sends a Delayed Write
request, with value 1, addressed to X.  This request is matched to the
existing completion using only Address and Byte Enables for the
comparison.  (The value written is not part of the DWC entry and is
not used for the comparison; see the description of DWC in Section of the PCI 2.1 Specification.)  Thus Master B obtains the
completion that was intended for Master A, and no DWR entry with the
value of 1 is latched by the bridge.  When Master A retries its
transaction, it is latched by the bridge and forwarded again to X.
Thus the target sees two writes of 0 instead of a write of 0 and a
write of 1.

In the process-control example, device C is led to believe that two
events coded 0 have occurred, when in fact there has been one event
coded 0 and one event coded 1.

			  Proposed solution

Four reserved pins are used to encode a Local Master ID.  This
provides 16 distinct codes, which is more than the number of masters
electrically feasible on a single bus.  The Arbiter knows the identity
of the master of the current transaction and uses these four Master ID
lines to communicate that identity to the target.

The Master ID can simply be the index into the array of GNT# lines of
the Arbiter.  Only the Arbiter needs to know the correspondance
between IDs and masters.  The masters themselves do not need to know
their own IDs.  Thus the configuration process need not be modified.

In any given cycle, the Arbiter asserts the Master ID corresponding to
the GNT# line that it asserted in the previous cycle.  Thus the ID of
the master that drives the transaction is present on the Master ID
lines during the address phase, and can be read by the target at the
same time it read the address.

The target of a Delayed Read Request records the Master ID as part of
the DRR entry.  When the target obtains the data for the completion,
it copies the Master ID to the DRC entry.  Incoming Delayed Read
requests are matched to existing DRC entries using Command, Address,
Byte Enables, and Master ID for the comparison.  Thus a request to the
same address by a different master cannot be given the completion of a
request issued by a different master.

Incoming Delayed Read requests are also matched to existing DRR
entries using Command, Address, Byte Enables, and Master ID for the
comparison.  Thus a target may contain two DRR entries with the same
Address and Byte Enables but different Master IDs.  If the target is a
bridge, when those two entries reach the remote bus, they result in
identical transactions.  Once one of those transactions is issued on
the remote bus, the second transaction cannot be issued until the
first has completed, to avoid an ordering violation with respect to an
intervening Posted Write.  In other words, a bridge is not allowed to
have two outstanding Delayed Read Requests for the same Address and
Byte Enables simultaneously.  (NOTE: The bridge is not allowed to
satisfy the two DRR entries with the same completion, even in the case
where no intervening Posted Writes are seen by the bridge.  This is
because Reads can have side effects, and thus the number of Reads that
reach the target may matter.)

Similarly, a master representing a multi-function device must not have
two outstanding Delayed Read Requests for the same Address and Byte
Enables simultaneously.

Master IDs are used for Delayed Write Transactions in the same manner
as for Delayed Read Transactions.

Implementation of the solution should be optional, both for targets
and for arbiters.  A target may ignore the Master ID lines, in which
case it may give a completion to the wrong master.  An arbiter may
fail to drive the Master ID lines.  A target that uses the Master ID
lines can be used with an arbiter that does not drive them, if the
lines are pulled up by resistors on the motherboard.  In that case all
targets will treat all delayed requests as if they came from the same
master.  Systems that implement the optional solution will be more
robust than systems that do not.

		       Backwards Compatibility

It is possible to use together hardware that implements the solution
(which we shall call "new" hardware) and hardware that does not ("old"

(*) An old target may be used on a new bus.  It will not be connected
to the 4 Master-ID pins.  If the target does not implement the Delayed
Transaction mechanism, and in particular if it is a pre-2.1 target,
then it will function correctly.  If the target is a 2.1 target that
implements the Delayed Transaction mechanism, it will function as it
does in a 2.1 system.  Of course the solution will not be effective
for this target, i.e. the target may give a completion to the wrong

(*) An old master may be used on a new bus.  If the master can only
have one outstanding Delayed Transaction at a time, then it will
function correctly.  This will be the case for some 2.1 masters and
for all single-function devices.  If the master can have multiple
outstanding Delayed Transactions, and if there are two Delayed
Transactions with same type and same address outstanding at the same
time, a new target will not be able to distinguish between those two
transactions, and may give the completion of one of them to the other.
However, the new target will distinguish between a transaction from
such a master and a transaction from a different master, and will not
give the completion of one of them to the other.  Thus in this case
the solution will only be partially effective.  Masters that can have
multiple outstanding transactions include some 2.1 masters as well as
all pre-2.1 bridges and multifunction devices.

(*) A new controller may be used on an old bus, provided that the 4
Master-ID lines are pulled up by a resistor.  When functioning as a
target, the controller will treat all transactions as if they came
from the same master.  Thus the new controller will behave as a 2.1
controller would.  Of course the solution will not be effective in
this case.