[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Read Speed



|<><><><><> Original message from =?iso-8859-1?Q?Andr=E9?= David  <><><><><>
|
|Hi!
|
|I am working on a group developing a PCI board for data acquisition.
|Since our priority is getting it running, busmastering capabilities for
|such things as DMA to the host memory are not on the front line of
|development.
|
|Now, since I'm the guy behing the device driver, I have done some
|benchmarking with a simple device driver (in Linux, of course) using a
|standar PCI VGA adapter.
|This "driver" just uses the memcpy() transfer some data between the main
|memory and the board's framebuffer.
|
|I have tried three different processor/chipset combinations and the
|results I get, are:
|
|(results after BIOS and MTRR parameters tweaking)
|
|                                             Reading (Mbyte/s)
|Writing(Mbyte/s)
|Intel 440FX (PII@233)                7.03
|36.16
|Intel 440BX (2*PII@400)            8.62                        102.4
|VIA KT133 (Athlon@900)            7.46                        119.6
|
|Now this points to a pattern in which the north bridge seems unable to
|read with a reasonable speed from the board. I know writing is always
|easier than reading (from the specs a single data phase read is slower
|than a single data phase write (4 clock cycles vs. 3)).
|
|The north bridge behaviour is inadmissible even if we assume that all
|the reads are single data phase reads (4 clock cycles), with even medium
|devsel (1 more clock cycle lost) and a wait-state from the VGA board
|(another clock sycle lost), because this would give a total of 6 clock
|cycles, or 22Mb/s total bandwidth.

It is likely that there is some additional latency on the VGA board as well.
It quite likely is forcing two transactions on the bus.  The first one completes
without a transfer and the second is a defered read that actually transfers
the data.  You should see something like the following on the bus.

+0		ReadMem		aaaaaaaa
+1				aaaaaaaa
+2		DevSel		aaaaaaaa
+3		Irdy		xxxxxxxx
+4		Irdy + Stop	xxxxxxxx
+5		dead cycle	xxxxxxxx
+6		ReadMem		aaaaaaaa
+7				aaaaaaaa
+8		DevSel		aaaaaaaa
+9		Irdy		xxxxxxxx
+10		Irdy + Trdy	xxxxxxxx
+11		dead cycle	xxxxxxxx
+12

If you're getting 8.62MB/s that implies 4 bytes every 15-16 PCI clock cycles,
so the VGA board is probably adding additional wait states.  I haven't measured
the chipsets you show above, but I have seen 13MB/s in a configuration that
wasn't particularly ideal, so I would expect that you can do better than
that with the 440BX or KT133 if the PCI target device is willing.

|So my questions are:
|
|- Since it looks that north bridges have always been like this, has
|anyone found one that is not?

I think your performance limitation is a result of the device you are
reading from.  You can probably improve your numbers by using 64-bit
reads using either the FPU, MMX, or SSE units.

|- Is it admissible (logical) that the north bridge is like this?

Yes, a 32-bit read from the CPU has never been combined with other
32-bit reads to consecutive addresses on x86 processors.  I believe this
is primarily for compatibilty with devices that are not completely
PCI compliant as well as avoiding introducing additional complexity to the
design.

|- Since I have only talked about commodity PC's, could there be
|something on the industrial market that does not suffer from this
|apparent "feature"?

Probably not.  If you anticipate transfering a large amount of data
from your device to the CPU, programmed I/O isn't the way to do it as
you have seen.  Best performance will result if your device does a bus
mastered write to system memory.  If your going the other direction
then programed I/O can generally do as well as a bus mastered read from
system memory on modern chipsets.

The general rule of thumb is always write, never read, latency kills.

TJ Merritt
tjm@codegen.com
1-415-834-9111