[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: long burst writes under NT 4.0?
Jim,
You are not doing anything wrong. You are just running into the
limitations of trying to use the x86 processor to do write bursting
for you.
You asked a question, can you get your memory mapped as write-combined?
Not really. There is a facility that drivers can call to map a range a
memory
as USWC (write combined) and some video drivers call this interface. It
resets
the MTRR registers is possible to give a USWC format to some memory. You
could look for that interface, try looking a video driver sample in the NT
DDK.
Otherwise, you are just running into the fact that the uncached writes
coming out of the CPU don't gather in posted write buffers in the 440BX
in greater numbers than 4 LWords at a time to allow greater bursts.
You won't get greater bursts, no matter how hard you try.
This is the reason most high performance devices are setup as bus masters
so that they can master their own traffic with longer bursts. (Of course
they
then run into the chipset cacheline latency delays on reads or writes, where
the chipset asserts stop. I also believe this is the source of the
recommendation
that you received to use memory write LINE or memory write multiple
commands).
That recommendation does not really apply in your situation where you are
just using the CPU to write to a slave PCI device.
Sorry to let you down as such. But I don't think you will be able to
speed things up. The PCI slave writes out of an x86 CPU with the average
Intel chipset just won't give you good sustained long burst rates.....
Which
seems to be what you are looking for.
-David O'Shea
david.j.oshea@intel.com <mailto:david.j.oshea@intel.com>
-----Original Message-----
From: Jim Foote [mailto:foote@parc.xerox.com]
Sent: Thursday, February 17, 2000 9:11 AM
To: pci-sig@znyx.com
Subject: long burst writes under NT 4.0?
Apologies up front for the length of this message, and if it's the wrong
forum. I've tried to summarize it at the top and leave the gory details for
later. The whole message is available at
ftp://parcftp.parc.xerox.com/transient/foote/BurstDebug.
<ftp://parcftp.parc.xerox.com/transient/foote/BurstDebug.htm> htm
<ftp://parcftp.parc.xerox.com/transient/foote/BurstDebug.htm>
Bottom line question: Is there a way to get long PCI combining write bursts
out of NT 4.0?
Or, stated differently, what's the right way to track down the cause of
short PCI target burst writes? I suspect that I've misconfigured my system
in some way that it's convinced that it can't burst more than 4 Lwords at a
time I'm just stumped as to how I go about tracking down the
misconfiguration.
Short summary: I'm trying to figure out why my Intel 440BX or VIA MPV3
based systems won't burst more than 4 Lwords at a time out of main memory
and through a PLX 9050 target bridge chip to a static ram buffer. With a
logic analyzer I can see that bursts are happening, but the 440BX is
toggling #FRAME after at most 4 Lwords even though the 9050 is holding STOP#
high. What I'm trying to figure out is what's possessing the 440BX to limit
the size of its combining bursts to 4 Lwords.
Ugly details:
Test setup:
Tyan S1830 Tsunami AT motherboard with 500Mhz PIII and 256MB memory
<http://www.tyan.com/products/html/s1830s_sl.html>
http://www.tyan.com/products/html/s1830s_sl.html
Adaptec PCI SCSI adapter
PLX 9050 RDK prototyping board
<http://www.plxtech.com/products/9050/briefs/9050rdk.pdf>
http://www.plxtech.com/products/9050/briefs/9050rdk.pdf
Windows NT 4.0 service pack 6a
KRF's WinDriver 4.0 for the PLX 9050
<http://www.krftech.com/windriver.html>
http://www.krftech.com/windriver.html
HP 15400C logic analyzer monitoring PCI bus and PLX 9050 local bus at 4ns
sampling
AMI bios version 1112991500; some of the potentially relevant BIOS settings:
Advanced Chipset settings:
Master Latency Timer (Clks): 224
Multi-Trans Timer (Clks): 224
PCI1 to PCI0 access: disabled
PIIx4 Passive Release: enabled
PIIx4 Delayed Transaction: disabled
Plug & Play Settings:
PCI Latency Timer (PCI Clocks): 224
(So here I've set the various latency timers to the maximum values that the
bios setup program will allow me to enter).
The KRF WinDriver code sets up the memory mappings for the PLX 9050's
configuration registers and local address space, but otherwise just calls
memcpy() to do block writes and relies on the underlying software and
hardware configuration to turn these block operations into burst transfers.
Here's the relevant section of the test code using the WinDriver API:
static void WriteDBlock (P9050_HANDLE hPlx, DWORD dwOffset, PVOID buf,
DWORD dwBytes, P9050_ADDR addrSpace, P9050_MODE mode)
{
WD_TRANSFER trans;
DWORD dwAddr = hPlx->addrDesc[addrSpace].dwAddr + dwOffset;
BZERO(trans);
if (!hPlx->addrDesc[addrSpace].fIsMemory)
exit(-1); /* error if it's not memory */
trans.cmdTrans = WM_SDWORD;
trans.dwPort = dwAddr;
trans.fAutoinc = TRUE;
trans.dwBytes = dwBytes;
trans.dwOptions = 0;
trans.Data.pBuffer = buf;
WD_Transfer (hPlx->hWD, &trans); /* This boils down to an memcpy() (?)
*/
}
static void TestBurstWrite(burstObjHandle p)
{
/* RDK has 128 Kbytes of static ram. Use that static ram to
simulate
a FIFO and do burst writes in target mode. */
unsigned i, bufSize = 64*1024;
DWORD inputVal = 0x55555555;
char *writeBufPtr = (char *)malloc(bufSize);
(void)memset((void *)writeBufPtr, inputVal, bufSize);
WriteDBlock (p->hPlx, 0, (PVOID)writeBufPtr,
bufSize, P9050_ADDR_SPACE2, P9050_MODE_DWORD);
}
When I run this test case I get the logic analyzer trace shown in:
<ftp://parcftp.parc.xerox.com/transient/foote/brst_fst.jpg>
ftp://parcftp.parc.xerox.com/transient/foote/brst_fst.jpg
This trace shows that the 440BX is indeed doing combining bursts, but it is
only bursting at most 4 Lwords between toggles of FRAME#, at a net bandwidth
of about 50Mbytes/second. The 9050 is never asserting STOP#, so as far as I
can tell the 9050 would happily accept larger bursts if only the 440BX would
send them.
PLX's technical support claims that they've been able to use a two card
configuration of a PLX 9054 RDK board acting as a bus mastering DMA
initiator to send streams to a PLX 9050 RDK in target mode at over
120Mbytes/sec, so it seems that the 9050 is able to sink fast streams if
only I could convince the 440BX to send them.
I've tried the same setup with a Tyan s1590s Trinity AT motherboard with an
AMD K6-III 450Mhz cpu and a VIA MPV3 chipset
<http://www.tyan.com/products/html/s1590s.html>
http://www.tyan.com/products/html/s1590s.html and seen virtually the
identical result. So my intuition is that either I've somehow misconfigured
both of the different BIOSs on the two motherboards, or more likely, the
Windows NT 4.0 drivers that WinDriver is layered on aren't properly
configured to support longer PCI target mode bursts.
Neither PLX technical support nor KRF's technical support have been able to
suggest a fix or a debugging strategy.
So what's the right debugging strategy from here?
For example, is it possible that NT drivers are resetting the latency timers
from what the bios sets them to at startup? Seems unlikely, but if I knew
how to read these registers from NT I guess I could verify that the before
and after values were the same.
PLX tech support suggested that I might be able to get longer bursts by
using a 'Memory Write and Invalidate' command and a large Cache Line Size
instead of just a 'Memory Write' command. But they didn't think that this
was something that the KRF WinDriver would support directly and didn't have
any insights into how one might get this operation from NT. And in any case
the Intel 440BX data sheet says in section 3.3.3 that Bit 4 of the PCI
Command Register is hardwired to disable this command. And for that matter,
the PLX 9050 data sheet says the same thing for its PCI Configuration ID
register in its section 4.2.1.
I found an article on Intel's web site (
<http://developer.intel.com/design/PentiumII/applnots/24442201.pdf>
http://developer.intel.com/design/PentiumII/applnots/24442201.pdf) that
gives hints to miniport device driver writers on flags to pass to
VideoPortMapMemory() and VideoPortGetDeviceBase() to enable Write Combining
memory. But this seems specific to Video adapters it's not clear how I'd
use this for a more vanilla data mover.
If somebody knows an obvious fix I sure wouldn't turn it down! But mostly
I'm looking for a debugging strategy that might help me narrow down the
cause, or a pointer to the right set of documentation to tell me how to
enable longer write bursts.
Thanks,
Jim Foote
foote@parc.xerox.com