[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Write Combining difference between 98 and NT



See below...

-----Original Message-----
From: Paul Slade [mailto:Paul@pmis.freeserve.co.uk]
Sent: Monday, August 14, 2000 1:05 PM
To: Dave New; pci-sig@znyx.com
Subject: Re: Write Combining difference between 98 and NT


Thanks for your thoughts - I've added replies to your questions below.


> I'll throw out a couple of thoughts:
>
> 1. I would concentrate on the timing on the CPU side of the things.  I
> suspect that the WC flush occurs under two different conditions -- buffer
> full, or timeout.  If the timeout is sufficiently short, any delay in
> writing to the WC buffer would result in a flush of a partial buffer.

What timeout are you refering to - a timeout in the processors memory
manager, the PCI Chipset or the Target PCI device? Intel don't seem to
document any internal timeouts on the WC buffers.

-- I would assume the PCI Chipset.  The idea would be to not hold a
   'partial' combined write forever, but to go ahead and flush it if
   only a couple of bytes are written within some specific number of
   bus cycles.  Purely my guess, though, since I'm not familiar with
   the chipset in question (hoping that someone who knows chimes in
   to save me from futher embarassment 8-)...

> 2. With the above in mind, I would concentrate on what things can cause a
>     user-mode program that is writing directly to mapped memory locations
>     to run slower on NT.  The first thing I would suspect is NT's vaunted
>     security mechanisms. <snip>

Under NT the user mode pointer is obtained by the driver  using the
VideoPortGetDeviceBase() Video Miniport Driver DDK call. The InIOSpace field
is set to VIDEO_MEMORY_SPACE_P6CACHE to specify that the memory should be
mapped as Write Combined memory.

I would be very surprised if the OS traps accesses to the mapped region
since the whole point of the VideoPortGetDeviceBase() call is to allow very
fast access to display memory.

-- Agreed.  I'm familiar with the raw DDK calls, rather than ones exposed
   via a Video miniport, but I'm sure the intent is the same.

Additionally, if I structure my data such that each call to the function
writes out a multiple of eight 32 bit words, then the code runs
significantly faster (around 20%) under NT than 98. This would seem to
indicate that there is little or no OS overhead being incurred by the
accesses.

-- It would seem to indicate that most of the overhead is being caused
   by the function call, but that seems rather excessive...

<snip>

> 4. Unless the call to writeData() is inlined by the compiler/linker, there
>     would probably be at least a couple of memory hits outside of your
>     WC buffer area (or the CPU's intstruction cache), that is to the
>     system
>     stack area.  That may be enough to throw things off in this case.

True, but I would expect that after the first itterartion through the
function then all code and data would be cached for all subsequent
itterations. There is not a huge amount of code or data being accessed
 much less than 1KByte).

> It sounds like you have a PCI bus analyzer hooked up, but that you don't
> have an ICE (or equivalent) watching what the CPU is doing on it's side
> of the bridge.  <snip>

I hooked up a standard logic analyzer to some of the PCI signals.
Unfortunatley I don't have access to an ICE.

Cheers,

-Paul

-- Cheers,

---- DaveN