[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Write Combining difference between 98 and NT

I'll throw out a couple of thoughts:
1. I would concentrate on the timing on the CPU side of the things.  I
    that the WC flush occurs under two different conditions -- buffer full,
    timeout.  If the timeout is sufficiently short, any delay in writing to
the WC
    buffer would result in a flush of a partial buffer.
2. With the above in mind, I would concentrate on what things can cause a
    user-mode program that is writing directly to mapped memory locations
    to run slower on NT.  The first thing I would suspect is NT's vaunted
    mechanisms.  While 98 will essentially let you party on just about
    location with abandon, NT will generally trap any direct I/O or memory
    that don't look like they belong to the user mode app.  Depending on the
    situation, it may allow the access to occur, but of course, only after a
    to the additional processing overhead incurred.
3. You didn't say exactly how you arrived at getting a user-mode pointer to
    a physical memory block (I'm assuming) hosted on your PCI adapter under
    If you didn't use ZwMapViewOfSection() [see the 'memmap' or 'mapmem'
    example (I can't remember which) in the NT DDK], then you probably
    don't have an unencumbered user-mode pointer that is also valid in
    kernel-mode.  Using ZwMapViewOfSection() will let you get such a
    pointer (after all the pre-requisite PhysToVirt(), etc. stuff -- see the
    example, which shows how to get a pointer to video RAM that is valid
    in both kernel and user mode).  You can then use such a pointer to
    party on the memory without any interference from the OS kernel.
    Of course, Microsoft does not recommend using such system interfaces
    (even though they document them and provide example code!) because
    it circumvents the NT security mechanism.  In other words, the presence
    of such a utility on a system will could allow someone to exploit the
    memory mapping for nefarious purposes.
4. Unless the call to writeData() is inlined by the compiler/linker, there
    would probably be at least a couple of memory hits outside of your
    WC buffer area (or the CPU's intstruction cache), that is to the system
    stack area.  That may be enough to throw things off in this case.
It sounds like you have a PCI bus analyzer hooked up, but that you don't
have an ICE (or equivalent) watching what the CPU is doing on it's side
of the bridge.  If you hook up an ICE, and capture the code execution,
you may see a *lot* of extraneous activity on the part of the NT kernel
that you don't see under 98, thus allowing the WC buffer to timeout and
flush its contents.
-- DaveN
Dave New, NewD@elcsci.com    | Machine vision?
ESI Vision Products Division |      At least *they* can see the future...
3980 Ranchero Drive          |
Ann Arbor, MI  48108         |        Opinions expressed are mine.    | PGP
(734) 332-7010 VOX           | 08 12 9F AF 5B 3E B2 9B  6F DC 66 5A 41 0B AB
(734) 332-7077 FAX

-----Original Message-----
From: Paul Slade [mailto:Paul@pmis.freeserve.co.uk]
Sent: Saturday, August 12, 2000 5:49 PM
To: pci-sig@znyx.com
Subject: Write Combining difference between 98 and NT

I have a PCI device to which I am writing data to large regions of
consecutive memory locations. The memory region is mapped using the MTRR's
as Write Combined memory by the device driver. The actual writes to memory
are performed by user level code. I have two versions of the driver - one
for Windows 98 and one for Windows NT 4.0. The user application that
actually performs the writes is the same for both OS's.
The application performs the writes in chunks, so each call to a particular
functions writes the next N words to the WC memory region.
I am seeing a significant performance difference in the bandwidth achieved
in the two OS's. Further analysis of the writes being performed accross the
PCI bus revealed that the bursts occuring accross the bus were significantly
Under Windows 98, you see the expected result, with bursts of multiples of 8
words occuring accross the PCI bus as the WC buffers get flushed once full.
Under Windows NT however, bursts do not seem to be generated for WC blocks
that are filled accross two calls to the function that writes data to the
region. In other words, if a particular call to the function does not fill
entirely a full WC buffer, then that buffer gets written out using between 1
and 4 partial writes according to how much data was actually written during
that call. It does not seem to wait until the next call to the function
which would have completed the WC buffer. Any WC buffer that does get
completely filled is written out as an 8 word burst thus indicating the WC
is actually enabled. Also, consecutive WC flushes never seem to get
amalgamated into single longer bursts of 16, 24, 32, etc words as is the
case under Windows 98.
This seems to impact upon the performance by around 25%.
In my test code, the calls to the function are generated from a FOR loop
that simply calls the function on each itteration and does nothing else, eg:
for (i = 0; i < 1000000; i++)
No data is accessed by the writeData() function that would not already be in
the primary data cache.
The writeData() function is written in assembler and does not contain any
command that to my knowledge would force a WC buffer to get flushed. I've
checked the list of events in Intels documentation and can not see any that
would occur on each call to the function.
So my question is this - Does any one know what could be causing the WC
buffers to get flushed between successive calls to the function - and why
would this only occur under Windows NT 4.0 and not under Windows 98? Is
there a way to speed things up under NT?
FYI: I am using a 700MHz Intel PIII processor on a Tyan S1867 Thunder 2500
motherboard. This contains a ServerWorks ServerSet III HE Chipset. I am dual
booting the PC so both the test was performed on the same hardware under
both OS's. The speed difference was also noticed on a motherboard using the
Intel 440BX chipset.
Thanks for any help offered.