:Hello! : :I'm writing a message-digest utility, which operates on file and :can use either stdio: : : while (not eof) { : char buffer[BUFSIZE]; : size = read(.... buffer ...); : process(buffer, size); : } : :or mmap: : : buffer = mmap(... file_size, PROT_READ ...); : process(buffer, file_size); : :I expected the second way to be faster, as it is supposed to avoid :one memory copying (no user-space buffer). But in reality, on a :CPU-bound (rather than IO-bound) machine, using mmap() is considerably :slower. Here are the tcsh's time results: read() is likely going to be faster because it does not involve any page fault overhead. The VM system only faults 16 or so pages ahead which is only 64KB, so the fault overhead is very high for the data rate. Why does the extra copy not matter? Well, it's fairly simple, actually. It's because your buffer is smaller then the L1 cache, and/or also simply because the VM fault overhead is higher then it would take to copy an extra 64KB. read() loops typically use buffer sizes in the 8K-46K range. L1 caches are typically 16K (for celeron class cpus) through 64K, or more for higher end cpus. L2 caches are typically 256K-1MB, or more. The copy bandwidth from or to the L1 cache is usually around 10x faster then main memory and the copy bandwidth from or two L2 cache is usually around 4x faster. (Note that I'm talking copy bandwidth here, not random access. The L1 cache is ~50x faster or more for random access). So the cost of the extra copy in a read() loop using a reasonable buffer size (~8K-64K) (L1 or L2 access) is virtually nil compared to the cost of accessing the kernel's buffer cache (which involves main memory accesses for files > L2 cache). :On the IO-bound machine, using mmap is only marginally faster: : : Single Pentium4M (Centrino 1GHz) runing recent -current: : -------------------------------------------------------- :stdio: 27.195u 8.280s 1:33.02 38.1% 10+169k 11221+0io 1pf+0w :mmap: 26.619u 3.004s 1:23.59 35.4% 10+169k 47+0io 19463pf+0w Yes, because it's I/O bound. As long as the kernel queues some readahead to the device it can burn those cpu cycles on whatever it wants without really effecting the transfer rate. :I this how things are supposed to be, or will mmap() become more :efficient eventually? Thanks! : : -mi It's hard to say. mmap() could certainly be made more efficient, e.g. by faulting in more pages at a time to reduce the actual fault rate. But it's fairly difficult to beat a read copy into a small buffer. -Matt Matthew Dillon <dillon_at_backplane.com>Received on Sun Jun 20 2004 - 16:35:19 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:58 UTC