From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from dukat.scot.redhat.com (sct@dukat.scot.redhat.com [195.89.149.246]) by kvack.org (8.8.7/8.8.7) with ESMTP id QAA15806 for ; Fri, 28 May 1999 16:46:17 -0400 From: "Stephen C. Tweedie" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14159.137.169623.500547@dukat.scot.redhat.com> Date: Fri, 28 May 1999 21:46:01 +0100 (BST) Subject: Re: Q: PAGE_CACHE_SIZE? In-Reply-To: References: Sender: owner-linux-mm@kvack.org To: Alan Cox Cc: Rik van Riel , ak@muc.de, ebiederm+eric@ccr.net, linux-kernel@vger.rutgers.edu, linux-mm@kvack.org List-ID: Hi, On Thu, 27 May 1999 23:06:48 +0100 (BST), Alan Cox said: >> A larger page size is no compensation for the lack of a decent >> read-{ahead,back,anywhere} I/O clustering algorithm in the OS. > It isnt compensating for that. If you have 4Gig of memory and a high > performance I/O controller the constant cost per page for VM > management begins to dominate the equation. Its also a win for other > CPU related reasons (reduced tlb misses and the like), and with 4Gig > of RAM the argument is a larger page size isnt a problem. That's still only half of the issue. Remember, there isn't any one block size in the kernel. We can happily use 1k or 2k blocks in the buffer cache. We use 8k chunks for stacks. Nothing is forcing us to use the hardware page size for all of our operations. In short, I think the real answer as far as the VM is concerned is indeed to use clustering, not just for IO (we already do that for mmap paging and for sequential reads), but for pageout too. I really believe that the best unit in which to do pageout is whatever unit we did the pagein in in the first place. This has a lot of really nice properties. If we record sequential accesses when setting up data in the first place, then we can automatically optimise for that when doing the pageout again. For swap, it reduces fragmentation: we can allocate in multi-page chunks and keep that allocation persistent. There are very few places where doing such clustering falls short of what we'd get by increasing the page size. COW is one: retaining per-page COW granularity is an easy way to fragment swap, but increasing the COW chunk size changes the semantics unless we actually export a larger pagesize to user space (because we end up destroying the VM backing store sharing for pages which haven't actually been touched by the user). Finally, there's one other thing worth considering: if we use a larger chunk size for dividing up memory (say, set the buddy heap up in a basic unit of 32k or more), then the slab allocator is actually a very good way of getting page allocations out of that. If we did in fact use the 4k minipage for all kernel get_free_page() allocations as usual, but used the larger 32k buddy heap pages for all VM allocations, then 8K kernel allocations (eg. stack allocations and large NFS packets) become trivial to deal with. The biggest problem we have had with these multi-page allocations up to now is fragmentation in the VM. If we populate the _entire_ VM in multiples of 8k or more then we can never see such fragmentation at all. 8k might actually be a reasonable pagesize even on low memory machines: we found in 2.2 that the increased size of the kernel was compensated for by more efficient swapping so that things still went faster in low memory than under 2.2, and large pages may well have the same tradeoff. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/