From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <3D740C35.9E190D04@zip.com.au> Date: Mon, 02 Sep 2002 18:11:17 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: About the free page pool References: <3D73CB28.D2F7C7B0@zip.com.au> <218D9232-BEBF-11D6-A3BE-000393829FA4@cs.amherst.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Scott Kaplan Cc: linux-mm@kvack.org List-ID: Scott Kaplan wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Monday, September 2, 2002, at 04:33 PM, Andrew Morton wrote: > > > Scott Kaplan wrote: > >> How important is it to maintain a list of free pages? That is, how > >> critical is it that there be some pool of free pages from which the only > >> bookkeeping required is the removal of that page from the free list. > > > > There are several reasons, all messy. > > > > - We need to be able to allocate pages at interrupt time. Mainly > > for networking receive. > > Okay, this actually seems pretty important, and I suspected that it would > be a critical issue. I suppose interrupts really do need to be as quick > as possible, so doing the reclamation work during non-interrupt times is a > good trade off. That's a sufficient argument for me. > > > - We sometimes need to allocate memory from *within* the context of > > page reclaim: find a dirty page on the LRU, need to write it out, > > need to allocate some memory to start the IO. Where does that > > memory come from. > > That part could be handled without too much trouble, I believe. If we're > ensuring that some trailing portion of the inactive list is clean and > ready for reclamation, then when the situation above arises, just allocate > space by taking it from the end of the inactive list. There should be no > problem in doing that. yes. But there are the latency issues as well. We'll have cpu-local pool of pages with which to satisfy most of these allocations anyway, I guess. > > - The kernel frequently needs to perform higher-order allocations: > > two or more physically-contiguous pages. The way we agglomerate > > 0-order pages into higher-order pages is by coalescing them in the > > buddy. If _all_ "free" pages are out on an LRU somewhere, we don't > > have a higher-order pool to draw from. > > What is the current approach to this problem? Does the buddy allocator > interact with the existing VM replacement policy so that, at times, the > page occupying some particular page frame will be evicted not because it's > the LRU page, but rather because its page frame is physically adjacent to > some other free page? In other words, I see the need to allocate > physically contiguous groups of pages, and that the buddy allocator is > used for that purpose, but what influence does the buddy allocator have to > ensure that it can fulfill those higher-order allocations? The current approach is guess-and-giggle. It seems to work out that there are enough physically contig pages for it to work. The most important are 1-order allocations (8k, for kernel stacks). The memory allocator will retry these allocations indefinitely, so they end up succeeding, somehow. I think there's a bug in there, actually. If all zones have enough free memory but there are no 1-order pages available, then the 1-order allocator tried to run page reclaim, which will say "nope, nothing needs doing". Eventually, someone else returns some memory and coalescing happens. It's not a very glorious part of the kernel design. > > It's a ratio of the zone size, and there are a few thresholds in there, > > for hysteresis, for emergency allocations, etc. See free_area_init_core( > > ) > > I took a look, and if I'm calculating things correctly, pages_high seems > to be set so that the free list is at most about 0.8% of the total number > of pages in the zone. For larger memories (above about 128 MB), that > percentage decreases. So we're keeping a modest pool of a few hundred > pages -- not too big a deal. Free memory seems to bottom out at about 2.2M on a 2.5G machine. Note that the kernel statically allocates about 10M when it boots. This is basically a bug, and fixing it is a matter of running around shouting at people. This will happen ;) This is the low-hanging fruit. > [From a later email:] > > Well, I'm at a bit of a loss to understand what the objective > > of all this is. Is it so that we can effectively increase the > > cache size, by not "wasting" all that free memory? > > While I suppose it would be to keep those few hundred pages mapped and > re-usable by the VM system, it would only make a difference in the miss > rate under very tense and unlikely circumstances. A few pages can make a > big difference in the miss rate, but only if those few pages would allow > the replacement policy to *just barely* keep the pages cached for long > enough before they are referenced again. See 10M, above. > My goal was a different one: I just wanted some further simplification of > the replacement mechanism. When a free page is allocated, it gets mapped > into some address space and inserted into the active list (right?). Inactive, initially. It changes with the vm-of-the-minute though. > If we > wanted the active and inactive lists to remain a constant size (and for > the movement of pages through those lists to be really simple), we could > immediately evict a page from the active list into the inactive list, and > then evict some other page from the inactive list to the free list. If we > did that, though, the use of a free list would be superfluous. > > Since the approach I'm describing performs the VM bookkeeping during > allocation (and, thus, potentially, interrupt) time, it would be a poor > choice. Evictions from the active and inactive lists must be performed at > some other time. Doing so is a tad more complicated, and makes the > behavior of the replacement policy harder to model. It seems, however, > that to keep allocation fast, that bit of added complexity is necessary. > Well, we never evict from the active list - just from the tail of the inactive list. But yes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/