From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 15 Aug 2001 17:55:54 -0300 (BRT) From: Marcelo Tosatti Subject: Re: 0-order allocation problem In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: Hugh Dickins , linux-mm@kvack.org List-ID: On Wed, 15 Aug 2001, Linus Torvalds wrote: > > [ cc'd to linux-mm and Marcelo, as this was kind of interesting ] > > On Wed, 15 Aug 2001, Hugh Dickins wrote: > > > > Exactly as you predict. A batch of the printks at the usual point, > > then it recovers and proceeds happily on its way. Same again each > > time (except first time clean as usual). I should be pleased, but > > I feel dissatisfied. I guess it's right for create_buffers() to > > try harder, but I'm surprised it got into that state at all. > > I'll try to understand it better. > > Ok, then I understand the schenario. > > This could _possibly_ be triggered by other things than swapoff too, but > it would probably be much harder. What happens is: > > - we have tons of free memory - so much that both inactive_shortage() and > free_shortage() are happy as clams, and kswapd or anybody else won't > ever try to balance out the fact that we have unusually low counts of > inactive data while having a high "inactive_target". > > The only strange thing is that _despite_ having tons of memory, we are > really short on inactive pages, because swapoff() really ate them all > up. > > This part is fine. We're doing the right thing - if we have tons of > memory, we shouldn't care. I'm just saying that it's unusual to be both > short on some things and extremely well off on others. > > - Because we have lots of memory, we can easily allocate that free memory > to user pages etc, and nobody will start checking the VM balance > because the allocations themselves work out really well and never even > feel that they have to wake up kswapd. So we quickly deplete the free > pages that used to hide the imbalance. > > Now we're in a situation where we're low on memory, but we're _also_ in > the unusual situation that we have almost no inactive pages, while at > the same time having a high inactive target. > > So fairly suddenly _everybody_ goes from "oh, we have tons of memory" to > "uhhuh, we're several thousand pages short of our inactive target". > > Now, this is really not much of a problem normally. because normal > applications will just loop on try_to_free_pages() until they're happy > again. So for normal allocations, the worst that can happen is that > because of the sudden shift in balance, we'll get a lot of queue activity. > Not a big deal - in fact that's exactly what we want. > > Not a big deal _except_ for GFP_NOFS (ie buffer) allocations and in > particular kswapd. Because those are special-cased, and return NULL > earlier (GFP_NOFS because __GFP_FS isn't set, and kswapd because > PF_MEMALLOC is set). > > Which is _exactly_ why refill_freelist() will do it's extra song-and-dance > number. > > And guess what? create_buffers() for the "struct page" case doesn't do > that. It just yields and hopes the situation goes away. And as that is the > thing that we want to use for writing out swap etc, we get the situation > where one of the most probable yielders in this case is kswapd. And the > situation never improves, surprise surprise. Most everybody will be in > PF_MEMALLOC and not make any progress. > > This is why when you do the full song-and-dance in the create_buffers() > case too, the problem just goes away. Instead of waiting for things to > improve, we will actively try to improve them, and sure as hell, we have > lots of pages that we can evict if we just try. So instead of getting a > comatose machine, you get one that says a few times "I had trouble getting > memory", and then it continues happily. > > Case solved. > > Moral of the story: don't just hope things will improve. Do something > about it. > > Other moral of the story: this "let's hope things improve" problem was > probably hidden by previously having refill_inactive() scan forever until > it hit its target. Or rather - I suspect that code was written exactly > because Rik or somebody _did_ hit this, and made refill_inactive() work > that way to make up for the simple fact that fs/buffer.c was broken. > > And finally: It's not a good idea to try to make the VM make up for broken > kernel code. > > Btw, the whole comment around the fs/buffer.c braindamage is telling: > > /* We're _really_ low on memory. Now we just > * wait for old buffer heads to become free due to > * finishing IO. Since this is an async request and > * the reserve list is empty, we're sure there are > * async buffer heads in use. > */ > run_task_queue(&tq_disk); > > current->policy |= SCHED_YIELD; > __set_current_state(TASK_RUNNING); > schedule(); > goto try_again; > > It used to be correct, say about a few years ago. It's simply not true any > more: yes, we obviously have async buffer heads in use, but they don't > just free up when IO completes. They are the buffer heads that we've > allocated to a "struct page" in order to push it out - and they'll be > free'd only by page_launder(). Not by IO completion. > > In short: we do have freeable memory. But it won't just come back to us. > > So I'd suggest: > - the one I already suggested: instead of just yielding, do the same > thing refill_freelist() does. > - also apply the one-liner patch which Marcelo already suggested some > time ago, to just make 0-order allocations of GFP_NOFS loop inside the > memory allocator until happy, because they _will_ eventually make > progress. > > (The one-liner in itself will probably already help us balance things much > faster and make it harder to hit the problem spot - but the "don't just > yield" thing is probably worth it anyway because when you get into this > situation many page allocators tend to be of the PF_MEMALLOC type, and > they will want to avoid recursion in try_to_free_pages() and will not > trigger the one-liner) > > So something like the appended (UNTESTED!) should be better. __GFP_IO is not going to help us that much on anon intensive workloads (eg swapoff). Remember we are _never_ going to block on buffer_head's of on flight swap pages because we can't see them in page_launder(). (if a page is locked, we simply skip it) Hugh, could you check which kind of allocation is failing and from where? (allocation flags, etc). > How does it work for you? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/