From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 15 Aug 2001 13:45:09 -0700 (PDT) From: Linus Torvalds Subject: Re: 0-order allocation problem In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Hugh Dickins Cc: Marcelo Tosatti , linux-mm@kvack.org List-ID: [ cc'd to linux-mm and Marcelo, as this was kind of interesting ] On Wed, 15 Aug 2001, Hugh Dickins wrote: > > Exactly as you predict. A batch of the printks at the usual point, > then it recovers and proceeds happily on its way. Same again each > time (except first time clean as usual). I should be pleased, but > I feel dissatisfied. I guess it's right for create_buffers() to > try harder, but I'm surprised it got into that state at all. > I'll try to understand it better. Ok, then I understand the schenario. This could _possibly_ be triggered by other things than swapoff too, but it would probably be much harder. What happens is: - we have tons of free memory - so much that both inactive_shortage() and free_shortage() are happy as clams, and kswapd or anybody else won't ever try to balance out the fact that we have unusually low counts of inactive data while having a high "inactive_target". The only strange thing is that _despite_ having tons of memory, we are really short on inactive pages, because swapoff() really ate them all up. This part is fine. We're doing the right thing - if we have tons of memory, we shouldn't care. I'm just saying that it's unusual to be both short on some things and extremely well off on others. - Because we have lots of memory, we can easily allocate that free memory to user pages etc, and nobody will start checking the VM balance because the allocations themselves work out really well and never even feel that they have to wake up kswapd. So we quickly deplete the free pages that used to hide the imbalance. Now we're in a situation where we're low on memory, but we're _also_ in the unusual situation that we have almost no inactive pages, while at the same time having a high inactive target. So fairly suddenly _everybody_ goes from "oh, we have tons of memory" to "uhhuh, we're several thousand pages short of our inactive target". Now, this is really not much of a problem normally. because normal applications will just loop on try_to_free_pages() until they're happy again. So for normal allocations, the worst that can happen is that because of the sudden shift in balance, we'll get a lot of queue activity. Not a big deal - in fact that's exactly what we want. Not a big deal _except_ for GFP_NOFS (ie buffer) allocations and in particular kswapd. Because those are special-cased, and return NULL earlier (GFP_NOFS because __GFP_FS isn't set, and kswapd because PF_MEMALLOC is set). Which is _exactly_ why refill_freelist() will do it's extra song-and-dance number. And guess what? create_buffers() for the "struct page" case doesn't do that. It just yields and hopes the situation goes away. And as that is the thing that we want to use for writing out swap etc, we get the situation where one of the most probable yielders in this case is kswapd. And the situation never improves, surprise surprise. Most everybody will be in PF_MEMALLOC and not make any progress. This is why when you do the full song-and-dance in the create_buffers() case too, the problem just goes away. Instead of waiting for things to improve, we will actively try to improve them, and sure as hell, we have lots of pages that we can evict if we just try. So instead of getting a comatose machine, you get one that says a few times "I had trouble getting memory", and then it continues happily. Case solved. Moral of the story: don't just hope things will improve. Do something about it. Other moral of the story: this "let's hope things improve" problem was probably hidden by previously having refill_inactive() scan forever until it hit its target. Or rather - I suspect that code was written exactly because Rik or somebody _did_ hit this, and made refill_inactive() work that way to make up for the simple fact that fs/buffer.c was broken. And finally: It's not a good idea to try to make the VM make up for broken kernel code. Btw, the whole comment around the fs/buffer.c braindamage is telling: /* We're _really_ low on memory. Now we just * wait for old buffer heads to become free due to * finishing IO. Since this is an async request and * the reserve list is empty, we're sure there are * async buffer heads in use. */ run_task_queue(&tq_disk); current->policy |= SCHED_YIELD; __set_current_state(TASK_RUNNING); schedule(); goto try_again; It used to be correct, say about a few years ago. It's simply not true any more: yes, we obviously have async buffer heads in use, but they don't just free up when IO completes. They are the buffer heads that we've allocated to a "struct page" in order to push it out - and they'll be free'd only by page_launder(). Not by IO completion. In short: we do have freeable memory. But it won't just come back to us. So I'd suggest: - the one I already suggested: instead of just yielding, do the same thing refill_freelist() does. - also apply the one-liner patch which Marcelo already suggested some time ago, to just make 0-order allocations of GFP_NOFS loop inside the memory allocator until happy, because they _will_ eventually make progress. (The one-liner in itself will probably already help us balance things much faster and make it harder to hit the problem spot - but the "don't just yield" thing is probably worth it anyway because when you get into this situation many page allocators tend to be of the PF_MEMALLOC type, and they will want to avoid recursion in try_to_free_pages() and will not trigger the one-liner) So something like the appended (UNTESTED!) should be better. How does it work for you? Linus ----- diff -u --recursive --new-file pre4/linux/mm/page_alloc.c linux/mm/page_alloc.c --- pre4/linux/mm/page_alloc.c Wed Aug 15 02:39:44 2001 +++ linux/mm/page_alloc.c Wed Aug 15 13:35:02 2001 @@ -450,7 +450,7 @@ if (gfp_mask & __GFP_WAIT) { if (!order || free_shortage()) { int progress = try_to_free_pages(gfp_mask); - if (progress || (gfp_mask & __GFP_FS)) + if (progress || (gfp_mask & __GFP_IO)) goto try_again; /* * Fail in case no progress was made and the diff -u --recursive --new-file pre4/linux/mm/vmscan.c linux/mm/vmscan.c --- pre4/linux/mm/vmscan.c Wed Aug 15 02:39:44 2001 +++ linux/mm/vmscan.c Wed Aug 15 02:37:07 2001 @@ -788,6 +788,9 @@ zone_t *zone = pgdat->node_zones + i; unsigned int inactive; + if (!zone->size) + continue; + inactive = zone->inactive_dirty_pages; inactive += zone->inactive_clean_pages; inactive += zone->free_pages; diff -u --recursive --new-file pre4/linux/fs/buffer.c linux/fs/buffer.c --- pre4/linux/fs/buffer.c Wed Aug 15 02:39:41 2001 +++ linux/fs/buffer.c Wed Aug 15 13:37:35 2001 @@ -794,6 +794,17 @@ goto retry; } +static void free_more_memory(void) +{ + balance_dirty(NODEV); + page_launder(GFP_NOFS, 0); + wakeup_bdflush(); + wakeup_kswapd(); + current->policy |= SCHED_YIELD; + __set_current_state(TASK_RUNNING); + schedule(); +} + /* * We used to try various strange things. Let's not. * We'll just try to balance dirty buffers, and possibly @@ -802,15 +813,8 @@ */ static void refill_freelist(int size) { - if (!grow_buffers(size)) { - balance_dirty(NODEV); - page_launder(GFP_NOFS, 0); - wakeup_bdflush(); - wakeup_kswapd(); - current->policy |= SCHED_YIELD; - __set_current_state(TASK_RUNNING); - schedule(); - } + if (!grow_buffers(size)) + free_more_memory(); } void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) @@ -1408,9 +1412,7 @@ */ run_task_queue(&tq_disk); - current->policy |= SCHED_YIELD; - __set_current_state(TASK_RUNNING); - schedule(); + free_more_memory(); goto try_again; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/