From: Marcelo Tosatti <marcelo@conectiva.com.br>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: Hugh Dickins <hugh@veritas.com>, linux-mm@kvack.org
Subject: Re: 0-order allocation problem
Date: Wed, 15 Aug 2001 17:55:54 -0300 (BRT) [thread overview]
Message-ID: <Pine.LNX.4.21.0108151747570.26574-100000@freak.distro.conectiva> (raw)
In-Reply-To: <Pine.LNX.4.33.0108151304340.2714-100000@penguin.transmeta.com>
On Wed, 15 Aug 2001, Linus Torvalds wrote:
>
> [ cc'd to linux-mm and Marcelo, as this was kind of interesting ]
>
> On Wed, 15 Aug 2001, Hugh Dickins wrote:
> >
> > Exactly as you predict. A batch of the printks at the usual point,
> > then it recovers and proceeds happily on its way. Same again each
> > time (except first time clean as usual). I should be pleased, but
> > I feel dissatisfied. I guess it's right for create_buffers() to
> > try harder, but I'm surprised it got into that state at all.
> > I'll try to understand it better.
>
> Ok, then I understand the schenario.
>
> This could _possibly_ be triggered by other things than swapoff too, but
> it would probably be much harder. What happens is:
>
> - we have tons of free memory - so much that both inactive_shortage() and
> free_shortage() are happy as clams, and kswapd or anybody else won't
> ever try to balance out the fact that we have unusually low counts of
> inactive data while having a high "inactive_target".
>
> The only strange thing is that _despite_ having tons of memory, we are
> really short on inactive pages, because swapoff() really ate them all
> up.
>
> This part is fine. We're doing the right thing - if we have tons of
> memory, we shouldn't care. I'm just saying that it's unusual to be both
> short on some things and extremely well off on others.
>
> - Because we have lots of memory, we can easily allocate that free memory
> to user pages etc, and nobody will start checking the VM balance
> because the allocations themselves work out really well and never even
> feel that they have to wake up kswapd. So we quickly deplete the free
> pages that used to hide the imbalance.
>
> Now we're in a situation where we're low on memory, but we're _also_ in
> the unusual situation that we have almost no inactive pages, while at
> the same time having a high inactive target.
>
> So fairly suddenly _everybody_ goes from "oh, we have tons of memory" to
> "uhhuh, we're several thousand pages short of our inactive target".
>
> Now, this is really not much of a problem normally. because normal
> applications will just loop on try_to_free_pages() until they're happy
> again. So for normal allocations, the worst that can happen is that
> because of the sudden shift in balance, we'll get a lot of queue activity.
> Not a big deal - in fact that's exactly what we want.
>
> Not a big deal _except_ for GFP_NOFS (ie buffer) allocations and in
> particular kswapd. Because those are special-cased, and return NULL
> earlier (GFP_NOFS because __GFP_FS isn't set, and kswapd because
> PF_MEMALLOC is set).
>
> Which is _exactly_ why refill_freelist() will do it's extra song-and-dance
> number.
>
> And guess what? create_buffers() for the "struct page" case doesn't do
> that. It just yields and hopes the situation goes away. And as that is the
> thing that we want to use for writing out swap etc, we get the situation
> where one of the most probable yielders in this case is kswapd. And the
> situation never improves, surprise surprise. Most everybody will be in
> PF_MEMALLOC and not make any progress.
>
> This is why when you do the full song-and-dance in the create_buffers()
> case too, the problem just goes away. Instead of waiting for things to
> improve, we will actively try to improve them, and sure as hell, we have
> lots of pages that we can evict if we just try. So instead of getting a
> comatose machine, you get one that says a few times "I had trouble getting
> memory", and then it continues happily.
>
> Case solved.
>
> Moral of the story: don't just hope things will improve. Do something
> about it.
>
> Other moral of the story: this "let's hope things improve" problem was
> probably hidden by previously having refill_inactive() scan forever until
> it hit its target. Or rather - I suspect that code was written exactly
> because Rik or somebody _did_ hit this, and made refill_inactive() work
> that way to make up for the simple fact that fs/buffer.c was broken.
>
> And finally: It's not a good idea to try to make the VM make up for broken
> kernel code.
>
> Btw, the whole comment around the fs/buffer.c braindamage is telling:
>
> /* We're _really_ low on memory. Now we just
> * wait for old buffer heads to become free due to
> * finishing IO. Since this is an async request and
> * the reserve list is empty, we're sure there are
> * async buffer heads in use.
> */
> run_task_queue(&tq_disk);
>
> current->policy |= SCHED_YIELD;
> __set_current_state(TASK_RUNNING);
> schedule();
> goto try_again;
>
> It used to be correct, say about a few years ago. It's simply not true any
> more: yes, we obviously have async buffer heads in use, but they don't
> just free up when IO completes. They are the buffer heads that we've
> allocated to a "struct page" in order to push it out - and they'll be
> free'd only by page_launder(). Not by IO completion.
>
> In short: we do have freeable memory. But it won't just come back to us.
>
> So I'd suggest:
> - the one I already suggested: instead of just yielding, do the same
> thing refill_freelist() does.
> - also apply the one-liner patch which Marcelo already suggested some
> time ago, to just make 0-order allocations of GFP_NOFS loop inside the
> memory allocator until happy, because they _will_ eventually make
> progress.
>
> (The one-liner in itself will probably already help us balance things much
> faster and make it harder to hit the problem spot - but the "don't just
> yield" thing is probably worth it anyway because when you get into this
> situation many page allocators tend to be of the PF_MEMALLOC type, and
> they will want to avoid recursion in try_to_free_pages() and will not
> trigger the one-liner)
>
> So something like the appended (UNTESTED!) should be better.
__GFP_IO is not going to help us that much on anon intensive workloads
(eg swapoff). Remember we are _never_ going to block on buffer_head's of
on flight swap pages because we can't see them in page_launder(). (if a
page is locked, we simply skip it)
Hugh, could you check which kind of allocation is failing and from where?
(allocation flags, etc).
> How does it work for you?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
next prev parent reply other threads:[~2001-08-15 20:55 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Pine.LNX.4.21.0108152049100.973-100000@localhost.localdomain>
2001-08-15 20:45 ` Linus Torvalds
2001-08-15 20:55 ` Marcelo Tosatti [this message]
2001-08-15 22:30 ` Linus Torvalds
2001-08-15 22:34 ` Rik van Riel
2001-08-15 23:27 ` Hugh Dickins
2001-08-15 22:15 ` Marcelo Tosatti
2001-08-15 22:00 ` Rik van Riel
2001-08-15 22:15 ` Rik van Riel
2001-08-15 23:09 ` Hugh Dickins
2001-08-15 21:54 ` Marcelo Tosatti
2001-08-15 23:38 ` Rik van Riel
2001-08-16 0:07 ` Hugh Dickins
2001-08-15 22:44 ` Marcelo Tosatti
2001-08-16 0:50 ` Linus Torvalds
2001-08-16 8:30 ` Daniel Phillips
2001-08-16 10:26 ` Stephen C. Tweedie
2001-08-16 12:18 ` Daniel Phillips
2001-08-16 15:35 ` Eric W. Biederman
2001-08-16 16:37 ` Stephen C. Tweedie
2001-08-17 3:20 ` Eric W. Biederman
2001-08-17 11:45 ` Stephen C. Tweedie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.21.0108151747570.26574-100000@freak.distro.conectiva \
--to=marcelo@conectiva.com.br \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox