From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: vedran.furac@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, kosaki.motohiro@jp.fujitsu.com,
minchan.kim@gmail.com, akpm@linux-foundation.org,
rientjes@google.com, aarcange@redhat.com
Subject: Re: Memory overcommit
Date: Wed, 28 Oct 2009 09:43:43 +0900 [thread overview]
Message-ID: <20091028094343.d59fec94.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0910271843510.11372@sister.anvils>
On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.
> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>
Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-10-28 0:57 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <hav57c$rso$1@ger.gmane.org>
[not found] ` <20091013120840.a844052d.kamezawa.hiroyu@jp.fujitsu.com>
[not found] ` <hb2cfu$r08$2@ger.gmane.org>
[not found] ` <20091014135119.e1baa07f.kamezawa.hiroyu@jp.fujitsu.com>
2009-10-20 21:52 ` Vedran Furač
2009-10-26 1:55 ` KAMEZAWA Hiroyuki
2009-10-26 16:16 ` Vedran Furač
2009-10-27 3:22 ` KAMEZAWA Hiroyuki
2009-10-27 6:10 ` KOSAKI Motohiro
2009-10-27 6:34 ` Minchan Kim
2009-10-27 6:36 ` KAMEZAWA Hiroyuki
2009-10-27 6:55 ` Minchan Kim
2009-10-27 7:45 ` [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: " KAMEZAWA Hiroyuki
2009-10-27 7:56 ` Minchan Kim
2009-10-27 12:38 ` Andrea Arcangeli
2009-10-28 0:22 ` KAMEZAWA Hiroyuki
2009-10-28 0:45 ` Vedran Furač
2009-10-27 7:56 ` KAMEZAWA Hiroyuki
2009-10-27 8:14 ` Minchan Kim
2009-10-27 8:33 ` KAMEZAWA Hiroyuki
2009-10-27 8:52 ` Minchan Kim
2009-10-27 8:56 ` KAMEZAWA Hiroyuki
2009-10-27 17:41 ` Vedran Furač
2009-10-28 0:13 ` KAMEZAWA Hiroyuki
2009-10-27 18:39 ` Hugh Dickins
2009-10-27 18:47 ` Andrea Arcangeli
2009-10-28 0:32 ` KAMEZAWA Hiroyuki
2009-11-05 19:02 ` Pavel Machek
2009-10-28 0:28 ` KAMEZAWA Hiroyuki
2009-10-27 6:46 ` KOSAKI Motohiro
2009-10-27 6:56 ` Minchan Kim
2009-10-27 17:12 ` Vedran Furač
2009-10-27 18:02 ` KOSAKI Motohiro
2009-10-27 18:30 ` Vedran Furač
2009-10-27 20:44 ` Hugh Dickins
2009-10-27 21:04 ` David Rientjes
2009-10-28 0:08 ` Vedran Furač
2009-10-28 0:25 ` David Rientjes
2009-10-28 0:39 ` Vedran Furač
2009-10-28 4:08 ` David Rientjes
2009-10-28 4:55 ` KAMEZAWA Hiroyuki
2009-10-28 5:13 ` David Rientjes
2009-10-28 6:05 ` KAMEZAWA Hiroyuki
2009-10-28 6:17 ` David Rientjes
2009-10-28 6:20 ` KAMEZAWA Hiroyuki
2009-10-29 8:38 ` David Rientjes
2009-10-29 11:11 ` Vedran Furač
2009-10-29 19:53 ` David Rientjes
2009-10-29 23:48 ` KAMEZAWA Hiroyuki
2009-10-30 9:10 ` David Rientjes
2009-10-30 9:36 ` KAMEZAWA Hiroyuki
2009-11-03 20:49 ` David Rientjes
2009-11-04 0:50 ` KAMEZAWA Hiroyuki
2009-11-04 1:58 ` David Rientjes
2009-11-04 2:17 ` KAMEZAWA Hiroyuki
2009-11-04 3:10 ` David Rientjes
2009-11-04 3:19 ` KAMEZAWA Hiroyuki
2009-10-30 13:59 ` Vedran Furač
2009-10-30 19:24 ` David Rientjes
2009-11-02 19:58 ` Vedran Furač
2009-10-28 13:28 ` Vedran Furač
2009-10-28 20:10 ` David Rientjes
2009-10-29 3:05 ` Vedran Furač
2009-10-29 8:35 ` David Rientjes
2009-10-29 11:01 ` Vedran Furač
2009-10-29 19:42 ` David Rientjes
2009-10-30 13:53 ` Vedran Furač
2009-10-30 14:08 ` Thomas Fjellstrom
2009-10-30 15:13 ` Vedran Furač
2009-10-30 14:12 ` Andrea Arcangeli
2009-10-30 14:41 ` Vedran Furač
2009-10-30 15:15 ` Andrea Arcangeli
2009-10-30 16:24 ` Hugh Dickins
2009-11-02 19:56 ` Vedran Furač
2009-10-30 19:44 ` David Rientjes
2009-11-02 19:56 ` Vedran Furač
2009-10-28 0:43 ` KAMEZAWA Hiroyuki [this message]
2009-10-28 2:47 ` KOSAKI Motohiro
2009-10-28 3:17 ` KAMEZAWA Hiroyuki
2009-10-28 4:12 ` David Rientjes
2009-10-28 8:10 ` Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091028094343.d59fec94.kamezawa.hiroyu@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hugh.dickins@tiscali.co.uk \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=rientjes@google.com \
--cc=vedran.furac@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox