From: David Rientjes <rientjes@google.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>, linux-mm@kvack.org
Subject: Re: oom killer rewrite
Date: Tue, 25 May 2010 18:40:36 -0700 (PDT) [thread overview]
Message-ID: <alpine.DEB.2.00.1005251818070.23584@chino.kir.corp.google.com> (raw)
In-Reply-To: <20100526091740.953090a7.kamezawa.hiroyu@jp.fujitsu.com>
On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
> > The only sane badness heuristic will be one that effectively compares all
> > eligible tasks for oom kill in a way that are relative to one another; I'm
> > concerned that a tunable that is based on a pure memory quantity requires
> > specific knowledge of the system (or memcg, cpuset, etc) capacity before
> > it is meaningful. In other words, I opted to use a relative proportion so
> > that when tasks are constrained to cpusets or memcgs or mempolicies they
> > become part of a "virtualized system" where the proportion is then used in
> > calculation of the total amount of system RAM, memcg limit, cpuset mems
> > capacities, etc, without knowledge of what that value actually is. So
> > "echo 3G" may be valid in your example when not constrained to any cgroup
> > or mempolicy but becomes invalid if I attach it to a cpuset with a single
> > node of 1G capacity. When oom_score_adj, we can specify the proportion
> > "of the resources that the application has access to" in comparison to
> > other applications that share those resources to determine oom killing
> > priority. I think that's a very powerful interface and your suggestion
> > could easily be implemented in userspace with a simple divide, thus we
> > don't need kernel support for it.
> >
> I know admins will be able to write a script. But, my point is
> "please don't force admins to write such a hacky scripts."
>
It's not necessarily the memory quantity that is interesting in this case
(or proportion of available memory), it's how the badness() score is
altered relative to other eligible tasks that end up changing the oom kill
priority list. If we were to implement a tunable that only took a memory
quantity, it would require specific knowledge of the system's capacity to
make any sense compared to other tasks. An oom_score_adj of 125MB means
vastly different things on a 4GB system compared to 64GB system and admins
do not want to update their script anytime they add (or hotadd) memory or
run on a variety of systems that don't have the same capacities. What
they want to do is specify the priority of an application either to prefer
it or protect it from oom kill by saying "this application can use 25%
more memory available to it than everything else" or "this application
should be killed if it's not leaving at least 25% to everything else."
> For example, an admin uses an application which always use 3G bytes adn it's
> valid and sane use for the application. When he run it on a server with
> 4G system and 8G system, he has to change the value for oom_score_adj.
>
That's the same if you were to implement a memory quantity instead of a
proportion for oom_score_adj and depends on how you want to protect or
prefer that application. For a 3G application on a 4G machine, an
oom_score_adj of 250 is legitimate if you want to ensure it never uses
more than 3G and is always killed first when it does. For the 8G machine,
you can't make the same killing choice if another instance of the same
application is using 5G instead of 3G. See the difference? In that case,
it may not be the correct choice for oom kill and we should kill something
else: the 5G memory leaker. That requires userspace intervention to
identify, but unless we mandate the expected memory use is spelled out for
every single application (which we can't), there's no way to use a fixed
memory quantity to determine relative priority.
If you really did always want to kill that 3G task, an oom_score_adj value
of +1000 would always work just like a value of +15 does for oom_adj.
> One good point of old oom_adj is that it's not influenced by environment.
> Then, X-window applications set it's oom_adj to be fixed value.
> IIUC, they're hardcoded with fixed value, now.
>
It _is_ influenced by environment, just indirectly. It's a bitshift on
the badness() score so for any other usecase other than a complete
polarization of the score to either always prefer or completely disable
oom killing for a task, it's practically useless. The bitshift will
increase or decrease the score but that score will be ranked according to
the scores of other tasks on the system. So if a task consuming 400K of
memory has a badness score of 100 with an oom_adj value of +10, the end
result is a score of 102400 which would represent about 10% of system
memory on a 4G system but about 1.5% of system memory on a 64GB system.
So the actual preference of a task, minus the usecase of polarizing the
task with oom_adj, is completely dependent on the size of system RAM.
oom_adj must also be altered anytime a task is attached to a cpuset or
memcg (or even mempolicy now) since its effect on badness will skew how
the score is compared relative to all other tasks in that cpuset, memcg,
or attached to the mempolicy nodes.
> Even if my customer may use only OOM_DISABLE, I think using oom_score_adj
> is too difficult for usual users.
>
How is oom_score_adj, which has a well-defined unit of "proportion of
available memory" more difficult to use than oom_adj, which exponentially
increases a tasks badness score with no units other than "badness()
quantum"? The typical use case for these users is simply to polarize the
score, anyway, which is still possible with oom_score_adj: oom_score_adj
of -1000 is equivalent oom_adj of -17 and oom_score_adj of +1000 is
equivalent to oom_adj of +15. That conversion is done automatically for
existing users of oom_adj, so I find oom_score_adj to be much more
predicatable and scalable than oom_adj.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-05-26 1:40 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-19 22:14 David Rientjes
2010-05-20 0:27 ` KAMEZAWA Hiroyuki
2010-05-25 9:42 ` David Rientjes
2010-05-26 0:17 ` KAMEZAWA Hiroyuki
2010-05-26 1:40 ` David Rientjes [this message]
2010-05-26 2:00 ` KAMEZAWA Hiroyuki
2010-05-26 3:26 ` David Rientjes
2010-05-24 1:09 ` KOSAKI Motohiro
2010-05-24 7:07 ` Nick Piggin
2010-05-25 9:46 ` David Rientjes
2010-05-25 10:05 ` Nick Piggin
2010-05-25 10:23 ` David Rientjes
2010-05-25 10:31 ` Nick Piggin
2010-05-25 9:55 ` David Rientjes
2010-05-26 0:02 ` David Rientjes
2010-05-28 5:27 ` KOSAKI Motohiro
2010-05-28 5:25 ` KOSAKI Motohiro
2010-06-01 7:30 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.00.1005251818070.23584@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox