Re: oom killer rewrite - KAMEZAWA Hiroyuki

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>, linux-mm@kvack.org
Subject: Re: oom killer rewrite
Date: Wed, 26 May 2010 11:00:08 +0900	[thread overview]
Message-ID: <20100526110008.eea05fd6.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1005251818070.23584@chino.kir.corp.google.com>

On Tue, 25 May 2010 18:40:36 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 26 May 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > The only sane badness heuristic will be one that effectively compares all 
> > > eligible tasks for oom kill in a way that are relative to one another; I'm 
> > > concerned that a tunable that is based on a pure memory quantity requires 
> > > specific knowledge of the system (or memcg, cpuset, etc) capacity before 
> > > it is meaningful.  In other words, I opted to use a relative proportion so 
> > > that when tasks are constrained to cpusets or memcgs or mempolicies they 
> > > become part of a "virtualized system" where the proportion is then used in 
> > > calculation of the total amount of system RAM, memcg limit, cpuset mems 
> > > capacities, etc, without knowledge of what that value actually is.  So 
> > > "echo 3G" may be valid in your example when not constrained to any cgroup 
> > > or mempolicy but becomes invalid if I attach it to a cpuset with a single 
> > > node of 1G capacity.  When oom_score_adj, we can specify the proportion 
> > > "of the resources that the application has access to" in comparison to 
> > > other applications that share those resources to determine oom killing 
> > > priority.  I think that's a very powerful interface and your suggestion 
> > > could easily be implemented in userspace with a simple divide, thus we 
> > > don't need kernel support for it.
> > > 
> > I know admins will be able to write a script. But, my point is
> > "please don't force admins to write such a hacky scripts."
> > 
> 
> It's not necessarily the memory quantity that is interesting in this case 
> (or proportion of available memory), it's how the badness() score is 
> altered relative to other eligible tasks that end up changing the oom kill 
> priority list.  If we were to implement a tunable that only took a memory 
> quantity, it would require specific knowledge of the system's capacity to 
> make any sense compared to other tasks.  An oom_score_adj of 125MB means 
> vastly different things on a 4GB system compared to 64GB system and admins 
> do not want to update their script anytime they add (or hotadd) memory or 
> run on a variety of systems that don't have the same capacities. 

IMHO, importance of application is consistent under all hosts in the system.
(the system here means a system maintained by a team of admins to do a service.)

It's not be influenced by the amount of memory, other applications, etc..
If influenced, it's a chaos for admins.
It seems that's fundamental difference in ideas among you and me.


> > For example, an admin uses an application which always use 3G bytes adn it's
> > valid and sane use for the application. When he run it on a server with
> > 4G system and 8G system, he has to change the value for oom_score_adj.
> > 
> 
> That's the same if you were to implement a memory quantity instead of a 
> proportion for oom_score_adj and depends on how you want to protect or 
> prefer that application.  For a 3G application on a 4G machine, an 
> oom_score_adj of 250 is legitimate if you want to ensure it never uses 
> more than 3G and is always killed first when it does.  For the 8G machine, 
> you can't make the same killing choice if another instance of the same 
> application is using 5G instead of 3G.  See the difference?  In that case, 
> it may not be the correct choice for oom kill and we should kill something 
> else: the 5G memory leaker.  That requires userspace intervention to 
> identify, but unless we mandate the expected memory use is spelled out for 
> every single application (which we can't), there's no way to use a fixed 
> memory quantity to determine relative priority.
> 

I just don't believe relative priority ;)
Then, my customer will just disable oom or will use panic_on_oom.
That's why I wrote don't take my words serious. 
I wonder if people wants precise control of oom_score_adj, they should
use memcg and put apps into containers. In that case, static priority
and will be useful.

> If you really did always want to kill that 3G task, an oom_score_adj value 
> of +1000 would always work just like a value of +15 does for oom_adj.
> 
> > One good point of old oom_adj is that it's not influenced by environment.
> > Then, X-window applications set it's oom_adj to be fixed value. 
> > IIUC, they're hardcoded with fixed value, now. 
> > 
> 
> It _is_ influenced by environment, just indirectly.  It's a bitshift on 
> the badness() score so for any other usecase other than a complete 
> polarization of the score to either always prefer or completely disable 
> oom killing for a task, it's practically useless.  The bitshift will 
> increase or decrease the score but that score will be ranked according to 
> the scores of other tasks on the system.  So if a task consuming 400K of 
> memory has a badness score of 100 with an oom_adj value of +10, the end 
> result is a score of 102400 which would represent about 10% of system 
> memory on a 4G system but about 1.5% of system memory on a 64GB system.  
> So the actual preference of a task, minus the usecase of polarizing the 
> task with oom_adj, is completely dependent on the size of system RAM.
> 
> oom_adj must also be altered anytime a task is attached to a cpuset or 
> memcg (or even mempolicy now) since its effect on badness will skew how 
> the score is compared relative to all other tasks in that cpuset, memcg, 
> or attached to the mempolicy nodes.
> 

I agree that oom_score_adj is better than current oom_adj.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-05-26  2:04 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-19 22:14 David Rientjes
2010-05-20  0:27 ` KAMEZAWA Hiroyuki
2010-05-25  9:42   ` David Rientjes
2010-05-26  0:17     ` KAMEZAWA Hiroyuki
2010-05-26  1:40       ` David Rientjes
2010-05-26  2:00         ` KAMEZAWA Hiroyuki [this message]
2010-05-26  3:26           ` David Rientjes
2010-05-24  1:09 ` KOSAKI Motohiro
2010-05-24  7:07   ` Nick Piggin
2010-05-25  9:46     ` David Rientjes
2010-05-25 10:05       ` Nick Piggin
2010-05-25 10:23         ` David Rientjes
2010-05-25 10:31           ` Nick Piggin
2010-05-25  9:55   ` David Rientjes
2010-05-26  0:02     ` David Rientjes
2010-05-28  5:27       ` KOSAKI Motohiro
2010-05-28  5:25     ` KOSAKI Motohiro
2010-06-01  7:30       ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100526110008.eea05fd6.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox