Re: [v10 4/6] mm, oom: introduce memory.oom_group

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@kernel.org>
To: Roman Gushchin <guro@fb.com>
Cc: linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v10 4/6] mm, oom: introduce memory.oom_group
Date: Thu, 5 Oct 2017 14:58:08 +0200	[thread overview]
Message-ID: <20171005125808.vsbpxmkabpzq4wsg@dhcp22.suse.cz> (raw)
In-Reply-To: <20171005123214.GA15459@castle.dhcp.TheFacebook.com>

On Thu 05-10-17 13:32:14, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > > consumption entities and performs the victim selection by comparing
> > > them based on their memory footprint. Then it kills the biggest task
> > > inside the selected memory cgroup.
> > > 
> > > But there are workloads, which are not tolerant to a such behavior.
> > > Killing a random task may leave the workload in a broken state.
> > > 
> > > To solve this problem, memory.oom_group knob is introduced.
> > > It will define, whether a memory group should be treated as an
> > > indivisible memory consumer, compared by total memory consumption
> > > with other memory consumers (leaf memory cgroups and other memory
> > > cgroups with memory.oom_group set), and whether all belonging tasks
> > > should be killed if the cgroup is selected.
> > > 
> > > If set on memcg A, it means that in case of system-wide OOM or
> > > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > > belonging to the sub-tree of A will be killed. If OOM event is
> > > scoped to a descendant cgroup (A/B, for example), only tasks in
> > > that cgroup can be affected. OOM killer will never touch any tasks
> > > outside of the scope of the OOM event.
> > > 
> > > Also, tasks with oom_score_adj set to -1000 will not be killed.
> > 
> > I would extend the last sentence with an explanation. What about the
> > following:
> > "
> > Also, tasks with oom_score_adj set to -1000 will not be killed because
> > this has been a long established way to protect a particular process
> > from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> > user defined configuration might lead to data corruptions or other
> > misbehavior.
> > "
> 
> Added, thanks!
> 
> > 
> > few mostly nit picks below but this looks good other than that. Once the
> > fix mentioned in patch 3 is folded I will ack this.
> > 
> > [...]
> > 
> > >  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > >  {
> > > -	struct mem_cgroup *iter;
> > > +	struct mem_cgroup *iter, *group = NULL;
> > > +	long group_score = 0;
> > >  
> > >  	oc->chosen_memcg = NULL;
> > >  	oc->chosen_points = 0;
> > >  
> > >  	/*
> > > +	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > > +	 * all tasks belonging to the memcg should be killed.
> > > +	 * So, we mark the memcg as a victim.
> > > +	 */
> > > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > 
> > we have is_memcg_oom() helper which is esier to read and understand than
> > the explicit oc->memcg check
> 
> It's defined in oom_kill.c and not exported, so I'm not sure.

putting it to oom.h shouldn't be a big deal.
 
> > > +		oc->chosen_memcg = oc->memcg;
> > > +		css_get(&oc->chosen_memcg->css);
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > >  	 * The oom_score is calculated for leaf memory cgroups (including
> > >  	 * the root memcg).
> > > +	 * Non-leaf oom_group cgroups accumulating score of descendant
> > > +	 * leaf memory cgroups.
> > >  	 */
> > >  	rcu_read_lock();
> > >  	for_each_mem_cgroup_tree(iter, root) {
> > >  		long score;
> > >  
> > > +		/*
> > > +		 * We don't consider non-leaf non-oom_group memory cgroups
> > > +		 * as OOM victims.
> > > +		 */
> > > +		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * If group is not set or we've ran out of the group's sub-tree,
> > > +		 * we should set group and reset group_score.
> > > +		 */
> > > +		if (!group || group == root_mem_cgroup ||
> > > +		    !mem_cgroup_is_descendant(iter, group)) {
> > > +			group = iter;
> > > +			group_score = 0;
> > > +		}
> > > +
> > 
> > hmm, I thought you would go with a recursive oom_evaluate_memcg
> > implementation that would result in a more readable code IMHO. It is
> > true that we would traverse oom_group more times. But I do not expect
> > we would have very deep memcg hierarchies in the majority of workloads
> > and even if we did then this is a cold path which should focus on
> > readability more than a performance. Also implementing
> > mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> > turns out a real problem.
> 
> I've tried to go this way, but I didn't like the result. These both
> loops will share a lot of code (e.g. breaking on finding a previous victim,
> skipping non-leaf non-oom-group memcgs, etc), so the result is more messy.
> And actually it's strange to start a new loop to iterate exactly over
> the same sub-tree, which you want to skip in the first loop.

As I've said, I will not insist. It just makes more sense to me to do
the hierarchical behavior in a single place rather than open code it in
the main loop.
 
> > Anyway this is nothing really fundamental so I will leave the decision
> > on you.
> > 
> > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > +{
> > >  	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > >  		return oc->chosen_memcg;
> > >  
> > > -	/* Kill a task in the chosen memcg with the biggest memory footprint */
> > > -	oc->chosen_points = 0;
> > > -	oc->chosen_task = NULL;
> > > -	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > -
> > > -	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > -		goto out;
> > > -
> > > -	__oom_kill_process(oc->chosen_task);
> > > +	/*
> > > +	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> > > +	 * of the chosen memory cgroup, otherwise kill the task with the biggest
> > > +	 * memory footprint.
> > > +	 */
> > > +	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> > > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > > +				      NULL);
> > > +		/* We have one or more terminating processes at this point. */
> > > +		oc->chosen_task = INFLIGHT_VICTIM;
> > 
> > it took me a while to realize we need this because of return
> > !!oc->chosen_task in out_of_memory. Subtle... Also a reason to hate
> > oc->chosen_* thingy. As I've said in other reply, don't worry about this
> > I will probably turn my hate into a patch ;)
> > 
> > > +	} else {
> > > +		oc->chosen_points = 0;
> > > +		oc->chosen_task = NULL;
> > > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > +
> > > +		if (oc->chosen_task == NULL ||
> > > +		    oc->chosen_task == INFLIGHT_VICTIM)
> > > +			goto out;
> > 
> > How can this happen? There shouldn't be any INFLIGHT_VICTIM in our memcg
> > because we have checked for that already. I can see how we do not find
> > any task because those can terminate by the time we get here but no new
> > oom victim should appear we are under the oom_lock.
> 
> You're probably right, but I would prefer to have this check in place,
> rather then get a panic on attempt to kill an INFLIGHT_VICTIM task one day.

This would be a bug which you just paper over.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-10-05 12:58 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-04 19:14   ` Johannes Weiner
2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-10-04 19:15   ` Johannes Weiner
2017-10-04 20:10   ` David Rientjes
2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-10-04 19:27   ` Johannes Weiner
2017-10-04 19:51     ` Roman Gushchin
2017-10-04 20:17       ` David Rientjes
2017-10-04 20:22         ` Roman Gushchin
2017-10-04 20:31         ` Johannes Weiner
2017-10-05 11:14           ` Michal Hocko
2017-10-04 19:48   ` Shakeel Butt
2017-10-04 20:15     ` Roman Gushchin
2017-10-04 21:24       ` Shakeel Butt
2017-10-05 10:27         ` Roman Gushchin
2017-10-05 11:12           ` Michal Hocko
2017-10-05 11:45             ` Roman Gushchin
2017-10-04 20:27   ` David Rientjes
2017-10-04 20:41     ` Johannes Weiner
2017-10-05  8:40       ` David Rientjes
2017-10-05 10:27         ` Johannes Weiner
2017-10-05 21:53           ` David Rientjes
2017-10-05 10:44         ` Roman Gushchin
2017-10-05 22:02           ` David Rientjes
2017-10-06  5:43             ` Michal Hocko
2017-10-05 11:40   ` Michal Hocko
2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-04 19:37   ` Johannes Weiner
2017-10-05 12:06   ` Michal Hocko
2017-10-05 12:32     ` Roman Gushchin
2017-10-05 12:58       ` Michal Hocko [this message]
2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-04 20:04   ` Johannes Weiner
2017-10-05 13:14     ` Michal Hocko
2017-10-05 13:41       ` Roman Gushchin
2017-10-05 14:10         ` Michal Hocko
2017-10-05 14:54       ` Johannes Weiner
2017-10-05 16:40         ` Michal Hocko
2017-10-05 15:51       ` Tejun Heo
2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
2017-10-04 20:08   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171005125808.vsbpxmkabpzq4wsg@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox