linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
Date: Mon, 20 Apr 2020 18:51:50 +0800	[thread overview]
Message-ID: <CALOAHbC656aNU0-nOSLtEVyPqxKHNEaVUG5r9evSUuq009i_Mw@mail.gmail.com> (raw)
In-Reply-To: <20200420103142.GK27314@dhcp22.suse.cz>

On Mon, Apr 20, 2020 at 6:31 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-04-20 17:58:03, Yafang Shao wrote:
> > On Mon, Apr 20, 2020 at 5:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> > > > On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> > > [...]
> > > > > > This patch is to improve it.
> > > > > > If no victim found in memcg oom, we should force the current task to
> > > > > > wait until there's available pages. That is similar with the behavior in
> > > > > > memcg1 when oom_kill_disable is set.
> > > > >
> > > > > The primary reason why we force the charge is because we _cannot_ wait
> > > > > indefinitely in the charge path because the current call chain might
> > > > > hold locks or other resources which could block a large part of the
> > > > > system. You are essentially reintroducing that behavior.
> > > > >
> > > >
> > > > Seems my poor English misleads you ?
> > > > The task is NOT waiting in the charge path, while it is really waiting
> > > > at the the end of the page fault, so it doesn't hold any locks.
> > >
> > > How is that supposed to work? Sorry I didn't really study your patch
> > > very closely because it doesn't apply on the current Linus' tree and
> > > your previous 2 patches have reshuffled the code so it is not really
> > > trivial to have a good picture of the overall logic change.
> > >
> >
> > My patch is based on the commit 8632e9b5645b, and I can rebase my
> > patch for better reviewing.
> > Here is the overall logic of the patch.
> > do_page_fault
> >     mem_cgroup_try_charge
> >         mem_cgroup_out_of_memory  <<< over the limit of this memcg
> >             out_of_memory
> >                   if (!oc->chosen)  <<<< no killable tasks found
> >                         Set_an_OOM _state_in_the_task <<<< set the oom state
> >     mm_fault_error
> >         pagefault_out_of_memory  <<<< VM_FAULT_OOM is returned by the
> > previously error
> >             mem_cgroup_oom_synchronize(true)
> >                  Check_the_OOM_state_and_then_wait_here <<<< check the oom state
>
> OK, I see. So this is a hybrid model. My primary concern would be that
> issues seen previously with the #PF based approach will happen again.
> It will be in a reduced form. It is hard to judge whether this is good.
> A rare error case might be even worse because it is harder to predict.
>
> > > > See the comment above mem_cgroup_oom_synchronize()
> > >
> > > Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
> > > oom handling is disabled (aka handed over to the userspace). All other
> > > paths should handle the oom in the charge path.
> >
> > Right. Now this patch introduces another patch to enter
> > mem_cgroup_oom_synchronize().
> >
> > >  Please have a look at
> > > 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> > > for more background and motivation.
> > >
> >
> > Before I send this patch, I have read it carefully.
> >
> > > mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
> > > effect was that all other charge paths outside of #PF were failing
> > > allocations prematurely and that had an effect to user space.
> > >
> >
> > I guess this side effect is caused by the precision of the page
> > counter, for example, the page counter isn't modified immdiately after
> > uncharging the pages - that's the issue we should improve IMHO.
>
> No, this is not really the case. If you have a look at gup users or any
> kernel memory charged then you can find many places to return ENOMEM
> even when it is not really expected. The most notable example was
> MAP_POPULATE failures.
>

Thanks for your explanation. I will take a look at these code.

> > > > > Is the above example a real usecase or you have just tried a test case
> > > > > that would trigger the problem?
> > > >
> > > > On my server I found the memory usage of a container was greater than
> > > > the limit of it.
> > > > From the dmesg I know there's no killable tasks becasue the
> > > > oom_score_adj is set with -1000.
> > >
> > > I would really recommend to address this problem in the userspace
> > > configuration. Either by increasing the memory limit or fixing the
> > > oom disabled userspace to not consume that much of a memory.
> > >
> >
> > This issue can be addressed in the usespace configuration.
> > But note that there're many containers running on one single host,
> > what we should do is try to keep the isolation as strong as possible.
>
> I do agree with this but there is a line between isolation and proper
> configuration that this requires. You will never get 100% isolation in
> the first place. You have to be careful to not share some resources
> which are inherently hard to isolate - e.g. page cache, file system
> metadata and many others.
>
> While your example is not the same it is similar. Disabling the oom
> killer for a task implies that all the resources held by that task
> cannot be reclaimed/rebalanced and so isolation is much harder to
> achieve if possible at all. There is only only relieable way to handle
> OOM in this situation and that is breaking the contract and kill those
> tasks. I am not really convinced this is what we want to do.
>
> > If we don't take any action in the kernel, the users will complain to
> > us that their service is easily effected by the weak isolation of the
> > container.
>
> Yes I can imagine that and we will carefully explain that disabling oom
> for some tasks is a very dangerous thing to do. Maybe we are not
> explicit about that in our documentation now and all the consequences
> are not really clear.
>
> If this turns out to be infeasible then we should be addressing that
> problem for all possible cases and that means to allow breaking the
> oom_score_adj contract and kill also hidden tasks.

Breaking the oom_score_adj contract seems another possible way, that
would be accepted by the user - misconfiguration or bugs in user code
should be punished.  IOW this is the fault of the user and the kernel
should tell the user the result of this fault.

Thanks
Yafang


  reply	other threads:[~2020-04-20 10:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-18 15:13 [PATCH 0/3] " Yafang Shao
2020-04-18 15:13 ` [PATCH 1/3] mm: change the return type of out_of_memory() Yafang Shao
2020-04-18 15:13 ` [PATCH 2/3] mm, memcg: introduce a new helper task_in_memcg_oom_set() Yafang Shao
2020-04-18 15:13 ` [PATCH 3/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
2020-04-20  8:13   ` Michal Hocko
2020-04-20  8:52     ` Yafang Shao
2020-04-20  9:14       ` Michal Hocko
2020-04-20  9:58         ` Yafang Shao
2020-04-20 10:31           ` Michal Hocko
2020-04-20 10:51             ` Yafang Shao [this message]
2020-04-20 11:10               ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbC656aNU0-nOSLtEVyPqxKHNEaVUG5r9evSUuq009i_Mw@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox