Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Chris Li <chrisl@kernel.org>
To: Alistair Popple <apopple@nvidia.com>
Cc: "T.J. Mercier" <tjmercier@google.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com>,
	Tejun Heo <tj@kernel.org>, Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Yu Zhao <yuzhao@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
Date: Sat, 6 May 2023 15:49:08 -0700	[thread overview]
Message-ID: <ZFbZZPkSpsKMe8iR@google.com> (raw)
In-Reply-To: <877ctm518f.fsf@nvidia.com>

On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote:
> 
> >> Unfortunately I won't be attending LSF/MM in person this year but I am
> >
> > Will you be able to join virtually?
> 
> I should be able to join afternoon sessions virtually.

Great.

> >> The issue with per-page memcg limits is what to do for shared
> >> mappings. The below suggestion sounds promising because the pins for
> >> shared pages could be charged to the smemcg. However I'm not sure how it
> >> would solve the problem of a process in cgroup A being able to raise the
> >> pin count of cgroup B when pinning a smemcg page which was one of the
> >> reason I introduced a new cgroup controller.
> >
> > Now that I think of it, I can see the pin count memcg as a subtype of
> > smemcg.
> >
> > The smemcg can have a limit as well, when it add to a memcg, the operation
> > raise the pin count smemcg charge over the smemcg limit will fail.
> 
> I'm not sure that works for the pinned scenario. If a smemcg already has
> pinned pages adding it to another memcg shouldn't raise the pin count of
> the memcg it's being added to. The pin counts should only be raised in
> memcg's of processes actually requesting the page be pinned. See below
> though, the idea of borrowing seems helpful.

I am very interested in letting smemcg support your pin usage case.
I read your patch thread a bit but I still feel a bit fuzzy about
the pin usage workflow.

If you have some more detailed write up of the usage case, with a sequence
of interaction and desired outcome that would help me understand. Links to
the previous threads work too.

We can set up some meetings to discuss it as well.

> So for pinning at least I don't see a per smemcg limit being useful.

That is fine.  I see you are interested in the <smemcg, memcg> limit.

> > For the detail tracking of shared/unshared behavior, the smemcg can model it
> > as a step 2 feature.
> >
> > There are four different kind of operation can perform on a smemcg:
> >
> > 1) allocate/charge memory. The charge will add on the per smemcg charge
> > counter, check against the per smemcg limit. ENOMEM if it is over the limit.
> >
> > 2) free/uncharge memory. Similar to above just subtract the counter.
> >
> > 3) share/mmap already charged memory. This will not change the smemcg charge
> > count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
> > put a limit on that counter as well, even though I haven't given too much thought
> > of how useful it is. That will limit how much memory can mapped from the smemcg.
> 
> I would like to see the idea of a borrow counter fleshed out some more
> but this sounds like it could work for the pinning scenario.
> 
> Pinning could be charged to the per <smemcg, memcg> borrow counter and
> the pin limit would be enforced against that plus the anonymous pins.
> 
> Implementation wise we'd need a way to lookup both the smemcg of the
> struct page and the memcg that the pinning task belongs to.

The page->memcg_data points to the pin smemcg. I am hoping pinning API or
the current memcg can get to the pinning memcg.

> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> > borrow counter.
> 
> Actually this is where things might get a bit tricky for pinning. We'd
> have to reduce the pin charge when a driver calls put_page(). But that
> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> charged the page to.

Does the pin page share between different memcg or just one memcg?

If it is shared, can the put_page() API indicate it is performing in behalf
of which memcg?
> I will have to give this idea some more tought though. Most drivers
> don't store anything other than the struct page pointers, but my series
> added an accounting struct which I think could reference the borrow
> counter.

Ack.

> 
> > Will that work for your pin memory usage?
> 
> I think it could help. I will give it some thought.

Ack.
> 
> >> 
> >> > Shared Memory Cgroup Controllers
> >> >
> >> > = Introduction
> >> >
> >> > The current memory cgroup controller does not support shared memory
> >> > objects. For the memory that is shared between different processes, it
> >> > is not obvious which process should get charged. Google has some
> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> >> > specific memcg that’s often different from where charging processes
> >> > run. However it faces some difficulties when the charged memcg exits
> >> > and the charged memcg becomes a zombie memcg.
> >> > Other approaches include “re-parenting” the memcg charge to the parent
> >> > memcg. Which has its own problem. If the charge is huge, iteration of
> >> > the reparenting can be costly.
> >> >
> >> > = Proposed Solution
> >> >
> >> > The proposed solution is to add a new type of memory controller for
> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> >> > dma_buf. This shared memory cgroup controller object will have the
> >> > same life cycle of the underlying shared memory.
> >> >
> >> > Processes can not be added to the shared memory cgroup. Instead the
> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
> >> > file, similar to adding a process into the “tasks” API file.
> >> > When a smemcg is added to the memcg, the amount of memory that has
> >> > been shared in the memcg process will be accounted for as the part of
> >> > the memcg “memory.current”.The memory.current of the memcg is make up
> >> > of two parts, 1) the processes anonymous memory and 2) the memory
> >> > shared from smemcg.
> >> >
> >> > When the memcg “memory.current” is raised to the limit. The kernel
> >> > will active try to reclaim for the memcg to make “smemcg memory +
> >> > process anonymous memory” within the limit.
> >> 
> >> That means a process in one cgroup could force reclaim of smemcg memory
> >> in use by a process in another cgroup right? I guess that's no different
> >> to the current situation though.
> >> 
> >> > Further memory allocation
> >> > within those memcg processes will fail if the limit can not be
> >> > followed. If many reclaim attempts fail to bring the memcg
> >> > “memory.current” within the limit, the process in this memcg will get
> >> > OOM killed.
> >> 
> >> How would this work if say a charge for cgroup A to a smemcg in both
> >> cgroup A and B would cause cgroup B to go over its memory limit and not
> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
> >
> > If we separate out the charge counter with the borrow counter, that problem
> > will be solved. When smemcg is add to memcg A, we can have a policy specific
> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
> >
> > If B did not map that page, that page will not be part of <smemcg, memcg B>
> > borrow count. B will not be punished.
> >
> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
> > B will be punished for it.
> >
> > Will that work for your example situation?
> 
> I think so, although I have been looking at this more from the point of
> view of pinning. It sounds like we could treat pinning in much the same
> way as mapping though.

Ack.
> 
> >> > = Benefits
> >> >
> >> > The benefits of this solution include:
> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> >> 
> >> If we added pinning it could get a bit messier, as it would have to hang
> >> around until the driver unpinned the pages. But I don't think that's a
> >> problem.
> >
> >
> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
> > to model the driver holding the pin ref count as one of the share/mmap operation.
> >
> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
> >
> > We have have different policy option on smemcg.
> > For the simple usage don't care the per memcg borrow counter, it can add the
> > smemcg's charge count to "memory.current".
> >
> > Only the user who cares about per memcg usage of a smemcg will need to maintain
> > per <smemcg, memcg> borrow counter, at additional cost.
> 
> Right, I think pinning drivers will always have to care about the borrow
> counter so will have to track that.

Ack.

Chris

next prev parent reply	other threads:[~2023-05-06 22:49 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-11 23:36 T.J. Mercier
2023-04-11 23:48 ` Yosry Ahmed
2023-04-25 11:36   ` Yosry Ahmed
2023-04-25 18:42     ` Waiman Long
2023-04-25 18:53       ` Yosry Ahmed
2023-04-26 20:15         ` Waiman Long
2023-05-01 16:38     ` Roman Gushchin
2023-05-02  7:18       ` Yosry Ahmed
2023-05-02 20:02       ` Yosry Ahmed
2023-05-03 22:15 ` Chris Li
2023-05-04 11:58   ` Alistair Popple
2023-05-04 15:31     ` Chris Li
2023-05-05 13:53       ` Alistair Popple
2023-05-06 22:49         ` Chris Li [this message]
2023-05-08  8:17           ` Alistair Popple
2023-05-10 14:51             ` Chris Li
2023-05-12  8:45               ` Alistair Popple
2023-05-12 21:09                 ` Jason Gunthorpe
2023-05-16 12:21                   ` Alistair Popple
2023-05-19 15:47                     ` Jason Gunthorpe
2023-05-20 15:09                   ` Chris Li
2023-05-20 15:31                 ` Chris Li
2023-05-29 19:31                   ` Jason Gunthorpe
2023-05-04 17:02   ` Shakeel Butt
2023-05-04 17:36     ` Chris Li
2023-05-12  3:08 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZFbZZPkSpsKMe8iR@google.com \
    --to=chrisl@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@nvidia.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox