[QUESTION] What memcg lifetime is required by list_lru

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [QUESTION] What memcg lifetime is required by list_lru_add?
@ 2024-11-27 21:04 Alice Ryhl
  2024-11-27 22:05 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Alice Ryhl @ 2024-11-27 21:04 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner, Andrew Morton, Nhat Pham
  Cc: Qi Zheng, Roman Gushchin, Muchun Song,
	Linux Memory Management List, Michal Hocko, Shakeel Butt,
	cgroups, open list

Dear SHRINKER and MEMCG experts,

When using list_lru_add() and list_lru_del(), it seems to be required
that you pass the same value of nid and memcg to both calls, since
list_lru_del() might otherwise try to delete it from the wrong list /
delete it while holding the wrong spinlock. I'm trying to understand
the implications of this requirement on the lifetime of the memcg.

Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
to keep the memcg object alive for the duration of list_lru_add().
That rcu locking is used here seems to imply that without it, the
memcg could be deallocated during the list_lru_add() call, which is of
course bad. But rcu is not enough on its own to keep the memcg alive
all the way until the list_lru_del_obj() call, so how does it ensure
that the memcg stays valid for that long? And if there is a mechanism
to keep the memcg alive for the entire duration between add and del,
why is rcu locking needed? I don't see any refcounts being taken on
the memcg.

Is it because the memcg could be replaced by another memcg that has
the same value of memcg_kmem_id(memcg)?

tl;dr: what does list_lru_add actually require from the memcg
pointer's lifetime?

Alice

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [QUESTION] What memcg lifetime is required by list_lru_add?
  2024-11-27 21:04 [QUESTION] What memcg lifetime is required by list_lru_add? Alice Ryhl
@ 2024-11-27 22:05 ` Dave Chinner
  2024-11-28 12:27   ` Alice Ryhl
  2024-12-03 10:44   ` Michal Koutný
  0 siblings, 2 replies; 4+ messages in thread
From: Dave Chinner @ 2024-11-27 22:05 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Qi Zheng,
	Roman Gushchin, Muchun Song, Linux Memory Management List,
	Michal Hocko, Shakeel Butt, cgroups, open list

On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> Dear SHRINKER and MEMCG experts,
> 
> When using list_lru_add() and list_lru_del(), it seems to be required
> that you pass the same value of nid and memcg to both calls, since
> list_lru_del() might otherwise try to delete it from the wrong list /
> delete it while holding the wrong spinlock. I'm trying to understand
> the implications of this requirement on the lifetime of the memcg.
> 
> Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> to keep the memcg object alive for the duration of list_lru_add().
> That rcu locking is used here seems to imply that without it, the
> memcg could be deallocated during the list_lru_add() call, which is of
> course bad. But rcu is not enough on its own to keep the memcg alive
> all the way until the list_lru_del_obj() call, so how does it ensure
> that the memcg stays valid for that long?

We don't care if the memcg goes away whilst there are objects on the
LRU. memcg destruction will reparent the objects to a different
memcg via memcg_reparent_list_lrus() before the memcg is torn down.
New objects should not be added to the memcg LRUs once the memcg
teardown process starts, so there should never be add vs reparent
races during teardown.

Hence all the list_lru_add_obj() function needs to do is ensure that
the locking/lifecycle rules for the memcg object that
mem_cgroup_from_slab_obj() returns are obeyed.

> And if there is a mechanism
> to keep the memcg alive for the entire duration between add and del,

It's enforced by the -complex- state machine used to tear down
control groups.

tl;dr: If the memcg gets torn down, it will reparent the objects on
the LRU to it's parent memcg during the teardown process.

This reparenting happens in the cgroup ->css_offline() method, which
only happens after the cgroup reference count goes to zero and is
waited on via:

kill_css
  percpu_ref_kill_and_confirm(css_killed_ref_fn)
    <wait>
    css_killed_ref_fn
      offline_css
        mem_cgroup_css_offline
	  memcg_offline_kmem
	    {
	    .....
	    memcg_reparent_objcgs(memcg, parent);

        /*
         * After we have finished memcg_reparent_objcgs(), all list_lrus
         * corresponding to this cgroup are guaranteed to remain empty.
         * The ordering is imposed by list_lru_node->lock taken by
         * memcg_reparent_list_lrus().
         */
	    memcg_reparent_list_lrus(memcg, parent)
	    }

Then the cgroup teardown control code then schedules the freeing
of the memcg container via a RCU work callback when the reference
count is globally visible as killed and the reference count has gone
to zero.

Hence the cgroup infrastructure requires RCU protection for the
duration of unreferenced cgroup object accesses. This allows for
subsystems to perform operations on the cgroup object without
needing to holding cgroup references for every access. The complex,
multi-stage teardown process allows for cgroup objects to release
objects that it tracks hence avoiding the need for every object the
cgroup tracks to hold a reference count on the cgroup.

See the comment above css_free_rwork_fn() for more details about the
teardown process:

/*
 * css destruction is four-stage process.
 *
 * 1. Destruction starts.  Killing of the percpu_ref is initiated.
 *    Implemented in kill_css().
 *
 * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
 *    and thus css_tryget_online() is guaranteed to fail, the css can be
 *    offlined by invoking offline_css().  After offlining, the base ref is
 *    put.  Implemented in css_killed_work_fn().
 *
 * 3. When the percpu_ref reaches zero, the only possible remaining
 *    accessors are inside RCU read sections.  css_release() schedules the
 *    RCU callback.
 *
 * 4. After the grace period, the css can be freed.  Implemented in
 *    css_free_rwork_fn().
 *
 * It is actually hairier because both step 2 and 4 require process context
 * and thus involve punting to css->destroy_work adding two additional
 * steps to the already complex sequence.
 */

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [QUESTION] What memcg lifetime is required by list_lru_add?
  2024-11-27 22:05 ` Dave Chinner
@ 2024-11-28 12:27   ` Alice Ryhl
  2024-12-03 10:44   ` Michal Koutný
  1 sibling, 0 replies; 4+ messages in thread
From: Alice Ryhl @ 2024-11-28 12:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Qi Zheng,
	Roman Gushchin, Muchun Song, Linux Memory Management List,
	Michal Hocko, Shakeel Butt, cgroups, open list

On Wed, Nov 27, 2024 at 11:05 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> > Dear SHRINKER and MEMCG experts,
> >
> > When using list_lru_add() and list_lru_del(), it seems to be required
> > that you pass the same value of nid and memcg to both calls, since
> > list_lru_del() might otherwise try to delete it from the wrong list /
> > delete it while holding the wrong spinlock. I'm trying to understand
> > the implications of this requirement on the lifetime of the memcg.
> >
> > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> > to keep the memcg object alive for the duration of list_lru_add().
> > That rcu locking is used here seems to imply that without it, the
> > memcg could be deallocated during the list_lru_add() call, which is of
> > course bad. But rcu is not enough on its own to keep the memcg alive
> > all the way until the list_lru_del_obj() call, so how does it ensure
> > that the memcg stays valid for that long?
>
> We don't care if the memcg goes away whilst there are objects on the
> LRU. memcg destruction will reparent the objects to a different
> memcg via memcg_reparent_list_lrus() before the memcg is torn down.
> New objects should not be added to the memcg LRUs once the memcg
> teardown process starts, so there should never be add vs reparent
> races during teardown.
>
> Hence all the list_lru_add_obj() function needs to do is ensure that
> the locking/lifecycle rules for the memcg object that
> mem_cgroup_from_slab_obj() returns are obeyed.
>
> > And if there is a mechanism
> > to keep the memcg alive for the entire duration between add and del,
>
> It's enforced by the -complex- state machine used to tear down
> control groups.
>
> tl;dr: If the memcg gets torn down, it will reparent the objects on
> the LRU to it's parent memcg during the teardown process.
>
> This reparenting happens in the cgroup ->css_offline() method, which
> only happens after the cgroup reference count goes to zero and is
> waited on via:
>
> kill_css
>   percpu_ref_kill_and_confirm(css_killed_ref_fn)
>     <wait>
>     css_killed_ref_fn
>       offline_css
>         mem_cgroup_css_offline
>           memcg_offline_kmem
>             {
>             .....
>             memcg_reparent_objcgs(memcg, parent);
>
>         /*
>          * After we have finished memcg_reparent_objcgs(), all list_lrus
>          * corresponding to this cgroup are guaranteed to remain empty.
>          * The ordering is imposed by list_lru_node->lock taken by
>          * memcg_reparent_list_lrus().
>          */
>             memcg_reparent_list_lrus(memcg, parent)
>             }
>
> Then the cgroup teardown control code then schedules the freeing
> of the memcg container via a RCU work callback when the reference
> count is globally visible as killed and the reference count has gone
> to zero.
>
> Hence the cgroup infrastructure requires RCU protection for the
> duration of unreferenced cgroup object accesses. This allows for
> subsystems to perform operations on the cgroup object without
> needing to holding cgroup references for every access. The complex,
> multi-stage teardown process allows for cgroup objects to release
> objects that it tracks hence avoiding the need for every object the
> cgroup tracks to hold a reference count on the cgroup.
>
> See the comment above css_free_rwork_fn() for more details about the
> teardown process:
>
> /*
>  * css destruction is four-stage process.
>  *
>  * 1. Destruction starts.  Killing of the percpu_ref is initiated.
>  *    Implemented in kill_css().
>  *
>  * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
>  *    and thus css_tryget_online() is guaranteed to fail, the css can be
>  *    offlined by invoking offline_css().  After offlining, the base ref is
>  *    put.  Implemented in css_killed_work_fn().
>  *
>  * 3. When the percpu_ref reaches zero, the only possible remaining
>  *    accessors are inside RCU read sections.  css_release() schedules the
>  *    RCU callback.
>  *
>  * 4. After the grace period, the css can be freed.  Implemented in
>  *    css_free_rwork_fn().
>  *
>  * It is actually hairier because both step 2 and 4 require process context
>  * and thus involve punting to css->destroy_work adding two additional
>  * steps to the already complex sequence.
>  */

Thanks a lot Dave, this clears it up for me.

I sent a patch containing some additional docs for list_lru:
https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@google.com/

Alice


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [QUESTION] What memcg lifetime is required by list_lru_add?
  2024-11-27 22:05 ` Dave Chinner
  2024-11-28 12:27   ` Alice Ryhl
@ 2024-12-03 10:44   ` Michal Koutný
  1 sibling, 0 replies; 4+ messages in thread
From: Michal Koutný @ 2024-12-03 10:44 UTC (permalink / raw)
  To: Dave Chinner, Alice Ryhl
  Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Qi Zheng,
	Roman Gushchin, Muchun Song, Linux Memory Management List,
	Michal Hocko, Shakeel Butt, cgroups, open list

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

On Thu, Nov 28, 2024 at 09:05:34AM GMT, Dave Chinner <david@fromorbit.com> wrote:
> It's enforced by the -complex- state machine used to tear down
> control groups.

True.

> tl;dr: If the memcg gets torn down, it will reparent the objects on
> the LRU to it's parent memcg during the teardown process.
> 
> This reparenting happens in the cgroup ->css_offline() method, which
> only happens after the cgroup reference count goes to zero and is
> waited on via:

What's waited for is seeing "killing" of the _initial_ reference, the
refcount may be still non-zero. I.e. ->css_offline() happens with some
referencs around (e.g. from struct page^W folio) and only
->css_released() is called after refs drop to zero (and ->css_free()
even after RCU period given there were any RCU readers who didn't
css_get()).

> See the comment above css_free_rwork_fn() for more details about the
> teardown process:
> 
> /*
>  * css destruction is four-stage process.
>  *
>  * 1. Destruction starts.  Killing of the percpu_ref is initiated.
>  *    Implemented in kill_css().
>  *
>  * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
>  *    and thus css_tryget_online() is guaranteed to fail, the css can be
>  *    offlined by invoking offline_css().  After offlining, the base ref is
>  *    put.  Implemented in css_killed_work_fn().
>  *
>  * 3. When the percpu_ref reaches zero, the only possible remaining
>  *    accessors are inside RCU read sections.  css_release() schedules the
>  *    RCU callback.
>  *
>  * 4. After the grace period, the css can be freed.  Implemented in
>  *    css_free_rwork_fn().

This is a useful comment.

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-12-03 10:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-27 21:04 [QUESTION] What memcg lifetime is required by list_lru_add? Alice Ryhl
2024-11-27 22:05 ` Dave Chinner
2024-11-28 12:27   ` Alice Ryhl
2024-12-03 10:44   ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox