* [PATCH rfc 1/5] mm: kmem: optimize get_obj_cgroup_from_current()
2023-09-27 15:08 [PATCH rfc 0/5] mm: improve performance of kernel memory accounting Roman Gushchin
@ 2023-09-27 15:08 ` Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct Roman Gushchin
` (3 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2023-09-27 15:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, cgroups, Johannes Weiner, Michal Hocko,
Shakeel Butt, Muchun Song, Dennis Zhou, Andrew Morton,
Roman Gushchin
Manually inline memcg_kmem_bypass() and active_memcg() to speed up
get_obj_cgroup_from_current() by avoiding duplicate in_task() checks
and active_memcg() readings.
Also add a likely() macro to __get_obj_cgroup_from_memcg():
obj_cgroup_tryget() should succeed at almost all times except a very
unlikely race with the memcg deletion path.
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
mm/memcontrol.c | 34 ++++++++++++++--------------------
1 file changed, 14 insertions(+), 20 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9741d62d0424..16ac2a5838fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1068,19 +1068,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
}
EXPORT_SYMBOL(get_mem_cgroup_from_mm);
-static __always_inline bool memcg_kmem_bypass(void)
-{
- /* Allow remote memcg charging from any context. */
- if (unlikely(active_memcg()))
- return false;
-
- /* Memcg to charge can't be determined. */
- if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
- return true;
-
- return false;
-}
-
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -3007,7 +2994,7 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
objcg = rcu_dereference(memcg->objcg);
- if (objcg && obj_cgroup_tryget(objcg))
+ if (likely(objcg && obj_cgroup_tryget(objcg)))
break;
objcg = NULL;
}
@@ -3016,16 +3003,23 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
{
- struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
- if (memcg_kmem_bypass())
- return NULL;
+ if (in_task()) {
+ memcg = current->active_memcg;
+
+ /* Memcg to charge can't be determined. */
+ if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD)))
+ return NULL;
+ } else {
+ memcg = this_cpu_read(int_active_memcg);
+ if (likely(!memcg))
+ return NULL;
+ }
rcu_read_lock();
- if (unlikely(active_memcg()))
- memcg = active_memcg();
- else
+ if (!memcg)
memcg = mem_cgroup_from_task(current);
objcg = __get_obj_cgroup_from_memcg(memcg);
rcu_read_unlock();
--
2.42.0
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct
2023-09-27 15:08 [PATCH rfc 0/5] mm: improve performance of kernel memory accounting Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 1/5] mm: kmem: optimize get_obj_cgroup_from_current() Roman Gushchin
@ 2023-09-27 15:08 ` Roman Gushchin
2023-10-02 20:12 ` Johannes Weiner
2023-09-27 15:08 ` [PATCH rfc 3/5] mm: kmem: make memcg keep a reference to the original objcg Roman Gushchin
` (2 subsequent siblings)
4 siblings, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2023-09-27 15:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, cgroups, Johannes Weiner, Michal Hocko,
Shakeel Butt, Muchun Song, Dennis Zhou, Andrew Morton,
Roman Gushchin
To charge a freshly allocated kernel object to a memory cgroup, the
kernel needs to obtain an objcg pointer. Currently it does it
indirectly by obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().
Usually tasks spend their entire life belonging to the same object
cgroup. So it makes sense to save the objcg pointer on task_struct
directly, so it can be obtained faster. It requires some work on fork,
exit and cgroup migrate paths, but these paths are way colder.
To avoid any costly synchronization the following rules are applied:
1) A task sets it's objcg pointer itself.
2) If a task is being migrated to another cgroup, the least
significant bit of the objcg pointer is set.
3) On the allocation path the objcg pointer is obtained locklessly
using the READ_ONCE() macro and the least significant bit is
checked. If it set, the task updates it's objcg before proceeding
with an allocation.
4) Operations 1) and 4) are synchronized via a new spinlock, so that
if a task is moved twice, the update bit can't be lost.
This allows to keep the hot path fully lockless. Because the task
is keeping a reference to the objcg, it can't go away while the task
is alive.
This commit doesn't change the way the remote memcg charging works.
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
---
include/linux/memcontrol.h | 10 ++++
include/linux/sched.h | 4 ++
mm/memcontrol.c | 107 +++++++++++++++++++++++++++++++++----
3 files changed, 112 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ab94ad4597d0..84425bfe4124 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -553,6 +553,16 @@ static inline bool folio_memcg_kmem(struct folio *folio)
return folio->memcg_data & MEMCG_DATA_KMEM;
}
+static inline bool current_objcg_needs_update(struct obj_cgroup *objcg)
+{
+ return (struct obj_cgroup *)((unsigned long)objcg & 0x1);
+}
+
+static inline struct obj_cgroup *
+current_objcg_clear_update_flag(struct obj_cgroup *objcg)
+{
+ return (struct obj_cgroup *)((unsigned long)objcg & ~0x1);
+}
#else
static inline bool folio_memcg_kmem(struct folio *folio)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..60de42715b56 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1443,6 +1443,10 @@ struct task_struct {
struct mem_cgroup *active_memcg;
#endif
+#ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *objcg;
+#endif
+
#ifdef CONFIG_BLK_CGROUP
struct gendisk *throttle_disk;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16ac2a5838fb..7f33a503d600 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
return objcg;
}
+static DEFINE_SPINLOCK(current_objcg_lock);
+
+static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
+{
+ struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
+ unsigned long flags;
+
+ old = current_objcg_clear_update_flag(old);
+ if (old)
+ obj_cgroup_put(old);
+
+ spin_lock_irqsave(¤t_objcg_lock, flags);
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+ objcg = rcu_dereference(memcg->objcg);
+ if (objcg && obj_cgroup_tryget(objcg))
+ break;
+ objcg = NULL;
+ }
+ rcu_read_unlock();
+
+ WRITE_ONCE(current->objcg, objcg);
+ spin_unlock_irqrestore(¤t_objcg_lock, flags);
+
+ return objcg;
+}
+
+static inline void current_objcg_set_needs_update(struct task_struct *task)
+{
+ struct obj_cgroup *objcg;
+ unsigned long flags;
+
+ spin_lock_irqsave(¤t_objcg_lock, flags);
+ objcg = READ_ONCE(task->objcg);
+ objcg = (struct obj_cgroup *)((unsigned long)objcg | 0x1);
+ WRITE_ONCE(task->objcg, objcg);
+ spin_unlock_irqrestore(¤t_objcg_lock, flags);
+}
+
__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
{
struct mem_cgroup *memcg;
@@ -3008,19 +3049,26 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
if (in_task()) {
memcg = current->active_memcg;
+ if (unlikely(memcg))
+ goto from_memcg;
- /* Memcg to charge can't be determined. */
- if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD)))
- return NULL;
+ objcg = READ_ONCE(current->objcg);
+ if (unlikely(current_objcg_needs_update(objcg)))
+ objcg = current_objcg_update(objcg);
+
+ if (objcg) {
+ obj_cgroup_get(objcg);
+ return objcg;
+ }
} else {
memcg = this_cpu_read(int_active_memcg);
- if (likely(!memcg))
- return NULL;
+ if (unlikely(memcg))
+ goto from_memcg;
}
+ return NULL;
+from_memcg:
rcu_read_lock();
- if (!memcg)
- memcg = mem_cgroup_from_task(current);
objcg = __get_obj_cgroup_from_memcg(memcg);
rcu_read_unlock();
return objcg;
@@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
mem_cgroup_clear_mc();
}
}
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_fork(struct task_struct *task)
+{
+ task->objcg = (struct obj_cgroup *)0x1;
+}
+
+static void mem_cgroup_exit(struct task_struct *task)
+{
+ struct obj_cgroup *objcg = current_objcg_clear_update_flag(task->objcg);
+
+ if (objcg)
+ obj_cgroup_put(objcg);
+}
+#endif
+
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
@@ -6359,7 +6423,7 @@ static void mem_cgroup_move_task(void)
#endif
#ifdef CONFIG_LRU_GEN
-static void mem_cgroup_attach(struct cgroup_taskset *tset)
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
struct cgroup_subsys_state *css;
@@ -6377,10 +6441,29 @@ static void mem_cgroup_attach(struct cgroup_taskset *tset)
task_unlock(task);
}
#else
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_LRU_GEN */
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
+{
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+
+ cgroup_taskset_for_each(task, css, tset)
+ current_objcg_set_needs_update(task);
+}
+#else
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_MEMCG_KMEM */
+
+#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
+ mem_cgroup_lru_gen_attach(tset);
+ mem_cgroup_kmem_attach(tset);
}
-#endif /* CONFIG_LRU_GEN */
+#endif
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
@@ -6824,9 +6907,15 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.can_attach = mem_cgroup_can_attach,
+#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
.attach = mem_cgroup_attach,
+#endif
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+#ifdef CONFIG_MEMCG_KMEM
+ .fork = mem_cgroup_fork,
+ .exit = mem_cgroup_exit,
+#endif
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
.early_init = 0,
--
2.42.0
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct
2023-09-27 15:08 ` [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct Roman Gushchin
@ 2023-10-02 20:12 ` Johannes Weiner
2023-10-02 22:03 ` Roman Gushchin
0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2023-10-02 20:12 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Shakeel Butt,
Muchun Song, Dennis Zhou, Andrew Morton
On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote:
> @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> return objcg;
> }
>
> +static DEFINE_SPINLOCK(current_objcg_lock);
> +
> +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
> +{
> + struct mem_cgroup *memcg;
> + struct obj_cgroup *objcg;
> + unsigned long flags;
> +
> + old = current_objcg_clear_update_flag(old);
> + if (old)
> + obj_cgroup_put(old);
> +
> + spin_lock_irqsave(¤t_objcg_lock, flags);
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(current);
> + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> + objcg = rcu_dereference(memcg->objcg);
> + if (objcg && obj_cgroup_tryget(objcg))
> + break;
> + objcg = NULL;
> + }
> + rcu_read_unlock();
Can this tryget() actually fail when this is called on the current
task during fork() and attach()? A cgroup cannot be offlined while
there is a task in it.
> @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
> mem_cgroup_clear_mc();
> }
> }
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static void mem_cgroup_fork(struct task_struct *task)
> +{
> + task->objcg = (struct obj_cgroup *)0x1;
dup_task_struct() will copy this pointer from the old task. Would it
be possible to bump the refcount here instead? That would save quite a
bit of work during fork().
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct
2023-10-02 20:12 ` Johannes Weiner
@ 2023-10-02 22:03 ` Roman Gushchin
2023-10-03 14:22 ` Johannes Weiner
0 siblings, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2023-10-02 22:03 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Shakeel Butt,
Muchun Song, Dennis Zhou, Andrew Morton
On Mon, Oct 02, 2023 at 04:12:54PM -0400, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote:
> > @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > return objcg;
> > }
> >
> > +static DEFINE_SPINLOCK(current_objcg_lock);
> > +
> > +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
> > +{
> > + struct mem_cgroup *memcg;
> > + struct obj_cgroup *objcg;
> > + unsigned long flags;
> > +
> > + old = current_objcg_clear_update_flag(old);
> > + if (old)
> > + obj_cgroup_put(old);
> > +
> > + spin_lock_irqsave(¤t_objcg_lock, flags);
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(current);
> > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > + objcg = rcu_dereference(memcg->objcg);
> > + if (objcg && obj_cgroup_tryget(objcg))
> > + break;
> > + objcg = NULL;
> > + }
> > + rcu_read_unlock();
>
> Can this tryget() actually fail when this is called on the current
> task during fork() and attach()? A cgroup cannot be offlined while
> there is a task in it.
Highly theoretically it can if it races against a migration of the current
task to another memcg and the previous memcg is getting offlined.
I actually might make sense to apply the same approach for memcgs as well
(saving a lazily-updating memcg pointer on task_struct). Then it will be
possible to ditch this "for" loop. But I need some time to master the code
and run benchmarks. Idk if it will make enough difference to justify the change.
Btw, this is the rfc version, while there is a newer v1 version, which Andrew
already picked for mm-unstable. Both of your comments still apply, just fyi.
>
> > @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
> > mem_cgroup_clear_mc();
> > }
> > }
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static void mem_cgroup_fork(struct task_struct *task)
> > +{
> > + task->objcg = (struct obj_cgroup *)0x1;
>
> dup_task_struct() will copy this pointer from the old task. Would it
> be possible to bump the refcount here instead? That would save quite a
> bit of work during fork().
Yeah, it should be possible. It won't save a lot, but I agree it makes
sense. I'll take a look and will prepare a separate patch for this.
Thank you!
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct
2023-10-02 22:03 ` Roman Gushchin
@ 2023-10-03 14:22 ` Johannes Weiner
2023-10-03 16:06 ` Roman Gushchin
0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2023-10-03 14:22 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Shakeel Butt,
Muchun Song, Dennis Zhou, Andrew Morton
On Mon, Oct 02, 2023 at 03:03:48PM -0700, Roman Gushchin wrote:
> On Mon, Oct 02, 2023 at 04:12:54PM -0400, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote:
> > > @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > > return objcg;
> > > }
> > >
> > > +static DEFINE_SPINLOCK(current_objcg_lock);
> > > +
> > > +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
> > > +{
> > > + struct mem_cgroup *memcg;
> > > + struct obj_cgroup *objcg;
> > > + unsigned long flags;
> > > +
> > > + old = current_objcg_clear_update_flag(old);
> > > + if (old)
> > > + obj_cgroup_put(old);
> > > +
> > > + spin_lock_irqsave(¤t_objcg_lock, flags);
> > > + rcu_read_lock();
> > > + memcg = mem_cgroup_from_task(current);
> > > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > > + objcg = rcu_dereference(memcg->objcg);
> > > + if (objcg && obj_cgroup_tryget(objcg))
> > > + break;
> > > + objcg = NULL;
> > > + }
> > > + rcu_read_unlock();
> >
> > Can this tryget() actually fail when this is called on the current
> > task during fork() and attach()? A cgroup cannot be offlined while
> > there is a task in it.
>
> Highly theoretically it can if it races against a migration of the current
> task to another memcg and the previous memcg is getting offlined.
Ah right, if this runs between css_set_move_task() and ->attach(). The
cache would be briefly updated to a parent in the old hierarchy, but
then quickly reset from the ->attach().
Can you please add a comment along these lines?
> I actually might make sense to apply the same approach for memcgs as well
> (saving a lazily-updating memcg pointer on task_struct). Then it will be
> possible to ditch this "for" loop. But I need some time to master the code
> and run benchmarks. Idk if it will make enough difference to justify the change.
Yeah the memcg pointer is slightly less attractive from an
optimization POV because it already is a pretty direct pointer from
task through the cset array.
If you still want to look into it from a simplification POV that
sounds reasonable, but IMO it would be fine with a comment.
> > > @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
> > > mem_cgroup_clear_mc();
> > > }
> > > }
> > > +
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +static void mem_cgroup_fork(struct task_struct *task)
> > > +{
> > > + task->objcg = (struct obj_cgroup *)0x1;
> >
> > dup_task_struct() will copy this pointer from the old task. Would it
> > be possible to bump the refcount here instead? That would save quite a
> > bit of work during fork().
>
> Yeah, it should be possible. It won't save a lot, but I agree it makes
> sense. I'll take a look and will prepare a separate patch for this.
I guess the hairiest part would be synchronizing against a migration
because all these cgroup core callbacks are unlocked.
Would it make sense to add ->fork_locked() and ->attach_locked()
callbacks that are dispatched under the css_set_lock? Then this could
be a simple if (p && !(p & 0x1)) obj_cgroup_get(), which would
certainly be nice to workloads where fork() is hot, with little
downside otherwise.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct
2023-10-03 14:22 ` Johannes Weiner
@ 2023-10-03 16:06 ` Roman Gushchin
0 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2023-10-03 16:06 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Shakeel Butt,
Muchun Song, Dennis Zhou, Andrew Morton
On Tue, Oct 03, 2023 at 10:22:55AM -0400, Johannes Weiner wrote:
> On Mon, Oct 02, 2023 at 03:03:48PM -0700, Roman Gushchin wrote:
> > On Mon, Oct 02, 2023 at 04:12:54PM -0400, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote:
> > > > @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > > > return objcg;
> > > > }
> > > >
> > > > +static DEFINE_SPINLOCK(current_objcg_lock);
> > > > +
> > > > +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
> > > > +{
> > > > + struct mem_cgroup *memcg;
> > > > + struct obj_cgroup *objcg;
> > > > + unsigned long flags;
> > > > +
> > > > + old = current_objcg_clear_update_flag(old);
> > > > + if (old)
> > > > + obj_cgroup_put(old);
> > > > +
> > > > + spin_lock_irqsave(¤t_objcg_lock, flags);
> > > > + rcu_read_lock();
> > > > + memcg = mem_cgroup_from_task(current);
> > > > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > > > + objcg = rcu_dereference(memcg->objcg);
> > > > + if (objcg && obj_cgroup_tryget(objcg))
> > > > + break;
> > > > + objcg = NULL;
> > > > + }
> > > > + rcu_read_unlock();
> > >
> > > Can this tryget() actually fail when this is called on the current
> > > task during fork() and attach()? A cgroup cannot be offlined while
> > > there is a task in it.
> >
> > Highly theoretically it can if it races against a migration of the current
> > task to another memcg and the previous memcg is getting offlined.
>
> Ah right, if this runs between css_set_move_task() and ->attach(). The
> cache would be briefly updated to a parent in the old hierarchy, but
> then quickly reset from the ->attach().
Even simpler:
rcu_read_lock();
memcg = mem_cgroup_from_task(current);
---------
Here the task can be moved to another memcg and the previous one
can be offlined, making objcg fully detached.
---------
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
objcg = rcu_dereference(memcg->objcg);
if (objcg && obj_cgroup_tryget(objcg))
---------
Objcg can be NULL here or it can be not NULL, but loose the last reference
between the objcg check and obj_cgroup_tryget().
---------
break;
objcg = NULL;
}
rcu_read_unlock();
>
> Can you please add a comment along these lines?
Sure, will do.
>
> > I actually might make sense to apply the same approach for memcgs as well
> > (saving a lazily-updating memcg pointer on task_struct). Then it will be
> > possible to ditch this "for" loop. But I need some time to master the code
> > and run benchmarks. Idk if it will make enough difference to justify the change.
>
> Yeah the memcg pointer is slightly less attractive from an
> optimization POV because it already is a pretty direct pointer from
> task through the cset array.
>
> If you still want to look into it from a simplification POV that
> sounds reasonable, but IMO it would be fine with a comment.
I'll come back with some numbers, hard to speculate without it. In this case
the majority of savings came from not bumping and decreasing a percpu objcg
refcounter on the slab allocation path - that was quite surprising to me.
>
> > > > @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
> > > > mem_cgroup_clear_mc();
> > > > }
> > > > }
> > > > +
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +static void mem_cgroup_fork(struct task_struct *task)
> > > > +{
> > > > + task->objcg = (struct obj_cgroup *)0x1;
> > >
> > > dup_task_struct() will copy this pointer from the old task. Would it
> > > be possible to bump the refcount here instead? That would save quite a
> > > bit of work during fork().
> >
> > Yeah, it should be possible. It won't save a lot, but I agree it makes
> > sense. I'll take a look and will prepare a separate patch for this.
>
> I guess the hairiest part would be synchronizing against a migration
> because all these cgroup core callbacks are unlocked.
Yep.
>
> Would it make sense to add ->fork_locked() and ->attach_locked()
> callbacks that are dispatched under the css_set_lock? Then this could
> be a simple if (p && !(p & 0x1)) obj_cgroup_get(), which would
> certainly be nice to workloads where fork() is hot, with little
> downside otherwise.
Maybe, but then the question is if it really worth it. In the final version
the update path doesn't need a spinlock, so it's quite cheap and happens
once on the first allocation, so Idk if it's worth it at all, but I'll take
a look.
I think the bigger question I have here (and probably worth a lsfmmbpf/plumbers
discussion) - what if we introduce a cgroup mount (or even Kconfig) option to
prohibit moving tasks between cgroups and rely solely on fork to enter the right
cgroup (a-la namespaces). I start thinking that this is the right path long-term,
things will be not only more reliable, but we also can ditch a lot of
synchronization and get better performance. Obviously not a small project.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH rfc 3/5] mm: kmem: make memcg keep a reference to the original objcg
2023-09-27 15:08 [PATCH rfc 0/5] mm: improve performance of kernel memory accounting Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 1/5] mm: kmem: optimize get_obj_cgroup_from_current() Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct Roman Gushchin
@ 2023-09-27 15:08 ` Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 4/5] mm: kmem: scoped objcg protection Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 5/5] percpu: " Roman Gushchin
4 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2023-09-27 15:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, cgroups, Johannes Weiner, Michal Hocko,
Shakeel Butt, Muchun Song, Dennis Zhou, Andrew Morton,
Roman Gushchin
Keep a reference to the original objcg object for the entire life
of a memcg structure.
This allows to simplify the synchronization on the kernel memory
allocation paths: pinning a (live) memcg will also pin the
corresponding objcg.
The memory overhead of this change is minimal because object cgroups
usually outlive their corresponding memory cgroups even without this
change, so it's only an additional pointer per memcg.
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
---
include/linux/memcontrol.h | 8 +++++++-
mm/memcontrol.c | 5 +++++
2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 84425bfe4124..412ff0e8694d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -299,7 +299,13 @@ struct mem_cgroup {
#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
- struct obj_cgroup __rcu *objcg;
+ /*
+ * memcg->objcg is wiped out as a part of the objcg repaprenting
+ * process. memcg->orig_objcg preserves a pointer (and a reference)
+ * to the original objcg until the end of live of memcg.
+ */
+ struct obj_cgroup __rcu *objcg;
+ struct obj_cgroup *orig_objcg;
/* list of inherited objcgs, protected by objcg_lock */
struct list_head objcg_list;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f33a503d600..4815f897758c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3803,6 +3803,8 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
objcg->memcg = memcg;
rcu_assign_pointer(memcg->objcg, objcg);
+ obj_cgroup_get(objcg);
+ memcg->orig_objcg = objcg;
static_branch_enable(&memcg_kmem_online_key);
@@ -5297,6 +5299,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
{
int node;
+ if (memcg->orig_objcg)
+ obj_cgroup_put(memcg->orig_objcg);
+
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
kfree(memcg->vmstats);
--
2.42.0
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH rfc 4/5] mm: kmem: scoped objcg protection
2023-09-27 15:08 [PATCH rfc 0/5] mm: improve performance of kernel memory accounting Roman Gushchin
` (2 preceding siblings ...)
2023-09-27 15:08 ` [PATCH rfc 3/5] mm: kmem: make memcg keep a reference to the original objcg Roman Gushchin
@ 2023-09-27 15:08 ` Roman Gushchin
2023-09-27 15:08 ` [PATCH rfc 5/5] percpu: " Roman Gushchin
4 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2023-09-27 15:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, cgroups, Johannes Weiner, Michal Hocko,
Shakeel Butt, Muchun Song, Dennis Zhou, Andrew Morton,
Roman Gushchin
Switch to a scope-based protection of the objcg pointer on slab/kmem
allocation paths. Instead of using the get_() semantics in the
pre-allocation hook and put the reference afterwards, let's rely
on the fact that objcg is pinned by the scope.
It's possible because:
1) if the objcg is received from the current task struct, the task is
keeping a reference to the objcg.
2) if the objcg is received from an active memcg (remote charging),
the memcg is pinned by the scope and has a reference to the
corresponding objcg.
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
---
include/linux/memcontrol.h | 6 +++++
mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++--
mm/slab.h | 10 +++------
3 files changed, 53 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 412ff0e8694d..9a5212d3b9d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1779,6 +1779,12 @@ bool mem_cgroup_kmem_disabled(void);
int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
void __memcg_kmem_uncharge_page(struct page *page, int order);
+/*
+ * The returned objcg pointer is safe to use without additional
+ * protection within a scope, refer to the implementation for the
+ * additional details.
+ */
+struct obj_cgroup *current_obj_cgroup(void);
struct obj_cgroup *get_obj_cgroup_from_current(void);
struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4815f897758c..76557370f212 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3074,6 +3074,48 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
return objcg;
}
+__always_inline struct obj_cgroup *current_obj_cgroup(void)
+{
+ struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
+
+ if (in_task()) {
+ memcg = current->active_memcg;
+ if (unlikely(memcg))
+ goto from_memcg;
+
+ objcg = READ_ONCE(current->objcg);
+ if (unlikely(current_objcg_needs_update(objcg)))
+ objcg = current_objcg_update(objcg);
+ /*
+ * Objcg reference is kept by the task, so it's safe
+ * to use the objcg by the current task.
+ */
+ return objcg;
+ } else {
+ memcg = this_cpu_read(int_active_memcg);
+ if (unlikely(memcg))
+ goto from_memcg;
+ }
+ return NULL;
+
+from_memcg:
+ for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+ /*
+ * Memcg pointer is protected by scope (see set_active_memcg())
+ * and is pinning the corresponding objcg, so objcg can't go
+ * away and can be used within the scope without any additional
+ * protection.
+ */
+ objcg = rcu_dereference_check(memcg->objcg, 1);
+ if (likely(objcg))
+ break;
+ objcg = NULL;
+ }
+
+ return objcg;
+}
+
struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
{
struct obj_cgroup *objcg;
@@ -3168,15 +3210,15 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
struct obj_cgroup *objcg;
int ret = 0;
- objcg = get_obj_cgroup_from_current();
+ objcg = current_obj_cgroup();
if (objcg) {
ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
if (!ret) {
+ obj_cgroup_get(objcg);
page->memcg_data = (unsigned long)objcg |
MEMCG_DATA_KMEM;
return 0;
}
- obj_cgroup_put(objcg);
}
return ret;
}
diff --git a/mm/slab.h b/mm/slab.h
index 799a315695c6..8cd3294fedf5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -484,7 +484,7 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT))
return true;
- objcg = get_obj_cgroup_from_current();
+ objcg = current_obj_cgroup();
if (!objcg)
return true;
@@ -497,17 +497,14 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
css_put(&memcg->css);
if (ret)
- goto out;
+ return false;
}
if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s)))
- goto out;
+ return false;
*objcgp = objcg;
return true;
-out:
- obj_cgroup_put(objcg);
- return false;
}
static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
@@ -542,7 +539,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
obj_cgroup_uncharge(objcg, obj_full_size(s));
}
}
- obj_cgroup_put(objcg);
}
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
--
2.42.0
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH rfc 5/5] percpu: scoped objcg protection
2023-09-27 15:08 [PATCH rfc 0/5] mm: improve performance of kernel memory accounting Roman Gushchin
` (3 preceding siblings ...)
2023-09-27 15:08 ` [PATCH rfc 4/5] mm: kmem: scoped objcg protection Roman Gushchin
@ 2023-09-27 15:08 ` Roman Gushchin
4 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2023-09-27 15:08 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, cgroups, Johannes Weiner, Michal Hocko,
Shakeel Butt, Muchun Song, Dennis Zhou, Andrew Morton,
Roman Gushchin
Similar to slab and kmem, switch to a scope-based protection of the
objcg pointer to avoid.
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
---
mm/percpu.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index a7665de8485f..f53ba692d67a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1628,14 +1628,12 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
if (!memcg_kmem_online() || !(gfp & __GFP_ACCOUNT))
return true;
- objcg = get_obj_cgroup_from_current();
+ objcg = current_obj_cgroup();
if (!objcg)
return true;
- if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) {
- obj_cgroup_put(objcg);
+ if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
return false;
- }
*objcgp = objcg;
return true;
@@ -1649,6 +1647,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
return;
if (likely(chunk && chunk->obj_cgroups)) {
+ obj_cgroup_get(objcg);
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
rcu_read_lock();
@@ -1657,7 +1656,6 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
rcu_read_unlock();
} else {
obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
- obj_cgroup_put(objcg);
}
}
--
2.42.0
^ permalink raw reply [flat|nested] 10+ messages in thread