Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ivan Babrou <ivan@cloudflare.com>
To: Wei Xu <weixugc@google.com>
Cc: "Michal Hocko" <mhocko@suse.com>,
	"Yosry Ahmed" <yosryahmed@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Shakeel Butt" <shakeelb@google.com>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
	"Waiman Long" <longman@redhat.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, "Greg Thelen" <gthelen@google.com>
Subject: Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
Date: Thu, 7 Sep 2023 18:02:50 -0700	[thread overview]
Message-ID: <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA@mail.gmail.com> (raw)
In-Reply-To: <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com>

On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc@google.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > Unified flushing allows for great concurrency for paths that attempt to
> > > flush the stats, at the expense of potential staleness and a single
> > > flusher paying the extra cost of flushing the full tree.
> > >
> > > This tradeoff makes sense for in-kernel flushers that may observe high
> > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > > may be unexpected and problematic, especially when such stats are used
> > > for critical paths such as userspace OOM handling. Additionally, a
> > > userspace reader will occasionally pay the cost of flushing the entire
> > > hierarchy, which also causes problems in some cases [1].
> > >
> > > Opt userspace reads out of unified flushing. This makes the cost of
> > > reading the stats more predictable (proportional to the size of the
> > > subtree), as well as the freshness of the stats. Userspace readers are
> > > not expected to have similar concurrency to in-kernel flushers,
> > > serializing them among themselves and among in-kernel flushers should be
> > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > > userspace readers to make sure only a single userspace reader can compete
> > > with in-kernel flushers at a time. This takes away userspace ability to
> > > directly influence or hurt in-kernel lock contention.
> >
> > I think it would be helpful to note that the primary reason this is a
> > concern is that the spinlock is dropped during flushing under
> > contention.
> >
> > > An alternative is to remove flushing from the stats reading path
> > > completely, and rely on the periodic flusher. This should be accompanied
> > > by making the periodic flushing period tunable, and providing an
> > > interface for userspace to force a flush, following a similar model to
> > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > implementation needs to be changed because:
> > > - The cost of reading stats will be very cheap and we won't be able to
> > >   take that back easily.
> > > - There are user-visible interfaces involved.
> > >
> > > Hence, let's go with the change that's most reversible first and revisit
> > > as needed.
> > >
> > > This was tested on a machine with 256 cpus by running a synthetic test
> > > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > > the stats every second (including root, top-level, and leaf cgroups --
> > > so total 251 threads). No significant regressions were observed in the
> > > total run time, which means that userspace readers are not significantly
> > > affecting in-kernel flushers:
> > >
> > > Base (mm-unstable):
> > >
> > > real  0m22.500s
> > > user  0m9.399s
> > > sys   73m41.381s
> > >
> > > real  0m22.749s
> > > user  0m15.648s
> > > sys   73m13.113s
> > >
> > > real  0m22.466s
> > > user  0m10.000s
> > > sys   73m11.933s
> > >
> > > With this patch:
> > >
> > > real  0m23.092s
> > > user  0m10.110s
> > > sys   75m42.774s
> > >
> > > real  0m22.277s
> > > user  0m10.443s
> > > sys   72m7.182s
> > >
> > > real  0m24.127s
> > > user  0m12.617s
> > > sys   78m52.765s
> > >
> > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > OK, I can live with that but I still believe that locking involved in
> > the user interface only begs for issues later on as there is no control
> > over that lock contention other than the number of processes involved.
> > As it seems that we cannot make a consensus on this concern now and this
> > should be already helping existing workloads then let's just buy some
> > more time ;)
>
> Indeed, even though the new global mutex protects the kernel from the
> userspace contention on the rstats spinlock, its current
> implementation doesn't have any protection for the lock contention
> among the userspace threads and can cause significant delays to memcg
> stats reads.
>
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second. This is a significant regression.
> In comparison, without contention, each memory.stat read only takes
> 20us-50us in the kernel.  Almost all of the extra read time is spent
> on waiting for the new mutex. The time to flush rstats only accounts
> for 10us-50us (This test creates only 1K memory cgroups and doesn't
> generate any loads other than these stat reader threads).
>
>  Here are some ideas to control the lock contention on the mutex and
> reduce both the median and tail latencies of concurrent memcg stat
> reads:
>
> - Bring back the stats_flush_threshold check in
> mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> check provides a reasonable bound on the stats staleness while being
> able to filter out a large number of rstats flush requests, which
> reduces the contention on the mutex.
>
> - When contended, upgrade the per-memcg rstats flush in
> mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> contended flushes together.  We can wait for the ongoing flush to
> complete and eliminate repeated flush requests.

Full root memcg flush being slow is one of the issues that prompted this patch:

* https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

I don't want us to regress in this regard.

> - Wait for the mutex and the ongoing flush with a timeout.  We should
> not use busy-wait, though.  We can bail out to read the stats without
> a flush after enough wait.  A long-stalled system call is much worse
> than somewhat stale stats in the corner cases in my opinion.
>
> Wei Xu
>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > Thanks!
> >
> > > ---
> > >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> > >  1 file changed, 20 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 94d5a6751a9e..46a7abf71c73 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> > >  static void flush_memcg_stats_dwork(struct work_struct *w);
> > >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > > +static DEFINE_MUTEX(stats_user_flush_mutex);
> > >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> > >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> > >  static u64 flush_next_time;
> > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> > >       cgroup_rstat_flush(memcg->css.cgroup);
> > >  }
> > >
> > > +/*
> > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > > + * @memcg: memory cgroup to flush
> > > + *
> > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > > + * readers hogging the lock.
> >
> > readers hogging the lock as do_stats_flush drops the spinlock under
> > contention.
> >
> > > + */
> > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > > +{
> > > +     mutex_lock(&stats_user_flush_mutex);
> > > +     do_stats_flush(memcg);
> > > +     mutex_unlock(&stats_user_flush_mutex);
> > > +}
> > > +
> > >  /*
> > >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> > >   *
> > --
> > Michal Hocko
> > SUSE Labs
> >

next prev parent reply	other threads:[~2023-09-08  1:03 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
2023-09-04 14:44   ` Michal Hocko
2023-09-05 15:55     ` Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
2023-09-04 14:45   ` Michal Hocko
2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
2023-09-04 14:50   ` Michal Hocko
2023-09-04 15:29     ` Michal Koutný
2023-09-04 15:41       ` Michal Hocko
2023-09-05 14:10         ` Michal Koutný
2023-09-05 15:54           ` Yosry Ahmed
2023-09-05 16:07             ` Michal Koutný
2023-09-12 11:03             ` Michal Hocko
2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
2023-09-04 15:15   ` Michal Hocko
2023-09-05 15:57     ` Yosry Ahmed
2023-09-08  0:52     ` Wei Xu
2023-09-08  1:02       ` Ivan Babrou [this message]
2023-09-08  1:11         ` Yosry Ahmed
2023-09-11 13:11       ` Michal Hocko
2023-09-11 19:15         ` Wei Xu
2023-09-11 19:34           ` Michal Hocko
2023-09-11 20:01             ` Wei Xu
2023-09-11 20:21               ` Tejun Heo
2023-09-11 20:28                 ` Yosry Ahmed
2023-09-12 11:03                 ` Michal Hocko
2023-09-12 11:09                   ` Yosry Ahmed
2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA@mail.gmail.com \
    --to=ivan@cloudflare.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox