From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4783AC83F2C for ; Tue, 5 Sep 2023 15:58:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 98A608E0002; Tue, 5 Sep 2023 11:58:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 939BE8D0001; Tue, 5 Sep 2023 11:58:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7DA698E0002; Tue, 5 Sep 2023 11:58:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6AF978D0001 for ; Tue, 5 Sep 2023 11:58:29 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CC92BA09AB for ; Tue, 5 Sep 2023 15:58:28 +0000 (UTC) X-FDA: 81203001096.30.FAED8AD Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51]) by imf19.hostedemail.com (Postfix) with ESMTP id 021CB1A001D for ; Tue, 5 Sep 2023 15:58:26 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=U1RgmfsF; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693929507; a=rsa-sha256; cv=none; b=r+Rf2vfqUGhjY3KTxUCIMk/gvnelCb1PcD1/dAczVuGA7sjCLjSZZCGMqhiBzWM+kCZMHQ FslfrrlnyfrkDYVm+EeSGHnndMgF10ASPQ3tO21+BuA2dJX6juc7tGDFcTPMRrHso9BffM qjc5T010uqVPGB+r4RpyUGh0Bugw24k= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=U1RgmfsF; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693929507; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Aa7aT+i5EQ3QihS49Ljt9BqKnGTT4ELtv83ViKOgL14=; b=OZ6JbGv7207YTQ1Zl0vapQ/U05suvdvraqLUfc4kL+H6epBeqpIA0dCB+0Q7O1iLTi1q9A ct+KDqMlorwtg4AgqVNwuRoOHU4ufGzzW0aQTsyas3MqnYbm7ktRtW1kkJjmMiOJlT070C rm+pu4hc4URXmtaHj72cyDIsqaXeIEA= Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-9a21b6d105cso391636766b.3 for ; Tue, 05 Sep 2023 08:58:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693929505; x=1694534305; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Aa7aT+i5EQ3QihS49Ljt9BqKnGTT4ELtv83ViKOgL14=; b=U1RgmfsF4EiSVNDTWcnFER/Fj/5n4dLfBS0UjONiyOwFrOKoatDpwP3jpEA7ievXZY EtT6RmTH1BqvrzuQ/YDzhOKq0bou8fP/6oijpyVgtdgCPRCiwA8RkCvCSajc/N2hDowN Cj0LwvVyb7vvpl7kw/tWXzViHcnty4ls77++Us3ALHUIoSYpvvlyhXfs8cT4trxhNp4A iwG9qgIA8qai4HxRSaXhJ514E8gg7PiZDBz7/2fzHGtVfFMxE9oPFTflMJaRPP7VDSh2 1Nz5vweXxg3JCYoHhNpd5fPU1yaOLWyeqkJ0KQ9PgEKdBYgEodcDlzW3WaLNB7lk2TBP y/Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693929505; x=1694534305; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Aa7aT+i5EQ3QihS49Ljt9BqKnGTT4ELtv83ViKOgL14=; b=X6dGJNp/Rn+sD+tbQtb6XdXDrLjVKvQuT7bglS5Qo10SvUoddEdCgq7KdnwKO6AgbX jiO6sTSiTkQyjz8uHySTil0T8gSL7Ixz/pKPJMDuz9HdEKYaKhRiht+uKEA8uhTj4Bmu uW9nqeQ6JSp19OAOqvs9khVE3Q+5XYeba2e5eEvgVTluk1uedV1aTExBa5On0zfnYaCV fDDykhlEhg8Kf7Imb+PuMZCD66zH9JuPZLb3bshE9ml/6y6tVm9HDuJ/qws75qRY2EoH VMnUCf82NOGhxjPn+NZCXtJKFYYl7q3zldZ0GKnGIV2z+9FPaTRoFCrPYkTDQDzlw658 sgAQ== X-Gm-Message-State: AOJu0YxY+PDyyTngsFB0ewE8oOyyuxLi76ryOzUhhGjEz5QU3IjVO91y XUAwsfY464V2wE9djWuDzhEpP0aHUVi793FjGWmm4A== X-Google-Smtp-Source: AGHT+IHjV3kT/fsRM56CZxbDIb5jloyTUmWtIcr9xzj8Z6MUrQT/VCOG7UqITNEAVlfwe9bTc0z+KPP3US9GKxsDU7Y= X-Received: by 2002:a17:906:1baa:b0:9a1:e941:6f49 with SMTP id r10-20020a1709061baa00b009a1e9416f49mr190033ejg.44.1693929505416; Tue, 05 Sep 2023 08:58:25 -0700 (PDT) MIME-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> <20230831165611.2610118-5-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Tue, 5 Sep 2023 08:57:49 -0700 Message-ID: Subject: Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads To: Michal Hocko Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 021CB1A001D X-Stat-Signature: x9djsspabujqurorda6u7r181kqneear X-Rspam-User: X-HE-Tag: 1693929506-429206 X-HE-Meta: U2FsdGVkX1/+O3tGJXbtC63CyFC1rXyy9WwfopNYPC8UsNnQJFafg2gXJfsyftIKBdlQBA5PqtH/2Z3fMaiuDXvT4rtQ7IplYGl9lw/g5IgWHsPQ2EscrKxxio1bguBp2yvNjV1cKAz1D1HQ4xsk0bH1w9IXF4eOoop8R/W1LJCYiJF4OWmR3GSQu0J+vs1KbJADwqVixy+h0yfXJERt7Ms6G/8tCgpyBYsYMcaaVL3twqqAlAE37APYuQ2Ql7Yz/3tqqENt6d5qB7KvRuXdXTfrP0kpbTGv3G5d3Iw8gxw2RBMP+r80M7j+22ZcSxLskNPtFgfGdyirNBzh5XbkLi7TsEbuMtHgGeRLuytQWsrl7MytFaWdKjQW0Ek85YAxsUR0Ws7ApZ5WRIMbIkZ9Ocra5lgwoj3hSJ9qVeH1XrLLCg+0hT3G1TskENcIqFeSRvlj447BPBKj7KaWHhBzSdZ0qyD+vsS78Jt9NbKoKhxPgsSrBAt5bD9pIVnJ3KRqajmB6H22v/P5MIg+Un4qi3Ek1S19/WAEnonx4fiNFtBjrMcQEtDu4aFKbsHh5nDyE+8ts0CngK/poV7rNnXMig2an4rQaMllC9W6kAd9jg8fwx+IooHJ1iHgBwRTz9QYtP7PpPv6kv8D2GDJ4ERBJZjggnxtioCVTmqyr5VaUQr4+Ah9P17S/xrYlKL9BXxHA7r6ByXLHOzpCuEWW3OkiYM0rfkOvwPyTr8EBQOUhhsqCMpxXjrtSaC3TQIpCFOt6bEmbR+NlRELvI3DaQQdFgoBKBd/lWAzoliWX352Xwn03OrVe/k1r1q+fSnGo2O+KpElx2k2hRUqqQe/dL9cG6kf4+lnrB6W+YjPSJ3cpJm41sLkd1duQ8oLQqneztQ/kI/Xv4K5ifuUlxedK/QvIRh2jj7DCE6uZnPF2XZYSsnszJ/8+LPQZ2YqCN722RSxmz9DDAQkAZ6lSRan0Dp Uvkw0jI7 rrMYgrOHZqVel5UD5jGyCSdEP4pHnyDNXjBSQIDwr4dqEM6TNvpOIRaGqe6Ln+9z9Xrgeh+oVR92DbpPVK3MfukntfBBPMh2xeyD4P/qURaQyLrB0zme/fuGnQG1X95e2QVPfMOnDYIp2FTJyKMwGkncWn+Ey6KuojKh0hRzmaDtYtAAcAxmwY1i05XvO5exQIxAo38snvPY8b3qtcWM7wiOVaWXNL3a9pp7HFt0a4fDcO8kcAKHS/giRVyzKVQ0LB74lhS5vTQeM45a6wt3bSDY7ED9IMBdrHOiMrsnF1MwwdLb9p/1gySR0V8vkKdrUZBP//ha7XHYhGec1YEDrTSkJWw6XcP78Zad3YXumv2Wbe5T8rNb8q4i3pcciT3anAKpUJfoVWqm69D8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 4, 2023 at 8:15=E2=80=AFAM Michal Hocko wrote= : > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote: > > Unified flushing allows for great concurrency for paths that attempt to > > flush the stats, at the expense of potential staleness and a single > > flusher paying the extra cost of flushing the full tree. > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > may be unexpected and problematic, especially when such stats are used > > for critical paths such as userspace OOM handling. Additionally, a > > userspace reader will occasionally pay the cost of flushing the entire > > hierarchy, which also causes problems in some cases [1]. > > > > Opt userspace reads out of unified flushing. This makes the cost of > > reading the stats more predictable (proportional to the size of the > > subtree), as well as the freshness of the stats. Userspace readers are > > not expected to have similar concurrency to in-kernel flushers, > > serializing them among themselves and among in-kernel flushers should b= e > > okay. Nonetheless, for extra safety, introduce a mutex when flushing fo= r > > userspace readers to make sure only a single userspace reader can compe= te > > with in-kernel flushers at a time. This takes away userspace ability to > > directly influence or hurt in-kernel lock contention. > > I think it would be helpful to note that the primary reason this is a > concern is that the spinlock is dropped during flushing under > contention. > > > An alternative is to remove flushing from the stats reading path > > completely, and rely on the periodic flusher. This should be accompanie= d > > by making the periodic flushing period tunable, and providing an > > interface for userspace to force a flush, following a similar model to > > /proc/vmstat. However, such a change will be hard to reverse if the > > implementation needs to be changed because: > > - The cost of reading stats will be very cheap and we won't be able to > > take that back easily. > > - There are user-visible interfaces involved. > > > > Hence, let's go with the change that's most reversible first and revisi= t > > as needed. > > > > This was tested on a machine with 256 cpus by running a synthetic test > > script [2] that creates 50 top-level cgroups, each with 5 children (250 > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > unified flusher). Concurrently, one thread is spawned per-cgroup to rea= d > > the stats every second (including root, top-level, and leaf cgroups -- > > so total 251 threads). No significant regressions were observed in the > > total run time, which means that userspace readers are not significantl= y > > affecting in-kernel flushers: > > > > Base (mm-unstable): > > > > real 0m22.500s > > user 0m9.399s > > sys 73m41.381s > > > > real 0m22.749s > > user 0m15.648s > > sys 73m13.113s > > > > real 0m22.466s > > user 0m10.000s > > sys 73m11.933s > > > > With this patch: > > > > real 0m23.092s > > user 0m10.110s > > sys 75m42.774s > > > > real 0m22.277s > > user 0m10.443s > > sys 72m7.182s > > > > real 0m24.127s > > user 0m12.617s > > sys 78m52.765s > > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43K= O9ME4-dsgfoQ@mail.gmail.com/ > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOB= Zcz6POYTV-4g@mail.gmail.com/ > > > > Signed-off-by: Yosry Ahmed > > OK, I can live with that but I still believe that locking involved in > the user interface only begs for issues later on as there is no control > over that lock contention other than the number of processes involved. > As it seems that we cannot make a consensus on this concern now and this > should be already helping existing workloads then let's just buy some > more time ;) > > Acked-by: Michal Hocko Thanks! I agree, let's fix problems if/when they arise, maybe it will be just fine = :) I will send a v5 collecting Ack's and augmenting the changelog and comment below as you suggested (probably after we resolve patch 3). > > Thanks! > > > --- > > mm/memcontrol.c | 24 ++++++++++++++++++++---- > > 1 file changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 94d5a6751a9e..46a7abf71c73 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgrou= p_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dw= ork); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > +static DEFINE_MUTEX(stats_user_flush_mutex); > > static atomic_t stats_unified_flush_ongoing =3D ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); > > static u64 flush_next_time; > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memc= g) > > cgroup_rstat_flush(memcg->css.cgroup); > > } > > > > +/* > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read > > + * @memcg: memory cgroup to flush > > + * > > + * Flush the subtree of @memcg. A mutex is used for userspace readers = to gate > > + * the global rstat spinlock. This protects in-kernel flushers from us= erspace > > + * readers hogging the lock. > > readers hogging the lock as do_stats_flush drops the spinlock under > contention. > > > + */ > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) > > +{ > > + mutex_lock(&stats_user_flush_mutex); > > + do_stats_flush(memcg); > > + mutex_unlock(&stats_user_flush_mutex); > > +} > > + > > /* > > * do_unified_stats_flush - do a unified flush of memory cgroup statis= tics > > * > -- > Michal Hocko > SUSE Labs