From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9A63C761AF for ; Wed, 29 Mar 2023 18:45:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 452256B0080; Wed, 29 Mar 2023 14:45:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 401DB6B0081; Wed, 29 Mar 2023 14:45:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C954900002; Wed, 29 Mar 2023 14:45:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1C0CF6B0080 for ; Wed, 29 Mar 2023 14:45:47 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BDD3E160E30 for ; Wed, 29 Mar 2023 18:45:46 +0000 (UTC) X-FDA: 80622814692.05.E60405B Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf24.hostedemail.com (Postfix) with ESMTP id A773A180016 for ; Wed, 29 Mar 2023 18:45:43 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=R3c1c79q; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680115543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=j4XQWEx4XOGTRmpxqiJA9FV6OjtgnooLELcFguOwH/M=; b=pT6MeWdalPFns0z1d0j89fh9OKoAcKP+LKswNF7c+8mv1Z3l3W4Ncn20EqTpOFgP7zvJNw tTS9jQWRQQL1RRwxznG7YxQ14M3wQWSzZg4q2q9G5vzuZIfEUXigQurkzAKfPuLC6RPoNW T9F32dk608SRD4NPMhZfdlJvZkFcMMY= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=R3c1c79q; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680115543; a=rsa-sha256; cv=none; b=D/l1BB9oQ98JEb9mPM2ETO0arha8i5MGaiNoCrFLDnT87/pNslnI3fejhXvbjClMZ40MH+ l/1bohxWvzbRGpMSSizRZOKQRYbsExCipIOvgk5pyGURC77FvzWp46HwRiZ/kiP+hYAIk6 FylqPkM+QG3poq9Sv0pgejaQqu4llg0= Received: by mail-ed1-f42.google.com with SMTP id t10so67036927edd.12 for ; Wed, 29 Mar 2023 11:45:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680115542; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=j4XQWEx4XOGTRmpxqiJA9FV6OjtgnooLELcFguOwH/M=; b=R3c1c79qcrryGJME0k5/voHe0ocIUyrzrRPmUzruWW9wKNc8R8igTjgfbKp4SmPAPk BYWMOuubOojwlh56NX2+nnpobhHFMpG8DbUfXcQrzIcQ4rhvXqyMp9KnhpDZz8vMyLAP PEs4t/Yq6YU2tDXdbeIZF434Mnei5A/Jh0vHupc8PxKvuioxFAav7YUQLftfo6DeMceG arjRixDMAzpupyhw9a1IUbUYrtEZ16QymkaC8lz4/bxm+6fybvtHp8GLtJap9i4E1Zte fcOxcxVMTyptd2iyg+syEW1Ay8m4s5oUnfITXxW63g8aW7yJOEQoA7X0Z8+cF6Q8x4Aj pQPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680115542; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j4XQWEx4XOGTRmpxqiJA9FV6OjtgnooLELcFguOwH/M=; b=w7HZouh7Cv+Xz+yn2XGU0Poek43vvHbBdTlES6BbfBlcmslFV9ANTobunRgwwgkGl9 SKAyJivvyX8xjxMf9auI+/u3wkPbtAzwtcv3qE1h+cWXLOYb5dC0W2QCydcs34cgq4TT V5hwChHjRNx04wd8dmMg0MSAJi7WwbwQ27kjcvKBxiFSYRXp38IkFzqzK+muPH2C02YX 5IYdFQHx+Dfq6vWghYf5LX4Lj5DcZ6+W78e/HydxAwuvAU/XgrKk5WDjVduXKgGYka1A 8DCqD+/JH85PAvpTUoueMTT2cDMXYuTUFAqbsxBasszBcNKeKowLnmTLJ+8AI2Jdy2CV wEag== X-Gm-Message-State: AAQBX9djUY7LbZetEqq/+mMpEAeDf/YlEWEYus3me7V6MSvyUNHx+pdx GNrBECpY3rSLYH13+xbxwsYru/Bq7xt1xu1QiXfv+w== X-Google-Smtp-Source: AKy350ZrEgPF5oQCtQlpBo35h5Arle98ynmscmO8msEzRRNXHdagSlX+LoJxPCZUEUNFNCVZX3A4CPoqgeNqTMC2C44= X-Received: by 2002:a17:906:a86:b0:933:f6e8:26d9 with SMTP id y6-20020a1709060a8600b00933f6e826d9mr10670607ejf.15.1680115541997; Wed, 29 Mar 2023 11:45:41 -0700 (PDT) MIME-Version: 1.0 References: <20230328221644.803272-1-yosryahmed@google.com> <20230328221644.803272-6-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 29 Mar 2023 11:45:05 -0700 Message-ID: Subject: Re: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic To: Michal Hocko Cc: Tejun Heo , Josef Bacik , Jens Axboe , Zefan Li , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Vasily Averin , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: n4p88d5pofjnjum4irxe3cecsh4met9r X-Rspam-User: X-Rspamd-Queue-Id: A773A180016 X-Rspamd-Server: rspam06 X-HE-Tag: 1680115543-919622 X-HE-Meta: U2FsdGVkX185puE6Jd5h+UXhi98NLDCAibSwLwhYiGif/+ym94PITgYxWr4hBf7PCT3sQDEYnbDRz0fIB+aB3cU049+4Ekjl/Fb60/SHqI1336cbZTstNpwHwyBVT8DMXyySQ3lnxF4xyWRWRAQH0ojj9JLACLK320qQIQibtAFtFCW4syCMlMDXrKtCNjYLCTQ7QuuAvfGsgkjK98MdTK41vQK9pOEBw5+4NIhFRVxY2ktBVRVgnPgzuieNqZISu7EQoI8JJIfTWXsGD6qtLzYLZUYe0Bj66KQGIkj2B2cJL8ZvlHWUAXPOnuaT0oaXztoIwhBDEN5Rzja5rjfnGnrMKzHbxKaeL6Bm2Mp0D0Hjn/zSCxEglUsQHqpjzBKzuI5wWV5k8Jtb2NCrTJRXao6M073uyDAKnnRwfVRYBT7ml6KkTLQQVJwfyygmBsLKwpVTT5umRAjeQZ0iu7lkPPd7KjAT17VyS3U2ZUOwQy3z/tyCvnlrmlvqJwpk5kiJ06e4MLwytAm3vM4PsgI9XUIrWDSahh0CfJ7CX5CYk7yZBLeI05ZWIlAy2ZThBEUEnZfzPa9Tb92SZJDx5PEU3sMBjZTr42ickz8xH6yU4kuoNVzp1QE3QC2Dkv1xpsC1qRcdLRxvVl9LkijcLUNw9NIfrSHkeFl1bwOBKNjPTI/oeGpatBUzjYyRkZRDJ+AgbPEpbS7plk/SiTvnAS3S0XNhnXmdhVMRhYhjD1uUaIhzopGKeU31tUegcg9+XZy4uiJaB4JWLPQSu9fdZGRKtKae2cht4rXtgKk5UVLm+2/ctlsakk+T7FqBSQR6fhn1j+DsedySyL05zlCL3k7g/f98SA8eAEUXrNaQFFgpdcJc6Dm6QODpv0wh8f3dTxHpmW5n9vodcf0yvTK4yeivGRXI9EN7lclVuD6B9BY57v5G3+QD7AyXvtln/k0vrAxjffx5QPVoFp3BpPznsK7 JTv0LKZ0 NPbhW5x3JvBDN6JLNTmhDNDaoCEcGFLVJc27OLwayT/BicXGEz6GXoU2F2C0E0LTTzikhiWi1rCyUDSgZaphUDBiT25XxHiRV5PWDiYz/TAxyE5a05NmPNr2SgcTeELFH/U3Fjqc4Cu0UPKdzzUsHhH0H6w1QFrcVQxMRPFl6mDWS7GUJTxN2WZ0ssg/zos/hWbJoas2m8ADVCYf51RkDJl3iiffHOlqoJCxFvbNKtPi/N8GVfGKf1+olZ6uA3U8NHtHlNiDDRvgMaRluMDaH0PacpnxcluK75TchXw5iPUvPQm2YyoxP8eA9RFpDSr770o5GUvgMDJwzjUsqV1Y5Zr15i2jjBAM7syXUYUy+qPcA2CYNFuLeX/dlUG7TT1frI25+f6/pR5LpqCdk3ofQY35EAmG1HPDKKmEHIFSnovzNK5LeG7wiSOrx6rSizyiym7h4IvBDuLHc+NH0kE4nOwhAqDZH/Ce16fJFYk7bAsAWtGk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 29, 2023 at 8:58=E2=80=AFAM Michal Hocko wrot= e: > > On Tue 28-03-23 22:16:40, Yosry Ahmed wrote: > > As Johannes notes in [1], stats_flush_lock is currently used to: > > (a) Protect updated to stats_flush_threshold. > > (b) Protect updates to flush_next_time. > > (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits. > > > > However: > > > > 1. stats_flush_threshold is already an atomic > > > > 2. flush_next_time is not atomic. The writer is locked, but the reader > > is lockless. If the reader races with a flush, you could see this: > > > > if (time_after(jiffies, flush_n= ext_time)) > > spin_trylock() > > flush_next_time =3D now + delay > > flush() > > spin_unlock() > > spin_trylock() > > flush_next_time =3D now + delay > > flush() > > spin_unlock() > > > > which means we already can get flushes at a higher frequency than > > FLUSH_TIME during races. But it isn't really a problem. > > > > The reader could also see garbled partial updates, so it needs at > > least READ_ONCE and WRITE_ONCE protection. > > Just a nit. Sounds more serious than it is actually. This would only > happen if compiler decides to split the write. Thanks for the note, Michal. I honestly quoted Johannes here as I do not have much expertise when it comes to this. I will add "if the compiler decides to split the write" to the commit log if I respin. > > > 3. Serializing cgroup_rstat_flush() calls against the ratelimit > > factors is currently broken because of the race in 2. But the race > > is actually harmless, all we might get is the occasional earlier > > flush. If there is no delta, the flush won't do much. And if there > > is, the flush is justified. > > > > So the lock can be removed all together. However, the lock also served > > the purpose of preventing a thundering herd problem for concurrent > > flushers, see [2]. Use an atomic instead to serve the purpose of > > unifying concurrent flushers. > > > > [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/ > > [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@googl= e.com/ > > > > Signed-off-by: Yosry Ahmed > > Acked-by: Johannes Weiner > > Acked-by: Michal Hocko > > > --- > > mm/memcontrol.c | 18 +++++++++++------- > > 1 file changed, 11 insertions(+), 7 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ff39f78f962e..65750f8b8259 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgrou= p_tree_per_node *mctz) > > */ > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dw= ork); > > -static DEFINE_SPINLOCK(stats_flush_lock); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > +static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); > > static u64 flush_next_time; > > > > @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem= _cgroup *memcg, int val) > > > > static void __mem_cgroup_flush_stats(void) > > { > > - unsigned long flag; > > - > > - if (!spin_trylock_irqsave(&stats_flush_lock, flag)) > > + /* > > + * We always flush the entire tree, so concurrent flushers can ju= st > > + * skip. This avoids a thundering herd problem on the rstat globa= l lock > > + * from memcg flushers (e.g. reclaim, refault, etc). > > + */ > > + if (atomic_read(&stats_flush_ongoing) || > > + atomic_xchg(&stats_flush_ongoing, 1)) > > return; > > > > - flush_next_time =3D jiffies_64 + 2*FLUSH_TIME; > > + WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup); > > atomic_set(&stats_flush_threshold, 0); > > - spin_unlock_irqrestore(&stats_flush_lock, flag); > > + atomic_set(&stats_flush_ongoing, 0); > > } > > > > void mem_cgroup_flush_stats(void) > > @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void) > > > > void mem_cgroup_flush_stats_ratelimited(void) > > { > > - if (time_after64(jiffies_64, flush_next_time)) > > + if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) > > mem_cgroup_flush_stats(); > > } > > > > -- > > 2.40.0.348.gf938b09366-goog > > -- > Michal Hocko > SUSE Labs