From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC484CD4F30 for ; Wed, 4 Sep 2024 21:10:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 118366B02AC; Wed, 4 Sep 2024 17:10:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C9836B02AE; Wed, 4 Sep 2024 17:10:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E84466B02B0; Wed, 4 Sep 2024 17:10:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C6AA36B02AC for ; Wed, 4 Sep 2024 17:10:46 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3A0CBA0576 for ; Wed, 4 Sep 2024 21:10:46 +0000 (UTC) X-FDA: 82528300092.26.AF9AA6A Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by imf18.hostedemail.com (Postfix) with ESMTP id 4944D1C0019 for ; Wed, 4 Sep 2024 21:10:44 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0a9RsRuP; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725484168; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=; b=IdV/yuL7KZI1kTLj38t9iRAxBWY9KmBTAIq4tghPasEQDUZh0ZJVASLzexcQMwKq2JjSG4 cnHUK5Ddz2zHqwzZodjBliTU2hWi/guB0zzcebVWV6UlT4Pqv0MsrUo3cHAkPSpQKO6xod o1odeGwMcBd4ygBc93EM8FYB04l5MCA= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0a9RsRuP; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725484168; a=rsa-sha256; cv=none; b=mdmu3m/UjWD+JKx5DjQzB7RJKBlBjz7llj8PapUK5Bvfh9IXaTAbJN1NKJtiUKmEC/Mory 9bCW4aBCtSHjKqNAgpUq9aJng9HP2mYRlsySzYxJs91FRS8vfw1R8ktjM6800AtwKa4b0z avC5ikYyywpOS6YmU5FFdlZ68oAA9cg= Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-5333b2fbedaso12381959e87.0 for ; Wed, 04 Sep 2024 14:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725484242; x=1726089042; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=; b=0a9RsRuPOC/Eb42HMaWcD/65Js3XIZWAD/MMQbJfwGOep3DOBZLDP1COHGTNCYtlzE pCTMr4ug0QJT30nPd6zq6i5Cb417PTOFurW7eB8z1lzslPD+4zj6EdPpKik5m2rdm8qb /XnM1b4/dNOq2ouRwDureoncd33Reya/vdeB6CXPlFTov/bICin0UiXNZCFNYNNduW+4 R2eBjMaf2fI6397o7lEDp5S/Ht5rgKUsZlTPocFwQ2yDi4yguMxHIMEJ8e+mZmAeoj4z vc3YYzXOpvRMYdH8r0MmK8ur6DxGPA+pBWPSC+/v9+RG1hdeegMQJ3kE3/b7iiSR6cLN dOEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725484242; x=1726089042; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XP8HEjcvcsSjCqyyJiUYtMj75FT90mtxfvTOuGNKzpc=; b=SVVyOikC1bYTmCb2c79OyfyYBqhrkjBC8GvS+FSJrjsHm6+6fhuioouqsLpIF2fV+c CH6UIb8idGGxregouo38tN1HWOMaTrJbtL1/oBXpQAHXxUqnkVR9/xgJX/lTWEyF/Mgl JUniUeg/p+o6Kat/eIYR9IRjlC7SYNiynw6DxJ7Bi0NgUcpSbCQqxoLznKxj+C+AH9PO l97NCO4oiNZNv+hDcQzKosHfpw2HOpcsfqbE1PAFu/CpVCSSiPYJN2zxym7tLTXNaslG fFhZYDNnhtL5O8xabrUJcnz7E0cIIluOgAz203A2mSxFARIw7evIRr4se1nZkIxVpA5d 7Z5A== X-Forwarded-Encrypted: i=1; AJvYcCV2b2qHhjJCMPVjxWadhw6C3qHQooNH4zKZpk+VhAnOFsPvL+2u/pA+VlDmpkoUitgbHVaeEeae6w==@kvack.org X-Gm-Message-State: AOJu0Yx97AtPQ/tBe2oPErtYXGvcu/wQgBaUX9VwesZy+UjQg5HQ889U HfvYDJoCPGk5mZlgPNGv9RifM5RUeeLolBIEWkQ7esQ5ps+J8maTX381MLo7znD34QOpPLKLnYp A0XFhW3EjL/eh0OUCwDGQqwe4PuLPGp0A85fl X-Google-Smtp-Source: AGHT+IHWolUXkuWL//rhfaUvkN/zThvDPlsWG88X63zP37akORnLDoixS2Ft+XAO3y3JFu8SwbH6T6Q9UATa2h6DvsQ= X-Received: by 2002:a05:6512:1286:b0:533:e4d:3374 with SMTP id 2adb3069b0e04-53546ba8fdamr12344003e87.57.1725484241758; Wed, 04 Sep 2024 14:10:41 -0700 (PDT) MIME-Version: 1.0 References: <172547884995.206112.808619042206173396.stgit@firesoul> In-Reply-To: <172547884995.206112.808619042206173396.stgit@firesoul> From: Yosry Ahmed Date: Wed, 4 Sep 2024 14:10:03 -0700 Message-ID: Subject: Re: [PATCH V10] cgroup/rstat: Avoid flushing if there is an ongoing root flush To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: gffkk5orxw9g1ecam69jc8e1o5s83567 X-Rspam-User: X-Rspamd-Queue-Id: 4944D1C0019 X-Rspamd-Server: rspam02 X-HE-Tag: 1725484244-519704 X-HE-Meta: U2FsdGVkX1+9PiF5zA+7TBxA8WaWGAEToOOO7+sEiFl6e6jokkW/GY0wvlp9VWcOjlFR7lEagHVo1OVc/wrlVvtLVsnNQPJAMjPeNAOu96zwKD1wUIKwJd+hpBLz2nbNGymMG2NfvkGI9CQ9ZzOs89Hz+umhGU/DFqgZdkop1oenXpP2GIn7zrVKd758j84C8w7p1EzT2xbDQBGjMEy4ogW3jm8i2Ub8rkrxnp51QZxH+p3CyIgqIQFBB1J+NrKXXO2bTrCcSQT/7Kv2GzUV/TnyKPthqpDmqUDMVVnMHaF27/kZeiCkQy2IK/vsxTceSeGdSp7YqRMsb6af/SmbZiYwhDzK1oGJmH116Q4YJxZF96GQf39Uc7+14b1iq/M8VMU0DoIYNqfSTCZ86SwnsSDDYoXp0jmqFU5FKIhcxeK1XlQgM61LG+l/eY88RWf5rzrVorQRq+mLyW/4Fbx9AAdOA0nJG0kbmQAD3va42tJeBaVzWb7As+t+4+9NQo52OE9/E2HsDPDuh9MdQyAGIhH8bm7fmQNMa9ZIRD14AFnebjW4POEYzigTpUJtopSBALzaklWWBtFGWmMRBavOaQC4KXCYKjjShbhEoWEOVJ5BRCroaRjb7rDM2IPNEPKrY2df88g+I+1WB4mDOQx53coBAMBjJY3iXzxX3gZ/O/b3z3cHV044DgOMFD7Fa/N4rf66qHBmSMpwu/n2XHfFKfXun5pGquh/5d6ZvlQ2C18VoLqHZKNgRLAFFPFhQNddcl2ywj8G3QderhC7L7jvMGLwhhPxDj8ZbmMmqZ4yn4dH66PHQAdp5bXtYsa3xVgzOBkOPhcMzoVsdnH+Yk3aNpfzhQpvFJ+VYDqenLUZDIJAm5MzpS6XOob8nJ1Bh+n6Fe8jfYm9vW1khoHia9rR6HilOA+nN/bEZREFQRf57D/pGRyB4Wz+dU4oYdvq+aYwNFZSbqVtQ1HaSmluPzJ BS2twIoZ DmHoYVAlHZfEDrJxChKOyZHRBSOyG40H+7DH9SioS6AWLq0mUgIDxCHHBNA3A3vOP9sk01weEVz37VKxIhbjITDHAdbGtUrRPgU5iYZQ/wPqbw2EDysrsUVuD7OHrhodGPrQKbhB11m79ENyp0HbtxsFVLrwwynPi8UGCGkUAXnK4jRK9uNKSYKLWEOubwU+HjVSfD2nq8uLhMO0lR2j37k3UZiArGqg1k3Ws X-Bogosity: Ham, tests=bogofilter, spamicity=0.000006, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 4, 2024 at 12:41=E2=80=AFPM Jesper Dangaard Brouer wrote: > > This patch reintroduces and generalizes the "stats_flush_ongoing" concept= to > avoid redundant flushes if there is an ongoing flush at cgroup root level= , > addressing production lock contention issues on the global cgroup rstat l= ock. > > At Cloudflare, we observed significant performance degradation due to > lock contention on the rstat lock, primarily caused by kswapd. The > specific mem_cgroup_flush_stats() call inlined in shrink_node, which > takes the rstat lock, is particularly problematic. > > On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, w= e > noted severe lock contention on the rstat lock, causing 12 CPUs to waste > cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedst= at) > for kthreads revealed that we are burning an average of 20,000 CPU cores > fleet-wide on kswapd, primarily due to spinning on the rstat lock. > > Here's a brief overview of the issue: > - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN thre= ads > to wake up simultaneously. > - The kswapd thread invokes shrink_node (via balance_pgdat), triggering t= he > cgroup rstat flush operation as part of its work. > - balance_pgdat() has a NULL value in target_mem_cgroup, causing > mem_cgroup_flush_stats() to flush with root_mem_cgroup. > > The kernel previously addressed this with a "stats_flush_ongoing" concept= , > which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree sta= ts > flushing"). This patch reintroduces and generalizes the concept to apply = to > all users of cgroup rstat, not just memcg. > > In this patch only a root cgroup can become the ongoing flusher, as this = solves > the production issue. Letting other levels becoming ongoing flusher cause= root > cgroup to contend on the lock again. > > Some in-kernel users of the cgroup_rstat_flush() API depend on waiting fo= r the > flush to complete before continuing. This patch introduce the call > cgroup_rstat_flush_relaxed() and use it in those cases that can live with > slightly inaccurate flushes. > > This change significantly reduces lock contention, especially in > environments with multiple NUMA nodes, thereby improving overall system > performance. > > Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). > Signed-off-by: Jesper Dangaard Brouer I am honestly not happy with this patch, see below. > --- > Please include a brief delta vs the previous version here to save reviewers the duplicated effort of figuring it out. > V9: https://lore.kernel.org/all/172245504313.3147408.12138439169548255896= .stgit@firesoul/ > V8: https://lore.kernel.org/all/172139415725.3084888.13770938453137383953= .stgit@firesoul > V7: https://lore.kernel.org/all/172070450139.2992819.13210624094367257881= .stgit@firesoul > V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.= stgit@firesoul/ > V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.= stgit@firesoul/ > V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077= .stgit@firesoul/ > V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.= stgit@firesoul/ > V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.= stgit@firesoul/ > V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519= .stgit@firesoul/ > RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252= .stgit@firesoul/ [..] > @@ -299,6 +301,67 @@ static inline void __cgroup_rstat_unlock(struct cgro= up *cgrp, int cpu_in_loop) > spin_unlock_irq(&cgroup_rstat_lock); > } > > +static inline bool cgroup_is_root(struct cgroup *cgrp) > +{ > + return cgroup_parent(cgrp) =3D=3D NULL; > +} > + > +/** > + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flu= sher > + * @cgrp: target cgroup > + * > + * Function return value follow trylock semantics. Returning true when l= ock is > + * obtained. Returning false when not locked and it detected flushing ca= n be > + * skipped as another ongoing flusher is taking care of the flush. > + * > + * For callers that depend on flush completing before returning a strict= option > + * is provided. > + */ > +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool stric= t) > +{ > + struct cgroup *ongoing; > + > + if (strict) > + goto lock; > + > + /* > + * Check if ongoing flusher is already taking care of this. Desc= endant > + * check is necessary due to cgroup v1 supporting multiple root's= . > + */ > + ongoing =3D READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) > + return false; Why did we drop the agreed upon method of waiting until the flushers are done? This is now a much more intrusive patch which makes all flushers skip if a root is currently flushing. This causes user-visible problems and is something that I worked hard to fix. I thought we got good results with waiting for the ongoing flusher as long as it is a root? What changed? You also never addressed my concern here about 'ongoing' while we are accessing it, and never responded to my question in v8 about expanding this to support non-root cgroups once we shift to a mutex. I don't appreciate the silent yet drastic change made in this version and without addressing concerns raised in previous versions. Please let me know if I missed something. > + > + /* Grab right to be ongoing flusher */ > + if (!ongoing && cgroup_is_root(cgrp)) { > + struct cgroup *old; > + > + old =3D cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp); > + if (old) { > + /* Lost race for being ongoing flusher */ > + if (cgroup_is_descendant(cgrp, old)) > + return false; > + } > + /* Due to lock yield combined with strict mode record ID = */ > + WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current); I am not sure I understand why we need this, do you mind elaborating?