From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 337D5C3DA42 for ; Wed, 17 Jul 2024 16:04:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C00046B009A; Wed, 17 Jul 2024 12:04:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BAEA56B009B; Wed, 17 Jul 2024 12:04:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A75BD6B009C; Wed, 17 Jul 2024 12:04:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 89A286B009A for ; Wed, 17 Jul 2024 12:04:53 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9FCF41417E2 for ; Wed, 17 Jul 2024 16:04:52 +0000 (UTC) X-FDA: 82349718024.30.7805073 Received: from mail-ej1-f44.google.com (mail-ej1-f44.google.com [209.85.218.44]) by imf13.hostedemail.com (Postfix) with ESMTP id 836CC20023 for ; Wed, 17 Jul 2024 16:04:50 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dTVYZhQS; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721232250; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XKIjwSAQrkgMVgtHKV1uhwl6zysvp7/fR9dohyVGsJc=; b=wWEA+hRrAOVQe9BHAbq4gjPv3ThwWabV2aczagP5E7q1ltmBk/wUynbGG0Xf7uKr1bHZdm MgAqLB4ChZ4uaz3xnPWOWV8A43d5PEpo2rNOI7oUH26+9fj1+hLRQ5z+W0hhXGiETLTjbp 96jJk8dIZW3GjKM3Nt4gUd+DbPdygPM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721232250; a=rsa-sha256; cv=none; b=HC5LtyBgC0L/4zlfkRFFSa1xkD6y1zoeQF4Hj+jkgnb2CWJ7ai/GMC/DGoOlestgThF9cB bEvMkiAdiIIJ5MQ4AlC6oqKAFJJL0a+LK1YW3xA9KK0+hcml0NXhYZS5q0hX9+mvtdJe9Y lmmi+OztDYv6pdXNq28q5+toYV6D25o= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dTVYZhQS; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f44.google.com with SMTP id a640c23a62f3a-a77dc08db60so769923666b.1 for ; Wed, 17 Jul 2024 09:04:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721232289; x=1721837089; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XKIjwSAQrkgMVgtHKV1uhwl6zysvp7/fR9dohyVGsJc=; b=dTVYZhQSKHEoMK+7t1p5Xjk8loTDe0eg7VFlrPfe7Sp1ZaAy+/vJjhITgf7nfSt6SV 1Rt5rzNLZT1tD9C6ALMd7ZDBWZtRIC+B6lKjlxvT7brfnuZOyMZRBXjyIq/jIZrMZnZA pD7l4aRiGErubUQPi2j8bAb4tmUTevD8mXP1BJZYxTzPFlgXiL4KGTYfKT8ax6Vi1dkB xumBX/lEYhL9cqVLZ0NCCXdlMvLDL1zWjI7E5yAiKLtTswVZgfvH12thB6bYVzQnKW73 ypn7DZoqrdUf7ApoxmIawK6lvjJu8J/5pi49jvpGl+216OwklL12KePi19x7Ihd3x0Ra suwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721232289; x=1721837089; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XKIjwSAQrkgMVgtHKV1uhwl6zysvp7/fR9dohyVGsJc=; b=GAi74O+kJpKpj4kKvWvADrOj9lsAjMblg3qQmb7d6ykxpv7QGXqxGCO1q0fsuj+hOl 1a6+TnjoYufUzgdbQwgKUh4uIidODX2cCJDsWB+vuSqtahlP6jhji95amkhX2fGZ+EfG ZjVqBVh/Uix9GlC27EUKt430eD41zJzPqG/bBl3AveyctMOCvnKezv1LvALU9GKz1yFP vCUnSQRGsYmY6kEtVbpWVZN1JqsRn9knsp1ZnVy6hqezSEJrm9IDLkNzNfmOibJPW7iO IW4U53tP7G2BboB+dQ/ed0YtFbJ2u0b1M+SB7eEnnDBxCohmsBe2urDj1L8JnwCOimR+ xcpg== X-Forwarded-Encrypted: i=1; AJvYcCVanQvVGVfkrxmWwRQDlBvjNXIpAIiq+YzXDeStjmf4RoM0dtmzhJiZGPK+tNn+JK1bp2Yz6W6+VQ7DsXGI+z0h/xU= X-Gm-Message-State: AOJu0YyAU2eK2qG54g6SiRE/LqPqDRZvz+kZqBP9irL3TLRS/X0p7uoe QGV4G5vNdmyis/JxcNMufMTjrs1uH7R1vJV2TBo8GOSnoQDUTOyggtON9v7PL9O4qdQCyQV/fZc pdGdk8DoODLH67lnE7r1fARASBfgKlcKcvB7I X-Google-Smtp-Source: AGHT+IE+KGZPn8v42uAvYqn/WiCl5jZpR4/bzecrp0HAsD5IbBc4tqOnEKm34cqpnXBi+3JZQ/CJ/A4sgAhzyL14cIU= X-Received: by 2002:a17:906:718:b0:a6f:b885:2042 with SMTP id a640c23a62f3a-a7a011e5186mr164953166b.18.1721232288273; Wed, 17 Jul 2024 09:04:48 -0700 (PDT) MIME-Version: 1.0 References: <171952310959.1810550.17003659816794335660.stgit@firesoul> <171952312320.1810550.13209360603489797077.stgit@firesoul> <4n3qu75efpznkomxytm7irwfiq44hhi4hb5igjbd55ooxgmvwa@tbgmwvcqsy75> <7ecdd625-37a0-49f1-92fc-eef9791fbe5b@kernel.org> <9a7930b9-dec0-418c-8475-5a7e18b3ec68@kernel.org> In-Reply-To: From: Yosry Ahmed Date: Wed, 17 Jul 2024 09:04:07 -0700 Message-ID: Subject: Re: [PATCH V4 2/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: Shakeel Butt , tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: oexda1jogffxmctjnfuhi8qfmo5i3fbn X-Rspamd-Queue-Id: 836CC20023 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1721232290-270571 X-HE-Meta: U2FsdGVkX1/UBHTOgEtZ0CSHxykZ368k2uon2KbFD5mD2ufm2l6q4WqD7rS199wOPfF6vglzvpJ12IpGwJtv7dWxnHVOiUtzHBKWDKi5UMtCjizeExh5TjiuMSf5taketne/B6uVgfpOCQvJlbxYth33RpRcW2L7mOMyyqmqkxuOYcRuO3WdO5baOowXOKhEJwdbYct3mdPgrH5EvSzKV5p0Figi5IJWcrgpmWaZ944TGuZ/ur3cF06fbfrmcie3C2Sqiml/wwZ94VzdRgrc3qIpqGVLVlQsm8miOwqPXDPm++4U5JtbSMSjAG5Cw4Vr6FmaSYy9pggEblVqEEUS1Wcwq9slkUyq9Y9plhYD89oHHcY5XcWK/jmo/wuErlRpQG6HrCB/VA5xdZwE7XAZBLXsMYflJSg7QE1KbSGOHxnasNlykWHn0ET+8EzBbzRwEFbTPV0C7V1Ms/MCKsS6+jlo8iVUBXS3Yj+uDr4Fk2Lv2MZHDuuS9z7SQtNY8QuAJMuNtUD1FOtGAXwUuAGVJfgj8tECmH88B5gZnDZwxfp3qx2eLUYn9ulQ0ibytf2e5YPAp2XD1pWtiNs7AYTcYxi896b4RtNcsuhMjcQ/Jfgt8H7jAdEAzR0ssXJmnLYTIPrv6lVkhs4iQOVoJ9faMzZBtkresdNvNUvS8sR+vH/3iBjscO7fs0BKiFHtpIZwFZPsRGEIeYfff6JK1rmUEHhsfTbe5c/WJYkdxjt7LEEcmUbZ/iZd6TCuBdAdFJzIXK4kcplvvyFc5Vg4M5fpQdwm5vTBGAPsIfzbCjXVNfUId1ooh9e8oPFxmfRFN+5EDFZtW/guIM1Z9QPk1Bp0zcGWfTjitRnkflH9DccM99As1ztStAogQug3HrYj8Su4Sbx40SVwFE/x/xEkzekMfVsqH+Ed/U+YOMfVMb+JiXl1yCSdWXpJPcLVa6su81HDz6d9Ub7D9ioDvclXRpG 4hkh/zKB Sv2a40UC0DaFLffP9Ljlqo/YtRmSNctnTY6RK5LWOQguCqeLtAEgolj9+/Hy2eZVyTFa4MePK1zIPM8DmMkFXi4v4Wv67uvxkpBWKL3c+ambA0uL2sbVQAi2Pvl5nTb9Cis30TD5ryoYZmpanbvlE1aVfTj2RlLgJhEeYSck1c9v/Tg2/Q2TD8djm1oXRs9jIoQXxb8Iub36qT0QijacWsb7qd/GjR7jrf6Cmq7aaJqc+h7Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000421, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 12:46=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > > On 16/07/2024 23.54, Yosry Ahmed wrote: > > On Mon, Jul 8, 2024 at 8:26=E2=80=AFAM Jesper Dangaard Brouer wrote: > >> > >> > >> On 28/06/2024 11.39, Jesper Dangaard Brouer wrote: > >>> > >>> > >>> On 28/06/2024 01.34, Shakeel Butt wrote: > >>>> On Thu, Jun 27, 2024 at 11:18:56PM GMT, Jesper Dangaard Brouer wrote= : > >>>>> Avoid lock contention on the global cgroup rstat lock caused by ksw= apd > >>>>> starting on all NUMA nodes simultaneously. At Cloudflare, we observ= ed > >>>>> massive issues due to kswapd and the specific mem_cgroup_flush_stat= s() > >>>>> call inlined in shrink_node, which takes the rstat lock. > >>>>> > >> [...] > >>>>> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)= ; > >>>>> @@ -312,6 +315,45 @@ static inline void __cgroup_rstat_unlock(struc= t > >>>>> cgroup *cgrp, int cpu_in_loop) > >>>>> spin_unlock_irq(&cgroup_rstat_lock); > >>>>> } > >>>>> +#define MAX_WAIT msecs_to_jiffies(100) > >>>>> +/* Trylock helper that also checks for on ongoing flusher */ > >>>>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) > >>>>> +{ > >>>>> + bool locked =3D __cgroup_rstat_trylock(cgrp, -1); > >>>>> + if (!locked) { > >>>>> + struct cgroup *cgrp_ongoing; > >>>>> + > >>>>> + /* Lock is contended, lets check if ongoing flusher is alr= eady > >>>>> + * taking care of this, if we are a descendant. > >>>>> + */ > >>>>> + cgrp_ongoing =3D READ_ONCE(cgrp_rstat_ongoing_flusher); > >>>>> + if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoin= g)) { > >>>> > >>>> I wonder if READ_ONCE() and cgroup_is_descendant() needs to happen > >>>> within in rcu section. On a preemptable kernel, let's say we got > >>>> preempted in between them, the flusher was unrelated and got freed > >>>> before we get the CPU. In that case we are accessing freed memory. > >>>> > >>> > >>> I have to think about this some more. > >>> > >> > >> I don't think this is necessary. We are now waiting (for completion) a= nd > >> not skipping flush, because as part of take down function > >> cgroup_rstat_exit() is called, which will call cgroup_rstat_flush(). > >> > >> > >> void cgroup_rstat_exit(struct cgroup *cgrp) > >> { > >> int cpu; > >> cgroup_rstat_flush(cgrp); > >> > >> > > > > Sorry for the late response, I was traveling for a bit. I will take a > > look at your most recent version shortly. But I do have a comment > > here. > > > > I don't see how this addresses Shakeel's concern. IIUC, if the cgroup > > was freed after READ_ONCE() (and cgroup_rstat_flush() was called), > > then cgroup_is_descendant() will access freed memory. We are not > > holding the lock here so we are not preventing cgroup_rstat_flush() > > from being called for the freed cgroup, right? > > If we go back to only allowing root-cgroup to be ongoing-flusher, then > we could do a cgroup_rstat_flush(root) in cgroup_rstat_exit() to be sure > nothing is left waiting for completion scheme. Right? I am still not sure I understand how this helps. We still need to call cgroup_is_descendant() because in cgroup v1 we may have multiple root cgroups, right? So it is still possible that the cgroup is freed after READ_ONCE() and cgroup_is_descendant() accesses freed memory. Unless of course we have other guarantees that the root cgroups will not go away. Since at this point we are not holding the rstat lock, or actually waiting for the ongoing flush (yet), I don't see how any cgroup_rstat_flush() calls in the cgroup exit paths will help. I actually think RCU may not help either for non-root cgroups, because we call cgroup_rstat_flush() in cgroup_rstat_exit(), which is called *after* the RCU grace period, and the cgroup is freed right away after that. We may need to replace kfree(cgrp) with kfree_rcu(cgrp) in css_free_rwork_fn(). > > IMHO the code is getting too complicated with sub-cgroup's as ongoing > flushers which also required having 'completion' queues per cgroup. > We should go back to only doing this for the root-cgroup. Because of multiple root cgroups in cgroup v1, we may still need that anyway, right? Please let me know if I am missing something.