From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A49EC3DA60 for ; Wed, 17 Jul 2024 18:44:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 549F16B0082; Wed, 17 Jul 2024 14:44:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4F9D86B0083; Wed, 17 Jul 2024 14:44:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C12D6B0088; Wed, 17 Jul 2024 14:44:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1F0EF6B0083 for ; Wed, 17 Jul 2024 14:44:00 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9A6B3141828 for ; Wed, 17 Jul 2024 18:43:59 +0000 (UTC) X-FDA: 82350118998.26.FA220F9 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf05.hostedemail.com (Postfix) with ESMTP id C739310000B for ; Wed, 17 Jul 2024 18:43:57 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yvuJXNrd; spf=pass (imf05.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721241797; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZrLop0WXR5l6C8CwplY6UuCHzt6tM7NAzgoyEdBYBm8=; b=MO+77wW8WbtRkdGZbajxgDJ7IpYjkfRBS5jXtxhZ7uVI6gw6JdcrERUDT+yREi3U0jSAdm GSjNnzysRuAveX6LwyUX39lP5W+xv9Ki7RB2LiqE7UuJCh40sRQwuDItTvvErW/4jLN4By LImPqcd4ikEo+/UtMJlRkYqievswBO0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721241797; a=rsa-sha256; cv=none; b=A+r0D+jqBAhRm3t9SQmdPLOOgA6zc1YrnTzrTCmxp3TUWWC4nO+j/eEyKqy38YADbtZq3p Td2Zj1F47hgQWYKshfU2C6MIB8XanbxSbrhGoCTQokgnyh812A2qMQHpgim22+XTVV3FtB TBWRin0VKn61sOz9Ig7gH2mUuPUJzgE= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yvuJXNrd; spf=pass (imf05.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-a79a7d1a0dbso588391766b.2 for ; Wed, 17 Jul 2024 11:43:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721241836; x=1721846636; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=ZrLop0WXR5l6C8CwplY6UuCHzt6tM7NAzgoyEdBYBm8=; b=yvuJXNrdJjA7p1k4SHQDSykaNx5fImTsLr+Yoss5Z8C9sXpjuFRgbpbBZgEcuw6k2M 0tEz7JtYeDF89WQsXB654pT6ziS8VSGSkLcsVqYfdwapButPQlBjkikZwALMrLii8ufj 6a1F3HBDQydV6zorzzaoTcPZTzQC9mxHICeWA82oCk9QpoGWxhdPEvkUsSoSYyRoTHk/ Uc/bTgqGMT899fZdXHgZy05R1sfsx4GwG1jbPhSCpDohaA1Nh18ye7AB+hYjHKUQ9c62 yWHgA1bC8xUx1mvVVKf2nzwUZG1OpOWv7DDrYP0EciLCnU0y0NbSAsXuB7s860HbUVRc oWTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721241836; x=1721846636; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ZrLop0WXR5l6C8CwplY6UuCHzt6tM7NAzgoyEdBYBm8=; b=pHuIj8wF7C0DM4PIRwwObvaZdug7a0lWnsoVHneVK8xirEGGKYF/VCyCTSleSDCdRm CJbIq5pGL0sTrx8jB1CSjkw1PjAK+5Pf5YcBz7UwaEakhoXzaUD6DbI5J7iXlZxEFdy0 Zxmg5LCybVc0RbVLayZBEW6oKQWdFmMr+0k8IK7IVX9qbXw37S1BHXRfSn/ADpxMoeet HG+NaC4B6oFrtRpVFZSSipYmr5mfQ6SvVJkhW6mID0MC2IxKpNpAmHVte8c1bjIHuvBG CfPmNZ4jB3duLn0z775J2mKM+ajB14FTc+TWdndVmK2o9QnQqdBz9uTiHCXytyEgoRZx rYJg== X-Forwarded-Encrypted: i=1; AJvYcCV7zw7mXXkbuDV5dBZCTCQfglTGRWM5F9CsJz611IFwRWPiM3jmKC+/ayhJ38YwopgN+Tg3pg/sV/6s6PAkux9lKT8= X-Gm-Message-State: AOJu0Yzfpbdd1kddVTM//XCyUdgusAYp+YyPJZMjvRE1WxMItur2/krQ PX6/EkomhKv9u58XC/WFUdVCmvDB8I/0Cabk4CW5g1PM5pxA2FznASXRDy+8CiqSHHLlkb84XA0 AMvD6scESdc3iTRpY2soOF7VpP3YEw9GqeZbx X-Google-Smtp-Source: AGHT+IF8RdKoDfVtGRrva1w5brn8g7mEl2nXivUh40LMHs36jzVVAz3m0gkm067oLrutSxcYTkEZdggGOrahc9xfi4E= X-Received: by 2002:a17:906:384c:b0:a6f:e75b:2b39 with SMTP id a640c23a62f3a-a7a011d333amr149443066b.42.1721241835572; Wed, 17 Jul 2024 11:43:55 -0700 (PDT) MIME-Version: 1.0 References: <172070450139.2992819.13210624094367257881.stgit@firesoul> <8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org> In-Reply-To: From: Yosry Ahmed Date: Wed, 17 Jul 2024 11:43:17 -0700 Message-ID: Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: o9c4mbmw5iahatsa1mdiygtb59w6gt1s X-Rspamd-Queue-Id: C739310000B X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1721241837-779827 X-HE-Meta: U2FsdGVkX18I1MxfThG37hPpCI+27EVVzgnGNjv5zdiTsl01UyCCw2VFJgn4mOdf7gg6a9TQVomBfrO2Sf057FzPd+oH2AtYmd7eFtMKaVGHDHAai+vrFk5kacYMLJgcMUZFKYJTbkCFMPN+pA5btOdyT9wjPXKzWlaz5BT2B7qAVmzkk1kyzI1W2Uot9e0a9iXer5/c9ERegqd+/pCend0+KZ1yYWWFGkv9BUuBy9nfa3iVPoLRlIXYzztCS4I3cyRM2qaVxN/I+sbFx4RbVRp1ObUDlaMTptMcnZHXn8+t8tKIcIGyA0PjdWmPdaL5ZnNPuAX54JuN2SRAA6d7EkjpZj8h/RKeDDrkVDNwxRdH65JDTezoGlJYGMLMoBhltfF5TYBUJEW8JxwJWKh402w8HM0zw45I1LD6v6Ti8OPPzzQ/7mrNjYjQDT63F+Qne7TtwDrSlM+WIH3c3cuDXqAQhHjcxhcyupYZsFsCDT+De7hFhnVR0XRVYuyvDbrfX4iUqJKSmc6c6MQAL3/p0r2fcyJdrB7pfMR3dqJCzejzMmG6voaFTFjryQk2PKdBGLH35QZGFX1w8LPbCa1lmAiZgLwK+YWIjUbvdw5s8PRS4ixmRyHvz9Eug9dRJ6uXFuLrWTGTNe9pZpgjSD3LoT8B/AMSKVbZpw6M+2IoxjTrPgeekhvQ+6w81tflJY+zm3P9QRrwSiMbcborhqebg6lxnRybf8h4uhBjq+UQEvReVwn7uQoTIKQzbjiW4zPu+1833/dxfuuUmKwUmHIbbpP0kQu9je8A153RHP2jnLsjsPE0qBPhlVZC8bchkhiX5xWGTHo/we13nOke2jF8h3fL3mEKkrrYF5K30jtQiyXTF3uRKGtK8F8jCg8JHan17UwsAEO+6+VOI8osqVGpiM5l7YdMMjgRARQMT9XzIskbupHMkTGbWGBDikBCZWQEsV03CN+Xc5qi9T5Di3G wMY2CQ0Y Vp3jbMl1GNHiXuHp9nMZ2J2Ah62O4DV0BLJC7Q998lLsmRnW4gpeSjcAdzqwe6AayKae0uRxNhw6397thzUV7HZk3QjuoK0HW/KhGvVn6oMZ+yiXo2z/0wLTva3F0Xq6pqajNt6TVn/tDliAWuHb6QF1qBOExGPzXhtZBl6W+JBey22KY7MzTMMDy7EzaSPi/gE/HLbE1gZfmh8f1OXmafTFe7jpvYIEzdfnVVKZRl0bU7xaawiTJ5gkKQ5GKzz0ubwfX X-Bogosity: Ham, tests=bogofilter, spamicity=0.152668, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > What I think we should be doing is either supporting multiple > > ongoing_flushers (by replacing the global variable with a per-cgroup > > flag, perhaps), or biting the bullet and start using a mutex to drop > > lock yielding. If there aren't any concerns beyond priority inversion > > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I > > suggested could fix this issue (and also put a bound on the flushing > > latency in general)? > > > > The mutex_lock_timeout is an interesting idea, but not a primitive we > have available today, right? We don't, but Waiman said it shouldn't be difficult to add: https://lore.kernel.org/lkml/623e62c5-3045-4dca-9f2c-ed15b8d3bad8@redhat.com/ > > p.s. I have 5 prod machines running with mutex change, and I've not > gotten any SRE complaints. That's nice! > > > > (I suppose the latter is preferrable) > > > >> > >> IMHO we should go back to only doing ongoing_flusher for root-cgroup. > >> There is really low chance of sub-trees flushes being concurrent enough > >> to benefit from this, and it cause issues and needs (ugly) race handling. > >> > >> Further more sub-tree flushing doesn't take as long time as level 0 > >> flushing, which further lower the chance of concurrent flushes. > > > > I agree that the race handling is not pretty, and we can try to > > improve the implementation or handle the race in other ways. However, > > I think that skipping flushes for subtrees is valuable. I am not sure > > about the cgroup arrangement in your use case, but we do have cgroups > > with a lot of tasks in them (and/or child cgroups). If there is memory > > pressure (or hit the cgroup limit), they all may go into direct > > reclaim concurrently, so skipping flushes could really be beneficial. > > > > Of course, if the difference in complexity is not justified, we can go > > back to only supporting root cgroups for ongoing_flusher for now. But > > as I mentioned in the v4 email thread, some of the complexity may > > still remain anyway as we have multiple root cgroups in v1. > > > > Having an incremental step with "only supporting root cgroups for > ongoing_flusher for now" is a good step forward IMHO. > As you could see in grafana plot, this would be a significant production > improvement on its own, as it avoids wasting CPU resources spinning on > the lock. I am not opposed to this at all. All I am saying is, if we need to handle most complexity anyway due to multiple root cgroups in v1, then might as well support subtrees too. > > Being able to have multiple root cgroups, due to in v1, does pose an > implementation problem. Only having a single root, would allow to have > a race-free cmpxchg scheme. > Would it be reasonable to only support v2? The rstat code has so far been v1/v2 agnostic AFAICT, and Google is still using cgroup v1, so I naturally prefer we keep supporting both v1 and v2 going forward. > If so, how can I match on this? cgroup_on_dfl() is basically testing if a cgroup is on v2, but I really want to keep v1 included if possible :/ > > >> > >> Let's get some quick data on flush times from production, to support my > >> claim: > > > > Thanks for the data. I agree that in general root flushes will be a > > lot more expensive than subtree flushes, but keep in mind that the > > data may look really different depends on the cgroup setup. As I > > mentioned, I think we should still support subtree flushes unless the > > difference in complexity is not justified. > > > > It would be really valuable if you could provide production data on the > lock-hold times, just like I did with below bpftrace script... > Is that possible, please? Unfortunately, we don't have the infrastructure to do this :/ But again, I am not objecting to only supporting root cgroups as ongoing flushers for now if there is a justifiable complexity difference. So this shouldn't be a blocker.