From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03167C41513 for ; Thu, 12 Oct 2023 22:23:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B4D58D0155; Thu, 12 Oct 2023 18:23:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 362F08D0154; Thu, 12 Oct 2023 18:23:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22B358D0155; Thu, 12 Oct 2023 18:23:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 129758D0154 for ; Thu, 12 Oct 2023 18:23:50 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CFF00C0992 for ; Thu, 12 Oct 2023 22:23:49 +0000 (UTC) X-FDA: 81338237778.03.12A1291 Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by imf12.hostedemail.com (Postfix) with ESMTP id DEED040019 for ; Thu, 12 Oct 2023 22:23:46 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=MESoruWx; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697149427; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; b=V+BSw0UkIWfcLH9CrnmsfzVP0R6Yp3b9FCo5wsE7h/Bk1PeRTNGhiidLP9+1lAIkrpj4yO F9nDa/7amnYnwhadbXtQuB31Y/yhZ+ts0xjbQ+FyLdSBiW0MqHf7YjeF7PV2nbR+i4bYmJ l3q44C838bXL5OoNIFwGyw4Ggl+MwqE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697149427; a=rsa-sha256; cv=none; b=34bO11nIzuRDvdb/i+tmbZOCIF9ARFa7Pefo8O+5tJbzocy4Sg5t0nT2xk2RGG2KtaMaPh zH64R/RYMDPUtmuiH5eoiZBMnR33+H81ESGYJZBVPr4qdW6deJ1NVScZzP0hvHUFYSIEM1 Z1Y2e4ZpyNrsCJBV4SQM03yVq4iMIdI= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=MESoruWx; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f47.google.com with SMTP id a640c23a62f3a-9936b3d0286so246336366b.0 for ; Thu, 12 Oct 2023 15:23:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1697149425; x=1697754225; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; b=MESoruWxKlDVbexhTn+6x2lxsXePoKjq/PqbBsJdewqwRH0hVfR6eKRwCOE30CF7QJ Ouez2zOP+bH9nX4LM//pJzb612OwV8puvm1ktrwvNjnfi6k1mkbgdlCpHVsch0CwN4SB 5j74zUk80CRChoXYNVkVbotBD1TlfpJHUg9FV04rrVrc1Y49/YtSunGs0YXPmomFO0C7 PNMJzS6zZG9uE9hcrE3hfMU949SIl29IPKoiL97RfjOy2Vb1++uT7E1BwWuSIaRJ9EVA 9Bg0TSzup9ElPu4mLxN9x9c6lqMuZruXrs5zFCidk9t0n4Y+Pv1ShzLsa1kwLcPgC49M UclA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697149425; x=1697754225; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; b=WdKAXxPS1PQKfQXs67WopkHaArob/nNzMdv2K9KLgQIEVyI/A3uxNik7tWdgxlBBno gPjtakaoD8PBYpgRapK0oYHE+YuyFdkFqJzhmNH1rbUZd187F6brpbpHM7WZq6Z/uJ8X nd9nh+czcVPFb8xQAE2ZaoS+uTP7VtqdRCTVh3zRl5tfSS9OEhp5ulnxpURlKQ1f1N+S ZDgtbSpgZv8XN7p+kx2OmD+PyN0/dGtM3igshXWjev+s4GrwMcM2lj/o/y+lY9kV6Yoo FEldlE8YrfaBVZ8teXTnloyJTfEfaKJpaOnpBtg20C2jjq0TLCL5wBuw7JIntJa1TXdD g+lg== X-Gm-Message-State: AOJu0YxbQ1XPLwvz1Z9eGOtf1WqoaQgPnxBgwwH52mTdGsyC2KtlL6+M SyFlg9t3gQjgPJ4Ikp72Gw1NS+ZK4yXxsF909Nl6zQ== X-Google-Smtp-Source: AGHT+IG0znS342LSOtk4SWPPirgo9vg1+/7qOA8TH7gv7MxLcFqaaOg3V8ctTNCqsueaU3VEQyrXqhkXbPCSr4qKaKU= X-Received: by 2002:a17:906:6a19:b0:9a5:a0c6:9e8e with SMTP id qw25-20020a1709066a1900b009a5a0c69e8emr29493117ejc.31.1697149425195; Thu, 12 Oct 2023 15:23:45 -0700 (PDT) MIME-Version: 1.0 References: <20231010032117.1577496-1-yosryahmed@google.com> <20231010032117.1577496-4-yosryahmed@google.com> <20231011003646.dt5rlqmnq6ybrlnd@google.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 12 Oct 2023 15:23:06 -0700 Message-ID: Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg To: Shakeel Butt , Andrew Morton Cc: michael@phoronix.com, Feng Tang , kernel test robot , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DEED040019 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: ynpebjra6pjhpo33j9d18qhzwsuto1of X-HE-Tag: 1697149426-945566 X-HE-Meta: U2FsdGVkX1+aklNY+JjBMKF0/Na6sdH+Bdmax4BcjrkaDGpDYtIbpNHYJv1mYDYdcZCBn/KG4ghAcdPvDTbggkroptKyZW3OEp31u6AMi+U1+Eq/M7HSVJHYmgq+3cbRT60gADjeAnsgcWL/BFdkH5SzKfTh6+fxgD6yfDsjdbQ37qp5DRTf1/UsqDsDdHbaEFgVUqGgMt9/kCBZWCJa63+kEHcpLmN9q/iXK+cchIznAQgXLdKMGsk8kPxPk249GYvdjL2ZXL/KB80gOVXCvzQnUENrmAivAOOZ+y0rt6YIvAVkrfZ8/9wK0RpSQGuA8L8goQ6oxCyzah2Nlri3Ime9qwGaFUDeAiF2q8l7kMotngkrqLHg3oxcViv13AsycFFUCnLXRCN7incQCAyWC4wITVIgOJk6vm3VvdmcPgRJjaRoE7g4qOocRI7duSAJzZ1FeuQYUyj++098teKYdNakdHFGbP8bridw6ubMnHJuAYK1L6itGYfU8eNig3Pda7Y2s9m99gvXy2GkUoR+3UQqSY5+X+asBe1Qy1yv0tsmqe/py8sUxizdsH6YPyyTyAXBRaqWWrtTsMKMSu1IqTTyIDoA3ocC8D4i9TId6Zw+oIjdL2bcWN6TIyPRl5StBK3eGjx6uNXXiazPQkmD7PHUemSFq6kP7/Pr7m6C3iKpT6X9uO6ws8CM5IiFtVHbABfuJfA+Sh3irRGs5MT69eelrXDliwsI+Jf+DUXHr2117tr7lDdgxjvCz+BvN0DRc8dTKZgPiOOabWh5U6/s2A8VhSEE0vWnBNb36ucBCk86fGz4XT7epBYYKBASFCVZ0WljnbzDUVjgJwJM/Xgw5Spn9PDm4EPlEPnmzH9VDVQ7MiKSZc9BRbGtpd+a5m2n2Q4YBh+zF3EDpeTZBfGqLVMxzU2+OSY0v33CgHnTAzhznmXYSNv8tc2CNo2gT6UWgGATUFdtxpAYksAff3O DAEx0xU0 CgM8BoCFLOR2qJmCC9vxgPiSMRH58vN7Q4Ebjw4Zh2XZ1KY8/ICz5PAHbPEQUTWe/doxIz//V/RWP/OGmWfvjhkbHgW1htLNALGji9aZieTAiIrpjpSjoJRgjatmFW/JBXhpcu+KpwRYbxGwlzKFbM3hyg1HTzAvyGd98evcMZHVTpY5P8EUQH6WvJXgvBC5dxXi/mNMm4iFOsHxT5/3diypg2ljcMaAJK4dGhqbg1yXUd7SwsjMSSJCaRilyOxivk5s6UlkIErKKI9oMXiDfwiDIv7QCRs3CYE3F05jEzu2zs/1XvG7NFVewmA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 12, 2023 at 2:39=E2=80=AFPM Shakeel Butt = wrote: > > On Thu, Oct 12, 2023 at 2:20=E2=80=AFPM Yosry Ahmed wrote: > > > [...] > > > > > > Yes this looks better. I think we should also ask intel perf and > > > phoronix folks to run their benchmarks as well (but no need to block > > > on them). > > > > Anything I need to do for this to happen? (I thought such testing is > > already done on linux-next) > > Just Cced the relevant folks. > > Michael, Oliver & Feng, if you have some time/resource available, > please do trigger your performance benchmarks on the following series > (but nothing urgent): > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.co= m/ Thanks for that. > > > > > Also, any further comments on the patch (or the series in general)? If > > not, I can send a new commit message for this patch in-place. > > Sorry, I haven't taken a look yet but will try in a week or so. Sounds good, thanks. Meanwhile, Andrew, could you please replace the commit log of this patch as follows for more updated testing info: Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were n= o further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/