From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC80BC4345F for ; Thu, 18 Apr 2024 21:01:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61A5F6B0088; Thu, 18 Apr 2024 17:01:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CA626B0089; Thu, 18 Apr 2024 17:01:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46B316B00A9; Thu, 18 Apr 2024 17:01:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 253EF6B0088 for ; Thu, 18 Apr 2024 17:01:15 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A5F7116030A for ; Thu, 18 Apr 2024 21:01:14 +0000 (UTC) X-FDA: 82023872868.23.223BF4D Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf14.hostedemail.com (Postfix) with ESMTP id EF2EE100009 for ; Thu, 18 Apr 2024 21:01:09 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="dChb+/op"; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713474070; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HK0AG2ye43fBHNCJU/ufv8sukfvu+p/h2Iio6a//K1k=; b=M+ZfEu1CU/um2m2iq3zoAKoQnL5Ksj8w3TkdZneIknJ4L5rk1/o6OoR7geUnnYDp76tUa0 h2MYdQdvFUxslCr+TgtIkqDp26mygNhCm/2fMIsH2FjxX3fUVuV9kr6px0P/ISygr/h+Yg MQUy5MWfKxuoIVgVfg10svNAh6clXfM= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="dChb+/op"; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713474070; a=rsa-sha256; cv=none; b=jQSvN+Ubu48T8A2YPbvW+KlAQLCiHUDephxeR3wqLDiaRV7Yo875jLVTshw8pL/NhcH6hu tBVqJIlUKnxvRFvRpzUf8ste/tClhtedypPqarmoS7Djf099xFgsQWgwqsRcysIf/a7J/W SwAMQ0medmNZDFlhnqz3JOr4sySHCtA= Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-4daaa559464so358014e0c.2 for ; Thu, 18 Apr 2024 14:01:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1713474069; x=1714078869; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HK0AG2ye43fBHNCJU/ufv8sukfvu+p/h2Iio6a//K1k=; b=dChb+/ophdqqSPyDT1w0Ea7Wv1d7Ydj98Hz9025LaF1NBhSWjgrd3g4qG8FCMdhLCB AQwl2aC9IreFO7XRBkdkuibaBcnlrd1vFCBq1sYX8C8iw3rH8PWhrfK0rXSPxl/NmWFL aE+qXBagR9/Wo5yx+tBxsgoIlugpksA3QUOcwkW1mS+R+JUrH6o23Yn6k/1VxCr4BQwA uUpyIyNydLIfBZmol6sVZdprvLqTXbtG821RdQ194G0a9YPvIJcQwmZX+om+XpYBtbp7 rO0BS6k+kLabMO0a4hAnJfAl4tkmM/Pt88iKPP189vl0iiqev3wk10L7Xo/g3HNpeKrN WY0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713474069; x=1714078869; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HK0AG2ye43fBHNCJU/ufv8sukfvu+p/h2Iio6a//K1k=; b=A4zkg2EjE4oJYLyP5gI+d1E797WK1nliUZEQWSxNIlqWFmDtmSfs9b+eGIO1fG40sk +2rrE7OW+fW9740mi9nODK/7j9tsBYQoJC4rjZkIuXEkRgTw1helWkI4QMhyLdNWLlL4 2N0imEA5fr+dAYU+wn8tkhIVp4uYvDtAJNUThJ2wtPomNnbb+6mUZ3nVqhQW7waBfyUD 0cUO3W357Ipp29ythd337kXG6RFYWAXvNewyhj9I/JKUbeDG0TK3k6PKQpopOYOdqs7s AxHRmO3FmI9rrH3UfdDQm8/SVOxeqakpgKHAUndFlIupiK24Gt8IhOh0UecM/as9ffuj kD/w== X-Forwarded-Encrypted: i=1; AJvYcCVhJzDoPqE6k+ggX9Rwz1fjGGm6j0OHV+kElchdVMk4k+jADLCD06Nlx7v3rN7qR2FrpztuM8+of6wK9XIGigGmdQs= X-Gm-Message-State: AOJu0YwwfYENiuhzAD5BWZ3ueKJsiK61OEZcw7G8sUgnkP/FiUF3xEcg JScdoNXopqmkGWhrI2hE8fcM/zzd6kAeMO7f2Lrc0n+bIkm5LaNgd5/q1Xr/LygmkA+A+n/KE1k 6SAPbp7HJcNqLJBwId4Byqe0IclLdOXguC7LL X-Google-Smtp-Source: AGHT+IFBdEi2qUFzu77cpMVDPhbVlygx0PyrYLkgzO+9NRyHdqbfH4thsX5/nxN8+SubPv2KNq1jNr0iNEgZdCGM89M= X-Received: by 2002:a05:6122:3181:b0:4d0:36e3:40c3 with SMTP id ch1-20020a056122318100b004d036e340c3mr82383vkb.13.1713474068479; Thu, 18 Apr 2024 14:01:08 -0700 (PDT) MIME-Version: 1.0 References: <171328983017.3930751.9484082608778623495.stgit@firesoul> <171328990014.3930751.10674097155895405137.stgit@firesoul> <72e4a55e-a246-4e28-9d2e-d4f1ef5637c2@kernel.org> In-Reply-To: <72e4a55e-a246-4e28-9d2e-d4f1ef5637c2@kernel.org> From: Yosry Ahmed Date: Thu, 18 Apr 2024 14:00:28 -0700 Message-ID: Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing To: Jesper Dangaard Brouer Cc: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, longman@redhat.com, netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior , mhocko@kernel.org, Wei Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: dqsoj5gmdhgu6brm8339c4wj87y3uofu X-Rspamd-Queue-Id: EF2EE100009 X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1713474069-167401 X-HE-Meta: U2FsdGVkX1+gUZPyu2744+df/A4yfgSTGkwf1j1yi5heRdlU4Iz1e4r0eUjODasLp/biOo/wsnDyLXcEz/Fi9NIIl+LVOPS4XdzcsS8yti+2k4q6M8bWrUIXN+Gmstt0vO1S3GNhubxCyWI6YNhqDjFCrGt1gDGLyP+3lxHsOKNPpFU6Jgm8PxGaKV6y5prMs+RtDohPMNyAaT+xPs+BN/gRIGRPMyftklJhPTbpasIO2H0eo03xvkgmhNs9L+EVowRB3LU5wMCumsYkrKYwfHJfTK0eYrh+u40ReSfohjqR7YF3gUGNx1lC1UYJoHU//lvI+bYrY3pSq02BkYmqwC49jiEi1Cacbns5DP3Y4yzfWm+0j8KpwOXw3PIWc2RbXeiV781ysp27IbkpjJDescFiYxuXUUJ5ZpgtypIL9fEvC8cG8ZpZtDPXHlEOzHz3xVR3OqrYvM/yCaR5GfkaMMAWAu9z1IdyOPVAERh3d6PBFD2r+g6tlTUqEfXStGA9cJgSZgGPUUBPTa/VcmDlVYquH++DMTw+Vm6VPz1aHTeimDW60+ihdrI9DhQZ5yYEPHGr8T7OygEss146/Z/0Y/E4mti/Fc76+vpxMjTyzsi/jjH/4+ehZs0eHJu08aZdzhdOIlYJGTAfbdheitmnTz2aBKaeFtYuX0qy8WPzaclLCdQLX9KslAAXWJ9+GR9haMD03Hq9GxwVeMKIaROJZASTlvMUC16g/10CL6d+YTlrhRNtrtG7Nu8yLENRaphBDcAaOz8Gt6QdwFW/aA6ZrOTmyWsrrt4eNK9KjjMg53VYSNq0fUBdAHzxlQ3fjsak2C8dxUQrxpxuXH4AIWrQM/3TqvKnyfweUjRF5LNmtqAI/XpSGvKhqTDllq/8dy7kNcFGBVyojzf3Kw5zUXEh/tiPUYBaV5l6vPZ/YGdSKJ3JO9Dt744e7W5JshSZjNpNGWxl4M0dc6GgS1BEb5U pwxc/HFj k4VZyxsNRVGlZVlR00ivEMjU6/EYlixqqzSNb7mt+a0m+Sp9fUCXhTF5q0MUgJviVecH2//yl2JgPP6NIeLjY0eW6KCoo2ssCYtI6abSzwRdtqzdApc8PLj3Gy0f86dF3ekTrT/Wa0NlRTUOuUDc7uWCic4fvaCZPpkUgWurZvNAtUcpcv4k4KYw0NUOlJNbSrMUrvydxkOip3ZsLBLEaC3J6SItb7Abl2POjAMiltlp98SN6dbNV68TWOfCu8EXxhORvkyvufPYBGjv9Orzn1MVKSJH+iCEfNlUfJD+5QRGl3AMZDSLdQQshuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000084, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 18, 2024 at 4:00=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > > On 18/04/2024 04.21, Yosry Ahmed wrote: > > On Tue, Apr 16, 2024 at 10:51=E2=80=AFAM Jesper Dangaard Brouer wrote: > >> > >> This patch aims to reduce userspace-triggered pressure on the global > >> cgroup_rstat_lock by introducing a mechanism to limit how often readin= g > >> stat files causes cgroup rstat flushing. > >> > >> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined w= ith > >> mem_cgroup_flush_stats_ratelimited() already limits pressure on the > >> global lock (cgroup_rstat_lock). As a result, reading memory-related s= tat > >> files (such as memory.stat, memory.numa_stat, zswap.current) is alread= y > >> a less userspace-triggerable issue. > >> > >> However, other userspace users of cgroup_rstat_flush(), such as when > >> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to > >> limit pressure on the global lock. Furthermore, userspace can easily > >> trigger this issue by reading those stat files. > >> > >> Typically, normal userspace stats tools (e.g., cadvisor, nomad, system= d) > >> spawn threads that read io.stat, cpu.stat, and memory.stat (even from = the > >> same cgroup) without realizing that on the kernel side, they share the > >> same global lock. This limitation also helps prevent malicious userspa= ce > >> applications from harming the kernel by reading these stat files in a > >> tight loop. > >> > >> To address this, the patch introduces cgroup_rstat_flush_ratelimited()= , > >> similar to memcg's mem_cgroup_flush_stats_ratelimited(). > >> > >> Flushing occurs per cgroup (even though the lock remains global) a > >> variable named rstat_flush_last_time is introduced to track when a giv= en > >> cgroup was last flushed. This variable, which contains the jiffies of = the > >> flush, shares properties and a cache line with rstat_flush_next and is > >> updated simultaneously. > >> > >> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold= ) > >> because other data is read under the lock, but we skip the expensive > >> flushing if it occurred recently. > >> > >> Regarding io.stat, there is an opportunity outside the lock to skip th= e > >> flush, but inside the lock, we must recheck to handle races. > >> > >> Signed-off-by: Jesper Dangaard Brouer > > > > As I mentioned in another thread, I really don't like time-based > > rate-limiting [1]. Would it be possible to generalize the > > magnitude-based rate-limiting instead? Have something like > > memcg_vmstats_needs_flush() in the core rstat code? > > > > I've taken a closer look at memcg_vmstats_needs_flush(). And I'm > concerned about overhead maintaining the stats (that is used as a filter)= . > > static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) > { > return atomic64_read(&vmstats->stats_updates) > > MEMCG_CHARGE_BATCH * num_online_cpus(); > } > > I looked at `vmstats->stats_updates` to see how often this is getting > updated. It is updated in memcg_rstat_updated(), but it gets inlined > into a number function (__mod_memcg_state, __mod_memcg_lruvec_state, > __count_memcg_events), plus it calls cgroup_rstat_updated(). > Counting invocations per sec (via funccount): > > 10:28:09 > FUNC COUNT > __mod_memcg_state 377553 > __count_memcg_events 393078 > __mod_memcg_lruvec_state 1229673 > cgroup_rstat_updated 2632389 > > > I'm surprised to see how many time per sec this is getting invoked. > Originating from memcg_rstat_updated() =3D 2,000,304 times per sec. > (On a 128 CPU core machine with 39% idle CPU-load.) > Maintaining these stats seems excessive... Well, the number of calls to memcg_rstat_updated() is not affected by maintaining stats_updates, and this only adds a few percpu updates in the common path. I did not see any regressions (after all optimizations) in any benchmarks with this, including will-it-scale and netperf. > > Then how often does the filter lower pressure on lock: > > MEMCG_CHARGE_BATCH(64) * 128 CPU =3D 8192 > 2000304/(64*128) =3D 244 time per sec (every ~4ms) > (assuming memcg_rstat_updated val=3D1) This does not tell the whole story though because: 1. The threshold (8192 in this case) is per-memcg. I am assuming 2,000,304 is the number of calls per second for the entire system. In this case, the filtering should be more effective. 2. This assumes that updates and flushes are uniform, I am not sure this applies in practice. 3. In my experiments, this thresholding drastically improved userspace read latency under heavy contention (100s or 1000s of readers), especially the tail latencies. Generally, I think magnitude-based thresholding is better than time-based, especially in larger systems where a lot can change in a short amount of time. I did not observe any regressions from this scheme, and I observed very noticeable improvements in flushing latency. Taking a step back, I think this series is trying to address two issues in one go: interrupt handling latency and lock contention. While both are related because reducing flushing reduces irq disablement, I think it would be better if we can fix that issue separately with a more fundamental solution (e.g. using a mutex or dropping the lock at each CPU boundary). After that, we can more clearly evaluate the lock contention problem with data purely about flushing latency, without taking into consideration the irq handling problem. Does this make sense to you? > > > > Also, why do we keep the memcg time rate-limiting with this patch? Is > > it because we use a much larger window there (2s)? Having two layers > > of time-based rate-limiting is not ideal imo. > > > > I'm also not-a-fan of having two layer of time-based rate-limiting, but > they do operate a different time scales *and* are not active at the same > time with current patch, if you noticed the details, then I excluded > memcg from using this as I commented "memcg have own ratelimit layer" > (in do_flush_stats). Right, I meant generally having two schemes doing very similar things, even if they are not active at the same time. I think this is an artifact of different subsystems sharing the same rstat tree for no specific reason. I think almost all flushing calls really need the stats from one subsystem after all. If we have separate trees, lock contention gets slightly better as different subsystems do not compete. We can also have different subsystems "customize" their trees, for e.g. by setting different time-based or magnitude-based rate-limiting thresholds. I know this is a bigger lift, just thinking out loud :) > > I would prefer removing the memcg time rate-limiting and use this more > granular 50ms (20 timer/sec) for memcg also. And I was planning to do > that in a followup patchset. The 50ms (20 timer/sec) limit will be per > cgroup in the system, which then "scales"/increase with the number of > cgroups, but better than unbounded read/access locks per sec. > > --Jesper > > > > [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_= yf7dSknPoRKw@mail.gmail.com/ > > > sudo ./bcc/tools/funccount -Ti 1 -d 10 > '__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_r= stat_updated'