From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0EC40C001DF for ; Tue, 25 Jul 2023 22:01:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 148E46B0071; Tue, 25 Jul 2023 18:01:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0FA0D6B0074; Tue, 25 Jul 2023 18:01:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E8CFA6B0075; Tue, 25 Jul 2023 18:01:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D92AD6B0071 for ; Tue, 25 Jul 2023 18:01:16 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A960380F36 for ; Tue, 25 Jul 2023 22:01:16 +0000 (UTC) X-FDA: 81051505752.16.59F56D5 Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf13.hostedemail.com (Postfix) with ESMTP id 12F682003C for ; Tue, 25 Jul 2023 22:01:12 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ffXmfym6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690322473; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EitNPpdaQImS+UO+6gfA4PmDrc0gIvXFa/aFMwcLRHY=; b=wqt5KE+IhUWOOZiMj3IE360Lmcl9MoDvKAtHpnyNu/4Tm98RmE2qcZxm8QzolJ98Q4O1cr hWlkS5EVmqv6fW8qeDyzUAyi46PRn5yGB+J8STDhMNem09awBhwR5vdEkqh9jR2bBjcoV8 iGQ+sQcT+TLAAMapteYH2YxHPTm18bI= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ffXmfym6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690322473; a=rsa-sha256; cv=none; b=BHs1QxW7yir/J3KhbAaHGYm8HHp3l9PMzcqgjr55db7rLcyJHGQggNIrzGwwXBsCmNQFDK kQvEHeoFwFfRO5jcIP1C3nxsyI85oWhXl3aT//KWPW8IJGNVI9deMQAreNWLQ0HkJsh8dX Sz1V7l9J50II6wuqq/7C4xwS0eJxCMM= Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-9939fbb7191so71654066b.0 for ; Tue, 25 Jul 2023 15:01:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690322471; x=1690927271; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EitNPpdaQImS+UO+6gfA4PmDrc0gIvXFa/aFMwcLRHY=; b=ffXmfym6N4mdb0c+soPvdHw9AnsEOLeiuBQ+yw4S7vov+SQonwJhjpAunJ8MZ/0I5z vkrWDPPOpEOEBzeRbv5sOiO30J4bFlm2uaVwSYhHD0F3MKUR9bH25ZV/8P/V4vv+68DQ Awc52OT0af4X6ruHifd5tQ1PogYcEQOfve3Ueg+a7WcGr5VWhifbZTc/db4frCI4n7g5 /cI4O00K/WdfHxccbUgsBgsHT0u3X1wgsRZRcCroOqmlSuxKZyBiWllsAHZDE0aYI863 HH03aEHMz4BpzVT211siQA6rfn6moa4eIizUAYEDOz1f4m0/qoh/lxD+/YVqk5FZbfS7 Pe0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690322471; x=1690927271; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EitNPpdaQImS+UO+6gfA4PmDrc0gIvXFa/aFMwcLRHY=; b=AMqo+Kj3HULaJisoL+i6kN/ISxT+B+vSpkgV4nyiyz6dghdLPJbdDNYyy0/Gofxtu0 dBfPxkQKW1nBmRtMUytafLQIjYZJqwzmrdjTJJ4kCPmk7HZ1uZFn6y+t4ocVGnpTfldA AnJBhHyj3n3HWj/53YJhUpQnmYbiM2TxTW9nSkrIBhWv67jVmgCLCYDGr2Ujv+WtXYHO ms7frR5nCFqF7bIFabvk3+UycDC7M/U2jeCp4NAGGsy90BRz5IsK3MqOpzVyH+fod52/ i8Mg6jd6zfDZ3eS8ip+QTeNHjP/lcqUsfyW2uwEQbIqGt6R6vI+5z32gpAJqJ33NmIc1 qiyQ== X-Gm-Message-State: ABy/qLZuRopLZDpabk1P/k1FNeTGNPSwyNSaqCG+IGcN3zuXIU9KSwih e/JsK6FzCjjXubJcivUfJLHzKs0pAgeNirMn0+LtOA== X-Google-Smtp-Source: APBJJlFgTPyujUtZ8EgbKUjRjcN96PY7sGSk7GdBAe1Uf1XmD95bwOiGIXX+Oe7jKRmrKs4UJhAWrAaQohGw5cQDLp0= X-Received: by 2002:a17:906:32db:b0:994:5457:269e with SMTP id k27-20020a17090632db00b009945457269emr457378ejk.0.1690322471171; Tue, 25 Jul 2023 15:01:11 -0700 (PDT) MIME-Version: 1.0 References: <20230719174613.3062124-1-yosryahmed@google.com> <20230725140435.GB1146582@cmpxchg.org> <20230725201811.GA1231514@cmpxchg.org> In-Reply-To: <20230725201811.GA1231514@cmpxchg.org> From: Yosry Ahmed Date: Tue, 25 Jul 2023 15:00:34 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats To: Johannes Weiner Cc: Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 12F682003C X-Stat-Signature: 4pqyftjghpitckab11bh7u3drht3ktky X-HE-Tag: 1690322472-131573 X-HE-Meta: U2FsdGVkX1+rNro3vDwAYvcJZIQntuAgDZ/oDjt3ocLXbH6MKdxcHz1AXNzFkW6ZTEYNkwuAXAxtY4TkC9/lzTv7Dw7nA/ZvsXNIizR5iBXrNIFWBnQlkYvpjEboHcrAAtnKXiZX3wEYRiq1iAXDiQ0rETtospvEbY52UTHAbXypL9MNLO0gqLTpqxiM0tjOXHKYnVyOn69Dt3YO4/LIoRxvevaTd4t+QGRb0FHAsh7qREgfupcObTXsb2ZrfEVRrSKLiFRHCNKIJOMu+O+UvVweDovxEH8MecsbfB2BxUi1Gste18ogQU27qmnPzltUzZrajqU4w2j2Qdy8fQTfDGC6JLKkqc5RQ4497JEK27ZR+0c1Hx0E0mciBdsgyBIyQFcoaCuYp41h9ERDMxL1hj2JLjV0GUtLC/yugunEwkynRQv6UGK3yxsduNenoGO/ctYMxCQYKzYqrGmQEMPG8dPgYze+w52xz8xfv5Q0KAoThdIHpDrxdb1J7d64btItPB/HMi16P/3yvb3rt/fxKPPGTAxTvZI115yfV1aelgn+858FvcdvkHustVOSw5FOVS9VgzfuN5i56wEU7kc4WqESEPzEfSNVrXIQBRqNtbTIOuhEHV1bL+Ohi/O9aDNvIIX16ZRaeomPJLx5EgdUDCUQ1Mzi4nopoII5xZN8Fa0pYCbHnwQLrFDu+9aghNPmQkQVDBRrvMB9XTFcbqpf4otxDMZwr3EPhesBk8yDTtpkWPScogXjZgi9ZsZVMRul3C8BTRXlKRqAFLb1u5q5xUPepFGrli8W79yqTUp/d8tP2/Z2ZoXMSHwRD+eHDw3zJ/KnwD0BgDEWgZlOYwTPym1jxl1/jvTEnl6Mb1+3F1pKmAtYnFXoUslIzWXyP/AYV1SQMeCklK6nB1WV+5dqvOwkXfEpLzi6zc0dl42ql10lhGqHmEGTXwNHZa6Wxz3ZkuUWVQfjur6uZj6g44U 65+h+eHd 04ydggmwnOPWmZ6xHbcI0Z9gHFPT85+IHJFGj9oq2pP7JZmUweVqKlvn3tItkiuzylh0iR/8WWutoRAiMElytFID5MZAlT73pxRM+4PnVm3YZ/Q/Y0umgXbzQdDuv6DyyRipwro2OrPvj8kZHQhr7VcrRmzj+NOeILo1T5eR6s7juXaYME/2PEOqPW2vH6ARyRRF66fbu+K1c7fFUsV9iKoofmh0PvNCNHIcf+0ou1Wf7KOJOnM3LBRxlfh1TxX32Qi4mhD32QgucjQkpUvWdFq2ZjQqNQnSuAprvU7bXjTKE0qP/jI4dMEoZvAT5V0UBSrj3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 25, 2023 at 1:18=E2=80=AFPM Johannes Weiner wrote: > > On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote: > > On Tue, Jul 25, 2023 at 7:04=E2=80=AFAM Johannes Weiner wrote: > > > We used to maintain *all* stats in per-cpu counters at the local > > > level. memory.stat reads would have to iterate and aggregate the > > > entire subtree every time. This was obviously very costly, so we adde= d > > > batched upward propagation during stat updates to simplify reads: > > > > > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > > > Author: Johannes Weiner > > > Date: Tue May 14 15:47:12 2019 -0700 > > > > > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > > > > > However, that caused a regression in the stat write path, as the > > > upward propagation would bottleneck on the cachelines in the shared > > > parents. The fix for *that* re-introduced the per-cpu loops in the > > > local stat reads: > > > > > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > > > Author: Johannes Weiner > > > Date: Thu Jun 13 15:55:46 2019 -0700 > > > > > > mm: memcontrol: don't batch updates of local VM stats and events > > > > > > So I wouldn't say it's a regression from rstat. Except for that short > > > period between the two commits above, the read side for local stats > > > was always expensive. > > > > I was comparing from an 4.15 kernel, so I assumed the major change was > > from rstat, but that was not accurate. Thanks for the history. > > > > However, in that 4.15 kernel the local (non-hierarchical) stats were > > readily available without iterating percpu counters. There is a > > regression that was introduced somewhere. > > > > Looking at the history you described, it seems like up until > > 815744d75152 we used to maintain "local" (aka non-hierarchical) > > counters, so reading local stats was reading one counter, and starting > > 815744d75152 we started having to loop percpu counters for that. > > > > So it is not a regression of rstat, but seemingly it is a regression > > of 815744d75152. Is my understanding incorrect? > > Yes, it actually goes back further. Bear with me. > > For the longest time, it used to be local per-cpu counters. Every > memory.stat read had to do nr_memcg * nr_cpu aggregation. You can > imagine that this didn't scale in production. > > We added local atomics and turned the per-cpu counters into buffers: > > commit a983b5ebee57209c99f68c8327072f25e0e6e3da > Author: Johannes Weiner > Date: Wed Jan 31 16:16:45 2018 -0800 > > mm: memcontrol: fix excessive complexity in memory.stat reporting > > Local counts became a simple atomic_read(), but the hierarchy counts > would still have to aggregate nr_memcg counters. > > That was of course still too much read-side complexity, so we switched > to batched upward propagation during the stat updates: > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > Author: Johannes Weiner > Date: Tue May 14 15:47:12 2019 -0700 > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > This gave us two atomics at each level: one for local and one for > hierarchical stats. > > However, that went too far the other direction: too many counters > touched during stat updates, and we got a regression report over memcg > cacheline contention during MM workloads. Instead of backing out > 42a300353 - since all the previous versions were terrible too - we > dropped write-side aggregation of *only* the local counters: > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > Author: Johannes Weiner > Date: Thu Jun 13 15:55:46 2019 -0700 > > mm: memcontrol: don't batch updates of local VM stats and events > > In effect, this kept all the stat optimizations for cgroup2 (which > doesn't have local counters), and reverted cgroup1 back to how it was > for the longest time: on-demand aggregated per-cpu counters. > > For about a year, cgroup1 didn't have to per-cpu the local stats on > read. But for the recursive stats, it would either still have to do > subtree aggregation on read, or too much upward flushing on write. > > So if I had to blame one commit for a cgroup1 regression, it would > probably be 815744d. But it's kind of a stretch to say that it worked > well before that commit. > > For the changelog, maybe just say that there was a lot of back and > forth between read-side aggregation and write-side aggregation. Since > with rstat we now have efficient read-side aggregation, attempt a > conceptual revert of 815744d. Now that's a much more complete picture. Thanks a lot for all the history, now it makes much more sense. I wouldn't blame 815744d then, as you said it's better framed as a conceptual revert of it. I will rewrite the commit log accordingly and send a v2. Thanks! > > > > But I want to be clear: this isn't a regression fix. It's a new > > > performance optimization for the deprecated cgroup1 code. And it come= s > > > at the cost of higher memory footprint for both cgroup1 AND cgroup2. > > > > I still think it is, but I can easily be wrong. I am hoping that the > > memory footprint is not a problem. There are *roughly* 80 per-memcg > > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats > > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so > > on a two-node machine that's 8 * (80 + 55 * 2) ~=3D 1.5 KiB per memcg. > > > > Given that struct mem_cgroup is already large, and can easily be 100s > > of KiBs on a large machine with many cpus, I hope there won't be a > > noticeable regression. > > Yes, the concern wasn't so much the memory consumption but the > cachelines touched during hot paths. > > However, that was mostly because we either had a) write-side flushing, > which is extremely hot during MM stress, or b) read-side flushing with > huuuge cgroup subtrees due to zombie cgroups. A small cacheline > difference would be enormously amplified by these factors. > > Rstat is very good at doing selective subtree flushing on reads, so > the big coefficients from a) and b) are no longer such a big concern. > A slightly bigger cache footprint is probably going to be okay. Agreed, maintaining the local counters with rstat is much easier than the previous attempts. I will try to bake most of this into the commit log.