From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE9B0C04A6A for ; Tue, 25 Jul 2023 20:18:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E138C8D0001; Tue, 25 Jul 2023 16:18:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DC3BF6B0074; Tue, 25 Jul 2023 16:18:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8BD68D0001; Tue, 25 Jul 2023 16:18:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BA0696B0071 for ; Tue, 25 Jul 2023 16:18:15 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 86729160EFE for ; Tue, 25 Jul 2023 20:18:15 +0000 (UTC) X-FDA: 81051246150.03.2B34C87 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by imf18.hostedemail.com (Postfix) with ESMTP id 5375F1C0014 for ; Tue, 25 Jul 2023 20:18:13 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=4NE+Yvyb; spf=pass (imf18.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.169 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690316293; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GPWJuz8CAEYamZihAR0z0R2Kk0EYNWObLkeGgE8Dn0I=; b=DadyySGmwn2b5lYQW9NO3OCOia0oM9RWoL2q7YElMIYNLB/UeYTjOzmSF68szZ3RYtEDeT 5/LPIwStpHVoHuwv0hrh8g19WSt1wcMSsBtvXXLMSZxxUJPXfkMQOpnacMMra6WAnCrt/G je3OQqMVcardBQ5r3p6cF9wVa3H0UU0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690316293; a=rsa-sha256; cv=none; b=E4vx9afS10MzkPNVQKMHLdj8r741fcbMpB8iEgNj54ev8nl3ErOWWgHTsHhS+2g9JzqX/q q+eTbGpOKz7szdi7mN9aDxEXI06I8qxlRSDwLYDqq852CU7CXSBMVkvC8Bi2WGWVzC9UVb Ul5m6goP2Jw8S427JyBNef4TotTslBA= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=4NE+Yvyb; spf=pass (imf18.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.169 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-40544d92632so1486791cf.0 for ; Tue, 25 Jul 2023 13:18:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20221208.gappssmtp.com; s=20221208; t=1690316292; x=1690921092; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=GPWJuz8CAEYamZihAR0z0R2Kk0EYNWObLkeGgE8Dn0I=; b=4NE+YvybWo/g6VcPhh2fy5ZBIK2nXfKjbHIrelje0rqNy4LW0kZXKWbe7Y2Hh0eV/+ +29ZM018tyUHpbJ4C4PuCBigcTNasPOzQHAnf04mtPxhxxg6kZ0gJiDnhivJEwankALK Uovby1CLPABNOOd4W4KiaeQFPHLCgi3Gpiw906LZwfCzGIdW1NApq0YZi+a20oDjVXjo e0L44dr1itRsFRxkSm9aFcTI6KXhxJgoxLav3p6wpGjpgrSRmF/B/8GWhfn5iB6LDyS0 LE5qridIH3lSWZ8zIzT48HV1vyaKIVKi+EwTw1xThMN4P53tGMwX6Z1N7miPOfUZsntY M2zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690316292; x=1690921092; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GPWJuz8CAEYamZihAR0z0R2Kk0EYNWObLkeGgE8Dn0I=; b=MKmwGjLvsPdop3AWtNQlCK6CZFRFfxAmggw7MtB6a8VSwD2qOTmHEV05uRvYPh42b9 phsQVhziUH2hxmMtbiFv5uv/k5AU1GuYe+6OEAi/A/n4Cv8+E+pEdLOHVE21UVjvgG8f clj7Yb1cKW04GLzw7ulNj/T+IpdLvsyrbKWmsyqkHd8vuVmQrDwKzNFGrT4CxosTGCmg IHVMtuJ2HvKbi/s7BrRy2JxM30ccRRPhXN0L5Mx4ZXcrKbVbrjsdquZQnqZJiTEJtsQ+ gyTRtSuExkgDJ5maC43/BZX3beUebDedg+ptCinxq5sPdL53+r2oFoZYKkuFd3EdECpH M8/A== X-Gm-Message-State: ABy/qLZ/xo96+3BGyab319Cb+rZo8SGNFvlcE+Q8EtVDmc8ehraYGIiY cMuqPXy0oKsuoSB8d1+azqBw4A== X-Google-Smtp-Source: APBJJlEonHvE95sEnIxdcddfXK3Ru/OM8KynANl6ZnU5hjPfscflgMaIlgpWNikeCdPCdTunJvmkKw== X-Received: by 2002:ac8:4e85:0:b0:400:9847:59f6 with SMTP id 5-20020ac84e85000000b00400984759f6mr4023893qtp.13.1690316292303; Tue, 25 Jul 2023 13:18:12 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:ad06]) by smtp.gmail.com with ESMTPSA id d23-20020ac851d7000000b00403f4459e33sm4281077qtn.91.2023.07.25.13.18.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Jul 2023 13:18:11 -0700 (PDT) Date: Tue, 25 Jul 2023 16:18:11 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats Message-ID: <20230725201811.GA1231514@cmpxchg.org> References: <20230719174613.3062124-1-yosryahmed@google.com> <20230725140435.GB1146582@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: u87bjhmj6cdo5opx8dapzzcmkbnh89q3 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5375F1C0014 X-Rspam-User: X-HE-Tag: 1690316293-174948 X-HE-Meta: U2FsdGVkX18Da9Ju/2iAGDSGsGMDDw1adkWIx+8EUaLhLYZx0+Z6Og91ON4J7o3QXIwuxK/tj8EWfZ8ZpezHvLxv9CWLGfsRDo/rPvHhfYZEtWbG3zg79O6IGPsHyP5F2jU2fGulvjqqG6UC+2aor6gpXW9tOwbUnX70sf83PRlY1wKfSVld+UVMeIMDzD+fwlDd1+s6ZBLEDsw4NQK5j8D0D+YM5+Y691+1p0EX25sRyrIa64CJAK6R2WOBK7efxx2mbhHjUVCeDlD1nc4+HRB05yxDQvHJOacEO14oLg4RENS5u849q3li2i01A8cRWAaieEtXI2YyqCVxKk70tRxe1Gty7ei3ji3Vvr5eJhTsgIiibSy8fTBfMf733f9nsLwVPSxvrP/yYjGbsKLuXfhagG9/cEA4NRSqz9P5/pix3Y3JemE7p50lFJJ9U7NZpgDXNXUx9fCL47CIDI7Qa15iy6rV3cWucUwmMRzBdAReK7klGdnyCT2a9m00qpLyk290gHlUhGGH0YDSPuoKUPPI5NphJkz9ozvZXJ6xeR89cLWpa8pGJi3NbTfTdD8jjdP13baumw3+RJXtUhUplbxTM7OWgMLGrO35ApnXUCFe5TezW7JD+7o09uA+vMQ6hbfOlE/OpRKoR3CGw1NXW4H951hT+V4bYQDxWxLAU9JK5IrQb/PvmeUhd/oRJKvDTGZ6/TUZm1l6keQMWDWUvi4LN5Qy4VvvnjXhHTADDjwzghmxzjT+z0dyy6NoOCW5e+FYPGDNQu6ldaZ0YHhCPmErR9PfrPRZ085Azja0/USDkNhtLeiZEOEUG5w1k+Xlkdxm7kYdGoHkpMfdNVweBORLy0pv6Qrt0J7tlQWVImRyR7t5L3BBcIWo1n5P6ArxC2mdEl3Vc8FqozFecsKkhUZv7XJj9gYQ0V90pjD70GMG/Pn09lvxgZdB2A/Lok9a/z8qnV75gd5HdK08jXo xs2WFjLG 7MUPyJpiSqGWG5F9NIMdBUkvdOc5Gztu3zAYOQyS36avJk4A5OpV60mHtc+lL1CWSdCtsRosqn2ncriOEOmODEJMCecDOueveO1086yXzOXQwqAGo9gO7PX5jd6ERbuGORIcu0Wg5qUXu37BIDEdIzLtm1rCxzGg7AjvVPENIbUuRoFW/zMEiPGOXe5pEXCWvDCDHK0QUG8WLUCqWpKepKWhKnuwQQJwrwQsVX6E0nw0fQiWbmxDFc4LIESKnh52d5guz57mGVQ18qaXNtG4deY9hVCIOWjM4w8/oGKCuFyAxFnlVrY23v6yspOFEdGA3TH2Anyldkou7mHC0O/t+TStqlvN9ulDunFHW2TVvZ4uEfVuVhwGEVIJRQ3iXZ/4pfxZf/t8b7T7E6uk7p/WqBq33fYI1FtbrybA/HjENSItbKMAw9J3+JDNkq6ULIYx8EzWXUsl1I6nO5szO4n63tMeIbA+dmMubG38AYein43QjQP0CZ5soVK4ObDFk2RndYNffThYq+q1Em5w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote: > On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner wrote: > > We used to maintain *all* stats in per-cpu counters at the local > > level. memory.stat reads would have to iterate and aggregate the > > entire subtree every time. This was obviously very costly, so we added > > batched upward propagation during stat updates to simplify reads: > > > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > > Author: Johannes Weiner > > Date: Tue May 14 15:47:12 2019 -0700 > > > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > > > However, that caused a regression in the stat write path, as the > > upward propagation would bottleneck on the cachelines in the shared > > parents. The fix for *that* re-introduced the per-cpu loops in the > > local stat reads: > > > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > > Author: Johannes Weiner > > Date: Thu Jun 13 15:55:46 2019 -0700 > > > > mm: memcontrol: don't batch updates of local VM stats and events > > > > So I wouldn't say it's a regression from rstat. Except for that short > > period between the two commits above, the read side for local stats > > was always expensive. > > I was comparing from an 4.15 kernel, so I assumed the major change was > from rstat, but that was not accurate. Thanks for the history. > > However, in that 4.15 kernel the local (non-hierarchical) stats were > readily available without iterating percpu counters. There is a > regression that was introduced somewhere. > > Looking at the history you described, it seems like up until > 815744d75152 we used to maintain "local" (aka non-hierarchical) > counters, so reading local stats was reading one counter, and starting > 815744d75152 we started having to loop percpu counters for that. > > So it is not a regression of rstat, but seemingly it is a regression > of 815744d75152. Is my understanding incorrect? Yes, it actually goes back further. Bear with me. For the longest time, it used to be local per-cpu counters. Every memory.stat read had to do nr_memcg * nr_cpu aggregation. You can imagine that this didn't scale in production. We added local atomics and turned the per-cpu counters into buffers: commit a983b5ebee57209c99f68c8327072f25e0e6e3da Author: Johannes Weiner Date: Wed Jan 31 16:16:45 2018 -0800 mm: memcontrol: fix excessive complexity in memory.stat reporting Local counts became a simple atomic_read(), but the hierarchy counts would still have to aggregate nr_memcg counters. That was of course still too much read-side complexity, so we switched to batched upward propagation during the stat updates: commit 42a300353577ccc17ecc627b8570a89fa1678bec Author: Johannes Weiner Date: Tue May 14 15:47:12 2019 -0700 mm: memcontrol: fix recursive statistics correctness & scalabilty This gave us two atomics at each level: one for local and one for hierarchical stats. However, that went too far the other direction: too many counters touched during stat updates, and we got a regression report over memcg cacheline contention during MM workloads. Instead of backing out 42a300353 - since all the previous versions were terrible too - we dropped write-side aggregation of *only* the local counters: commit 815744d75152078cde5391fc1e3c2d4424323fb6 Author: Johannes Weiner Date: Thu Jun 13 15:55:46 2019 -0700 mm: memcontrol: don't batch updates of local VM stats and events In effect, this kept all the stat optimizations for cgroup2 (which doesn't have local counters), and reverted cgroup1 back to how it was for the longest time: on-demand aggregated per-cpu counters. For about a year, cgroup1 didn't have to per-cpu the local stats on read. But for the recursive stats, it would either still have to do subtree aggregation on read, or too much upward flushing on write. So if I had to blame one commit for a cgroup1 regression, it would probably be 815744d. But it's kind of a stretch to say that it worked well before that commit. For the changelog, maybe just say that there was a lot of back and forth between read-side aggregation and write-side aggregation. Since with rstat we now have efficient read-side aggregation, attempt a conceptual revert of 815744d. > > But I want to be clear: this isn't a regression fix. It's a new > > performance optimization for the deprecated cgroup1 code. And it comes > > at the cost of higher memory footprint for both cgroup1 AND cgroup2. > > I still think it is, but I can easily be wrong. I am hoping that the > memory footprint is not a problem. There are *roughly* 80 per-memcg > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so > on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg. > > Given that struct mem_cgroup is already large, and can easily be 100s > of KiBs on a large machine with many cpus, I hope there won't be a > noticeable regression. Yes, the concern wasn't so much the memory consumption but the cachelines touched during hot paths. However, that was mostly because we either had a) write-side flushing, which is extremely hot during MM stress, or b) read-side flushing with huuuge cgroup subtrees due to zombie cgroups. A small cacheline difference would be enormously amplified by these factors. Rstat is very good at doing selective subtree flushing on reads, so the big coefficients from a) and b) are no longer such a big concern. A slightly bigger cache footprint is probably going to be okay.