From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B36BEB64DD for ; Tue, 25 Jul 2023 19:25:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AE0296B0071; Tue, 25 Jul 2023 15:25:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A901B6B0074; Tue, 25 Jul 2023 15:25:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 958648D0001; Tue, 25 Jul 2023 15:25:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 88AA06B0071 for ; Tue, 25 Jul 2023 15:25:00 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 45BCE1C9F77 for ; Tue, 25 Jul 2023 19:25:00 +0000 (UTC) X-FDA: 81051111960.27.6184C78 Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) by imf19.hostedemail.com (Postfix) with ESMTP id 5C32D1A0016 for ; Tue, 25 Jul 2023 19:24:58 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ooRahWUd; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690313098; a=rsa-sha256; cv=none; b=qJcl09qggOSDjZzP8QEIMAo33eCThDc1pYehHRy8u6eHozQHXea+Vp0n8a7h1UHyZjvQoY 8SioHDH8HXpzUJBHWuWQux83FuGIcPJkui5AodWf7YLNWD5coE68r3bbvmWAjb9iYFZ3po 0tM6Mt91AK31B05QyE6EmzH9YC0xbR8= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=ooRahWUd; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690313098; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; b=KLWtfH/JjLoCH5JRJoXTRBnMjG03hP5wueYzdDNAeaqh4chG/LiI/iKFybZ2fmLrBq8Czb tFOgWpAW2oX7UooX+0b1SJR2jToX/UC7La9zoCDaDY3VA/u8WJ3byP6HxPrVY97p8DwP7S NHyWQkG5yxkhgYV9c9Z62sG1a0Dzyeg= Received: by mail-lf1-f52.google.com with SMTP id 2adb3069b0e04-4fb41682472so9036633e87.2 for ; Tue, 25 Jul 2023 12:24:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690313096; x=1690917896; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; b=ooRahWUd9qepvopG3J9bw8egrolyIsO84D5Vc0l1eY7rKbHLsh4YLsQhAkHszACG43 wiDdZqDCfuuZG0/qiNUsKynyxGflky1uwaaXpJIgWCWwyX50Bx2lq2hxpkz0LEfD9cD7 08HWy/WST4OLNYWEfwBafRDQQdKtdlRQ4NyE9pxSSW2B8prEMlyI6Lq7Vk7jgjQVPxpT 1yfFkkxhTK0RvKM7EfpXF6sG/rTkke8+HFRi1yyIM18vj2TKd8NhpBhNxcwxAm8Fntfv itmeTnNvDP1worfpkzjL3V/Vi4p3Q4MC+X0pnrXtub5xWx6V9wHU4GMu/+yCT0tixuYA hlfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690313096; x=1690917896; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; b=gJUsv/3N57Nrs8Su2Nw5NBg7fyuFFM4v+zp/eJyTJZYwkAOwmRAAxxIAUDZcev2VVS iMAHi5LIcCmAhjjtuX8r3mQTcf92FjK54+y12BMTXtaojgnqY+CbYSc7zhJi55ZKiCvG XDt6HKHPwNFsyT5vuH3348GGx4+x/oHam3gzSV6PI7eu14tWkrk4aIP2NhSq850bjg1Q VFGop2h2gvlXTJeAlTTF3eimvL1XROHJ0XbJAi8RgrNMgrKpoWA44gKa/7+iN0htHoAB qHqHXzz6mgWaRN4AmS3Hoj/nqUprnzsKCj/xVeLSS2dbxD1LbjrRspYGAvzaVGdL0TuP I7nQ== X-Gm-Message-State: ABy/qLYC02fV0ofwbqzm9dwXOhUFfEYsMT+SpJe/ojGxe/LUQZfPn5Ir mN1MjL2EaFoZ9VYlUZMk9CwFLYhYafa2Q7Yn07WLSw== X-Google-Smtp-Source: APBJJlFvefXqddr8rAGU+RRiXk66fGsLhtcXxGGBMxnXLz4XrlnM39V3t/SdrcPAseVScTEeLDKfBX1La9gMI4E3YBQ= X-Received: by 2002:a05:6512:baa:b0:4fb:b11:c9a2 with SMTP id b42-20020a0565120baa00b004fb0b11c9a2mr11022614lfv.34.1690313096153; Tue, 25 Jul 2023 12:24:56 -0700 (PDT) MIME-Version: 1.0 References: <20230719174613.3062124-1-yosryahmed@google.com> <20230725140435.GB1146582@cmpxchg.org> In-Reply-To: <20230725140435.GB1146582@cmpxchg.org> From: Yosry Ahmed Date: Tue, 25 Jul 2023 12:24:19 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats To: Johannes Weiner Cc: Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 5C32D1A0016 X-Stat-Signature: jh4kkzsg9z37dgupxo96tkr83wurjbz8 X-Rspam-User: X-HE-Tag: 1690313098-572938 X-HE-Meta: U2FsdGVkX182lalsCto5dGQfdRyJGoiMZscHf9x1+Ch8dS+ucBvFgus5/T+OCbQxEpD+NOqC1nhwNVvOcJ2j+u9DsW5JJAFL1hiR2a4UYhpDxqXnTC6ecVnvNEdT61Opuw/VZw/y5PpFILqJIVjNu+EFAxbxKtybFLluQew4W/2b9i4AuGEveCjXd6tAI+CrqrcGP87qif+MfziHx++5BnQbg2SZPas1owqhZyocVw0aTG9w57d866CKbkkmHWW3tsd+BHMT8pkBwFneqXNxuJ8O3cGsHSxtEKP7gWm5C2aRFtge38YatYfdO1WR4SQ4RLEQW9UuGhH7l3Kf8abcnp1P04mVM8NED/vjo+VAkJc9VWByW2DfgFuaRZnI7v1jD2/Z0+vj7R0FnRRS0wLFE/8LwidsJBUOPhZUvDzF6J5JAwwDxHFHVsowJ9nybZbEMPJzRWAzVrPWcDxZ68vlENrQXRuWpYZ2/I8DWW3LkiRKmRwEaji+AaOxsQM/1yBnJXfLt10/6AE3rSFHTovzDoObxGXshAUqBklrVxlVW/OeO4Jwe+4mRAaw0dOS41oTpeqnrTn8q/9kBaqlynvQOKy9gyCpbwZwJNQBoBouw8I0kSvPJDFGK3FIBDnMRqgBRV2PJl3m4l5UpS8NSZJJqp8TSTEJiGBu9XlALBNugRXQX0esiTXoIWXfvt/uyuGWqHFN8WSXJK9STt2kVyg3IMGoR8Otc3A50mJqB2gTevCRRILhZ8h5wYMe6Q5YmZtELcfmfAhkTT03H0G6qZjZjLo7F4sZ5TUekD8tEafrn0Q9eDj2XRC79KK8fJZJT9qF6ax+kctHXCWpKEIgDkz0Mq9ZopDwhlbs29TRlno0zFtcRn71RYE2Fw1Y8bhwdjZeSTZiWqZsmP4ktSvNlX/YBqrkFnTSa/pyICNb5vgOHhDJMkJgNI26BGxO3ObP/USjWQS7hkVF85QxoiRC9EY ylg91+Fz Y2259nKB7O0naGESH3yLWZ0dnYz+Fwg1zuXE8R8WewgMcN2EVnhouB6An3pupESu2VfxgnNcDwbLiGGXhARF9P/KwFXDKPq4Wy4ydoHdvjEEuXKwkxE/30G7d88pytHoNx6HSV0jjjt7Nl6E9ihJSBE1eDMlw/5c5kWLICbvxfE98bmElX2JO3KN4g0qEIktGh4UpP7KJ1EmMb8lLWLP/ozfK757bYjtL+5QB+ROt0DdMXTjwqtxyow1zWxdq3VZzDXP+S79DK6Kaf8bTu6c/wZLLXzOq6AD70eB+139sc1rVOAvHFdrLvFf7Wg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Johannes, On Tue, Jul 25, 2023 at 7:04=E2=80=AFAM Johannes Weiner wrote: > > On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > > framework keeps track of which cgroups have updates on which cpus. > > > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > > readily available as counters. Instead, the percpu counters for a given > > stat need to be summed to get the non-hierarchical stat value. This > > causes a performance regression when reading non-hierarchical stats on > > kernels where memcg moved to using rstat. This is especially visible > > when reading memory.stat on cgroup v1. There are also some code paths > > internal to the kernel that read such non-hierarchical stats. > > It's actually not an rstat regression. It's always been this costly. > > Quick history: Thanks for the context. > > We used to maintain *all* stats in per-cpu counters at the local > level. memory.stat reads would have to iterate and aggregate the > entire subtree every time. This was obviously very costly, so we added > batched upward propagation during stat updates to simplify reads: > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > Author: Johannes Weiner > Date: Tue May 14 15:47:12 2019 -0700 > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > However, that caused a regression in the stat write path, as the > upward propagation would bottleneck on the cachelines in the shared > parents. The fix for *that* re-introduced the per-cpu loops in the > local stat reads: > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > Author: Johannes Weiner > Date: Thu Jun 13 15:55:46 2019 -0700 > > mm: memcontrol: don't batch updates of local VM stats and events > > So I wouldn't say it's a regression from rstat. Except for that short > period between the two commits above, the read side for local stats > was always expensive. I was comparing from an 4.15 kernel, so I assumed the major change was from rstat, but that was not accurate. Thanks for the history. However, in that 4.15 kernel the local (non-hierarchical) stats were readily available without iterating percpu counters. There is a regression that was introduced somewhere. Looking at the history you described, it seems like up until 815744d75152 we used to maintain "local" (aka non-hierarchical) counters, so reading local stats was reading one counter, and starting 815744d75152 we started having to loop percpu counters for that. So it is not a regression of rstat, but seemingly it is a regression of 815744d75152. Is my understanding incorrect? > > rstat promises a shot at finally fixing it, with less risk to the > write path. > > > It is inefficient to iterate and sum counters in all cpus when the rsta= t > > framework knows exactly when a percpu counter has an update. Instead, > > maintain cpu-aggregated non-hierarchical counters for each stat. During > > an rstat flush, keep those updated as well. When reading > > non-hierarchical stats, we no longer need to iterate cpus, we just need > > to read the maintainer counters, similar to hierarchical stats. > > > > A caveat is that we now a stats flush before reading > > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > > or memcg_events_local(), where we previously only needed a flush to > > read hierarchical stats. Most contexts reading non-hierarchical stats > > are already doing a flush, add a flush to the only missing context in > > count_shadow_nodes(). > > > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > > machine with 256 cpus on cgroup v1: > > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > > real 0m0.125s > > user 0m0.005s > > sys 0m0.120s > > > > After: > > real 0m0.032s > > user 0m0.005s > > sys 0m0.027s > > > > Signed-off-by: Yosry Ahmed > > Acked-by: Johannes Weiner Thanks! I will reformulate the commit log after we agree on the history. > > But I want to be clear: this isn't a regression fix. It's a new > performance optimization for the deprecated cgroup1 code. And it comes > at the cost of higher memory footprint for both cgroup1 AND cgroup2. I still think it is, but I can easily be wrong. I am hoping that the memory footprint is not a problem. There are *roughly* 80 per-memcg stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so on a two-node machine that's 8 * (80 + 55 * 2) ~=3D 1.5 KiB per memcg. Given that struct mem_cgroup is already large, and can easily be 100s of KiBs on a large machine with many cpus, I hope there won't be a noticeable regression. > > If this causes a regression, we should revert it again. But let's try. Of course. Fingers crossed.