From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD17BC0015E for ; Tue, 25 Jul 2023 14:06:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C11A6B0074; Tue, 25 Jul 2023 10:06:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5727A8D0002; Tue, 25 Jul 2023 10:06:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 439DB6B0078; Tue, 25 Jul 2023 10:06:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 355466B0074 for ; Tue, 25 Jul 2023 10:06:46 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E82AA80E47 for ; Tue, 25 Jul 2023 14:06:45 +0000 (UTC) X-FDA: 81050309970.26.445C797 Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf23.hostedemail.com (Postfix) with ESMTP id 6BDD11400F4 for ; Tue, 25 Jul 2023 14:04:37 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=NhRu+ZEB; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690293877; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L6q2fbSEUaRmabz4YeOC1QVhb19awtdwyXc9KAJHQSg=; b=GVrf/hvaFTcne8jb8c173rJieSQ7eN5NQhfPxNbW6CrAyerxnR89VsCfpINRSAwfDdi3L9 leMDI/1rQSzYjabZwyR1Rw0aZ/xGtjtvaDdF2Ah2Xt2Z7tCTJ45ppzFxlzOqnDzOE+HxO2 0rCOM+hgUIXrtbxQtNJ5FWOYuGYuAEY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690293877; a=rsa-sha256; cv=none; b=7Ho9HAHW6oIG4/B1t+LiBusMj3zsHGLskY6+hzdSCYGDc/2bUs/FXS3SH16OBtRdzls4e/ QBIfe0vC+WamL8zajYCKYcQ1kGGIl2Eh7rAOFpd1TWnMQbJV8rWxIQjvQ78lvtw2piEKlx E+90mKh8XTTFODUtEJJaUCbAsMbqVIk= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=NhRu+ZEB; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.41 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-635eb5b0320so41350796d6.3 for ; Tue, 25 Jul 2023 07:04:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20221208.gappssmtp.com; s=20221208; t=1690293876; x=1690898676; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=L6q2fbSEUaRmabz4YeOC1QVhb19awtdwyXc9KAJHQSg=; b=NhRu+ZEBbGmHUyV23DWyyUEl6Nq9VUT++p+9QWJdkAJ01svswkr1clQ+XgDEJoKZlu Weht+NskAhuvf8tGjjG2GgCJMoox5KBq9h7cac9rQqrC1vpsp28oMsNe9KY4CjN98NgM 2onTXCyaJs2t96h7XB5cCypHFdWg+fwb65PFB8O3ZrvuiaJHuzW2QZDMOqGry5xTxZnL a0LtNC5vIK9Xsexs+3OgGgt5bslUleSsJtnl1gA4F45g2Ko2FIGEpl62me3wUqr1jORf 2aM4RpjbPng4AlA4q3lBQwcuniuBSVYARmbS0pCKD+STkQGtoiTn6CFXFLVdidVlar95 wCyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690293876; x=1690898676; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=L6q2fbSEUaRmabz4YeOC1QVhb19awtdwyXc9KAJHQSg=; b=f7GMhS2MPKj6bIyPFraIxhDVIkF/3w8lNEFg5lTdfn7jlmbdaB/Q4KIa5W6UnCUwmS VQIMoaImHKBQ+nypijs99s9Zmt8GtZTQ9pmvjKFZfZjS6bqvokAaNikggjdeI3eoOcP/ foYDQjD02IP8NrmPeUPvAKUx8+0lv+TiNVAFwakj+N1XTwar5XJSsRxWeH4S+l1QjBwS srMx0BsPWjw1ItwBdosy7MJ4c6lAGb1YQNX40UFCPo2P1ynAVJmcsTITcIricO3Viwrh MAOvkhplLdHB98w/4rhtbsUTQ1ZpBb86HYjS9HKxXOOFjqqTubHBeeTrlVoo3Mp8d+Qx WaWg== X-Gm-Message-State: ABy/qLZ3FmNrrvHa71agFIGZXg/BMKXnWfzUn6QwBvftc4rtVecfmCUF dwLtoJ5lem4lPprNDJVgvQi2Ug== X-Google-Smtp-Source: APBJJlG3KOFzO2JnzeEJGbYDiAU3LPsExxVCfBgtf0f78SLJnPjtuVXjAyD32F9FLeYCbgmlt2ReZA== X-Received: by 2002:a0c:9a4f:0:b0:628:8185:bd6e with SMTP id q15-20020a0c9a4f000000b006288185bd6emr2524662qvd.5.1690293876314; Tue, 25 Jul 2023 07:04:36 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:ad06]) by smtp.gmail.com with ESMTPSA id q4-20020a0cf5c4000000b00636047c276csm1368975qvm.126.2023.07.25.07.04.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Jul 2023 07:04:36 -0700 (PDT) Date: Tue, 25 Jul 2023 10:04:35 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats Message-ID: <20230725140435.GB1146582@cmpxchg.org> References: <20230719174613.3062124-1-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230719174613.3062124-1-yosryahmed@google.com> X-Stat-Signature: 5hkf89qq1od6op3qsdwamjxyp1px5x6j X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 6BDD11400F4 X-Rspam-User: X-HE-Tag: 1690293877-711317 X-HE-Meta: U2FsdGVkX19EDVC+/WPErHm31C060JTfKSolLFDiQ+NXIc9euXgTd7Yc+iovj+SR28nAfCxBPYxPfV8C8I55dAu0UvxxV/82iyO2oRPqorT38DEmLI8E93Y9boPn/Ho5UPxqmx0JcCJh2tkMDZYPhswrEvpdS5BOBMNzc6AWPegWQSTouCc8++5i11pYc/GIbOX9b5jMdkfKRkfX6R8k/QCtiEMEGQhIfy2kgRRjlxnyEi2vaIzkVeesfJSshTVV01yC2kXaahYcEyzflDI+N/ZjZXtyBzo4mW5+jd0keRWmMLjWCOQkqlY4WSDewY0LzXiDhDo2r6AdcRIrAtYvXjYJgaJSfZuSfeG9mf9NfuXbbcV7kk56ysBQ4KahPxO9S+d0PQdhM84gNdLqegh2EgBfjV6S1tFQG8XixVFs7qffPbKUQV2WCVnBGQQLcEAZYc9f1NG7ebJjbNZUgd1bo5jZ45XmhSoM7BWxz9CFN7ul803B1gCJCRnuXgoJKMg2tBBw9hC5EMgnQpb1nzK0lrUBaaL54WC8yE1TrMQHqkg4yEXZcgX04pwxE47RbLdOnXVp/HIfF/4vLaM9yBqw1j77AFIn1zSdRQicT7cxZDjP6mZQlKGBZiwJRT90csZMGswQBvCcrRIhJHTkTIGyLU2Twa2iXDNfEEvewTFPx4RjTiDsnws7NaBrU0kWh3H+AaAZ+xBF/tG36bNqYmcumNoV5wxGzRILuDPn9aI4Hr6eKYQWKoVVVDLCDDkfWaPAFNMtyoqRXCFFkgPCUFZMIFX5SgqP+orVH6XGpjYlEfl+uayEBZFaaJGBYos/H3HzV9w2C+LlRbV+Gc9lvXIBvJzIYeOHg5El6wZMC3VnOD2Q8Ci5xXSD+8sXa0PgguYpVdrXEFRQxyMCFVI0oHl7sQCC3El3fSFKJHaae+pZK3do2NzOsUG3/1ZbfyBkqd8T0RLhFdaleTC07EPZ3us ewMA9o/h ynkYdkCeF0N5Q2JzWqQ7ZCv0Iau/zijvqISp9BDFCjewKAF+OgPFMMCviCKRtLW9tBue+36WGS7cQsFTr22TXBNCJ9wtZiRRgcdX3yFOmzOFto7L9nxAoEeQlQwtNtG5d9UpYWOUfnYlC/79kZqUjw57OMyH4+vjqMC8xJANXQc7aUhOnVFV0K8lHJwpANEKAzkMDWqwLwBcooaZj8HTyVaypV/SqVSSZWqqWqrs6MZyJEFbRfj6ySFLv65Q9GWYTvRtChpDEsqDyDgALl3iI7QbbDcIvL9/1AJiqRrxdbsHQ2kJEZiF0NS5oxFc7ZfljrnjvM3O+Y264deYUgoArlaJqYLx8oVRafW/7m79GoBt5PYZCnvAHr7UZVFa3EZizV+nW5t1J/jpGLQyxnpTddDCTXdztwxI2ccW4h4R/borI1EVv1kKxYrDlRkakSTo4I4c4tJfJfWFcivyaxE/DTY7s2cQipPui/Z00Ln1ai6DdTOPeHa7GYV5PtA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > framework keeps track of which cgroups have updates on which cpus. > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > readily available as counters. Instead, the percpu counters for a given > stat need to be summed to get the non-hierarchical stat value. This > causes a performance regression when reading non-hierarchical stats on > kernels where memcg moved to using rstat. This is especially visible > when reading memory.stat on cgroup v1. There are also some code paths > internal to the kernel that read such non-hierarchical stats. It's actually not an rstat regression. It's always been this costly. Quick history: We used to maintain *all* stats in per-cpu counters at the local level. memory.stat reads would have to iterate and aggregate the entire subtree every time. This was obviously very costly, so we added batched upward propagation during stat updates to simplify reads: commit 42a300353577ccc17ecc627b8570a89fa1678bec Author: Johannes Weiner Date: Tue May 14 15:47:12 2019 -0700 mm: memcontrol: fix recursive statistics correctness & scalabilty However, that caused a regression in the stat write path, as the upward propagation would bottleneck on the cachelines in the shared parents. The fix for *that* re-introduced the per-cpu loops in the local stat reads: commit 815744d75152078cde5391fc1e3c2d4424323fb6 Author: Johannes Weiner Date: Thu Jun 13 15:55:46 2019 -0700 mm: memcontrol: don't batch updates of local VM stats and events So I wouldn't say it's a regression from rstat. Except for that short period between the two commits above, the read side for local stats was always expensive. rstat promises a shot at finally fixing it, with less risk to the write path. > It is inefficient to iterate and sum counters in all cpus when the rstat > framework knows exactly when a percpu counter has an update. Instead, > maintain cpu-aggregated non-hierarchical counters for each stat. During > an rstat flush, keep those updated as well. When reading > non-hierarchical stats, we no longer need to iterate cpus, we just need > to read the maintainer counters, similar to hierarchical stats. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > > Signed-off-by: Yosry Ahmed Acked-by: Johannes Weiner But I want to be clear: this isn't a regression fix. It's a new performance optimization for the deprecated cgroup1 code. And it comes at the cost of higher memory footprint for both cgroup1 AND cgroup2. If this causes a regression, we should revert it again. But let's try.