From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A544AC369D9 for ; Wed, 30 Apr 2025 15:05:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E22836B00C7; Wed, 30 Apr 2025 11:05:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DA81E6B00C8; Wed, 30 Apr 2025 11:05:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C70356B00CA; Wed, 30 Apr 2025 11:05:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A5D426B00C7 for ; Wed, 30 Apr 2025 11:05:37 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 88DE8801F2 for ; Wed, 30 Apr 2025 15:05:38 +0000 (UTC) X-FDA: 83391034356.22.93F3510 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) by imf01.hostedemail.com (Postfix) with ESMTP id 90FD840009 for ; Wed, 30 Apr 2025 15:05:36 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=LFQKKNP+; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746025536; a=rsa-sha256; cv=none; b=X76/yDstpnXOtB8WRE3cBjhIlGM3a+afnsCYksMitWXeNqhRfDKDiDa7ksU+SzkGuOsPq9 X5a6j2K2MQMvpGnDqZDNcqIbBhvkgoaIfrCQh50euqqyehkZi3K/amC/UFSFuoInsL5HXI WWnNbgkPHNw0s4ZBM32ObXHsKfeKifE= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=LFQKKNP+; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746025536; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WpIQNCU7S/pOp1j7VaB6fj72/ADMB01oQnJ2fNfA0oQ=; b=u8KVE/Y9wEEi4V3nlzm4OOwoBnzfgiYQar6KFl2KufwjLshqo/3iWftKx8WMQrhgnGqkDd 34VMnvw+5KcxOFK5lyhvtHCuxR4dc26neH8sfp9YaK4HEEIhOwUSCzvtCpK7D20512kUfz TRcMCGWXy4i/AH8mdDo+Bj4ym4OfThQ= Date: Wed, 30 Apr 2025 08:05:22 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1746025534; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WpIQNCU7S/pOp1j7VaB6fj72/ADMB01oQnJ2fNfA0oQ=; b=LFQKKNP+gwd7xNP/gx4JfwFegBctJsu/O3SYGudvRa2cQmE64BDj7Wo5WP2MZnB7AOHCn/ ZnNdly5JqjBnyqgalB9AFwfkB0223ycyy8tekHGM4NRGkoFIzAly5VyhlylxcRf0pQTOlV VmbZuN2yvrekUUTFXDMnlg/PhSCfo6U= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Vlastimil Babka Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Jakub Kicinski , Eric Dumazet , Soheil Hassas Yeganeh , linux-mm@kvack.org, cgroups@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [PATCH] memcg: multi-memcg percpu charge cache Message-ID: References: <20250416180229.2902751-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 90FD840009 X-Stat-Signature: gdj5dch33ntpm7gr975cacuhupy1f97p X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746025536-156881 X-HE-Meta: U2FsdGVkX1+014YpAOa9WBVSO21Kf7ubm4jD2Sq8c952kDdNk7AIZJtaWYqr9OmbN9Upnla1gXhXw6oleIxWdsGpJe/33HyvtCxBs/SpEj83zWbu93H4ESeFOgP/aus6uunGa/ofdSdYULYH8j5kMTGWY6QClUX2vGgJfJL40J1XhHN9ghkA3l8W/GGkTgmUevXq0bcQB358ahGUMWGuw2DCSYX2ZyqzM+GOJoxLZBiCqVim+2wv5bwdnipOWXM9GhROIvH/rV/POOVIpGiL6xFrxuo1bRIXage0jY2ZhBzLzPi6MH0Fu+eyyM0XTT/go9BWqw0p1PsytZezwIqkzH9eqXjelChjqPopUdnkg5lKxXFr0AO8AGDS2LnoErTBqQUQbvBJ/2Paq7Fk5aNgL9NSL3EvWrMulDDxOL0OGPgyX4hj3I+cgWzlKRgio/8MUUCgy87OQgBWX6lXPU61a/qHR32n5xgKleim9n3S01JlqB/n/PVWeCy6mL+zdbyMIqBYWjplv/fCj8QH/jPfmxj2I2QZRgrfgrN93ONsnLUDm+jcucM9wQDx9LQQ8LGUo8nJHqPeSQWKOz0/fgPwbRaqn9L3RbtXHvmXE/F4FwqL7rfH4UNYG+6SxDLIJMFzIiK7iDxtmp5etD8Xz5dTVw6kLUZOp3AcEhK91A1FqcLEPSArrq5RmcdEx7xT5+QwBnRaVYYRyxyB9EuuOG7EwIM8AsjysOoqS46V/tZ6RBykdVujNbRaOwoYkq5jtmtbw5Hy+4fePMpAFGcP+d30iGv7O5CoQyfRdkkUXgSlDW1SWgqWlJ8eNGwREqUjx/wjUVyPdHg4XUO58k7GJB4DYc+VsnwTcYlyaHUatjN8yOANWZOELd+UFHyrfKJer+N5R37Md0K/H74wsjfzhzgrSfFXAZXb6Apu7idIg9ek2QZz+zeNcJOGjtdukMQANIwy+86hbBWycDgc2L4Ug/z v8h32J5S 5ykKkMNKEzEkQQ0wahxaHxc45J4Gz3AepRy4OVF8UHe21srSD6yHYH86wGfnSi6eACvrsU61MJ+bA37hLcerdp4YdKUB4HBHPP8J0cEr2WwowX1Ao0bar8SAiIyFdPPJyb0o2D9lGjlxEQ3gG+rxDBhhcbo5GDdV2tSNrTz0s1VcI8TquNyz7HiHQFw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 30, 2025 at 11:57:13AM +0200, Vlastimil Babka wrote: > On 4/16/25 20:02, Shakeel Butt wrote: > > Memory cgroup accounting is expensive and to reduce the cost, the kernel > > maintains per-cpu charge cache for a single memcg. So, if a charge > > request comes for a different memcg, the kernel will flush the old > > memcg's charge cache and then charge the newer memcg a fixed amount (64 > > pages), subtracts the charge request amount and stores the remaining in > > the per-cpu charge cache for the newer memcg. > > > > This mechanism is based on the assumption that the kernel, for locality, > > keep a process on a CPU for long period of time and most of the charge > > requests from that process will be served by that CPU's local charge > > cache. > > > > However this assumption breaks down for incoming network traffic in a > > multi-tenant machine. We are in the process of running multiple > > workloads on a single machine and if such workloads are network heavy, > > we are seeing very high network memory accounting cost. We have observed > > multiple CPUs spending almost 100% of their time in net_rx_action and > > almost all of that time is spent in memcg accounting of the network > > traffic. > > > > More precisely, net_rx_action is serving packets from multiple workloads > > and is observing/serving mix of packets of these workloads. The memcg > > switch of per-cpu cache is very expensive and we are observing a lot of > > memcg switches on the machine. Almost all the time is being spent on > > charging new memcg and flushing older memcg cache. So, definitely we > > need per-cpu cache that support multiple memcgs for this scenario. > > > > This patch implements a simple (and dumb) multiple memcg percpu charge > > cache. Actually we started with more sophisticated LRU based approach but > > the dumb one was always better than the sophisticated one by 1% to 3%, > > so going with the simple approach. > > > > Some of the design choices are: > > > > 1. Fit all caches memcgs in a single cacheline. > > 2. The cache array can be mix of empty slots or memcg charged slots, so > > the kernel has to traverse the full array. > > 3. The cache drain from the reclaim will drain all cached memcgs to keep > > things simple. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload where each netperf client runs in a different > > cgroup. The next-20250415 kernel is used as base. > > > > $ netserver -6 > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > number of clients | Without patch | With patch > > 6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement) > > 12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement) > > 18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement) > > 24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement) > > 30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement) > > 36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement) > > > > The results show drastic improvement for network intensive workloads. > > > > Signed-off-by: Shakeel Butt > > Acked-by: Vlastimil Babka > > See below > > > --- > > mm/memcontrol.c | 128 ++++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 91 insertions(+), 37 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 1ad326e871c1..0a02ba07561e 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1769,10 +1769,11 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) > > pr_cont(" are going to be killed due to memory.oom.group set\n"); > > } > > > > +#define NR_MEMCG_STOCK 7 > > struct memcg_stock_pcp { > > local_trylock_t stock_lock; > > - struct mem_cgroup *cached; /* this never be root cgroup */ > > - unsigned int nr_pages; > > + uint8_t nr_pages[NR_MEMCG_STOCK]; > > + struct mem_cgroup *cached[NR_MEMCG_STOCK]; > > I have noticed memcg_stock is a DEFINE_PER_CPU and not > DEFINE_PER_CPU_ALIGNED so I think that the intended cacheline usage isn't > guaranteed now. > > Actually tried compiling and got in objdump -t vmlinux: > > ffffffff83a26e60 l O .data..percpu 0000000000000088 memcg_stock > > AFAICS that's aligned to 32 bytes only (0x60 is 96) bytes, not 64. > > changing to _ALIGNED gives me: > > ffffffff83a2c5c0 l O .data..percpu 0000000000000088 memcg_stock > > 0xc0 is 192 so multiple of 64, so seems to work as intended and indeed > necessary. So you should change it too while adding the comment. > Wow I didn't notice this at all. Thanks a lot. I will fix this in the next fix diff.