Re: [linux-next:master] [memcg] 01d37228d3: netperf.Throughput_Mbps 37.9% regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: kernel test robot <oliver.sang@intel.com>,
	Alexei Starovoitov <ast@kernel.org>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
	Michal Hocko <mhocko@suse.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [linux-next:master] [memcg] 01d37228d3: netperf.Throughput_Mbps 37.9% regression
Date: Mon, 10 Mar 2025 10:55:51 +0100	[thread overview]
Message-ID: <7c41d8d7-7d5a-4c3d-97b3-23642e376ff9@suse.cz> (raw)
In-Reply-To: <202503101254.cfd454df-lkp@intel.com>

On 3/10/25 06:50, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 37.9% regression of netperf.Throughput_Mbps on:

I assume this is some network receive context where gfpflags do not allow
blocking.
 
> commit: 01d37228d331047a0bbbd1026cec2ccabef6d88d ("memcg: Use trylock to access memcg stock_lock.")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [test failed on linux-next/master 7ec162622e66a4ff886f8f28712ea1b13069e1aa]
> 
> testcase: netperf
> config: x86_64-rhel-9.4
> compiler: gcc-12
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> parameters:
> 
> 	ip: ipv4
> 	runtime: 300s
> 	nr_threads: 50%
> 	cluster: cs-localhost
> 	test: TCP_MAERTS
> 	cpufreq_governor: performance
> 
> 
> In addition to that, the commit also has significant impact on the following tests:
> 
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | stress-ng: stress-ng.mmapfork.ops_per_sec  63.5% regression                                        |

Hm interesting, this one at least from the name would be a GFP_KERNEL context?

> | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | nr_threads=100%                                                                                    |
> |                  | test=mmapfork                                                                                      |
> |                  | testtime=60s                                                                                       |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | hackbench: hackbench.throughput  26.6% regression                                                  |
> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | ipc=socket                                                                                         |
> |                  | iterations=4                                                                                       |
> |                  | mode=threads                                                                                       |
> |                  | nr_threads=100%                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec  33.0% regression                               |
> | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory        |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | mode=development                                                                                   |
> |                  | nr_threads=100%                                                                                    |
> |                  | test=TCP                                                                                           |
> |                  | test_memory_size=50%                                                                               |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | vm-scalability: vm-scalability.throughput  86.8% regression                                        |
> | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | runtime=300s                                                                                       |
> |                  | size=1T                                                                                            |
> |                  | test=lru-shm                                                                                       |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | netperf: netperf.Throughput_Mbps 39.9% improvement                                                 |

An improvement? Weird.

> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> | test parameters  | cluster=cs-localhost                                                                               |
> |                  | cpufreq_governor=performance                                                                       |
> |                  | ip=ipv4                                                                                            |
> |                  | nr_threads=200%                                                                                    |
> |                  | runtime=300s                                                                                       |
> |                  | test=TCP_MAERTS                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops  68.8% regression                                      |
> | test machine     | 104 threads 2 sockets (Skylake) with 192G memory                                                   |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | mode=thread                                                                                        |
> |                  | nr_task=100%                                                                                       |
> |                  | test=fallocate1                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+

Some of those as well.

Anyway we should not be expecting the localtry_trylock_irqsave() itself be
failing and resulting in a slow path, as that woulre require an allocation
attempt from a nmi. So what else the commit does?

>       0.10 ±  4%     +11.3       11.43 ±  3%  perf-profile.self.cycles-pp.try_charge_memcg
>       0.00           +13.7       13.72 ±  2%  perf-profile.self.cycles-pp.page_counter_try_charge

This does suggest more time spent in try_charge_memcg() because consume_stock() has failed.

And I suspect this:

+       if (!gfpflags_allow_spinning(gfp_mask))
+               /* Avoid the refill and flush of the older stock */
+               batch = nr_pages;

because this will affect the refill even if consume_stock() fails not due to
a trylock failure (which should not be happening), but also just because the
stock was of a wrong memcg or depleted. So in the nowait context we deny the
refill even if we have the memory. Attached patch could be used to see if it
if fixes things. I'm not sure about the testcases where it doesn't look like
nowait context would be used though, let's see.

I've also found this:
https://lore.kernel.org/all/7s6fbpwsynadnzybhdqg3jwhls4pq2sptyxuyghxpaufhissj5@iadb6ibzscjj/

> 
> BTW after the done_restock tag in try_charge_memcg(), we will another
> gfpflags_allow_spinning() check to avoid schedule_work() and
> mem_cgroup_handle_over_high(). Maybe simply return early for
> gfpflags_allow_spinning() without checking high marks.

looks like a small possible optimization that was forgotten?
 ----8<----
From 29e7d18645577ce13d8a0140c0df050ce1ce0f95 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 10 Mar 2025 10:32:14 +0100
Subject: [PATCH] memcg: Avoid stock refill only if stock_lock can't be
 acquired

Since commit 01d37228d331 ("memcg: Use trylock to access memcg
stock_lock.") consume_stock() can fail if it can't obtain
memcg_stock.stock_lock. In that case try_charge_memcg() also avoids
refilling or flushing the stock when gfp flags indicate we are in the
context where obtaining the lock could fail.

However consume_stock() can also fail because the stock was depleted, or
belonged to a different memcg. Avoiding the stock refill then reduces
the caching efficiency, as the refill could still succeed with memory
available. This has caused various regressions to be reported by the
kernel test robot.

To fix this, make the decision to avoid stock refill more precise by
making consume_stock() return -EBUSY when it fails to obtain stock_lock,
and using that for the no-refill decision.

Fixes: 01d37228d331 ("memcg: Use trylock to access memcg stock_lock.")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503101254.cfd454df-lkp@intel.com

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memcontrol.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 092cab99dec7..a8371a22c7f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1772,22 +1772,23 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  * stock, and at least @nr_pages are available in that stock.  Failure to
  * service an allocation will refill the stock.
  *
- * returns true if successful, false otherwise.
+ * returns 0 if successful, -EBUSY if lock cannot be acquired, or -ENOMEM
+ * if the memcg does not match or there are not enough pages
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+static int consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
 	unsigned long flags;
-	bool ret = false;
+	bool ret = -ENOMEM;
 
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
 	if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
 		if (!gfpflags_allow_spinning(gfp_mask))
-			return ret;
+			return -EBUSY;
 		localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
 	}
 
@@ -1795,7 +1796,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 	stock_pages = READ_ONCE(stock->nr_pages);
 	if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) {
 		WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages);
-		ret = true;
+		ret = 0;
 	}
 
 	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
@@ -2228,13 +2229,18 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool drained = false;
 	bool raised_max_event = false;
 	unsigned long pflags;
+	int consume_ret;
 
 retry:
-	if (consume_stock(memcg, nr_pages, gfp_mask))
+	consume_ret = consume_stock(memcg, nr_pages, gfp_mask);
+	if (!consume_ret)
 		return 0;
 
-	if (!gfpflags_allow_spinning(gfp_mask))
-		/* Avoid the refill and flush of the older stock */
+	/*
+	 * Avoid the refill and flush of the older stock if we failed to acquire
+	 * the stock_lock
+	 */
+	if (consume_ret == -EBUSY)
 		batch = nr_pages;
 
 	if (!do_memsw_account() ||
-- 
2.48.1

next prev parent reply	other threads:[~2025-03-10  9:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10  5:50 kernel test robot
2025-03-10  9:55 ` Vlastimil Babka [this message]
2025-03-10 10:18   ` Alexei Starovoitov
2025-03-10 10:34     ` Vlastimil Babka
2025-03-10 10:56       ` Alexei Starovoitov
2025-03-10 11:03         ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c41d8d7-7d5a-4c3d-97b3-23642e376ff9@suse.cz \
    --to=vbabka@suse.cz \
    --cc=ast@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mhocko@suse.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox