linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Kiryl Shutsemau <kirill@shutemov.name>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Chris Mason <clm@fb.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	 Suren Baghdasaryan <surenb@gogle.com>,
	Michal Hocko <mhocko@suse.com>,
	 Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
Date: Tue, 19 Aug 2025 10:15:39 -0700	[thread overview]
Message-ID: <x3xp3cj6wpgxu5mjsd62fzvuzpn2mxpvlk6sau65si7bk6ncu5@dx6jbuacy42i> (raw)
In-Reply-To: <k6fpx5adh45t4jrxgiccq7acubwcgmi746crggxi6e4oihtvpt@thks5zrn53n3>

On Tue, Aug 19, 2025 at 10:15:13AM +0100, Kiryl Shutsemau wrote:
> On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> > 
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
> 
> Hm. It doesn't necessary to be contention on the lock, but just that you
> holding the lock for too long so the CPU is not available for the scheduler.
> 
> > In our fleet, we have seen that performing batched lock freeing has led to
> > significantly lower rates of softlockups, while incurring relatively small
> > regressions (relative to the workload and relative to the variation).
> > 
> > The following are a few synthetic benchmarks:
> > 
> > Test 1: Small machine (30G RAM, 36 CPUs)
> > 
> > stress-ng --vm 30 --vm-bytes 1G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0076 |   -0.0183 |
> > | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> > | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> > +----------------------+---------------+-----------+
> > 
> > stress-ng --vm 20 --vm-bytes 3G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0295 |   -0.0078 |
> > | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> > | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> > +----------------------+---------------+-----------+
> > 
> > Test 2: Big machine (250G RAM, 176 CPUs)
> > 
> > stress-ng --vm 50 --vm-bytes 5G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0362 |   -0.0187 |
> > | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> > | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> > +----------------------+---------------+-----------+
> > 
> > stress-ng --vm 10 --vm-bytes 30G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        2.3130 |   -0.0754 |
> > | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> > | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> > +----------------------+---------------+-----------+
> > 
> > Suggested-by: Chris Mason <clm@fb.com>
> > Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> > 
> > ---
> >  mm/page_alloc.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a8a84c3b5fe5..bd7a8da3e159 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	 * below while (list_empty(list)) loop.
> >  	 */
> >  	count = min(pcp->count, count);
> > +	if (!count)
> > +		return;
> >  
> >  	/* Ensure requested pindex is drained first. */
> >  	pindex = pindex - 1;
> > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	while (count > 0) {
> >  		struct list_head *list;
> >  		int nr_pages;
> > +		int batch = min(count, pcp->batch);
> >  
> >  		/* Remove pages from lists in a round-robin fashion. */
> >  		do {
> > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  
> >  			/* must delete to avoid corrupting pcp list */
> >  			list_del(&page->pcp_list);
> > +			batch -= nr_pages;
> >  			count -= nr_pages;
> >  			pcp->count -= nr_pages;
> >  
> >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> >  			trace_mm_page_pcpu_drain(page, order, mt);
> > -		} while (count > 0 && !list_empty(list));
> > +		} while (batch > 0 && !list_empty(list));
> > +
> > +		/*
> > +		 * Prevent starving the lock for other users; every pcp->batch
> > +		 * pages freed, relinquish the zone lock if it is contended.
> > +		 */
> > +		if (count && spin_is_contended(&zone->lock)) {
> 
> I would rather drop the count thing and do something like this:
> 
> 		if (need_resched() || spin_needbreak(&zone->lock) {
> 			spin_unlock_irqrestore(&zone->lock, flags);
> 			cond_resched();

Can this function be called from non-sleepable context?

> 			spin_lock_irqsave(&zone->lock, flags);
> 		}
> 
> > +			spin_unlock_irqrestore(&zone->lock, flags);
> > +			spin_lock_irqsave(&zone->lock, flags);
> > +		}
> >  	}
> >  
> >  	spin_unlock_irqrestore(&zone->lock, flags);
> > 
> > base-commit: 137a6423b60fe0785aada403679d3b086bb83062
> > -- 
> > 2.47.3
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


  parent reply	other threads:[~2025-08-19 17:15 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-18 18:58 Joshua Hahn
2025-08-19  0:13 ` Andrew Morton
2025-08-19 15:18   ` Joshua Hahn
2025-08-19 21:44     ` Andrew Morton
2025-08-20 13:20       ` Joshua Hahn
2025-08-19  9:15 ` Kiryl Shutsemau
2025-08-19 15:28   ` Joshua Hahn
2025-08-19 17:15   ` Shakeel Butt [this message]
2025-08-20 12:58     ` Kiryl Shutsemau
2025-08-19 15:34 ` Joshua Hahn
2025-08-20  1:29 ` Hillf Danton
2025-08-20 15:13   ` Joshua Hahn
2025-08-21  1:03     ` Hillf Danton
2025-08-20  5:41 ` Andrew Morton
2025-08-20 15:48   ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x3xp3cj6wpgxu5mjsd62fzvuzpn2mxpvlk6sau65si7bk6ncu5@dx6jbuacy42i \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=clm@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@gogle.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox