linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>, Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
Date: Fri, 11 Apr 2025 11:39:06 -0400	[thread overview]
Message-ID: <20250411153906.GC366747@cmpxchg.org> (raw)
In-Reply-To: <46f1b2ab-2903-4cde-9e68-e334a0d0df22@suse.cz>

On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> > which helps, but produces only one block at a time. Allocation stalls
> > and THP failure rates are still higher than they could be.
> > 
> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> > the watermarking for kswapd & kcompactd: instead of targeting the high
> > watermark in order-0 pages and checking for one suitable block, simply
> > require that the high watermark is entirely met in pageblocks.
> 
> Hrm.

Hrm!

> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >  	if (!pageblock_aligned(cc->migrate_pfn))
> >  		return COMPACT_CONTINUE;
> >  
> > +	/*
> > +	 * When defrag_mode is enabled, make kcompactd target
> > +	 * watermarks in whole pageblocks. Because they can be stolen
> > +	 * without polluting, no further fallback checks are needed.
> > +	 */
> > +	if (defrag_mode && !cc->direct_compaction) {
> > +		if (__zone_watermark_ok(cc->zone, cc->order,
> > +					high_wmark_pages(cc->zone),
> > +					cc->highest_zoneidx, cc->alloc_flags,
> > +					zone_page_state(cc->zone,
> > +							NR_FREE_PAGES_BLOCKS)))
> > +			return COMPACT_SUCCESS;
> > +
> > +		return COMPACT_CONTINUE;
> > +	}
> 
> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> put us below high wmark quickly and then we never satisfy this?

The high watermark is not aligned, but why does it have to be? It's a
binary condition: met or not met. Compaction continues until it's met.

NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
it'll really work until align_up(highmark, pageblock_nr_pages), as
that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
that seems reasonable, no?

The allocator side is using low/min, so we have the conventional
hysteresis between consumer and producer.

For illustration, on my 2G test box, the watermarks in DMA32 look like
this:

  pages free     212057
        boost    0
        min      11164		(21.8 blocks)
        low      13955		(27.3 blocks)
        high     16746		(32.7 blocks)
        promo    19537
        spanned  456704
        present  455680
        managed  431617		(843.1 blocks)

So there are several blocks between the kblahds wakeup and sleep. The
first allocation to cut into a whole free block will decrease
NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
the remaining space won't change that counter. So the distance between
the watermarks didn't fundamentally change (modulo block rounding).

> Doesn't then happen that with defrag_mode, in practice kcompactd basically
> always runs until scanners met?

Tracing kcompactd calls to compaction_finished() with defrag_mode:

@[COMPACT_CONTINUE]: 6955
@[COMPACT_COMPLETE]: 19
@[COMPACT_PARTIAL_SKIPPED]: 1
@[COMPACT_SUCCESS]: 17
@wakeuprequests: 3

Of course, similar to kswapd, it might not reach the watermarks and
keep running if there is a continuous stream of allocations consuming
the blocks it's making. Hence the ratio between wakeups & continues.

But when demand stops, it'll balance the high mark and quit.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  	 * meet watermarks.
> >  	 */
> >  	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> > +		unsigned long free_pages;
> > +
> >  		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> >  			mark = promo_wmark_pages(zone);
> >  		else
> >  			mark = high_wmark_pages(zone);
> > -		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
> 
> I think you just removed the only user of this _safe() function. Is the
> cpu-drift control it does no longer necessary?

Ah good catch. This should actually use zone_page_state_snapshot()
below depending on z->percpu_drift_mark.

This is active when there are enough cpus for the vmstat pcp deltas to
exceed low-min. Afaics this is still necessary, but also still
requires a lot of CPUs to matter (>212 cpus with 64G of memory).

I'll send a follow-up fix.

> > +		/*
> > +		 * In defrag_mode, watermarks must be met in whole
> > +		 * blocks to avoid polluting allocator fallbacks.
> > +		 */
> > +		if (defrag_mode)
> > +			free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> > +		else
> > +			free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +		if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > +					0, free_pages))
> >  			return true;
> >  	}
> >  


  reply	other threads:[~2025-04-11 15:39 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08   ` Zi Yan
2025-03-16  4:28   ` Hugh Dickins
2025-03-17 18:18     ` Johannes Weiner
2025-03-21  6:21   ` kernel test robot
2025-03-21 13:55     ` Johannes Weiner
2025-04-10 15:19   ` Vlastimil Babka
2025-04-10 20:17     ` Johannes Weiner
2025-04-11  7:32       ` Vlastimil Babka
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
2025-03-14 18:36   ` Zi Yan
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
2025-03-14 18:54   ` Zi Yan
2025-03-14 20:50     ` Johannes Weiner
2025-03-14 22:54       ` Zi Yan
2025-03-22 15:05   ` Brendan Jackman
2025-03-23  0:58     ` Johannes Weiner
2025-03-23  1:34       ` Johannes Weiner
2025-03-23  3:46         ` Johannes Weiner
2025-03-23 18:04           ` Brendan Jackman
2025-03-31 15:55             ` Johannes Weiner
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
2025-03-14 21:05   ` Johannes Weiner
2025-04-11  8:19   ` Vlastimil Babka
2025-04-11 15:39     ` Johannes Weiner [this message]
2025-04-11 16:51       ` Vlastimil Babka
2025-04-11 18:21         ` Johannes Weiner
2025-04-13  2:20           ` Johannes Weiner
2025-04-15  7:31             ` Vlastimil Babka
2025-04-15  7:44             ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250411153906.GC366747@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox