[PATCH 0/3] mm: page_alloc: freelist hygiene follow-up

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mm: page_alloc: freelist hygiene follow-up
@ 2025-02-25  0:08 Johannes Weiner
  2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25  0:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

One fix, two cleanups. Based on mm-unstable of today.

 mm/page_alloc.c | 490 +++++++++++++++++++++++++-----------------------------
 1 file changed, 230 insertions(+), 260 deletions(-)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25  0:08 [PATCH 0/3] mm: page_alloc: freelist hygiene follow-up Johannes Weiner
@ 2025-02-25  0:08 ` Johannes Weiner
  2025-02-25 10:50   ` Vlastimil Babka
  2025-02-25 13:34   ` Brendan Jackman
  2025-02-25  0:08 ` [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates Johannes Weiner
  2025-02-25  0:08 ` [PATCH 3/3] mm: page_alloc: group fallback functions together Johannes Weiner
  2 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25  0:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

The fallback code searches for the biggest buddy first in an attempt
to steal the whole block and encourage type grouping down the line.

The approach used to be this:

- Non-movable requests will split the largest buddy and steal the
  remainder. This splits up contiguity, but it allows subsequent
  requests of this type to fall back into adjacent space.

- Movable requests go and look for the smallest buddy instead. The
  thinking is that movable requests can be compacted, so grouping is
  less important than retaining contiguity.

c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
conversion") enforces freelist type hygiene, which restricts stealing
to either claiming the whole block or just taking the requested chunk;
no additional pages or buddy remainders can be stolen any more.

The patch mishandled when to switch to finding the smallest buddy in
that new reality. As a result, it may steal the exact request size,
but from the biggest buddy. This causes fracturing for no good reason.

Fix this by committing to the new behavior: either steal the whole
block, or fall back to the smallest buddy.

Remove single-page stealing from steal_suitable_fallback(). Rename it
to try_to_steal_block() to make the intentions clear. If this fails,
always fall back to the smallest buddy.

The following is from 4 runs of mmtest's thpchallenge. "Pollute" is
single page fallback, "steal" is conversion of a partially used block.
The numbers for free block conversions (omitted) are comparable.

				     vanilla	      patched

@pollute[unmovable from reclaimable]:	  27		  106
@pollute[unmovable from movable]:	  82		   46
@pollute[reclaimable from unmovable]:	 256		   83
@pollute[reclaimable from movable]:	  46		    8
@pollute[movable from unmovable]:	4841		  868
@pollute[movable from reclaimable]:	5278		12568

@steal[unmovable from reclaimable]:	  11		   12
@steal[unmovable from movable]:		 113		   49
@steal[reclaimable from unmovable]:	  19		   34
@steal[reclaimable from movable]:	  47		   21
@steal[movable from unmovable]:		 250		  183
@steal[movable from reclaimable]:	  81		   93

The allocator appears to do a better job at keeping stealing and
polluting to the first fallback preference. As a result, the numbers
for "from movable" - the least preferred fallback option, and most
detrimental to compactability - are down across the board.

Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 80 +++++++++++++++++++++----------------------------
 1 file changed, 34 insertions(+), 46 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 16dfcf7ade74..9ea14ec52449 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1986,13 +1986,12 @@ static inline bool boost_watermark(struct zone *zone)
  * can claim the whole pageblock for the requested migratetype. If not, we check
  * the pageblock for constituent pages; if at least half of the pages are free
  * or compatible, we can still claim the whole block, so pages freed in the
- * future will be put on the correct free list. Otherwise, we isolate exactly
- * the order we need from the fallback block and leave its migratetype alone.
+ * future will be put on the correct free list.
  */
 static struct page *
-steal_suitable_fallback(struct zone *zone, struct page *page,
-			int current_order, int order, int start_type,
-			unsigned int alloc_flags, bool whole_block)
+try_to_steal_block(struct zone *zone, struct page *page,
+		   int current_order, int order, int start_type,
+		   unsigned int alloc_flags)
 {
 	int free_pages, movable_pages, alike_pages;
 	unsigned long start_pfn;
@@ -2005,7 +2004,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 	 * highatomic accounting.
 	 */
 	if (is_migrate_highatomic(block_type))
-		goto single_page;
+		return NULL;
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
@@ -2026,14 +2025,10 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 	if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD))
 		set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 
-	/* We are not allowed to try stealing from the whole block */
-	if (!whole_block)
-		goto single_page;
-
 	/* moving whole block can fail due to zone boundary conditions */
 	if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
 				       &movable_pages))
-		goto single_page;
+		return NULL;
 
 	/*
 	 * Determine how many pages are compatible with our allocation.
@@ -2066,9 +2061,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		return __rmqueue_smallest(zone, order, start_type);
 	}
 
-single_page:
-	page_del_and_expand(zone, page, order, current_order, block_type);
-	return page;
+	return NULL;
 }
 
 /*
@@ -2250,14 +2243,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 }
 
 /*
- * Try finding a free buddy page on the fallback list and put it on the free
- * list of requested migratetype, possibly along with other pages from the same
- * block, depending on fragmentation avoidance heuristics. Returns true if
- * fallback was found so that __rmqueue_smallest() can grab it.
+ * Try finding a free buddy page on the fallback list.
+ *
+ * This will attempt to steal a whole pageblock for the requested type
+ * to ensure grouping of such requests in the future.
+ *
+ * If a whole block cannot be stolen, regress to __rmqueue_smallest()
+ * logic to at least break up as little contiguity as possible.
  *
  * The use of signed ints for order and current_order is a deliberate
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
+ *
+ * Return the stolen page, or NULL if none can be found.
  */
 static __always_inline struct page *
 __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
@@ -2291,45 +2289,35 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 		if (fallback_mt == -1)
 			continue;
 
-		/*
-		 * We cannot steal all free pages from the pageblock and the
-		 * requested migratetype is movable. In that case it's better to
-		 * steal and split the smallest available page instead of the
-		 * largest available page, because even if the next movable
-		 * allocation falls back into a different pageblock than this
-		 * one, it won't cause permanent fragmentation.
-		 */
-		if (!can_steal && start_migratetype == MIGRATE_MOVABLE
-					&& current_order > order)
-			goto find_smallest;
+		if (!can_steal)
+			break;
 
-		goto do_steal;
+		page = get_page_from_free_area(area, fallback_mt);
+		page = try_to_steal_block(zone, page, current_order, order,
+					  start_migratetype, alloc_flags);
+		if (page)
+			goto got_one;
 	}
 
-	return NULL;
+	if (alloc_flags & ALLOC_NOFRAGMENT)
+		return NULL;
 
-find_smallest:
+	/* No luck stealing blocks. Find the smallest fallback page */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
 				start_migratetype, false, &can_steal);
-		if (fallback_mt != -1)
-			break;
-	}
-
-	/*
-	 * This should not happen - we already found a suitable fallback
-	 * when looking for the largest page.
-	 */
-	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
+		if (fallback_mt == -1)
+			continue;
 
-do_steal:
-	page = get_page_from_free_area(area, fallback_mt);
+		page = get_page_from_free_area(area, fallback_mt);
+		page_del_and_expand(zone, page, order, current_order, fallback_mt);
+		goto got_one;
+	}
 
-	/* take off list, maybe claim block, expand remainder */
-	page = steal_suitable_fallback(zone, page, current_order, order,
-				       start_migratetype, alloc_flags, can_steal);
+	return NULL;
 
+got_one:
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates
  2025-02-25  0:08 [PATCH 0/3] mm: page_alloc: freelist hygiene follow-up Johannes Weiner
  2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
@ 2025-02-25  0:08 ` Johannes Weiner
  2025-02-25 11:01   ` Vlastimil Babka
  2025-02-25 13:43   ` Brendan Jackman
  2025-02-25  0:08 ` [PATCH 3/3] mm: page_alloc: group fallback functions together Johannes Weiner
  2 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25  0:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

The freelist hygiene patches made migratetype accesses fully protected
under the zone->lock. Remove remnants of handling the race conditions
that existed before from the MIGRATE_HIGHATOMIC code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 50 ++++++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ea14ec52449..53d315aa69c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1991,20 +1991,10 @@ static inline bool boost_watermark(struct zone *zone)
 static struct page *
 try_to_steal_block(struct zone *zone, struct page *page,
 		   int current_order, int order, int start_type,
-		   unsigned int alloc_flags)
+		   int block_type, unsigned int alloc_flags)
 {
 	int free_pages, movable_pages, alike_pages;
 	unsigned long start_pfn;
-	int block_type;
-
-	block_type = get_pageblock_migratetype(page);
-
-	/*
-	 * This can happen due to races and we want to prevent broken
-	 * highatomic accounting.
-	 */
-	if (is_migrate_highatomic(block_type))
-		return NULL;
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
@@ -2179,33 +2169,22 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
-			int mt;
+			unsigned long size;
 
 			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
-			mt = get_pageblock_migratetype(page);
 			/*
-			 * In page freeing path, migratetype change is racy so
-			 * we can counter several free pages in a pageblock
-			 * in this loop although we changed the pageblock type
-			 * from highatomic to ac->migratetype. So we should
-			 * adjust the count once.
+			 * It should never happen but changes to
+			 * locking could inadvertently allow a per-cpu
+			 * drain to add pages to MIGRATE_HIGHATOMIC
+			 * while unreserving so be safe and watch for
+			 * underflows.
 			 */
-			if (is_migrate_highatomic(mt)) {
-				unsigned long size;
-				/*
-				 * It should never happen but changes to
-				 * locking could inadvertently allow a per-cpu
-				 * drain to add pages to MIGRATE_HIGHATOMIC
-				 * while unreserving so be safe and watch for
-				 * underflows.
-				 */
-				size = max(pageblock_nr_pages, 1UL << order);
-				size = min(size, zone->nr_reserved_highatomic);
-				zone->nr_reserved_highatomic -= size;
-			}
+			size = max(pageblock_nr_pages, 1UL << order);
+			size = min(size, zone->nr_reserved_highatomic);
+			zone->nr_reserved_highatomic -= size;
 
 			/*
 			 * Convert to ac->migratetype and avoid the normal
@@ -2217,10 +2196,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * may increase.
 			 */
 			if (order < pageblock_order)
-				ret = move_freepages_block(zone, page, mt,
+				ret = move_freepages_block(zone, page,
+							   MIGRATE_HIGHATOMIC,
 							   ac->migratetype);
 			else {
-				move_to_free_list(page, zone, order, mt,
+				move_to_free_list(page, zone, order,
+						  MIGRATE_HIGHATOMIC,
 						  ac->migratetype);
 				change_pageblock_range(page, order,
 						       ac->migratetype);
@@ -2294,7 +2275,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 
 		page = get_page_from_free_area(area, fallback_mt);
 		page = try_to_steal_block(zone, page, current_order, order,
-					  start_migratetype, alloc_flags);
+					  start_migratetype, fallback_mt,
+					  alloc_flags);
 		if (page)
 			goto got_one;
 	}
-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3] mm: page_alloc: group fallback functions together
  2025-02-25  0:08 [PATCH 0/3] mm: page_alloc: freelist hygiene follow-up Johannes Weiner
  2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
  2025-02-25  0:08 ` [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates Johannes Weiner
@ 2025-02-25  0:08 ` Johannes Weiner
  2025-02-25 11:02   ` Vlastimil Babka
  2025-02-25 13:50   ` Brendan Jackman
  2 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25  0:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Brendan Jackman, linux-mm, linux-kernel

The way the fallback rules are spread out makes them hard to
follow. Move the functions next to each other at least.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 394 ++++++++++++++++++++++++------------------------
 1 file changed, 197 insertions(+), 197 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53d315aa69c4..d02edb7d5e21 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1903,6 +1903,43 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+static inline bool boost_watermark(struct zone *zone)
+{
+	unsigned long max_boost;
+
+	if (!watermark_boost_factor)
+		return false;
+	/*
+	 * Don't bother in zones that are unlikely to produce results.
+	 * On small machines, including kdump capture kernels running
+	 * in a small area, boosting the watermark can cause an out of
+	 * memory situation immediately.
+	 */
+	if ((pageblock_nr_pages * 4) > zone_managed_pages(zone))
+		return false;
+
+	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
+			watermark_boost_factor, 10000);
+
+	/*
+	 * high watermark may be uninitialised if fragmentation occurs
+	 * very early in boot so do not boost. We do not fall
+	 * through and boost by pageblock_nr_pages as failing
+	 * allocations that early means that reclaim is not going
+	 * to help and it may even be impossible to reclaim the
+	 * boosted watermark resulting in a hang.
+	 */
+	if (!max_boost)
+		return false;
+
+	max_boost = max(pageblock_nr_pages, max_boost);
+
+	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+		max_boost);
+
+	return true;
+}
+
 /*
  * When we are falling back to another migratetype during allocation, try to
  * steal extra free pages from the same pageblocks to satisfy further
@@ -1944,41 +1981,38 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
-static inline bool boost_watermark(struct zone *zone)
+/*
+ * Check whether there is a suitable fallback freepage with requested order.
+ * If only_stealable is true, this function returns fallback_mt only if
+ * we can steal other freepages all together. This would help to reduce
+ * fragmentation due to mixed migratetype pages in one pageblock.
+ */
+int find_suitable_fallback(struct free_area *area, unsigned int order,
+			int migratetype, bool only_stealable, bool *can_steal)
 {
-	unsigned long max_boost;
+	int i;
+	int fallback_mt;
 
-	if (!watermark_boost_factor)
-		return false;
-	/*
-	 * Don't bother in zones that are unlikely to produce results.
-	 * On small machines, including kdump capture kernels running
-	 * in a small area, boosting the watermark can cause an out of
-	 * memory situation immediately.
-	 */
-	if ((pageblock_nr_pages * 4) > zone_managed_pages(zone))
-		return false;
+	if (area->nr_free == 0)
+		return -1;
 
-	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
-			watermark_boost_factor, 10000);
+	*can_steal = false;
+	for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
+		fallback_mt = fallbacks[migratetype][i];
+		if (free_area_empty(area, fallback_mt))
+			continue;
 
-	/*
-	 * high watermark may be uninitialised if fragmentation occurs
-	 * very early in boot so do not boost. We do not fall
-	 * through and boost by pageblock_nr_pages as failing
-	 * allocations that early means that reclaim is not going
-	 * to help and it may even be impossible to reclaim the
-	 * boosted watermark resulting in a hang.
-	 */
-	if (!max_boost)
-		return false;
+		if (can_steal_fallback(order, migratetype))
+			*can_steal = true;
 
-	max_boost = max(pageblock_nr_pages, max_boost);
+		if (!only_stealable)
+			return fallback_mt;
 
-	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
-		max_boost);
+		if (*can_steal)
+			return fallback_mt;
+	}
 
-	return true;
+	return -1;
 }
 
 /*
@@ -2054,175 +2088,6 @@ try_to_steal_block(struct zone *zone, struct page *page,
 	return NULL;
 }
 
-/*
- * Check whether there is a suitable fallback freepage with requested order.
- * If only_stealable is true, this function returns fallback_mt only if
- * we can steal other freepages all together. This would help to reduce
- * fragmentation due to mixed migratetype pages in one pageblock.
- */
-int find_suitable_fallback(struct free_area *area, unsigned int order,
-			int migratetype, bool only_stealable, bool *can_steal)
-{
-	int i;
-	int fallback_mt;
-
-	if (area->nr_free == 0)
-		return -1;
-
-	*can_steal = false;
-	for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
-		fallback_mt = fallbacks[migratetype][i];
-		if (free_area_empty(area, fallback_mt))
-			continue;
-
-		if (can_steal_fallback(order, migratetype))
-			*can_steal = true;
-
-		if (!only_stealable)
-			return fallback_mt;
-
-		if (*can_steal)
-			return fallback_mt;
-	}
-
-	return -1;
-}
-
-/*
- * Reserve the pageblock(s) surrounding an allocation request for
- * exclusive use of high-order atomic allocations if there are no
- * empty page blocks that contain a page with a suitable order
- */
-static void reserve_highatomic_pageblock(struct page *page, int order,
-					 struct zone *zone)
-{
-	int mt;
-	unsigned long max_managed, flags;
-
-	/*
-	 * The number reserved as: minimum is 1 pageblock, maximum is
-	 * roughly 1% of a zone. But if 1% of a zone falls below a
-	 * pageblock size, then don't reserve any pageblocks.
-	 * Check is race-prone but harmless.
-	 */
-	if ((zone_managed_pages(zone) / 100) < pageblock_nr_pages)
-		return;
-	max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages);
-	if (zone->nr_reserved_highatomic >= max_managed)
-		return;
-
-	spin_lock_irqsave(&zone->lock, flags);
-
-	/* Recheck the nr_reserved_highatomic limit under the lock */
-	if (zone->nr_reserved_highatomic >= max_managed)
-		goto out_unlock;
-
-	/* Yoink! */
-	mt = get_pageblock_migratetype(page);
-	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (!migratetype_is_mergeable(mt))
-		goto out_unlock;
-
-	if (order < pageblock_order) {
-		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
-			goto out_unlock;
-		zone->nr_reserved_highatomic += pageblock_nr_pages;
-	} else {
-		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
-		zone->nr_reserved_highatomic += 1 << order;
-	}
-
-out_unlock:
-	spin_unlock_irqrestore(&zone->lock, flags);
-}
-
-/*
- * Used when an allocation is about to fail under memory pressure. This
- * potentially hurts the reliability of high-order allocations when under
- * intense memory pressure but failed atomic allocations should be easier
- * to recover from than an OOM.
- *
- * If @force is true, try to unreserve pageblocks even though highatomic
- * pageblock is exhausted.
- */
-static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
-						bool force)
-{
-	struct zonelist *zonelist = ac->zonelist;
-	unsigned long flags;
-	struct zoneref *z;
-	struct zone *zone;
-	struct page *page;
-	int order;
-	int ret;
-
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
-								ac->nodemask) {
-		/*
-		 * Preserve at least one pageblock unless memory pressure
-		 * is really high.
-		 */
-		if (!force && zone->nr_reserved_highatomic <=
-					pageblock_nr_pages)
-			continue;
-
-		spin_lock_irqsave(&zone->lock, flags);
-		for (order = 0; order < NR_PAGE_ORDERS; order++) {
-			struct free_area *area = &(zone->free_area[order]);
-			unsigned long size;
-
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
-			if (!page)
-				continue;
-
-			/*
-			 * It should never happen but changes to
-			 * locking could inadvertently allow a per-cpu
-			 * drain to add pages to MIGRATE_HIGHATOMIC
-			 * while unreserving so be safe and watch for
-			 * underflows.
-			 */
-			size = max(pageblock_nr_pages, 1UL << order);
-			size = min(size, zone->nr_reserved_highatomic);
-			zone->nr_reserved_highatomic -= size;
-
-			/*
-			 * Convert to ac->migratetype and avoid the normal
-			 * pageblock stealing heuristics. Minimally, the caller
-			 * is doing the work and needs the pages. More
-			 * importantly, if the block was always converted to
-			 * MIGRATE_UNMOVABLE or another type then the number
-			 * of pageblocks that cannot be completely freed
-			 * may increase.
-			 */
-			if (order < pageblock_order)
-				ret = move_freepages_block(zone, page,
-							   MIGRATE_HIGHATOMIC,
-							   ac->migratetype);
-			else {
-				move_to_free_list(page, zone, order,
-						  MIGRATE_HIGHATOMIC,
-						  ac->migratetype);
-				change_pageblock_range(page, order,
-						       ac->migratetype);
-				ret = 1;
-			}
-			/*
-			 * Reserving the block(s) already succeeded,
-			 * so this should not fail on zone boundaries.
-			 */
-			WARN_ON_ONCE(ret == -1);
-			if (ret > 0) {
-				spin_unlock_irqrestore(&zone->lock, flags);
-				return ret;
-			}
-		}
-		spin_unlock_irqrestore(&zone->lock, flags);
-	}
-
-	return false;
-}
-
 /*
  * Try finding a free buddy page on the fallback list.
  *
@@ -3143,6 +3008,141 @@ struct page *rmqueue(struct zone *preferred_zone,
 	return page;
 }
 
+/*
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
+ */
+static void reserve_highatomic_pageblock(struct page *page, int order,
+					 struct zone *zone)
+{
+	int mt;
+	unsigned long max_managed, flags;
+
+	/*
+	 * The number reserved as: minimum is 1 pageblock, maximum is
+	 * roughly 1% of a zone. But if 1% of a zone falls below a
+	 * pageblock size, then don't reserve any pageblocks.
+	 * Check is race-prone but harmless.
+	 */
+	if ((zone_managed_pages(zone) / 100) < pageblock_nr_pages)
+		return;
+	max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages);
+	if (zone->nr_reserved_highatomic >= max_managed)
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	/* Recheck the nr_reserved_highatomic limit under the lock */
+	if (zone->nr_reserved_highatomic >= max_managed)
+		goto out_unlock;
+
+	/* Yoink! */
+	mt = get_pageblock_migratetype(page);
+	/* Only reserve normal pageblocks (i.e., they can merge with others) */
+	if (!migratetype_is_mergeable(mt))
+		goto out_unlock;
+
+	if (order < pageblock_order) {
+		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+			goto out_unlock;
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+	} else {
+		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+		zone->nr_reserved_highatomic += 1 << order;
+	}
+
+out_unlock:
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * Used when an allocation is about to fail under memory pressure. This
+ * potentially hurts the reliability of high-order allocations when under
+ * intense memory pressure but failed atomic allocations should be easier
+ * to recover from than an OOM.
+ *
+ * If @force is true, try to unreserve pageblocks even though highatomic
+ * pageblock is exhausted.
+ */
+static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
+						bool force)
+{
+	struct zonelist *zonelist = ac->zonelist;
+	unsigned long flags;
+	struct zoneref *z;
+	struct zone *zone;
+	struct page *page;
+	int order;
+	int ret;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
+								ac->nodemask) {
+		/*
+		 * Preserve at least one pageblock unless memory pressure
+		 * is really high.
+		 */
+		if (!force && zone->nr_reserved_highatomic <=
+					pageblock_nr_pages)
+			continue;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		for (order = 0; order < NR_PAGE_ORDERS; order++) {
+			struct free_area *area = &(zone->free_area[order]);
+			unsigned long size;
+
+			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			if (!page)
+				continue;
+
+			/*
+			 * It should never happen but changes to
+			 * locking could inadvertently allow a per-cpu
+			 * drain to add pages to MIGRATE_HIGHATOMIC
+			 * while unreserving so be safe and watch for
+			 * underflows.
+			 */
+			size = max(pageblock_nr_pages, 1UL << order);
+			size = min(size, zone->nr_reserved_highatomic);
+			zone->nr_reserved_highatomic -= size;
+
+			/*
+			 * Convert to ac->migratetype and avoid the normal
+			 * pageblock stealing heuristics. Minimally, the caller
+			 * is doing the work and needs the pages. More
+			 * importantly, if the block was always converted to
+			 * MIGRATE_UNMOVABLE or another type then the number
+			 * of pageblocks that cannot be completely freed
+			 * may increase.
+			 */
+			if (order < pageblock_order)
+				ret = move_freepages_block(zone, page,
+							   MIGRATE_HIGHATOMIC,
+							   ac->migratetype);
+			else {
+				move_to_free_list(page, zone, order,
+						  MIGRATE_HIGHATOMIC,
+						  ac->migratetype);
+				change_pageblock_range(page, order,
+						       ac->migratetype);
+				ret = 1;
+			}
+			/*
+			 * Reserving the block(s) already succeeded,
+			 * so this should not fail on zone boundaries.
+			 */
+			WARN_ON_ONCE(ret == -1);
+			if (ret > 0) {
+				spin_unlock_irqrestore(&zone->lock, flags);
+				return ret;
+			}
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	return false;
+}
+
 static inline long __zone_watermark_unusable_free(struct zone *z,
 				unsigned int order, unsigned int alloc_flags)
 {
-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
@ 2025-02-25 10:50   ` Vlastimil Babka
  2025-02-25 13:34   ` Brendan Jackman
  1 sibling, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2025-02-25 10:50 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel

On 2/25/25 01:08, Johannes Weiner wrote:
> The fallback code searches for the biggest buddy first in an attempt
> to steal the whole block and encourage type grouping down the line.
> 
> The approach used to be this:
> 
> - Non-movable requests will split the largest buddy and steal the
>   remainder. This splits up contiguity, but it allows subsequent
>   requests of this type to fall back into adjacent space.
> 
> - Movable requests go and look for the smallest buddy instead. The
>   thinking is that movable requests can be compacted, so grouping is
>   less important than retaining contiguity.
> 
> c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
> conversion") enforces freelist type hygiene, which restricts stealing
> to either claiming the whole block or just taking the requested chunk;
> no additional pages or buddy remainders can be stolen any more.
> 
> The patch mishandled when to switch to finding the smallest buddy in
> that new reality. As a result, it may steal the exact request size,
> but from the biggest buddy. This causes fracturing for no good reason.
> 
> Fix this by committing to the new behavior: either steal the whole
> block, or fall back to the smallest buddy.
> 
> Remove single-page stealing from steal_suitable_fallback(). Rename it
> to try_to_steal_block() to make the intentions clear. If this fails,
> always fall back to the smallest buddy.
> 
> The following is from 4 runs of mmtest's thpchallenge. "Pollute" is
> single page fallback, "steal" is conversion of a partially used block.
> The numbers for free block conversions (omitted) are comparable.
> 
> 				     vanilla	      patched
> 
> @pollute[unmovable from reclaimable]:	  27		  106
> @pollute[unmovable from movable]:	  82		   46
> @pollute[reclaimable from unmovable]:	 256		   83
> @pollute[reclaimable from movable]:	  46		    8
> @pollute[movable from unmovable]:	4841		  868
> @pollute[movable from reclaimable]:	5278		12568
> 
> @steal[unmovable from reclaimable]:	  11		   12
> @steal[unmovable from movable]:		 113		   49
> @steal[reclaimable from unmovable]:	  19		   34
> @steal[reclaimable from movable]:	  47		   21
> @steal[movable from unmovable]:		 250		  183
> @steal[movable from reclaimable]:	  81		   93
> 
> The allocator appears to do a better job at keeping stealing and
> polluting to the first fallback preference. As a result, the numbers
> for "from movable" - the least preferred fallback option, and most
> detrimental to compactability - are down across the board.
> 
> Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates
  2025-02-25  0:08 ` [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates Johannes Weiner
@ 2025-02-25 11:01   ` Vlastimil Babka
  2025-02-25 13:43   ` Brendan Jackman
  1 sibling, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2025-02-25 11:01 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel

On 2/25/25 01:08, Johannes Weiner wrote:
> The freelist hygiene patches made migratetype accesses fully protected
> under the zone->lock. Remove remnants of handling the race conditions
> that existed before from the MIGRATE_HIGHATOMIC code.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] mm: page_alloc: group fallback functions together
  2025-02-25  0:08 ` [PATCH 3/3] mm: page_alloc: group fallback functions together Johannes Weiner
@ 2025-02-25 11:02   ` Vlastimil Babka
  2025-02-26  8:51     ` Vlastimil Babka
  2025-02-25 13:50   ` Brendan Jackman
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-02-25 11:02 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel

On 2/25/25 01:08, Johannes Weiner wrote:
> The way the fallback rules are spread out makes them hard to
> follow. Move the functions next to each other at least.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
  2025-02-25 10:50   ` Vlastimil Babka
@ 2025-02-25 13:34   ` Brendan Jackman
  2025-02-25 14:35     ` Vlastimil Babka
  2025-02-25 15:04     ` Johannes Weiner
  1 sibling, 2 replies; 17+ messages in thread
From: Brendan Jackman @ 2025-02-25 13:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Mon, Feb 23, 2025 at 07:08:24PM -0500, Johannes Weiner wrote:
> The fallback code searches for the biggest buddy first in an attempt
> to steal the whole block and encourage type grouping down the line.
> 
> The approach used to be this:
> 
> - Non-movable requests will split the largest buddy and steal the
>   remainder. This splits up contiguity, but it allows subsequent
>   requests of this type to fall back into adjacent space.
> 
> - Movable requests go and look for the smallest buddy instead. The
>   thinking is that movable requests can be compacted, so grouping is
>   less important than retaining contiguity.
> 
> c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
> conversion") enforces freelist type hygiene, which restricts stealing
> to either claiming the whole block or just taking the requested chunk;
> no additional pages or buddy remainders can be stolen any more.
> 
> The patch mishandled when to switch to finding the smallest buddy in
> that new reality. As a result, it may steal the exact request size,
> but from the biggest buddy. This causes fracturing for no good reason.
> 
> Fix this by committing to the new behavior: either steal the whole
> block, or fall back to the smallest buddy.
> 
> Remove single-page stealing from steal_suitable_fallback(). Rename it
> to try_to_steal_block() to make the intentions clear. If this fails,
> always fall back to the smallest buddy.

Nit - I think the try_to_steal_block() changes could be a separate
patch, the history might be easier to understand if it went:

[1/N] mm: page_alloc: don't steal single pages from biggest buddy
[2/N] mm: page_alloc: drop unused logic in steal_suitable_fallback()

(But not a big deal, it's not that hard to follow as-is).

>  static __always_inline struct page *
>  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> @@ -2291,45 +2289,35 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>  		if (fallback_mt == -1)
>  			continue;
>  
> -		/*
> -		 * We cannot steal all free pages from the pageblock and the
> -		 * requested migratetype is movable. In that case it's better to
> -		 * steal and split the smallest available page instead of the
> -		 * largest available page, because even if the next movable
> -		 * allocation falls back into a different pageblock than this
> -		 * one, it won't cause permanent fragmentation.
> -		 */
> -		if (!can_steal && start_migratetype == MIGRATE_MOVABLE
> -					&& current_order > order)
> -			goto find_smallest;
> +		if (!can_steal)
> +			break;
>  
> -		goto do_steal;
> +		page = get_page_from_free_area(area, fallback_mt);
> +		page = try_to_steal_block(zone, page, current_order, order,
> +					  start_migratetype, alloc_flags);
> +		if (page)
> +			goto got_one;
>  	}
>  
> -	return NULL;
> +	if (alloc_flags & ALLOC_NOFRAGMENT)
> +		return NULL;

Is this a separate change? Is it a bug that we currently allow
stealing a from a fallback type when ALLOC_NOFRAGMENT? (I wonder if
the second loop was supposed to start from min_order).

>  
> -find_smallest:
> +	/* No luck stealing blocks. Find the smallest fallback page */
>  	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
>  		area = &(zone->free_area[current_order]);
>  		fallback_mt = find_suitable_fallback(area, current_order,
>  				start_migratetype, false, &can_steal);
> -		if (fallback_mt != -1)
> -			break;
> -	}
> -
> -	/*
> -	 * This should not happen - we already found a suitable fallback
> -	 * when looking for the largest page.
> -	 */
> -	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
> +		if (fallback_mt == -1)
> +			continue;
>  
> -do_steal:
> -	page = get_page_from_free_area(area, fallback_mt);
> +		page = get_page_from_free_area(area, fallback_mt);
> +		page_del_and_expand(zone, page, order, current_order, fallback_mt);
> +		goto got_one;
> +	}
>  
> -	/* take off list, maybe claim block, expand remainder */
> -	page = steal_suitable_fallback(zone, page, current_order, order,
> -				       start_migratetype, alloc_flags, can_steal);
> +	return NULL;
>  
> +got_one:
>  	trace_mm_page_alloc_extfrag(page, order, current_order,
>  		start_migratetype, fallback_mt);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates
  2025-02-25  0:08 ` [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates Johannes Weiner
  2025-02-25 11:01   ` Vlastimil Babka
@ 2025-02-25 13:43   ` Brendan Jackman
  2025-02-25 15:09     ` Johannes Weiner
  1 sibling, 1 reply; 17+ messages in thread
From: Brendan Jackman @ 2025-02-25 13:43 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Mon, Feb 24, 2025 at 07:08:25PM -0500, Johannes Weiner wrote:
> The freelist hygiene patches made migratetype accesses fully protected
> under the zone->lock. Remove remnants of handling the race conditions
> that existed before from the MIGRATE_HIGHATOMIC code.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Aside from my WARN bikeshedding, which isn't really about this patch
anyway:

Reviewed-by: Brendan Jackman <jackmanb@google.com>

> -			if (is_migrate_highatomic(mt)) {
> -				unsigned long size;
> -				/*
> -				 * It should never happen but changes to
> -				 * locking could inadvertently allow a per-cpu
> -				 * drain to add pages to MIGRATE_HIGHATOMIC
> -				 * while unreserving so be safe and watch for
> -				 * underflows.
> -				 */
> -				size = max(pageblock_nr_pages, 1UL << order);
> -				size = min(size, zone->nr_reserved_highatomic);
> -				zone->nr_reserved_highatomic -= size;
> -			}
> +			size = max(pageblock_nr_pages, 1UL << order);
> +			size = min(size, zone->nr_reserved_highatomic);
> +			zone->nr_reserved_highatomic -= size;

Now that the locking is a bit cleaner, would it make sense to add a
[VM_]WARN_ON[_ONCE] for underflow?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] mm: page_alloc: group fallback functions together
  2025-02-25  0:08 ` [PATCH 3/3] mm: page_alloc: group fallback functions together Johannes Weiner
  2025-02-25 11:02   ` Vlastimil Babka
@ 2025-02-25 13:50   ` Brendan Jackman
  2025-02-25 15:14     ` Johannes Weiner
  1 sibling, 1 reply; 17+ messages in thread
From: Brendan Jackman @ 2025-02-25 13:50 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel


On Mon, Feb 24, 2025 at 07:08:26PM -0500, Johannes Weiner wrote:
> The way the fallback rules are spread out makes them hard to
> follow. Move the functions next to each other at least.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Moving code makes blame more tiresome so I wouldn't personally do
this, but if others find it helpful it's fine by me.

Reviewed-by: Brendan Jackman <jackmanb@google.com>

("Reviewed" means "git tells me there are no new lines nor totally
deleted lines").


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25 13:34   ` Brendan Jackman
@ 2025-02-25 14:35     ` Vlastimil Babka
  2025-02-25 14:40       ` Brendan Jackman
  2025-02-25 15:04     ` Johannes Weiner
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-02-25 14:35 UTC (permalink / raw)
  To: Brendan Jackman, Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel

On 2/25/25 14:34, Brendan Jackman wrote:
> On Mon, Feb 23, 2025 at 07:08:24PM -0500, Johannes Weiner wrote:
>> The fallback code searches for the biggest buddy first in an attempt
>> to steal the whole block and encourage type grouping down the line.
>> 
>> The approach used to be this:
>> 
>> - Non-movable requests will split the largest buddy and steal the
>>   remainder. This splits up contiguity, but it allows subsequent
>>   requests of this type to fall back into adjacent space.
>> 
>> - Movable requests go and look for the smallest buddy instead. The
>>   thinking is that movable requests can be compacted, so grouping is
>>   less important than retaining contiguity.
>> 
>> c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
>> conversion") enforces freelist type hygiene, which restricts stealing
>> to either claiming the whole block or just taking the requested chunk;
>> no additional pages or buddy remainders can be stolen any more.
>> 
>> The patch mishandled when to switch to finding the smallest buddy in
>> that new reality. As a result, it may steal the exact request size,
>> but from the biggest buddy. This causes fracturing for no good reason.
>> 
>> Fix this by committing to the new behavior: either steal the whole
>> block, or fall back to the smallest buddy.
>> 
>> Remove single-page stealing from steal_suitable_fallback(). Rename it
>> to try_to_steal_block() to make the intentions clear. If this fails,
>> always fall back to the smallest buddy.
> 
> Nit - I think the try_to_steal_block() changes could be a separate
> patch, the history might be easier to understand if it went:
> 
> [1/N] mm: page_alloc: don't steal single pages from biggest buddy
> [2/N] mm: page_alloc: drop unused logic in steal_suitable_fallback()
> 
> (But not a big deal, it's not that hard to follow as-is).
> 
>>  static __always_inline struct page *
>>  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>> @@ -2291,45 +2289,35 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>>  		if (fallback_mt == -1)
>>  			continue;
>>  
>> -		/*
>> -		 * We cannot steal all free pages from the pageblock and the
>> -		 * requested migratetype is movable. In that case it's better to
>> -		 * steal and split the smallest available page instead of the
>> -		 * largest available page, because even if the next movable
>> -		 * allocation falls back into a different pageblock than this
>> -		 * one, it won't cause permanent fragmentation.
>> -		 */
>> -		if (!can_steal && start_migratetype == MIGRATE_MOVABLE
>> -					&& current_order > order)
>> -			goto find_smallest;
>> +		if (!can_steal)
>> +			break;
>>  
>> -		goto do_steal;
>> +		page = get_page_from_free_area(area, fallback_mt);
>> +		page = try_to_steal_block(zone, page, current_order, order,
>> +					  start_migratetype, alloc_flags);
>> +		if (page)
>> +			goto got_one;
>>  	}
>>  
>> -	return NULL;
>> +	if (alloc_flags & ALLOC_NOFRAGMENT)
>> +		return NULL;
> 
> Is this a separate change? Is it a bug that we currently allow
> stealing a from a fallback type when ALLOC_NOFRAGMENT? (I wonder if
> the second loop was supposed to start from min_order).

It's subtle but not a new condition. Previously ALLOC_NOFRAGMENT would
result in not taking the "goto find_smallest" path because it means
searching >=pageblock_order only and that would always be can_steal == true
if it found a fallback. And failure to find fallback would reach an
unconditional return NULL here. Now we fall through the search below
(instead of the goto), but ALLOC_NOFRAGMENT must not do it so it's now
explicit here.

>>  
>> -find_smallest:
>> +	/* No luck stealing blocks. Find the smallest fallback page */
>>  	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
>>  		area = &(zone->free_area[current_order]);
>>  		fallback_mt = find_suitable_fallback(area, current_order,
>>  				start_migratetype, false, &can_steal);
>> -		if (fallback_mt != -1)
>> -			break;
>> -	}
>> -
>> -	/*
>> -	 * This should not happen - we already found a suitable fallback
>> -	 * when looking for the largest page.
>> -	 */
>> -	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
>> +		if (fallback_mt == -1)
>> +			continue;
>>  
>> -do_steal:
>> -	page = get_page_from_free_area(area, fallback_mt);
>> +		page = get_page_from_free_area(area, fallback_mt);
>> +		page_del_and_expand(zone, page, order, current_order, fallback_mt);
>> +		goto got_one;
>> +	}
>>  
>> -	/* take off list, maybe claim block, expand remainder */
>> -	page = steal_suitable_fallback(zone, page, current_order, order,
>> -				       start_migratetype, alloc_flags, can_steal);
>> +	return NULL;
>>  
>> +got_one:
>>  	trace_mm_page_alloc_extfrag(page, order, current_order,
>>  		start_migratetype, fallback_mt);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25 14:35     ` Vlastimil Babka
@ 2025-02-25 14:40       ` Brendan Jackman
  0 siblings, 0 replies; 17+ messages in thread
From: Brendan Jackman @ 2025-02-25 14:40 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Tue, Feb 25, 2025 at 03:35:25PM +0100, Vlastimil Babka wrote:
> >> -	return NULL;
> >> +	if (alloc_flags & ALLOC_NOFRAGMENT)
> >> +		return NULL;
> > 
> > Is this a separate change? Is it a bug that we currently allow
> > stealing a from a fallback type when ALLOC_NOFRAGMENT? (I wonder if
> > the second loop was supposed to start from min_order).
> 
> It's subtle but not a new condition. Previously ALLOC_NOFRAGMENT would
> result in not taking the "goto find_smallest" path because it means
> searching >=pageblock_order only and that would always be can_steal == true
> if it found a fallback. And failure to find fallback would reach an
> unconditional return NULL here. Now we fall through the search below
> (instead of the goto), but ALLOC_NOFRAGMENT must not do it so it's now
> explicit here.

Ahhhh yes, thank you for the help. The new explicit code is much
better.

Reviewed-by: Brendan Jackman <jackmanb@google.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy
  2025-02-25 13:34   ` Brendan Jackman
  2025-02-25 14:35     ` Vlastimil Babka
@ 2025-02-25 15:04     ` Johannes Weiner
  1 sibling, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25 15:04 UTC (permalink / raw)
  To: Brendan Jackman; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Tue, Feb 25, 2025 at 01:34:32PM +0000, Brendan Jackman wrote:
> On Mon, Feb 23, 2025 at 07:08:24PM -0500, Johannes Weiner wrote:
> > The fallback code searches for the biggest buddy first in an attempt
> > to steal the whole block and encourage type grouping down the line.
> > 
> > The approach used to be this:
> > 
> > - Non-movable requests will split the largest buddy and steal the
> >   remainder. This splits up contiguity, but it allows subsequent
> >   requests of this type to fall back into adjacent space.
> > 
> > - Movable requests go and look for the smallest buddy instead. The
> >   thinking is that movable requests can be compacted, so grouping is
> >   less important than retaining contiguity.
> > 
> > c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
> > conversion") enforces freelist type hygiene, which restricts stealing
> > to either claiming the whole block or just taking the requested chunk;
> > no additional pages or buddy remainders can be stolen any more.
> > 
> > The patch mishandled when to switch to finding the smallest buddy in
> > that new reality. As a result, it may steal the exact request size,
> > but from the biggest buddy. This causes fracturing for no good reason.
> > 
> > Fix this by committing to the new behavior: either steal the whole
> > block, or fall back to the smallest buddy.
> > 
> > Remove single-page stealing from steal_suitable_fallback(). Rename it
> > to try_to_steal_block() to make the intentions clear. If this fails,
> > always fall back to the smallest buddy.
> 
> Nit - I think the try_to_steal_block() changes could be a separate
> patch, the history might be easier to understand if it went:
> 
> [1/N] mm: page_alloc: don't steal single pages from biggest buddy
> [2/N] mm: page_alloc: drop unused logic in steal_suitable_fallback()

There are several ways in which steal_suitable_fallback() could end up
taking a single page, and I'd have to mirror all those conditions in
the caller if I wanted to prevent this. That would be too convoluted.

> >  static __always_inline struct page *
> >  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> > @@ -2291,45 +2289,35 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> >  		if (fallback_mt == -1)
> >  			continue;
> >  
> > -		/*
> > -		 * We cannot steal all free pages from the pageblock and the
> > -		 * requested migratetype is movable. In that case it's better to
> > -		 * steal and split the smallest available page instead of the
> > -		 * largest available page, because even if the next movable
> > -		 * allocation falls back into a different pageblock than this
> > -		 * one, it won't cause permanent fragmentation.
> > -		 */
> > -		if (!can_steal && start_migratetype == MIGRATE_MOVABLE
> > -					&& current_order > order)
> > -			goto find_smallest;
> > +		if (!can_steal)
> > +			break;
> >  
> > -		goto do_steal;
> > +		page = get_page_from_free_area(area, fallback_mt);
> > +		page = try_to_steal_block(zone, page, current_order, order,
> > +					  start_migratetype, alloc_flags);
> > +		if (page)
> > +			goto got_one;
> >  	}
> >  
> > -	return NULL;
> > +	if (alloc_flags & ALLOC_NOFRAGMENT)
> > +		return NULL;
> 
> Is this a separate change? Is it a bug that we currently allow
> stealing a from a fallback type when ALLOC_NOFRAGMENT? (I wonder if
> the second loop was supposed to start from min_order).

No, I don't see how we could hit that right now. With NOFRAGMENT, the
first loop scans whole free blocks only, which, if present, are always
stealable. If there are no blocks, the loop continues through all the
fallback_mt == -1 and then the function returns NULL. Only without
NOFRAGMENT does it run into !can_steal buddies.

IOW, the control flow implicit in min_order, can_steal and the gotos
would make it honor NOFRAGMENT - albeit in a fairly non-obvious way.

The code is just a bit odd. While the function currently looks like
it's two loops following each other, this isn't how it's actually
executed. Instead, the first loop is the main sequence of the
function. The second loop is entered only from a jump in the main loop
under certain conditions, more akin to a function call.

I'm changing the sequence so that all types fall back to the smallest
buddy if stealing a block fails. The easiest way to express that is
removing the find_smallest jump and having the loops *actually* follow
each other as the main sequence of this function.

For that, I need to make that implicit NOFRAGMENT behavior explicit.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates
  2025-02-25 13:43   ` Brendan Jackman
@ 2025-02-25 15:09     ` Johannes Weiner
  2025-02-25 15:19       ` Brendan Jackman
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25 15:09 UTC (permalink / raw)
  To: Brendan Jackman; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Tue, Feb 25, 2025 at 01:43:35PM +0000, Brendan Jackman wrote:
> On Mon, Feb 24, 2025 at 07:08:25PM -0500, Johannes Weiner wrote:
> > The freelist hygiene patches made migratetype accesses fully protected
> > under the zone->lock. Remove remnants of handling the race conditions
> > that existed before from the MIGRATE_HIGHATOMIC code.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Aside from my WARN bikeshedding, which isn't really about this patch
> anyway:
> 
> Reviewed-by: Brendan Jackman <jackmanb@google.com>

Thanks

> > -			if (is_migrate_highatomic(mt)) {
> > -				unsigned long size;
> > -				/*
> > -				 * It should never happen but changes to
> > -				 * locking could inadvertently allow a per-cpu
> > -				 * drain to add pages to MIGRATE_HIGHATOMIC
> > -				 * while unreserving so be safe and watch for
> > -				 * underflows.
> > -				 */
> > -				size = max(pageblock_nr_pages, 1UL << order);
> > -				size = min(size, zone->nr_reserved_highatomic);
> > -				zone->nr_reserved_highatomic -= size;
> > -			}
> > +			size = max(pageblock_nr_pages, 1UL << order);
> > +			size = min(size, zone->nr_reserved_highatomic);
> > +			zone->nr_reserved_highatomic -= size;
> 
> Now that the locking is a bit cleaner, would it make sense to add a
> [VM_]WARN_ON[_ONCE] for underflow?

Yeah I think that would be a nice additional cleanup. Do you want to
send a patch? Otherwise, I can.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] mm: page_alloc: group fallback functions together
  2025-02-25 13:50   ` Brendan Jackman
@ 2025-02-25 15:14     ` Johannes Weiner
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-02-25 15:14 UTC (permalink / raw)
  To: Brendan Jackman; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Tue, Feb 25, 2025 at 01:50:09PM +0000, Brendan Jackman wrote:
> 
> On Mon, Feb 24, 2025 at 07:08:26PM -0500, Johannes Weiner wrote:
> > The way the fallback rules are spread out makes them hard to
> > follow. Move the functions next to each other at least.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Moving code makes blame more tiresome so I wouldn't personally do
> this, but if others find it helpful it's fine by me.

I feel you, having had to get good at git blame, git show, swear, git
blame foo^, repeat myself.

Blame has definitely become quite important to understanding code. But
I'd argue it's slightly less important than having understandable code ;)

> Reviewed-by: Brendan Jackman <jackmanb@google.com>
> 
> ("Reviewed" means "git tells me there are no new lines nor totally
> deleted lines").

Thanks Brendan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates
  2025-02-25 15:09     ` Johannes Weiner
@ 2025-02-25 15:19       ` Brendan Jackman
  0 siblings, 0 replies; 17+ messages in thread
From: Brendan Jackman @ 2025-02-25 15:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Vlastimil Babka, linux-mm, linux-kernel

On Tue, 25 Feb 2025 at 16:09, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Now that the locking is a bit cleaner, would it make sense to add a
> > [VM_]WARN_ON[_ONCE] for underflow?
>
> Yeah I think that would be a nice additional cleanup. Do you want to
> send a patch? Otherwise, I can.

Yep I'll kick off some tests to check it doesn't fire and send it once
that's done.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] mm: page_alloc: group fallback functions together
  2025-02-25 11:02   ` Vlastimil Babka
@ 2025-02-26  8:51     ` Vlastimil Babka
  0 siblings, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2025-02-26  8:51 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton; +Cc: Brendan Jackman, linux-mm, linux-kernel



On 2/25/25 12:02 PM, Vlastimil Babka wrote:
> On 2/25/25 01:08, Johannes Weiner wrote:
>> The way the fallback rules are spread out makes them hard to
>> follow. Move the functions next to each other at least.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Vlastimil Babka <vbabka@suse.cz>

Sorry, meant:

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-26  8:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-25  0:08 [PATCH 0/3] mm: page_alloc: freelist hygiene follow-up Johannes Weiner
2025-02-25  0:08 ` [PATCH 1/3] mm: page_alloc: don't steal single pages from biggest buddy Johannes Weiner
2025-02-25 10:50   ` Vlastimil Babka
2025-02-25 13:34   ` Brendan Jackman
2025-02-25 14:35     ` Vlastimil Babka
2025-02-25 14:40       ` Brendan Jackman
2025-02-25 15:04     ` Johannes Weiner
2025-02-25  0:08 ` [PATCH 2/3] mm: page_alloc: remove remnants of unlocked migratetype updates Johannes Weiner
2025-02-25 11:01   ` Vlastimil Babka
2025-02-25 13:43   ` Brendan Jackman
2025-02-25 15:09     ` Johannes Weiner
2025-02-25 15:19       ` Brendan Jackman
2025-02-25  0:08 ` [PATCH 3/3] mm: page_alloc: group fallback functions together Johannes Weiner
2025-02-25 11:02   ` Vlastimil Babka
2025-02-26  8:51     ` Vlastimil Babka
2025-02-25 13:50   ` Brendan Jackman
2025-02-25 15:14     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox