[PATCH v3 0/2] Leave IRQs enabled for per-cpu page allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] Leave IRQs enabled for per-cpu page allocations
@ 2022-11-18 10:17 Mel Gorman
  2022-11-18 10:17 ` [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list Mel Gorman
  2022-11-18 10:17 ` [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations Mel Gorman
  0 siblings, 2 replies; 9+ messages in thread
From: Mel Gorman @ 2022-11-18 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Yu Zhao, Vlastimil Babka, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM, Mel Gorman

Changelog since V2
o Fix list corruption issue					(hughd)
o Tighen up locking protocol					(hughd)
o Micro-optimisations						(hughd)

Changelog since V1
o Use trylock in free_unref_page_list due to IO completion from
  softirq context						(yuzhao)

This was a standalone patch but now split in two. The main changes since
v2 are fixing the issues exposed by Hugh's shmfs stress test.

Patch 1 is a long-standing issue that is technically a bug but happened
to work because of how it was used.

Patch 2 leaves IRQs enabled for most PCP allocations with more details
in the changelog itself.

-- 
2.35.3



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list
  2022-11-18 10:17 [PATCH v3 0/2] Leave IRQs enabled for per-cpu page allocations Mel Gorman
@ 2022-11-18 10:17 ` Mel Gorman
  2022-11-18 13:24   ` Vlastimil Babka
  2022-11-18 10:17 ` [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations Mel Gorman
  1 sibling, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2022-11-18 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Yu Zhao, Vlastimil Babka, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM, Mel Gorman

free_unref_page_list() has neglected to remove pages properly from the list
of pages to free since forever. It works by coincidence because list_add
happened to do the right thing adding the pages to just the PCP lists.
However, a later patch added pages to either the PCP list or the zone list
but only properly deleted the page from the list in one path leading to
list corruption and a subsequent failure. As a preparation patch, always
delete the pages from one list properly before adding to another.  On its
own, this fixes nothing although it adds a fractional amount of overhead
but is critical to the next patch.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 218b28ee49ed..1ec54173b8d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3546,6 +3546,8 @@ void free_unref_page_list(struct list_head *list)
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);

+		list_del(&page->lru);
+
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
 			if (pcp)
-- 
2.35.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-18 10:17 [PATCH v3 0/2] Leave IRQs enabled for per-cpu page allocations Mel Gorman
  2022-11-18 10:17 ` [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list Mel Gorman
@ 2022-11-18 10:17 ` Mel Gorman
  2022-11-18 14:30   ` Vlastimil Babka
  1 sibling, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2022-11-18 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Yu Zhao, Vlastimil Babka, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM, Mel Gorman

The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired
using pcp_spin_trylock_irqsave() even though IRQs are disabled and
re-entrancy is impossible.

Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
case at the cost of some IRQ allocations taking a slower path. If the PCP
lists need to be refilled, the zone lock still needs to disable IRQs but
that will only happen on PCP refill and drain. If an IRQ is raised when
a PCP allocation is in progress, the trylock will fail and fallback to
using the buddy lists directly. Note that this may not be a universal win
if an interrupt-intensive workload also allocates heavily from interrupt
context and contends heavily on the zone->lock as a result.

[yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
[hughd@google.com: Fix list corruption, lock improvements, micro-optimsations]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 122 +++++++++++++++++++++---------------------------
 1 file changed, 53 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ec54173b8d4..323fec05c4c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -170,21 +170,12 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
 	_ret;								\
 })
 
-#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
+#define pcpu_spin_trylock(type, member, ptr)				\
 ({									\
 	type *_ret;							\
 	pcpu_task_pin();						\
 	_ret = this_cpu_ptr(ptr);					\
-	spin_lock_irqsave(&_ret->member, flags);			\
-	_ret;								\
-})
-
-#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
-({									\
-	type *_ret;							\
-	pcpu_task_pin();						\
-	_ret = this_cpu_ptr(ptr);					\
-	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
+	if (!spin_trylock(&_ret->member)) {				\
 		pcpu_task_unpin();					\
 		_ret = NULL;						\
 	}								\
@@ -197,27 +188,16 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
 	pcpu_task_unpin();						\
 })
 
-#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
-({									\
-	spin_unlock_irqrestore(&ptr->member, flags);			\
-	pcpu_task_unpin();						\
-})
-
 /* struct per_cpu_pages specific helpers. */
 #define pcp_spin_lock(ptr)						\
 	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
 
-#define pcp_spin_lock_irqsave(ptr, flags)				\
-	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
-
-#define pcp_spin_trylock_irqsave(ptr, flags)				\
-	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+#define pcp_spin_trylock(ptr)						\
+	pcpu_spin_trylock(struct per_cpu_pages, lock, ptr)
 
 #define pcp_spin_unlock(ptr)						\
 	pcpu_spin_unlock(lock, ptr)
 
-#define pcp_spin_unlock_irqrestore(ptr, flags)				\
-	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1547,6 +1527,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp,
 					int pindex)
 {
+	unsigned long flags;
 	int min_pindex = 0;
 	int max_pindex = NR_PCP_LISTS - 1;
 	unsigned int order;
@@ -1562,8 +1543,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	while (count > 0) {
@@ -1611,7 +1591,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (count > 0 && !list_empty(list));
 	}
 
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -3125,10 +3105,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
+	unsigned long flags;
 	int i, allocated = 0;
 
-	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -3162,7 +3142,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return allocated;
 }
 
@@ -3179,16 +3159,9 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0) {
-		unsigned long flags;
-
-		/*
-		 * free_pcppages_bulk expects IRQs disabled for zone->lock
-		 * so even though pcp->lock is not intended to be IRQ-safe,
-		 * it's needed in this context.
-		 */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 #endif
@@ -3202,12 +3175,9 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count) {
-		unsigned long flags;
-
-		/* See drain_zone_pages on why this is disabling IRQs */
-		spin_lock_irqsave(&pcp->lock, flags);
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock_irqrestore(&pcp->lock, flags);
+		spin_unlock(&pcp->lock);
 	}
 }
 
@@ -3473,7 +3443,6 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
  */
 void free_unref_page(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
@@ -3501,10 +3470,10 @@ void free_unref_page(struct page *page, unsigned int order)
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
-		pcp_spin_unlock_irqrestore(pcp, flags);
+		pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
 	}
@@ -3516,10 +3485,10 @@ void free_unref_page(struct page *page, unsigned int order)
  */
 void free_unref_page_list(struct list_head *list)
 {
+	unsigned long __maybe_unused UP_flags;
 	struct page *page, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
@@ -3550,11 +3519,26 @@ void free_unref_page_list(struct list_head *list)
 
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
-			if (pcp)
-				pcp_spin_unlock_irqrestore(pcp, flags);
+			if (pcp) {
+				pcp_spin_unlock(pcp);
+				pcp_trylock_finish(UP_flags);
+			}
 
+			/*
+			 * trylock is necessary as pages may be getting freed
+			 * from IRQ or SoftIRQ context after an IO completion.
+			 */
+			pcp_trylock_prepare(UP_flags);
+			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+			if (!pcp) {
+				pcp_trylock_finish(UP_flags);
+				free_one_page(zone, page, page_to_pfn(page),
+					      0, migratetype, FPI_NONE);
+				locked_zone = NULL;
+				continue;
+			}
 			locked_zone = zone;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			batch_count = 0;
 		}
 
 		/*
@@ -3569,18 +3553,23 @@ void free_unref_page_list(struct list_head *list)
 		free_unref_page_commit(zone, pcp, page, migratetype, 0);
 
 		/*
-		 * Guard against excessive IRQ disabled times when we get
-		 * a large list of pages to free.
+		 * Guard against excessive lock hold times when freeing
+		 * a large list of pages. Lock will be reacquired if
+		 * necessary on the next iteration.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			pcp_spin_unlock_irqrestore(pcp, flags);
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(UP_flags);
 			batch_count = 0;
-			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
+			pcp = NULL;
+			locked_zone = NULL;
 		}
 	}
 
-	if (pcp)
-		pcp_spin_unlock_irqrestore(pcp, flags);
+	if (pcp) {
+		pcp_spin_unlock(pcp);
+		pcp_trylock_finish(UP_flags);
+	}
 }
 
 /*
@@ -3781,15 +3770,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 
-	/*
-	 * spin_trylock may fail due to a parallel drain. In the future, the
-	 * trylock will also protect against IRQ reentrancy.
-	 */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
 		return NULL;
@@ -3803,7 +3788,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -5371,7 +5356,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			struct page **page_array)
 {
 	struct page *page;
-	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 	struct zone *zone;
 	struct zoneref *z;
@@ -5453,9 +5437,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	/* Is a parallel drain in progress? */
+	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp)
 		goto failed_irq;
 
@@ -5474,7 +5458,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
-				pcp_spin_unlock_irqrestore(pcp, flags);
+				pcp_spin_unlock(pcp);
 				goto failed_irq;
 			}
 			break;
@@ -5489,7 +5473,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	pcp_spin_unlock_irqrestore(pcp, flags);
+	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
-- 
2.35.3



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list
  2022-11-18 10:17 ` [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list Mel Gorman
@ 2022-11-18 13:24   ` Vlastimil Babka
  0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2022-11-18 13:24 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hugh Dickins, Yu Zhao, Marcelo Tosatti, Michal Hocko,
	Marek Szyprowski, LKML, Linux-MM

On 11/18/22 11:17, Mel Gorman wrote:
> free_unref_page_list() has neglected to remove pages properly from the list
> of pages to free since forever. It works by coincidence because list_add
> happened to do the right thing adding the pages to just the PCP lists.
> However, a later patch added pages to either the PCP list or the zone list
> but only properly deleted the page from the list in one path leading to
> list corruption and a subsequent failure. As a preparation patch, always
> delete the pages from one list properly before adding to another.  On its
> own, this fixes nothing although it adds a fractional amount of overhead
> but is critical to the next patch.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 218b28ee49ed..1ec54173b8d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3546,6 +3546,8 @@ void free_unref_page_list(struct list_head *list)
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		struct zone *zone = page_zone(page);
>  
> +		list_del(&page->lru);
> +
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
>  			if (pcp)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-18 10:17 ` [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations Mel Gorman
@ 2022-11-18 14:30   ` Vlastimil Babka
  2022-11-21 12:01     ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2022-11-18 14:30 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hugh Dickins, Yu Zhao, Marcelo Tosatti, Michal Hocko,
	Marek Szyprowski, LKML, Linux-MM

On 11/18/22 11:17, Mel Gorman wrote:
> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> allocating from the PCP must not re-enter the allocator from IRQ context.
> In each instance where IRQ-reentrancy is possible, the lock is acquired
> using pcp_spin_trylock_irqsave() even though IRQs are disabled and
> re-entrancy is impossible.
> 
> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> case at the cost of some IRQ allocations taking a slower path. If the PCP
> lists need to be refilled, the zone lock still needs to disable IRQs but
> that will only happen on PCP refill and drain. If an IRQ is raised when
> a PCP allocation is in progress, the trylock will fail and fallback to
> using the buddy lists directly. Note that this may not be a universal win
> if an interrupt-intensive workload also allocates heavily from interrupt
> context and contends heavily on the zone->lock as a result.
> 
> [yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
> [hughd@google.com: Fix list corruption, lock improvements, micro-optimsations]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some nits below:

> @@ -3516,10 +3485,10 @@ void free_unref_page(struct page *page, unsigned int order)
>   */
>  void free_unref_page_list(struct list_head *list)
>  {
> +	unsigned long __maybe_unused UP_flags;
>  	struct page *page, *next;
>  	struct per_cpu_pages *pcp = NULL;
>  	struct zone *locked_zone = NULL;
> -	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
> @@ -3550,11 +3519,26 @@ void free_unref_page_list(struct list_head *list)
>  
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
> -			if (pcp)
> -				pcp_spin_unlock_irqrestore(pcp, flags);
> +			if (pcp) {
> +				pcp_spin_unlock(pcp);
> +				pcp_trylock_finish(UP_flags);
> +			}
>  
> +			/*
> +			 * trylock is necessary as pages may be getting freed
> +			 * from IRQ or SoftIRQ context after an IO completion.
> +			 */
> +			pcp_trylock_prepare(UP_flags);
> +			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> +			if (!pcp) {

Perhaps use unlikely() here?

> +				pcp_trylock_finish(UP_flags);
> +				free_one_page(zone, page, page_to_pfn(page),
> +					      0, migratetype, FPI_NONE);

Not critical for correctness, but the migratepage here might be stale and we
should do get_pcppage_migratetype(page);

> +				locked_zone = NULL;
> +				continue;
> +			}
>  			locked_zone = zone;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> +			batch_count = 0;
>  		}
>  
>  		/*
> @@ -3569,18 +3553,23 @@ void free_unref_page_list(struct list_head *list)
>  		free_unref_page_commit(zone, pcp, page, migratetype, 0);
>  
>  		/*
> -		 * Guard against excessive IRQ disabled times when we get
> -		 * a large list of pages to free.
> +		 * Guard against excessive lock hold times when freeing
> +		 * a large list of pages. Lock will be reacquired if
> +		 * necessary on the next iteration.
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
> -			pcp_spin_unlock_irqrestore(pcp, flags);
> +			pcp_spin_unlock(pcp);
> +			pcp_trylock_finish(UP_flags);
>  			batch_count = 0;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> +			pcp = NULL;
> +			locked_zone = NULL;

AFAICS if this block was just "locked_zone = NULL;" then the existing code
would do the right thing.
Or maybe to have simpler code, just do batch_count++ here and
make the relocking check do
if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX)

>  		}
>  	}
>  
> -	if (pcp)
> -		pcp_spin_unlock_irqrestore(pcp, flags);
> +	if (pcp) {
> +		pcp_spin_unlock(pcp);
> +		pcp_trylock_finish(UP_flags);
> +	}
>  }
>  
>  /*



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-18 14:30   ` Vlastimil Babka
@ 2022-11-21 12:01     ` Mel Gorman
  2022-11-21 16:03       ` Mel Gorman
  2022-11-22  9:09       ` Vlastimil Babka
  0 siblings, 2 replies; 9+ messages in thread
From: Mel Gorman @ 2022-11-21 12:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hugh Dickins, Yu Zhao, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM

On Fri, Nov 18, 2022 at 03:30:57PM +0100, Vlastimil Babka wrote:
> On 11/18/22 11:17, Mel Gorman wrote:
> > The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> > allocating from the PCP must not re-enter the allocator from IRQ context.
> > In each instance where IRQ-reentrancy is possible, the lock is acquired
> > using pcp_spin_trylock_irqsave() even though IRQs are disabled and
> > re-entrancy is impossible.
> > 
> > Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> > case at the cost of some IRQ allocations taking a slower path. If the PCP
> > lists need to be refilled, the zone lock still needs to disable IRQs but
> > that will only happen on PCP refill and drain. If an IRQ is raised when
> > a PCP allocation is in progress, the trylock will fail and fallback to
> > using the buddy lists directly. Note that this may not be a universal win
> > if an interrupt-intensive workload also allocates heavily from interrupt
> > context and contends heavily on the zone->lock as a result.
> > 
> > [yuzhao@google.com: Reported lockdep issue on IO completion from softirq]
> > [hughd@google.com: Fix list corruption, lock improvements, micro-optimsations]
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Some nits below:
> 

Thanks.

> > @@ -3516,10 +3485,10 @@ void free_unref_page(struct page *page, unsigned int order)
> >   */
> >  void free_unref_page_list(struct list_head *list)
> >  {
> > +	unsigned long __maybe_unused UP_flags;
> >  	struct page *page, *next;
> >  	struct per_cpu_pages *pcp = NULL;
> >  	struct zone *locked_zone = NULL;
> > -	unsigned long flags;
> >  	int batch_count = 0;
> >  	int migratetype;
> >  
> > @@ -3550,11 +3519,26 @@ void free_unref_page_list(struct list_head *list)
> >  
> >  		/* Different zone, different pcp lock. */
> >  		if (zone != locked_zone) {
> > -			if (pcp)
> > -				pcp_spin_unlock_irqrestore(pcp, flags);
> > +			if (pcp) {
> > +				pcp_spin_unlock(pcp);
> > +				pcp_trylock_finish(UP_flags);
> > +			}
> >  
> > +			/*
> > +			 * trylock is necessary as pages may be getting freed
> > +			 * from IRQ or SoftIRQ context after an IO completion.
> > +			 */
> > +			pcp_trylock_prepare(UP_flags);
> > +			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> > +			if (!pcp) {
> 
> Perhaps use unlikely() here?
> 

Sure, I'm less convinced these days that unlikely makes a big difference
with improvements to branch prediction but there is no harm in annotating
it. It's probably time to do a round of PROFILE_ALL_BRANCHES checking.

> > +				pcp_trylock_finish(UP_flags);
> > +				free_one_page(zone, page, page_to_pfn(page),
> > +					      0, migratetype, FPI_NONE);
> 
> Not critical for correctness, but the migratepage here might be stale and we
> should do get_pcppage_migratetype(page);
> 

Yes, the call should move up. Isolated pages should already have been
removed and while it's possible there would still be a race, it's unlikely
and no worse than the existing isolated page checks.

> > +				locked_zone = NULL;
> > +				continue;
> > +			}
> >  			locked_zone = zone;
> > -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> > +			batch_count = 0;
> >  		}
> >  
> >  		/*
> > @@ -3569,18 +3553,23 @@ void free_unref_page_list(struct list_head *list)
> >  		free_unref_page_commit(zone, pcp, page, migratetype, 0);
> >  
> >  		/*
> > -		 * Guard against excessive IRQ disabled times when we get
> > -		 * a large list of pages to free.
> > +		 * Guard against excessive lock hold times when freeing
> > +		 * a large list of pages. Lock will be reacquired if
> > +		 * necessary on the next iteration.
> >  		 */
> >  		if (++batch_count == SWAP_CLUSTER_MAX) {
> > -			pcp_spin_unlock_irqrestore(pcp, flags);
> > +			pcp_spin_unlock(pcp);
> > +			pcp_trylock_finish(UP_flags);
> >  			batch_count = 0;
> > -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> > +			pcp = NULL;
> > +			locked_zone = NULL;
> 
> AFAICS if this block was just "locked_zone = NULL;" then the existing code
> would do the right thing.
> Or maybe to have simpler code, just do batch_count++ here and
> make the relocking check do
> if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX)
> 

While I think you're right, I think it's a bit subtle, the batch reset would
need to move, rechecked within the "Different zone, different pcp lock."
block and it would be easy to forget exactly why it's structured like
that in the future.  Rather than being a fix, it could be a standalone
patch so it would be obvious in git blame but I don't feel particularly
strongly about it.

For the actual fixes to the patch, how about this? It's boot-tested only
as I find it hard to believe it would make a difference to performance.

--8<--
mm/page_alloc: Leave IRQs enabled for per-cpu page allocations -fix

As noted by Vlastimil Babka, the migratetype might be wrong if a PCP
fails to lock so check the migrate type early. Similarly the !pcp check
is generally unlikely so explicitly tagging it makes sense.

This is a fix for the mm-unstable patch
mm-page_alloc-leave-irqs-enabled-for-per-cpu-page-allocations.patch

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 323fec05c4c6..445066617204 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,6 +3516,7 @@ void free_unref_page_list(struct list_head *list)
 		struct zone *zone = page_zone(page);
 
 		list_del(&page->lru);
+		migratetype = get_pcppage_migratetype(page);
 
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
@@ -3530,7 +3531,7 @@ void free_unref_page_list(struct list_head *list)
 			 */
 			pcp_trylock_prepare(UP_flags);
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
-			if (!pcp) {
+			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, page, page_to_pfn(page),
 					      0, migratetype, FPI_NONE);
@@ -3545,7 +3546,6 @@ void free_unref_page_list(struct list_head *list)
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
 		 */
-		migratetype = get_pcppage_migratetype(page);
 		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
 			migratetype = MIGRATE_MOVABLE;
 

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-21 12:01     ` Mel Gorman
@ 2022-11-21 16:03       ` Mel Gorman
  2022-11-22  9:11         ` Vlastimil Babka
  2022-11-22  9:09       ` Vlastimil Babka
  1 sibling, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2022-11-21 16:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hugh Dickins, Yu Zhao, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM

On Mon, Nov 21, 2022 at 12:01:23PM +0000, Mel Gorman wrote:
> On Fri, Nov 18, 2022 at 03:30:57PM +0100, Vlastimil Babka wrote:
> > On 11/18/22 11:17, Mel Gorman wrote:
> > AFAICS if this block was just "locked_zone = NULL;" then the existing code
> > would do the right thing.
> > Or maybe to have simpler code, just do batch_count++ here and
> > make the relocking check do
> > if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX)
> > 
> 
> While I think you're right, I think it's a bit subtle, the batch reset would
> need to move, rechecked within the "Different zone, different pcp lock."
> block and it would be easy to forget exactly why it's structured like
> that in the future.  Rather than being a fix, it could be a standalone
> patch so it would be obvious in git blame but I don't feel particularly
> strongly about it.
> 

Ok, less subtle than I initially thought but still deserving of a separate
patch instead of being a fix. This?

--8<--
mm/page_alloc: Simplify locking during free_unref_page_list

While freeing a large list, the zone lock will be released and reacquired
to avoid long hold times since commit c24ad77d962c ("mm/page_alloc.c: avoid
excessive IRQ disabled times in free_unref_page_list()"). As suggested
by Vlastimil Babka, the lockrelease/reacquire logic can be simplified by
reusing the logic that acquires a different lock when changing zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 445066617204..08e32daf0918 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3518,13 +3518,19 @@ void free_unref_page_list(struct list_head *list)
 		list_del(&page->lru);
 		migratetype = get_pcppage_migratetype(page);
 
-		/* Different zone, different pcp lock. */
-		if (zone != locked_zone) {
+		/*
+		 * Either different zone requiring a different pcp lock or
+		 * excessive lock hold times when freeing a large list of
+		 * pages.
+		 */
+		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
 			}
 
+			batch_count = 0;
+
 			/*
 			 * trylock is necessary as pages may be getting freed
 			 * from IRQ or SoftIRQ context after an IO completion.
@@ -3539,7 +3545,6 @@ void free_unref_page_list(struct list_head *list)
 				continue;
 			}
 			locked_zone = zone;
-			batch_count = 0;
 		}
 
 		/*
@@ -3551,19 +3556,7 @@ void free_unref_page_list(struct list_head *list)
 
 		trace_mm_page_free_batched(page);
 		free_unref_page_commit(zone, pcp, page, migratetype, 0);
-
-		/*
-		 * Guard against excessive lock hold times when freeing
-		 * a large list of pages. Lock will be reacquired if
-		 * necessary on the next iteration.
-		 */
-		if (++batch_count == SWAP_CLUSTER_MAX) {
-			pcp_spin_unlock(pcp);
-			pcp_trylock_finish(UP_flags);
-			batch_count = 0;
-			pcp = NULL;
-			locked_zone = NULL;
-		}
+		batch_count++;
 	}
 
 	if (pcp) {


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-21 12:01     ` Mel Gorman
  2022-11-21 16:03       ` Mel Gorman
@ 2022-11-22  9:09       ` Vlastimil Babka
  1 sibling, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2022-11-22  9:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hugh Dickins, Yu Zhao, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM

On 11/21/22 13:01, Mel Gorman wrote:
> On Fri, Nov 18, 2022 at 03:30:57PM +0100, Vlastimil Babka wrote:
>> On 11/18/22 11:17, Mel Gorman wrote:
> 
> While I think you're right, I think it's a bit subtle, the batch reset would
> need to move, rechecked within the "Different zone, different pcp lock."
> block and it would be easy to forget exactly why it's structured like
> that in the future.  Rather than being a fix, it could be a standalone
> patch so it would be obvious in git blame but I don't feel particularly
> strongly about it.
> 
> For the actual fixes to the patch, how about this? It's boot-tested only
> as I find it hard to believe it would make a difference to performance.

Looks good. Shouldn't make a difference indeed.

> 
> --8<--
> mm/page_alloc: Leave IRQs enabled for per-cpu page allocations -fix
> 
> As noted by Vlastimil Babka, the migratetype might be wrong if a PCP
> fails to lock so check the migrate type early. Similarly the !pcp check
> is generally unlikely so explicitly tagging it makes sense.
> 
> This is a fix for the mm-unstable patch
> mm-page_alloc-leave-irqs-enabled-for-per-cpu-page-allocations.patch
> 
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 323fec05c4c6..445066617204 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,6 +3516,7 @@ void free_unref_page_list(struct list_head *list)
>  		struct zone *zone = page_zone(page);
>  
>  		list_del(&page->lru);
> +		migratetype = get_pcppage_migratetype(page);
>  
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
> @@ -3530,7 +3531,7 @@ void free_unref_page_list(struct list_head *list)
>  			 */
>  			pcp_trylock_prepare(UP_flags);
>  			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> -			if (!pcp) {
> +			if (unlikely(!pcp)) {
>  				pcp_trylock_finish(UP_flags);
>  				free_one_page(zone, page, page_to_pfn(page),
>  					      0, migratetype, FPI_NONE);
> @@ -3545,7 +3546,6 @@ void free_unref_page_list(struct list_head *list)
>  		 * Non-isolated types over MIGRATE_PCPTYPES get added
>  		 * to the MIGRATE_MOVABLE pcp list.
>  		 */
> -		migratetype = get_pcppage_migratetype(page);
>  		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
>  			migratetype = MIGRATE_MOVABLE;
>  
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations
  2022-11-21 16:03       ` Mel Gorman
@ 2022-11-22  9:11         ` Vlastimil Babka
  0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2022-11-22  9:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hugh Dickins, Yu Zhao, Marcelo Tosatti,
	Michal Hocko, Marek Szyprowski, LKML, Linux-MM

On 11/21/22 17:03, Mel Gorman wrote:
> On Mon, Nov 21, 2022 at 12:01:23PM +0000, Mel Gorman wrote:
>> On Fri, Nov 18, 2022 at 03:30:57PM +0100, Vlastimil Babka wrote:
>> > On 11/18/22 11:17, Mel Gorman wrote:
>> > AFAICS if this block was just "locked_zone = NULL;" then the existing code
>> > would do the right thing.
>> > Or maybe to have simpler code, just do batch_count++ here and
>> > make the relocking check do
>> > if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX)
>> > 
>> 
>> While I think you're right, I think it's a bit subtle, the batch reset would
>> need to move, rechecked within the "Different zone, different pcp lock."
>> block and it would be easy to forget exactly why it's structured like
>> that in the future.  Rather than being a fix, it could be a standalone
>> patch so it would be obvious in git blame but I don't feel particularly
>> strongly about it.
>> 
> 
> Ok, less subtle than I initially thought but still deserving of a separate
> patch instead of being a fix. This?

Yeah, thanks!

> --8<--
> mm/page_alloc: Simplify locking during free_unref_page_list
> 
> While freeing a large list, the zone lock will be released and reacquired
> to avoid long hold times since commit c24ad77d962c ("mm/page_alloc.c: avoid
> excessive IRQ disabled times in free_unref_page_list()"). As suggested
> by Vlastimil Babka, the lockrelease/reacquire logic can be simplified by
> reusing the logic that acquires a different lock when changing zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 445066617204..08e32daf0918 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3518,13 +3518,19 @@ void free_unref_page_list(struct list_head *list)
>  		list_del(&page->lru);
>  		migratetype = get_pcppage_migratetype(page);
>  
> -		/* Different zone, different pcp lock. */
> -		if (zone != locked_zone) {
> +		/*
> +		 * Either different zone requiring a different pcp lock or
> +		 * excessive lock hold times when freeing a large list of
> +		 * pages.
> +		 */
> +		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
>  			if (pcp) {
>  				pcp_spin_unlock(pcp);
>  				pcp_trylock_finish(UP_flags);
>  			}
>  
> +			batch_count = 0;
> +
>  			/*
>  			 * trylock is necessary as pages may be getting freed
>  			 * from IRQ or SoftIRQ context after an IO completion.
> @@ -3539,7 +3545,6 @@ void free_unref_page_list(struct list_head *list)
>  				continue;
>  			}
>  			locked_zone = zone;
> -			batch_count = 0;
>  		}
>  
>  		/*
> @@ -3551,19 +3556,7 @@ void free_unref_page_list(struct list_head *list)
>  
>  		trace_mm_page_free_batched(page);
>  		free_unref_page_commit(zone, pcp, page, migratetype, 0);
> -
> -		/*
> -		 * Guard against excessive lock hold times when freeing
> -		 * a large list of pages. Lock will be reacquired if
> -		 * necessary on the next iteration.
> -		 */
> -		if (++batch_count == SWAP_CLUSTER_MAX) {
> -			pcp_spin_unlock(pcp);
> -			pcp_trylock_finish(UP_flags);
> -			batch_count = 0;
> -			pcp = NULL;
> -			locked_zone = NULL;
> -		}
> +		batch_count++;
>  	}
>  
>  	if (pcp) {



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-22  9:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18 10:17 [PATCH v3 0/2] Leave IRQs enabled for per-cpu page allocations Mel Gorman
2022-11-18 10:17 ` [PATCH 1/2] mm/page_alloc: Always remove pages from temporary list Mel Gorman
2022-11-18 13:24   ` Vlastimil Babka
2022-11-18 10:17 ` [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations Mel Gorman
2022-11-18 14:30   ` Vlastimil Babka
2022-11-21 12:01     ` Mel Gorman
2022-11-21 16:03       ` Mel Gorman
2022-11-22  9:11         ` Vlastimil Babka
2022-11-22  9:09       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox