[PATCH v3 0/2] Reduce lock contention related with large folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] Reduce lock contention related with large folio
@ 2023-04-29  8:27 Yin Fengwei
  2023-04-29  8:27 ` [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
  2023-04-29  8:27 ` [PATCH v3 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
  0 siblings, 2 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-04-29  8:27 UTC (permalink / raw)
  To: linux-mm, akpm, willy, kirill, yuzhao, ryan.roberts, ying.huang
  Cc: fengwei.yin

yan tried to enable the large folio for anonymous mapping [1].

Unlike large folio for page cache which doesn't trigger frequent page
allocation/free, large folio for anonymous mapping is allocated/freeed
more frequently. So large folio for anonymous mapping exposes some lock
contention.

Ryan mentioned the deferred queue lock in [1]. We also met other two
lock contention: lru lock and zone lock.

This series tries to mitigate the deferred queue lock and reduce lru
lock in some level.

The patch1 tries to reduce deferred queue lock by not acquiring queue
lock when check whether the folio is in deferred list or not. Test
page fault1 of will-it-scale showed 60% deferred queue lock contention
reduction.

The patch2 tries to reduce lru lock by allowing batched add large folio
to lru list. Test page fault1 of will-it-scale showed 20% lru lock
contention reduction.

The zone lock contention happens on large folio free path and related
with commit f26b3fa04611 "mm/page_alloc: limit number of high-order
pages on PCP during bulk free" and will not be address by this series.

[1]
https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/

Changelog from v2:
  - Rebased to v6.3-rc7
  - Removed Tested-by: Ryan Roberts <ryan.roberts@arm.com> as patches got
    some updated after Ryan tested them.
  - Updated the perf data change for deferred queue lock and lru lock with
    v3.
  - recheck whether folio is in deferred_list or not after take the deferred
    queue lock as Kirill suggested.

Changelog from v1:

For patch2:
  - Add Reported-by from Huang Ying which was missed by my mistake.
  - Fix kernel panic issue. The folio_batch_add() can have folio which
    doesn't reference folio directly:
    - For mlock usage, add new interface with extra parameter nr_pages.
      And callee pass nr_pages by direct reference folio.
    - For swap, shawdow and dax entries as parameter folio, treat the
      nr_pages as 1.
    With the fix, the stress testing can run 12 hours without any issue
    while hit kernel panic in around 3 minutes.
  - Update the lock contention info in commit message.
  - Change field name from pages_nr to nr_pages as Ying's suggestion.

For this version, still use PAGEVEC_SIZE as max nr_pages in fbatch. We
can revise it after we make decision about the page order for anonymous
large folio.

Yin Fengwei (2):
  THP: avoid lock when check whether THP is in deferred list
  lru: allow large batched add large folio to lru list

 include/linux/pagevec.h | 46 ++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c        | 17 ++++++++++-----
 mm/mlock.c              |  7 +++----
 mm/swap.c               |  3 +--
 4 files changed, 59 insertions(+), 14 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-04-29  8:27 [PATCH v3 0/2] Reduce lock contention related with large folio Yin Fengwei
@ 2023-04-29  8:27 ` Yin Fengwei
  2023-05-04 11:48   ` kirill
  2023-05-05  0:52   ` Huang, Ying
  2023-04-29  8:27 ` [PATCH v3 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
  1 sibling, 2 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-04-29  8:27 UTC (permalink / raw)
  To: linux-mm, akpm, willy, kirill, yuzhao, ryan.roberts, ying.huang
  Cc: fengwei.yin

free_transhuge_page() acquires split queue lock then check
whether the THP was added to deferred list or not. It brings
high deferred queue lock contention.

It's safe to check whether the THP is in deferred list or not
without holding the deferred queue lock in free_transhuge_page()
because when code hit free_transhuge_page(), there is no one
tries to add the folio to _deferred_list.

Running page_fault1 of will-it-scale + order 2 folio for anonymous
mapping with 96 processes on an Ice Lake 48C/96T test box, we could
see the 61% split_queue_lock contention:
-   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
   - 63.01% free_transhuge_page
      + 62.91% _raw_spin_lock_irqsave

With this patch applied, the split_queue_lock contention is less
than 1%.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/huge_memory.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 032fb0ef9cd1..2a1df2c24c8e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2799,12 +2799,19 @@ void free_transhuge_page(struct page *page)
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	unsigned long flags;
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	if (!list_empty(&folio->_deferred_list)) {
-		ds_queue->split_queue_len--;
-		list_del(&folio->_deferred_list);
+	/*
+	 * At this point, there is no one trying to add the folio to
+	 * deferred_list. If folio is not in deferred_list, it's safe
+	 * to check without acquiring the split_queue_lock.
+	 */
+	if (data_race(!list_empty(&folio->_deferred_list))) {
+		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+		if (!list_empty(&folio->_deferred_list)) {
+			ds_queue->split_queue_len--;
+			list_del(&folio->_deferred_list);
+		}
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 	free_compound_page(page);
 }
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-04-29  8:27 ` [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
@ 2023-05-04 11:48   ` kirill
  2023-05-05  1:09     ` Yin, Fengwei
  2023-05-29  2:58     ` Yin Fengwei
  2023-05-05  0:52   ` Huang, Ying
  1 sibling, 2 replies; 15+ messages in thread
From: kirill @ 2023-05-04 11:48 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: linux-mm, akpm, willy, yuzhao, ryan.roberts, ying.huang

On Sat, Apr 29, 2023 at 04:27:58PM +0800, Yin Fengwei wrote:
> free_transhuge_page() acquires split queue lock then check
> whether the THP was added to deferred list or not. It brings
> high deferred queue lock contention.
> 
> It's safe to check whether the THP is in deferred list or not
> without holding the deferred queue lock in free_transhuge_page()
> because when code hit free_transhuge_page(), there is no one
> tries to add the folio to _deferred_list.
> 
> Running page_fault1 of will-it-scale + order 2 folio for anonymous
> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
> see the 61% split_queue_lock contention:
> -   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
>    - 63.01% free_transhuge_page
>       + 62.91% _raw_spin_lock_irqsave
> 
> With this patch applied, the split_queue_lock contention is less
> than 1%.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-05-04 11:48   ` kirill
@ 2023-05-05  1:09     ` Yin, Fengwei
  2023-05-29  2:58     ` Yin Fengwei
  1 sibling, 0 replies; 15+ messages in thread
From: Yin, Fengwei @ 2023-05-05  1:09 UTC (permalink / raw)
  To: kirill; +Cc: linux-mm, akpm, willy, yuzhao, ryan.roberts, ying.huang

Hi Kirill,

On 5/4/2023 7:48 PM, kirill@shutemov.name wrote:
> On Sat, Apr 29, 2023 at 04:27:58PM +0800, Yin Fengwei wrote:
>> free_transhuge_page() acquires split queue lock then check
>> whether the THP was added to deferred list or not. It brings
>> high deferred queue lock contention.
>>
>> It's safe to check whether the THP is in deferred list or not
>> without holding the deferred queue lock in free_transhuge_page()
>> because when code hit free_transhuge_page(), there is no one
>> tries to add the folio to _deferred_list.
>>
>> Running page_fault1 of will-it-scale + order 2 folio for anonymous
>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
>> see the 61% split_queue_lock contention:
>> -   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
>>    - 63.01% free_transhuge_page
>>       + 62.91% _raw_spin_lock_irqsave
>>
>> With this patch applied, the split_queue_lock contention is less
>> than 1%.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Thanks a lot for the reviewing.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-05-04 11:48   ` kirill
  2023-05-05  1:09     ` Yin, Fengwei
@ 2023-05-29  2:58     ` Yin Fengwei
  1 sibling, 0 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-05-29  2:58 UTC (permalink / raw)
  To: kirill, akpm; +Cc: linux-mm, willy, yuzhao, ryan.roberts, ying.huang

Hi Andrew,

On 5/4/23 19:48, kirill@shutemov.name wrote:
> On Sat, Apr 29, 2023 at 04:27:58PM +0800, Yin Fengwei wrote:
>> free_transhuge_page() acquires split queue lock then check
>> whether the THP was added to deferred list or not. It brings
>> high deferred queue lock contention.
>>
>> It's safe to check whether the THP is in deferred list or not
>> without holding the deferred queue lock in free_transhuge_page()
>> because when code hit free_transhuge_page(), there is no one
>> tries to add the folio to _deferred_list.
>>
>> Running page_fault1 of will-it-scale + order 2 folio for anonymous
>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
>> see the 61% split_queue_lock contention:
>> -   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
>>    - 63.01% free_transhuge_page
>>       + 62.91% _raw_spin_lock_irqsave
>>
>> With this patch applied, the split_queue_lock contention is less
>> than 1%.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
I didn't get the green light for patch2 (which was trying to reduce lru lock contention)
from Matthew. It may need more time to figure out how to reduce lru lock contention.

I am wondering whether this patch1 (without patch2) can be picked as
  - It has nothing to do with patch2
  - It could reduce the deferred queue lock contention.
  - It got acked-by from Kirill.

Let me know if you want me to resend this patch1. Thanks.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-04-29  8:27 ` [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
  2023-05-04 11:48   ` kirill
@ 2023-05-05  0:52   ` Huang, Ying
  2023-05-05  1:09     ` Yin, Fengwei
  1 sibling, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-05-05  0:52 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: linux-mm, akpm, willy, kirill, yuzhao, ryan.roberts

Yin Fengwei <fengwei.yin@intel.com> writes:

> free_transhuge_page() acquires split queue lock then check
> whether the THP was added to deferred list or not. It brings
> high deferred queue lock contention.
>
> It's safe to check whether the THP is in deferred list or not
> without holding the deferred queue lock in free_transhuge_page()
> because when code hit free_transhuge_page(), there is no one
> tries to add the folio to _deferred_list.
>
> Running page_fault1 of will-it-scale + order 2 folio for anonymous
> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
> see the 61% split_queue_lock contention:
> -   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
>    - 63.01% free_transhuge_page
>       + 62.91% _raw_spin_lock_irqsave
>
> With this patch applied, the split_queue_lock contention is less
> than 1%.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/huge_memory.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 032fb0ef9cd1..2a1df2c24c8e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2799,12 +2799,19 @@ void free_transhuge_page(struct page *page)
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	if (!list_empty(&folio->_deferred_list)) {
> -		ds_queue->split_queue_len--;
> -		list_del(&folio->_deferred_list);
> +	/*
> +	 * At this point, there is no one trying to add the folio to
> +	 * deferred_list. If folio is not in deferred_list, it's safe
> +	 * to check without acquiring the split_queue_lock.
> +	 */
> +	if (data_race(!list_empty(&folio->_deferred_list))) {
> +		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +		if (!list_empty(&folio->_deferred_list)) {
> +			ds_queue->split_queue_len--;
> +			list_del(&folio->_deferred_list);
> +		}
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	}
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	free_compound_page(page);
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-05-05  0:52   ` Huang, Ying
@ 2023-05-05  1:09     ` Yin, Fengwei
  0 siblings, 0 replies; 15+ messages in thread
From: Yin, Fengwei @ 2023-05-05  1:09 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm, akpm, willy, kirill, yuzhao, ryan.roberts

Hi Ying,

On 5/5/2023 8:52 AM, Huang, Ying wrote:
> Yin Fengwei <fengwei.yin@intel.com> writes:
> 
>> free_transhuge_page() acquires split queue lock then check
>> whether the THP was added to deferred list or not. It brings
>> high deferred queue lock contention.
>>
>> It's safe to check whether the THP is in deferred list or not
>> without holding the deferred queue lock in free_transhuge_page()
>> because when code hit free_transhuge_page(), there is no one
>> tries to add the folio to _deferred_list.
>>
>> Running page_fault1 of will-it-scale + order 2 folio for anonymous
>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could
>> see the 61% split_queue_lock contention:
>> -   63.02%     0.01%  page_fault1_pro  [kernel.kallsyms]         [k] free_transhuge_page
>>    - 63.01% free_transhuge_page
>>       + 62.91% _raw_spin_lock_irqsave
>>
>> With this patch applied, the split_queue_lock contention is less
>> than 1%.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Thanks!
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Thanks a lot for reviewing.

Regards
Yin, Fengwei

> 
>> ---
>>  mm/huge_memory.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 032fb0ef9cd1..2a1df2c24c8e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2799,12 +2799,19 @@ void free_transhuge_page(struct page *page)
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	unsigned long flags;
>>  
>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	if (!list_empty(&folio->_deferred_list)) {
>> -		ds_queue->split_queue_len--;
>> -		list_del(&folio->_deferred_list);
>> +	/*
>> +	 * At this point, there is no one trying to add the folio to
>> +	 * deferred_list. If folio is not in deferred_list, it's safe
>> +	 * to check without acquiring the split_queue_lock.
>> +	 */
>> +	if (data_race(!list_empty(&folio->_deferred_list))) {
>> +		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +		if (!list_empty(&folio->_deferred_list)) {
>> +			ds_queue->split_queue_len--;
>> +			list_del(&folio->_deferred_list);
>> +		}
>> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  	}
>> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  	free_compound_page(page);
>>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-04-29  8:27 [PATCH v3 0/2] Reduce lock contention related with large folio Yin Fengwei
  2023-04-29  8:27 ` [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
@ 2023-04-29  8:27 ` Yin Fengwei
  2023-04-29 22:35   ` Matthew Wilcox
  2023-06-20  3:22   ` Matthew Wilcox
  1 sibling, 2 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-04-29  8:27 UTC (permalink / raw)
  To: linux-mm, akpm, willy, kirill, yuzhao, ryan.roberts, ying.huang
  Cc: fengwei.yin

Currently, large folio is not batched added to lru list. Which
cause high lru lock contention after enable large folio for
anonymous mapping.

Running page_fault1 of will-it-scale + order 2 folio with 96
processes on Ice Lake 48C/96T, the lru lock contention could
be around 64%:
-   64.31%     0.23%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
   - 64.07% folio_lruvec_lock_irqsave
      + 64.01% _raw_spin_lock_irqsave

With this patch, the lru lock contention dropped to 43% with same
testing:
-   42.67%     0.19%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
   - 42.48% folio_lruvec_lock_irqsave
      + 42.42% _raw_spin_lock_irqsave

Reported-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/pagevec.h | 46 ++++++++++++++++++++++++++++++++++++++---
 mm/mlock.c              |  7 +++----
 mm/swap.c               |  3 +--
 3 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index f582f7213ea5..9479b7b50bc6 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -10,6 +10,7 @@
 #define _LINUX_PAGEVEC_H
 
 #include <linux/xarray.h>
+#include <linux/mm.h>
 
 /* 15 pointers + header align the pagevec structure to a power of two */
 #define PAGEVEC_SIZE	15
@@ -22,6 +23,7 @@ struct address_space;
 struct pagevec {
 	unsigned char nr;
 	bool percpu_pvec_drained;
+	unsigned short nr_pages;
 	struct page *pages[PAGEVEC_SIZE];
 };
 
@@ -30,12 +32,14 @@ void __pagevec_release(struct pagevec *pvec);
 static inline void pagevec_init(struct pagevec *pvec)
 {
 	pvec->nr = 0;
+	pvec->nr_pages = 0;
 	pvec->percpu_pvec_drained = false;
 }
 
 static inline void pagevec_reinit(struct pagevec *pvec)
 {
 	pvec->nr = 0;
+	pvec->nr_pages = 0;
 }
 
 static inline unsigned pagevec_count(struct pagevec *pvec)
@@ -54,7 +58,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
 static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
 {
 	pvec->pages[pvec->nr++] = page;
-	return pagevec_space(pvec);
+	pvec->nr_pages += compound_nr(page);
+
+	if (pvec->nr_pages > PAGEVEC_SIZE)
+		return 0;
+	else
+		return pagevec_space(pvec);
 }
 
 static inline void pagevec_release(struct pagevec *pvec)
@@ -75,6 +84,7 @@ static inline void pagevec_release(struct pagevec *pvec)
 struct folio_batch {
 	unsigned char nr;
 	bool percpu_pvec_drained;
+	unsigned short nr_pages;
 	struct folio *folios[PAGEVEC_SIZE];
 };
 
@@ -92,12 +102,14 @@ static_assert(offsetof(struct pagevec, pages) ==
 static inline void folio_batch_init(struct folio_batch *fbatch)
 {
 	fbatch->nr = 0;
+	fbatch->nr_pages = 0;
 	fbatch->percpu_pvec_drained = false;
 }
 
 static inline void folio_batch_reinit(struct folio_batch *fbatch)
 {
 	fbatch->nr = 0;
+	fbatch->nr_pages = 0;
 }
 
 static inline unsigned int folio_batch_count(struct folio_batch *fbatch)
@@ -110,6 +122,32 @@ static inline unsigned int fbatch_space(struct folio_batch *fbatch)
 	return PAGEVEC_SIZE - fbatch->nr;
 }
 
+/**
+ * folio_batch_add_nr_pages() - Add a folio to a batch.
+ * @fbatch: The folio batch.
+ * @folio: The folio to add.
+ * @nr_pages: The number of pages added to batch.
+ *
+ * The folio is added to the end of the batch.
+ * The batch must have previously been initialised using folio_batch_init().
+ *
+ * Return: The number of slots still available.
+ * Note: parameter folio may not be direct reference to folio and can't
+ *       use folio_nr_pages(folio).
+ *       Currently, this function is only called in mlock.c.
+ */
+static inline unsigned folio_batch_add_nr_pages(struct folio_batch *fbatch,
+		struct folio *folio, unsigned int nr_pages)
+{
+	fbatch->folios[fbatch->nr++] = folio;
+	fbatch->nr_pages += nr_pages;
+
+	if (fbatch->nr_pages > PAGEVEC_SIZE)
+		return 0;
+	else
+		return fbatch_space(fbatch);
+}
+
 /**
  * folio_batch_add() - Add a folio to a batch.
  * @fbatch: The folio batch.
@@ -123,8 +161,10 @@ static inline unsigned int fbatch_space(struct folio_batch *fbatch)
 static inline unsigned folio_batch_add(struct folio_batch *fbatch,
 		struct folio *folio)
 {
-	fbatch->folios[fbatch->nr++] = folio;
-	return fbatch_space(fbatch);
+	unsigned int nr_pages;
+
+	nr_pages = xa_is_value(folio) ? 1 : folio_nr_pages(folio);
+	return folio_batch_add_nr_pages(fbatch, folio, nr_pages);
 }
 
 static inline void folio_batch_release(struct folio_batch *fbatch)
diff --git a/mm/mlock.c b/mm/mlock.c
index 617469fce96d..6de3e6d4639f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -243,19 +243,18 @@ bool need_mlock_drain(int cpu)
 void mlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
+	unsigned int nr_pages = folio_nr_pages(folio);
 
 	local_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 
 	if (!folio_test_set_mlocked(folio)) {
-		int nr_pages = folio_nr_pages(folio);
-
 		zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
 		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 
 	folio_get(folio);
-	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
+	if (!folio_batch_add_nr_pages(fbatch, mlock_lru(folio), nr_pages) ||
 	    folio_test_large(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
@@ -278,7 +277,7 @@ void mlock_new_folio(struct folio *folio)
 	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 
 	folio_get(folio);
-	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
+	if (!folio_batch_add_nr_pages(fbatch, mlock_new(folio), nr_pages) ||
 	    folio_test_large(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
diff --git a/mm/swap.c b/mm/swap.c
index 57cb01b042f6..0f8554aeb338 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -228,8 +228,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 static void folio_batch_add_and_move(struct folio_batch *fbatch,
 		struct folio *folio, move_fn_t move_fn)
 {
-	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
-	    !lru_cache_disabled())
+	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
 		return;
 	folio_batch_move_lru(fbatch, move_fn);
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-04-29  8:27 ` [PATCH v3 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
@ 2023-04-29 22:35   ` Matthew Wilcox
  2023-05-01  5:52     ` Yin, Fengwei
  2023-05-05  5:51     ` Yin, Fengwei
  2023-06-20  3:22   ` Matthew Wilcox
  1 sibling, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-04-29 22:35 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang

On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
> @@ -22,6 +23,7 @@ struct address_space;
>  struct pagevec {
>  	unsigned char nr;
>  	bool percpu_pvec_drained;
> +	unsigned short nr_pages;

I still don't like storing nr_pages in the pagevec/folio_batch.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-04-29 22:35   ` Matthew Wilcox
@ 2023-05-01  5:52     ` Yin, Fengwei
  2023-05-05  5:51     ` Yin, Fengwei
  1 sibling, 0 replies; 15+ messages in thread
From: Yin, Fengwei @ 2023-05-01  5:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang

Hi Matthew,

On 4/30/2023 6:35 AM, Matthew Wilcox wrote:
> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>> @@ -22,6 +23,7 @@ struct address_space;
>>  struct pagevec {
>>  	unsigned char nr;
>>  	bool percpu_pvec_drained;
>> +	unsigned short nr_pages;
> 
> I still don't like storing nr_pages in the pagevec/folio_batch.
OK. Let me see whether I could find out other way without storing
the nr_pages. Thanks.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-04-29 22:35   ` Matthew Wilcox
  2023-05-01  5:52     ` Yin, Fengwei
@ 2023-05-05  5:51     ` Yin, Fengwei
  2023-05-15  2:14       ` Yin, Fengwei
  1 sibling, 1 reply; 15+ messages in thread
From: Yin, Fengwei @ 2023-05-05  5:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang

Hi Matthew,

On 4/30/2023 6:35 AM, Matthew Wilcox wrote:
> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>> @@ -22,6 +23,7 @@ struct address_space;
>>  struct pagevec {
>>  	unsigned char nr;
>>  	bool percpu_pvec_drained;
>> +	unsigned short nr_pages;
> 
> I still don't like storing nr_pages in the pagevec/folio_batch.
> 

What about the change like following:

diff --git a/mm/swap.c b/mm/swap.c
index 57cb01b042f6..5e7e9c0734ab 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -228,8 +228,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 static void folio_batch_add_and_move(struct folio_batch *fbatch,
                struct folio *folio, move_fn_t move_fn)
 {
-       if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
-           !lru_cache_disabled())
+       int nr_pages = folio_nr_pages(folio);
+
+       if (folio_batch_add(fbatch, folio) && !lru_cache_disabled() &&
+           (!folio_test_large(folio) || (nr_pages <= (PAGEVEC_SIZE + 1))))
                return;
        folio_batch_move_lru(fbatch, move_fn);
 }


I did testing about the lru lock contention with different folio size
with will-it-scale + deferred queue lock contention mitigated:
  - If large folio size is 16K (order 2), the lru lock takes 64.31% cpu runtime
  - If large folio size is 64K (order 4), the lru lock takes 24.24% cpu runtime
This is as our expectation: The larger size of folio, the less lru lock
contention.

It's acceptable to not batched operate on large folio which is large
enough. PAGEVEC_SIZE + 1 is chosen here based on following reasons:
  - acceptable max memory size per batch: 15 x 16 x 4096 = 983040 bytes
  - the folios with size larger than it will not apply batched operation.
    But the lru lock contention is not high already.


I collected data with lru contention when run will-it-scale.page_fault1:

folio with order 2:
  Without the change:
  -   64.31%     0.23%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
     + 64.07% folio_lruvec_lock_irqsave

  With the change:
  -   21.55%     0.21%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
     + 21.34% folio_lruvec_lock_irqsave

folio with order 4:
  Without the change:
  -   24.24%     0.15%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
     + 24.09% folio_lruvec_lock_irqsave

  With the change:
  -   2.20%     0.09%  page_fault1_pro  [kernel.kallsyms]            [k] folio_lruvec_lock_irqsave
     + 2.11% folio_lruvec_lock_irqsave

folio with order 5:
  -   8.21%     0.16%  page_fault1_pro  [kernel.kallsyms]  [k] folio_lruvec_lock_irqsave
     + 8.05% folio_lruvec_lock_irqsave


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-05-05  5:51     ` Yin, Fengwei
@ 2023-05-15  2:14       ` Yin, Fengwei
  0 siblings, 0 replies; 15+ messages in thread
From: Yin, Fengwei @ 2023-05-15  2:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang

Hi Matthew,

On 5/5/2023 1:51 PM, Yin, Fengwei wrote:
> Hi Matthew,
> 
> On 4/30/2023 6:35 AM, Matthew Wilcox wrote:
>> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>>> @@ -22,6 +23,7 @@ struct address_space;
>>>  struct pagevec {
>>>  	unsigned char nr;
>>>  	bool percpu_pvec_drained;
>>> +	unsigned short nr_pages;
>>
>> I still don't like storing nr_pages in the pagevec/folio_batch.
>>
> 
> What about the change like following:
Soft ping.


Regards
Yin, Fengwei

> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 57cb01b042f6..5e7e9c0734ab 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -228,8 +228,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>  static void folio_batch_add_and_move(struct folio_batch *fbatch,
>                 struct folio *folio, move_fn_t move_fn)
>  {
> -       if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
> -           !lru_cache_disabled())
> +       int nr_pages = folio_nr_pages(folio);
> +
> +       if (folio_batch_add(fbatch, folio) && !lru_cache_disabled() &&
> +           (!folio_test_large(folio) || (nr_pages <= (PAGEVEC_SIZE + 1))))
>                 return;
>         folio_batch_move_lru(fbatch, move_fn);
>  }
> 
> 
> I did testing about the lru lock contention with different folio size
> with will-it-scale + deferred queue lock contention mitigated:
>   - If large folio size is 16K (order 2), the lru lock takes 64.31% cpu runtime
>   - If large folio size is 64K (order 4), the lru lock takes 24.24% cpu runtime
> This is as our expectation: The larger size of folio, the less lru lock
> contention.
> 
> It's acceptable to not batched operate on large folio which is large
> enough. PAGEVEC_SIZE + 1 is chosen here based on following reasons:
>   - acceptable max memory size per batch: 15 x 16 x 4096 = 983040 bytes
>   - the folios with size larger than it will not apply batched operation.
>     But the lru lock contention is not high already.
> 
> 
> I collected data with lru contention when run will-it-scale.page_fault1:
> 
> folio with order 2:
>   Without the change:
>   -   64.31%     0.23%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 64.07% folio_lruvec_lock_irqsave
> 
>   With the change:
>   -   21.55%     0.21%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 21.34% folio_lruvec_lock_irqsave
> 
> folio with order 4:
>   Without the change:
>   -   24.24%     0.15%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 24.09% folio_lruvec_lock_irqsave
> 
>   With the change:
>   -   2.20%     0.09%  page_fault1_pro  [kernel.kallsyms]            [k] folio_lruvec_lock_irqsave
>      + 2.11% folio_lruvec_lock_irqsave
> 
> folio with order 5:
>   -   8.21%     0.16%  page_fault1_pro  [kernel.kallsyms]  [k] folio_lruvec_lock_irqsave
>      + 8.05% folio_lruvec_lock_irqsave
> 
> 
> Regards
> Yin, Fengwei
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-04-29  8:27 ` [PATCH v3 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
  2023-04-29 22:35   ` Matthew Wilcox
@ 2023-06-20  3:22   ` Matthew Wilcox
  2023-06-20  4:39     ` Yin Fengwei
  2023-06-20  8:01     ` Yin Fengwei
  1 sibling, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-06-20  3:22 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang

On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
> diff --git a/mm/swap.c b/mm/swap.c
> index 57cb01b042f6..0f8554aeb338 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -228,8 +228,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>  static void folio_batch_add_and_move(struct folio_batch *fbatch,
>  		struct folio *folio, move_fn_t move_fn)
>  {
> -	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
> -	    !lru_cache_disabled())
> +	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
>  		return;
>  	folio_batch_move_lru(fbatch, move_fn);
>  }

What if all you do is:

-	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
-	    !lru_cache_disabled())
+	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())


How does that perform?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-06-20  3:22   ` Matthew Wilcox
@ 2023-06-20  4:39     ` Yin Fengwei
  2023-06-20  8:01     ` Yin Fengwei
  1 sibling, 0 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-06-20  4:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang



On 6/20/23 11:22, Matthew Wilcox wrote:
> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 57cb01b042f6..0f8554aeb338 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -228,8 +228,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>>  static void folio_batch_add_and_move(struct folio_batch *fbatch,
>>  		struct folio *folio, move_fn_t move_fn)
>>  {
>> -	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
>> -	    !lru_cache_disabled())
>> +	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
>>  		return;
>>  	folio_batch_move_lru(fbatch, move_fn);
>>  }
> 
> What if all you do is:
> 
> -	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
> -	    !lru_cache_disabled())
> +	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
> 
> 
> How does that perform?
I will give it a try. Thanks.


Regards
Yin, Fengwei




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
  2023-06-20  3:22   ` Matthew Wilcox
  2023-06-20  4:39     ` Yin Fengwei
@ 2023-06-20  8:01     ` Yin Fengwei
  1 sibling, 0 replies; 15+ messages in thread
From: Yin Fengwei @ 2023-06-20  8:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, kirill, yuzhao, ryan.roberts, ying.huang


On 6/20/23 11:22, Matthew Wilcox wrote:
> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 57cb01b042f6..0f8554aeb338 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -228,8 +228,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>>  static void folio_batch_add_and_move(struct folio_batch *fbatch,
>>  		struct folio *folio, move_fn_t move_fn)
>>  {
>> -	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
>> -	    !lru_cache_disabled())
>> +	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
>>  		return;
>>  	folio_batch_move_lru(fbatch, move_fn);
>>  }
> 
> What if all you do is:
> 
> -	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
> -	    !lru_cache_disabled())
> +	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
> 
> 
> How does that perform?
With same hardware: IceLake 48C/96T and using order 2, the test result is as following:

order2_without_the_patch:
  -   65.53%     0.22%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
     - 65.30% folio_lruvec_lock_irqsave
        + 65.30% _raw_spin_lock_irqsave

order2_with_the_patch:
  -   19.94%     0.26%  page_fault1_pro  [kernel.vmlinux]            [k] folio_lruvec_lock_irqsave
     - 19.67% folio_lruvec_lock_irqsave
        + 19.67% _raw_spin_lock_irqsave


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-06-20  8:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-29  8:27 [PATCH v3 0/2] Reduce lock contention related with large folio Yin Fengwei
2023-04-29  8:27 ` [PATCH v3 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
2023-05-04 11:48   ` kirill
2023-05-05  1:09     ` Yin, Fengwei
2023-05-29  2:58     ` Yin Fengwei
2023-05-05  0:52   ` Huang, Ying
2023-05-05  1:09     ` Yin, Fengwei
2023-04-29  8:27 ` [PATCH v3 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
2023-04-29 22:35   ` Matthew Wilcox
2023-05-01  5:52     ` Yin, Fengwei
2023-05-05  5:51     ` Yin, Fengwei
2023-05-15  2:14       ` Yin, Fengwei
2023-06-20  3:22   ` Matthew Wilcox
2023-06-20  4:39     ` Yin Fengwei
2023-06-20  8:01     ` Yin Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox