[PATCH 0/2] Reduce lock contention related with large folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Reduce lock contention related with large folio
@ 2023-04-17  7:56 Yin Fengwei
  2023-04-17  7:56 ` [PATCH 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yin Fengwei @ 2023-04-17  7:56 UTC (permalink / raw)
  To: linux-mm, akpm, willy, yuzhao, ryan.roberts; +Cc: fengwei.yin

Ryan tried to enable the large folio for anonymous mapping [1].

Unlike large folio for page cache which doesn't trigger frequent page
allocation/free, large folio for anonymous mapping is allocated/freeed
more frequently. So large folio for anonymous mapping exposes some lock
contention.

Ryan mentioned the deferred queue lock in [1]. We also met other two
lock contention: lru lock and zone lock.

This series tries to mitigate the deferred queue lock and reduce lru
lock in some level.

The patch1 tries to reduce deferred queue lock by not acquiring queue
lock when check whether the folio is in deferred list or not. Test
page fault1 of will-it-scale showed 60% deferred queue lock contention
reduction.

The patch2 tries to reduce lru lock by allowing batched add large folio
to lru list. Test page fault1 of will-it-scale showed 20% lru lock
contention reduction.

The zone lock contention happens on large folio free path and related
with commit f26b3fa04611 "mm/page_alloc: limit number of high-order
pages on PCP during bulk free" and will not be address by this series.

[1]
https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/

Yin Fengwei (2):
  THP: avoid lock when check whether THP is in deferred list
  lru: allow large batched add large folio to lru list

 include/linux/pagevec.h | 19 +++++++++++++++++--
 mm/huge_memory.c        | 19 ++++++++++++++++---
 mm/swap.c               |  3 +--
 3 files changed, 34 insertions(+), 7 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] THP: avoid lock when check whether THP is in deferred list
  2023-04-17  7:56 [PATCH 0/2] Reduce lock contention related with large folio Yin Fengwei
@ 2023-04-17  7:56 ` Yin Fengwei
  2023-04-17  7:56 ` [PATCH 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
  2023-04-17 10:33 ` [PATCH 0/2] Reduce lock contention related with large folio Ryan Roberts
  2 siblings, 0 replies; 8+ messages in thread
From: Yin Fengwei @ 2023-04-17  7:56 UTC (permalink / raw)
  To: linux-mm, akpm, willy, yuzhao, ryan.roberts; +Cc: fengwei.yin

free_transhuge_page() acquires split queue lock then check
wether the THP was added to deferred list or not.

It's safe to check whether the THP is in deferred list or not.
   When code hit free_transhuge_page(), there is no one tries
   to update the folio's _deferred_list.

   If folio is not in deferred_list, it's safe to check without
   acquiring lock.

   If folio is in deferred_list, the other node in deferred_list
   adding/deleteing doesn't impact the return value of
   list_epmty(@folio->_deferred_list).

Running page_fault1 of will-it-scale + order 2 folio for anonymous
mapping with 96 processes on an Ice Lake 48C/96T test box, we could
see the 61% split_queue_lock contention:
-   71.28%     0.35%  page_fault1_pro  [kernel.kallsyms]           [k]
    release_pages
   - 70.93% release_pages
      - 61.42% free_transhuge_page
         + 60.77% _raw_spin_lock_irqsave

With this patch applied, the split_queue_lock contention is less
than 1%.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b0252b418ef01..802082531e606 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2842,12 +2842,25 @@ void free_transhuge_page(struct page *page)
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	unsigned long flags;
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	if (!list_empty(&folio->_deferred_list)) {
+	/*
+	 * At this point, there is no one trying to queue the folio
+	 * to deferred_list. folio->_deferred_list is not possible
+	 * being updated.
+	 *
+	 * If folio is already added to deferred_list, add/delete to/from
+	 * deferred_list will not impact list_empty(&folio->_deferred_list).
+	 * It's safe to check list_empty(&folio->_deferred_list) without
+	 * acquiring the lock.
+	 *
+	 * If folio is not in deferred_list, it's safe to check without
+	 * acquiring the lock.
+	 */
+	if (data_race(!list_empty(&folio->_deferred_list))) {
+		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 		ds_queue->split_queue_len--;
 		list_del(&folio->_deferred_list);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 	free_compound_page(page);
 }
 
-- 
2.30.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] lru: allow large batched add large folio to lru list
  2023-04-17  7:56 [PATCH 0/2] Reduce lock contention related with large folio Yin Fengwei
  2023-04-17  7:56 ` [PATCH 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
@ 2023-04-17  7:56 ` Yin Fengwei
  2023-04-17 12:25   ` Matthew Wilcox
  2023-04-17 10:33 ` [PATCH 0/2] Reduce lock contention related with large folio Ryan Roberts
  2 siblings, 1 reply; 8+ messages in thread
From: Yin Fengwei @ 2023-04-17  7:56 UTC (permalink / raw)
  To: linux-mm, akpm, willy, yuzhao, ryan.roberts; +Cc: fengwei.yin

Currently, large folio is not batched added to lru list. Which
cause high lru lock contention after enable large folio for
anonymous mappping.

Running page_fault1 of will-it-scale + order 2 folio with 96
processes on Ice Lake 48C/96T, the lru lock contention could
be around 65%:
-   65.38%     0.17%  page_fault1_pro  [kernel.kallsyms]           [k]
    folio_lruvec_lock_irqsave
   - 65.21% folio_lruvec_lock_irqsave

With this patch, the lru lock contention dropped to 45% with same
testing:
-   44.93%     0.17%  page_fault1_pro  [kernel.kallsyms]           [k]
    folio_lruvec_lock_irqsave
   + 44.75% folio_lruvec_lock_irqsave

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/pagevec.h | 19 +++++++++++++++++--
 mm/swap.c               |  3 +--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index f582f7213ea52..d719f7ad5a567 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -10,6 +10,7 @@
 #define _LINUX_PAGEVEC_H
 
 #include <linux/xarray.h>
+#include <linux/mm.h>
 
 /* 15 pointers + header align the pagevec structure to a power of two */
 #define PAGEVEC_SIZE	15
@@ -22,6 +23,7 @@ struct address_space;
 struct pagevec {
 	unsigned char nr;
 	bool percpu_pvec_drained;
+	unsigned short pages_nr;
 	struct page *pages[PAGEVEC_SIZE];
 };
 
@@ -30,6 +32,7 @@ void __pagevec_release(struct pagevec *pvec);
 static inline void pagevec_init(struct pagevec *pvec)
 {
 	pvec->nr = 0;
+	pvec->pages_nr = 0;
 	pvec->percpu_pvec_drained = false;
 }
 
@@ -54,7 +57,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
 static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
 {
 	pvec->pages[pvec->nr++] = page;
-	return pagevec_space(pvec);
+	pvec->pages_nr += compound_nr(page);
+
+	if (pvec->pages_nr > PAGEVEC_SIZE)
+		return 0;
+	else
+		return pagevec_space(pvec);
 }
 
 static inline void pagevec_release(struct pagevec *pvec)
@@ -75,6 +83,7 @@ static inline void pagevec_release(struct pagevec *pvec)
 struct folio_batch {
 	unsigned char nr;
 	bool percpu_pvec_drained;
+	unsigned short pages_nr;
 	struct folio *folios[PAGEVEC_SIZE];
 };
 
@@ -92,6 +101,7 @@ static_assert(offsetof(struct pagevec, pages) ==
 static inline void folio_batch_init(struct folio_batch *fbatch)
 {
 	fbatch->nr = 0;
+	fbatch->pages_nr = 0;
 	fbatch->percpu_pvec_drained = false;
 }
 
@@ -124,7 +134,12 @@ static inline unsigned folio_batch_add(struct folio_batch *fbatch,
 		struct folio *folio)
 {
 	fbatch->folios[fbatch->nr++] = folio;
-	return fbatch_space(fbatch);
+	fbatch->pages_nr += folio_nr_pages(folio);
+
+	if (fbatch->pages_nr > PAGEVEC_SIZE)
+		return 0;
+	else
+		return fbatch_space(fbatch);
 }
 
 static inline void folio_batch_release(struct folio_batch *fbatch)
diff --git a/mm/swap.c b/mm/swap.c
index 423199ee8478c..59e3f1e3701c3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -228,8 +228,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 static void folio_batch_add_and_move(struct folio_batch *fbatch,
 		struct folio *folio, move_fn_t move_fn)
 {
-	if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
-	    !lru_cache_disabled())
+	if (folio_batch_add(fbatch, folio) && !lru_cache_disabled())
 		return;
 	folio_batch_move_lru(fbatch, move_fn);
 }
-- 
2.30.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Reduce lock contention related with large folio
  2023-04-17  7:56 [PATCH 0/2] Reduce lock contention related with large folio Yin Fengwei
  2023-04-17  7:56 ` [PATCH 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
  2023-04-17  7:56 ` [PATCH 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
@ 2023-04-17 10:33 ` Ryan Roberts
  2 siblings, 0 replies; 8+ messages in thread
From: Ryan Roberts @ 2023-04-17 10:33 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, akpm, willy, yuzhao

On 17/04/2023 08:56, Yin Fengwei wrote:
> Ryan tried to enable the large folio for anonymous mapping [1].
> 
> Unlike large folio for page cache which doesn't trigger frequent page
> allocation/free, large folio for anonymous mapping is allocated/freeed
> more frequently. So large folio for anonymous mapping exposes some lock
> contention.
> 
> Ryan mentioned the deferred queue lock in [1]. We also met other two
> lock contention: lru lock and zone lock.
> 
> This series tries to mitigate the deferred queue lock and reduce lru
> lock in some level.
> 
> The patch1 tries to reduce deferred queue lock by not acquiring queue
> lock when check whether the folio is in deferred list or not. Test
> page fault1 of will-it-scale showed 60% deferred queue lock contention
> reduction.
> 
> The patch2 tries to reduce lru lock by allowing batched add large folio
> to lru list. Test page fault1 of will-it-scale showed 20% lru lock
> contention reduction.
> 
> The zone lock contention happens on large folio free path and related
> with commit f26b3fa04611 "mm/page_alloc: limit number of high-order
> pages on PCP during bulk free" and will not be address by this series.

I applied this series on top of mine and did some quick perf tests. See
https://lore.kernel.org/linux-mm/d9987135-3a8a-e22c-13f9-506d3249332b@arm.com/.
The change is certainly reducing time spent in the kernel, but there are other
problems I'll need to investigate. So:

Tested-by: Ryan Roberts <ryan.roberts@arm.com>

Thanks,
Ryan

> 
> 
> [1]
> https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
> 
> Yin Fengwei (2):
>   THP: avoid lock when check whether THP is in deferred list
>   lru: allow large batched add large folio to lru list
> 
>  include/linux/pagevec.h | 19 +++++++++++++++++--
>  mm/huge_memory.c        | 19 ++++++++++++++++---
>  mm/swap.c               |  3 +--
>  3 files changed, 34 insertions(+), 7 deletions(-)
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] lru: allow large batched add large folio to lru list
  2023-04-17  7:56 ` [PATCH 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
@ 2023-04-17 12:25   ` Matthew Wilcox
  2023-04-18  1:57     ` Yin Fengwei
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2023-04-17 12:25 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: linux-mm, akpm, yuzhao, ryan.roberts

On Mon, Apr 17, 2023 at 03:56:43PM +0800, Yin Fengwei wrote:
> Currently, large folio is not batched added to lru list. Which
> cause high lru lock contention after enable large folio for
> anonymous mappping.

Obviously, I think we should be doing a batched add, but I don't think
this is right.

> @@ -54,7 +57,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
>  static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
>  {
>  	pvec->pages[pvec->nr++] = page;
> -	return pagevec_space(pvec);
> +	pvec->pages_nr += compound_nr(page);
> +
> +	if (pvec->pages_nr > PAGEVEC_SIZE)
> +		return 0;
> +	else
> +		return pagevec_space(pvec);

I assume your thinking here is that we should limit the number of pages
in the batches, but I think we should allow the number of folios to reach
PAGEVEC_SIZE before we drain the batch onto the LRU list.  That will
reduce the contention on the LRU lock even further.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] lru: allow large batched add large folio to lru list
  2023-04-17 12:25   ` Matthew Wilcox
@ 2023-04-18  1:57     ` Yin Fengwei
  2023-04-18  2:37       ` Yin Fengwei
  0 siblings, 1 reply; 8+ messages in thread
From: Yin Fengwei @ 2023-04-18  1:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, akpm, yuzhao, ryan.roberts



On 4/17/23 20:25, Matthew Wilcox wrote:
> On Mon, Apr 17, 2023 at 03:56:43PM +0800, Yin Fengwei wrote:
>> Currently, large folio is not batched added to lru list. Which
>> cause high lru lock contention after enable large folio for
>> anonymous mappping.
> 
> Obviously, I think we should be doing a batched add, but I don't think
> this is right.
> 
>> @@ -54,7 +57,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
>>  static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
>>  {
>>  	pvec->pages[pvec->nr++] = page;
>> -	return pagevec_space(pvec);
>> +	pvec->pages_nr += compound_nr(page);
>> +
>> +	if (pvec->pages_nr > PAGEVEC_SIZE)
>> +		return 0;
>> +	else
>> +		return pagevec_space(pvec);
> 
> I assume your thinking here is that we should limit the number of pages
> in the batches, but I think we should allow the number of folios to reach
> PAGEVEC_SIZE before we drain the batch onto the LRU list.  That will
> reduce the contention on the LRU lock even further.

Yes. The first thought in my mind was to limit the number of folios also.

But the concern is that large folio has wider range of size. In the extreme
case, if all batched large folios has 2M size, with PAGEVEC_SIZE as 15,
one batch could have 30M memory. Which could be too large for some usages.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] lru: allow large batched add large folio to lru list
  2023-04-18  1:57     ` Yin Fengwei
@ 2023-04-18  2:37       ` Yin Fengwei
  2023-04-18  6:39         ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Yin Fengwei @ 2023-04-18  2:37 UTC (permalink / raw)
  To: Matthew Wilcox, Huang, Ying; +Cc: linux-mm, akpm, yuzhao, ryan.roberts

Add Ying who found out the large folio is not batched added to lru.

On 4/18/23 09:57, Yin Fengwei wrote:
> 
> 
> On 4/17/23 20:25, Matthew Wilcox wrote:
>> On Mon, Apr 17, 2023 at 03:56:43PM +0800, Yin Fengwei wrote:
>>> Currently, large folio is not batched added to lru list. Which
>>> cause high lru lock contention after enable large folio for
>>> anonymous mappping.
>>
>> Obviously, I think we should be doing a batched add, but I don't think
>> this is right.
>>
>>> @@ -54,7 +57,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
>>>  static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
>>>  {
>>>  	pvec->pages[pvec->nr++] = page;
>>> -	return pagevec_space(pvec);
>>> +	pvec->pages_nr += compound_nr(page);
>>> +
>>> +	if (pvec->pages_nr > PAGEVEC_SIZE)
>>> +		return 0;
>>> +	else
>>> +		return pagevec_space(pvec);
>>
>> I assume your thinking here is that we should limit the number of pages
>> in the batches, but I think we should allow the number of folios to reach
>> PAGEVEC_SIZE before we drain the batch onto the LRU list.  That will
>> reduce the contention on the LRU lock even further.
> 
> Yes. The first thought in my mind was to limit the number of folios also.
> 
> But the concern is that large folio has wider range of size. In the extreme
> case, if all batched large folios has 2M size, with PAGEVEC_SIZE as 15,
> one batch could have 30M memory. Which could be too large for some usages.
> 
> 
> Regards
> Yin, Fengwei
> 
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] lru: allow large batched add large folio to lru list
  2023-04-18  2:37       ` Yin Fengwei
@ 2023-04-18  6:39         ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2023-04-18  6:39 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: Matthew Wilcox, linux-mm, akpm, yuzhao, ryan.roberts

Yin Fengwei <fengwei.yin@intel.com> writes:

> Add Ying who found out the large folio is not batched added to lru.

Thanks!

> On 4/18/23 09:57, Yin Fengwei wrote:
>> 
>> 
>> On 4/17/23 20:25, Matthew Wilcox wrote:
>>> On Mon, Apr 17, 2023 at 03:56:43PM +0800, Yin Fengwei wrote:
>>>> Currently, large folio is not batched added to lru list. Which
>>>> cause high lru lock contention after enable large folio for
>>>> anonymous mappping.
>>>
>>> Obviously, I think we should be doing a batched add, but I don't think
>>> this is right.
>>>
>>>> @@ -54,7 +57,12 @@ static inline unsigned pagevec_space(struct pagevec *pvec)
>>>>  static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
>>>>  {
>>>>  	pvec->pages[pvec->nr++] = page;
>>>> -	return pagevec_space(pvec);
>>>> +	pvec->pages_nr += compound_nr(page);
>>>> +
>>>> +	if (pvec->pages_nr > PAGEVEC_SIZE)

nr_pages appears better for me.

>>>> +		return 0;
>>>> +	else
>>>> +		return pagevec_space(pvec);
>>>
>>> I assume your thinking here is that we should limit the number of pages
>>> in the batches, but I think we should allow the number of folios to reach
>>> PAGEVEC_SIZE before we drain the batch onto the LRU list.  That will
>>> reduce the contention on the LRU lock even further.
>> 
>> Yes. The first thought in my mind was to limit the number of folios also.
>> 
>> But the concern is that large folio has wider range of size. In the extreme
>> case, if all batched large folios has 2M size, with PAGEVEC_SIZE as 15,
>> one batch could have 30M memory. Which could be too large for some usages.

Yes.  I think that these are valid concerns.  One possibility to balance
between performance and lru cache size is to make nr_pages of per-CPU
lru cache < PAGEVEC_SIZE * N.  Where N can be determined according to
the intended base order of large folios.  For example, it can be 4 if we
use 2 as intended base order.

Just my 2 cents.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-18  6:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-17  7:56 [PATCH 0/2] Reduce lock contention related with large folio Yin Fengwei
2023-04-17  7:56 ` [PATCH 1/2] THP: avoid lock when check whether THP is in deferred list Yin Fengwei
2023-04-17  7:56 ` [PATCH 2/2] lru: allow large batched add large folio to lru list Yin Fengwei
2023-04-17 12:25   ` Matthew Wilcox
2023-04-18  1:57     ` Yin Fengwei
2023-04-18  2:37       ` Yin Fengwei
2023-04-18  6:39         ` Huang, Ying
2023-04-17 10:33 ` [PATCH 0/2] Reduce lock contention related with large folio Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox