[PATCH 0/1] mm: improve folio refcount scalability

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] mm: improve folio refcount scalability
@ 2026-02-26 16:27 Gladyshev Ilya
  2026-02-26 16:27 ` [PATCH 1/1] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
  2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
  0 siblings, 2 replies; 7+ messages in thread
From: Gladyshev Ilya @ 2026-02-26 16:27 UTC (permalink / raw)
  To: Ilya Gladyshev
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Harry Yoo,
	Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau

This patch was previously posted as an RFC and received positive, but
little, feedback. So I decided to fix remaining drawbacks and repost it
as non-RFC patch. Overall logic, as well as performance, remained the
same.

Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].

Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.

Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.

Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.

Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:
``
char buf[]
fd = open(/* same file in tmpfs */);

while (true) {
    pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``
While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.

This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel[4]
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.

Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1343381 | 1344401 |  +0.1
       2 | 2186160 | 2455837 | +12.3
       5 | 5277092 | 6108030 | +15.7
      10 | 5858123 | 7506328 | +28.1
      12 | 6484445 | 8137706 | +25.5
         /* Cross socket NUMA */
      14 | 3145860 | 4247391 | +35.0
      16 | 2350840 | 4262707 | +81.3
      18 | 2378825 | 4121415 | +73.2
      20 | 2438475 | 4683548 | +92.1
      24 | 2325998 | 4529737 | +94.7

Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1077276 | 1081653 |  +0.4
       5 | 4286838 | 4682513 |  +9.2
      10 | 1698095 | 1902753 | +12.1
      20 | 1662266 | 1921603 | +15.6
      49 | 1486745 | 1828926 | +23.0
      97 | 1617365 | 2052635 | +26.9
         /* Cross socket NUMA */
     105 | 1368319 | 1798862 | +31.5
     136 | 1008071 | 1393055 | +38.2
     168 |  879332 | 1245210 | +41.6
               /* SMT */
     193 |  905432 | 1294833 | +43.0
     289 |  851988 | 1313110 | +54.1
     353 |  771288 | 1347165 | +74.7

[1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3] https://github.com/antonblanchard/will-it-scale
[4] There were no changes to page_ref.h between v6.15 and v6.18 or any
    significant performance changes on the read side in mm/filemap.c

---
Changes since RFC:
- Drop refactoring patch (sent separately)
- Replace single CAS with CAS loop in failure path to improve
  robustness

Based on quick re-evaluation, this didn't affect performance because only cold
code was changed, so I kept RFC results.

Link to RFC: https://lore.foxido.dev/linux-mm/cover.1766145604.git.gladyshev.ilya1@h-partners.com

---
Gladyshev Ilya (1):
  mm: implement page refcount locking via dedicated bit

 include/linux/page-flags.h |  5 ++++-
 include/linux/page_ref.h   | 28 ++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 5 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/1] mm: implement page refcount locking via dedicated bit
  2026-02-26 16:27 [PATCH 0/1] mm: improve folio refcount scalability Gladyshev Ilya
@ 2026-02-26 16:27 ` Gladyshev Ilya
  2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
  1 sibling, 0 replies; 7+ messages in thread
From: Gladyshev Ilya @ 2026-02-26 16:27 UTC (permalink / raw)
  To: Ilya Gladyshev
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Harry Yoo,
	Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau

The current atomic-based page refcount implementation treats zero
counter as dead and requires a compare-and-swap loop in folio_try_get()
to prevent incrementing a dead refcount. This CAS loop acts as a
serialization point and can become a significant bottleneck during
high-frequency file read operations.

This patch introduces FOLIO_LOCKED_BIT to distinguish between a
(temporary) zero refcount and a locked (dead/frozen) state. Because now
incrementing counter doesn't affect it's locked/unlocked state, it is
possible to use an optimistic atomic_add_return() in
page_ref_add_unless_zero() that operates independently of the locked bit.
The locked state is handled after the increment attempt, eliminating the
need for the CAS loop.

If locked state is detected after atomic_add(), pageref counter will be
reset using CAS loop, eliminating theoretical possibility of overflow.

Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
---
 include/linux/page-flags.h |  5 ++++-
 include/linux/page_ref.h   | 28 ++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7c2195baf4c1..f2a9302104eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -196,6 +196,9 @@ enum pageflags {
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
 
+/* Most significant bit in page refcount */
+#define PAGEREF_LOCKED_BIT	(1 << 31)
+
 #ifndef __GENERATING_BOUNDS_H
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
@@ -257,7 +260,7 @@ static __always_inline bool page_count_writable(const struct page *page)
 	 * The refcount check also prevents modification attempts to other (r/o)
 	 * tail pages that are not fake heads.
 	 */
-	if (!atomic_read_acquire(&page->_refcount))
+	if (atomic_read_acquire(&page->_refcount) & PAGEREF_LOCKED_BIT)
 		return false;
 
 	return page_fixed_fake_head(page) == page;
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index b0e3f4a4b4b8..f2f2775af4bb 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -64,7 +64,12 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
 
 static inline int page_ref_count(const struct page *page)
 {
-	return atomic_read(&page->_refcount);
+	int val = atomic_read(&page->_refcount);
+
+	if (unlikely(val & PAGEREF_LOCKED_BIT))
+		return 0;
+
+	return val;
 }
 
 /**
@@ -176,6 +181,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
 {
 	int ret = atomic_sub_and_test(nr, &page->_refcount);
 
+	if (ret)
+		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
 	if (page_ref_tracepoint_active(page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -nr, ret);
 	return ret;
@@ -204,6 +212,9 @@ static inline int page_ref_dec_and_test(struct page *page)
 {
 	int ret = atomic_dec_and_test(&page->_refcount);
 
+	if (ret)
+		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
 	if (page_ref_tracepoint_active(page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -1, ret);
 	return ret;
@@ -228,14 +239,23 @@ static inline int folio_ref_dec_return(struct folio *folio)
 	return page_ref_dec_return(&folio->page);
 }
 
+#define _PAGEREF_LOCKED_LIMIT	((1 << 30) | PAGEREF_LOCKED_BIT)
+
 static inline bool page_ref_add_unless_zero(struct page *page, int nr)
 {
 	bool ret = false;
+	int val;
 
 	rcu_read_lock();
 	/* avoid writing to the vmemmap area being remapped */
-	if (page_count_writable(page))
-		ret = atomic_add_unless(&page->_refcount, nr, 0);
+	if (page_count_writable(page)) {
+		val = atomic_add_return(nr, &page->_refcount);
+		ret = !(val & PAGEREF_LOCKED_BIT);
+
+		/* Undo atomic_add() if counter is locked and scary big */
+		while (unlikely((unsigned int)val >= _PAGEREF_LOCKED_LIMIT))
+			val = atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_LOCKED_BIT);
+	}
 	rcu_read_unlock();
 
 	if (page_ref_tracepoint_active(page_ref_mod_unless))
@@ -271,7 +291,7 @@ static inline bool folio_ref_try_add(struct folio *folio, int count)
 
 static inline int page_ref_freeze(struct page *page, int count)
 {
-	int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
+	int ret = likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_LOCKED_BIT) == count);
 
 	if (page_ref_tracepoint_active(page_ref_freeze))
 		__page_ref_freeze(page, count, ret);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/1] mm: improve folio refcount scalability
  2026-02-26 16:27 [PATCH 0/1] mm: improve folio refcount scalability Gladyshev Ilya
  2026-02-26 16:27 ` [PATCH 1/1] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
@ 2026-02-28 22:19 ` Andrew Morton
  2026-03-01  3:27   ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2026-02-28 22:19 UTC (permalink / raw)
  To: Gladyshev Ilya
  Cc: David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Harry Yoo, Matthew Wilcox, Yu Zhao, Baolin Wang,
	Alistair Popple, Gorbunov Ivan, Muchun Song, linux-mm,
	linux-kernel, Kiryl Shutsemau, Dave Chinner, Linus Torvalds

On Thu, 26 Feb 2026 16:27:22 +0000 Gladyshev Ilya <gladyshev.ilya1@h-partners.com> wrote:

> This patch was previously posted as an RFC and received positive, but
> little, feedback. So I decided to fix remaining drawbacks and repost it
> as non-RFC patch. Overall logic, as well as performance, remained the
> same.
> 
> Intro
> =====
> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].
> 
> Overview
> ========
> Current refcount implementation is using zero counter as locked (dead/frozen)
> state, which required CAS loop for increments to avoid temporary unlocks in
> try_get functions. These CAS loops became a serialization point for otherwise
> scalable and fast read side.
> 
> Proposed implementation separates "locked" logic from the counting, allowing
> the use of optimistic fetch_add() instead of CAS. For more details, please
> refer to the commit message of the patch itself.
> 
> Proposed logic maintains the same public API as before, including all existing
> memory barrier guarantees.
> 
> Performance
> ===========
> Performance was measured using a simple custom benchmark based on
> will-it-scale[3]. This benchmark spawns N pinned threads/processes that
> execute the following loop:
> ``
> char buf[]
> fd = open(/* same file in tmpfs */);
> 
> while (true) {
>     pread(fd, buf, /* read size = */ 64, /* offset = */0)
> }
> ``
> While this is a synthetic load, it does highlight existing issue and
> doesn't differ a lot from benchmarking in [2] patch.

Well it's nice to see the performance benefits from Kiryl's ill-fated
patch
(https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/)

And this approach looks far simpler.

I'll paste the single patch below for others - I think it's not
desirable to prepare a [0/N] for a single-patch "series"!

Thanks, I'll await reviewer feedback for a couple of days then I'll
look at adding this to linux-next for some runtime testing.

> This benchmark measures operations per second in the inner loop and the
> results across all workers. Performance was tested on top of v6.15 kernel[4]
> on two platforms. Since threads and processes showed similar performance on
> both systems, only the thread results are provided below. The performance
> improvement scales linearly between the CPU counts shown.
> 
> Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>        1 | 1343381 | 1344401 |  +0.1
>        2 | 2186160 | 2455837 | +12.3
>        5 | 5277092 | 6108030 | +15.7
>       10 | 5858123 | 7506328 | +28.1
>       12 | 6484445 | 8137706 | +25.5
>          /* Cross socket NUMA */
>       14 | 3145860 | 4247391 | +35.0
>       16 | 2350840 | 4262707 | +81.3
>       18 | 2378825 | 4121415 | +73.2
>       20 | 2438475 | 4683548 | +92.1
>       24 | 2325998 | 4529737 | +94.7
> 
> Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>        1 | 1077276 | 1081653 |  +0.4
>        5 | 4286838 | 4682513 |  +9.2
>       10 | 1698095 | 1902753 | +12.1
>       20 | 1662266 | 1921603 | +15.6
>       49 | 1486745 | 1828926 | +23.0
>       97 | 1617365 | 2052635 | +26.9
>          /* Cross socket NUMA */
>      105 | 1368319 | 1798862 | +31.5
>      136 | 1008071 | 1393055 | +38.2
>      168 |  879332 | 1245210 | +41.6
>                /* SMT */
>      193 |  905432 | 1294833 | +43.0
>      289 |  851988 | 1313110 | +54.1
>      353 |  771288 | 1347165 | +74.7
> 
> [1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
> [3] https://github.com/antonblanchard/will-it-scale
> [4] There were no changes to page_ref.h between v6.15 and v6.18 or any
>     significant performance changes on the read side in mm/filemap.c
> 


> The current atomic-based page refcount implementation treats zero
> counter as dead and requires a compare-and-swap loop in folio_try_get()
> to prevent incrementing a dead refcount. This CAS loop acts as a
> serialization point and can become a significant bottleneck during
> high-frequency file read operations.
> 
> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
> (temporary) zero refcount and a locked (dead/frozen) state. Because now
> incrementing counter doesn't affect it's locked/unlocked state, it is
> possible to use an optimistic atomic_add_return() in
> page_ref_add_unless_zero() that operates independently of the locked bit.
> The locked state is handled after the increment attempt, eliminating the
> need for the CAS loop.
> 
> If locked state is detected after atomic_add(), pageref counter will be
> reset using CAS loop, eliminating theoretical possibility of overflow.
> 
> Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
> Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
> Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
> ---
>  include/linux/page-flags.h |  5 ++++-
>  include/linux/page_ref.h   | 28 ++++++++++++++++++++++++----
>  2 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7c2195baf4c1..f2a9302104eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -196,6 +196,9 @@ enum pageflags {
>  
>  #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
>  
> +/* Most significant bit in page refcount */
> +#define PAGEREF_LOCKED_BIT	(1 << 31)
> +
>  #ifndef __GENERATING_BOUNDS_H
>  
>  #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> @@ -257,7 +260,7 @@ static __always_inline bool page_count_writable(const struct page *page)
>  	 * The refcount check also prevents modification attempts to other (r/o)
>  	 * tail pages that are not fake heads.
>  	 */
> -	if (!atomic_read_acquire(&page->_refcount))
> +	if (atomic_read_acquire(&page->_refcount) & PAGEREF_LOCKED_BIT)
>  		return false;
>  
>  	return page_fixed_fake_head(page) == page;
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index b0e3f4a4b4b8..f2f2775af4bb 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -64,7 +64,12 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
>  
>  static inline int page_ref_count(const struct page *page)
>  {
> -	return atomic_read(&page->_refcount);
> +	int val = atomic_read(&page->_refcount);
> +
> +	if (unlikely(val & PAGEREF_LOCKED_BIT))
> +		return 0;
> +
> +	return val;
>  }
>  
>  /**
> @@ -176,6 +181,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
>  {
>  	int ret = atomic_sub_and_test(nr, &page->_refcount);
>  
> +	if (ret)
> +		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
> +
>  	if (page_ref_tracepoint_active(page_ref_mod_and_test))
>  		__page_ref_mod_and_test(page, -nr, ret);
>  	return ret;
> @@ -204,6 +212,9 @@ static inline int page_ref_dec_and_test(struct page *page)
>  {
>  	int ret = atomic_dec_and_test(&page->_refcount);
>  
> +	if (ret)
> +		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
> +
>  	if (page_ref_tracepoint_active(page_ref_mod_and_test))
>  		__page_ref_mod_and_test(page, -1, ret);
>  	return ret;
> @@ -228,14 +239,23 @@ static inline int folio_ref_dec_return(struct folio *folio)
>  	return page_ref_dec_return(&folio->page);
>  }
>  
> +#define _PAGEREF_LOCKED_LIMIT	((1 << 30) | PAGEREF_LOCKED_BIT)
> +
>  static inline bool page_ref_add_unless_zero(struct page *page, int nr)
>  {
>  	bool ret = false;
> +	int val;
>  
>  	rcu_read_lock();
>  	/* avoid writing to the vmemmap area being remapped */
> -	if (page_count_writable(page))
> -		ret = atomic_add_unless(&page->_refcount, nr, 0);
> +	if (page_count_writable(page)) {
> +		val = atomic_add_return(nr, &page->_refcount);
> +		ret = !(val & PAGEREF_LOCKED_BIT);
> +
> +		/* Undo atomic_add() if counter is locked and scary big */
> +		while (unlikely((unsigned int)val >= _PAGEREF_LOCKED_LIMIT))
> +			val = atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_LOCKED_BIT);
> +	}
>  	rcu_read_unlock();
>  
>  	if (page_ref_tracepoint_active(page_ref_mod_unless))
> @@ -271,7 +291,7 @@ static inline bool folio_ref_try_add(struct folio *folio, int count)
>  
>  static inline int page_ref_freeze(struct page *page, int count)
>  {
> -	int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
> +	int ret = likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_LOCKED_BIT) == count);
>  
>  	if (page_ref_tracepoint_active(page_ref_freeze))
>  		__page_ref_freeze(page, count, ret);
> -- 
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/1] mm: improve folio refcount scalability
  2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
@ 2026-03-01  3:27   ` Linus Torvalds
  2026-03-01 18:52     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2026-03-01  3:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gladyshev Ilya, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Harry Yoo,
	Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2136 bytes --]

On Sat, 28 Feb 2026 at 14:19, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Well it's nice to see the performance benefits from Kiryl's ill-fated
> patch
> (https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/)
>
> And this approach looks far simpler.

This attempt does something completely different, in that it doesn't
actually remove any atomics at all.

Quite the opposite, in fact. It adds *new* atomics - just in a different place.

But if it helps performance, that is certainly still interesting.

It's basically saying that it's not the atomic op itself that is so
expensive, it's literally just the "read + cmpxchg" in
atomic_add_unless() that makes for most of the expense.

And that, in turn, is probably due the fact that the read in that loop
first tries to make the cacheline shared, and then the cmpxchg has to
turn the shared cacheline exclusive, so you have two cache ops - and
possibly then many more because of bouncing due to this all.

Fine, I can believe that.

But if it's purely about the cacheline shared/exclusive behavior, I
think there's a much simpler patch

That much more simple patch is something we've done before: do *not*
read the old value before the cmpxchg loop. Do the cmpxchg with a
default value, and if we guessed wrong, just do the extra loop
iteration.

This attached patch is ENTIRELY UNTESTED. I might have gotten
something wrong. A quick look at the assembler seems to say it
generates that expected code (gcc is not great at this), with the loop
being

        mov    $0x1,%eax
        lea    0x34(%rdi),%rdx
        lea    0x1(%rax),%ecx
        lock cmpxchg %ecx,(%rdx)
        ...

ie the first time through we just assume the count is one.

And yes, that assumption may be wrong, but at least we don't go
through the shared state, and since we got the cacheline for exclusive
the first time around the loop, the second time around we will get it
right.

What do the numbers look with this much simpler patch? (All assuming I
didn't screw some logic up and get some conditional the wrong way
around - please check me).

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 855 bytes --]

 include/linux/page_ref.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..ed3f262aa7f1 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -234,8 +234,18 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)

 	rcu_read_lock();
 	/* avoid writing to the vmemmap area being remapped */
-	if (page_count_writable(page, u))
-		ret = atomic_add_unless(&page->_refcount, nr, u);
+	if (page_count_writable(page, u)) {
+		/* Assume count == 1, don't read it! */
+		int old = 1;
+		for (;;) {
+			if (atomic_try_cmpxchg(&page->_refcount, &old, old+1)) {
+				ret = true;
+				break;
+			}
+			if (unlikely(!old))
+				break;
+		}
+	}
 	rcu_read_unlock();

 	if (page_ref_tracepoint_active(page_ref_mod_unless))

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/1] mm: improve folio refcount scalability
  2026-03-01  3:27   ` Linus Torvalds
@ 2026-03-01 18:52     ` Linus Torvalds
  2026-03-01 20:26       ` Pedro Falcato
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2026-03-01 18:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gladyshev Ilya, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Harry Yoo,
	Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 922 bytes --]

On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> This attached patch is ENTIRELY UNTESTED.

Here's a slightly cleaned up and further simplified version, which is
also actually tested, although only in the "it boots for me" sense.

It generates good code at least with clang:

  .LBB76_7:
          movl    $1, %eax
  .LBB76_8:
          leal    1(%rax), %ecx
          lock cmpxchgl   %ecx, 52(%rdi)
          sete    %cl
          je      .LBB76_10
          testl   %eax, %eax
          jne     .LBB76_8
  .LBB76_10:

which actually looks both simple and fairly optimal for that sequence.

Of course, since this is very much about cacheline access patterns,
actual performance will depend on random microarchitectural issues
(and not just the CPU core, but the whole memory subsystem).

Can somebody with a good - and relevant - benchmark system try this out?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 820 bytes --]

 include/linux/page_ref.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..d8e4f175f74c 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -234,8 +234,15 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
 
 	rcu_read_lock();
 	/* avoid writing to the vmemmap area being remapped */
-	if (page_count_writable(page, u))
-		ret = atomic_add_unless(&page->_refcount, nr, u);
+	if (page_count_writable(page, u)) {
+		/* Assume count == 1, don't read it! */
+		int old = 1;
+		do {
+			ret = atomic_try_cmpxchg(&page->_refcount, &old, old+1);
+			if (likely(ret))
+				break;
+		} while (old);
+	}
 	rcu_read_unlock();
 
 	if (page_ref_tracepoint_active(page_ref_mod_unless))

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/1] mm: improve folio refcount scalability
  2026-03-01 18:52     ` Linus Torvalds
@ 2026-03-01 20:26       ` Pedro Falcato
  2026-03-01 21:16         ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Pedro Falcato @ 2026-03-01 20:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Gladyshev Ilya, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Harry Yoo, Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau, Dave Chinner

On Sun, Mar 01, 2026 at 10:52:57AM -0800, Linus Torvalds wrote:
> On Sat, 28 Feb 2026 at 19:27, Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> >
> > This attached patch is ENTIRELY UNTESTED.
> 
> Here's a slightly cleaned up and further simplified version, which is
> also actually tested, although only in the "it boots for me" sense.
> 
> It generates good code at least with clang:
> 
>   .LBB76_7:
>           movl    $1, %eax
>   .LBB76_8:
>           leal    1(%rax), %ecx
>           lock cmpxchgl   %ecx, 52(%rdi)
>           sete    %cl
>           je      .LBB76_10
>           testl   %eax, %eax
>           jne     .LBB76_8
>   .LBB76_10:
> 
> which actually looks both simple and fairly optimal for that sequence.
> 
> Of course, since this is very much about cacheline access patterns,
> actual performance will depend on random microarchitectural issues
> (and not just the CPU core, but the whole memory subsystem).
> 
> Can somebody with a good - and relevant - benchmark system try this out?
> 
>                Linus

Here are some perhaps interesting numbers from an extremely synthetic
benchmark[1] I wrote just now:

note: xadd_bench is lock addl, cmpxchg is the typical load + lock cmpxchg loop,
and optimistic_cmpxchg_benchmark is similar to what you wrote, where we assume
1 and only later do we do the actual loop. I also don't claim this is
representative of page cache performance, but this is quite a lot simpler
to set up and play around with.

On my zen4 AMD Ryzen 7 PRO 7840U laptop:
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1                      2.76 ns         2.76 ns    250435782
xadd_bench/threads:4                      42.1 ns         42.1 ns     15969296
xadd_bench/threads:8                      84.8 ns         84.8 ns      8920800
xadd_bench/threads:16                      226 ns          211 ns      2446928
cmpxchg_bench/threads:1                   3.12 ns         3.12 ns    220339301
cmpxchg_bench/threads:4                   51.1 ns         51.1 ns     12372808
cmpxchg_bench/threads:8                    112 ns          112 ns      6228056
cmpxchg_bench/threads:16                   679 ns          648 ns       930832
optimistic_cmpxchg_bench/threads:1        2.95 ns         2.95 ns    233704391
optimistic_cmpxchg_bench/threads:4        56.2 ns         56.2 ns     11780588
optimistic_cmpxchg_bench/threads:8         140 ns          140 ns      4606440
optimistic_cmpxchg_bench/threads:16        789 ns          746 ns       806400

Here we can see that the optimistic cmpxchg still can't match the xadd/lock addl
performance in single-thread, and degrades quickly and worse than straight up
cmpxchg under load (perhaps presumably because of the cmpxchg miss).

On our internal large 160-core Intel(R) Xeon(R) CPU E7-8891 v4 (older uarch,
sad) machine:
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
xadd_bench/threads:1                       13.6 ns         13.6 ns     51445934
xadd_bench/threads:4                       41.4 ns          166 ns      4211940
xadd_bench/threads:8                       30.3 ns          242 ns      2190488
xadd_bench/threads:16                      37.3 ns          596 ns      1162336
xadd_bench/threads:64                      24.9 ns         1376 ns       640000
xadd_bench/threads:128                     27.3 ns         3108 ns      1054592
cmpxchg_bench/threads:1                    17.9 ns         17.9 ns     38992029
cmpxchg_bench/threads:4                    54.8 ns          219 ns      3431076
cmpxchg_bench/threads:8                    39.0 ns          312 ns      1698712
cmpxchg_bench/threads:16                   62.2 ns          994 ns       530672
cmpxchg_bench/threads:64                   28.5 ns         1479 ns       665280
cmpxchg_bench/threads:128                  17.2 ns         1838 ns       517376
optimistic_cmpxchg_bench/threads:1         13.6 ns         13.6 ns     51384286
optimistic_cmpxchg_bench/threads:4         70.2 ns          281 ns      2585092
optimistic_cmpxchg_bench/threads:8         58.1 ns          465 ns      1598592
optimistic_cmpxchg_bench/threads:16         106 ns         1694 ns       420832
optimistic_cmpxchg_bench/threads:64        30.8 ns         1767 ns       499264
optimistic_cmpxchg_bench/threads:128       39.3 ns         4632 ns       447104

Here, optimistic seems to match xadd in single-threaded, but then very quickly
degrades. In general optimistic_cmpxchg seems to degrade worse than cmpxchg,
but there is a lot of variance here (and other users lightly using it) so
results (particularly those with higher thread counts) should be taken with a
grain of salt (for example, lock add scaling dratistically worse than cmpxchg
seems to be a fluke).

TL;DR I don't think the idea quite works, particularly when a folio is under
contention, because if you have traffic on a cacheline then you certainly have
a couple of threads trying to grab a refcount. And doing two cmpxchgs just
increases traffic and pessimises things. Also perhaps worth noting that neither
solution scales in any way.

[1] https://gist.github.com/heatd/2a6e6c778c3cfd4aa6804b2d598c7a4c (excuse my C++)
-- 
Pedro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/1] mm: improve folio refcount scalability
  2026-03-01 20:26       ` Pedro Falcato
@ 2026-03-01 21:16         ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2026-03-01 21:16 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Gladyshev Ilya, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	Harry Yoo, Matthew Wilcox, Yu Zhao, Baolin Wang, Alistair Popple,
	Gorbunov Ivan, Muchun Song, linux-mm, linux-kernel,
	Kiryl Shutsemau, Dave Chinner

On Sun, 1 Mar 2026 at 12:26, Pedro Falcato <pfalcato@suse.de> wrote:
>
> Here we can see that the optimistic cmpxchg still can't match the xadd/lock addl
> performance in single-thread, and degrades quickly and worse than straight up
> cmpxchg under load (perhaps presumably because of the cmpxchg miss).

Ok, thanks for doing the numbers. I'm (obviously) a bit surprised at
how badly cmpxchg does - it used to be noticeably worse than "lock
add" even for the non-contention case, but I thought that had long
since been fixed.

Clearly that's just not the case - and I had just been overly
optimistic that the "first cmpxchg failed,  but second one gets the
value without losing the cacheline in between" would work reliably.

Ho humm. Maybe that "locked" flag is the best we can do.

             Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-01 21:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-26 16:27 [PATCH 0/1] mm: improve folio refcount scalability Gladyshev Ilya
2026-02-26 16:27 ` [PATCH 1/1] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
2026-03-01  3:27   ` Linus Torvalds
2026-03-01 18:52     ` Linus Torvalds
2026-03-01 20:26       ` Pedro Falcato
2026-03-01 21:16         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox