* [RFC PATCH 0/2] mm: improve folio refcount scalability
@ 2025-12-19 12:46 Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 1/2] mm: make ref_unless functions unless_zero only Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
0 siblings, 2 replies; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-19 12:46 UTC (permalink / raw)
To: patchwork
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy,
gladyshev.ilya1, yuzhao, baolin.wang, muchun.song, linux-mm,
linux-kernel
Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].
Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.
Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.
Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.
Drawbacks
=========
In theory, an optimistic fetch_add can overflow the atomic_t and reset the
locked state. Currently, this is mitigated via a single CAS operation after
the "failed" fetch_add, which tries to reset the counter to a locked zero.
While this best-effort approach doesn't have any strong guarantees, it's
unrealistic that there will be 2^31 highly contended try_get calls on a locked
folio, and in each of these calls, the CAS operation will fail.
If this guarantee isn't sufficient, it can be improved by performing a full
CAS loop when the counter is approaching overflow.
Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:
``
char buf[]
fd = open(/* same file in tmpfs */);
while (true) {
pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``
While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.
This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel[4]
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.
Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
#threads | vanilla | patched | boost (%)
1 | 1343381 | 1344401 | +0.1
2 | 2186160 | 2455837 | +12.3
5 | 5277092 | 6108030 | +15.7
10 | 5858123 | 7506328 | +28.1
12 | 6484445 | 8137706 | +25.5
/* Cross socket NUMA */
14 | 3145860 | 4247391 | +35.0
16 | 2350840 | 4262707 | +81.3
18 | 2378825 | 4121415 | +73.2
20 | 2438475 | 4683548 | +92.1
24 | 2325998 | 4529737 | +94.7
Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
#threads | vanilla | patched | boost (%)
1 | 1077276 | 1081653 | +0.4
5 | 4286838 | 4682513 | +9.2
10 | 1698095 | 1902753 | +12.1
20 | 1662266 | 1921603 | +15.6
49 | 1486745 | 1828926 | +23.0
97 | 1617365 | 2052635 | +26.9
/* Cross socket NUMA */
105 | 1368319 | 1798862 | +31.5
136 | 1008071 | 1393055 | +38.2
168 | 879332 | 1245210 | +41.6
/* SMT */
193 | 905432 | 1294833 | +43.0
289 | 851988 | 1313110 | +54.1
353 | 771288 | 1347165 | +74.7
[1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3] https://github.com/antonblanchard/will-it-scale
[4] There were no changes to page_ref.h between v6.15 and v6.18 or any
significant performance changes on the read side in mm/filemap.c
Gladyshev Ilya (2):
mm: make ref_unless functions unless_zero only
mm: implement page refcount locking via dedicated bit
include/linux/mm.h | 2 +-
include/linux/page-flags.h | 9 ++++++---
include/linux/page_ref.h | 35 ++++++++++++++++++++++++++---------
3 files changed, 33 insertions(+), 13 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 1/2] mm: make ref_unless functions unless_zero only
2025-12-19 12:46 [RFC PATCH 0/2] mm: improve folio refcount scalability Gladyshev Ilya
@ 2025-12-19 12:46 ` Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
1 sibling, 0 replies; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-19 12:46 UTC (permalink / raw)
To: patchwork
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy,
gladyshev.ilya1, yuzhao, baolin.wang, muchun.song, linux-mm,
linux-kernel
There are no users of (folio/page)_ref_add_unless(page, nr, u) with
u != 0 [1] and all current users are "internal" for page refcounting API.
This allows us to safely drop this parameter and reduce function
semantics to the "unless zero" cases only, which will be optimized in
the following patch.
If needed, these functions for the u!=0 cases can be trivially
reintroduced later using the same atomic_add_unless operations as before.
[1]: The last user was dropped in v5.18 kernel, commit 27674ef6c73f
("mm: remove the extra ZONE_DEVICE struct page refcount"). There is no
trace of discussion as to why this cleanup wasn't done earlier.
Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
---
include/linux/mm.h | 2 +-
include/linux/page-flags.h | 6 +++---
include/linux/page_ref.h | 14 +++++++-------
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c79b3369b82..f652426cc218 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1115,7 +1115,7 @@ static inline int folio_put_testzero(struct folio *folio)
*/
static inline bool get_page_unless_zero(struct page *page)
{
- return page_ref_add_unless(page, 1, 0);
+ return page_ref_add_unless_zero(page, 1);
}
static inline struct folio *folio_get_nontail_page(struct page *page)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0091ad1986bf..7c2195baf4c1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -231,7 +231,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
return page;
}
-static __always_inline bool page_count_writable(const struct page *page, int u)
+static __always_inline bool page_count_writable(const struct page *page)
{
if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
return true;
@@ -257,7 +257,7 @@ static __always_inline bool page_count_writable(const struct page *page, int u)
* The refcount check also prevents modification attempts to other (r/o)
* tail pages that are not fake heads.
*/
- if (atomic_read_acquire(&page->_refcount) == u)
+ if (!atomic_read_acquire(&page->_refcount))
return false;
return page_fixed_fake_head(page) == page;
@@ -268,7 +268,7 @@ static inline const struct page *page_fixed_fake_head(const struct page *page)
return page;
}
-static inline bool page_count_writable(const struct page *page, int u)
+static inline bool page_count_writable(const struct page *page)
{
return true;
}
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..b0e3f4a4b4b8 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -228,14 +228,14 @@ static inline int folio_ref_dec_return(struct folio *folio)
return page_ref_dec_return(&folio->page);
}
-static inline bool page_ref_add_unless(struct page *page, int nr, int u)
+static inline bool page_ref_add_unless_zero(struct page *page, int nr)
{
bool ret = false;
rcu_read_lock();
/* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page, u))
- ret = atomic_add_unless(&page->_refcount, nr, u);
+ if (page_count_writable(page))
+ ret = atomic_add_unless(&page->_refcount, nr, 0);
rcu_read_unlock();
if (page_ref_tracepoint_active(page_ref_mod_unless))
@@ -243,9 +243,9 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
return ret;
}
-static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
+static inline bool folio_ref_add_unless_zero(struct folio *folio, int nr)
{
- return page_ref_add_unless(&folio->page, nr, u);
+ return page_ref_add_unless_zero(&folio->page, nr);
}
/**
@@ -261,12 +261,12 @@ static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u)
*/
static inline bool folio_try_get(struct folio *folio)
{
- return folio_ref_add_unless(folio, 1, 0);
+ return folio_ref_add_unless_zero(folio, 1);
}
static inline bool folio_ref_try_add(struct folio *folio, int count)
{
- return folio_ref_add_unless(folio, count, 0);
+ return folio_ref_add_unless_zero(folio, count);
}
static inline int page_ref_freeze(struct page *page, int count)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 12:46 [RFC PATCH 0/2] mm: improve folio refcount scalability Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 1/2] mm: make ref_unless functions unless_zero only Gladyshev Ilya
@ 2025-12-19 12:46 ` Gladyshev Ilya
2025-12-19 14:50 ` Kiryl Shutsemau
2025-12-19 18:17 ` Gregory Price
1 sibling, 2 replies; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-19 12:46 UTC (permalink / raw)
To: patchwork
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy,
gladyshev.ilya1, yuzhao, baolin.wang, muchun.song, linux-mm,
linux-kernel
The current atomic-based page refcount implementation treats zero
counter as dead and requires a compare-and-swap loop in folio_try_get()
to prevent incrementing a dead refcount. This CAS loop acts as a
serialization point and can become a significant bottleneck during
high-frequency file read operations.
This patch introduces FOLIO_LOCKED_BIT to distinguish between a
(temporary) zero refcount and a locked (dead/frozen) state. Because now
incrementing counter doesn't affect it's locked/unlocked state, it is
possible to use an optimistic atomic_fetch_add() in
page_ref_add_unless_zero() that operates independently of the locked bit.
The locked state is handled after the increment attempt, eliminating the
need for the CAS loop.
Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
---
include/linux/page-flags.h | 5 ++++-
include/linux/page_ref.h | 25 +++++++++++++++++++++----
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7c2195baf4c1..f2a9302104eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -196,6 +196,9 @@ enum pageflags {
#define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1)
+/* Most significant bit in page refcount */
+#define PAGEREF_LOCKED_BIT (1 << 31)
+
#ifndef __GENERATING_BOUNDS_H
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
@@ -257,7 +260,7 @@ static __always_inline bool page_count_writable(const struct page *page)
* The refcount check also prevents modification attempts to other (r/o)
* tail pages that are not fake heads.
*/
- if (!atomic_read_acquire(&page->_refcount))
+ if (atomic_read_acquire(&page->_refcount) & PAGEREF_LOCKED_BIT)
return false;
return page_fixed_fake_head(page) == page;
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index b0e3f4a4b4b8..98717fd25306 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -64,7 +64,12 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
static inline int page_ref_count(const struct page *page)
{
- return atomic_read(&page->_refcount);
+ int val = atomic_read(&page->_refcount);
+
+ if (unlikely(val & PAGEREF_LOCKED_BIT))
+ return 0;
+
+ return val;
}
/**
@@ -176,6 +181,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
{
int ret = atomic_sub_and_test(nr, &page->_refcount);
+ if (ret)
+ ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -nr, ret);
return ret;
@@ -204,6 +212,9 @@ static inline int page_ref_dec_and_test(struct page *page)
{
int ret = atomic_dec_and_test(&page->_refcount);
+ if (ret)
+ ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -1, ret);
return ret;
@@ -231,11 +242,17 @@ static inline int folio_ref_dec_return(struct folio *folio)
static inline bool page_ref_add_unless_zero(struct page *page, int nr)
{
bool ret = false;
+ int val;
rcu_read_lock();
/* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page))
- ret = atomic_add_unless(&page->_refcount, nr, 0);
+ if (page_count_writable(page)) {
+ val = atomic_add_return(nr, &page->_refcount);
+ ret = !(val & PAGEREF_LOCKED_BIT);
+
+ if (unlikely(!ret))
+ atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_LOCKED_BIT);
+ }
rcu_read_unlock();
if (page_ref_tracepoint_active(page_ref_mod_unless))
@@ -271,7 +288,7 @@ static inline bool folio_ref_try_add(struct folio *folio, int count)
static inline int page_ref_freeze(struct page *page, int count)
{
- int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
+ int ret = likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_LOCKED_BIT) == count);
if (page_ref_tracepoint_active(page_ref_freeze))
__page_ref_freeze(page, count, ret);
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
@ 2025-12-19 14:50 ` Kiryl Shutsemau
2025-12-19 16:18 ` Gladyshev Ilya
2025-12-19 18:17 ` Gregory Price
1 sibling, 1 reply; 10+ messages in thread
From: Kiryl Shutsemau @ 2025-12-19 14:50 UTC (permalink / raw)
To: Gladyshev Ilya
Cc: patchwork, guohanjun, wangkefeng.wang, weiyongjun1, yusongping,
leijitang, artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
> The current atomic-based page refcount implementation treats zero
> counter as dead and requires a compare-and-swap loop in folio_try_get()
> to prevent incrementing a dead refcount. This CAS loop acts as a
> serialization point and can become a significant bottleneck during
> high-frequency file read operations.
>
> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
> (temporary) zero refcount and a locked (dead/frozen) state. Because now
> incrementing counter doesn't affect it's locked/unlocked state, it is
> possible to use an optimistic atomic_fetch_add() in
> page_ref_add_unless_zero() that operates independently of the locked bit.
> The locked state is handled after the increment attempt, eliminating the
> need for the CAS loop.
I don't think I follow.
Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
page_ref_freeze(), but I don't think it does anything to serialize
against freeing the page under you.
Like, if the page in the process of freeing, page allocator sets its
refcount to zero and your version of page_ref_add_unless_zero()
successfully acquirees reference for the freed page.
How is it safe?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 14:50 ` Kiryl Shutsemau
@ 2025-12-19 16:18 ` Gladyshev Ilya
2025-12-19 17:46 ` Kiryl Shutsemau
0 siblings, 1 reply; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-19 16:18 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote:
> On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
>> The current atomic-based page refcount implementation treats zero
>> counter as dead and requires a compare-and-swap loop in folio_try_get()
>> to prevent incrementing a dead refcount. This CAS loop acts as a
>> serialization point and can become a significant bottleneck during
>> high-frequency file read operations.
>>
>> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
>
> s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
Ack, thanks
>> (temporary) zero refcount and a locked (dead/frozen) state. Because now
>> incrementing counter doesn't affect it's locked/unlocked state, it is
>> possible to use an optimistic atomic_fetch_add() in
>> page_ref_add_unless_zero() that operates independently of the locked bit.
>> The locked state is handled after the increment attempt, eliminating the
>> need for the CAS loop.
>
> I don't think I follow.
>
> Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
> page_ref_freeze(), but I don't think it does anything to serialize
> against freeing the page under you.
>
> Like, if the page in the process of freeing, page allocator sets its
> refcount to zero and your version of page_ref_add_unless_zero()
> successfully acquirees reference for the freed page.
>
> How is it safe?
Page is freed only after a successful page_ref_dec_and_test() call,
which will set LOCKED_BIT. This bit will persist until set_page_count(1)
is called somewhere in the allocation path [alloc_pages()], and
effectively block any "use after free" users.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 16:18 ` Gladyshev Ilya
@ 2025-12-19 17:46 ` Kiryl Shutsemau
2025-12-19 19:08 ` Gladyshev Ilya
0 siblings, 1 reply; 10+ messages in thread
From: Kiryl Shutsemau @ 2025-12-19 17:46 UTC (permalink / raw)
To: Gladyshev Ilya
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote:
> On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote:
> > On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
> > > The current atomic-based page refcount implementation treats zero
> > > counter as dead and requires a compare-and-swap loop in folio_try_get()
> > > to prevent incrementing a dead refcount. This CAS loop acts as a
> > > serialization point and can become a significant bottleneck during
> > > high-frequency file read operations.
> > >
> > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a
> >
> > s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
> Ack, thanks
>
> > > (temporary) zero refcount and a locked (dead/frozen) state. Because now
> > > incrementing counter doesn't affect it's locked/unlocked state, it is
> > > possible to use an optimistic atomic_fetch_add() in
> > > page_ref_add_unless_zero() that operates independently of the locked bit.
> > > The locked state is handled after the increment attempt, eliminating the
> > > need for the CAS loop.
> >
> > I don't think I follow.
> >
> > Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
> > page_ref_freeze(), but I don't think it does anything to serialize
> > against freeing the page under you.
> >
> > Like, if the page in the process of freeing, page allocator sets its
> > refcount to zero and your version of page_ref_add_unless_zero()
> > successfully acquirees reference for the freed page.
> >
> > How is it safe?
>
> Page is freed only after a successful page_ref_dec_and_test() call, which
> will set LOCKED_BIT. This bit will persist until set_page_count(1) is called
> somewhere in the allocation path [alloc_pages()], and effectively block any
> "use after free" users.
Okay, fair enough.
But what prevent the following scenario?
CPU0 CPU1
page_ref_dec_and_test()
atomic_dec_and_test() // refcount=0
page_ref_add_unless_zero()
atomic_add_return() // refcount=1, no LOCKED_BIT
page_ref_dec_and_test()
atomic_dec_and_test() // refcount=0
atomic_cmpxchg(0, LOCKED_BIT) // succeeds
atomic_cmpxchg(0, LOCKED_BIT) // fails
// return false to caller
// Use-after-free: BOOM!
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2025-12-19 14:50 ` Kiryl Shutsemau
@ 2025-12-19 18:17 ` Gregory Price
2025-12-22 12:42 ` Gladyshev Ilya
1 sibling, 1 reply; 10+ messages in thread
From: Gregory Price @ 2025-12-19 18:17 UTC (permalink / raw)
To: Gladyshev Ilya
Cc: patchwork, guohanjun, wangkefeng.wang, weiyongjun1, yusongping,
leijitang, artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
> The current atomic-based page refcount implementation treats zero
> counter as dead and requires a compare-and-swap loop in folio_try_get()
> to prevent incrementing a dead refcount. This CAS loop acts as a
> serialization point and can become a significant bottleneck during
> high-frequency file read operations.
>
> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
> (temporary) zero refcount and a locked (dead/frozen) state. Because now
> incrementing counter doesn't affect it's locked/unlocked state, it is
> possible to use an optimistic atomic_fetch_add() in
> page_ref_add_unless_zero() that operates independently of the locked bit.
> The locked state is handled after the increment attempt, eliminating the
> need for the CAS loop.
>
Such a fundamental change needs additional validation to show there's no
obvious failures. Have you run this through a model checker to verify
the only failure condition is the 2^31 overflow condition you describe?
A single benchmark and a short changelog is leaves me very uneasy about
such a change.
~Gregory
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 17:46 ` Kiryl Shutsemau
@ 2025-12-19 19:08 ` Gladyshev Ilya
2025-12-22 13:33 ` Kiryl Shutsemau
0 siblings, 1 reply; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-19 19:08 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On 12/19/2025 8:46 PM, Kiryl Shutsemau wrote:
> On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote:
>> On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote:
>>> On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
>>>> The current atomic-based page refcount implementation treats zero
>>>> counter as dead and requires a compare-and-swap loop in folio_try_get()
>>>> to prevent incrementing a dead refcount. This CAS loop acts as a
>>>> serialization point and can become a significant bottleneck during
>>>> high-frequency file read operations.
>>>>
>>>> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
>>>
>>> s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
>> Ack, thanks
>>
>>>> (temporary) zero refcount and a locked (dead/frozen) state. Because now
>>>> incrementing counter doesn't affect it's locked/unlocked state, it is
>>>> possible to use an optimistic atomic_fetch_add() in
>>>> page_ref_add_unless_zero() that operates independently of the locked bit.
>>>> The locked state is handled after the increment attempt, eliminating the
>>>> need for the CAS loop.
>>>
>>> I don't think I follow.
>>>
>>> Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
>>> page_ref_freeze(), but I don't think it does anything to serialize
>>> against freeing the page under you.
>>>
>>> Like, if the page in the process of freeing, page allocator sets its
>>> refcount to zero and your version of page_ref_add_unless_zero()
>>> successfully acquirees reference for the freed page.
>>>
>>> How is it safe?
>>
>> Page is freed only after a successful page_ref_dec_and_test() call, which
>> will set LOCKED_BIT. This bit will persist until set_page_count(1) is called
>> somewhere in the allocation path [alloc_pages()], and effectively block any
>> "use after free" users.
>
> Okay, fair enough.
>
> But what prevent the following scenario?
>
> CPU0 CPU1
> page_ref_dec_and_test()
> atomic_dec_and_test() // refcount=0
> page_ref_add_unless_zero()
> atomic_add_return() // refcount=1, no LOCKED_BIT
> page_ref_dec_and_test()
> atomic_dec_and_test() // refcount=0
> atomic_cmpxchg(0, LOCKED_BIT) // succeeds
> atomic_cmpxchg(0, LOCKED_BIT) // fails
> // return false to caller
> // Use-after-free: BOOM!
>
But you can't trust that the page is safe to use after
page_ref_dec_and_test() returns false, if I understood your example
correctly. For example, current implementation can also lead to this
'bug' if you slightly change the order of atomic ops in your example:
Initial refcount value: 1 from CPU 0
CPU 0 CPU 1
page_ref_and_dec() page_ref_add_unless_zero()
atomic_add_return() [1 -> 2]
atomic_dec_and_test() [2 -> 1]
page_ref_dec_and_test()
atomic_dec_and_test() [1 -> 0]
/* page is logically freed here */
return false [cause 1!=0]
// Caller with use after free?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 18:17 ` Gregory Price
@ 2025-12-22 12:42 ` Gladyshev Ilya
0 siblings, 0 replies; 10+ messages in thread
From: Gladyshev Ilya @ 2025-12-22 12:42 UTC (permalink / raw)
To: Gregory Price
Cc: patchwork, guohanjun, wangkefeng.wang, weiyongjun1, yusongping,
leijitang, artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On 12/19/2025 9:17 PM, Gregory Price wrote:
> On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
>> The current atomic-based page refcount implementation treats zero
>> counter as dead and requires a compare-and-swap loop in folio_try_get()
>> to prevent incrementing a dead refcount. This CAS loop acts as a
>> serialization point and can become a significant bottleneck during
>> high-frequency file read operations.
>>
>> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
>> (temporary) zero refcount and a locked (dead/frozen) state. Because now
>> incrementing counter doesn't affect it's locked/unlocked state, it is
>> possible to use an optimistic atomic_fetch_add() in
>> page_ref_add_unless_zero() that operates independently of the locked bit.
>> The locked state is handled after the increment attempt, eliminating the
>> need for the CAS loop.
>>
>
> Such a fundamental change needs additional validation to show there's no
> obvious failures. Have you run this through a model checker to verify
> the only failure condition is the 2^31 overflow condition you describe?
Aside from extensive logical reasoning, I validated some racy situations
via tools/memory-model model checking:
1. Increment vs. free race (bad output: use-after-free | memory leak)
2. Free vs. free race (bad output: double free | memory leak)
3. Increment vs. freeze (bad output: both fails)
4. Increment vs. unfreeze (bad output: missed increment)
If there are other scenarios you are concerned about, I will model them
as well. You can find the litmus tests at the end of this email.
> A single benchmark and a short changelog is leaves me very uneasy about
> such a change.
This RFC submission was primarily focused on demonstrating the concept
and the performance gain for the reported bottleneck. I will improve the
changelog (and safety reasoning) for later submissions, as well as the
benchmarking side.
---
Note: I used 32 as locked bit in model tests for better readability. It
doesn't affect anything
---
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
new file mode 100644
index 000000000000..4dc7e899245b
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
@@ -0,0 +1,37 @@
+C free_vs_free_race
+
+(* Result: Never
+ *
+ * Both P0 and P1 tries to decrement refcount.
+ *
+ * Expected result: only one deallocation (r0 xor r1 == 1)
+ * which is equal to r0 != r1 => bad result is r0 == r1
+*)
+
+{
+ int refcount = 2;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_dec_and_test(refcount);
+ if (r0) {
+ r0 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+
+P1(int *refcount)
+{
+ int r1;
+
+ r1 = atomic_dec_and_test(refcount);
+ if (r1) {
+ r1 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+exists (0:r0 == 1:r1)
+
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
new file mode 100644
index 000000000000..863abba48415
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
@@ -0,0 +1,34 @@
+C inc_free_race
+
+(* Result: Never
+ *
+ * P0 tries to decrement free object.
+ * P1 tries to acquire it.
+ * Expected result: one of them failes (r0 xor r1 == 1),
+ * so bad result is r0 == r1
+*)
+
+{
+ int refcount = 1;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_dec_and_test(refcount);
+ if (r0) {
+ r0 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+
+P1(int *refcount)
+{
+ int r1;
+
+ r1 = atomic_add_return(1, refcount);
+ r1 = (r1 & (32)) == 0;
+}
+
+exists (0:r0 == 1:r1)
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
new file mode 100644
index 000000000000..6e3a4112080c
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
@@ -0,0 +1,31 @@
+C inc_freeze_race
+
+(* Result: Never
+ *
+ * P0 tries to freeze counter with value 3 (can be arbitary).
+ * P1 tries to acquire reference.
+ * Expected result: one of them failes (r0 xor r1 == 1),
+ * so bad result is r0 == r1 (= 0, 1).
+*)
+
+{
+ int refcount = 3;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_cmpxchg(refcount, 3, 32);
+}
+
+
+P1(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_add_return(1, refcount);
+ r0 = (r0 & (32)) == 0;
+}
+
+exists (0:r0 == 1:r0)
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
new file mode 100644
index 000000000000..f7e2273fe7da
--- /dev/null
+++
b/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
@@ -0,0 +1,30 @@
+C inc_unfreeze_race
+
+(* Result: Never
+ *
+ * P0 tries to unfreeze refcount with saved value 3
+ * P1 tries to acquire reference.
+ *
+ * Expected result: P1 fails or in the end refcount is 4
+ * Bad result: Missed refcount
+*)
+
+{
+ int refcount = 32;
+}
+
+P0(int *refcount)
+{
+ smp_store_release(refcount, 3);
+}
+
+
+P1(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_add_return(1, refcount);
+ r0 = (r0 & (32)) == 0;
+}
+
+exists (1:r0=1 /\ refcount != 4)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit
2025-12-19 19:08 ` Gladyshev Ilya
@ 2025-12-22 13:33 ` Kiryl Shutsemau
0 siblings, 0 replies; 10+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 13:33 UTC (permalink / raw)
To: Gladyshev Ilya
Cc: guohanjun, wangkefeng.wang, weiyongjun1, yusongping, leijitang,
artem.kuzin, stepanov.anatoly, alexander.grubnikov,
gorbunov.ivan, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, ziy, harry.yoo, willy, yuzhao,
baolin.wang, muchun.song, linux-mm, linux-kernel
On Fri, Dec 19, 2025 at 10:08:54PM +0300, Gladyshev Ilya wrote:
> On 12/19/2025 8:46 PM, Kiryl Shutsemau wrote:
> > On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote:
> > > On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote:
> > > > On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
> > > > > The current atomic-based page refcount implementation treats zero
> > > > > counter as dead and requires a compare-and-swap loop in folio_try_get()
> > > > > to prevent incrementing a dead refcount. This CAS loop acts as a
> > > > > serialization point and can become a significant bottleneck during
> > > > > high-frequency file read operations.
> > > > >
> > > > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a
> > > >
> > > > s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
> > > Ack, thanks
> > >
> > > > > (temporary) zero refcount and a locked (dead/frozen) state. Because now
> > > > > incrementing counter doesn't affect it's locked/unlocked state, it is
> > > > > possible to use an optimistic atomic_fetch_add() in
> > > > > page_ref_add_unless_zero() that operates independently of the locked bit.
> > > > > The locked state is handled after the increment attempt, eliminating the
> > > > > need for the CAS loop.
> > > >
> > > > I don't think I follow.
> > > >
> > > > Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
> > > > page_ref_freeze(), but I don't think it does anything to serialize
> > > > against freeing the page under you.
> > > >
> > > > Like, if the page in the process of freeing, page allocator sets its
> > > > refcount to zero and your version of page_ref_add_unless_zero()
> > > > successfully acquirees reference for the freed page.
> > > >
> > > > How is it safe?
> > >
> > > Page is freed only after a successful page_ref_dec_and_test() call, which
> > > will set LOCKED_BIT. This bit will persist until set_page_count(1) is called
> > > somewhere in the allocation path [alloc_pages()], and effectively block any
> > > "use after free" users.
> >
> > Okay, fair enough.
> >
> > But what prevent the following scenario?
> >
> > CPU0 CPU1
> > page_ref_dec_and_test()
> > atomic_dec_and_test() // refcount=0
> > page_ref_add_unless_zero()
> > atomic_add_return() // refcount=1, no LOCKED_BIT
> > page_ref_dec_and_test()
> > atomic_dec_and_test() // refcount=0
> > atomic_cmpxchg(0, LOCKED_BIT) // succeeds
> > atomic_cmpxchg(0, LOCKED_BIT) // fails
> > // return false to caller
> > // Use-after-free: BOOM!
> >
> But you can't trust that the page is safe to use after
> page_ref_dec_and_test() returns false, if I understood your example
> correctly.
True. My bad.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-22 13:33 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-19 12:46 [RFC PATCH 0/2] mm: improve folio refcount scalability Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 1/2] mm: make ref_unless functions unless_zero only Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2025-12-19 14:50 ` Kiryl Shutsemau
2025-12-19 16:18 ` Gladyshev Ilya
2025-12-19 17:46 ` Kiryl Shutsemau
2025-12-19 19:08 ` Gladyshev Ilya
2025-12-22 13:33 ` Kiryl Shutsemau
2025-12-19 18:17 ` Gregory Price
2025-12-22 12:42 ` Gladyshev Ilya
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox