linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	Harry Yoo <harry.yoo@oracle.com>,
	Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Alistair Popple <apopple@nvidia.com>,
	Gorbunov Ivan <gorbunov.ivan@h-partners.com>,
	Muchun Song <muchun.song@linux.dev>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>,
	Kiryl Shutsemau <kirill@shutemov.name>,
	Linus Torvalds <torvalds@linuxfoundation.org>,
	<gladyshev.ilya1@h-partners.com>
Subject: Re: [PATCH 1/1] mm: implement page refcount locking via dedicated bit
Date: Fri, 6 Mar 2026 14:50:08 +0300	[thread overview]
Message-ID: <902d821b-e903-4bf5-89db-070851c95a1a@h-partners.com> (raw)
In-Reply-To: <a3361902-75bf-4e9e-a8c5-1959f9e72915@kernel.org>

This is combined reply to both of your emails, hope you don't mind

 >
 > This all made mu brain hurt a little 🙂
 >
 >>
 >>   /**
 >> @@ -176,6 +181,9 @@ static inline int page_ref_sub_and_test(struct 
page *page, int nr)
 >>   {
 >>       int ret = atomic_sub_and_test(nr, &page->_refcount);
 >>
 >> +    if (ret)
 >> +        ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, 
PAGEREF_LOCKED_BIT);
 >> +
 >
 > It took me a while to figure out why this can't be just an atomic_or():
 > even though concurrent page_ref_add_unless_zero() would see a 0 after
 > incrementing it to 1 to back off, there could be yet another concurrent
 > page_ref_add_unless_zero() that would see the transition from 1 to 2 and
 > continue.
 >
 > What is the performance impact on doing the additional
 > atomic_cmpxchg_relaxed() whenever we free a page, in particular, for
 > anonymous memory where we mostly have just a single reference that we
 > drop during munmap() etc?

I'll try to measure some numbers. Theoretically speaking, in 
low-contention scenario you will exclusively own cacheline after 
atomic_sub, so relaxed CAS should be cheap.

 >>       if (page_ref_tracepoint_active(page_ref_mod_and_test))
 >>           __page_ref_mod_and_test(page, -nr, ret);
 >>       return ret;
 >> @@ -204,6 +212,9 @@ static inline int page_ref_dec_and_test(struct 
page *page)
 >>   {
 >>       int ret = atomic_dec_and_test(&page->_refcount);
 >>
 >> +    if (ret)
 >> +        ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, 
PAGEREF_LOCKED_BIT);
 >> +
 >>       if (page_ref_tracepoint_active(page_ref_mod_and_test))
 >>           __page_ref_mod_and_test(page, -1, ret);
 >>       return ret;
 >> @@ -228,14 +239,23 @@ static inline int folio_ref_dec_return(struct 
folio *folio)
 >>       return page_ref_dec_return(&folio->page);
 >>   }
 >>
 >> +#define _PAGEREF_LOCKED_LIMIT    ((1 << 30) | PAGEREF_LOCKED_BIT)
 >> +
 >>   static inline bool page_ref_add_unless_zero(struct page *page, int nr)
 >>   {
 >>       bool ret = false;
 >> +    int val;
 >>
 >>       rcu_read_lock();
 >>       /* avoid writing to the vmemmap area being remapped */
 >> -    if (page_count_writable(page))
 >> -        ret = atomic_add_unless(&page->_refcount, nr, 0);
 >> +    if (page_count_writable(page)) {
 >> +        val = atomic_add_return(nr, &page->_refcount);
 >> +        ret = !(val & PAGEREF_LOCKED_BIT);
 >> +
 >> +        /* Undo atomic_add() if counter is locked and scary big */
 >> +        while (unlikely((unsigned int)val >= _PAGEREF_LOCKED_LIMIT))
 >> +            val = atomic_cmpxchg_relaxed(&page->_refcount, val, 
PAGEREF_LOCKED_BIT);
 >
 > I assume we can't do an atomic_dec(), because we might have concurrent
 > unfreezing (or similar things) happening that overwrote whatever was in
 > there.

Not only that, but also you probably don't want to handle "atomic_dec() 
returned 0" situations.

 > Is it really correct to replace _PAGEREF_LOCKED_LIMIT by
 > PAGEREF_LOCKED_BIT, dropping some unrelated references? I assume the
 > reasoning is that we treat any references with PAGEREF_LOCKED_BIT set as
 > irrelevant and can get overwritten any time.

Locked refcount doesn't contain any "references" except failed 
optimistic increments, so you are right that they don't hold any 
semantic meaning. The only reason to clear them is to prevent overflow, 
so we are doing it only if it is absolutely required.

 >
 > I was wondering is whether page_ref_freeze() could actually leave the
 > references set, and only set the PAGEREF_LOCKED_BIT bit, whereby
 > page_ref_unfreeze() would only clear the PAGEREF_LOCKED_BIT bit.
 >
 > Similarly, the set_page_refcounted() could add a reference and clear the
 > PAGEREF_LOCKED_BIT. That'd be more expensive on the allocation path ...
 > and not sure if that would really help to turn this
 > atomic_cmpxchg_relaxed() into an simpler atomic_dec() my brain could
 > more easily understand 🙂

I believe resetting whole counter via single CAS (CPU-wise) is cheaper 
than atomic_dec() for each individual attempt.

 >
 > I think this patch needs a lot more documentation around what the
 > PAGEREF_LOCKED_BIT means, and how this interacts with e.g., the
 > set_page_count() in set_page_refcounted().

Yep, I'll fix this

 > In general, I like this!
 >

 >>   static inline bool page_ref_add_unless_zero(struct page *page, int nr)
 >>   {
 >>       bool ret = false;
 >> +    int val;
 >>         rcu_read_lock();
 >>       /* avoid writing to the vmemmap area being remapped */
 >> -    if (page_count_writable(page))
 >> -        ret = atomic_add_unless(&page->_refcount, nr, 0);
 >> +    if (page_count_writable(page)) {
 >> +        val = atomic_add_return(nr, &page->_refcount);
 >> +        ret = !(val & PAGEREF_LOCKED_BIT);
 >> +
 >> +        /* Undo atomic_add() if counter is locked and scary big */
 >> +        while (unlikely((unsigned int)val >= _PAGEREF_LOCKED_LIMIT))
 >> +            val = atomic_cmpxchg_relaxed(&page->_refcount, val, 
PAGEREF_LOCKED_BIT);
 > It's still early here, but I think there is a problem.
 >
 > Please bear with me 🙂
 >
 >     val = atomic_add_return(nr, &page->_refcount);
 >     ret = !(val & PAGEREF_LOCKED_BIT);
 >
 > Implies that can grab a reference whenever the locked-bit is not set.
 >
 > Including when the refcount is 0.
 >
 > Now, that works fine when racing with concurrent freeing, where we are
 > just able to decrement the refcount, but yet have to set the
 > PAGEREF_LOCKED_BIT bit.
 >
 > But, what about any pages that don't have the PAGEREF_LOCKED_BIT set,
 > but have the refcount at 0 permanently?
 >
 > That's, for example, the case for any pages where we do an explicit
 > set_page_count(page, 0);
 >
 > For example, all pages we add to the page allocator through
 > __free_pages_core().

You are right that refcount = 0 is tricky. However, for a bad outcome 
you will need:
1. Some external reference to this page, through which you try to 
increment the refcount;
2. set_page_count(0) somewhere between freeing and "it is safe to alloc" 
state.

So adding new pages with zeroed refcount to allocator is safe, because 
there are no external references. Zeroing tail page's refcount is safe, 
unless someone actually tries to increment its refcount (and this is bug).

Generally, the only unsafe set_page_count() (or any other zeroing) will 
be in allocator itself between freeing and allocating. Or maybe I missed 
something, and this approach is indeed incorrect

Probably we can think of some debug checks to prevent bugs in "safe" 
scenarious

 > That means, that someone could easily grab a reference to such pages,
 > including tail pages of allocated compound pages where the refcount is
 > still 0 -- or pages allocated with a frozen refcount where we don't ever
 > do the set_page_refcount(1) in the buddy.
 >
 > Bad things will happen when that wrongly page_ref_add_unless_zero()
 > obtained reference is dropped again to free that page.
 >
 >
 > You'd have to make sure that there is no way we can achieve refcount ==
 > 0 without going through page_ref_dec_and_test(), when actually freeing a
 > page.
 >
 > One piece of the puzzle is handling set_page_count(p, 0) I think. But I
 > suspect that there might be other places where we don't even have the
 > set_page_count().
 >
 > See vmemmap_get_tail() in
 > https://lore.kernel.org/r/20260227194302.274384-13-kas@kernel.org for
 > example, where we know the refcount is 0, because we allocated the page
 > holding memmap with __GFP_ZERO.
 >
 > For example, I think you'd have to make sure that *any* pages in the
 > buddy have their refcount set to PAGEREF_LOCKED_BIT, not 0.
 >
 > So unless I am missing soemthing, this is broken an requires a lot of
 > care to make sure that refcount==0 is handled everywhere accordingly.
 >

---
Ilya


  reply	other threads:[~2026-03-06 11:50 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 16:27 [PATCH 0/1] mm: improve folio refcount scalability Gladyshev Ilya
2026-02-26 16:27 ` [PATCH 1/1] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2026-03-04 19:16   ` David Hildenbrand (Arm)
2026-03-05  8:10     ` David Hildenbrand (Arm)
2026-03-06 11:50       ` Gladyshev Ilya [this message]
2026-03-06 13:10         ` David Hildenbrand (Arm)
2026-03-06 14:29           ` Gladyshev Ilya
2026-02-28 22:19 ` [PATCH 0/1] mm: improve folio refcount scalability Andrew Morton
2026-03-01  3:27   ` Linus Torvalds
2026-03-01 18:52     ` Linus Torvalds
2026-03-01 20:26       ` Pedro Falcato
2026-03-01 21:16         ` Linus Torvalds
2026-03-04 17:34           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=902d821b-e903-4bf5-89db-070851c95a1a@h-partners.com \
    --to=gladyshev.ilya1@h-partners.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=gorbunov.ivan@h-partners.com \
    --cc=harry.yoo@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=torvalds@linuxfoundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox