linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
To: <gladyshev.ilya1@h-partners.com>
Cc: <guohanjun@huawei.com>, <wangkefeng.wang@huawei.com>,
	<weiyongjun1@huawei.com>, <yusongping@huawei.com>,
	<leijitang@huawei.com>, <artem.kuzin@huawei.com>,
	<stepanov.anatoly@huawei.com>, <alexander.grubnikov@huawei.com>,
	<gorbunov.ivan@h-partners.com>, <akpm@linux-foundation.org>,
	<david@kernel.org>, <lorenzo.stoakes@oracle.com>,
	<Liam.Howlett@oracle.com>, <vbabka@suse.cz>, <rppt@kernel.org>,
	<surenb@google.com>, <mhocko@suse.com>, <ziy@nvidia.com>,
	<harry.yoo@oracle.com>, <willy@infradead.org>,
	<yuzhao@google.com>, <baolin.wang@linux.alibaba.com>,
	<muchun.song@linux.dev>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/2] mm: improve folio refcount scalability
Date: Mon, 12 Jan 2026 11:30:38 +0300	[thread overview]
Message-ID: <ec023298-c26d-437b-a023-b49509f83c5a@h-partners.com> (raw)
In-Reply-To: <cover.1766145604.git.gladyshev.ilya1@h-partners.com>

Gentle ping on this proposal

> Intro
> =====
> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].
> 
> Overview
> ========
> Current refcount implementation is using zero counter as locked (dead/frozen)
> state, which required CAS loop for increments to avoid temporary unlocks in
> try_get functions. These CAS loops became a serialization point for otherwise
> scalable and fast read side.
> 
> Proposed implementation separates "locked" logic from the counting, allowing
> the use of optimistic fetch_add() instead of CAS. For more details, please
> refer to the commit message of the patch itself.
> 
> Proposed logic maintains the same public API as before, including all existing
> memory barrier guarantees.
> 
> Drawbacks
> =========
> In theory, an optimistic fetch_add can overflow the atomic_t and reset the
> locked state. Currently, this is mitigated via a single CAS operation after
> the "failed" fetch_add, which tries to reset the counter to a locked zero.
> While this best-effort approach doesn't have any strong guarantees, it's
> unrealistic that there will be 2^31 highly contended try_get calls on a locked
> folio, and in each of these calls, the CAS operation will fail.
> 
> If this guarantee isn't sufficient, it can be improved by performing a full
> CAS loop when the counter is approaching overflow.
> 
> Performance
> ===========
> Performance was measured using a simple custom benchmark based on
> will-it-scale[3]. This benchmark spawns N pinned threads/processes that
> execute the following loop:
> ``
> char buf[]
> fd = open(/* same file in tmpfs */);
> 
> while (true) {
>      pread(fd, buf, /* read size = */ 64, /* offset = */0)
> }
> ``
> While this is a synthetic load, it does highlight existing issue and
> doesn't differ a lot from benchmarking in [2] patch.
> 
> This benchmark measures operations per second in the inner loop and the
> results across all workers. Performance was tested on top of v6.15 kernel[4]
> on two platforms. Since threads and processes showed similar performance on
> both systems, only the thread results are provided below. The performance
> improvement scales linearly between the CPU counts shown.
> 
> Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1343381 | 1344401 |  +0.1
>         2 | 2186160 | 2455837 | +12.3
>         5 | 5277092 | 6108030 | +15.7
>        10 | 5858123 | 7506328 | +28.1
>        12 | 6484445 | 8137706 | +25.5
>           /* Cross socket NUMA */
>        14 | 3145860 | 4247391 | +35.0
>        16 | 2350840 | 4262707 | +81.3
>        18 | 2378825 | 4121415 | +73.2
>        20 | 2438475 | 4683548 | +92.1
>        24 | 2325998 | 4529737 | +94.7
> 
> Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1077276 | 1081653 |  +0.4
>         5 | 4286838 | 4682513 |  +9.2
>        10 | 1698095 | 1902753 | +12.1
>        20 | 1662266 | 1921603 | +15.6
>        49 | 1486745 | 1828926 | +23.0
>        97 | 1617365 | 2052635 | +26.9
>           /* Cross socket NUMA */
>       105 | 1368319 | 1798862 | +31.5
>       136 | 1008071 | 1393055 | +38.2
>       168 |  879332 | 1245210 | +41.6
>                 /* SMT */
>       193 |  905432 | 1294833 | +43.0
>       289 |  851988 | 1313110 | +54.1
>       353 |  771288 | 1347165 | +74.7
> 
> [1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
> [3] https://github.com/antonblanchard/will-it-scale
> [4] There were no changes to page_ref.h between v6.15 and v6.18 or any
>      significant performance changes on the read side in mm/filemap.c
> 
> Gladyshev Ilya (2):
>    mm: make ref_unless functions unless_zero only
>    mm: implement page refcount locking via dedicated bit
> 
>   include/linux/mm.h         |  2 +-
>   include/linux/page-flags.h |  9 ++++++---
>   include/linux/page_ref.h   | 35 ++++++++++++++++++++++++++---------
>   3 files changed, 33 insertions(+), 13 deletions(-)
> 



  parent reply	other threads:[~2026-01-12  8:30 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 12:46 Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 1/2] mm: make ref_unless functions unless_zero only Gladyshev Ilya
2025-12-19 12:46 ` [RFC PATCH 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2025-12-19 14:50   ` Kiryl Shutsemau
2025-12-19 16:18     ` Gladyshev Ilya
2025-12-19 17:46       ` Kiryl Shutsemau
2025-12-19 19:08         ` Gladyshev Ilya
2025-12-22 13:33           ` Kiryl Shutsemau
2025-12-19 18:17   ` Gregory Price
2025-12-22 12:42     ` Gladyshev Ilya
2026-01-12  8:30 ` Gladyshev Ilya [this message]
2026-01-12 11:49   ` [RFC PATCH 0/2] mm: improve folio refcount scalability Kiryl Shutsemau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec023298-c26d-437b-a023-b49509f83c5a@h-partners.com \
    --to=gladyshev.ilya1@h-partners.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.grubnikov@huawei.com \
    --cc=artem.kuzin@huawei.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=gorbunov.ivan@h-partners.com \
    --cc=guohanjun@huawei.com \
    --cc=harry.yoo@oracle.com \
    --cc=leijitang@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=rppt@kernel.org \
    --cc=stepanov.anatoly@huawei.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=weiyongjun1@huawei.com \
    --cc=willy@infradead.org \
    --cc=yusongping@huawei.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox