linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Xu Lu <luxu.kernel@bytedance.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>,
	David Hildenbrand <david@redhat.com>, Zi Yan <ziy@nvidia.com>,
	 paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu,  ardb@kernel.org, anup@brainfault.org,
	atishp@atishpatra.org,  xieyongji@bytedance.com,
	lihangjing@bytedance.com,  punit.agrawal@bytedance.com,
	linux-kernel@vger.kernel.org,  linux-riscv@lists.infradead.org,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
Date: Sat, 7 Dec 2024 15:02:57 -0700	[thread overview]
Message-ID: <CAOUHufaJ3ix7Prv7UkHzzgz5Hq7UW9T5AZFHdWKgzBW_2hYdLw@mail.gmail.com> (raw)
In-Reply-To: <CAPYmKFtvYFmE9qpp=Gyqdjc8nNnr-OPT3X+UittqTPCqzz6XPw@mail.gmail.com>

On Sat, Dec 7, 2024 at 1:03 AM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> Hi Pedro,
>
> On Sat, Dec 7, 2024 at 2:49 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> > >
> > > Hi David,
> > >
> > > On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 06.12.24 03:00, Zi Yan wrote:
> > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > > > >
> > > > >> This patch series attempts to break through the limitation of MMU and
> > > > >> supports larger base page on RISC-V, which only supports 4K page size
> > > > >> now. The key idea is to always manage and allocate memory at a
> > > > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > > > >> This is the second version and the detailed introduction can be found
> > > > >> in [1].
> > > > >>
> > > > >> Changes from v1:
> > > > >> - Rebase on v6.12.
> > > > >>
> > > > >> - Adjust the page table entry shift to reduce page table memory usage.
> > > > >>      For example, in SV39, the traditional va behaves as:
> > > > >>
> > > > >>      ----------------------------------------------
> > > > >>      | pgd index | pmd index | pte index | offset |
> > > > >>      ----------------------------------------------
> > > > >>      | 38     30 | 29     21 | 20     12 | 11   0 |
> > > > >>      ----------------------------------------------
> > > > >>
> > > > >>      When we choose 64K as basic software page, va now behaves as:
> > > > >>
> > > > >>      ----------------------------------------------
> > > > >>      | pgd index | pmd index | pte index | offset |
> > > > >>      ----------------------------------------------
> > > > >>      | 38     34 | 33     25 | 24     16 | 15   0 |
> > > > >>      ----------------------------------------------
> > > > >>
> > > > >> - Fix some bugs in v1.
> > > > >>
> > > > >> Thanks in advance for comments.
> > > > >>
> > > > >> [1] https://lwn.net/Articles/952722/
> > > > >
> > > > > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > > > > in the future? Thanks.
> > > > >
> > > > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > > > should have similar effect of RISC-V’s SVNAPOT, right?
> > > >
> > > > What is the real benefit over 4k + large folios/mTHP?
> > > >
> > > > 64K comes with the problem of internal fragmentation: for example, a
> > > > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > > > a downside.
> > >
> > > The original idea comes from the performance benefits we achieved on
> > > the ARM 64K kernel. We run several real world applications on the ARM
> > > Ampere Altra platform and found these apps' performance based on the
> > > 64K page kernel is significantly higher than that on the 4K page
> > > kernel:
> > > For Redis, the throughput has increased by 250% and latency has
> > > decreased by 70%.
> > > For Mysql, the throughput has increased by 16.9% and latency has
> > > decreased by 14.5%.
> > > For our own newsql database, throughput has increased by 16.5% and
> > > latency has decreased by 13.8%.
> > >
> > > Also, we have compared the performance between 64K and 4k + large
> > > folios/mTHP on ARM Neoverse-N2. The result shows considerable
> > > performance improvement on 64K kernel for both speccpu and lmbench,
> > > even when 4K kernel enables THP and ARM64_CONTPTE:
> > > For speccpu benchmark, 64K kernel without any huge pages optimization
> > > can still achieve 4.17% higher score than 4K kernel with transparent
> > > huge pages as well as CONTPTE optimization.
> > > For lmbench, 64K kernel achieves 75.98% lower memory mapping
> > > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> > > optimization, 84.34% higher map read open2close bandwidth(16MB), and
> > > 10.71% lower random load latency(16MB).
> > > Interestingly, sometimes kernel with transparent pages support have
> > > poorer performance for both 4K and 64K (for example, mmap read
> > > bandwidth bench). We assume this is due to the overhead of huge pages'
> > > combination and collapse.
> > > Also, if you check the full result, you will find that usually the
> > > larger the memory size used for testing is, the better the performance
> > > of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> > > in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> > > can't.
> > > In summary, for performance sensitive applications which require
> > > higher bandwidth and lower latency, sometimes 4K pages with huge pages
> > > may not be the best choice and 64k page can achieve better results.
> > > The test environment and result is attached.
> > >
> > > As RISC-V has no native 64K MMU support, we introduce a software
> > > implementation and accelerate it via Svnapot. Of course, there will be
> > > some extra overhead compared with native 64K MMU. Thus, we are also
> > > trying to persuade the RISC-V community to support the extension of
> > > native 64K MMU [1]. Please join us if you are interested.
> > >
> >
> > Ok, so you... didn't test this on riscv? And you're basing this
> > patchset off of a native 64KiB page size kernel being faster than 4KiB
> > + CONTPTE? I don't see how that makes sense?
>
> Sorry for the misleading. I didn't intend to use ARM data to support
> this patch, just to explain the idea source. We do prefer 64K MMU for
> the performance improvement it brought to real applications and
> benchmarks.

This breaks ABI, doesn't it? Not only userspace needs to be recompiled
with 64KB alignment, it also needs not to assume 4KB base page size.

> And since RISC-V does not support it yet, we internally
> use this patch as a transitional solution for RISC-V.

Distros need to support this as well. Otherwise it's a tech island.
Also why RV? It can be a generic feature which can apply to other
archs like x86, right? See "page clustering" [1][2].

[1] https://lwn.net/Articles/23785/
[2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100000@localhost.localdomain/

> And if native
> 64k MMU is available, this patch can be canceled.

Why 64KB? Why not 32KB or 128KB? In general, the less dependency on
h/w, the better. Ideally, *if* we want to consider this, it should be
a s/w feature applicable to all (or most of) archs.


> The only usage of
> this patch I can think of then is to make the kernel support more page
> sizes than MMU, as long as Svnapot supports the corresponding size.
>
> We will try to release the performance data in the next version. There
> have been more issues with applications and OS adaptation:) So this
> version is still an RFC.
>
> >
> > /me is confused
> >
> > How many of these PAGE_SIZE wins are related to e.g userspace basing
> > its buffer sizes (or whatever) off of the system page size? Where
> > exactly are you gaining time versus the CONTPTE stuff?
> > I think MM in general would be better off if we were more transparent
> > with regard to CONTPTE and page sizes instead of hand waving with
> > "hardware page size != software page size", which is such a *checks
> > notes* 4.4BSD idea... :) At the very least, this patchset seems to go
> > against all the work on better supporting large folios and CONTPTE.
>
> By the way, the core modification of this patch is turning pte
> structure to an array of 16 entries to map a 64K page and accelerating
> it via Svnapot. I think it is all about architectural pte and has
> little impact on pages or folios. Please remind me if anything is
> missed and I will try to fix it.
>
> >
> > --
> > Pedro
>
> Thanks,
>
> Xu Lu
>


  reply	other threads:[~2024-12-07 22:03 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20241205103729.14798-1-luxu.kernel@bytedance.com>
2024-12-06  2:00 ` Zi Yan
2024-12-06  2:41   ` [External] " Xu Lu
2024-12-06 10:13   ` David Hildenbrand
2024-12-06 13:42     ` [External] " Xu Lu
2024-12-06 18:48       ` Pedro Falcato
2024-12-07  8:03         ` Xu Lu
2024-12-07 22:02           ` Yu Zhao [this message]
2024-12-09  3:36             ` Xu Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOUHufaJ3ix7Prv7UkHzzgz5Hq7UW9T5AZFHdWKgzBW_2hYdLw@mail.gmail.com \
    --to=yuzhao@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=ardb@kernel.org \
    --cc=atishp@atishpatra.org \
    --cc=david@redhat.com \
    --cc=lihangjing@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=luxu.kernel@bytedance.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pedro.falcato@gmail.com \
    --cc=punit.agrawal@bytedance.com \
    --cc=xieyongji@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox