From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEEC6E7717F for ; Sat, 7 Dec 2024 22:03:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FFDB6B00A9; Sat, 7 Dec 2024 17:03:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AFAD6B00AA; Sat, 7 Dec 2024 17:03:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 277046B00AB; Sat, 7 Dec 2024 17:03:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 096B86B00A9 for ; Sat, 7 Dec 2024 17:03:38 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 92268C1C34 for ; Sat, 7 Dec 2024 22:03:37 +0000 (UTC) X-FDA: 82869539130.17.93BC945 Received: from mail-vs1-f43.google.com (mail-vs1-f43.google.com [209.85.217.43]) by imf23.hostedemail.com (Postfix) with ESMTP id 67535140016 for ; Sat, 7 Dec 2024 22:03:23 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JZzyoSyj; spf=pass (imf23.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733609001; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YIFxuOOTGkDnFbXhmO1flaaC4HzgHWDKSPocxmXWias=; b=2B+7Y2MT9wc+ZEVH1w5/9UlskNRxiV/pm/DdSIYtKbP7BbXXoC5ZI8+UvUgM3yI8ecDuI2 c8A1iBLBeacitvQgkqg8Xg+AFiASn6enWlJ4fqcYFvvTkzsaVJQapsgSeRdNrpyZd0TkjH of5MCg+22+seEgUBF1qVBCS5i3KBtz4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JZzyoSyj; spf=pass (imf23.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733609001; a=rsa-sha256; cv=none; b=HAUpVWIgTOGkigI14Vgxs2tPvXgpYHhY7j0vsJXhMVRPgEE9MM+WuOwVplB8Y+ft+Lp/7x PUUkrCZ/RnC0YmA+efGNT7rNqs41oYyz0e+J19tAUtyi3NYFJg2oo+syOVYbr0Gt+gsqfU 0CfWSsKvQ4dECdtRjNzJ2FR+gYlSYzw= Received: by mail-vs1-f43.google.com with SMTP id ada2fe7eead31-4af3bfbb721so994168137.3 for ; Sat, 07 Dec 2024 14:03:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733609015; x=1734213815; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YIFxuOOTGkDnFbXhmO1flaaC4HzgHWDKSPocxmXWias=; b=JZzyoSyjRf+IbqRkIJ2hjH/VKx6IIVk7+Vdkz+UP58b9G4ZP9/A2k//i5EuxHN7NF5 g0TS9/YDpmPJTSlgQYriT3N4GvtyXyWnjBu4sY/YvVxqO126lg4pBeb9/SI2bg4jURZq RqZI3Bg4r8bcSIAVjGuygD1dpjdVU1loNv/x6wV/U2O38uvSvwmCaV+wow+fdrRiDY/Y tB695tmbbzO9XJ+FyAwCwxHA54Sa++n6e8AZZlJTEOPd53ldrrvasyixwdzu9FLgCozF bW4M2Di3dGgRcuJ0VkJl51xxnG8a46H22cbaz31tnPaS+25A1gROQL/IfI+Uf5/fi1cm I5dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733609015; x=1734213815; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YIFxuOOTGkDnFbXhmO1flaaC4HzgHWDKSPocxmXWias=; b=uwIfqNE95PDyFcSNMOHidwP8qvOztD2o8YEjUTxnRHPa0FV+CDB+0EDK4UG9fD/rEu McDrpDtU1vEx07GlcyoWzk+QceSX8loCtBxeuVdszcUCiwg9ORN6uMBDaV0JhiHlwByE i0xLyIah3PmQ60ljM3z+jjATetRmb2NX1NwQjO/LjX3sEE52qENng4pCafZKr/hmzwJc rLv+5x78o5u1tqSUxkgvurxQndzRGUd/VzZPWqkB1RVt8EdV538f0p/6PbEtEt5f1fmc pjsqF6Hgt40dCB0THzuFlTMa3BtYy5eHfVKE6uswrFSn5gi51DLnE6GQI7FIMIS822tl 2b8w== X-Forwarded-Encrypted: i=1; AJvYcCXhmYW8LiC1b8OzK1rXP47KLwgRT8vQ+AO5RqZVm8NH4KuF8hKXUtVHoTv3xVACZa5mOmUnoS0Sgg==@kvack.org X-Gm-Message-State: AOJu0YyMo2J7h9SJ6eeyD06MjrzO4mstOYLxYWpkApLzYFREvewr71Am 4Pg01FNG0xf9QdNtNuoYTbPCpSe5oSZWzhP3GkXn71VC5KpD4I18FyCSHNbg/GadrPq7kgowuyI hdVQLcSS2uJn7tKHBjNIhVOMyh4f8oJ4pWlj6 X-Gm-Gg: ASbGncuEIeJ8EUAkDlft5tdIt1F7aCqkznUGWKLc8GEv/UZbHjpQElWCOUtRTJQ+ZtQ 7uxtxXavoczAnn288SykzCwK5IR/qfI3qONZttzs9Do7CO82Xu3B/HyPqSuAGnM1Q X-Google-Smtp-Source: AGHT+IEQEpS3JV+7Tlz8RWPHAgF7l9LurCYUPbwDtC1HhY6Q1C5bTVHwmM2S6KdUCVEgZpCaJsLmi6me8DTvozZh3BQ= X-Received: by 2002:a05:6102:1608:b0:4ad:497a:268d with SMTP id ada2fe7eead31-4afcab15e00mr7814377137.19.1733609014558; Sat, 07 Dec 2024 14:03:34 -0800 (PST) MIME-Version: 1.0 References: <20241205103729.14798-1-luxu.kernel@bytedance.com> <315752c5-6129-4c8b-bf8c-0cc26f0ad5c5@redhat.com> In-Reply-To: From: Yu Zhao Date: Sat, 7 Dec 2024 15:02:57 -0700 Message-ID: Subject: Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page To: Xu Lu Cc: Pedro Falcato , David Hildenbrand , Zi Yan , paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, ardb@kernel.org, anup@brainfault.org, atishp@atishpatra.org, xieyongji@bytedance.com, lihangjing@bytedance.com, punit.agrawal@bytedance.com, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Linux MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 67535140016 X-Rspamd-Server: rspam12 X-Stat-Signature: 8cxo35hpqb7tg79ja1kz58jt6qsaiw37 X-Rspam-User: X-HE-Tag: 1733609003-380559 X-HE-Meta: U2FsdGVkX1+mU1cMM8En+jzhwgkH6X6F8fttjzeYkybpgxQ3GXA0UaDpNab9wLpC4C70cx4eW27fkwJDA1rB9l49L+Q8iUCs9/ZjktpyH4kpHFVxbmZErL7DzBwndjFEMx8iBUxh+co5jfCNV501JQI0zFswWDloNAoIeW9SHveHsZoMPcdFPIGxx6t0QF3lyx4XxsqmXU/k2+HgHW88WsBVDZI44MCNievWghNq9VJmMbEypnAUn2OcrF3osaCueITp7yEfLu6dVyy1P5tIezjaQ52UEsvo3FthYrCwVe9V4vs+ANWZfUuHkRHtkwIpFtlAuO2zj+i5Es5ey6BQYY23FuKBgKxTMeraTFS9uEFsNeyigo2BhC+84jq7wOohoJ/Dtt4duzgB07kmzZeCSuPmwRfVUW0uGs4D3euaOHIVRLw/9SIY4Eh4svu27pun60ARTDaB8uaKvTGjYqOLnfxcxUJVIdLBdxQ2ZgpJB90cSmlv/SSWa6P/21IgCXxQdSv5lrhCu+z0FgeBC/7rQEqgzhy+gizASLZZyCLGSutoIOu+wZcSKsUY3xEug40pKrCV8ZOu4uJwHZwWCu7y7gf7C0N/HHUVaQiip7g3FWsIikyx1/giCCuBckJ1EtQZfA0OMkDqJI2w9RRXCiBrpGFpIc68grLiqLWpy1c9JYrSjdOORFDKwvZn0gV5/mjphXUvL1ov74hgGJLT3Q6mlyKQTRpn2IbJ7m4g3wadaONPlEt4VlksHnOfeqm3lOWGejT3eSB8+forCnopYr1KOfdXxCmo8+E4DelhSTWwmNPnWjmkBRw/uMlsHoQVgctWbG3ZMyh7zgQFFWiUk38aJUSERUlGF4gSBkIOaFicdLANrgwLaFrbrX6l0UYR2TUQF+VceyOrHDardqaZfx04A6q4JhFclinTA0zD2aBorOWmr9xv5k058Ij3eFFG2Vov/TwM5g4asF/jwj44xrT WYD2BeZw hJUEoM6XOYuHWka9SvfhxteLEit4ySLLRuX9BHj/NdTw/I7lwn8KOMUN5zSPFdrDFo/BzzKNwEGZFTVeMLmYrdjcX5PRoJmEGuqkDP9ZbdICoL+kW7nnC2A/aAaXciYUDRiHtmEfb6UxyyPNA8jKnezEdH/d5pCL0m/T+AvYo6aW5xT0Upke3XEgg6QD8DEJyUeJMRm6gwTQ/55Vfmbie1PWGgghhyAHaInuqqh3WDWGmnLqWMfcb13i9b4cmZMop1nYy1bbruNA7NP56IjBMPmLntWmAr6u8tfuY9kXEqShJ+GtvcA9RpG9ZE/qzR8EiqbEVG7t2a1/FoD7aEBlHg7TfWrP3iI6cmK3QDv9ZDm/4WO864uc9kgUiP27ZVagN64HrVioXLjseMHrJNxNkpdWLlBhEUxcXPrsCrD3LyxYE0cEZ1csyNg0WVmCUoI7ytvI64DyW6aKf5jVR99pmEcP/APWdaB1Nf03l X-Bogosity: Ham, tests=bogofilter, spamicity=0.000007, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Dec 7, 2024 at 1:03=E2=80=AFAM Xu Lu wr= ote: > > Hi Pedro, > > On Sat, Dec 7, 2024 at 2:49=E2=80=AFAM Pedro Falcato wrote: > > > > On Fri, Dec 6, 2024 at 1:42=E2=80=AFPM Xu Lu wrote: > > > > > > Hi David, > > > > > > On Fri, Dec 6, 2024 at 6:13=E2=80=AFPM David Hildenbrand wrote: > > > > > > > > On 06.12.24 03:00, Zi Yan wrote: > > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote: > > > > > > > > > >> This patch series attempts to break through the limitation of MM= U and > > > > >> supports larger base page on RISC-V, which only supports 4K page= size > > > > >> now. The key idea is to always manage and allocate memory at a > > > > >> granularity of 64K and use SVNAPOT to accelerate address transla= tion. > > > > >> This is the second version and the detailed introduction can be = found > > > > >> in [1]. > > > > >> > > > > >> Changes from v1: > > > > >> - Rebase on v6.12. > > > > >> > > > > >> - Adjust the page table entry shift to reduce page table memory = usage. > > > > >> For example, in SV39, the traditional va behaves as: > > > > >> > > > > >> ---------------------------------------------- > > > > >> | pgd index | pmd index | pte index | offset | > > > > >> ---------------------------------------------- > > > > >> | 38 30 | 29 21 | 20 12 | 11 0 | > > > > >> ---------------------------------------------- > > > > >> > > > > >> When we choose 64K as basic software page, va now behaves a= s: > > > > >> > > > > >> ---------------------------------------------- > > > > >> | pgd index | pmd index | pte index | offset | > > > > >> ---------------------------------------------- > > > > >> | 38 34 | 33 25 | 24 16 | 15 0 | > > > > >> ---------------------------------------------- > > > > >> > > > > >> - Fix some bugs in v1. > > > > >> > > > > >> Thanks in advance for comments. > > > > >> > > > > >> [1] https://lwn.net/Articles/952722/ > > > > > > > > > > This looks very interesting. Can you cc me and linux-mm@kvack.org > > > > > in the future? Thanks. > > > > > > > > > > Have you thought about doing it for ARM64 4KB as well? ARM64=E2= =80=99s contig PTE > > > > > should have similar effect of RISC-V=E2=80=99s SVNAPOT, right? > > > > > > > > What is the real benefit over 4k + large folios/mTHP? > > > > > > > > 64K comes with the problem of internal fragmentation: for example, = a > > > > page table that only occupies 4k of memory suddenly consumes 64K; q= uite > > > > a downside. > > > > > > The original idea comes from the performance benefits we achieved on > > > the ARM 64K kernel. We run several real world applications on the ARM > > > Ampere Altra platform and found these apps' performance based on the > > > 64K page kernel is significantly higher than that on the 4K page > > > kernel: > > > For Redis, the throughput has increased by 250% and latency has > > > decreased by 70%. > > > For Mysql, the throughput has increased by 16.9% and latency has > > > decreased by 14.5%. > > > For our own newsql database, throughput has increased by 16.5% and > > > latency has decreased by 13.8%. > > > > > > Also, we have compared the performance between 64K and 4k + large > > > folios/mTHP on ARM Neoverse-N2. The result shows considerable > > > performance improvement on 64K kernel for both speccpu and lmbench, > > > even when 4K kernel enables THP and ARM64_CONTPTE: > > > For speccpu benchmark, 64K kernel without any huge pages optimization > > > can still achieve 4.17% higher score than 4K kernel with transparent > > > huge pages as well as CONTPTE optimization. > > > For lmbench, 64K kernel achieves 75.98% lower memory mapping > > > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE > > > optimization, 84.34% higher map read open2close bandwidth(16MB), and > > > 10.71% lower random load latency(16MB). > > > Interestingly, sometimes kernel with transparent pages support have > > > poorer performance for both 4K and 64K (for example, mmap read > > > bandwidth bench). We assume this is due to the overhead of huge pages= ' > > > combination and collapse. > > > Also, if you check the full result, you will find that usually the > > > larger the memory size used for testing is, the better the performanc= e > > > of 64k kernel is (compared to 4K kernel). Unless the memory size lies > > > in a range where 4K kernel can apply 2MB huge pages while 64K kernel > > > can't. > > > In summary, for performance sensitive applications which require > > > higher bandwidth and lower latency, sometimes 4K pages with huge page= s > > > may not be the best choice and 64k page can achieve better results. > > > The test environment and result is attached. > > > > > > As RISC-V has no native 64K MMU support, we introduce a software > > > implementation and accelerate it via Svnapot. Of course, there will b= e > > > some extra overhead compared with native 64K MMU. Thus, we are also > > > trying to persuade the RISC-V community to support the extension of > > > native 64K MMU [1]. Please join us if you are interested. > > > > > > > Ok, so you... didn't test this on riscv? And you're basing this > > patchset off of a native 64KiB page size kernel being faster than 4KiB > > + CONTPTE? I don't see how that makes sense? > > Sorry for the misleading. I didn't intend to use ARM data to support > this patch, just to explain the idea source. We do prefer 64K MMU for > the performance improvement it brought to real applications and > benchmarks. This breaks ABI, doesn't it? Not only userspace needs to be recompiled with 64KB alignment, it also needs not to assume 4KB base page size. > And since RISC-V does not support it yet, we internally > use this patch as a transitional solution for RISC-V. Distros need to support this as well. Otherwise it's a tech island. Also why RV? It can be a generic feature which can apply to other archs like x86, right? See "page clustering" [1][2]. [1] https://lwn.net/Articles/23785/ [2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-10000= 0@localhost.localdomain/ > And if native > 64k MMU is available, this patch can be canceled. Why 64KB? Why not 32KB or 128KB? In general, the less dependency on h/w, the better. Ideally, *if* we want to consider this, it should be a s/w feature applicable to all (or most of) archs. > The only usage of > this patch I can think of then is to make the kernel support more page > sizes than MMU, as long as Svnapot supports the corresponding size. > > We will try to release the performance data in the next version. There > have been more issues with applications and OS adaptation:) So this > version is still an RFC. > > > > > /me is confused > > > > How many of these PAGE_SIZE wins are related to e.g userspace basing > > its buffer sizes (or whatever) off of the system page size? Where > > exactly are you gaining time versus the CONTPTE stuff? > > I think MM in general would be better off if we were more transparent > > with regard to CONTPTE and page sizes instead of hand waving with > > "hardware page size !=3D software page size", which is such a *checks > > notes* 4.4BSD idea... :) At the very least, this patchset seems to go > > against all the work on better supporting large folios and CONTPTE. > > By the way, the core modification of this patch is turning pte > structure to an array of 16 entries to map a 64K page and accelerating > it via Svnapot. I think it is all about architectural pte and has > little impact on pages or folios. Please remind me if anything is > missed and I will try to fix it. > > > > > -- > > Pedro > > Thanks, > > Xu Lu >