From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 630C0E77173 for ; Mon, 9 Dec 2024 03:36:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90C276B03AE; Sun, 8 Dec 2024 22:36:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BC146B03AF; Sun, 8 Dec 2024 22:36:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 784D76B03B0; Sun, 8 Dec 2024 22:36:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 5AA2F6B03AE for ; Sun, 8 Dec 2024 22:36:53 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id EBC02121306 for ; Mon, 9 Dec 2024 03:36:52 +0000 (UTC) X-FDA: 82874008518.12.7D2B633 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf16.hostedemail.com (Postfix) with ESMTP id 12B8E180006 for ; Mon, 9 Dec 2024 03:36:29 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ZozOiVVV; spf=pass (imf16.hostedemail.com: domain of luxu.kernel@bytedance.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=luxu.kernel@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733715390; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wZ6LtrcCk6208JGAAeJNyDPfi0uTT0z6gEVCVdLKwRU=; b=hI1kKMU9GptATOkSkUhiRufEwPaKFTfpmTIcfIPYbvW640r+lMVm48JGU0Qv5cDEGLYpHH aN7WTMNj2/uWP4vxlf4FORg0jb9AQ0mNthjA7Zn2ze+rveAC2/Nyt4++MaH/6hQDSYkDvL 1SI4ezr/WJmvO8mPzkIH5GpRDZjfFjo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733715390; a=rsa-sha256; cv=none; b=nRs82jKwF1EjMjJ3MwW1kZj8tyj14AvEcaFcie5Iq4RUTEKwvBwbbaRWtonopWRUNNS5oU gY+p+rL6VQia2AfCV0umnlDi4QW7KeUB434NyA9G+kOFKqT8YILDXRi3E+LZSiKGioXUVj yGV+dMaw05GqssspCDTDCsog3rrmMzE= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ZozOiVVV; spf=pass (imf16.hostedemail.com: domain of luxu.kernel@bytedance.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=luxu.kernel@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-2ee50ffcf14so3938209a91.0 for ; Sun, 08 Dec 2024 19:36:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1733715409; x=1734320209; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wZ6LtrcCk6208JGAAeJNyDPfi0uTT0z6gEVCVdLKwRU=; b=ZozOiVVVSTfA4Wcp5Dn0sGhrICbz0uSzpB01rGF8+IM4qFyITFzIQbzQ9cOkICME/M /kZlryin6fhHYN4Gfpu0qBq4Xqcm4bEMhhdEVubkskOJSkSo+ZU3dHKNE/vHbddNtkLR 2nU0/LywGg1GiD7AxA4ZOJ4LyaMvZ1mmmL+AMfGGm+VDBiy+J0VG3kKnbTsQnXHaCZC9 t/15ZTl9ljXbWwczcDEZ+e3lawJz/DyGiXReGH+hwlf8o40VC8prXqQJMQecjez+ybZP IUxK7/EhaQR+K95t6DEaaczicnTduSbG4FQ675HUXMNs/2issXmuDN8NTcyJHIaw7ipl FT7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733715409; x=1734320209; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wZ6LtrcCk6208JGAAeJNyDPfi0uTT0z6gEVCVdLKwRU=; b=q4FaSQ2zgbD58th7/zTbhOUK6oBaywnylL8JVeIyj5fVk2h3OUzc2qE0rajlv3yqXl SLAv3RsjvS9csbkxaB2s2W+BLWWUKKlX9qaqs/1YhF/iSYxemGBxDuEx7139QKbcKcLl GYdaFMfUCgGa+ILvnXE0VxGFYF2ModKxYuz5zwVtpZPDv1yvzEP8fOUDsLcYRvCPSdyd pJSk4iDJZyYE2xkmpYPMBzGR8iFK2fX+IjMYHBsxBXEl4flm/nywVlpcQld4jpD8ek+8 K7eRm8fDqQuqYpYZ9UPuZ+y7DeM/o8fthY+oI8nDj9R2Caqh5Qri2Xdlge9pW/RD0ngq ZvoQ== X-Forwarded-Encrypted: i=1; AJvYcCWGj+vZf/kTjsmftjyRyw7BnbfyMtATYe5ihxDjGE9OLJqwZf+P6b9pzh7rSq1o5+C3FdIEaPJnvQ==@kvack.org X-Gm-Message-State: AOJu0YyAG0k9YqvVxrHzS+GOXCXKmecK834qRoiAYpp0zH2+nCYiZtsN Kq1aeKTYzc/ltI1PFgd6NNtY29GB+jxu7d1e6BVUwoWvrawtEhepOK3ToiOScwPNb0lgVy1J3PZ k9AimVdiqXueiR2/7CxdULXSDg+MUJtAAWuG4uQ== X-Gm-Gg: ASbGncsydkbjKzdKjg0VMNCQ8WWwyj7Cp02XoMZVK7e06kNqcb74TOFSmtkVcMa8oeW fblte79ktqjYFhMLAaS8+z2CVtc97CONYi3lnsSLntrk5 X-Google-Smtp-Source: AGHT+IHemNqPPOXWoalJSPyvc278UNYChTi9JXYXkEEffZvQ/Sif7NZehOOzQlnzg7kGcnTF6WYURzIyIMvRNK0wydA= X-Received: by 2002:a17:90b:364a:b0:2ef:89a5:b8f1 with SMTP id 98e67ed59e1d1-2ef89a5ba3emr11140681a91.7.1733715409038; Sun, 08 Dec 2024 19:36:49 -0800 (PST) MIME-Version: 1.0 References: <20241205103729.14798-1-luxu.kernel@bytedance.com> <315752c5-6129-4c8b-bf8c-0cc26f0ad5c5@redhat.com> In-Reply-To: From: Xu Lu Date: Mon, 9 Dec 2024 11:36:38 +0800 Message-ID: Subject: Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page To: Yu Zhao Cc: Pedro Falcato , David Hildenbrand , Zi Yan , paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, ardb@kernel.org, anup@brainfault.org, atishp@atishpatra.org, xieyongji@bytedance.com, lihangjing@bytedance.com, punit.agrawal@bytedance.com, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Linux MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 12B8E180006 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: acsf1fcmnzq7tky5egzcdjf894s6dbww X-HE-Tag: 1733715389-548693 X-HE-Meta: U2FsdGVkX1+QUG3Yk4ZUX7tWE+kZ+xnEzzaMVHOv4IsSdfNq+EoFMNFGJmXNeKOagg2a1uqPSBpj5go77ADoev2igZkbsY6UBx3InkckvRp355zJXPbqyK5SxpsCdSCmtFEQ/ee40oRM/UyVXVX73w4IDOGoJwTrtLzL8PH78fquzmpUNiJ1ZLLOeJ9tu7uFtZK9r6LcimX5Xf2umvBIKPAK99fZNk+VFpJ5IzNSUjGl15daO8JetEAgV3ZOyizZ92d8NAciqLjymC7vY/A0tK/COOcs3UtfiT+2lgLgHae8QrUoR01nsAnqEAIjZE8PTkTAdMzIJrIgKLzgsTfOeP+vlMgfVBCgEw924bsZkPjOAff/SsxWasMltyvqcKPvLTRn4aogM8J1yUQX/TZUb68nZVHREZoLHNY6nRFBYGisjfdIwy2Sft5Tun4J9vBE6eZ0dH8GqcZjvzDaZNI/vETK6O4Mnqhg3yIwQpsdq8Fjh70Li2M44rrEsRN5dKiry9aQ4mR3rG/TBeFyMNjOfLIi3Cj2Yww17UGRqqV6zLIYo3YfpoYZ9uFN2xv3vdLUtu3+BYltr2Dc5gXUmaVGaC6KFLNMv13aNDq6NBGRTJevU1SpxKoWIisEMDMoryUOexfUDJIOBSL+7ACZmSK+4Fv8JPNn81k6RZ8XnlDr/rVduFDb9DjBSSyn4mRMVQrHSGsBkJ1PMF5mLagG4xh8i+HYZkR+fpvywjQlcI02SHT+h4Ktm46H6BMxkVrYuaeZoE7X3Y5CwhQ6YHdmg8b/tZkmGEg1Ce00M4R39xh2RgWuv7ByPozI1dUYszQvhCHeyhj7cxSWAVE4aS+4PQOPOa3J4EdbQGUeiOHaOa9Wbrvcz3wa0JFqimhxnseW+mGaIcEwDL7XC6DacyiPgrOkw/+wRFbvKd+HbUTrjkoric194ceDfK3SdyXXg/xnOS93sVpR7Qc6fmKCFgubHMw eu1LPlxw Bk9igntnZBGrWTKroprjdNhQxxJsMMW1cmFcZ9cPdHnJQw0oxZ4zFhB7cGpVNWCVNCaGP7aL91XV+oWqkTDkwRF6ILDDRJBzR3PPfKR4EfXeWm+/W/9qp1uWn9ucJJgD3S+SzEKqEJ+FD98heKwBMRniEvBCntIJzm5tCX1PH+BwMwuFyDmiWhj5LnU5Us0QFr41Xy87a6g1rthxz4FdJXo/ArMTcARLF5LhNqbY9vn163K9aDGY9CZysKpOOShwxphVlo3Yl8TbGr9ywYF7BzdJWOKZBUH5U6x+hER7vMnyoekz8udFaMX0nNnMiSq+tDwDmVZOmBDmkNkNqnJ2yS2rrQPoRRZ+ahbg0SPbNLYcqjlNGXN9V7TpTcd2wCFsG8xFg7Ai//D6Aa2H0KK9rTOOcyUphFlVSWwAtdTWM2juLUYp7V7/iUi1rODAYAN9vUXyOYnKHBgCGYjbF8JZVsCqS4qhATUhXlBQNlFvrDQjePTUBboDQYvqm2/xt/WFDrfKyd0d7rhgKVnA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000374, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Yu Zhao, On Sun, Dec 8, 2024 at 6:03=E2=80=AFAM Yu Zhao wrote: > > On Sat, Dec 7, 2024 at 1:03=E2=80=AFAM Xu Lu = wrote: > > > > Hi Pedro, > > > > On Sat, Dec 7, 2024 at 2:49=E2=80=AFAM Pedro Falcato wrote: > > > > > > On Fri, Dec 6, 2024 at 1:42=E2=80=AFPM Xu Lu wrote: > > > > > > > > Hi David, > > > > > > > > On Fri, Dec 6, 2024 at 6:13=E2=80=AFPM David Hildenbrand wrote: > > > > > > > > > > On 06.12.24 03:00, Zi Yan wrote: > > > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote: > > > > > > > > > > > >> This patch series attempts to break through the limitation of = MMU and > > > > > >> supports larger base page on RISC-V, which only supports 4K pa= ge size > > > > > >> now. The key idea is to always manage and allocate memory at a > > > > > >> granularity of 64K and use SVNAPOT to accelerate address trans= lation. > > > > > >> This is the second version and the detailed introduction can b= e found > > > > > >> in [1]. > > > > > >> > > > > > >> Changes from v1: > > > > > >> - Rebase on v6.12. > > > > > >> > > > > > >> - Adjust the page table entry shift to reduce page table memor= y usage. > > > > > >> For example, in SV39, the traditional va behaves as: > > > > > >> > > > > > >> ---------------------------------------------- > > > > > >> | pgd index | pmd index | pte index | offset | > > > > > >> ---------------------------------------------- > > > > > >> | 38 30 | 29 21 | 20 12 | 11 0 | > > > > > >> ---------------------------------------------- > > > > > >> > > > > > >> When we choose 64K as basic software page, va now behaves= as: > > > > > >> > > > > > >> ---------------------------------------------- > > > > > >> | pgd index | pmd index | pte index | offset | > > > > > >> ---------------------------------------------- > > > > > >> | 38 34 | 33 25 | 24 16 | 15 0 | > > > > > >> ---------------------------------------------- > > > > > >> > > > > > >> - Fix some bugs in v1. > > > > > >> > > > > > >> Thanks in advance for comments. > > > > > >> > > > > > >> [1] https://lwn.net/Articles/952722/ > > > > > > > > > > > > This looks very interesting. Can you cc me and linux-mm@kvack.o= rg > > > > > > in the future? Thanks. > > > > > > > > > > > > Have you thought about doing it for ARM64 4KB as well? ARM64=E2= =80=99s contig PTE > > > > > > should have similar effect of RISC-V=E2=80=99s SVNAPOT, right? > > > > > > > > > > What is the real benefit over 4k + large folios/mTHP? > > > > > > > > > > 64K comes with the problem of internal fragmentation: for example= , a > > > > > page table that only occupies 4k of memory suddenly consumes 64K;= quite > > > > > a downside. > > > > > > > > The original idea comes from the performance benefits we achieved o= n > > > > the ARM 64K kernel. We run several real world applications on the A= RM > > > > Ampere Altra platform and found these apps' performance based on th= e > > > > 64K page kernel is significantly higher than that on the 4K page > > > > kernel: > > > > For Redis, the throughput has increased by 250% and latency has > > > > decreased by 70%. > > > > For Mysql, the throughput has increased by 16.9% and latency has > > > > decreased by 14.5%. > > > > For our own newsql database, throughput has increased by 16.5% and > > > > latency has decreased by 13.8%. > > > > > > > > Also, we have compared the performance between 64K and 4k + large > > > > folios/mTHP on ARM Neoverse-N2. The result shows considerable > > > > performance improvement on 64K kernel for both speccpu and lmbench, > > > > even when 4K kernel enables THP and ARM64_CONTPTE: > > > > For speccpu benchmark, 64K kernel without any huge pages optimizati= on > > > > can still achieve 4.17% higher score than 4K kernel with transparen= t > > > > huge pages as well as CONTPTE optimization. > > > > For lmbench, 64K kernel achieves 75.98% lower memory mapping > > > > latency(16MB) than 4K kernel with transparent huge pages and CONTPT= E > > > > optimization, 84.34% higher map read open2close bandwidth(16MB), an= d > > > > 10.71% lower random load latency(16MB). > > > > Interestingly, sometimes kernel with transparent pages support have > > > > poorer performance for both 4K and 64K (for example, mmap read > > > > bandwidth bench). We assume this is due to the overhead of huge pag= es' > > > > combination and collapse. > > > > Also, if you check the full result, you will find that usually the > > > > larger the memory size used for testing is, the better the performa= nce > > > > of 64k kernel is (compared to 4K kernel). Unless the memory size li= es > > > > in a range where 4K kernel can apply 2MB huge pages while 64K kerne= l > > > > can't. > > > > In summary, for performance sensitive applications which require > > > > higher bandwidth and lower latency, sometimes 4K pages with huge pa= ges > > > > may not be the best choice and 64k page can achieve better results. > > > > The test environment and result is attached. > > > > > > > > As RISC-V has no native 64K MMU support, we introduce a software > > > > implementation and accelerate it via Svnapot. Of course, there will= be > > > > some extra overhead compared with native 64K MMU. Thus, we are also > > > > trying to persuade the RISC-V community to support the extension of > > > > native 64K MMU [1]. Please join us if you are interested. > > > > > > > > > > Ok, so you... didn't test this on riscv? And you're basing this > > > patchset off of a native 64KiB page size kernel being faster than 4Ki= B > > > + CONTPTE? I don't see how that makes sense? > > > > Sorry for the misleading. I didn't intend to use ARM data to support > > this patch, just to explain the idea source. We do prefer 64K MMU for > > the performance improvement it brought to real applications and > > benchmarks. > > This breaks ABI, doesn't it? Not only userspace needs to be recompiled > with 64KB alignment, it also needs not to assume 4KB base page size. Yes, it does. > > > And since RISC-V does not support it yet, we internally > > use this patch as a transitional solution for RISC-V. > > Distros need to support this as well. Otherwise it's a tech island. > Also why RV? It can be a generic feature which can apply to other > archs like x86, right? See "page clustering" [1][2]. > > [1] https://lwn.net/Articles/23785/ > [2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100= 000@localhost.localdomain/ > > > And if native > > 64k MMU is available, this patch can be canceled. > > Why 64KB? Why not 32KB or 128KB? In general, the less dependency on > h/w, the better. Ideally, *if* we want to consider this, it should be > a s/w feature applicable to all (or most of) archs. We chose RISC-V because of internal business needs, and chose 64k because of the benefits we have achieved on ARM 64k. It is a pretty ambitious goal to apply such a feature to all architectures. We are very glad to do so and request more assistance if everyone thinks it is better. But for now, perhaps it is a better choice to try it on RV first? After all, not all architectures support features like Svnapot or CONTPTE. Of course, for architectures not supporting Svnapot, applying a bigger page size can still achieve less metadata memory overhead and less page faults. We are pleased to see that similar things have already been considered before. We give the most respect to William Lee Irwin and Hugh Dickins and hope they can continue on this work. We will cc them in the future emails. Best Regards, Xu Lu > > > > The only usage of > > this patch I can think of then is to make the kernel support more page > > sizes than MMU, as long as Svnapot supports the corresponding size. > > > > We will try to release the performance data in the next version. There > > have been more issues with applications and OS adaptation:) So this > > version is still an RFC. > > > > > > > > /me is confused > > > > > > How many of these PAGE_SIZE wins are related to e.g userspace basing > > > its buffer sizes (or whatever) off of the system page size? Where > > > exactly are you gaining time versus the CONTPTE stuff? > > > I think MM in general would be better off if we were more transparent > > > with regard to CONTPTE and page sizes instead of hand waving with > > > "hardware page size !=3D software page size", which is such a *checks > > > notes* 4.4BSD idea... :) At the very least, this patchset seems to go > > > against all the work on better supporting large folios and CONTPTE. > > > > By the way, the core modification of this patch is turning pte > > structure to an array of 16 entries to map a 64K page and accelerating > > it via Svnapot. I think it is all about architectural pte and has > > little impact on pages or folios. Please remind me if anything is > > missed and I will try to fix it. > > > > > > > > -- > > > Pedro > > > > Thanks, > > > > Xu Lu > >