From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4129C5473A for ; Tue, 27 Aug 2024 22:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D6266B007B; Tue, 27 Aug 2024 18:36:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0867B6B0083; Tue, 27 Aug 2024 18:36:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E90916B0085; Tue, 27 Aug 2024 18:36:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CD9C26B007B for ; Tue, 27 Aug 2024 18:36:26 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8B8B8818C4 for ; Tue, 27 Aug 2024 22:36:26 +0000 (UTC) X-FDA: 82499485572.16.F0F7568 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) by imf24.hostedemail.com (Postfix) with ESMTP id A8AC1180011 for ; Tue, 27 Aug 2024 22:36:23 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=UYZEjtKE; spf=pass (imf24.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724798163; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=s7Z8bHC9lXhnYf47HvMGTMfGY056hXU2IGQLlFcO8fg=; b=xUg5LVW2HMBolZXv2LjvfIq02L3CLSFAkKb/Ch7ZPoYkLuI/g4UNC4nDi8QDiDsIRkYeRW KDmTnSH9ZdWpnl/lSAZLnhDPcZsmUkZmKgZGxYSHIRCfhQA1RgwSkcQG4VvEhJqDX8ts/1 JqmEGp9L/N0Nl9FJwPK9p7om3NrjZJU= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=UYZEjtKE; spf=pass (imf24.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724798163; a=rsa-sha256; cv=none; b=6ZJqSLu1b8tCKYIf9j5JbEI+ZlAJTMDae4s8oIDvJc56MrTL7IjjE9BFgy7fzn+kwDqs6q EyT6o76JY1FHy87uTAINIWpWASZMhbmBI55I6X41drODbZcdaidODPx036XxAEsnBclwCu 63dRGijGsB/06L18rqQkjGFtDgBNX6w= Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-42807cb6afdso10755e9.1 for ; Tue, 27 Aug 2024 15:36:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1724798182; x=1725402982; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=s7Z8bHC9lXhnYf47HvMGTMfGY056hXU2IGQLlFcO8fg=; b=UYZEjtKECyKzrVOlYUrK9EGG0VfanPOn5qtykl9g7q+O99WOTYd4EuxUxJLE/45s4l TBoTuYcdhH+rbx5ZeGDWjD74WT+cA1NkRPCzxl4PQsmVEoyVEk0RvoxFXbatpkopgUBP 0Knx7nKO+QpzqvImeNYaSk1N8A6leGWIub4L/SUIQ14mb6UvzV02cSBR1vbjz5vwB3ye najMMbiCMMxU/+bu7gufkYx2AefaDPjacPfOgTwH7mAj1RvnQ6+FOSAC/iqQPD3ZNf7M VD0lsQYfgBAFdNR9h36wDtirocGsgZgnsO7Tv20qLxAtx8Cx6f3RNiVmT5UGXys8fQU2 WmXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724798182; x=1725402982; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=s7Z8bHC9lXhnYf47HvMGTMfGY056hXU2IGQLlFcO8fg=; b=PNWxs5mrFgX1yZ4aF8Bc4A8mmB5Mu/bt7kFevI62iUexAo4+LNj2qHgBNQmpS5omll sk2x31yzxJwVkC6RwroPdcaixaucZWWEK48WxrOjqErmpas0WDD3zsr9qjaecWcT8Rg7 EdmdLfJKI5/Ibo6QCOP4wYklNXVEnrfFj7Z+zXmJjwoX8GimsOy7vhmn4mEtP5XyGmDa ILec4PQhnis6unVbOQFDOG+XHYl0pAkkhfKNG5q0SGnas3Rnz+rRxmaYJWIA7giZGDdS SUfTuFN97yG4fhGXuZLHIGRScUxrC3FOXZMMtcMeixPKFS8lSzZtHkXnlpvCZs54LvCJ DQGQ== X-Forwarded-Encrypted: i=1; AJvYcCWusqfLA9aRr+4dBhdoWWZgr8CGJOx+2McS4tViRsV+F98CQjk/L4YCr72LKQUQzR+bDX93ex9r2A==@kvack.org X-Gm-Message-State: AOJu0Yxiq5IDGiON0+O9BT/op0/NnhF7DxiBRa60m4KeM5Y7+xSueBuk WD7CQ2sstBrnw9PkHHJiVt0kwIVsSI+p3VTq6Q7gEvItsU92MMBp2T0hs3ECDo+ckVgPtRRJteT CNomgjYv1Qe+uw6/KJnhak+bF9fsmDk9inE3v X-Google-Smtp-Source: AGHT+IERarZz0t8KpbIzW+ZARbyqwfuN177nAbHR5HvRaocRDTIom/7x2jD8I5SxtyZhmQnBn5njFCAOBfVRulF8IDU= X-Received: by 2002:a05:600c:1e20:b0:426:68ce:c97a with SMTP id 5b1f17b1804b1-42ba50d0ba9mr209745e9.7.1724798181619; Tue, 27 Aug 2024 15:36:21 -0700 (PDT) MIME-Version: 1.0 References: <20240826204353.2228736-1-peterx@redhat.com> In-Reply-To: <20240826204353.2228736-1-peterx@redhat.com> From: Jiaqi Yan Date: Tue, 27 Aug 2024 15:36:07 -0700 Message-ID: Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gavin Shan , Catalin Marinas , x86@kernel.org, Ingo Molnar , Andrew Morton , Paolo Bonzini , Dave Hansen , Thomas Gleixner , Alistair Popple , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sean Christopherson , Oscar Salvador , Jason Gunthorpe , Borislav Petkov , Zi Yan , Axel Rasmussen , David Hildenbrand , Yan Zhao , Will Deacon , Kefeng Wang , Alex Williamson Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: yh896m73uggnkaon4g8so8a9qdwpref3 X-Rspamd-Queue-Id: A8AC1180011 X-Rspamd-Server: rspam11 X-HE-Tag: 1724798183-996513 X-HE-Meta: U2FsdGVkX18xg2X1LEcGccFE/SXImJcFcR0KlRgc/Iik5ePHUpff8Sy0xT5tkoG5DqnSxuCWtbaBU5wif/C2w6lT/QRqM6IjYsyXKUN+rJFIEwhYnWDr61FgaLmeFry4p1K0WJJT9HwJUsh/8GHFb3K3A2O4JP47Zs/p7sUfpzaH99Y+KsYYcW0rll1Vui6SsIElwb/i/AGWowmmUrpMuryauKFp6jEiZxB5ZD8bfmZlJy6QB65AsHrP4pfGBeB3nxtxrSuLi6InlLZXZ+vmZ2H2JB3EN1RT9s72a4bFw+gX1D+fkvXrm0DVaLU+p8/Py9XjedTa14Bfr46GEg7WJY6cdvUZS8cDQwUsQXxdSme2ZxX84V2XmTGH26pYgBnzpdR0mHqNSspUMqYJOCtYjEPaFfpGInia8PN4bR9SMHQ0fa7jmaoC/2SnI8lMFoQKD372SAtkQVQBquOTf2x3ua2m2bBiGY0ncC0ev7ojn9pcLSCxgKi1UK0CZTGMTxcvdOMUT7LPVtu0Si7GtNwlAg4fzh7eQT2cZ0Ahac7nVj40anKDRe5Bunpf6blU1bwE9fro+n7u+tnTEWu8/NhCORdpR+6HrRQonBHAcYfvql1OYYgKWwJW3C4FGIzYR/fCkcXaYzVtcFNB9Miur0QDDtwGxobD/btVF5JQwL3nrW/tlmTXM8UPDpa80BO08aIVHEYUXCdtgRzeGs3f0ZBAEqo73Qw8An5sDmnQxlh3nTC0dRXakyDUOYi4V2ANfcMAkHJ4a9CojcpzueEOSeh4ckLn3yJWTO3cDpkNz5U8N7BBQqGOtKU5EiO+aMWclyRRq4dRxYijbs9AZCWnUte87d3y5UKgeQ3iGxLVm42R1PksS8mgmMMlnL25Nhe9aSYTHgefqf0XdajUYfWCnRtlaM7VLn2UMLLzqcNr/OrW/Fu5LQCgDu9OwUAzL3k+WKHa8C/f1MhHeD2J80V40Hj s3fTAgN9 cg0oidt21YB15ElGCwvgkIkapGGrsvT05xRl+zRz9w6ZcB4BLHjeGbwtlq0CKYuFdGX6ILnsWopi7kVcVdcAbVrckUnIPuhGHXQVGP9K7xf3NSRjAOJnKF9lCmqGlTdJrlcl5DIhTbDxYjfMwZ9mobPlAzYKKWV1XAa2qpVRWQLSmix4/AP55hrc6ZP9zzNkYPyzLjTb9VFfucyCIjoF0slkNs6fN2NIsyg0fblZOqI+5qjTZqinRMMpzOyBDvEepfTRx0FTdHOoaLAUawTBsuvVftr+qh0JB4IdO54ZXLkjxbZu9hXa0I/seqxjo1KODVvt6DbMZggcJRpivQ5jIjFxRty6nGpkP3lwQCxJ7ot0ABApdcYmjU1I3u55KtRF6nDGOkWwl+RVRGGoB8j3mFuM5khIKX9VAPaXgjH4jTx2huxcJtiQzbVI/SFcXw21MhmDR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Aug 26, 2024 at 1:44=E2=80=AFPM Peter Xu wrote: > > v2: > - Added tags > - Let folio_walk_start() scan special pmd/pud bits [DavidH] > - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE() > - Update commit message to drop mentioning of gup-fast, in patch "mm: Mar= k > special bits for huge pfn mappings when inject" [JasonG] > - In gup-fast, reorder _special check v.s. _devmap check, so as to make > pmd/pud path look the same as pte path [DavidH, JasonG] > - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN i= s > used after the end() is invoked, s/-ve/negative/ [JasonG, Sean] > > Overview > =3D=3D=3D=3D=3D=3D=3D=3D > > This series is based on mm-unstable, commit b659edec079c of Aug 26th > latest, with patch "vma remove the unneeded avc bound with non-CoWed foli= o" > reverted, as reported broken [0]. > > This series implements huge pfnmaps support for mm in general. Huge pfnm= ap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last > patch (from Alex Williamson) will be the first user of huge pfnmap, so as > to enable vfio-pci driver to fault in huge pfn mappings. > > Implementation > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In reality, it's relatively simple to add such support comparing to many > other types of mappings, because of PFNMAP's specialties when there's no > vmemmap backing it, so that most of the kernel routines on huge mappings > should simply already fail for them, like GUPs or old-school follow_page(= ) > (which is recently rewritten to be folio_walk* APIs by David). > > One trick here is that we're still unmature on PUDs in generic paths here > and there, as DAX is so far the only user. This patchset will add the 2n= d > user of it. Hugetlb can be a 3rd user if the hugetlb unification work ca= n > go on smoothly, but to be discussed later. > > The other trick is how to allow gup-fast working for such huge mappings > even if there's no direct sign of knowing whether it's a normal page or > MMIO mapping. This series chose to keep the pte_special solution, so tha= t > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so th= at > gup-fast will be able to identify them and fail properly. > > Along the way, we'll also notice that the major pgtable pfn walker, aka, > follow_pte(), will need to retire soon due to the fact that it only works > with ptes. A new set of simple API is introduced (follow_pfnmap* API) to > be able to do whatever follow_pte() can already do, plus that it can also > process huge pfnmaps now. Half of this series is about that and converti= ng > all existing pfnmap walkers to use the new API properly. Hopefully the n= ew > API also looks better to avoid exposing e.g. pgtable lock details into th= e > callers, so that it can be used in an even more straightforward way. > > Here, three more options will be introduced and involved in huge pfnmap: > > - ARCH_SUPPORTS_HUGE_PFNMAP > > Arch developers will need to select this option when huge pfnmap is > supported in arch's Kconfig. After this patchset applied, both x86_6= 4 > and arm64 will start to enable it by default. > > - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP > > These options are for driver developers to identify whether current > arch / config supports huge pfnmaps, making decision on whether it ca= n > use the huge pfnmap APIs to inject them. One can refer to the last > vfio-pci patch from Alex on the use of them properly in a device > driver. > > So after the whole set applied, and if one would enable some dynamic debu= g > lines in vfio-pci core files, we should observe things like: > > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order =3D 9) BAR 0 pag= e offset 0x0: 0x100 > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order =3D 9) BAR 0 pag= e offset 0x200: 0x100 > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order =3D 9) BAR 0 pag= e offset 0x400: 0x100 > > In this specific case, it says that vfio-pci faults in PMDs properly for = a > few BAR0 offsets. > > Patch Layout > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Patch 1: Introduce the new options mentioned above for huge PFNMA= Ps > Patch 2: A tiny cleanup > Patch 3-8: Preparation patches for huge pfnmap (include introduce > special bit for pmd/pud) > Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and > then drop follow_pte() API > Patch 17: Add huge pfnmap support for x86_64 > Patch 18: Add huge pfnmap support for arm64 > Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex= ) > > TODO > =3D=3D=3D=3D > > More architectures / More page sizes > ------------------------------------ > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems = to > have plan to support arm64 1G later on top of this series [2]. > > Any arch will need to first support THP / THP_1G, then provide a special > bit in pmds/puds to support huge pfnmaps. > > remap_pfn_range() support > ------------------------- > > Currently, remap_pfn_range() still only maps PTEs. With the new option, > remap_pfn_range() can logically start to inject either PMDs or PUDs when > the alignment requirements match on the VAs. > > When the support is there, it should be able to silently benefit all > drivers that is using remap_pfn_range() in its mmap() handler on better T= LB > hit rate and overall faster MMIO accesses similar to processor on hugepag= es. > Hi Peter, I am curious if there is any work needed for unmap_mapping_range? If a driver hugely remap_pfn_range()ed at 1G granularity, can the driver unmap at PAGE_SIZE granularity? For example, when handling a PFN is poisoned in the 1G mapping, it would be great if the mapping can be splitted to 2M mappings + 4k mappings, so only the single poisoned PFN is lost. (Pretty much like the past proposal* to use HGM** to improve hugetlb's memory failure handling). Probably these questions can be answered after reading your code, which I plan to do, but just want to ask in case you have an easy answer for me. * https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2= 899856-1-jiaqiyan@google.com/ ** https://lwn.net/Articles/912017 > More driver support > ------------------- > > VFIO is so far the only consumer for the huge pfnmaps after this series > applied. Besides above remap_pfn_range() generic optimization, device > driver can also try to optimize its mmap() on a better VA alignment for > either PMD/PUD sizes. This may, iiuc, normally require userspace changes= , > as the driver doesn't normally decide the VA to map a bar. But I don't > think I know all the drivers to know the full picture. > > Tests Done > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Cross-build tests > > - run_vmtests.sh > > - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect(= ) > and fork() tests on the bar mapped > > - x86_64 + AMD GPU > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > sure all pages to be mapped with PUDs > - Main BAR (8GB) start to use PUD mappings > - Sub BAR (??MBs?) start to use PMD mappings > - Performance wise, slight improvement comparing to the old PTE mapping= s > > - aarch64 + NIC > - Detached NIC test to make sure driver loads fine with PMD mappings > > Credits all go to Alex on help testing the GPU/NIC use cases above. > > Comments welcomed, thanks. > > [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucife= r.local > [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com > [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei= .com > > Alex Williamson (1): > vfio/pci: Implement huge_fault support > > Peter Xu (18): > mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud > mm: Drop is_huge_zero_pud() > mm: Mark special bits for huge pfn mappings when inject > mm: Allow THP orders for PFNMAPs > mm/gup: Detect huge pfnmap entries in gup-fast > mm/pagewalk: Check pfnmap for folio_walk_start() > mm/fork: Accept huge pfnmap entries > mm: Always define pxx_pgprot() > mm: New follow_pfnmap API > KVM: Use follow_pfnmap API > s390/pci_mmio: Use follow_pfnmap API > mm/x86/pat: Use the new follow_pfnmap API > vfio: Use the new follow_pfnmap API > acrn: Use the new follow_pfnmap API > mm/access_process_vm: Use the new follow_pfnmap API > mm: Remove follow_pte() > mm/x86: Support large pfn mappings > mm/arm64: Support large pfn mappings > > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/pgtable.h | 30 +++++ > arch/powerpc/include/asm/pgtable.h | 1 + > arch/s390/include/asm/pgtable.h | 1 + > arch/s390/pci/pci_mmio.c | 22 ++-- > arch/sparc/include/asm/pgtable_64.h | 1 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 80 +++++++----- > arch/x86/mm/pat/memtype.c | 17 ++- > drivers/vfio/pci/vfio_pci_core.c | 60 ++++++--- > drivers/vfio/vfio_iommu_type1.c | 16 +-- > drivers/virt/acrn/mm.c | 16 +-- > include/linux/huge_mm.h | 16 +-- > include/linux/mm.h | 57 ++++++++- > include/linux/pgtable.h | 12 ++ > mm/Kconfig | 13 ++ > mm/gup.c | 6 + > mm/huge_memory.c | 50 +++++--- > mm/memory.c | 183 ++++++++++++++++++++-------- > mm/pagewalk.c | 4 +- > virt/kvm/kvm_main.c | 19 ++- > 21 files changed, 425 insertions(+), 181 deletions(-) > > -- > 2.45.0 > >