From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7559C52D7C for ; Fri, 16 Aug 2024 03:06:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 52D908D0036; Thu, 15 Aug 2024 23:06:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4414A8D002B; Thu, 15 Aug 2024 23:06:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2BA918D0036; Thu, 15 Aug 2024 23:06:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0EE708D002B for ; Thu, 15 Aug 2024 23:06:07 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 2F149A1449 for ; Fri, 16 Aug 2024 03:06:06 +0000 (UTC) X-FDA: 82456619532.28.18D291C Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35]) by imf14.hostedemail.com (Postfix) with ESMTP id F20BC10001E for ; Fri, 16 Aug 2024 03:06:02 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf14.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723777551; a=rsa-sha256; cv=none; b=StTA4ShV6Qd4fhi3xqmJ7OJWHS/usRt3Tqymt4HoQXfidNCmfJYmibgahkCN0EzJDUOJn3 11yZNCN+lTDHgjK7MNIKjxifLYd0JSXv2Fx/tt/KFIZ+grzv/xouI6Jn9faw7L8lNH0Yf7 /oXQp0pPgwze886tLvKJ9ZjGzWjQw+s= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf14.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723777551; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5ujIiBuviXRSnKo/cpEejfiWZav/F47PBhAAI7unhOE=; b=KFm9i9Jk+J+451O7kHCqqqw4KjTgsukAN2z2vZdPK93LlNk7cWi+VgFLqcKxBaeO+azBm4 qdeOc5NMcm+6TOXkAHYeIp1hdidTBa/58CDAaO5UsK2eEfa7N2XEtLcCbyLPkcW0ioSMzx oMhh8cMsUydaGHYOWQ2phXlPk3ZZ7l0= Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4WlRZK0cDdz1S82h; Fri, 16 Aug 2024 11:00:41 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 90FAE14010C; Fri, 16 Aug 2024 11:05:35 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 16 Aug 2024 11:05:34 +0800 Message-ID: <1147332f-790e-487f-8816-1860b8744ab2@huawei.com> Date: Fri, 16 Aug 2024 11:05:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 00/19] mm: Support huge pfnmaps Content-Language: en-US To: Peter Xu , Jason Gunthorpe CC: , , Sean Christopherson , Oscar Salvador , Axel Rasmussen , , , Will Deacon , Gavin Shan , Paolo Bonzini , Zi Yan , Andrew Morton , Catalin Marinas , Ingo Molnar , Alistair Popple , Borislav Petkov , David Hildenbrand , Thomas Gleixner , , Dave Hansen , Alex Williamson , Yan Zhao References: <20240809160909.1023470-1-peterx@redhat.com> <20240814123715.GB2032816@nvidia.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspam-User: X-Rspamd-Queue-Id: F20BC10001E X-Rspamd-Server: rspam01 X-Stat-Signature: mcmos9dwncdm5psfojtsius961576ysw X-HE-Tag: 1723777562-914813 X-HE-Meta: U2FsdGVkX19f2Pj2nExJLYY4noJbpUNYNIgG4+5VFDs+JuH2BG1aQWZrAVvjFLoXZ+HnX4sIQRWjE3IRk/CgTzahnmO9unXfkA/4cin3kLwZJp39q2ZYk1UrQI4BM/dxiDr292GGV71+Is9lMyLDBO6q5+goENtE1MdUE7rPS8ZphVfCK1an3BS8+s4an+T+pjhfjJV/bkTmjKP4VzzNukVFgLhKzpk7hZoNUVdD2jK8dExa7PvZ0S/uf/yGOq/FdHXrHzt0s+caFpFbfyCy1jB+j4etsHncTjwEAQ5z0BscMXN28+pD8nbIKBf0UIv0n8bYRAWGcNtOX4XV8PSZ6ztx3OO9Kpc3KDFUxItiOHCDCF7W01UY3gpGalY5tAbID2CDar5wT/zeJu0SDJkVr7FYg8eI2dVimGw38xBgzI52uYmDSXrH/pdCETN7/Bhdg0xiU7JPMM08zr12sDlnzykAhw4Vn/9Jv+CRBIgXvX2UwVt9kn3zZivVHPcmbwTTe8wPLIYk004z0kkPwb0sj1Hx68xepWkYuMwMbB6bZhHnDmhUOoj81POHazKPbr2y3FmJR1Xd38t1aO4Y2nGrUTCuJRWguNMxWbb/60m1AwfMz6OzLwL3eeJ4a2ll54yMic/yeBhifl3p6iZaPWRuk9sTg/+Dcf+S1MlVjqprWH9U6YHbEyVyv3eI/ZV7sl72jCquRQ6qXaH9HYQ9jzJ3lTJ+CPtIMuuKMbN1Qj/uh/QANUT3QMrJ9D3T6Ne0QM4ECBaHShttpyP8OjnFoEXsDAUoB4iG8GoqR9FYvC2QP3aoLcoi9CuXBTGsStlpsSDFEPBvkZUCjpFi042bWrtAXHU/zUUSzjU8f0YzCInwzxqOsLyjLDjJWqSy6VInC08JN/OFj1Wv1AUYaXM+kc7n29Iqo/sasHjryrIrGrlJ6Z6cbhXOf0iMy4/Eq+obxsWjSeMDZPGvXww6gr0EgRN qHmjs9EY o4t6LBIPvYHx5PtdtE9tayggIvrP+zmi1Xsvqj7gITq8IRLi/qhw1SsTvEQjlOp2xLGVJYAEhC2AwuJtNuETetEhcMYYdriJ4vhevDl471eDoLBAVIVAeeiSG5W+IIzniuoVNkDGTYIJfTnNf93Ofhp1kJGHRMrzbW3c6WeOO7OE5CsHpOO+THVL0wqAMcwUU1GuiGN1B1EJb6FGciZRRp4KKw65F4SlTo591lLfvGrK2vOzNzxlQXNXYWKaEwfTL1g4k X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/16 3:20, Peter Xu wrote: > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >> >> There is definitely interest here in extending ARM to support the 1G >> size too, what is missing? > > Currently PUD pfnmap relies on THP_PUD config option: > > config ARCH_SUPPORTS_PUD_PFNMAP > def_bool y > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > require to depend on THP at all, but we'll need things like below to land > first: > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > I sent that first a while ago, but I didn't collect enough inputs, and I > decided to unblock this series from that, so x86_64 shouldn't be affected, > and arm64 will at least start to have 2M. > >> >>> The other trick is how to allow gup-fast working for such huge mappings >>> even if there's no direct sign of knowing whether it's a normal page or >>> MMIO mapping. This series chose to keep the pte_special solution, so that >>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>> gup-fast will be able to identify them and fail properly. >> >> Make sense >> >>> More architectures / More page sizes >>> ------------------------------------ >>> >>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>> >>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>> on 1G will be automatically enabled. A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64. diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..ff0d27c72020 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -184,6 +184,7 @@ config ARM64 select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 select HAVE_ARCH_VMAP_STACK select HAVE_ARM_SMCCC select HAVE_ASM_MODVERSIONS diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7a4f5604be3f..e013fe458476 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) #define pud_valid(pud) pte_valid(pud_pte(pud)) #define pud_user(pud) pte_user(pud_pte(pud)) #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) + +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static inline int pud_trans_huge(pud_t pud) +{ + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +} + +static inline pud_t pud_mkdevmap(pud_t pud) +{ + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); +} +#endif static inline bool pgtable_l4_enabled(void); @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, pmd_pte(entry), dirty); } +static inline int pudp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, + pud_t entry, int dirty) +{ + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, + pud_pte(entry), dirty); +} + +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_devmap(pud_t pud) { return 0; } +#endif static inline int pgd_devmap(pgd_t pgd) { @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, { return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); } + +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long address, + pud_t *pudp) +{ + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) #ifdef CONFIG_ARM64_PA_BITS_52 #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52) -- 2.27.0 >> >> Oh that sounds like a bigger step.. > > Just to mention, no real THP 1G needed here for pfnmaps. The real gap here > is only about the pud helpers that only exists so far with CONFIG_THP_PUD > in huge_memory.c. > >> >>> VFIO is so far the only consumer for the huge pfnmaps after this series >>> applied. Besides above remap_pfn_range() generic optimization, device >>> driver can also try to optimize its mmap() on a better VA alignment for >>> either PMD/PUD sizes. This may, iiuc, normally require userspace changes, >>> as the driver doesn't normally decide the VA to map a bar. But I don't >>> think I know all the drivers to know the full picture. >> >> How does alignment work? In most caes I'm aware of the userspace does >> not use MAP_FIXED so the expectation would be for the kernel to >> automatically select a high alignment. I suppose your cases are >> working because qemu uses MAP_FIXED and naturally aligns the BAR >> addresses? >> >>> - x86_64 + AMD GPU >>> - Needs Alex's modified QEMU to guarantee proper VA alignment to make >>> sure all pages to be mapped with PUDs >> >> Oh :( > > So I suppose this answers above. :) Yes, alignment needed. >