From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C918AC25B7A for ; Wed, 22 May 2024 01:14:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C0386B0092; Tue, 21 May 2024 21:14:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 46FFE6B009A; Tue, 21 May 2024 21:14:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3373E6B009C; Tue, 21 May 2024 21:14:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 104B76B0092 for ; Tue, 21 May 2024 21:14:04 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 96842A2AFE for ; Wed, 22 May 2024 01:14:03 +0000 (UTC) X-FDA: 82144260366.09.D42643A Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf06.hostedemail.com (Postfix) with ESMTP id B0573180012 for ; Wed, 22 May 2024 01:14:01 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L8GswVkN; spf=pass (imf06.hostedemail.com: domain of npiggin@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=npiggin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716340441; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2Ucokp0AAkm1JrdVm50A3xJsq/C1UNhIjxFEJPxYe7E=; b=XLshpIuJ3dtSTJ9VdG9ufKdmbFPwNl5oxcSLxCXmlAxEgzEpahdYaVkUK1crwIUV13QjAB fv0l1gBdFE863BYz5Gev19c6tDGNNazYHx5fWKxxP56OfVP2vVqdIzJdWG8eRd5/JDBl8j hOFzhfiMvQc/WFpIETybKLO1oDEu60w= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L8GswVkN; spf=pass (imf06.hostedemail.com: domain of npiggin@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=npiggin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716340441; a=rsa-sha256; cv=none; b=jdyCvk5lyxxLApSAzQLl7Y9U0CHH9L6NcLuD3NhpEZ6FyWS4OT0hi2ERI49ibEH0nsPgkg LBr+oZ5WuzF4zqhQCe2QkuzHM2LS8Bms6GL1QIkYHBhC0bHcTMH3zWNKtwjIPuCeOm4LE3 e8kgi9ZMuTbryZur92bUDzNqmd6DD/g= Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-6f693306b7cso1341185b3a.1 for ; Tue, 21 May 2024 18:14:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716340440; x=1716945240; darn=kvack.org; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2Ucokp0AAkm1JrdVm50A3xJsq/C1UNhIjxFEJPxYe7E=; b=L8GswVkNRXErGJS7UPK8OZrZglhygKsmlLr4fPlxho8KhU+xqzkxDchTHXhyNKsJLC jySeRRby3MjpkmVbLpXd57CnAUFWDDgpqpucN8i8aWeYU7trzIVGtGKSijtx9GWYM+kw fyiDUy5eqzS5Oe5v/V4guQbHZf4Meh2wVZcf2U9jnCqv8BIZuhwZE6kI3Iks2Eg4jpru UVYOSWrPswC/r/OqV1AwX50H8m3zK9j2q0CsffSANIpjGASPaAPEgCmRQFmZDxXoCxOr 5Q+nANYo7M+94SnJ9kWjI5U1VdZd5+RVYM/B08TRAkxOVlKzo+CrMKx8QUiAxGtod3/s xXWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716340440; x=1716945240; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=2Ucokp0AAkm1JrdVm50A3xJsq/C1UNhIjxFEJPxYe7E=; b=h8DhnhESUldtbUM3OODfPGxD86OqwvLoNuqDBoI+sV/snRGUanV6I+an7JGWJWCj6o BkblDQO7WAiKa/KiyuDxtxyLQJKF84jflgbnRHYNxMOXBkMoFYPlAMERPyJdOsTG5Frb 4Cn95oDKVeja9fEf8LHwKy9BffhxWzYJkg8nipS1lM/MVI8db5MynoBO/yJ4WW6+OHzw 4mEqNKMTCWg1l0nI81orCb8JHce01/VYfoBWb0qp9ksx8alNPUgyNq5MJ4ZJsIcfUpSg c5JI7ar+nWBGmFya22K4GOc5p0G5tBfhGYyi6RBFYR2yi9Ny2b1Ahel5425py4sbd8aA 2KiA== X-Forwarded-Encrypted: i=1; AJvYcCUrlcwlV5Dk1ZDRvsQ3F4bO7L+d2S0ZyTzTHph5RQIJdr4b+SHcKv2GZczqu4ucsliyGVBoWb3i4odkIPa/r8jUY8E= X-Gm-Message-State: AOJu0YxFbaZr3S55qdIcruzN3EE/7zNklaML1wzQ8FQ+9drqtQ10Z0w1 abIalCJf7HjY3DtX7a8oIg4L1i5wvAFHbnRhlkuZa+c3syAanrCPNeawZg== X-Google-Smtp-Source: AGHT+IEeVFE7TfOpUrd1nI8e4z4RKvVMRMKWzoj8PEjSm761mXEMfBxfjcGLSveVOlL3LkbU11N+RQ== X-Received: by 2002:a05:6a20:c909:b0:1b0:1025:2d5 with SMTP id adf61e73a8af0-1b1f88a6e1fmr780915637.36.1716340440370; Tue, 21 May 2024 18:14:00 -0700 (PDT) Received: from localhost (110-175-65-7.tpgi.com.au. [110.175.65.7]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2bd930a3bf0sm1972833a91.5.2024.05.21.18.13.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 21 May 2024 18:13:59 -0700 (PDT) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 22 May 2024 11:13:53 +1000 Message-Id: Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "linuxppc-dev@lists.ozlabs.org" Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD From: "Nicholas Piggin" To: "Christophe Leroy" , "Andrew Morton" , "Jason Gunthorpe" , "Peter Xu" , "Oscar Salvador" , "Michael Ellerman" X-Mailer: aerc 0.17.0 References: <99575c2c-7840-4fa4-b84e-aaddc7fef4cb@csgroup.eu> In-Reply-To: <99575c2c-7840-4fa4-b84e-aaddc7fef4cb@csgroup.eu> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: B0573180012 X-Stat-Signature: xifuwrcbcxckq89hdbgdw55yu61yogbe X-Rspam-User: X-HE-Tag: 1716340441-650465 X-HE-Meta: U2FsdGVkX1+icTJXI8pn2Fo7aMEdOSO5OoT2EE74FOCOmyigH4QRjSTW8ICXOve/b70L865EU2AgeTzupDp64c5tTeoQxwDQ6+VFQMVaSn7klW99ME/vWS+o7044A+vpRNtAcu8kdCPBP/ZYqkCOc3ZRf+yA1DaIsG2GYd4zFfvxByrohiYTIJGbP+MI3qS7Yu92tDDcRnibNJYNDZ8/DFPU2LcVn/D/DrTWrlGoETj6oxY4NinW/FhOC52jmc30DX+54ZaxKUVSH6xYMq+hfk0a5wpgL3mPhUyFu1c3nY0sDKavljFHH0XZC9IK3xqq6/WpUn94XdFFOu2ghEfW3PLBtVh8bq7Sk9Ri0g5zTwzD7dH6RLB52b/YFygWJ3/QPWVfkR5LHKpMAuv3plG/UgwzNcEoAJnWUmWSSo8C1qRtosGnJNqh87PNP26NWCcg+oYimNwaychp3l7v+XTAtVV1nSp96K69DNG4hh6eXUi+4ny19zXa3OuQYElOZouZjp3x88GqDbULbKCe23MF/BhJ9WMBLjzxQxvlt5PEFKtntKXo3OyX5/+VuCkiyz9nu7qSzGMxSmlFitgJ5P+SKwo5z0ITQ0Q8Spamle0uWnulD/prW3adOXbqXQV9cP3cXSXR4FDw4V+Jb2eWUtDBSN0yT+6gZdBP4ZsURZvizgQ/rA6Dx74OnSLGkekBoMDR20ZljRtxW8teOUHydiDZ7G6I1R57NpYEFOTXUK9Rwz2krITpvvnytnzuMq94WJ/2uHKiPvyRyjoneoXT7hctoxkdYYfMIip6awAXy3NruhzNFaORPlTQOCe1a2mA7T9dzTgbiJ2D/cHB0ml2KOCYpIEJ6u/zS7EKeOXUNZtyRwdTe3Fr+KhesyPw2c1uoSRteDmHoev8koRiLdbeaeCl7eB0MqGp/LmKGfiG1OKHt4w/cY9Nr+H0JmE0U048nVEQ5leahy/CUYNKXTcnR10 htGDIj29 Xbm5S52sb4/ut0tpVh0jUCj7aTgYtnrmG7P4ROb/NpM0vvjMuagzDuvmnmXnJNaQHRuY58LhldI6mOVKQg4tA1bwkgPUhNVGKPpV29EjXzcsTa0AkKhKH4/qckj95HgY9hdwWsdvXkpHMdgIlyDelAEPU2z/W73NcCTXvXOzUGgwAeDhN+35Vp8srWJsJWMFPwrXqhVMm7XXeplx87RWPA914+pcir79+LhquQhSSCa0BOywHttYaCHOrQcu0qNsfixUWPS8L0hmXJC61p4HlEmRaLQZl/A8D50x0gX0Vr+YasghAn1jaK6OuCFKOFuVnaP28QT0jFQGEBXSseNiACCxrufkm0OCpb8NTiPHlJFaI4LfP4YmDt1Ibjvei3Nx0e2yNsKiRiiX5KP+NlBrEzQNCD6CUJLYWjwE4mnIgwptq8kmg3ccaqjs3NiT1xi9lbrqW1oC3Kk2OiJoM9tyYe8MWpTwcIpwlXsZw0PkzE8TN0/9QAFVsk4hZarrjXEPYf2uim+nP0aDuanXRtxvE4apLqHMqiL4ebtG7PI8JEWtncX4S/xs2Wxw2QA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote: > > > Le 20/05/2024 =C3=A0 14:54, Nicholas Piggin a =C3=A9crit=C2=A0: > > On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote: > >> On book3s/64, the only user of hugepd is hash in 4k mode. > >> > >> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD. > >> > >> Rework hash-4k to use contiguous PMD and PUD instead. > >> > >> In that setup there are only two huge page sizes: 16M and 16G. > >> > >> 16M sits at PMD level and 16G at PUD level. > >> > >> pte_update doesn't know page size, lets use the same trick as > >> hpte_need_flush() to get page size from segment properties. That's > >> not the most efficient way but let's do that until callers of > >> pte_update() provide page size instead of just a huge flag. > >> > >> Signed-off-by: Christophe Leroy > >> --- > >> arch/powerpc/include/asm/book3s/64/hash-4k.h | 15 -------- > >> arch/powerpc/include/asm/book3s/64/hash.h | 38 +++++++++++++++--= -- > >> arch/powerpc/include/asm/book3s/64/hugetlb.h | 38 -----------------= -- > >> .../include/asm/book3s/64/pgtable-4k.h | 34 ----------------- > >> .../include/asm/book3s/64/pgtable-64k.h | 20 ---------- > >> arch/powerpc/include/asm/hugetlb.h | 4 ++ > >> .../include/asm/nohash/32/hugetlb-8xx.h | 4 -- > >> .../powerpc/include/asm/nohash/hugetlb-e500.h | 4 -- > >> arch/powerpc/include/asm/page.h | 8 ---- > >> arch/powerpc/mm/book3s64/hash_utils.c | 11 ++++-- > >> arch/powerpc/mm/book3s64/pgtable.c | 12 ------ > >> arch/powerpc/mm/hugetlbpage.c | 19 ---------- > >> arch/powerpc/mm/pgtable.c | 2 +- > >> arch/powerpc/platforms/Kconfig.cputype | 1 - > >> 14 files changed, 43 insertions(+), 167 deletions(-) > >> > >> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/power= pc/include/asm/book3s/64/hash-4k.h > >> index 6472b08fa1b0..c654c376ef8b 100644 > >> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > >> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > >> @@ -74,21 +74,6 @@ > >> #define remap_4k_pfn(vma, addr, pfn, prot) \ > >> remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot)) > >> =20 > >> -#ifdef CONFIG_HUGETLB_PAGE > >> -static inline int hash__hugepd_ok(hugepd_t hpd) > >> -{ > >> - unsigned long hpdval =3D hpd_val(hpd); > >> - /* > >> - * if it is not a pte and have hugepd shift mask > >> - * set, then it is a hugepd directory pointer > >> - */ > >> - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) && > >> - ((hpdval & HUGEPD_SHIFT_MASK) !=3D 0)) > >> - return true; > >> - return false; > >> -} > >> -#endif > >> - > >> /* > >> * 4K PTE format is different from 64K PTE format. Saving the hash_s= lot is just > >> * a matter of returning the PTE bits that need to be modified. On 6= 4K PTE, > >> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/= include/asm/book3s/64/hash.h > >> index faf3e3b4e4b2..509811ca7695 100644 > >> --- a/arch/powerpc/include/asm/book3s/64/hash.h > >> +++ b/arch/powerpc/include/asm/book3s/64/hash.h > >> @@ -4,6 +4,7 @@ > >> #ifdef __KERNEL__ > >> =20 > >> #include > >> +#include > >> =20 > >> /* > >> * Common bits between 4K and 64K pages in a linux-style PTE. > >> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm= , unsigned long addr, > >> pte_t *ptep, unsigned long pte, int huge); > >> unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigne= d long flags); > >> /* Atomic PTE updates */ > >> -static inline unsigned long hash__pte_update(struct mm_struct *mm, > >> - unsigned long addr, > >> - pte_t *ptep, unsigned long clr, > >> - unsigned long set, > >> - int huge) > >> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigne= d long clr, > >> + unsigned long set) > >> { > >> __be64 old_be, tmp_be; > >> - unsigned long old; > >> =20 > >> __asm__ __volatile__( > >> "1: ldarx %0,0,%3 # pte_update\n\ > >> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(str= uct mm_struct *mm, > >> : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep), > >> "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set)) > >> : "cc" ); > >> + > >> + return be64_to_cpu(old_be); > >> +} > >> + > >> +static inline unsigned long hash__pte_update(struct mm_struct *mm, > >> + unsigned long addr, > >> + pte_t *ptep, unsigned long clr, > >> + unsigned long set, > >> + int huge) > >> +{ > >> + unsigned long old; > >> + > >> + old =3D hash__pte_update_one(ptep, clr, set); > >> + > >> + if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) { > >> + unsigned int psize =3D get_slice_psize(mm, addr); > >> + int nb, i; > >> + > >> + if (psize =3D=3D MMU_PAGE_16M) > >> + nb =3D SZ_16M / PMD_SIZE; > >> + else if (psize =3D=3D MMU_PAGE_16G) > >> + nb =3D SZ_16G / PUD_SIZE; > >> + else > >> + nb =3D 1; > >> + > >> + for (i =3D 1; i < nb; i++) > >> + hash__pte_update_one(ptep + i, clr, set); > >> + } > >> /* huge pages use the old page table lock */ > >> if (!huge) > >> assert_pte_locked(mm, addr); > >> =20 > >> - old =3D be64_to_cpu(old_be); > >> if (old & H_PAGE_HASHPTE) > >> hpte_need_flush(mm, addr, ptep, old, huge); > >> =20 > >=20 > > Nice series, I don't know this hugepd code very well but I'll try. > > Why do you have to replicate the PTE entry here? The hash table refill > > should always be working on the first PTE of the page otherwise we have > > bigger problems. > > I don't know how book3s/64 works exactly, but on nohash, when you get a= =20 > TLB miss exception the only thing you have is the address and you don't= =20 > know yes it is a hugepage so you get the PTE as if it was a 4k page and= =20 > it is only when you read that PTE that you know it is a hugepage. > > Ok, on book3s/64 the page size seems to be encoded inside the segment so= =20 > maybe it is a bit different but anyway the TLB miss exception (or DSI ?)= =20 > can happen at any address. Right. If you think of the hash page table as a software loaded TLB (which is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x calls find the Linux pte and load that translation into hash page table. One of the hard parts is keeping them coherent with low overhead. This requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means it might be in the hash table. So Linux PTE and hash PTE have to be 1:1 in general. There are probably cases where we could get away from 1:1, but I would much prefer not to. Maybe read-only access would be okay though. But the hash_page will have to always operate on the 0th pte, which I think we get via segment size masking, same for any set / update / clear of the pte. > >=20 > > What paths look at the N > 0 PTEs of a contiguous page entry? > >=20 > > pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous= =20 > PTE based on the address handed to pte_index(), as if it was a standard= =20 > (4k or 64k) page. > > pte_index() doesn't know it is a hugepage, that's the reason why we need= =20 > to duplicate the entry. >From the mm/ side of things, hugetlb page tables are always walked via the huge vma which knows the page size and could align address... I guess except for fast gup? Which should be read-only. So okay you do need to replicate huge ptes for fast gup at least. Any others? There's going to need to be a little more to it. __hash_page_huge sets PTE accessed and dirty for example, so if we allow any PTE readers to check the non-0th pte we would have to do something about that. How do you deal with dirty/accessed bits for other subarchs? We could just remove the hash_page setting of those bits and just cause a fault and require Linux mm to set them. At least for hugepages we could do that probably without any real performance worry. Thanks, Nick