From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 710A6C02198 for ; Wed, 12 Feb 2025 14:44:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE8F06B0082; Wed, 12 Feb 2025 09:44:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B99506B0083; Wed, 12 Feb 2025 09:44:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A60FC280002; Wed, 12 Feb 2025 09:44:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8CDA16B0082 for ; Wed, 12 Feb 2025 09:44:49 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3C8ACA1D7B for ; Wed, 12 Feb 2025 14:44:49 +0000 (UTC) X-FDA: 83111564298.11.3C360D9 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf14.hostedemail.com (Postfix) with ESMTP id 37AE710000A for ; Wed, 12 Feb 2025 14:44:47 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739371487; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+of4MpZ1l+omWy52l33KTeEpL5ddbTis/pPKlv94nwg=; b=IZ9KJV/9015YP4hNpZ3fGImKuaJv9ev+OnQdvlowW5/6IUtE8r3y03hlB8exFdHrczTBfi o1SVINFLjotFumaKhUN4pzg1yJrca4hIy1SQfupCE63jVwZANKS4v0z4+uMq6QAqSYtKo7 qma4vGfG800HL+PqHxeoP7KmMmE3flU= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739371487; a=rsa-sha256; cv=none; b=M7KO+mShbfPzcwY0etg6QBbWxJzrTt6ZUBiJ/AP8gYgYrNgJwzKd1EF+698R1SENy0vClu fafblnYTMyiZXOxTYAa6Jq93n6BfqOB5K+pFducO+n5vcfQZwDqQEmp/XuTvyTMXa6o6Lz ZfKU5SjNfBesY/IVwuoU1AlVBT9c64o= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 748B512FC; Wed, 12 Feb 2025 06:45:07 -0800 (PST) Received: from [10.57.81.93] (unknown [10.57.81.93]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B62FD3F6A8; Wed, 12 Feb 2025 06:44:43 -0800 (PST) Message-ID: Date: Wed, 12 Feb 2025 14:44:42 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes Content-Language: en-GB From: Ryan Roberts To: Anshuman Khandual , Catalin Marinas , Will Deacon , Muchun Song , Pasha Tatashin , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Mark Rutland , Ard Biesheuvel , Dev Jain , Alexandre Ghiti , Steve Capper , Kevin Brodsky Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org References: <20250205151003.88959-1-ryan.roberts@arm.com> <20250205151003.88959-3-ryan.roberts@arm.com> <83103cbb-e2b8-4177-aeaf-c3b6e6b08008@arm.com> <25155f6e-8a62-405e-852d-c07a55be0ac5@arm.com> In-Reply-To: <25155f6e-8a62-405e-852d-c07a55be0ac5@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: 38436t9kgs5phq3yerxpxbyp63cgamiw X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 37AE710000A X-HE-Tag: 1739371487-340337 X-HE-Meta: U2FsdGVkX1+KnHakl/sPApYeGztZrv851LefKNeb1IJkkG8GntxqjBewYsX16aQCIhaZQwuzrNhPB0snZmu3n5ku4tfra3mZ40PnzjHzVWYUzmwb5OKmPma7rcyO0R3gvu0NEraTJT6ZtPaFH+ZYlAGuvj8FHfTbxsPxNjmfdgOl89rNG60cx0Qc3Pd5oyGP1tmXc3T9OqecMC5WbsSwF1hIjg7nZ0/PBNv8R5YIhzvkU58g4DltNO5Hi+S3koIakx6fV4iA+pExDp+Yd4IVRsc5NTkW7YkuG+YMBA3CQnEHjFsj4EAefIkovBK14x9WRUdAiNCapBINJY6O0N+8FxFQD2d5+gJ+OtjmxCSHmpN9fsxgdXTP5lT3aTyvPgox+lNMoEgv67NZE3t0oCixpqn6euHVsIWPKZaPHt7xxOFtFzgJk04H4/Rxi27NacLGD/Z6b0efWo0THiWEMJbhD2QqlcwfcGZMDs7CAsvy8bMZ6FWd1GLOKdfHwlI2G2hNu6mA82tV0CPOfX+HJEOslZ32uWOxyimEiatWkkMyBKq4JWglv8aomimvwIfmOgghdCyVW0dCfNuK4SGn9pPHVX+1KDjPWRZwMozTyU2dTaw74oA5XVOpoUbDrYwOpxypO73vZob3MLq68bIdQTVjpY4wsibN0szJLxCUza/HnoQhxnz7x3lxJFjFnUpwe0BBeHK3Jp3dd1Q/aQxlCEyQshtEm+N/7Nk2WncgR+fHyNwMn+I7oA4uAUQKsvBrMVIRGvL1UulZym62xrVI9N6kry6Y11FxNRuMHI4r3fEobydi2R0YPLa6cT7hE9VtcF2AwlJQcS3PA5XO5l4HdAK1F1/f0O9Yvujr9cpwfzZAuYiywnm0zvcsyFJquSBv59BLjzriQvISj37fqaOtUTP0Tj8lHQUdhC0fo2RPZ+gASbQtBfJDwZNwgknqoLR2hkeaMO/+lXBLQij5ZfepbWs MwSULRFr d66Xxh/sfGhZMgG/fnpiQZWJ5y+IuNvaK5aYHM1/Af+Ubi/DGSIJLJPozqJDoR2UOh7clIzxsywxocT7tauTWi7lRr1Q/Xe6RQ49eWxqu2GPJkuTuYUYsrYQf3BDGNnoo2bqyZlUUxk2JJBKiMRfH6iTTe1NhtOTWqI0Prq3y81HOGTGwDIz0/ZqtvtoTVimtGMjZ5URpIwAowrBM2F7HGdx47gW1PvYRquZJO9HfjZ5Kn6/J6Vsz0oO8UX+jTAUQosMHKCWJ/+ZmlbpbpAtr2yqNMA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 06/02/2025 12:55, Ryan Roberts wrote: > On 06/02/2025 06:15, Anshuman Khandual wrote: >> On 2/5/25 20:39, Ryan Roberts wrote: >>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by >>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some >>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE, >>> CONT_PMD_SIZE). So the function has to figure out the size from the >>> huge_pte pointer. This was previously done by walking the pgtable to >>> determine the level, then using the PTE_CONT bit to determine the number >>> of ptes. >> >> Actually PTE_CONT gets used to determine if the entry is normal i.e >> PMD/PUD based huge page or cont PTE/PMD based huge page just to call >> into standard __ptep_get_and_clear() or specific get_clear_contig(), >> after determining find_num_contig() by walking the page table. >> >> PTE_CONT presence is only used to determine the switch above but not >> to determine the number of ptes for the mapping as mentioned earlier. > > Sorry I don't really follow your distinction; PTE_CONT is used to decide whether > we are operating on a single entry (pte_cont()==false) or on multiple entires > (pte_cont()==true). For the multiple entry case, the level tells you the exact > number. > > I can certainly tidy up this description a bit, but I think we both agree that > the value of PTE_CONT is one of the inputs into deciding how many entries need > to be operated on? > >> >> There are two similar functions which determines the >> >> static int find_num_contig(struct mm_struct *mm, unsigned long addr, >> pte_t *ptep, size_t *pgsize) >> { >> pgd_t *pgdp = pgd_offset(mm, addr); >> p4d_t *p4dp; >> pud_t *pudp; >> pmd_t *pmdp; >> >> *pgsize = PAGE_SIZE; >> p4dp = p4d_offset(pgdp, addr); >> pudp = pud_offset(p4dp, addr); >> pmdp = pmd_offset(pudp, addr); >> if ((pte_t *)pmdp == ptep) { >> *pgsize = PMD_SIZE; >> return CONT_PMDS; >> } >> return CONT_PTES; >> } >> >> find_num_contig() already assumes that the entry is contig huge page and >> it just finds whether it is PMD or PTE based one. This always requires a >> prior PTE_CONT bit being set determination via pte_cont() before calling >> find_num_contig() in each instance. > > Agreed. > >> >> But num_contig_ptes() can get the same information without walking the >> page table and thus without predetermining if PTE_CONT is set or not. >> size can be derived from VMA argument when present. > > Also agreed. But VMA is not provided to this function. And because we want to > use it for kernel space mappings, I think it's a bad idea to pass VMA. > >> >> static inline int num_contig_ptes(unsigned long size, size_t *pgsize) >> { >> int contig_ptes = 0; >> >> *pgsize = size; >> >> switch (size) { >> #ifndef __PAGETABLE_PMD_FOLDED >> case PUD_SIZE: >> if (pud_sect_supported()) >> contig_ptes = 1; >> break; >> #endif >> case PMD_SIZE: >> contig_ptes = 1; >> break; >> case CONT_PMD_SIZE: >> *pgsize = PMD_SIZE; >> contig_ptes = CONT_PMDS; >> break; >> case CONT_PTE_SIZE: >> *pgsize = PAGE_SIZE; >> contig_ptes = CONT_PTES; >> break; >> } >> >> return contig_ptes; >> } >> >> On a side note, why cannot num_contig_ptes() be used all the time and >> find_num_contig() be dropped ? OR am I missing something here. > > There are 2 remaining users of find_num_contig() after my series: > huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can > only be legitimately called for present ptes (so its safe to check pte_cont()). > huge_ptep_set_access_flags() already has the VMA so it would be easy to convert > to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess > you could do the trick where you take the size of the folio that the pte points to? > > So yes, I think we could drop find_num_contig() and I agree it would be an > improvement. > > But to be honest, grabbing the folio size also feels like a hack to me (we do > this in other places too). While today, the folio size is guarranteed to be be > the same size as the huge pte in practice, I'm not sure there is any spec that > mandates that? > > Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so > we can tell the size of both present and non-present ptes, then do the table > walk trick to find the level. Shrug. > >> >>> >>> But the PTE_CONT bit is only valid when the pte is present. For >>> non-present pte values (e.g. markers, migration entries), the previous >>> implementation was therefore erroniously determining the size. There is >>> at least one known caller in core-mm, move_huge_pte(), which may call >>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to >>> this case. Additionally the "regular" ptep_get_and_clear() is robust to >>> being called for non-present ptes so it makes sense to follow the >>> behaviour. >> >> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set >> and the entry being mapped might not be required. >>>> >>> Fix this by using the new sz parameter which is now provided to the >>> function. Additionally when clearing each pte in a contig range, don't >>> gather the access and dirty bits if the pte is not present. >> >> Makes sense. >> >>> >>> An alternative approach that would not require API changes would be to >>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt >>> cleaner to follow other APIs' lead and just pass in the size. >> >> Right, changing the arguments in the API will help solve this problem. >> >>> >>> While we are at it, add some debug warnings in functions that require >>> the pte is present. >>> >>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap >>> entry offset field (layout of non-present pte). Since hugetlb is never >>> swapped to disk, this field will only be populated for markers, which >>> always set this bit to 0 and hwpoison swap entries, which set the offset >>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where >>> memory in that high half was poisoned (I think!). So in practice, this >>> bit would almost always be zero for non-present ptes and we would only >>> clear the first entry if it was actually a contiguous block. That's >>> probably a less severe symptom than if it was always interpretted as 1 >>> and cleared out potentially-present neighboring PTEs. >>> >>> Cc: >>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") >>> Signed-off-by: Ryan Roberts >>> --- >>> arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++----------------- >>> 1 file changed, 29 insertions(+), 25 deletions(-) >>> >>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c >>> index 06db4649af91..328eec4bfe55 100644 >>> --- a/arch/arm64/mm/hugetlbpage.c >>> +++ b/arch/arm64/mm/hugetlbpage.c >>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, >>> unsigned long pgsize, >>> unsigned long ncontig) >>> { >>> - pte_t orig_pte = __ptep_get(ptep); >>> - unsigned long i; >>> - >>> - for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) { >>> - pte_t pte = __ptep_get_and_clear(mm, addr, ptep); >>> - >>> - /* >>> - * If HW_AFDBM is enabled, then the HW could turn on >>> - * the dirty or accessed bit for any page in the set, >>> - * so check them all. >>> - */ >>> - if (pte_dirty(pte)) >>> - orig_pte = pte_mkdirty(orig_pte); >>> - >>> - if (pte_young(pte)) >>> - orig_pte = pte_mkyoung(orig_pte); >>> + pte_t pte, tmp_pte; >>> + bool present; >>> + >>> + pte = __ptep_get_and_clear(mm, addr, ptep); >>> + present = pte_present(pte); >>> + while (--ncontig) { >> >> Although this does the right thing by calling __ptep_get_and_clear() once >> for non-contig huge pages but wondering if cont/non-cont separation should >> be maintained in the caller huge_ptep_get_and_clear(), keeping the current >> logical bifurcation intact. > > To what benefit? > >> >>> + ptep++; >>> + addr += pgsize; >>> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep); >>> + if (present) { >> >> Checking for present entries makes sense here. >> >>> + if (pte_dirty(tmp_pte)) >>> + pte = pte_mkdirty(pte); >>> + if (pte_young(tmp_pte)) >>> + pte = pte_mkyoung(pte); >>> + } >>> } >>> - return orig_pte; >>> + return pte; >>> } >>> >>> static pte_t get_clear_contig_flush(struct mm_struct *mm, >>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, >>> { >>> int ncontig; >>> size_t pgsize; >>> - pte_t orig_pte = __ptep_get(ptep); >>> - >>> - if (!pte_cont(orig_pte)) >>> - return __ptep_get_and_clear(mm, addr, ptep); >>> - >>> - ncontig = find_num_contig(mm, addr, ptep, &pgsize); >>> >>> + ncontig = num_contig_ptes(sz, &pgsize); >> >> __ptep_get_and_clear() can still be called here if 'ncontig' is >> returned as 0 indicating a normal non-contig huge page thus >> keeping get_clear_contig() unchanged just to handle contig huge >> pages. > > I think you're describing the case where num_contig_ptes() returns 0? The > intention, from my reading of the function, is that num_contig_ptes() returns > the number of ptes that need to be operated on (e.g. 1 for a single entry or N > for a contig block). It will only return 0 if called with an invalid huge size. > I don't believe it will ever "return 0 indicating a normal non-contig huge page". > > Perhaps the right solution is to add a warning if returning 0? > >> >>> return get_clear_contig(mm, addr, ptep, pgsize, ncontig); >>> } >>> >>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, >>> pgprot_t hugeprot; >>> pte_t orig_pte; >>> >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> if (!pte_cont(pte)) >>> return __ptep_set_access_flags(vma, addr, ptep, pte, dirty); >>> >>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, >>> return 0; >>> >>> orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig); >>> + VM_WARN_ON(!pte_present(orig_pte)); >>> >>> /* Make sure we don't lose the dirty or young state */ >>> if (pte_dirty(orig_pte)) >>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm, >>> size_t pgsize; >>> pte_t pte; >>> >>> - if (!pte_cont(__ptep_get(ptep))) { >>> + pte = __ptep_get(ptep); >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> + if (!pte_cont(pte)) { >>> __ptep_set_wrprotect(mm, addr, ptep); >>> return; >>> } >>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma, >>> struct mm_struct *mm = vma->vm_mm; >>> size_t pgsize; >>> int ncontig; >>> + pte_t pte; >>> >>> - if (!pte_cont(__ptep_get(ptep))) >>> + pte = __ptep_get(ptep); >>> + VM_WARN_ON(!pte_present(pte)); >>> + >>> + if (!pte_cont(pte)) >>> return ptep_clear_flush(vma, addr, ptep); >>> >>> ncontig = find_num_contig(mm, addr, ptep, &pgsize); >> >> In all the above instances should not num_contig_ptes() be called to determine >> if a given entry is non-contig or contig huge page, thus dropping the need for >> pte_cont() and pte_present() tests as proposed here. > > Yeah maybe. But as per above, we have options for how to do that. I'm not sure > which is preferable at the moment. What do you think? Regardless, I think that > cleanup would be a separate patch (which I'm happy to add for v2). For this bug > fix, I was trying to do the minimum. I took another look at this; I concluded that we should switch from find_num_contig() to num_contig_ptes() for the 2 cases where we have the vma and can therefore directly get the huge_ptep size. That leaves one callsite in huge_ptep_set_wrprotect(), which doesn't have the vma. One approach would be to grab the folio out of the pte and use the size of the folio. That's already done in huge_ptep_get(). But actually I think that's a very brittle approach because there is nothing stoping the size of the folio from changing in future (you could have a folio twice the size mapped by 2 huge_ptes for example). So I'm actually proposing to keep find_num_contig() and use it additionally in huge_ptep_get(). That will mean we end up with 2 callsites for find_num_contig(), and 0 places that infer the huge_pte size from the folio size. I think that's much cleaner personally. I'll do this for v2. Thanks, Ryan > > Thanks, > Ryan > >