From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCFFEC7115C for ; Wed, 25 Jun 2025 08:45:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 525EB6B00A1; Wed, 25 Jun 2025 04:45:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4FD406B00A2; Wed, 25 Jun 2025 04:45:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 413836B00A3; Wed, 25 Jun 2025 04:45:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2AFB86B00A1 for ; Wed, 25 Jun 2025 04:45:10 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 97C8D1619A2 for ; Wed, 25 Jun 2025 08:45:09 +0000 (UTC) X-FDA: 83593288338.03.D1AB416 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) by imf07.hostedemail.com (Postfix) with ESMTP id 7F92E40002 for ; Wed, 25 Jun 2025 08:45:07 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G4CAHzSj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf07.hostedemail.com: domain of lance.yang@linux.dev designates 91.218.175.174 as permitted sender) smtp.mailfrom=lance.yang@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750841107; a=rsa-sha256; cv=none; b=KXx3FDSEkIR8tzomV2TplTCHNiQgQo5QrcTUauAANaDr2I9oVpbzHyqVUIYkg4cci8uLp+ 7LOIkTuNfupLEw0qFWjJQWRpJxVvGwC7sOJyTClNzWdrjmQDr5kGGWqrYR9c9UanwKnvfq 7iD8FBuhVjzosqM4LvlrnSNYQoU6DfM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G4CAHzSj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf07.hostedemail.com: domain of lance.yang@linux.dev designates 91.218.175.174 as permitted sender) smtp.mailfrom=lance.yang@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750841107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c9eJXWiLkEHbzhuAnPgypPWcb3z11lz3Pujb6AiAmuc=; b=DpC//XqBAIXPb+nmRNODyDRO5rCuafuxCxBYDR1kXyF28uhMpEtHm8uP6t9muTWFUsvc+c 5yG7iTFctECzlUq8jEbFNsfR5m/iskcUOvEYQoCLnfAMCiMg8S1yKRT62HtzC1RMwcciyL RNzSBkV4kdSaM5t9VG+4WbcT/vX07bs= Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1750841102; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c9eJXWiLkEHbzhuAnPgypPWcb3z11lz3Pujb6AiAmuc=; b=G4CAHzSj6ghduZghtJvd0SSnHNyQgbjvNmbKL8+wxYzsjj++dTHT9bjPssGm42SDZYXT4m fFuuOiQUmw9TdQ9A4K1i4Z1XZY1qhpB07s/HL0NajUB1FdmvfYUf1l9v0vG+M2F6U8Cvr0 LiDZ6A36AoDyGYUoU4WVXGpL1ELKOyI= Date: Wed, 25 Jun 2025 16:44:47 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Content-Language: en-US To: David Hildenbrand Cc: 21cnbao@gmail.com, akpm@linux-foundation.org, baolin.wang@linux.alibaba.com, chrisl@kernel.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-riscv@lists.infradead.org, lorenzo.stoakes@oracle.com, ryan.roberts@arm.com, v-songbaohua@oppo.com, x86@kernel.org, ying.huang@intel.com, zhengtangquan@oppo.com, Lance Yang References: <20250624152654.38145-1-ioworker0@gmail.com> <2c19a6cf-0b42-477b-a672-ed8c1edd4267@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang In-Reply-To: <2c19a6cf-0b42-477b-a672-ed8c1edd4267@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Stat-Signature: q77oz5chr8mqcwpd8r8u7mt6fwz6xxni X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 7F92E40002 X-HE-Tag: 1750841107-268501 X-HE-Meta: U2FsdGVkX1+amNzH0IlUk1WCn1yi4jE0OCzJ4fhmXFSbU1YmpcbJB9v06a7B79YI3eRJ+ix4rR1HRkHnr9m091sG/GPb1Dk7NZOcN9IT8Ar/Cb4RUe5GNB+gKDEcGD4cVqdl+sBuJbQUIlFWFyrdi0+2l9aJdvQCWi7nLBhHybvtX4/BVGYP+k8XWXhH+bBZ85JrIKXlRytZjWuQmn8imk1GhyOM5gpvO5fZ5rsNn2AwaRA3IiLN1wNutXMZLvyD7Gv+3A0n4pyNSqoHLfoSB9UMp5a9iVvDsVtXyqPUFb3LKlgU8UD/NaGEnzucAKEwAicw3SPybgJRte6nWYERuHqlyu1zHAU3Rf5ddyEwbO/Rgees3UVC1DUJ+UkZHTP0GxU3je86upuj/kS9IzZ3RqOftkzIsDjLy2c9QOvVFv1LpmuUeN6G0Xugom6GYvK/uCn0hLzzpH8LjkT2D6QkGvEktizRapmLOGft/2hZjgj27nGFTUbuUR77swDzRcjgyUkoUYmse7x3eImCjxbshJlCu9yRgYbjbr+1K11iI2T4UGEmhKoAvrnqj8fRUhjq1bDHXBZnJPglHprPEqjNMC0JaqN23KkcaE9jfGkWVhJcs1JLwD5SlbMHE3TldGTm5OtVEma9ULXeqwyx1P9ZCjNSrK5zZ2cWxJpc+XDshjRQolN6RhW+cdGukGfDSVZgUEm3YPkF2/WvKbSStdY30POUOcYatd0TYJOIBBaqPihWHGTYCU7I06FXWQs5CvQjxnqU58qYV4cW2aX3JaYdwrt2DzMtP0UwHl5FW1Pn7XEpWp9Zrlvx0R6wqlIPObz/coiGvh61XV6mbige4xuWwAD8Nb6OUlXnnXMlRiucEVTjEKZsWU2/X5fOTO0lF/lAoHBDLT6xhVFGdn6TUZrX3gpQwBUOLYdhxIP8q6h2/KfMg51p2om0UYiVOfWy+MS0Ct31zHfJOtuhZ3jy1J9 GIdJcDS4 HvPyMaNU1QLIVtk6F2iwVb0YIq4lyXnQet7nholdzW5LLnIA9W0gSuIT2NmuPLM9jKY8HWBVvpLotHD+0h9XeKabc4iiYrKADoNEia6aY9HbwJyreqV2QsB6UaJiKeJo6Fts3aiFL8flRlDgGYUeQ/c9KDoMlZQew+i6I6dsCz2wu/maa6h4fYaLcvIyXAPTwo9wrlVk0pCFEX5qLteheSgwdzdZZV5r/YXFmTzPQucPeBIQJMEaVMsoTyrZJHO6EIW2Lb98z5Isiu9hK0meGJP3/URyHsmYUSgWO1L1GAEOw3El6XNeGYuiw6Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/24 23:34, David Hildenbrand wrote: > On 24.06.25 17:26, Lance Yang wrote: >> On 2025/6/24 20:55, David Hildenbrand wrote: >>> On 14.02.25 10:30, Barry Song wrote: >>>> From: Barry Song >> [...] >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 89e51a7a9509..8786704bd466 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1781,6 +1781,25 @@ void folio_remove_rmap_pud(struct folio *folio, >>>> struct page *page, >>>>    #endif >>>>    } >>>> +/* We support batch unmapping of PTEs for lazyfree large folios */ >>>> +static inline bool can_batch_unmap_folio_ptes(unsigned long addr, >>>> +            struct folio *folio, pte_t *ptep) >>>> +{ >>>> +    const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >>>> +    int max_nr = folio_nr_pages(folio); >>> >>> Let's assume we have the first page of a folio mapped at the last page >>> table entry in our page table. >> >> Good point. I'm curious if it is something we've seen in practice ;) > > I challenge you to write a reproducer :P I assume it might be doable > through simple mremap(). Yes! The scenario is indeed reproducible from userspace ;p First, I get a 64KB folio by allocating a large anonymous mapping and advising the kernel with madvise(MADV_HUGEPAGE). After faulting in the pages, /proc/self/pagemap confirms the PFNs are contiguous. Then, the key is to use mremap() with MREMAP_FIXED to move the folio to a virtual address that crosses a PMD boundary. Doing so ensures the physically contiguous folio is mapped by PTEs from two different page tables. The C reproducer is attached. It was tested on a system with 64KB mTHP enabled (in madvise mode). Please correct me if I'm wrong ;) ``` #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #define PAGE_SIZE ((size_t)sysconf(_SC_PAGESIZE)) #define FOLIO_SIZE (64 * 1024) #define NUM_PAGES_IN_FOLIO (FOLIO_SIZE / PAGE_SIZE) #define PMD_SIZE (2 * 1024 * 1024) int get_pagemap_entry(uint64_t *entry, int pagemap_fd, uintptr_t vaddr) { size_t offset = (vaddr / PAGE_SIZE) * sizeof(uint64_t); if (pread(pagemap_fd, entry, sizeof(uint64_t), offset) != sizeof(uint64_t)) { perror("pread pagemap"); return -1; } return 0; } int is_page_present(uint64_t entry) { return (entry >> 63) & 1; } uint64_t get_pfn(uint64_t entry) { return entry & ((1ULL << 55) - 1); } bool verify_contiguity(int pagemap_fd, uintptr_t vaddr, size_t size, const char *label) { printf("\n--- Verifying Contiguity for: %s at 0x%lx ---\n", label, vaddr); printf("Page | Virtual Address | Present | PFN (Physical) | Contiguous?\n"); printf("-----+---------------------------+---------+--------------------+-------------\n"); uint64_t first_pfn = 0; bool is_contiguous = true; int num_pages = size / PAGE_SIZE; for (int i = 0; i < num_pages; ++i) { uintptr_t current_vaddr = vaddr + i * PAGE_SIZE; uint64_t pagemap_entry; if (get_pagemap_entry(&pagemap_entry, pagemap_fd, current_vaddr) != 0) { is_contiguous = false; break; } if (!is_page_present(pagemap_entry)) { printf(" %2d | 0x%016lx | No | N/A | Error\n", i, current_vaddr); is_contiguous = false; continue; } uint64_t pfn = get_pfn(pagemap_entry); char contiguous_str[4] = "Yes"; if (i == 0) { first_pfn = pfn; } else { if (pfn != first_pfn + i) { strcpy(contiguous_str, "No!"); is_contiguous = false; } } printf(" %2d | 0x%016lx | Yes | 0x%-16lx | %s\n", i, current_vaddr, pfn, contiguous_str); } if (is_contiguous) { printf("Verification PASSED: PFNs are contiguous for %s.\n", label); } else { printf("Verification FAILED: PFNs are NOT contiguous for %s.\n", label); } return is_contiguous; } int main(void) { printf("--- Folio-across-PMD-boundary reproducer ---\n"); printf("Page size: %zu KB, Folio size: %zu KB, PMD coverage: %zu MB\n", PAGE_SIZE / 1024, FOLIO_SIZE / 1024, PMD_SIZE / (1024 * 1024)); size_t source_size = 4 * 1024 * 1024; void *source_addr = mmap(NULL, source_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (source_addr == MAP_FAILED) { perror("mmap source"); exit(EXIT_FAILURE); } printf("\n1. Source memory mapped at: %p\n", source_addr); if (madvise(source_addr, source_size, MADV_HUGEPAGE) != 0) { perror("madvise MADV_HUGEPAGE"); } printf("2. Advised kernel to use large folios (MADV_HUGEPAGE).\n"); memset(source_addr, 'A', source_size); printf("3. Faulted in source pages.\n"); int pagemap_fd = open("/proc/self/pagemap", O_RDONLY); if (pagemap_fd < 0) { perror("open /proc/self/pagemap"); exit(EXIT_FAILURE); } if (!verify_contiguity(pagemap_fd, (uintptr_t)source_addr, FOLIO_SIZE, "Source Address (pre-mremap)")) { fprintf(stderr, "\nInitial folio allocation failed. Cannot proceed.\n"); close(pagemap_fd); munmap(source_addr, source_size); exit(EXIT_FAILURE); } uintptr_t search_base = 0x10000000000UL; uintptr_t pmd_boundary = (search_base + PMD_SIZE) & ~(PMD_SIZE - 1); uintptr_t target_vaddr = pmd_boundary - PAGE_SIZE; printf("\n5. Calculated target address to be 0x%lx\n", target_vaddr); munmap((void *)target_vaddr, FOLIO_SIZE); void *new_addr = mremap(source_addr, FOLIO_SIZE, FOLIO_SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, (void *)target_vaddr); if (new_addr == MAP_FAILED) { perror("mremap"); close(pagemap_fd); exit(EXIT_FAILURE); } printf("6. Successfully mremap'd %zu KB to 0x%lx.\n", FOLIO_SIZE / 1024, (uintptr_t)new_addr); bool final_success = verify_contiguity(pagemap_fd, (uintptr_t)new_addr, FOLIO_SIZE, "Target Address (post-mremap)"); printf("\n--- Final Conclusion ---\n"); if (final_success) { printf("✅ SUCCESS: The folio's pages remained physically contiguous after remapping to a PMD-crossing virtual address.\n"); printf(" The reproducer successfully created the desired edge-case memory layout.\n"); } else { printf("❌ UNEXPECTED FAILURE: The pages were not contiguous after mremap.\n"); } close(pagemap_fd); munmap(new_addr, FOLIO_SIZE); return 0; } ``` $ a.out ``` --- Folio-across-PMD-boundary reproducer --- Page size: 4 KB, Folio size: 64 KB, PMD coverage: 2 MB 1. Source memory mapped at: 0x7f2e41200000 2. Advised kernel to use large folios (MADV_HUGEPAGE). 3. Faulted in source pages. --- Verifying Contiguity for: Source Address (pre-mremap) at 0x7f2e41200000 --- Page | Virtual Address | Present | PFN (Physical) | Contiguous? -----+---------------------------+---------+--------------------+------------- 0 | 0x00007f2e41200000 | Yes | 0x113aa0 | Yes 1 | 0x00007f2e41201000 | Yes | 0x113aa1 | Yes 2 | 0x00007f2e41202000 | Yes | 0x113aa2 | Yes 3 | 0x00007f2e41203000 | Yes | 0x113aa3 | Yes 4 | 0x00007f2e41204000 | Yes | 0x113aa4 | Yes 5 | 0x00007f2e41205000 | Yes | 0x113aa5 | Yes 6 | 0x00007f2e41206000 | Yes | 0x113aa6 | Yes 7 | 0x00007f2e41207000 | Yes | 0x113aa7 | Yes 8 | 0x00007f2e41208000 | Yes | 0x113aa8 | Yes 9 | 0x00007f2e41209000 | Yes | 0x113aa9 | Yes 10 | 0x00007f2e4120a000 | Yes | 0x113aaa | Yes 11 | 0x00007f2e4120b000 | Yes | 0x113aab | Yes 12 | 0x00007f2e4120c000 | Yes | 0x113aac | Yes 13 | 0x00007f2e4120d000 | Yes | 0x113aad | Yes 14 | 0x00007f2e4120e000 | Yes | 0x113aae | Yes 15 | 0x00007f2e4120f000 | Yes | 0x113aaf | Yes Verification PASSED: PFNs are contiguous for Source Address (pre-mremap). 5. Calculated target address to be 0x100001ff000 6. Successfully mremap'd 64 KB to 0x100001ff000. --- Verifying Contiguity for: Target Address (post-mremap) at 0x100001ff000 --- Page | Virtual Address | Present | PFN (Physical) | Contiguous? -----+---------------------------+---------+--------------------+------------- 0 | 0x00000100001ff000 | Yes | 0x113aa0 | Yes 1 | 0x0000010000200000 | Yes | 0x113aa1 | Yes 2 | 0x0000010000201000 | Yes | 0x113aa2 | Yes 3 | 0x0000010000202000 | Yes | 0x113aa3 | Yes 4 | 0x0000010000203000 | Yes | 0x113aa4 | Yes 5 | 0x0000010000204000 | Yes | 0x113aa5 | Yes 6 | 0x0000010000205000 | Yes | 0x113aa6 | Yes 7 | 0x0000010000206000 | Yes | 0x113aa7 | Yes 8 | 0x0000010000207000 | Yes | 0x113aa8 | Yes 9 | 0x0000010000208000 | Yes | 0x113aa9 | Yes 10 | 0x0000010000209000 | Yes | 0x113aaa | Yes 11 | 0x000001000020a000 | Yes | 0x113aab | Yes 12 | 0x000001000020b000 | Yes | 0x113aac | Yes 13 | 0x000001000020c000 | Yes | 0x113aad | Yes 14 | 0x000001000020d000 | Yes | 0x113aae | Yes 15 | 0x000001000020e000 | Yes | 0x113aaf | Yes Verification PASSED: PFNs are contiguous for Target Address (post-mremap). --- Final Conclusion --- ✅ SUCCESS: The folio's pages remained physically contiguous after remapping to a PMD-crossing virtual address. The reproducer successfully created the desired edge-case memory layout. ``` Thanks, Lance > >> >>> >>> What prevents folio_pte_batch() from reading outside the page table? >> >> Assuming such a scenario is possible, to prevent any chance of an >> out-of-bounds read, how about this change: >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index fb63d9256f09..9aeae811a38b 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1852,6 +1852,25 @@ static inline bool >> can_batch_unmap_folio_ptes(unsigned long addr, >>       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >>       int max_nr = folio_nr_pages(folio); >>       pte_t pte = ptep_get(ptep); >> +    unsigned long end_addr; >> + >> +    /* >> +     * To batch unmap, the entire folio's PTEs must be contiguous >> +     * and mapped within the same PTE page table, which corresponds to >> +     * a single PMD entry. Before calling folio_pte_batch(), which does >> +     * not perform boundary checks itself, we must verify that the >> +     * address range covered by the folio does not cross a PMD boundary. >> +     */ >> +    end_addr = addr + (max_nr * PAGE_SIZE) - 1; >> + >> +    /* >> +     * A fast way to check for a PMD boundary cross is to align both >> +     * the start and end addresses to the PMD boundary and see if they >> +     * are different. If they are, the range spans across at least two >> +     * different PMD-managed regions. >> +     */ >> +    if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) >> +        return false; > > You should not be messing with max_nr = folio_nr_pages(folio) here at > all. folio_pte_batch() takes care of that. > > Also, way too many comments ;) > > You may only batch within a single VMA and within a single page table. > > So simply align the addr up to the next PMD, and make sure it does not > exceed the vma end. > > ALIGN and friends can help avoiding excessive comments. >