From: Xu Yu <xuyu@linux.alibaba.com>
To: linux-mm@kvack.org
Cc: david@redhat.com
Subject: Re: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds
Date: Thu, 7 Dec 2023 15:47:47 +0800 [thread overview]
Message-ID: <17f31a27-dcb7-4907-bfa0-6282d202b7ac@linux.alibaba.com> (raw)
In-Reply-To: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com>
On 12/7/23 11:09 AM, Xu Yu wrote:
> In the anonymous collapse path, khugepaged always collapses pte-mapped
> hugepage by allocating and copying to a new hugepage.
>
> In some scenarios, we can only update the mapping page tables for
> anonymous pte-mapped THPs, in the same way as file/shmem-backed
> pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged:
> attempt to map file/shmem-backed pte-mapped THPs by pmds")
>
> The simplest scenario that satisfies the conditions, as David points out,
> is when no subpages are PageAnonExclusive (PTEs must be R/O), we can
> collapse into a R/O PMD without further action.
>
> Let's start from this simplest scenario.
>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
> mm/khugepaged.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 214 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 88433cc25d8a..85c7a2ab44ce 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1237,6 +1237,197 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> return result;
> }
>
> +static struct page *find_lock_pte_mapped_page(struct vm_area_struct *vma,
> + unsigned long addr, pmd_t *pmd)
> +{
> + pte_t *pte, pteval;
> + struct page *page = NULL;
> +
> + pte = pte_offset_map(pmd, addr);
> + if (!pte)
> + return NULL;
> +
> + pteval = ptep_get_lockless(pte);
> + if (pte_none(pteval) || !pte_present(pteval))
> + goto out;
> +
> + page = vm_normal_page(vma, addr, pteval);
> + if (unlikely(!page) || unlikely(is_zone_device_page(page)))
> + goto out;
> +
> + page = compound_head(page);
> +
> + if (!trylock_page(page)) {
> + page = NULL;
> + goto out;
> + }
> +
> + if (!get_page_unless_zero(page)) {
> + unlock_page(page);
> + page = NULL;
> + goto out;
> + }
> +
> +out:
> + pte_unmap(pte);
> + return page;
> +}
> +
> +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long haddr, bool *mmap_locked,
> + struct collapse_control *cc)
> +{
> + struct mmu_notifier_range range;
> + struct page *hpage;
> + pte_t *start_pte, *pte;
> + pmd_t *pmd, pmdval;
> + spinlock_t *pml, *ptl;
> + pgtable_t pgtable;
> + unsigned long addr;
> + int exclusive = 0;
> + bool writable = false;
> + int result, i;
> +
> + /* Fast check before locking page if already PMD-mapped */
> + result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> + if (result == SCAN_PMD_MAPPED)
> + return result;
> +
> + hpage = find_lock_pte_mapped_page(vma, haddr, pmd);
> + if (!hpage)
> + return SCAN_PAGE_NULL;
> + if (!PageHead(hpage)) {
> + result = SCAN_FAIL;
> + goto drop_hpage;
> + }
> + if (compound_order(hpage) != HPAGE_PMD_ORDER) {
> + result = SCAN_PAGE_COMPOUND;
> + goto drop_hpage;
> + }
> +
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> +
> + /* Prevent all access to pagetables */
> + mmap_write_lock(mm);
> +
> + result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc);
> + if (result != SCAN_SUCCEED)
> + goto up_write;
> +
> + result = check_pmd_still_valid(mm, haddr, pmd);
> + if (result != SCAN_SUCCEED)
> + goto up_write;
> +
> + /* Recheck with mmap write lock */
> + result = SCAN_SUCCEED;
> + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> + if (!start_pte)
> + goto drop_hpage;
^^^^^^^^^^ should be up_write.
> + for (i = 0, addr = haddr, pte = start_pte;
> + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> + struct page *page;
> + pte_t pteval = ptep_get(pte);
> +
> + if (pte_none(pteval) || !pte_present(pteval)) {
> + result = SCAN_PTE_NON_PRESENT;
> + break;
> + }
> +
> + if (pte_uffd_wp(pteval)) {
> + result = SCAN_PTE_UFFD_WP;
> + break;
> + }
> +
> + if (pte_write(pteval))
> + writable = true;
> +
> + page = vm_normal_page(vma, addr, pteval);
> +
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> + result = SCAN_PAGE_NULL;
> + break;
> + }
> +
> + if (hpage + i != page) {
> + result = SCAN_FAIL;
> + break;
> + }
> +
> + if (PageAnonExclusive(page))
> + exclusive++;
> + }
> + pte_unmap_unlock(start_pte, ptl);
> + if (result != SCAN_SUCCEED)
> + goto drop_hpage;
^^^^^^^^^^ should be up_write.
> +
> + /*
> + * Case 1:
> + * No subpages are PageAnonExclusive (PTEs must be R/O), we can
> + * collapse into a R/O PMD without further action.
> + */
> + if (!(exclusive == 0 && !writable))
> + goto drop_hpage;
^^^^^^^^^^ should be up_write.
> +
> + /* Collapse pmd entry */
> + vma_start_write(vma);
> + anon_vma_lock_write(vma->anon_vma);
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> + haddr, haddr + HPAGE_PMD_SIZE);
> + mmu_notifier_invalidate_range_start(&range);
> +
> + pml = pmd_lock(mm, pmd); /* probably unnecessary */
> + pmdval = pmdp_collapse_flush(vma, haddr, pmd);
> + spin_unlock(pml);
> + mmu_notifier_invalidate_range_end(&range);
> + tlb_remove_table_sync_one();
> +
> + anon_vma_unlock_write(vma->anon_vma);
> +
> + start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl);
> + if (!start_pte)
> + goto rollback;
> + for (i = 0, addr = haddr, pte = start_pte;
> + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> + struct page *page;
> + pte_t pteval = ptep_get(pte);
> +
> + page = vm_normal_page(vma, addr, pteval);
> + page_remove_rmap(page, vma, false);
> + }
> + pte_unmap_unlock(start_pte, ptl);
> +
> + /* Install pmd entry */
> + pgtable = pmd_pgtable(pmdval);
> + pmdval = mk_huge_pmd(hpage, vma->vm_page_prot);
> + spin_lock(pml);
> + page_add_anon_rmap(hpage, vma, haddr, RMAP_COMPOUND);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + set_pmd_at(mm, haddr, pmd, pmdval);
> + update_mmu_cache_pmd(vma, haddr, pmd);
> + spin_unlock(pml);
> +
> + result = SCAN_SUCCEED;
> + goto up_write;
> +
> +rollback:
> + spin_lock(pml);
> + pmd_populate(mm, pmd, pmd_pgtable(pmdval));
> + spin_unlock(pml);
> +
> +up_write:
> + mmap_write_unlock(mm);
> +
> +drop_hpage:
> + unlock_page(hpage);
> + put_page(hpage);
> +
> + /* TODO: tracepoints */
> + return result;
> +}
> +
> static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma,
> unsigned long address, bool *mmap_locked,
> @@ -1251,6 +1442,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
> bool writable = false;
> + int exclusive = 0;
> + bool is_hpage = false;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1333,8 +1526,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> }
> }
>
> + if (PageAnonExclusive(page))
> + exclusive++;
> +
> page = compound_head(page);
>
> + if (compound_order(page) == HPAGE_PMD_ORDER)
> + is_hpage = true;
> +
> /*
> * Record which node the original page is from and save this
> * information to cc->node_load[].
> @@ -1396,7 +1595,22 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> }
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> +
> + if (is_hpage && (exclusive == 0 && !writable)) {
> + int res;
> +
> + res = collapse_pte_mapped_anon_thp(mm, vma, address,
> + mmap_locked, cc);
> + if (res == SCAN_PMD_MAPPED || res == SCAN_SUCCEED) {
> + result = res;
> + goto out;
> + }
> +
> + }
> +
> if (result == SCAN_SUCCEED) {
> + if (!*mmap_locked)
> + mmap_read_lock(mm);
> result = collapse_huge_page(mm, address, referenced,
> unmapped, cc);
> /* collapse_huge_page will return with the mmap_lock released */
--
Thanks,
Yu
next prev parent reply other threads:[~2023-12-07 7:47 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-13 9:05 [PATCH 0/1] mm/khugepaged: " Xu Yu
2023-11-13 9:05 ` [PATCH 1/1] " Xu Yu
2023-11-13 9:26 ` David Hildenbrand
2023-11-13 9:33 ` Xu Yu
2023-11-13 10:10 ` David Hildenbrand
2023-12-07 3:09 ` [PATCH v2 0/2] attempt to " Xu Yu
2023-12-07 3:09 ` [PATCH v2 1/2] mm/khugepaged: " Xu Yu
2023-12-07 7:47 ` Xu Yu [this message]
2023-12-07 10:37 ` David Hildenbrand
2023-12-18 2:45 ` Xu Yu
2023-12-07 3:09 ` [PATCH v2 2/2] mm/khugepaged: add case for mapping " Xu Yu
2023-12-18 7:06 ` [PATCH v3 0/2] attempt to map " Xu Yu
2023-12-18 7:06 ` [PATCH v3 1/2] mm/khugepaged: map RO non-exclusive pte-mapped anon " Xu Yu
2023-12-18 7:06 ` [PATCH v3 2/2] mm/khugepaged: map exclusive anonymous pte-mapped " Xu Yu
2023-12-21 20:40 ` [PATCH v3 0/2] attempt to map " Zach O'Keefe
2023-12-21 20:54 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=17f31a27-dcb7-4907-bfa0-6282d202b7ac@linux.alibaba.com \
--to=xuyu@linux.alibaba.com \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox