Re: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Xu Yu <xuyu@linux.alibaba.com>
To: linux-mm@kvack.org
Cc: david@redhat.com
Subject: Re: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds
Date: Thu, 7 Dec 2023 15:47:47 +0800	[thread overview]
Message-ID: <17f31a27-dcb7-4907-bfa0-6282d202b7ac@linux.alibaba.com> (raw)
In-Reply-To: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com>

On 12/7/23 11:09 AM, Xu Yu wrote:
> In the anonymous collapse path, khugepaged always collapses pte-mapped
> hugepage by allocating and copying to a new hugepage.
> 
> In some scenarios, we can only update the mapping page tables for
> anonymous pte-mapped THPs, in the same way as file/shmem-backed
> pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged:
> attempt to map file/shmem-backed pte-mapped THPs by pmds")
> 
> The simplest scenario that satisfies the conditions, as David points out,
> is when no subpages are PageAnonExclusive (PTEs must be R/O), we can
> collapse into a R/O PMD without further action.
> 
> Let's start from this simplest scenario.
> 
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
>   mm/khugepaged.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 214 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 88433cc25d8a..85c7a2ab44ce 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1237,6 +1237,197 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	return result;
>   }
>   
> +static struct page *find_lock_pte_mapped_page(struct vm_area_struct *vma,
> +					      unsigned long addr, pmd_t *pmd)
> +{
> +	pte_t *pte, pteval;
> +	struct page *page = NULL;
> +
> +	pte = pte_offset_map(pmd, addr);
> +	if (!pte)
> +		return NULL;
> +
> +	pteval = ptep_get_lockless(pte);
> +	if (pte_none(pteval) || !pte_present(pteval))
> +		goto out;
> +
> +	page = vm_normal_page(vma, addr, pteval);
> +	if (unlikely(!page) || unlikely(is_zone_device_page(page)))
> +		goto out;
> +
> +	page = compound_head(page);
> +
> +	if (!trylock_page(page)) {
> +		page = NULL;
> +		goto out;
> +	}
> +
> +	if (!get_page_unless_zero(page)) {
> +		unlock_page(page);
> +		page = NULL;
> +		goto out;
> +	}
> +
> +out:
> +	pte_unmap(pte);
> +	return page;
> +}
> +
> +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm,
> +				struct vm_area_struct *vma,
> +				unsigned long haddr, bool *mmap_locked,
> +				struct collapse_control *cc)
> +{
> +	struct mmu_notifier_range range;
> +	struct page *hpage;
> +	pte_t *start_pte, *pte;
> +	pmd_t *pmd, pmdval;
> +	spinlock_t *pml, *ptl;
> +	pgtable_t pgtable;
> +	unsigned long addr;
> +	int exclusive = 0;
> +	bool writable = false;
> +	int result, i;
> +
> +	/* Fast check before locking page if already PMD-mapped */
> +	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> +	if (result == SCAN_PMD_MAPPED)
> +		return result;
> +
> +	hpage = find_lock_pte_mapped_page(vma, haddr, pmd);
> +	if (!hpage)
> +		return SCAN_PAGE_NULL;
> +	if (!PageHead(hpage)) {
> +		result = SCAN_FAIL;
> +		goto drop_hpage;
> +	}
> +	if (compound_order(hpage) != HPAGE_PMD_ORDER) {
> +		result = SCAN_PAGE_COMPOUND;
> +		goto drop_hpage;
> +	}
> +
> +	mmap_read_unlock(mm);
> +	*mmap_locked = false;
> +
> +	/* Prevent all access to pagetables */
> +	mmap_write_lock(mm);
> +
> +	result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc);
> +	if (result != SCAN_SUCCEED)
> +		goto up_write;
> +
> +	result = check_pmd_still_valid(mm, haddr, pmd);
> +	if (result != SCAN_SUCCEED)
> +		goto up_write;
> +
> +	/* Recheck with mmap write lock */
> +	result = SCAN_SUCCEED;
> +	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> +	if (!start_pte)
> +		goto drop_hpage;
                      ^^^^^^^^^^ should be up_write.
> +	for (i = 0, addr = haddr, pte = start_pte;
> +	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +		struct page *page;
> +		pte_t pteval = ptep_get(pte);
> +
> +		if (pte_none(pteval) || !pte_present(pteval)) {
> +			result = SCAN_PTE_NON_PRESENT;
> +			break;
> +		}
> +
> +		if (pte_uffd_wp(pteval)) {
> +			result = SCAN_PTE_UFFD_WP;
> +			break;
> +		}
> +
> +		if (pte_write(pteval))
> +			writable = true;
> +
> +		page = vm_normal_page(vma, addr, pteval);
> +
> +		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> +			result = SCAN_PAGE_NULL;
> +			break;
> +		}
> +
> +		if (hpage + i != page) {
> +			result = SCAN_FAIL;
> +			break;
> +		}
> +
> +		if (PageAnonExclusive(page))
> +			exclusive++;
> +	}
> +	pte_unmap_unlock(start_pte, ptl);
> +	if (result != SCAN_SUCCEED)
> +		goto drop_hpage;
                      ^^^^^^^^^^ should be up_write.
> +
> +	/*
> +	 * Case 1:
> +	 * No subpages are PageAnonExclusive (PTEs must be R/O), we can
> +	 * collapse into a R/O PMD without further action.
> +	 */
> +	if (!(exclusive == 0 && !writable))
> +		goto drop_hpage;
                      ^^^^^^^^^^ should be up_write.
> +
> +	/* Collapse pmd entry */
> +	vma_start_write(vma);
> +	anon_vma_lock_write(vma->anon_vma);
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	pml = pmd_lock(mm, pmd); /* probably unnecessary */
> +	pmdval = pmdp_collapse_flush(vma, haddr, pmd);
> +	spin_unlock(pml);
> +	mmu_notifier_invalidate_range_end(&range);
> +	tlb_remove_table_sync_one();
> +
> +	anon_vma_unlock_write(vma->anon_vma);
> +
> +	start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl);
> +	if (!start_pte)
> +		goto rollback;
> +	for (i = 0, addr = haddr, pte = start_pte;
> +	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +		struct page *page;
> +		pte_t pteval = ptep_get(pte);
> +
> +		page = vm_normal_page(vma, addr, pteval);
> +		page_remove_rmap(page, vma, false);
> +	}
> +	pte_unmap_unlock(start_pte, ptl);
> +
> +	/* Install pmd entry */
> +	pgtable = pmd_pgtable(pmdval);
> +	pmdval = mk_huge_pmd(hpage, vma->vm_page_prot);
> +	spin_lock(pml);
> +	page_add_anon_rmap(hpage, vma, haddr, RMAP_COMPOUND);
> +	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +	set_pmd_at(mm, haddr, pmd, pmdval);
> +	update_mmu_cache_pmd(vma, haddr, pmd);
> +	spin_unlock(pml);
> +
> +	result = SCAN_SUCCEED;
> +	goto up_write;
> +
> +rollback:
> +	spin_lock(pml);
> +	pmd_populate(mm, pmd, pmd_pgtable(pmdval));
> +	spin_unlock(pml);
> +
> +up_write:
> +	mmap_write_unlock(mm);
> +
> +drop_hpage:
> +	unlock_page(hpage);
> +	put_page(hpage);
> +
> +	/* TODO: tracepoints */
> +	return result;
> +}
> +
>   static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   				   struct vm_area_struct *vma,
>   				   unsigned long address, bool *mmap_locked,
> @@ -1251,6 +1442,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   	bool writable = false;
> +	int exclusive = 0;
> +	bool is_hpage = false;
>   
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
> @@ -1333,8 +1526,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   
> +		if (PageAnonExclusive(page))
> +			exclusive++;
> +
>   		page = compound_head(page);
>   
> +		if (compound_order(page) == HPAGE_PMD_ORDER)
> +			is_hpage = true;
> +
>   		/*
>   		 * Record which node the original page is from and save this
>   		 * information to cc->node_load[].
> @@ -1396,7 +1595,22 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	}
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
> +
> +	if (is_hpage && (exclusive == 0 && !writable)) {
> +		int res;
> +
> +		res = collapse_pte_mapped_anon_thp(mm, vma, address,
> +						   mmap_locked, cc);
> +		if (res == SCAN_PMD_MAPPED || res == SCAN_SUCCEED) {
> +			result = res;
> +			goto out;
> +		}
> +
> +	}
> +
>   	if (result == SCAN_SUCCEED) {
> +		if (!*mmap_locked)
> +			mmap_read_lock(mm);
>   		result = collapse_huge_page(mm, address, referenced,
>   					    unmapped, cc);
>   		/* collapse_huge_page will return with the mmap_lock released */
-- 
Thanks,
Yu

next prev parent reply	other threads:[~2023-12-07  7:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-13  9:05 [PATCH 0/1] mm/khugepaged: " Xu Yu
2023-11-13  9:05 ` [PATCH 1/1] " Xu Yu
2023-11-13  9:26   ` David Hildenbrand
2023-11-13  9:33     ` Xu Yu
2023-11-13 10:10       ` David Hildenbrand
2023-12-07  3:09 ` [PATCH v2 0/2] attempt to " Xu Yu
2023-12-07  3:09   ` [PATCH v2 1/2] mm/khugepaged: " Xu Yu
2023-12-07  7:47     ` Xu Yu [this message]
2023-12-07 10:37     ` David Hildenbrand
2023-12-18  2:45       ` Xu Yu
2023-12-07  3:09   ` [PATCH v2 2/2] mm/khugepaged: add case for mapping " Xu Yu
2023-12-18  7:06 ` [PATCH v3 0/2] attempt to map " Xu Yu
2023-12-18  7:06   ` [PATCH v3 1/2] mm/khugepaged: map RO non-exclusive pte-mapped anon " Xu Yu
2023-12-18  7:06   ` [PATCH v3 2/2] mm/khugepaged: map exclusive anonymous pte-mapped " Xu Yu
2023-12-21 20:40   ` [PATCH v3 0/2] attempt to map " Zach O'Keefe
2023-12-21 20:54     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17f31a27-dcb7-4907-bfa0-6282d202b7ac@linux.alibaba.com \
    --to=xuyu@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox