Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
	kirill.shutemov@linux.intel.com
Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com,
	cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
	haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
	21cnbao@gmail.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise
Date: Wed, 18 Dec 2024 14:54:15 +0530	[thread overview]
Message-ID: <ca198de7-1ebf-4749-855c-f8358046cd22@arm.com> (raw)
In-Reply-To: <d0e339a0-94c9-42e7-b62c-63fde6cffd71@arm.com>


On 17/12/24 11:45 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
>> derived in conjunction with sysfs THP settings. Scale down the tunables; in
>> case of collapse failure, we drop down to the next order. Otherwise, we try to
>> jump to the highest possible order and then start a fresh scan. Note that
>> madvise(MADV_COLLAPSE) has not been generalized.
> Is there are reason you are not modifying MADV_COLLAPSE? It's really just a
> synchonous way to do what khugepaged does asynchonously (isn't it?), so it would
> behave the same way in an ideal world.

Correct, but I started running into return value problems for madvise(). For example,
the return value of hpage_collapse_scan_ptes() will be the return value of the last
mTHP scan. In this case, what do we want madvise() to return? If I collapse the range
to multiple 64K mTHPs, then I should still return failure, because otherwise the caller would
logically assume that MADV_COLLAPSE succeeded so will assume a PMD-hugepage mapped there. But
then the caller ended up collapsing memory...if you return success, then the khugepaged selftest
starts failing. Basically, this will be (kind of?) an ABI change and I really didn't want to sway
the discussion away from khugepaged, so I just kept it simple :)

>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 69 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 886c76816963..078794aa3335 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/count_zeros.h>
>>   
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>   }
>>   
>>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -			      int referenced, int unmapped,
>> +			      int referenced, int unmapped, int order,
>>   			      struct collapse_control *cc)
>>   {
>>   	LIST_HEAD(compound_pagelist);
>> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   				   unsigned long address, bool *mmap_locked,
>>   				   struct collapse_control *cc)
>>   {
>> -	pmd_t *pmd;
>> -	pte_t *pte, *_pte;
>> -	int result = SCAN_FAIL, referenced = 0;
>> -	int none_or_zero = 0, shared = 0;
>> -	struct page *page = NULL;
>> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
>> +	int referenced, shared, none_or_zero, unmapped;
>> +	unsigned long _address, org_address = address;
> nit: Perhaps it's clearer to keep the original address in address and use a
> variable, start, for the starting point of each scan?

Probably...will keep it in mind.

>
>>   	struct folio *folio = NULL;
>> -	unsigned long _address;
>> -	spinlock_t *ptl;
>> -	int node = NUMA_NO_NODE, unmapped = 0;
>> +	struct page *page = NULL;
>> +	int node = NUMA_NO_NODE;
>> +	int result = SCAN_FAIL;
>>   	bool writable = false;
>> +	unsigned long orders;
>> +	pte_t *pte, *_pte;
>> +	spinlock_t *ptl;
>> +	pmd_t *pmd;
>> +	int order;
>>   
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
> Perhaps THP_ORDERS_ALL instead of "BIT(PMD_ORDER + 1) - 1"?

Ah yes, THP_ORDERS_ALL_ANON.

>
>> +	orders = thp_vma_suitable_orders(vma, address, orders);
>> +	order = highest_order(orders);
>> +
>> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
>> +	if (!cc->is_khugepaged)
>> +		order = HPAGE_PMD_ORDER;
>> +
>> +scan_pte_range:
>> +
>> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
>> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
>> +
>> +	/* Check pmd after taking mmap lock */
>>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>>   	nodes_clear(cc->alloc_nmask);
>> +
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	if (!pte) {
>>   		result = SCAN_PMD_NULL;
>>   		goto out;
>>   	}
>>   
>> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>>   	     _pte++, _address += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>>   		if (is_swap_pte(pteval)) {
>>   			++unmapped;
>>   			if (!cc->is_khugepaged ||
>> -			    unmapped <= khugepaged_max_ptes_swap) {
>> +			    unmapped <= max_ptes_swap) {
>>   				/*
>>   				 * Always be strict with uffd-wp
>>   				 * enabled swap entries.  Please see
>> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			++none_or_zero;
>>   			if (!userfaultfd_armed(vma) &&
>>   			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +			     none_or_zero <= max_ptes_none)) {
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_NONE_PTE;
>> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>>   			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> +			    shared > max_ptes_shared) {
>>   				result = SCAN_EXCEED_SHARED_PTE;
>>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>   				goto out_unmap;
>> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		result = SCAN_PAGE_RO;
>>   	} else if (cc->is_khugepaged &&
>>   		   (!referenced ||
>> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>> +		    (unmapped && referenced < (1UL << order) / 2))) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>>   	} else {
>>   		result = SCAN_SUCCEED;
>> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	pte_unmap_unlock(pte, ptl);
>>   	if (result == SCAN_SUCCEED) {
>>   		result = collapse_huge_page(mm, address, referenced,
>> -					    unmapped, cc);
>> +					    unmapped, order, cc);
>>   		/* collapse_huge_page will return with the mmap_lock released */
>>   		*mmap_locked = false;
>> +
>> +		/* Immediately exit on exhaustion of range */
>> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>> +			goto out;
> Looks like this assumes this function is always asked to scan a full PTE table?
> Does that mean that you can't handle collapse for VMAs that don't span a whole
> PMD entry? I think we will want to support that.

Correct. Yes, we should support that, otherwise khugepaged will scan only large VMAs.
Will have to make a change in khugepaged_scan_mm_slot() for anon case.

>
>> +	}
>> +	if (result != SCAN_SUCCEED) {
>> +
>> +		/* Go to the next order. */
>> +		order = next_order(&orders, order);
>> +		if (order < 2)
> This should be:
> 		if (!orders)
>
> I think the return order is undefined when order is the last order in orders.

The return order is -1, from what I could gather from reading the code.

>
>> +			goto out;
>> +		goto maybe_mmap_lock;
>> +	} else {
>> +		address = _address;
>> +		pte = _pte;
>> +
>> +
>> +		/* Get highest order possible starting from address */
>> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
>> +
>> +		/* This needs to be present in the mask too */
>> +		if (!(orders & (1UL << order)))
>> +			order = next_order(&orders, order);
> Not quite; if the exact order isn't in the bitmap, this will pick out the
> highest order in the bitmap, which may be higher than count_trailing_zeros()
> returned.

Oh okay, nice catch.

>   You could do:
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders &= (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> That will mask out any orders that are bigger than the one returned by
> count_trailing_zeros() then next_order() will return the highest order in the
> remaining set.
>
> But even that doesn't quite work because next_order() is destructive. Once you
> arrive on a higher order address boundary, you want to be able to select a
> higher order from the original orders bitmap. But you have lost them on a
> previous go around the loop.
>
> Perhaps stash orig_orders at the top of the function when you first calculate
> it. Then I think this works (totally untested):
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders = orig_orders & (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> You might want to do something like this for the first go around the loop, but I
> think address is currently always at the start of the PMD on entry, so not
> needed until that restriction is removed.

Will take a look, we just need order <= order derived from trailing zeroes,
and then we need the first enabled order below this in the bitmask, shouldn't
be too complicated.

>
>> +		if (order < 2)
>> +			goto out;
>> +
>> +maybe_mmap_lock:
>> +		if (!(*mmap_locked)) {
>> +			mmap_read_lock(mm);
> Given the lock was already held in read mode on entering this function, then
> released by collapse_huge_page(), is it definitely safe to retake this lock and
> rerun this function? Is it possible that state that was checked before entering
> this function has changed since the lock was released that would now need
> re-checking?

Thanks, what I am missing is a hugepage_vma_revalidate().

>> +			*mmap_locked = true;
>> +		}
>> +		goto scan_pte_range;
>>   	}
>>   out:
>>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,

next prev parent reply	other threads:[~2024-12-18  9:24 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17  4:18   ` Matthew Wilcox
2024-12-17  5:52     ` Dev Jain
2024-12-17  6:43     ` Ryan Roberts
2024-12-17 18:11       ` Zi Yan
2024-12-17 19:12         ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17  2:51   ` Baolin Wang
2024-12-17  6:08     ` Dev Jain
2024-12-17  4:17   ` Matthew Wilcox
2024-12-17  7:09     ` Ryan Roberts
2024-12-17 13:00       ` Zi Yan
2024-12-20 17:41       ` Christoph Lameter (Ampere)
2024-12-20 17:45         ` Ryan Roberts
2024-12-20 18:47           ` Christoph Lameter (Ampere)
2025-01-02 11:21             ` Ryan Roberts
2024-12-17  6:53   ` Ryan Roberts
2024-12-17  9:06     ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17  4:21   ` Matthew Wilcox
2024-12-17 16:58   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17  4:24   ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17  4:32   ` Matthew Wilcox
2024-12-17  6:41     ` Dev Jain
2024-12-17 17:14       ` Ryan Roberts
2024-12-17 17:09   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22   ` Ryan Roberts
2024-12-18  8:49     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15   ` Ryan Roberts
2024-12-18  9:24     ` Dev Jain [this message]
2025-01-06 10:04   ` Usama Arif
2025-01-07  7:17     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24   ` Ryan Roberts
2024-12-18  9:26     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06   ` David Hildenbrand
2024-12-16 19:08     ` Yang Shi
2024-12-17 10:07     ` Dev Jain
2024-12-17 10:32       ` David Hildenbrand
2024-12-18  8:35         ` Dev Jain
2025-01-02 10:08           ` Dev Jain
2025-01-02 11:33             ` David Hildenbrand
2025-01-03  8:17               ` Dev Jain
2025-01-02 11:22           ` David Hildenbrand
2024-12-18 15:59     ` Dev Jain
2025-01-06 10:17   ` Usama Arif
2025-01-07  8:12     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18  7:36   ` Ryan Roberts
2024-12-18  9:34     ` Dev Jain
2024-12-19  3:40       ` John Hubbard
2024-12-19  3:51         ` Zi Yan
2024-12-19  7:59         ` Dev Jain
2024-12-19  8:07           ` Dev Jain
2024-12-20 11:57             ` Ryan Roberts
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18  9:03   ` Ryan Roberts
2024-12-18  9:50     ` Dev Jain
2024-12-20 11:05       ` Ryan Roberts
2024-12-30  7:09         ` Dev Jain
2024-12-30 16:36           ` Zi Yan
2025-01-02 11:43             ` Ryan Roberts
2025-01-03 10:10               ` Dev Jain
2025-01-03 10:11             ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58   ` Nico Pache
2025-01-03  7:04     ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ca198de7-1ebf-4749-855c-f8358046cd22@arm.com \
    --to=dev.jain@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox