Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Dev Jain <dev.jain@arm.com>, John Hubbard <jhubbard@nvidia.com>,
	akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
	kirill.shutemov@linux.intel.com
Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com,
	cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
	haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, 21cnbao@gmail.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
Date: Fri, 20 Dec 2024 11:57:20 +0000	[thread overview]
Message-ID: <aa647830-cf55-48f0-98c2-8230796e35b3@arm.com> (raw)
In-Reply-To: <33488d3c-a176-4779-a9c6-e9cbdd1afc62@arm.com>

On 19/12/2024 08:07, Dev Jain wrote:
> 
> On 19/12/24 1:29 pm, Dev Jain wrote:
>>
>> On 19/12/24 9:10 am, John Hubbard wrote:
>>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>> ...
>>>>> It would be good if you can spell out the desired policy when khugepaged hits
>>>>> partially unmapped large folios and unaligned large folios. I think the simple
>>>>> approach is to always collapse them to fully mapped, aligned folios even if
>>>>> the
>>>>> resulting order is smaller than the original. But I'm not sure that's
>>>>> definitely
>>>>> going to always be the best thing.
>>>>>
>>>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>>>> order of a folio based on having hit one of it's pages says anything about
>>>>> whether the whole of that folio is mapped or not or it's alignment. And
>>>>> it's not
>>>>> clear to me how we would get to a situation where we are scanning for a lower
>>>>> order and find a (fully mapped, aligned) folio of higher order in the first
>>>>> place.
>>>>>
>>>>> Let's assume the desired policy is that khugepaged should always collapse to
>>>>> naturally aligned large folios. If there happens to be an existing aligned
>>>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>>>> of the scan for order-4. At that point, we should just notice that it is
>>>>> already
>>>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>>>> chosen to collapse it into a higher order, but we should definitely not get
>>>>> to a
>>>>> lower order before we notice it.
>>>>>
>>>>> Hmm... I guess if the sysfs thp settings have been changed then things
>>>>> could get
>>>>> spicy... if order-8 was previously enabled and we have an order-8 folio,
>>>>> then it
>>>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>>>> then hits the order-8; what's the expected policy? rework into 2 order-4
>>>>> folios
>>>>> or leave it as as single order-8?
>>>>
>>>> Exactly, sorry, I should have made it clear in the patch description that I am
>>>> handling the following scenario: there is a long running system on which we are
>>>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>>>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This
>>>> should
>>>> be a rare situation in the first place, so I have currently decided to
>>>> ignore the
>>>> folios set up by the previous sysfs setting and only focus on collapsing
>>>> fresh memory.
>>>>
>>>> Thinking again, a sys-admin deciding to downgrade order of folios, should do
>>>> that in
>>>> the hopes of reducing internal fragmentation or increasing swap speed etc,
>>>> so it makes
>>>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>>>
>>> Maybe we should not support it (at runtime) at all. We are trying to build
>>> systems that don't require incredibly detailed sysadmin involvement, and
>>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>>> sysadmin micromanagement", imho.

Agreed that we definitely don't want any new controls for this.

>>
>> Ryan pointed out one thing: what about unaligned, or partially mapped large
>> folios? For the previous sysfs settings, it may happen that we have an unaligned
>> order-8 folio, let us say it got unaligned due to mremap(). Then it is a good
>> idea to start from the order-4 aligned page and start collapsing memory so
>> that we can take advantage of the contig bit. Otherwise if it is a fully-mapped
>> aligned order-8 folio, then we anyways are abusing the contig bit advantage
>> so collapsing is pointless.
> 
> 
> In fact, in the current code, we are collapsing an unaligned PMD-size folio to an
> aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
> ahead with the scan...so the logic should be, skip the scan if the VAs and PAs are
> aligned.

Here is my stab at what the policy should be. It sounds quite complex, but
actually I think the implementation is quite simple.

memory may initially be mapped partially or fully from large or small folios.
And the mapping may be naturally aligned for the folio order or not.

khugepaged's goal should always to end up with the largest, aligned orders that
are possible. It will only collapse to an enabled order, but it will never
collapse to a smaller or equal order than the source memory's contiguous mapping
_and_ aligment.

So if for example, the first (or last) 64K of a 128K folio is mapped on a 64K VA
boundary, if scanning for order-4 (64K) we would leave that partial mapping
alone. But if either less than 64K was mapped from the 128K folio, or it was
mapped in a way that it was not 64K aligned in VA, then it would be a candidate
for collapse to a new 64K folio.

I think this can all be implemented by remembering if all of the PFNs for the
VA-aligned order that is being scanned are present and congituous and if they
all belong to the same folio. If that is true, and if the first PFN is aligned
on a PA boundary for that order, then skip the collapse and move to the next
block (addr + (PAGE_SIZE << order)).

I think this will give us the properties we want; if we encounter a larger folio
that is fully mapped and aligned, we will leave it alone. If we encounter a mix
of small folios and partially mapped large folios we will collapse. If we
encounter a partial mapping of a larger folio but that partial mapping fills the
block we are scanning in an appropriately aligned manner, we will leave it alone.

Thanks,
Ryan

> 
>>>
>>> Apologies for not having gone through the series in detail yet, but this
>>> point jumped out at me.
>>>
>>> thanks,
>>

next prev parent reply	other threads:[~2024-12-20 11:57 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17  4:18   ` Matthew Wilcox
2024-12-17  5:52     ` Dev Jain
2024-12-17  6:43     ` Ryan Roberts
2024-12-17 18:11       ` Zi Yan
2024-12-17 19:12         ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17  2:51   ` Baolin Wang
2024-12-17  6:08     ` Dev Jain
2024-12-17  4:17   ` Matthew Wilcox
2024-12-17  7:09     ` Ryan Roberts
2024-12-17 13:00       ` Zi Yan
2024-12-20 17:41       ` Christoph Lameter (Ampere)
2024-12-20 17:45         ` Ryan Roberts
2024-12-20 18:47           ` Christoph Lameter (Ampere)
2025-01-02 11:21             ` Ryan Roberts
2024-12-17  6:53   ` Ryan Roberts
2024-12-17  9:06     ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17  4:21   ` Matthew Wilcox
2024-12-17 16:58   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17  4:24   ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17  4:32   ` Matthew Wilcox
2024-12-17  6:41     ` Dev Jain
2024-12-17 17:14       ` Ryan Roberts
2024-12-17 17:09   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22   ` Ryan Roberts
2024-12-18  8:49     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15   ` Ryan Roberts
2024-12-18  9:24     ` Dev Jain
2025-01-06 10:04   ` Usama Arif
2025-01-07  7:17     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24   ` Ryan Roberts
2024-12-18  9:26     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06   ` David Hildenbrand
2024-12-16 19:08     ` Yang Shi
2024-12-17 10:07     ` Dev Jain
2024-12-17 10:32       ` David Hildenbrand
2024-12-18  8:35         ` Dev Jain
2025-01-02 10:08           ` Dev Jain
2025-01-02 11:33             ` David Hildenbrand
2025-01-03  8:17               ` Dev Jain
2025-01-02 11:22           ` David Hildenbrand
2024-12-18 15:59     ` Dev Jain
2025-01-06 10:17   ` Usama Arif
2025-01-07  8:12     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18  7:36   ` Ryan Roberts
2024-12-18  9:34     ` Dev Jain
2024-12-19  3:40       ` John Hubbard
2024-12-19  3:51         ` Zi Yan
2024-12-19  7:59         ` Dev Jain
2024-12-19  8:07           ` Dev Jain
2024-12-20 11:57             ` Ryan Roberts [this message]
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18  9:03   ` Ryan Roberts
2024-12-18  9:50     ` Dev Jain
2024-12-20 11:05       ` Ryan Roberts
2024-12-30  7:09         ` Dev Jain
2024-12-30 16:36           ` Zi Yan
2025-01-02 11:43             ` Ryan Roberts
2025-01-03 10:10               ` Dev Jain
2025-01-03 10:11             ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58   ` Nico Pache
2025-01-03  7:04     ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa647830-cf55-48f0-98c2-8230796e35b3@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox