Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Yin Fengwei <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Yang Shi <shy828301@gmail.com>,
	"Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Itaru Kitayama <itaru.kitayama@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Barry Song <21cnbao@gmail.com>,
	Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP
Date: Thu, 7 Dec 2023 12:08:08 +0100	[thread overview]
Message-ID: <ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com> (raw)
In-Reply-To: <f2896d7f-183b-48fb-a3aa-d21bf2257043@arm.com>

[...]

>>
>> Nit: the orders = ... order = ... looks like this might deserve a helper
>> function that makes this easier to read.
> 
> To be honest, the existing function that I've modified is a bit of a mess.

It's all an ugly mess and I hate it.

It would be cleanest if we'd just have "thp_vma_configured_orders()" 
that gives us all configured orders for the given VMA+flags combination. 
No passing in of orders, try handling the masking in the caller.

Then, we move that nasty "transhuge_vma_suitable" handling for !in_pf 
out of there and handle that in the callers. The comment "Huge fault 
does the check in fault handlers. And this check is not suitable for 
huge PUD fault handlers." already makes me angry, what a mess.


Then, we'd have a thp_vma_fitting_orders() / thp_vma_is_fitting_order() 
function that does the filtering only based on the given address + vma 
size/alignment. That's roughly "thp_vma_suitable_orders()".


Finding a good name to combine both could be something like
"thp_vma_possible_orders()".


Would make more sense to me (but again, German guy, so it's probably all 
wrong).


> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
> page fault, because the page fault handlers already do that check themselves. It
> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
> a strict superset of thp_vma_suitable_orders(). Then this can just call
> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
> handlers, so prefer if we leave that for a separate patch set.
> 
>>
>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>> helper might be reasonable where that is handled internally.
> 
> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
> should just as efficient? This would go away with the refactoring described above.

Right. Won't win in a beauty contest. Some simple helper might make this 
much easier to digest.

> 
>>
>> Comment: For order-0 we'll always perform a function call to both
>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>> fast and efficient check if any <PMD THP are even enabled in the system / for
>> this VMA, and in that case just fallback before doing more expensive checks.
> 
> thp_vma_allowable_orders() is inline as you mentioned.
> 
> I was deliberately trying to keep all the decision logic in one place
> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
> insist, how about this in the header:
> 
> static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> 				       unsigned long vm_flags, bool smaps,
> 				       bool in_pf, bool enforce_sysfs,
> 				       unsigned long orders)
> {
> 	/* Optimization to check if required orders are enabled early. */
> 	if (enforce_sysfs && vma_is_anonymous(vma)) {
> 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
> 
> 		if (vm_flags & VM_HUGEPAGE)
> 			mask |= READ_ONCE(huge_anon_orders_madvise);
> 		if (hugepage_global_always() ||
> 			((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> 			mask |= READ_ONCE(huge_anon_orders_inherit);
> 
> 		orders &= mask;
> 		if (!orders)
> 			return 0;
> 		
> 		enforce_sysfs = false;
> 	}
> 
> 	return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
> 					  enforce_sysfs, orders);
> }
> 
> Then the above check can be removed from __thp_vma_allowable_orders() - it will
> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
> 

Better. I still kind-of hate having to pass in orders here. Such masking 
is better done in the caller (see above how it might be done when moving 
the transhuge_vma_suitable() check out).

> 
>>
>>> +
>>> +    if (!orders)
>>> +        goto fallback;
>>> +
>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> +    if (!pte)
>>> +        return ERR_PTR(-EAGAIN);
>>> +
>>> +    order = first_order(orders);
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        vmf->pte = pte + pte_index(addr);
>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>> +            break;
>>
>> Comment: Likely it would make sense to scan only once and determine the "largest
>> none range" around that address, having the largest suitable order in mind.
> 
> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> IIRC. Perhaps this an optimization opportunity for later?

Yes, definetly.

> 
>>
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>> +    vmf->pte = NULL;
>>
>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>> variable will make this code look less magical. Unless I am missing something
>> important :)
> 
> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
> will be much clearer just to use a local variable. Will fix.
> 
>>
>>> +    pte_unmap(pte);
>>> +
>>> +    gfp = vma_thp_gfp_mask(vma);
>>> +
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +        if (folio) {
>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>> +            return folio;
>>> +        }
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>
>> Queestion: would it make sense to combine both loops? I suspect memory
>> allocations with pte_offset_map()/kmao are problematic.
> 
> They are both operating on separate orders; next_order() is "consuming" an order
> by removing the current one from the orders bitfield and returning the next one.
> 
> So the first loop starts at the highest order and keeps checking lower orders
> until one fully fits in the VMA. And the second loop starts at the first order
> that was found to fully fits and loops to lower orders until an allocation is
> successful.

Right, but you know from the first loop which order is applicable (and 
will be fed to the second loop) and could just pte_unmap(pte) + 
tryalloc. If that fails, remap and try with the next orders.

That would make the code certainly easier to understand. That "orders" 
magic of constructing, filtering, walking is confusing :)


I might find some time today to see if there is an easy way to cleanup 
all what I spelled out above. It really is a mess. But likely that 
cleanup could be deferred (but you're touching it, so ... :) ).

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2023-12-07 11:08 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-04 10:20 [PATCH v8 00/10] Multi-size THP for anonymous memory Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 01/10] mm: Allow deferred splitting of arbitrary anon large folios Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 02/10] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-12-05  0:58   ` Barry Song
2023-12-04 10:20 ` [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface Ryan Roberts
2023-12-05  4:21   ` Barry Song
2023-12-05  9:50     ` Ryan Roberts
2023-12-05  9:57       ` David Hildenbrand
2023-12-05 10:50         ` Ryan Roberts
2023-12-05 16:57   ` David Hildenbrand
2023-12-06 13:18     ` Ryan Roberts
2023-12-07 10:56       ` Ryan Roberts
2023-12-07 11:13       ` David Hildenbrand
2023-12-07 11:22         ` Ryan Roberts
2023-12-07 11:25           ` David Hildenbrand
2023-12-07 11:44             ` Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP Ryan Roberts
2023-12-05  1:15   ` Barry Song
2023-12-05  1:24     ` Barry Song
2023-12-05 10:48       ` Ryan Roberts
2023-12-05 11:16         ` David Hildenbrand
2023-12-05 20:16         ` Barry Song
2023-12-06 10:15           ` Ryan Roberts
2023-12-06 10:25             ` Barry Song
2023-12-05 16:32   ` David Hildenbrand
2023-12-05 16:35     ` David Hildenbrand
2023-12-06 14:19     ` Ryan Roberts
2023-12-06 15:44       ` Ryan Roberts
2023-12-07 10:37         ` Ryan Roberts
2023-12-07 10:40           ` David Hildenbrand
2023-12-07 11:08       ` David Hildenbrand [this message]
2023-12-07 12:08         ` Ryan Roberts
2023-12-07 13:28           ` David Hildenbrand
2023-12-07 14:45             ` Ryan Roberts
2023-12-07 15:01               ` David Hildenbrand
2023-12-07 15:12                 ` Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 05/10] selftests/mm/kugepaged: Restore thp settings at exit Ryan Roberts
2023-12-05 17:00   ` David Hildenbrand
2023-12-04 10:20 ` [PATCH v8 06/10] selftests/mm: Factor out thp settings management Ryan Roberts
2023-12-05 17:03   ` David Hildenbrand
2023-12-04 10:20 ` [PATCH v8 07/10] selftests/mm: Support multi-size THP interface in thp_settings Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 08/10] selftests/mm/khugepaged: Enlighten for multi-size THP Ryan Roberts
2023-12-04 10:20 ` [PATCH v8 09/10] selftests/mm/cow: Generalize do_run_with_thp() helper Ryan Roberts
2023-12-05  9:59   ` David Hildenbrand
2023-12-04 10:20 ` [PATCH v8 10/10] selftests/mm/cow: Add tests for anonymous multi-size THP Ryan Roberts
2023-12-05 16:00   ` David Hildenbrand
2023-12-04 19:30 ` [PATCH v8 00/10] Multi-size THP for anonymous memory Andrew Morton
2023-12-05  9:34   ` Ryan Roberts
2023-12-05  3:28 ` Barry Song
2023-12-05 11:05   ` Ryan Roberts
2023-12-05  3:37 ` John Hubbard
2023-12-05 11:13   ` Ryan Roberts
2023-12-05 18:58     ` John Hubbard
2023-12-05 14:19 ` Kefeng Wang
2023-12-06 10:08   ` Ryan Roberts
2023-12-07 15:50     ` Kefeng Wang
2023-12-05 17:21 ` David Hildenbrand
2023-12-06 10:13   ` Ryan Roberts
2023-12-06 10:22     ` David Hildenbrand
2023-12-06 14:22       ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=fengwei.yin@intel.com \
    --cc=hughd@google.com \
    --cc=itaru.kitayama@gmail.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox