From: John Hubbard <jhubbard@nvidia.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Yin Fengwei <fengwei.yin@intel.com>,
David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Yang Shi <shy828301@gmail.com>,
"Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
Luis Chamberlain <mcgrof@kernel.org>,
Itaru Kitayama <itaru.kitayama@gmail.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
David Rientjes <rientjes@google.com>,
Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [RESEND PATCH v7 03/10] mm: thp: Introduce per-size thp sysfs interface
Date: Wed, 29 Nov 2023 11:40:57 -0800 [thread overview]
Message-ID: <83744787-6bc5-4958-bdaa-030c9ed8225a@nvidia.com> (raw)
In-Reply-To: <0641d171-731c-4c12-83c2-8212c63ffade@arm.com>
1On 11/29/23 03:05, Ryan Roberts wrote:
> On 29/11/2023 03:42, John Hubbard wrote:
>> On 11/22/23 08:29, Ryan Roberts wrote:
...
>>> +As well as PMD-sized THP described above, it is also possible to
>>> +configure the system to allocate "small-sized THP" to back anonymous
>>
>> Here's one of the places to change to the new name, which lately is
>> "multi-size THP", or mTHP or m_thp for short. (I've typed "multi-size"
>> instead of "multi-sized", because the 'd' doesn't add significantly to
>> the meaning, and if in doubt, shorter is better.
>
> I was thinking of some light changes to the start of this paragraph, something like:
>
> Modern kernels support "multi-size THP" (mTHP), which introduces the ability to
> allocate memory in blocks that are bigger than a base page but smaller than
> traditional PMD-size (as described above, in increments of a power-of-2 number
> of pages. mTHP can back anonymous
>
Very nice.
...
>> By default, PMD-sized hugepages have enabled="inherited" and all other
>> hugepage sizes have enabled="never".
>
> That all sounds good; will update.
>
> I wonder if "inherit" is even better than "inherited" though?
Yes, I think so. "inherit" was actually my first idea, and after
thinking about it, I just made it worse by adding the "ed". haha. :)
...
>>> Khugepaged controls
>>> -------------------
>>>
>>> +.. note::
>>> + khugepaged currently only searches for opportunities to collapse to
>>> + PMD-sized THP and no attempt is made to collapse to small-sized
>>> + THP.
>>> +
>
> "small-sized THP" used here too; I propose to change it to "... collapse to
> other THP sizes."
OK, looks good.
>
>>> khugepaged runs usually at low frequency so while one may not want to
>>> invoke defrag algorithms synchronously during the page faults, it
>>> should be worth invoking defrag at least in khugepaged. However it's
>>> @@ -282,10 +321,11 @@ force
>>> Need of application restart
>>> ===========================
>>>
>>> -The transparent_hugepage/enabled values and tmpfs mount option only affect
>>> -future behavior. So to make them effective you need to restart any
>>> -application that could have been using hugepages. This also applies to the
>>> -regions registered in khugepaged.
>>> +The transparent_hugepage/enabled and
>>> +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
>>> +option only affect future behavior. So to make them effective you need
>>> +to restart any application that could have been using hugepages. This
>>> +also applies to the regions registered in khugepaged.
>>>
>>> Monitoring usage
>>> ================
>>> @@ -308,6 +348,10 @@ frequently will incur overhead.
>>> There are a number of counters in ``/proc/vmstat`` that may be used to
>>> monitor how successfully the system is providing huge pages for use.
>>>
>>> +.. note::
>>> + Currently the below counters only record events relating to
>>> + PMD-sized THP. Events relating to small-sized THP are not included.
>>
>> Here's another spot to rename to "multi-size THP".
>
> I propose to change it to "Events relating to other THP sizes..."
>
> I'm also going to move this note to just under the "Monitoring Usage" heading
> since its current position doesn't make it clear that it includes
> "AnonHugePages". I'll also make it clear in the text that mentions AnonHugePages
> counter that it "only applies to traditional PMD-sized THP for historical
> reasons" and that "AnonHugePages should have been called AnonHugePmdMapped" as
> David suggested in the other thread.
OK.
Are we entirely dropping the AnonHugePtePages that was there in v6? I'm
looking for some way to monitor this. I may have missed it because I haven't
gone through all of v7 yet.
...
>>> -"THPeligible" indicates whether the mapping is eligible for allocating THP
>>> -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
>>> -It just shows the current status.
>>> +"THPeligible" indicates whether the mapping is eligible for allocating
>>> +naturally aligned THP pages of any currently enabled size. 1 if true, 0
>>> +otherwise. It just shows the current status.
>>
>> "It just shows the current status"...as opposed to what? I'm missing an
>> educational opportunity here--not sure what this means. :)
>
> I have no idea either - it seems superfluous. But that sentence was there
> already. I can remove it if you like?
>
Yes, let's remove it, please.
...
>>> +/*
>>> + * Mask of all large folio orders supported for anonymous THP.
>>> + */
>>> +#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for file THP.
>>> + */
>>> +#define THP_ORDERS_ALL_FILE (BIT(PMD_ORDER) | BIT(PUD_ORDER))
>>
>> Is there something about file THP that implies PUD_ORDER? This
>> deserves an explanatory comment, I think.
>
> Sorry, I'm not really sure what you are asking? Today's kernel supports
> PUD-mapping file-backed memory (mostly DAX I believe). I'm not sure what comment
> you want me to add, other than "file-backed memory can support PUD-mapping", but
> that's self-evident from the define, isn't it?
>
Well, it's sort of evident, but confusing to me anyway, because it
combines THP and PUD. Which seems to imply that we have PUD-based THPs
supported, but only for file-backed mappings. Which is weird and needs
some explanation, right?
>>
>>> +
>>> +/*
>>> + * Mask of all large folio orders supported for THP.
>>> + */
>>> +#define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
>>> +
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> #define HPAGE_PMD_SHIFT PMD_SHIFT
>>> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
>>> @@ -78,13 +93,18 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>
>>> extern unsigned long transparent_hugepage_flags;
>>>
>>> -#define hugepage_flags_enabled() \
>>> - (transparent_hugepage_flags & \
>>> - ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
>>> - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> -#define hugepage_flags_always() \
>>> - (transparent_hugepage_flags & \
>>> - (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>
>> Are macros actually required? If not, static inlines would be nicer.
>
> I agree static inlines are nicer. Here I'm removing existing macros though, and
Oh, I see that I replied in the wrong part of the email, since that's the
*removal* part. oops. :)
...
>>> -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>>> - unsigned long addr)
>>> -{
>>> - unsigned long haddr;
>>> -
>>> - /* Don't have to check pgoff for anonymous vma */
>>> - if (!vma_is_anonymous(vma)) {
>>> - if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>>> - HPAGE_PMD_NR))
>>> - return false;
>>> +static inline unsigned long transhuge_vma_suitable(struct vm_area_struct *vma,
>>> + unsigned long addr, unsigned long orders)
>>
>> Changing this from a bool to a mask of orders is quite significant, and
>> both the function name and the function-level comment documentation need
>> to also be adjusted. Here, perhaps one of these names would work:
>>
>> transhuge_vma_suitable_orders()
>
> This is my preference, but its getting a bit long. How about:
>
> thp_vma_suitable_orders()
>
> I could then call the other one:
>
> thp_vma_allowable_orders()
>
> So we have a clearly related pair?
Oh yes, that works nicely.
>
>
>> transhuge_vma_orders()>
>>
>>> +{
>>> + int order;
>>> +
>>> + /*
>>> + * Iterate over orders, highest to lowest, removing orders that don't
>>> + * meet alignment requirements from the set. Exit loop at first order
>>> + * that meets requirements, since all lower orders must also meet
>>> + * requirements.
>>
>> Should we add a little note here, to the effect of, "this is unilaterally
>> taking over a certain amount of allocation-like policy, by deciding how
>> to search for folios of a certain size"?
>>
>> Or is this The Only Way To Do It, after all? I know we had some discussion
>> about it, and intuitively it feels reasonable, but wanted to poke at it
>> one last time anyway.
>
> This function isn't trying to implement policy, its just filtering out orders
> that don't fit and therefore definitely cannot be used. The result is a set of
> orders the could be used and its the policy maker's job to decide which one.
> Currently that policy is "biggest one that fits && does not overlap other
> non-none ptes && folio successfully allocated". That's implemented in the next
> patch and would potentially be swapped out in the future for something more
> exotic/optimal.
>
> So I don't think we need any extra comments here.
Ack.
...
>>> -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>>> - bool smaps, bool in_pf, bool enforce_sysfs)
>>> +#define hugepage_global_enabled() \
>>> + (transparent_hugepage_flags & \
>>> + ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
>>> + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
>>> +
>>> +#define hugepage_global_always() \
>>> + (transparent_hugepage_flags & \
>>> + (1<<TRANSPARENT_HUGEPAGE_FLAG))
>>> +
>>
>> Here again, I'd like to request that we avoid macros, as I don't think
>> they are strictly required.
>
> Yeah I agree. I did them this way, because they already existed and I was just
> moving them from the header to here. But I'll change to static inline.
>
>>
>>> +bool hugepage_flags_enabled(void)
>>> {
>>> + /*
>>> + * We cover both the anon and the file-backed case here; we must return
>>> + * true if globally enabled, even when all anon sizes are set to never.
>>> + * So we don't need to look at huge_anon_orders_global.
>>> + */
>>> + return hugepage_global_enabled() ||
>>> + huge_anon_orders_always ||
>>> + huge_anon_orders_madvise;
>>> +}
>>> +
>>> +/**
>>> + * hugepage_vma_check - determine which hugepage orders can be applied to vma
>>> + * @vma: the vm area to check
>>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>>> + * @smaps: whether answer will be used for smaps file
>>> + * @in_pf: whether answer will be used by page fault handler
>>> + * @enforce_sysfs: whether sysfs config should be taken into account
>>> + * @orders: bitfield of all orders to consider
>>> + *
>>> + * Calculates the intersection of the requested hugepage orders and the allowed
>>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>>> + *
>>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>>> + * orders are allowed.
>>> + */
>>> +unsigned long hugepage_vma_check(struct vm_area_struct *vma,
>>> + unsigned long vm_flags, bool smaps, bool in_pf,
>>> + bool enforce_sysfs, unsigned long orders)
>>
>> Here again, a bool return type has been changed to a bitfield. Let's
>> also update the function name. Maybe one of these:
>>
>> hugepage_vma_orders()
>> hugepage_vma_allowable_orders()
>
> thp_vma_allowable_orders()?
>
Even better, yes.
thanks,
--
John Hubbard
NVIDIA
next prev parent reply other threads:[~2023-11-29 19:41 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-22 16:29 [RESEND PATCH v7 00/10] Small-sized THP for anonymous memory Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 01/10] mm: Allow deferred splitting of arbitrary anon large folios Ryan Roberts
2023-11-27 8:27 ` Barry Song
2023-11-22 16:29 ` [RESEND PATCH v7 02/10] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-11-24 17:40 ` David Hildenbrand
2023-11-27 10:34 ` Ryan Roberts
2023-11-27 4:36 ` Barry Song
2023-11-27 11:30 ` Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 03/10] mm: thp: Introduce per-size thp sysfs interface Ryan Roberts
2023-11-29 3:42 ` John Hubbard
2023-11-29 8:05 ` David Hildenbrand
2023-11-29 11:05 ` Ryan Roberts
2023-11-29 19:40 ` John Hubbard [this message]
2023-11-30 12:14 ` Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 04/10] mm: thp: Support allocation of anonymous small-sized THP Ryan Roberts
2023-11-27 3:41 ` Barry Song
2023-11-27 11:28 ` Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 05/10] selftests/mm/kugepaged: Restore thp settings at exit Ryan Roberts
2023-11-23 5:54 ` Alistair Popple
2023-11-22 16:29 ` [RESEND PATCH v7 06/10] selftests/mm: Factor out thp settings management Ryan Roberts
2023-11-23 6:07 ` Alistair Popple
2023-11-27 12:22 ` Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 07/10] selftests/mm: Support small-sized THP interface in thp_settings Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 08/10] selftests/mm/khugepaged: Enlighten for small-sized THP Ryan Roberts
2023-11-22 16:29 ` [RESEND PATCH v7 09/10] selftests/mm/cow: Generalize do_run_with_thp() helper Ryan Roberts
2023-11-24 17:48 ` David Hildenbrand
2023-11-27 10:48 ` Ryan Roberts
2023-11-27 13:59 ` David Hildenbrand
2023-11-27 14:11 ` Ryan Roberts
2023-11-27 14:17 ` David Hildenbrand
2023-11-22 16:29 ` [RESEND PATCH v7 10/10] selftests/mm/cow: Add tests for anonymous small-sized THP Ryan Roberts
2023-11-27 14:02 ` Ryan Roberts
2023-11-27 14:50 ` David Hildenbrand
2023-11-27 14:54 ` Ryan Roberts
2023-11-22 16:32 ` [RESEND PATCH v7 00/10] Small-sized THP for anonymous memory David Hildenbrand
2023-11-23 6:28 ` John Hubbard
2023-11-23 15:59 ` Matthew Wilcox
2023-11-23 16:05 ` David Hildenbrand
2023-11-23 16:18 ` Matthew Wilcox
2023-11-23 16:50 ` David Hildenbrand
2023-11-24 1:14 ` John Hubbard
2023-11-24 1:34 ` Zi Yan
2023-11-24 9:02 ` David Hildenbrand
2023-11-24 9:56 ` Ryan Roberts
2023-11-24 15:13 ` Matthew Wilcox
2023-11-24 15:23 ` Ryan Roberts
2023-11-24 15:25 ` David Hildenbrand
2023-11-24 15:53 ` Matthew Wilcox
2023-11-24 17:34 ` David Hildenbrand
2023-11-27 8:20 ` Alistair Popple
2023-11-27 10:31 ` Ryan Roberts
2023-11-28 2:09 ` John Hubbard
2023-11-28 8:48 ` David Hildenbrand
2023-11-28 12:15 ` Ryan Roberts
2023-11-28 14:09 ` David Hildenbrand
2023-11-28 15:34 ` Ryan Roberts
2023-11-28 16:40 ` David Hildenbrand
2023-11-28 18:39 ` John Hubbard
2023-11-29 9:59 ` Ryan Roberts
2023-11-29 19:46 ` John Hubbard
2023-11-28 4:10 ` Matthew Wilcox
2023-11-28 4:05 ` Matthew Wilcox
2023-11-28 8:47 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83744787-6bc5-4958-bdaa-030c9ed8225a@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=fengwei.yin@intel.com \
--cc=hughd@google.com \
--cc=itaru.kitayama@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox