Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, ziy@nvidia.com
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org,
	shakeel.butt@linux.dev, riel@surriel.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	hughd@google.com, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
Date: Mon, 9 Jun 2025 13:12:25 +0100	[thread overview]
Message-ID: <b8490586-131b-4ce7-8835-aaa5437e3e97@gmail.com> (raw)
In-Reply-To: <8200fd8b-edae-44ab-be47-7dfccab25a24@gmail.com>


> I dont like it either :)
> 

Pressed "Ctrl+enter" instead of "enter" by mistake which sent the email prematurely :)
Adding replies to the rest of the comments in this email.

As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
THP is enabled.

It needs a better name, but I think the right approach is just to change
pageblock_order as recommended in [2]
 
[1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/
[2] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/


> 
>>> +{
>>> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
>>> +}
>>> +
>>>  static void set_recommended_min_free_kbytes(void)
>>>  {
>>>  	struct zone *zone;
>>> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
>>
>> You provide a 'patchlet' in
>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>
>> That also does:
>>
>>         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
>> -       recommended_min = pageblock_nr_pages * nr_zones * 2;
>> +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
>>
>> So comment here - this comment is now incorrect, this isn't 2 page blocks,
>> it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
>> always/madvise THP size'.
>>
>> Again, this whole thing strikes me as we're doing things at the wrong level
>> of abstraction.
>>
>> And you're definitely now not helping avoid pageblock-sized
>> fragmentation. You're accepting that you need less so... why not reduce
>> pageblock size? :)
>>

Yes agreed.

>> 	/*
>> 	 * Make sure that on average at least two pageblocks are almost free
>> 	 * of another type, one for a migratetype to fall back to and a
>>
>> ^ remainder of comment
>>
>>>  	 * second to avoid subsequent fallbacks of other types There are 3
>>>  	 * MIGRATE_TYPES we care about.
>>>  	 */
>>> -	recommended_min += pageblock_nr_pages * nr_zones *
>>> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
>>>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
>>
>> This just seems wrong now and contradicts the comment - you're setting
>> minimum pages based on migrate PCP types that operate at pageblock order
>> but without reference to the actual number of page block pages?
>>
>> So the comment is just wrong now? 'make sure there are at least two
>> pageblocks', well this isn't what you're doing is it? So why there are we
>> making reference to PCP counts etc.?
>>
>> This seems like we're essentially just tuning these numbers someswhat
>> arbitrarily to reduce them?
>>
>>>
>>> -	/* don't ever allow to reserve more than 5% of the lowmem */
>>> -	recommended_min = min(recommended_min,
>>> -			      (unsigned long) nr_free_buffer_pages() / 20);
>>> +	/*
>>> +	 * Don't ever allow to reserve more than 5% of the lowmem.
>>> +	 * Use a min of 128 pages when all THP orders are set to never.
>>
>> Why? Did you just choose this number out of the blue?


Mentioned this in the previous comment.
>>
>> Previously, on x86-64 with thp -> never on everything a pageblock order-9
>> wouldn't this be a much higher value?
>>
>> I mean just putting '128' here is not acceptable. It needs to be justified
>> (even if empirically with data to back it) and defined as a named thing.
>>
>>
>>> +	 */
>>> +	recommended_min = clamp(recommended_min, 128,
>>> +				(unsigned long) nr_free_buffer_pages() / 20);
>>> +
>>>  	recommended_min <<= (PAGE_SHIFT-10);
>>>
>>>  	if (recommended_min > min_free_kbytes) {
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0c5fb4ffa03a..8e92678d1175 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -136,10 +136,10 @@ struct shmem_options {
>>>  };
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> -static unsigned long huge_shmem_orders_always __read_mostly;
>>> -static unsigned long huge_shmem_orders_madvise __read_mostly;
>>> -static unsigned long huge_shmem_orders_inherit __read_mostly;
>>> -static unsigned long huge_shmem_orders_within_size __read_mostly;
>>> +unsigned long huge_shmem_orders_always __read_mostly;
>>> +unsigned long huge_shmem_orders_madvise __read_mostly;
>>> +unsigned long huge_shmem_orders_inherit __read_mostly;
>>> +unsigned long huge_shmem_orders_within_size __read_mostly;
>>
>> Again, we really shouldn't need to do this.

Agreed, for the RFC, I just did it similar to the anon ones when I got the build error
trying to use these, but yeah a much better approach would be to just have a
function in shmem that would return the largest shmem thp allowable order.


>>
>>>  static bool shmem_orders_configured __initdata;
>>>  #endif
>>>
>>> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>>>  }
>>>
>>> -/*
>>> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>>> - *
>>> - * SHMEM_HUGE_NEVER:
>>> - *	disables huge pages for the mount;
>>> - * SHMEM_HUGE_ALWAYS:
>>> - *	enables huge pages for the mount;
>>> - * SHMEM_HUGE_WITHIN_SIZE:
>>> - *	only allocate huge pages if the page will be fully within i_size,
>>> - *	also respect madvise() hints;
>>> - * SHMEM_HUGE_ADVISE:
>>> - *	only allocate huge pages if requested with madvise();
>>> - */
>>> -
>>> -#define SHMEM_HUGE_NEVER	0
>>> -#define SHMEM_HUGE_ALWAYS	1
>>> -#define SHMEM_HUGE_WITHIN_SIZE	2
>>> -#define SHMEM_HUGE_ADVISE	3
>>> -
>>
>> Again we really shouldn't need to do this, just provide some function from
>> shmem that gives you what you need.
>>
>>>  /*
>>>   * Special values.
>>>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
>>> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>  /* ifdef here to avoid bloating shmem.o when not necessary */
>>>
>>> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>
>> Same comment.
>>
>>>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>>>
>>>  /**
>>> --
>>> 2.47.1
>>>
>

next prev parent reply	other threads:[~2025-06-09 12:12 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-06 14:37 Usama Arif
2025-06-06 15:01 ` Usama Arif
2025-06-06 15:18 ` Zi Yan
2025-06-06 15:38   ` Usama Arif
2025-06-06 16:10     ` Zi Yan
2025-06-07  8:35       ` Lorenzo Stoakes
2025-06-08  0:04         ` Zi Yan
2025-06-09 11:13       ` Usama Arif
2025-06-09 13:19         ` Zi Yan
2025-06-09 14:11           ` Usama Arif
2025-06-09 14:16             ` Lorenzo Stoakes
2025-06-09 14:37               ` Zi Yan
2025-06-09 14:50                 ` Lorenzo Stoakes
2025-06-09 15:20                   ` Zi Yan
2025-06-09 19:40                     ` Lorenzo Stoakes
2025-06-09 19:49                       ` Zi Yan
2025-06-09 20:03                         ` Usama Arif
2025-06-09 20:24                           ` Zi Yan
2025-06-10 10:41                             ` Usama Arif
2025-06-10 14:03                         ` Lorenzo Stoakes
2025-06-10 14:20                           ` Zi Yan
2025-06-10 15:16                             ` Usama Arif
2025-06-09 15:32             ` Zi Yan
2025-06-06 17:37 ` David Hildenbrand
2025-06-09 11:34   ` Usama Arif
2025-06-09 13:28     ` Zi Yan
2025-06-07  8:18 ` Lorenzo Stoakes
2025-06-07  8:44   ` Lorenzo Stoakes
2025-06-09 12:07   ` Usama Arif
2025-06-09 12:12     ` Usama Arif [this message]
2025-06-09 14:58       ` Lorenzo Stoakes
2025-06-09 14:57     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b8490586-131b-4ce7-8835-aaa5437e3e97@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox