linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org,
	shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com,
	laoar.shao@gmail.com, baolin.wang@linux.alibaba.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
Date: Thu, 15 May 2025 15:40:19 +0100	[thread overview]
Message-ID: <c0af0eb2-d10f-4ee3-87dd-c23cca6cfd1a@lucifer.local> (raw)
In-Reply-To: <20250515133519.2779639-2-usamaarif642@gmail.com>

Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
outlandish stuff and needs discussion.

You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
never is completely ignored and overridden. Which I am emphatically not
comfortable with. And you're not saying that you're doing this,
anywhere. Which is wrong.

Also, this patch is quite broken.

I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a
fan of you not mentioning such a completely fundamental change in the
commit mesage.

This patch also breaks VMA merging and the VMA tests...

I really feel this series needs to be an RFC until we can get some
consensus on how to approach this.

On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
> This is set via the new PR_SET_THP_POLICY prctl.

What is?

You're making very major changes here, including adding a new flag to
mm_struct (!!) and the explanation/justification for this is missing.

> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
> which changes the default of new VMAs to be VM_HUGEPAGE. The
> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> to be VM_HUGEPAGE. The policy is inherited during fork+exec.

So you can only set this flag?

>
> This allows systems where the global policy is set to "madvise"
> to effectively have THPs always for the process. In an environment
> where different types of workloads are stacked on the same machine,
> this will allow workloads that benefit from always having hugepages
> to do so, without regressing those that don't.

Again, this explanation really makes no sense at all to me, I don't really
know what you mean, you're not going into what you're doing in this change,
this is just a very unclear commit message.

>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h                       |  3 ++
>  include/linux/mm_types.h                      | 11 +++++++
>  include/uapi/linux/prctl.h                    |  4 +++
>  kernel/fork.c                                 |  1 +
>  kernel/sys.c                                  | 21 ++++++++++++
>  mm/huge_memory.c                              | 32 +++++++++++++++++++
>  mm/vma.c                                      |  2 ++
>  tools/include/uapi/linux/prctl.h              |  4 +++
>  .../trace/beauty/include/uapi/linux/prctl.h   |  4 +++
>  9 files changed, 82 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..e652ad9ddbbd 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>  	return orders;
>  }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma);

This is a VMA-specific function but you're putting it in huge_mm.h? Why
can't this be in vma.h or vma.c?

> +void process_vmas_thp_default_huge(struct mm_struct *mm);

'vmas' is redundant here.

> +
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
>  					 unsigned long tva_flags,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..2fe93965e761 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1066,6 +1066,7 @@ struct mm_struct {
>  		mm_context_t context;
>
>  		unsigned long flags; /* Must use atomic bitops to access */
> +		unsigned long flags2;


Ugh, god really??

I really am not a fan of adding flags2 just to add a prctl() feature like
this. This is crazy.

Also this is a TERRIBLE name. I mean, no please PLEASE no.

Do we really have absolutely no choice but to add a new flags field here?

It again doesn't help that you don't mention nor even try to justify this
in the commit message or cover letter.

If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
let's just go make this flags field 64-bit on 32-bit kernels.

I mean - I'm kind of insisting we do that to be honest. Because I really
don't like this.

Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
atomic bitops?...

>
>  #ifdef CONFIG_AIO
>  		spinlock_t			ioctx_lock;
> @@ -1744,6 +1745,11 @@ enum {
>  				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>  				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>
> +#define MMF2_THP_VMA_DEFAULT_HUGE		0

I thought the whole idea was to move away from explicitly refrencing 'THP'
in a future where large folios are implicit and now we're saying 'THP'.

Anyway the 'VMA' is totally redundant here.

> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK		(1 << MMF2_THP_VMA_DEFAULT_HUGE)

Do we really need explicit trivial mask declarations like this?

> +
> +#define MMF2_INIT_MASK		(MMF2_THP_VMA_DEFAULT_HUGE_MASK)

> +
>  static inline unsigned long mmf_init_flags(unsigned long flags)
>  {
>  	if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>  	return flags & MMF_INIT_MASK;
>  }
>
> +static inline unsigned long mmf2_init_flags(unsigned long flags)
> +{
> +	return flags & MMF2_INIT_MASK;
> +}
> +
>  #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9e4616dacd82..6e5f4a8869dc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>
>  	if (current->mm) {
>  		mm->flags = mmf_init_flags(current->mm->flags);
> +		mm->flags2 = mmf2_init_flags(current->mm->flags2);
>  		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>  	} else {
>  		mm->flags = default_dump_filter;
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..1115f258f253 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		mmap_write_unlock(me->mm);
>  		break;
> +	case PR_GET_THP_POLICY:
> +		if (arg2 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))

I really don't think we need the !!? Do we?

Shouldn't we lock the mm when we do this no? Can't somebody change this?

> +			error = PR_THP_POLICY_DEFAULT_HUGE;
> +		break;
> +	case PR_SET_THP_POLICY:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		if (mmap_write_lock_killable(me->mm))
> +			return -EINTR;
> +		switch (arg2) {
> +		case PR_THP_POLICY_DEFAULT_HUGE:
> +			set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> +			process_vmas_thp_default_huge(me->mm);
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		mmap_write_unlock(me->mm);
> +		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
>  		/* No longer implemented: */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f0..64f66d5295e8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>  }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> +		vm_flags_set(vma, VM_HUGEPAGE);
> +}
> +
> +static void vmas_thp_default_huge(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long vm_flags;
> +
> +	VMA_ITERATOR(vmi, mm, 0);

This is a declaration, it should be grouped with declarations...

> +	for_each_vma(vmi, vma) {
> +		vm_flags = vma->vm_flags;
> +		if (vm_flags & VM_NOHUGEPAGE)
> +			continue;

Literally no point in you putting vm_flags as a separate variable here.

So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
is to override global 'never'?

I'm really concerned about this.

> +		vm_flags_set(vma, VM_HUGEPAGE);
> +	}
> +}

Do we have an mmap write lock established here? Can you confirm that? Also
you should add an assert for that here.

> +
> +void process_vmas_thp_default_huge(struct mm_struct *mm)
> +{
> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> +		return;
> +
> +	set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
> +	vmas_thp_default_huge(mm);
> +}
> +
> +
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
>  					 unsigned long tva_flags,
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..101b19c96803 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	if (!vma_is_anonymous(vma))
>  		khugepaged_enter_vma(vma, map->flags);
>  	ksm_add_vma(vma);
> +	vma_set_thp_policy(vma);

You're breaking VMA merging completely by doing this here...

Now I can map one VMA with this policy set, then map another immediately
next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
set in the new VMA on merge attempt.

I realise KSM is just as broken (grr) but this doesn't justify us
completely breaking VMA merging here.

You need to set earlier than this. Then of course a driver might decide to
override this, so maybe then we need to override that.

But then we're getting into realms of changing fundamental VMA code _just
for this feature_.

Again I'm iffy about this. Very.

Also you've broken the VMA userland tests here:

$ cd tools/testing/vma
$ make
...
In file included from vma.c:33:
../../../mm/vma.c: In function ‘__mmap_new_vma’:
../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
 2486 |         vma_set_thp_policy(vma);
      |         ^~~~~~~~~~~~~~~~~~
      |         vma_dup_policy
make: *** [<builtin>: vma.o] Error 1

You need to create stubs accordingly.

>  	*vmap = vma;
>  	return 0;
>
> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	mm->map_count++;
>  	validate_mm(mm);
>  	ksm_add_vma(vma);
> +	vma_set_thp_policy(vma);

You're breaking merging again... This is quite a bad case too as now you'll
have totally fragmented brk VMAs no?

We can't have it implemented this way.

>  out:
>  	perf_event_mmap(vma);
>  	mm->total_vm += len >> PAGE_SHIFT;
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 35791791a879..f5945ebfe3f2 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -328,4 +328,8 @@ struct prctl_mm_map {
>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> --
> 2.47.1
>


  reply	other threads:[~2025-05-15 14:40 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
2025-05-15 14:40   ` Lorenzo Stoakes [this message]
2025-05-15 14:44     ` David Hildenbrand
2025-05-15 14:56       ` Usama Arif
2025-05-15 14:58         ` David Hildenbrand
2025-05-15 15:18           ` Lorenzo Stoakes
2025-05-15 15:45       ` Liam R. Howlett
2025-05-15 15:57         ` David Hildenbrand
2025-05-15 16:38           ` Lorenzo Stoakes
2025-05-15 17:29             ` David Hildenbrand
2025-05-15 18:09               ` Liam R. Howlett
2025-05-15 18:21                 ` Lorenzo Stoakes
2025-05-15 18:42                   ` Zi Yan
2025-05-15 21:04                     ` Lorenzo Stoakes
2025-05-15 18:46                   ` Usama Arif
2025-05-15 19:20                 ` David Hildenbrand
2025-05-15 15:28     ` Usama Arif
2025-05-15 16:06       ` Lorenzo Stoakes
2025-05-15 16:11         ` David Hildenbrand
2025-05-15 18:08           ` Lorenzo Stoakes
2025-05-15 19:12             ` David Hildenbrand
2025-05-15 20:35               ` Lorenzo Stoakes
2025-05-16  7:45                 ` David Hildenbrand
2025-05-16 10:57                   ` Lorenzo Stoakes
2025-05-16 11:24                     ` David Hildenbrand
2025-05-16 12:57                       ` Lorenzo Stoakes
2025-05-16 17:19                         ` Usama Arif
2025-05-16 17:51                           ` Lorenzo Stoakes
2025-05-16 19:34                             ` Usama Arif
2025-05-17 16:20                         ` Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process) SeongJae Park
2025-05-17 18:50                           ` Lorenzo Stoakes
2025-05-17 20:25                             ` SeongJae Park
2025-05-17 19:01                         ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Lorenzo Stoakes
2025-05-15 16:47         ` Usama Arif
2025-05-15 18:36           ` Lorenzo Stoakes
2025-05-15 19:17             ` David Hildenbrand
2025-05-15 20:42               ` Lorenzo Stoakes
2025-05-16  6:12   ` kernel test robot
2025-05-15 13:33 ` [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE " Usama Arif
2025-05-16  8:19   ` kernel test robot
2025-05-15 13:33 ` [PATCH 3/6] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
2025-05-15 13:33 ` [PATCH 4/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE Usama Arif
2025-05-15 13:33 ` [PATCH 5/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
2025-05-15 13:33 ` [PATCH 6/6] docs: transhuge: document process level THP controls Usama Arif
2025-05-15 13:55 ` [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
2025-05-15 14:50   ` Usama Arif
2025-05-15 15:15     ` Lorenzo Stoakes
2025-05-15 15:54       ` Usama Arif
2025-05-15 16:04         ` David Hildenbrand
2025-05-15 16:24         ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0af0eb2-d10f-4ee3-87dd-c23cca6cfd1a@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox