linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
@ 2024-10-11 10:24 David Hildenbrand
  2024-10-11 10:24 ` [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 10:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, kvm, David Hildenbrand, Andrew Morton, Hugh Dickins,
	Thomas Huth, Matthew Wilcox (Oracle),
	Ryan Roberts, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Kefeng Wang

During testing, it was found that we can get PMD mappings in processes
where THP (and more precisely, PMD mappings) are supposed to be disabled.
While it works as expected for anon+shmem, the pagecache is the problematic
bit.

For s390 KVM this currently means that a VM backed by a file located on
filesystem with large folio support can crash when KVM tries accessing
the problematic page, because the readahead logic might decide to use
a PMD-sized THP and faulting it into the page tables will install a
PMD mapping, something that s390 KVM cannot tolerate.

This might also be a problem with HW that does not support PMD mappings,
but I did not try reproducing it.

Fix it by respecting the ways to disable THPs when deciding whether we
can install a PMD mapping. khugepaged should already be taking care of
not collapsing if THPs are effectively disabled for the hw/process/vma.

An earlier patch was tested by Thomas Huth, this one still needs to
be retested; sending it out already.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>

David Hildenbrand (1):
  mm: don't install PMD mappings when THPs are disabled by the
    hw/process/vma

Kefeng Wang (1):
  mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()

 include/linux/huge_mm.h | 18 ++++++++++++++++++
 mm/huge_memory.c        | 13 +------------
 mm/memory.c             |  9 +++++++++
 mm/shmem.c              |  7 +------
 4 files changed, 29 insertions(+), 18 deletions(-)

-- 
2.46.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
  2024-10-11 10:24 [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
@ 2024-10-11 10:24 ` David Hildenbrand
  2024-10-11 11:21   ` Ryan Roberts
  2024-10-11 10:24 ` [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
  2024-10-11 11:39 ` [PATCH v1 0/2] " Thomas Huth
  2 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 10:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, kvm, David Hildenbrand, Andrew Morton, Hugh Dickins,
	Thomas Huth, Matthew Wilcox (Oracle),
	Ryan Roberts, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Kefeng Wang

From: Kefeng Wang <wangkefeng.wang@huawei.com>

Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by
shmem_allowable_huge_orders() and __thp_vma_allowable_orders().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
[ rename to vma_thp_disabled(), split out thp_disabled_by_hw() ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/huge_mm.h | 18 ++++++++++++++++++
 mm/huge_memory.c        | 13 +------------
 mm/shmem.c              |  7 +------
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 67d0ab3c3bba..ef5b80e48599 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -322,6 +322,24 @@ struct thpsize {
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
 
+static inline bool vma_thp_disabled(struct vm_area_struct *vma,
+		unsigned long vm_flags)
+{
+	/*
+	 * Explicitly disabled through madvise or prctl, or some
+	 * architectures may disable THP for some mappings, for
+	 * example, s390 kvm.
+	 */
+	return (vm_flags & VM_NOHUGEPAGE) ||
+	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
+}
+
+static inline bool thp_disabled_by_hw(void)
+{
+	/* If the hardware/firmware marked hugepage support disabled. */
+	return transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED);
+}
+
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87b49ecc7b1e..2fb328880b50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -109,18 +109,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if (!vma->vm_mm)		/* vdso */
 		return 0;
 
-	/*
-	 * Explicitly disabled through madvise or prctl, or some
-	 * architectures may disable THP for some mappings, for
-	 * example, s390 kvm.
-	 * */
-	if ((vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return 0;
-	/*
-	 * If the hardware/firmware marked hugepage support disabled.
-	 */
-	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
 		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/shmem.c b/mm/shmem.c
index 4f11b5506363..c5adb987b23c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1664,12 +1664,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	loff_t i_size;
 	int order;
 
-	if (vma && ((vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
-		return 0;
-
-	/* If the hardware/firmware marked hugepage support disabled. */
-	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
 		return 0;
 
 	global_huge = shmem_huge_global_enabled(inode, index, write_end,
-- 
2.46.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 10:24 [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
  2024-10-11 10:24 ` [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() David Hildenbrand
@ 2024-10-11 10:24 ` David Hildenbrand
  2024-10-11 11:29   ` Ryan Roberts
  2024-10-11 11:39 ` [PATCH v1 0/2] " Thomas Huth
  2 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 10:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, kvm, David Hildenbrand, Andrew Morton, Hugh Dickins,
	Thomas Huth, Matthew Wilcox (Oracle),
	Ryan Roberts, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Kefeng Wang, Leo Fu

We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.

This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
starting the VM.

For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.

Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
without X86_FEATURE_PSE.

In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants. For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.

Reported-by: Leo Fu <bfu@redhat.com>
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Cc: Thomas Huth <thuth@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..a2e501489517 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	pmd_t entry;
 	vm_fault_t ret = VM_FAULT_FALLBACK;
 
+	/*
+	 * It is too late to allocate a small folio, we already have a large
+	 * folio in the pagecache: especially s390 KVM cannot tolerate any
+	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
+	 * PMD mappings if THPs are disabled.
+	 */
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
+		return ret;
+
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
 		return ret;
 
-- 
2.46.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
  2024-10-11 10:24 ` [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() David Hildenbrand
@ 2024-10-11 11:21   ` Ryan Roberts
  0 siblings, 0 replies; 10+ messages in thread
From: Ryan Roberts @ 2024-10-11 11:21 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins, Thomas Huth,
	Matthew Wilcox (Oracle),
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Kefeng Wang

On 11/10/2024 11:24, David Hildenbrand wrote:
> From: Kefeng Wang <wangkefeng.wang@huawei.com>
> 
> Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by
> shmem_allowable_huge_orders() and __thp_vma_allowable_orders().
> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> [ rename to vma_thp_disabled(), split out thp_disabled_by_hw() ]
> Signed-off-by: David Hildenbrand <david@redhat.com>

Looks like a nice tidy up on its own:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  include/linux/huge_mm.h | 18 ++++++++++++++++++
>  mm/huge_memory.c        | 13 +------------
>  mm/shmem.c              |  7 +------
>  3 files changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 67d0ab3c3bba..ef5b80e48599 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -322,6 +322,24 @@ struct thpsize {
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>  
> +static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> +		unsigned long vm_flags)
> +{
> +	/*
> +	 * Explicitly disabled through madvise or prctl, or some
> +	 * architectures may disable THP for some mappings, for
> +	 * example, s390 kvm.
> +	 */
> +	return (vm_flags & VM_NOHUGEPAGE) ||
> +	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> +}
> +
> +static inline bool thp_disabled_by_hw(void)
> +{
> +	/* If the hardware/firmware marked hugepage support disabled. */
> +	return transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED);
> +}
> +
>  unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags);
>  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87b49ecc7b1e..2fb328880b50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -109,18 +109,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	if (!vma->vm_mm)		/* vdso */
>  		return 0;
>  
> -	/*
> -	 * Explicitly disabled through madvise or prctl, or some
> -	 * architectures may disable THP for some mappings, for
> -	 * example, s390 kvm.
> -	 * */
> -	if ((vm_flags & VM_NOHUGEPAGE) ||
> -	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -		return 0;
> -	/*
> -	 * If the hardware/firmware marked hugepage support disabled.
> -	 */
> -	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
>  		return 0;
>  
>  	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4f11b5506363..c5adb987b23c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1664,12 +1664,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>  	loff_t i_size;
>  	int order;
>  
> -	if (vma && ((vm_flags & VM_NOHUGEPAGE) ||
> -	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
> -		return 0;
> -
> -	/* If the hardware/firmware marked hugepage support disabled. */
> -	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> +	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
>  		return 0;
>  
>  	global_huge = shmem_huge_global_enabled(inode, index, write_end,



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 10:24 ` [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
@ 2024-10-11 11:29   ` Ryan Roberts
  2024-10-11 11:33     ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Ryan Roberts @ 2024-10-11 11:29 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins, Thomas Huth,
	Matthew Wilcox (Oracle),
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Kefeng Wang, Leo Fu

On 11/10/2024 11:24, David Hildenbrand wrote:
> We (or rather, readahead logic :) ) might be allocating a THP in the
> pagecache and then try mapping it into a process that explicitly disabled
> THP: we might end up installing PMD mappings.
> 
> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
> starting the VM.
> 
> For example, starting a VM backed on a file system with large folios
> supported makes the VM crash when the VM tries accessing such a mapping
> using KVM.
> 
> Is it also a problem when the HW disabled THP using
> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
> without X86_FEATURE_PSE.
> 
> In the future, we might be able to do better on s390x and only disallow
> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
> really wants. For now, fix it by essentially performing the same check as
> would be done in __thp_vma_allowable_orders() or in shmem code, where this
> works as expected, and disallow PMD mappings, making us fallback to PTE
> mappings.
> 
> Reported-by: Leo Fu <bfu@redhat.com>
> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")

Will this patch be difficult to backport given it depends on the previous patch
and that doesn't have a Fixes tag?

> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..a2e501489517 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  	pmd_t entry;
>  	vm_fault_t ret = VM_FAULT_FALLBACK;
>  
> +	/*
> +	 * It is too late to allocate a small folio, we already have a large
> +	 * folio in the pagecache: especially s390 KVM cannot tolerate any
> +	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
> +	 * PMD mappings if THPs are disabled.
> +	 */
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
> +		return ret;

Why not just call thp_vma_allowable_orders()?

> +
>  	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>  		return ret;
>  



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 11:29   ` Ryan Roberts
@ 2024-10-11 11:33     ` David Hildenbrand
  2024-10-11 11:36       ` Ryan Roberts
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 11:33 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins, Thomas Huth,
	Matthew Wilcox (Oracle),
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Kefeng Wang, Leo Fu

On 11.10.24 13:29, Ryan Roberts wrote:
> On 11/10/2024 11:24, David Hildenbrand wrote:
>> We (or rather, readahead logic :) ) might be allocating a THP in the
>> pagecache and then try mapping it into a process that explicitly disabled
>> THP: we might end up installing PMD mappings.
>>
>> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
>> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
>> starting the VM.
>>
>> For example, starting a VM backed on a file system with large folios
>> supported makes the VM crash when the VM tries accessing such a mapping
>> using KVM.
>>
>> Is it also a problem when the HW disabled THP using
>> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
>> without X86_FEATURE_PSE.
>>
>> In the future, we might be able to do better on s390x and only disallow
>> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
>> really wants. For now, fix it by essentially performing the same check as
>> would be done in __thp_vma_allowable_orders() or in shmem code, where this
>> works as expected, and disallow PMD mappings, making us fallback to PTE
>> mappings.
>>
>> Reported-by: Leo Fu <bfu@redhat.com>
>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> 
> Will this patch be difficult to backport given it depends on the previous patch
> and that doesn't have a Fixes tag?

"difficult" -- not really. Andrew might want to tag patch #1  with 
"Fixes:" as well, but I can also send simple stable backports that avoid 
patch #1.

(Thinking again, I assume we want to Cc:stable)

> 
>> Cc: Thomas Huth <thuth@redhat.com>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
>> Cc: Janosch Frank <frankja@linux.ibm.com>
>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2366578015ad..a2e501489517 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>   	pmd_t entry;
>>   	vm_fault_t ret = VM_FAULT_FALLBACK;
>>   
>> +	/*
>> +	 * It is too late to allocate a small folio, we already have a large
>> +	 * folio in the pagecache: especially s390 KVM cannot tolerate any
>> +	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
>> +	 * PMD mappings if THPs are disabled.
>> +	 */
>> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
>> +		return ret;
> 
> Why not just call thp_vma_allowable_orders()?

Why call thp_vma_allowable_orders() that does a lot more work that 
doesn't really apply here? :)

I'd say, just like shmem, we handle this separately here.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 11:33     ` David Hildenbrand
@ 2024-10-11 11:36       ` Ryan Roberts
  2024-10-11 11:40         ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Ryan Roberts @ 2024-10-11 11:36 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins, Thomas Huth,
	Matthew Wilcox (Oracle),
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Kefeng Wang, Leo Fu

On 11/10/2024 12:33, David Hildenbrand wrote:
> On 11.10.24 13:29, Ryan Roberts wrote:
>> On 11/10/2024 11:24, David Hildenbrand wrote:
>>> We (or rather, readahead logic :) ) might be allocating a THP in the
>>> pagecache and then try mapping it into a process that explicitly disabled
>>> THP: we might end up installing PMD mappings.
>>>
>>> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
>>> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
>>> starting the VM.
>>>
>>> For example, starting a VM backed on a file system with large folios
>>> supported makes the VM crash when the VM tries accessing such a mapping
>>> using KVM.
>>>
>>> Is it also a problem when the HW disabled THP using
>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
>>> without X86_FEATURE_PSE.
>>>
>>> In the future, we might be able to do better on s390x and only disallow
>>> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
>>> really wants. For now, fix it by essentially performing the same check as
>>> would be done in __thp_vma_allowable_orders() or in shmem code, where this
>>> works as expected, and disallow PMD mappings, making us fallback to PTE
>>> mappings.
>>>
>>> Reported-by: Leo Fu <bfu@redhat.com>
>>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>
>> Will this patch be difficult to backport given it depends on the previous patch
>> and that doesn't have a Fixes tag?
> 
> "difficult" -- not really. Andrew might want to tag patch #1  with "Fixes:" as
> well, but I can also send simple stable backports that avoid patch #1.
> 
> (Thinking again, I assume we want to Cc:stable)
> 
>>
>>> Cc: Thomas Huth <thuth@redhat.com>
>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
>>> Cc: Janosch Frank <frankja@linux.ibm.com>
>>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memory.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 2366578015ad..a2e501489517 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct
>>> page *page)
>>>       pmd_t entry;
>>>       vm_fault_t ret = VM_FAULT_FALLBACK;
>>>   +    /*
>>> +     * It is too late to allocate a small folio, we already have a large
>>> +     * folio in the pagecache: especially s390 KVM cannot tolerate any
>>> +     * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
>>> +     * PMD mappings if THPs are disabled.
>>> +     */
>>> +    if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
>>> +        return ret;
>>
>> Why not just call thp_vma_allowable_orders()?
> 
> Why call thp_vma_allowable_orders() that does a lot more work that doesn't
> really apply here? :)

Yeah fair enough, I was just thinking it makes the code simpler to keep all the
checks in one place. But no strong opinion.

Either way:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> 
> I'd say, just like shmem, we handle this separately here.
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 10:24 [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
  2024-10-11 10:24 ` [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() David Hildenbrand
  2024-10-11 10:24 ` [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
@ 2024-10-11 11:39 ` Thomas Huth
  2024-10-11 11:43   ` David Hildenbrand
  2 siblings, 1 reply; 10+ messages in thread
From: Thomas Huth @ 2024-10-11 11:39 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Kefeng Wang

On 11/10/2024 12.24, David Hildenbrand wrote:
> During testing, it was found that we can get PMD mappings in processes
> where THP (and more precisely, PMD mappings) are supposed to be disabled.
> While it works as expected for anon+shmem, the pagecache is the problematic
> bit.
> 
> For s390 KVM this currently means that a VM backed by a file located on
> filesystem with large folio support can crash when KVM tries accessing
> the problematic page, because the readahead logic might decide to use
> a PMD-sized THP and faulting it into the page tables will install a
> PMD mapping, something that s390 KVM cannot tolerate.
> 
> This might also be a problem with HW that does not support PMD mappings,
> but I did not try reproducing it.
> 
> Fix it by respecting the ways to disable THPs when deciding whether we
> can install a PMD mapping. khugepaged should already be taking care of
> not collapsing if THPs are effectively disabled for the hw/process/vma.
> 
> An earlier patch was tested by Thomas Huth, this one still needs to
> be retested; sending it out already.

I just finished testing your new version of these patches here, and I can 
confirm that they are fixing the problem that I was facing, so:

Tested-by: Thomas Huth <thuth@redhat.com>

FWIW, the problem can be reproduced by running a KVM guest on a s390x host 
like this:

qemu-system-s390x -accel kvm -nographic -m 4G -d guest_errors \
   -M s390-ccw-virtio,memory-backend=mem-machine_mem \
   -object 
memory-backend-file,size=4294967296,prealloc=true,mem-path=$HOME/myfile,share=true,id=mem-machine_mem

Without the fix, the guest crashes immediatly before being able to execute 
the first instruction. With the fix applied, you can still see the first 
messages of the guest firmware, indicating that the guest started successfully.

Thank you very much for the fix, David!

  Thomas



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 11:36       ` Ryan Roberts
@ 2024-10-11 11:40         ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 11:40 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins, Thomas Huth,
	Matthew Wilcox (Oracle),
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Kefeng Wang, Leo Fu

On 11.10.24 13:36, Ryan Roberts wrote:
> On 11/10/2024 12:33, David Hildenbrand wrote:
>> On 11.10.24 13:29, Ryan Roberts wrote:
>>> On 11/10/2024 11:24, David Hildenbrand wrote:
>>>> We (or rather, readahead logic :) ) might be allocating a THP in the
>>>> pagecache and then try mapping it into a process that explicitly disabled
>>>> THP: we might end up installing PMD mappings.
>>>>
>>>> This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
>>>> THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
>>>> starting the VM.
>>>>
>>>> For example, starting a VM backed on a file system with large folios
>>>> supported makes the VM crash when the VM tries accessing such a mapping
>>>> using KVM.
>>>>
>>>> Is it also a problem when the HW disabled THP using
>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
>>>> without X86_FEATURE_PSE.
>>>>
>>>> In the future, we might be able to do better on s390x and only disallow
>>>> PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
>>>> really wants. For now, fix it by essentially performing the same check as
>>>> would be done in __thp_vma_allowable_orders() or in shmem code, where this
>>>> works as expected, and disallow PMD mappings, making us fallback to PTE
>>>> mappings.
>>>>
>>>> Reported-by: Leo Fu <bfu@redhat.com>
>>>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>>>
>>> Will this patch be difficult to backport given it depends on the previous patch
>>> and that doesn't have a Fixes tag?
>>
>> "difficult" -- not really. Andrew might want to tag patch #1  with "Fixes:" as
>> well, but I can also send simple stable backports that avoid patch #1.
>>
>> (Thinking again, I assume we want to Cc:stable)
>>
>>>
>>>> Cc: Thomas Huth <thuth@redhat.com>
>>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
>>>> Cc: Janosch Frank <frankja@linux.ibm.com>
>>>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    mm/memory.c | 9 +++++++++
>>>>    1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 2366578015ad..a2e501489517 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4925,6 +4925,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct
>>>> page *page)
>>>>        pmd_t entry;
>>>>        vm_fault_t ret = VM_FAULT_FALLBACK;
>>>>    +    /*
>>>> +     * It is too late to allocate a small folio, we already have a large
>>>> +     * folio in the pagecache: especially s390 KVM cannot tolerate any
>>>> +     * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
>>>> +     * PMD mappings if THPs are disabled.
>>>> +     */
>>>> +    if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
>>>> +        return ret;
>>>
>>> Why not just call thp_vma_allowable_orders()?
>>
>> Why call thp_vma_allowable_orders() that does a lot more work that doesn't
>> really apply here? :)
> 
> Yeah fair enough, I was just thinking it makes the code simpler to keep all the
> checks in one place. But no strong opinion.
> 
> Either way:
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

Thanks!

Also, I decided to not use "thp_vma_allowable_orders" because we are 
past the allocation phase (as indicated in the comment) and can really 
just change the way how we map the folio (PMD vs. PTE), not really 
*what* folio to use.

Ideally, in the future we have a different way of just saying "no PMD 
mappings please", decoupling the mapping from the allocation granularity.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  2024-10-11 11:39 ` [PATCH v1 0/2] " Thomas Huth
@ 2024-10-11 11:43   ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-10-11 11:43 UTC (permalink / raw)
  To: Thomas Huth, linux-kernel
  Cc: linux-mm, kvm, Andrew Morton, Hugh Dickins,
	Matthew Wilcox (Oracle),
	Ryan Roberts, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Kefeng Wang

On 11.10.24 13:39, Thomas Huth wrote:
> On 11/10/2024 12.24, David Hildenbrand wrote:
>> During testing, it was found that we can get PMD mappings in processes
>> where THP (and more precisely, PMD mappings) are supposed to be disabled.
>> While it works as expected for anon+shmem, the pagecache is the problematic
>> bit.
>>
>> For s390 KVM this currently means that a VM backed by a file located on
>> filesystem with large folio support can crash when KVM tries accessing
>> the problematic page, because the readahead logic might decide to use
>> a PMD-sized THP and faulting it into the page tables will install a
>> PMD mapping, something that s390 KVM cannot tolerate.
>>
>> This might also be a problem with HW that does not support PMD mappings,
>> but I did not try reproducing it.
>>
>> Fix it by respecting the ways to disable THPs when deciding whether we
>> can install a PMD mapping. khugepaged should already be taking care of
>> not collapsing if THPs are effectively disabled for the hw/process/vma.
>>
>> An earlier patch was tested by Thomas Huth, this one still needs to
>> be retested; sending it out already.
> 
> I just finished testing your new version of these patches here, and I can
> confirm that they are fixing the problem that I was facing, so:
> 
> Tested-by: Thomas Huth <thuth@redhat.com>
> 
> FWIW, the problem can be reproduced by running a KVM guest on a s390x host
> like this:
> 
> qemu-system-s390x -accel kvm -nographic -m 4G -d guest_errors \
>     -M s390-ccw-virtio,memory-backend=mem-machine_mem \
>     -object
> memory-backend-file,size=4294967296,prealloc=true,mem-path=$HOME/myfile,share=true,id=mem-machine_mem
> 
> Without the fix, the guest crashes immediatly before being able to execute
> the first instruction. With the fix applied, you can still see the first
> messages of the guest firmware, indicating that the guest started successfully.
> 
> Thank you very much for the fix, David!

Thanks for the quick test, Thomas!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-10-11 11:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-11 10:24 [PATCH v1 0/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
2024-10-11 10:24 ` [PATCH v1 1/2] mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() David Hildenbrand
2024-10-11 11:21   ` Ryan Roberts
2024-10-11 10:24 ` [PATCH v1 2/2] mm: don't install PMD mappings when THPs are disabled by the hw/process/vma David Hildenbrand
2024-10-11 11:29   ` Ryan Roberts
2024-10-11 11:33     ` David Hildenbrand
2024-10-11 11:36       ` Ryan Roberts
2024-10-11 11:40         ` David Hildenbrand
2024-10-11 11:39 ` [PATCH v1 0/2] " Thomas Huth
2024-10-11 11:43   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox