[PATCH 0/1] Fix zero copy I/O on __get_user

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages
       [not found] <CGME20250507154119uscas1p2a4055d14ab111fdb94a6378789c38d9d@uscas1p2.samsung.com>
@ 2025-05-07 15:41 ` Pantelis Antoniou
       [not found]   ` <CGME20250507154119uscas1p17799fe7589e4f1bd53d2d3dc7f44cb8c@uscas1p1.samsung.com>
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Pantelis Antoniou @ 2025-05-07 15:41 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth

Updates to network filesystems enabled zero copy I/O by using the
netfslib common accessors.

One example of that is the 9p filesystem which is commonly used in qemu
based setups for sharing files with the host.

In our emulation environment we have noticed failing writes when performing
I/O from a userspace mapped DRM GEM buffer object.
The platform does not use VRAM, all graphics memory is regular DRAM memory,
allocated via __get_free_pages

The same write was successful from a heap allocated bounce buffer.

The sequence of events is as follows.

1. A BO (Buffer Object) is created, and it's backing memory is allocated via
   __get_user_pages()

2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
   file handle of a DRM driver. The mapping is done via the
   drm_gem_mmap_obj() call.

3. Userspace issues a write to a file copying the contents of the BO.

3a. If the file is located on regular filesystem (like ext4), the write
    completes successfully.

3b. If the file is located on a network filesystem, like 9p the write fails.

The write fails because v9fs_file_write_iter() will call
netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will 
call netfs_extract_user_iter() 

netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
a user backed iterator will call iov_iter_extract_user_pages which will call
pin_user_pages_fast() which finally will call __gup_longterm_locked().

__gup_longterm_locked() will call __get_user_pages_locked() which will fail
because the VMA is marked with the VM_IO and VM_PFNMAP flags.

Pantelis Antoniou (1):
  Fix zero copy I/O on __get_user_pages allocated pages

 mm/gup.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages
       [not found]   ` <CGME20250507154119uscas1p17799fe7589e4f1bd53d2d3dc7f44cb8c@uscas1p1.samsung.com>
@ 2025-05-07 15:41     ` Pantelis Antoniou
  2025-05-08 15:03       ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Pantelis Antoniou @ 2025-05-07 15:41 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth

Recent updates to net filesystems enabled zero copy operations,
which require getting a user space page pinned.

This does not work for pages that were allocated via __get_user_pages
and then mapped to user-space via remap_pfn_rage.

remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
VM_PFNMAP in particular mark the pages as not having struct_page
associated with them, which is not the case for __get_user_pages()

This in turn makes any attempt to lock a page fail, and breaking
I/O from that address range.

This patch address it by special casing pages in those VMAs and not
calling vm_normal_page() for them.

Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
---
 mm/gup.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 84461d384ae2..e185c18c0c81 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t pte, struct page *page,
 	return !userfaultfd_pte_wp(vma, pte);
 }
 
+static struct page *gup_normal_page(struct vm_area_struct *vma,
+		unsigned long address, pte_t pte)
+{
+	unsigned long pfn;
+
+	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
+		pfn = pte_pfn(pte);
+		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn > highest_memmap_pfn)
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+	return vm_normal_page(vma, address, pte);
+}
+
 static struct page *follow_page_pte(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd, unsigned int flags,
 		struct dev_pagemap **pgmap)
@@ -858,7 +872,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
 		goto no_page;
 
-	page = vm_normal_page(vma, address, pte);
+	page = gup_normal_page(vma, address, pte);
+	if (page && (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)))
+		(void)follow_pfn_pte(vma, address, ptep, flags);
 
 	/*
 	 * We only care about anon pages in can_follow_write_pte() and don't
@@ -1130,7 +1146,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	*vma = get_gate_vma(mm);
 	if (!page)
 		goto out;
-	*page = vm_normal_page(*vma, address, entry);
+	*page = gup_normal_page(*vma, address, entry);
 	if (!*page) {
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(entry)))
 			goto unmap;
@@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	int foreign = (gup_flags & FOLL_REMOTE);
 	bool vma_anon = vma_is_anonymous(vma);
 
-	if (vm_flags & (VM_IO | VM_PFNMAP))
-		return -EFAULT;
 
 	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-07 15:41 ` [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages Pantelis Antoniou
       [not found]   ` <CGME20250507154119uscas1p17799fe7589e4f1bd53d2d3dc7f44cb8c@uscas1p1.samsung.com>
@ 2025-05-07 21:50   ` Andrew Morton
  2025-05-08  8:24     ` Pantelis Antoniou
  2025-05-07 21:54   ` Andrew Morton
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-05-07 21:50 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: linux-mm, linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth

On Wed, 7 May 2025 10:41:04 -0500 Pantelis Antoniou <p.antoniou@partner.samsung.com> wrote:

> Updates to network filesystems enabled zero copy I/O by using the
> netfslib common accessors.

Updates by whom?  Are all the people who need to know about this being
cc'ed here?

> One example of that is the 9p filesystem which is commonly used in qemu
> based setups for sharing files with the host.
> 
> In our emulation environment we have noticed failing writes when performing
> I/O from a userspace mapped DRM GEM buffer object.
> The platform does not use VRAM, all graphics memory is regular DRAM memory,
> allocated via __get_free_pages

We should identify which kernel version(s) should be patched, please. 
6.16-rc1?  6.15?  -stable?

I often make these decisions but in this case I have far too little
information to be able to do that.

Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-07 15:41 ` [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages Pantelis Antoniou
       [not found]   ` <CGME20250507154119uscas1p17799fe7589e4f1bd53d2d3dc7f44cb8c@uscas1p1.samsung.com>
  2025-05-07 21:50   ` [PATCH 0/1] " Andrew Morton
@ 2025-05-07 21:54   ` Andrew Morton
  2 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2025-05-07 21:54 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: linux-mm, linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth

On Wed, 7 May 2025 10:41:04 -0500 Pantelis Antoniou <p.antoniou@partner.samsung.com> wrote:

> One example of that is the 9p filesystem which is commonly used in qemu
> based setups for sharing files with the host.
> 
> In our emulation environment we have noticed failing writes when performing
> I/O from a userspace mapped DRM GEM buffer object.
> The platform does not use VRAM, all graphics memory is regular DRAM memory,
> allocated via __get_free_pages
> 
> The same write was successful from a heap allocated bounce buffer.
> 
> The sequence of events is as follows.
>
> ..
>

There's a lot of good stuff in this [0/N], but a single patch "series"
is cumbersome.  Can you please redo this as a standalone patch in which
the changelog is the union of [0/1] and [1/1]?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-07 21:50   ` [PATCH 0/1] " Andrew Morton
@ 2025-05-08  8:24     ` Pantelis Antoniou
  0 siblings, 0 replies; 9+ messages in thread
From: Pantelis Antoniou @ 2025-05-08  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth

On Wed, 7 May 2025 14:50:18 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 7 May 2025 10: 41: 04 -0500 Pantelis Antoniou <p. antoniou@
> partner. samsung. com> wrote: > Updates to network filesystems
> enabled zero copy I/O by using the > netfslib common accessors.
> Updates by whom? Are all the people who 
> On Wed, 7 May 2025 10:41:04 -0500 Pantelis Antoniou
> <p.antoniou@partner.samsung.com> wrote:
> 
> > Updates to network filesystems enabled zero copy I/O by using the
> > netfslib common accessors.
> 
> Updates by whom?  Are all the people who need to know about this being
> cc'ed here?
> 

I think the first cover letter contains that information.

> > One example of that is the 9p filesystem which is commonly used in
> > qemu based setups for sharing files with the host.
> > 
> > In our emulation environment we have noticed failing writes when
> > performing I/O from a userspace mapped DRM GEM buffer object.
> > The platform does not use VRAM, all graphics memory is regular DRAM
> > memory, allocated via __get_free_pages
> 
> We should identify which kernel version(s) should be patched, please. 
> 6.16-rc1?  6.15?  -stable?
> 

The first occurance of the bug was on internal kernel tree that was
based on 6.8.

This patch is against 6.15-rc5.

> I often make these decisions but in this case I have far too little
> information to be able to do that.
> 

No worries.

I see that this is picked up for mm unstable as is? Do you want
me to generate a single patch merging the info of the cover letter
and the single patch?

The reason for the split is that I was not sure if you needed to
have all the sordid details included in the applied patch.

FWIW, we also have a buildroot patch that exhibits the problem
in a much simplified way that what the original bug report came about.
I don't think its appropriate content for the list, but I can
share if anyone is curious about it.

> Thanks.

Regards

-- Pantelis


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-07 15:41     ` [PATCH 1/1] " Pantelis Antoniou
@ 2025-05-08 15:03       ` David Hildenbrand
  2025-05-08 15:23         ` Pantelis Antoniou
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-05-08 15:03 UTC (permalink / raw)
  To: Pantelis Antoniou, Andrew Morton, linux-mm
  Cc: linux-kernel, Artem Krupotkin, Charles Briere, Wade Farnsworth, Peter Xu

On 07.05.25 17:41, Pantelis Antoniou wrote:

Hi,

> Recent updates to net filesystems enabled zero copy operations,
> which require getting a user space page pinned.
> 
> This does not work for pages that were allocated via __get_user_pages
> and then mapped to user-space via remap_pfn_rage.

Right. Because the struct page of a VM_PFNMAP *must not be touched*. It 
has to be treated like it doesn't exist.

> 
> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
> VM_PFNMAP in particular mark the pages as not having struct_page
> associated with them, which is not the case for __get_user_pages()
> 
> This in turn makes any attempt to lock a page fail, and breaking
> I/O from that address range.
> 
> This patch address it by special casing pages in those VMAs and not
> calling vm_normal_page() for them.
> 
> Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> ---
>   mm/gup.c | 22 ++++++++++++++++++----
>   1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 84461d384ae2..e185c18c0c81 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t pte, struct page *page,
>   	return !userfaultfd_pte_wp(vma, pte);
>   }
>   
> +static struct page *gup_normal_page(struct vm_area_struct *vma,
> +		unsigned long address, pte_t pte)
> +{
> +	unsigned long pfn;
> +
> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> +		pfn = pte_pfn(pte);
> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn > highest_memmap_pfn)
> +			return NULL;
> +		return pfn_to_page(pfn);
> +	}
> +	return vm_normal_page(vma, address, pte);

I enjoy seeing vm_normal_page() checks in GUP code.

I don't enjoy seeing what you added before that :)

If vm_normal_page() tells you "this is not a normal", then we should not 
touch it. There is one exception: the shared zeropage.


So, unfortunately, this is wrong.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-08 15:03       ` David Hildenbrand
@ 2025-05-08 15:23         ` Pantelis Antoniou
  2025-05-08 15:37           ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 15:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, linux-kernel, Artem Krupotkin,
	Charles Briere, Wade Farnsworth, Peter Xu

On Thu, 8 May 2025 17:03:46 +0200
David Hildenbrand <david@redhat.com> wrote:

Hi there,

> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
> to net filesystems enabled zero copy operations, > which require
> getting a user space page pinned. > > This does not work for pages
> that were allocated via __get_user_pages 
> On 07.05.25 17:41, Pantelis Antoniou wrote:
> 
> Hi,
> 
> > Recent updates to net filesystems enabled zero copy operations,
> > which require getting a user space page pinned.
> > 
> > This does not work for pages that were allocated via
> > __get_user_pages and then mapped to user-space via remap_pfn_rage.
> 
> Right. Because the struct page of a VM_PFNMAP *must not be touched*.
> It has to be treated like it doesn't exist.
> 

Well, that's not exactly the case. For pages mapped to user space via
remap_pfn_range() the VM_PFNMAP bit is set even though the pages do
have a struct page.

The details of how it happens are at the cover page of this patch but
let me paste the relevant bits here.

"In our emulation environment we have noticed failing writes when
performing I/O from a userspace mapped DRM GEM buffer object.
The platform does not use VRAM, all graphics memory is regular DRAM
memory, allocated via __get_free_pages

The same write was successful from a heap allocated bounce buffer.

The sequence of events is as follows.

1. A BO (Buffer Object) is created, and it's backing memory is allocated via
   __get_user_pages()

2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
   file handle of a DRM driver. The mapping is done via the
   drm_gem_mmap_obj() call.

3. Userspace issues a write to a file copying the contents of the BO.

3a. If the file is located on regular filesystem (like ext4), the write
    completes successfully.

3b. If the file is located on a network filesystem, like 9p the write fails.

The write fails because v9fs_file_write_iter() will call
netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will 
call netfs_extract_user_iter() 

netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
a user backed iterator will call iov_iter_extract_user_pages which will call
pin_user_pages_fast() which finally will call __gup_longterm_locked().

__gup_longterm_locked() will call __get_user_pages_locked() which will fail
because the VMA is marked with the VM_IO and VM_PFNMAP flags."

> > 
> > remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
> > VM_PFNMAP in particular mark the pages as not having struct_page
> > associated with them, which is not the case for __get_user_pages()
> > 
> > This in turn makes any attempt to lock a page fail, and breaking
> > I/O from that address range.
> > 
> > This patch address it by special casing pages in those VMAs and not
> > calling vm_normal_page() for them.
> > 
> > Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> > ---
> >   mm/gup.c | 22 ++++++++++++++++++----
> >   1 file changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 84461d384ae2..e185c18c0c81 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
> > pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
> >   }
> >   
> > +static struct page *gup_normal_page(struct vm_area_struct *vma,
> > +		unsigned long address, pte_t pte)
> > +{
> > +	unsigned long pfn;
> > +
> > +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> > +		pfn = pte_pfn(pte);
> > +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
> > highest_memmap_pfn)
> > +			return NULL;
> > +		return pfn_to_page(pfn);
> > +	}
> > +	return vm_normal_page(vma, address, pte);
> 
> I enjoy seeing vm_normal_page() checks in GUP code.
> 
> I don't enjoy seeing what you added before that :)
> 
> If vm_normal_page() tells you "this is not a normal", then we should
> not touch it. There is one exception: the shared zeropage.
> 
> 
> So, unfortunately, this is wrong.
> 

Well, lets talk about a proper fix then for the previously mentioned
user-space regression.

Regards

-- Pantelis


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-08 15:23         ` Pantelis Antoniou
@ 2025-05-08 15:37           ` David Hildenbrand
  2025-05-08 15:47             ` Pantelis Antoniou
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-05-08 15:37 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Andrew Morton, linux-mm, linux-kernel, Artem Krupotkin,
	Charles Briere, Wade Farnsworth, Peter Xu

On 08.05.25 17:23, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 17:03:46 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> Hi there,
> 
>> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
>> to net filesystems enabled zero copy operations, > which require
>> getting a user space page pinned. > > This does not work for pages
>> that were allocated via __get_user_pages
>> On 07.05.25 17:41, Pantelis Antoniou wrote:
>>
>> Hi,
>>
>>> Recent updates to net filesystems enabled zero copy operations,
>>> which require getting a user space page pinned.
>>>
>>> This does not work for pages that were allocated via
>>> __get_user_pages and then mapped to user-space via remap_pfn_rage.
>>
>> Right. Because the struct page of a VM_PFNMAP *must not be touched*.
>> It has to be treated like it doesn't exist.
>>
> 
> Well, that's not exactly the case. For pages mapped to user space via
> remap_pfn_range() the VM_PFNMAP bit is set even though the pages do
> have a struct page.

Yes. And VM_PFNMAP is the big flag that these pages shall not be 
touched. Even if it exists. Even if you say please. :)

See the comment above vm_normal_page():

	"Special" mappings do not wish to be associated with a "struct
	page" (either it doesn't exist, or it exists but they don't want
	to touch it)

VM_MIXEDMAP could maybe be used for that purpose: possibly GUP also has 
to be updated to make use of that. (I was hoping we can get rid of 
VM_MIXEDMAP at some point)


> 
> The details of how it happens are at the cover page of this patch but
> let me paste the relevant bits here.
> 
> "In our emulation environment we have noticed failing writes when
> performing I/O from a userspace mapped DRM GEM buffer object.
> The platform does not use VRAM, all graphics memory is regular DRAM
> memory, allocated via __get_free_pages
> 
> The same write was successful from a heap allocated bounce buffer.
> 
> The sequence of events is as follows.
> 
> 1. A BO (Buffer Object) is created, and it's backing memory is allocated via
>     __get_user_pages()

__get_user_pages() only allocates memory via page faults. Are you sure 
you meant __get_user_pages() here?

> 
> 2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
>     file handle of a DRM driver. The mapping is done via the
>     drm_gem_mmap_obj() call.
> 
> 3. Userspace issues a write to a file copying the contents of the BO.
> 
> 3a. If the file is located on regular filesystem (like ext4), the write
>      completes successfully.
> 
> 3b. If the file is located on a network filesystem, like 9p the write fails.
> 
> The write fails because v9fs_file_write_iter() will call
> netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will
> call netfs_extract_user_iter()
> 
> netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
> a user backed iterator will call iov_iter_extract_user_pages which will call
> pin_user_pages_fast() which finally will call __gup_longterm_locked().
> 
> __gup_longterm_locked() will call __get_user_pages_locked() which will fail
> because the VMA is marked with the VM_IO and VM_PFNMAP flags."

So, drm_gem_mmap_obj() ends up using remap_pfn_rage()?

I can spot that drm_gem_mmap_obj() has a path where it explicitly sets

	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
		     VM_DONTDUMP);

Which is a clear sign to core-MM (incl. GUP) to never mess with the 
mapped pages.

> 
>>>
>>> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
>>> VM_PFNMAP in particular mark the pages as not having struct_page
>>> associated with them, which is not the case for __get_user_pages()
>>>
>>> This in turn makes any attempt to lock a page fail, and breaking
>>> I/O from that address range.
>>>
>>> This patch address it by special casing pages in those VMAs and not
>>> calling vm_normal_page() for them.
>>>
>>> Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
>>> ---
>>>    mm/gup.c | 22 ++++++++++++++++++----
>>>    1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 84461d384ae2..e185c18c0c81 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
>>> pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
>>>    }
>>>    
>>> +static struct page *gup_normal_page(struct vm_area_struct *vma,
>>> +		unsigned long address, pte_t pte)
>>> +{
>>> +	unsigned long pfn;
>>> +
>>> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
>>> +		pfn = pte_pfn(pte);
>>> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
>>> highest_memmap_pfn)
>>> +			return NULL;
>>> +		return pfn_to_page(pfn);
>>> +	}
>>> +	return vm_normal_page(vma, address, pte);
>>
>> I enjoy seeing vm_normal_page() checks in GUP code.
>>
>> I don't enjoy seeing what you added before that :)
>>
>> If vm_normal_page() tells you "this is not a normal", then we should
>> not touch it. There is one exception: the shared zeropage.
>>
>>
>> So, unfortunately, this is wrong.
>>
> 
> Well, lets talk about a proper fix then for the previously mentioned
> user-space regression.

You really have to find out the responsible commit. GUP has been 
behaving like that forever I'm afraid.

And even the VM_PFNMAP was in drm_gem_mmap_obj() already at least in 
2012 if I am not wrong.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages
  2025-05-08 15:37           ` David Hildenbrand
@ 2025-05-08 15:47             ` Pantelis Antoniou
  0 siblings, 0 replies; 9+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 15:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, linux-kernel, Artem Krupotkin,
	Charles Briere, Wade Farnsworth, Peter Xu

On Thu, 8 May 2025 17:37:04 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 08. 05. 25 17: 23, Pantelis Antoniou wrote: > On Thu, 8 May 2025
> 17: 03: 46 +0200 > David Hildenbrand <david@ redhat. com> wrote: > >
> Hi there, > >> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, >
> Recent updates 
> On 08.05.25 17:23, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 17:03:46 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > Hi there,
> > 
> >> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
> >> to net filesystems enabled zero copy operations, > which require
> >> getting a user space page pinned. > > This does not work for pages
> >> that were allocated via __get_user_pages
> >> On 07.05.25 17:41, Pantelis Antoniou wrote:
> >>
> >> Hi,
> >>
> >>> Recent updates to net filesystems enabled zero copy operations,
> >>> which require getting a user space page pinned.
> >>>
> >>> This does not work for pages that were allocated via
> >>> __get_user_pages and then mapped to user-space via remap_pfn_rage.
> >>
> >> Right. Because the struct page of a VM_PFNMAP *must not be
> >> touched*. It has to be treated like it doesn't exist.
> >>
> > 
> > Well, that's not exactly the case. For pages mapped to user space
> > via remap_pfn_range() the VM_PFNMAP bit is set even though the
> > pages do have a struct page.
> 
> Yes. And VM_PFNMAP is the big flag that these pages shall not be 
> touched. Even if it exists. Even if you say please. :)
> 
> See the comment above vm_normal_page():
> 
> 	"Special" mappings do not wish to be associated with a "struct
> 	page" (either it doesn't exist, or it exists but they don't
> want to touch it)
> 
> VM_MIXEDMAP could maybe be used for that purpose: possibly GUP also
> has to be updated to make use of that. (I was hoping we can get rid
> of VM_MIXEDMAP at some point)
> 
> 
> > 
> > The details of how it happens are at the cover page of this patch
> > but let me paste the relevant bits here.
> > 
> > "In our emulation environment we have noticed failing writes when
> > performing I/O from a userspace mapped DRM GEM buffer object.
> > The platform does not use VRAM, all graphics memory is regular DRAM
> > memory, allocated via __get_free_pages
> > 
> > The same write was successful from a heap allocated bounce buffer.
> > 
> > The sequence of events is as follows.
> > 
> > 1. A BO (Buffer Object) is created, and it's backing memory is
> > allocated via __get_user_pages()
> 
> __get_user_pages() only allocates memory via page faults. Are you
> sure you meant __get_user_pages() here?
> 

Oops, yeah, __get_free_pages(). Apologies for the confusion.

> > 
> > 2. Userspace mmaps a BO (Buffer Object) via a mmap call on the
> > opened file handle of a DRM driver. The mapping is done via the
> >     drm_gem_mmap_obj() call.
> > 
> > 3. Userspace issues a write to a file copying the contents of the
> > BO.
> > 
> > 3a. If the file is located on regular filesystem (like ext4), the
> > write completes successfully.
> > 
> > 3b. If the file is located on a network filesystem, like 9p the
> > write fails.
> > 
> > The write fails because v9fs_file_write_iter() will call
> > netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked()
> > which will call netfs_extract_user_iter()
> > 
> > netfs_extract_user_iter() will in turn call
> > iov_iter_extract_pages() which for a user backed iterator will call
> > iov_iter_extract_user_pages which will call pin_user_pages_fast()
> > which finally will call __gup_longterm_locked().
> > 
> > __gup_longterm_locked() will call __get_user_pages_locked() which
> > will fail because the VMA is marked with the VM_IO and VM_PFNMAP
> > flags."
> 
> So, drm_gem_mmap_obj() ends up using remap_pfn_rage()?
> 

Yes.

> I can spot that drm_gem_mmap_obj() has a path where it explicitly sets
> 
> 	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
> 		     VM_DONTDUMP);
> 
> Which is a clear sign to core-MM (incl. GUP) to never mess with the 
> mapped pages.
> 

Well, let just say this not quite right for pages that are normally
allocated via __get_free_pages().

DRM has to handle both VRAM and regular system memory maps so maybe it's
playing it safe here.

> > 
> >>>
> >>> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma
> >>> bits. VM_PFNMAP in particular mark the pages as not having
> >>> struct_page associated with them, which is not the case for
> >>> __get_user_pages()
> >>>
> >>> This in turn makes any attempt to lock a page fail, and breaking
> >>> I/O from that address range.
> >>>
> >>> This patch address it by special casing pages in those VMAs and
> >>> not calling vm_normal_page() for them.
> >>>
> >>> Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> >>> ---
> >>>    mm/gup.c | 22 ++++++++++++++++++----
> >>>    1 file changed, 18 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 84461d384ae2..e185c18c0c81 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
> >>> pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
> >>>    }
> >>>    
> >>> +static struct page *gup_normal_page(struct vm_area_struct *vma,
> >>> +		unsigned long address, pte_t pte)
> >>> +{
> >>> +	unsigned long pfn;
> >>> +
> >>> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> >>> +		pfn = pte_pfn(pte);
> >>> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
> >>> highest_memmap_pfn)
> >>> +			return NULL;
> >>> +		return pfn_to_page(pfn);
> >>> +	}
> >>> +	return vm_normal_page(vma, address, pte);
> >>
> >> I enjoy seeing vm_normal_page() checks in GUP code.
> >>
> >> I don't enjoy seeing what you added before that :)
> >>
> >> If vm_normal_page() tells you "this is not a normal", then we
> >> should not touch it. There is one exception: the shared zeropage.
> >>
> >>
> >> So, unfortunately, this is wrong.
> >>
> > 
> > Well, lets talk about a proper fix then for the previously mentioned
> > user-space regression.
> 
> You really have to find out the responsible commit. GUP has been 
> behaving like that forever I'm afraid.
> 
> And even the VM_PFNMAP was in drm_gem_mmap_obj() already at least in 
> 2012 if I am not wrong.
> 

There is no single responsible commit, it was broken forever.
It is just that no-one has ever tried to pin the pages to perform I/O
before now.

Regards

-- Pantelis




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-08 15:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20250507154119uscas1p2a4055d14ab111fdb94a6378789c38d9d@uscas1p2.samsung.com>
2025-05-07 15:41 ` [PATCH 0/1] Fix zero copy I/O on __get_user_pages allocated pages Pantelis Antoniou
     [not found]   ` <CGME20250507154119uscas1p17799fe7589e4f1bd53d2d3dc7f44cb8c@uscas1p1.samsung.com>
2025-05-07 15:41     ` [PATCH 1/1] " Pantelis Antoniou
2025-05-08 15:03       ` David Hildenbrand
2025-05-08 15:23         ` Pantelis Antoniou
2025-05-08 15:37           ` David Hildenbrand
2025-05-08 15:47             ` Pantelis Antoniou
2025-05-07 21:50   ` [PATCH 0/1] " Andrew Morton
2025-05-08  8:24     ` Pantelis Antoniou
2025-05-07 21:54   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox