[PATCH] perf: map pages in advance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] perf: map pages in advance
@ 2024-11-28 11:37 Lorenzo Stoakes
  2024-11-28 13:08 ` David Hildenbrand
  2024-11-28 20:47 ` Lorenzo Stoakes
  0 siblings, 2 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	linux-mm, Matthew Wilcox

We are current refactoring struct page to make it smaller, removing
unneeded fields that correctly belong to struct folio.

Two of those fields are page->index and page->mapping. Perf is currently
making use of both of these, so this patch removes this usage as it turns
out it is unnecessary.

Perf establishes its own internally controlled memory-mapped pages using
vm_ops hooks. The first page in the mapping is the read/write user control
page, and the rest of the mapping consists of read-only pages.

The VMA is backed by kernel memory either from the buddy allocator or
vmalloc depending on configuration. It is intended to be mapped read/write,
but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
that it should be mapped read-only.

When a write fault occurs, the provided page_mkwrite() hook,
perf_mmap_fault() (doing double duty handing faults as well) uses the
vmf->pgoff field to determine if this is the first page, allowing for the
desired read/write first page, read-only rest mapping.

For this to work the implementation has to carefully work around faulting
logic. When a page is write-faulted, the fault() hook is called first, then
its page_mkwrite() hook is called (to allow for dirty tracking in file
systems).

On fault we set the folio's mapping in perf_mmap_fault(), this is because
when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
as an indicator that the fault should be retried.

We also set the folio's index so, given the folio is being treated as faux
user memory, it correctly references its offset within the VMA.

This explains why the mapping and index fields are used - but it's not
necessary.

We preallocate pages when perf_mmap() is called for the first time via
rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
needed if the mapping requires it.

This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
and so we can instead simply map all the memory right away here - there's
no point in handling (read) page faults when we don't demand page nor need
to be notified about them (perf does not).

This patch therefore changes this logic to map everything when the mmap()
hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
to provide the required read/write vs. read-only behaviour, which does not
require the previously implemented workarounds.

It makes sense semantically to establish a PFN map too - we are managing
the pages internally and so it is appropriate to mark this as a special
mapping.

There should be no change to actual functionality as a result of this
change.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/events/core.c        | 76 +++++++++++++++++++------------------
 kernel/events/ring_buffer.c | 19 +---------
 2 files changed, 40 insertions(+), 55 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5d4a54f50826..0754b070497f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6284,41 +6284,6 @@ void perf_event_update_userpage(struct perf_event *event)
 }
 EXPORT_SYMBOL_GPL(perf_event_update_userpage);

-static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
-{
-	struct perf_event *event = vmf->vma->vm_file->private_data;
-	struct perf_buffer *rb;
-	vm_fault_t ret = VM_FAULT_SIGBUS;
-
-	if (vmf->flags & FAULT_FLAG_MKWRITE) {
-		if (vmf->pgoff == 0)
-			ret = 0;
-		return ret;
-	}
-
-	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
-	if (!rb)
-		goto unlock;
-
-	if (vmf->pgoff && (vmf->flags & FAULT_FLAG_WRITE))
-		goto unlock;
-
-	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
-	if (!vmf->page)
-		goto unlock;
-
-	get_page(vmf->page);
-	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
-	vmf->page->index   = vmf->pgoff;
-
-	ret = 0;
-unlock:
-	rcu_read_unlock();
-
-	return ret;
-}
-
 static void ring_buffer_attach(struct perf_event *event,
 			       struct perf_buffer *rb)
 {
@@ -6558,13 +6523,47 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	ring_buffer_put(rb); /* could be last */
 }

+static vm_fault_t perf_mmap_pfn_mkwrite(struct vm_fault *vmf)
+{
+	/* The first page is the user control page, others are read-only. */
+	return vmf->pgoff == 0 ? 0 : VM_FAULT_SIGBUS;
+}
+
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
 	.close		= perf_mmap_close, /* non mergeable */
-	.fault		= perf_mmap_fault,
-	.page_mkwrite	= perf_mmap_fault,
+	.pfn_mkwrite	= perf_mmap_pfn_mkwrite,
 };

+static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
+{
+	unsigned long nr_pages = vma_pages(vma);
+	int err = 0;
+	unsigned long pgoff;
+
+	for (pgoff = 0; pgoff < nr_pages; pgoff++) {
+		unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
+		struct page *page = perf_mmap_to_page(rb, pgoff);
+
+		if (page == NULL) {
+			err = -EINVAL;
+			break;
+		}
+
+		/* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
+		err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
+				      vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
+		if (err)
+			break;
+	}
+
+	/* Clear any partial mappings on error. */
+	if (err)
+		zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
+
+	return err;
+}
+
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
@@ -6783,6 +6782,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &perf_mmap_vmops;

+	if (!ret)
+		ret = map_range(rb, vma);
+
 	if (event->pmu->event_mapped)
 		event->pmu->event_mapped(event, vma->vm_mm);

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4f46f688d0d4..180509132d4b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -643,7 +643,6 @@ static void rb_free_aux_page(struct perf_buffer *rb, int idx)
 	struct page *page = virt_to_page(rb->aux_pages[idx]);

 	ClearPagePrivate(page);
-	page->mapping = NULL;
 	__free_page(page);
 }

@@ -819,7 +818,6 @@ static void perf_mmap_free_page(void *addr)
 {
 	struct page *page = virt_to_page(addr);

-	page->mapping = NULL;
 	__free_page(page);
 }

@@ -890,28 +888,13 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
 	return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
 }

-static void perf_mmap_unmark_page(void *addr)
-{
-	struct page *page = vmalloc_to_page(addr);
-
-	page->mapping = NULL;
-}
-
 static void rb_free_work(struct work_struct *work)
 {
 	struct perf_buffer *rb;
-	void *base;
-	int i, nr;

 	rb = container_of(work, struct perf_buffer, work);
-	nr = data_page_nr(rb);
-
-	base = rb->user_page;
-	/* The '<=' counts in the user page. */
-	for (i = 0; i <= nr; i++)
-		perf_mmap_unmark_page(base + (i * PAGE_SIZE));

-	vfree(base);
+	vfree(rb->user_page);
 	kfree(rb);
 }

-- 
2.47.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 11:37 [PATCH] perf: map pages in advance Lorenzo Stoakes
@ 2024-11-28 13:08 ` David Hildenbrand
  2024-11-28 13:20   ` Lorenzo Stoakes
  2024-11-28 20:47 ` Lorenzo Stoakes
  1 sibling, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-28 13:08 UTC (permalink / raw)
  To: Lorenzo Stoakes, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	linux-mm, Matthew Wilcox

On 28.11.24 12:37, Lorenzo Stoakes wrote:
> We are current refactoring struct page to make it smaller, removing
> unneeded fields that correctly belong to struct folio.
> 
> Two of those fields are page->index and page->mapping. Perf is currently
> making use of both of these, so this patch removes this usage as it turns
> out it is unnecessary.
> 
> Perf establishes its own internally controlled memory-mapped pages using
> vm_ops hooks. The first page in the mapping is the read/write user control
> page, and the rest of the mapping consists of read-only pages.
> 
> The VMA is backed by kernel memory either from the buddy allocator or
> vmalloc depending on configuration. It is intended to be mapped read/write,
> but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
> that it should be mapped read-only.
> 
> When a write fault occurs, the provided page_mkwrite() hook,
> perf_mmap_fault() (doing double duty handing faults as well) uses the
> vmf->pgoff field to determine if this is the first page, allowing for the
> desired read/write first page, read-only rest mapping.
> 
> For this to work the implementation has to carefully work around faulting
> logic. When a page is write-faulted, the fault() hook is called first, then
> its page_mkwrite() hook is called (to allow for dirty tracking in file
> systems).
> 
> On fault we set the folio's mapping in perf_mmap_fault(), this is because
> when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
> as an indicator that the fault should be retried.
> 
> We also set the folio's index so, given the folio is being treated as faux
> user memory, it correctly references its offset within the VMA.
> 
> This explains why the mapping and index fields are used - but it's not
> necessary.
> 
> We preallocate pages when perf_mmap() is called for the first time via
> rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
> needed if the mapping requires it.
> 
> This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
> and so we can instead simply map all the memory right away here - there's
> no point in handling (read) page faults when we don't demand page nor need
> to be notified about them (perf does not).
> 
> This patch therefore changes this logic to map everything when the mmap()
> hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
> to provide the required read/write vs. read-only behaviour, which does not
> require the previously implemented workarounds.
> 
> It makes sense semantically to establish a PFN map too - we are managing
> the pages internally and so it is appropriate to mark this as a special
> mapping.

It's rather sad seeing more PFNMAP users where PFNMAP is not really 
required (-> this is struct page backed).

Especially having to perform several independent remap_pfn_range() calls 
rather looks like yet another workaround ...

Would we be able to achieve something comparable with vm_insert_pages(), 
to just map them in advance?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 13:08 ` David Hildenbrand
@ 2024-11-28 13:20   ` Lorenzo Stoakes
  2024-11-28 13:37     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 13:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
> On 28.11.24 12:37, Lorenzo Stoakes wrote:
> > We are current refactoring struct page to make it smaller, removing
> > unneeded fields that correctly belong to struct folio.
> >
> > Two of those fields are page->index and page->mapping. Perf is currently
> > making use of both of these, so this patch removes this usage as it turns
> > out it is unnecessary.
> >
> > Perf establishes its own internally controlled memory-mapped pages using
> > vm_ops hooks. The first page in the mapping is the read/write user control
> > page, and the rest of the mapping consists of read-only pages.
> >
> > The VMA is backed by kernel memory either from the buddy allocator or
> > vmalloc depending on configuration. It is intended to be mapped read/write,
> > but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
> > that it should be mapped read-only.
> >
> > When a write fault occurs, the provided page_mkwrite() hook,
> > perf_mmap_fault() (doing double duty handing faults as well) uses the
> > vmf->pgoff field to determine if this is the first page, allowing for the
> > desired read/write first page, read-only rest mapping.
> >
> > For this to work the implementation has to carefully work around faulting
> > logic. When a page is write-faulted, the fault() hook is called first, then
> > its page_mkwrite() hook is called (to allow for dirty tracking in file
> > systems).
> >
> > On fault we set the folio's mapping in perf_mmap_fault(), this is because
> > when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
> > as an indicator that the fault should be retried.
> >
> > We also set the folio's index so, given the folio is being treated as faux
> > user memory, it correctly references its offset within the VMA.
> >
> > This explains why the mapping and index fields are used - but it's not
> > necessary.
> >
> > We preallocate pages when perf_mmap() is called for the first time via
> > rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
> > needed if the mapping requires it.
> >
> > This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
> > and so we can instead simply map all the memory right away here - there's
> > no point in handling (read) page faults when we don't demand page nor need
> > to be notified about them (perf does not).
> >
> > This patch therefore changes this logic to map everything when the mmap()
> > hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
> > to provide the required read/write vs. read-only behaviour, which does not
> > require the previously implemented workarounds.
> >
> > It makes sense semantically to establish a PFN map too - we are managing
> > the pages internally and so it is appropriate to mark this as a special
> > mapping.
>
> It's rather sad seeing more PFNMAP users where PFNMAP is not really required
> (-> this is struct page backed).
>
> Especially having to perform several independent remap_pfn_range() calls
> rather looks like yet another workaround ...
>
> Would we be able to achieve something comparable with vm_insert_pages(), to
> just map them in advance?

Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
friends all refer vma->vm_page_prot which is not yet _correctly_ established at
the point of the f_op->mmap() hook being invoked :)

We set the field in __mmap_new_vma(), _but_ importantly, we defer the
writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
read/write mappings, which is emphatically not what we want - we want them
read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
first page read/write and the rest read-only.

It's this requirement that means this is really the only way to do this as far
as I can tell.

It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
mapping, as the pages reference kernel-allocated memory and are managed by perf,
not put on any LRU, etc.

It sucks to have to loop like this and it feels like a workaround, which makes
me wonder if we need a new interface to better allow this stuff on mmap...

In any case I think this is the most sensible solution currently available that
avoids the pre-existing situation of pretending the pages are folios but
somewhat abusing the interface to allow page_mkwrite() to work correctly by
setting page->index, mapping.

The alternative to this would be to folio-fy, but these are emphatically _not_
folios, that is userland memory managed as userland memory, it's a mapping onto
kernel memory exposed to userspace.

It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
to expose an interface that doesn't assume the VMA is already fully set
up... but I think one for a future series perhaps.

>
> --
> Cheers,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 13:20   ` Lorenzo Stoakes
@ 2024-11-28 13:37     ` David Hildenbrand
  2024-11-28 14:23       ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-28 13:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 28.11.24 14:20, Lorenzo Stoakes wrote:
> On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
>> On 28.11.24 12:37, Lorenzo Stoakes wrote:
>>> We are current refactoring struct page to make it smaller, removing
>>> unneeded fields that correctly belong to struct folio.
>>>
>>> Two of those fields are page->index and page->mapping. Perf is currently
>>> making use of both of these, so this patch removes this usage as it turns
>>> out it is unnecessary.
>>>
>>> Perf establishes its own internally controlled memory-mapped pages using
>>> vm_ops hooks. The first page in the mapping is the read/write user control
>>> page, and the rest of the mapping consists of read-only pages.
>>>
>>> The VMA is backed by kernel memory either from the buddy allocator or
>>> vmalloc depending on configuration. It is intended to be mapped read/write,
>>> but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
>>> that it should be mapped read-only.
>>>
>>> When a write fault occurs, the provided page_mkwrite() hook,
>>> perf_mmap_fault() (doing double duty handing faults as well) uses the
>>> vmf->pgoff field to determine if this is the first page, allowing for the
>>> desired read/write first page, read-only rest mapping.
>>>
>>> For this to work the implementation has to carefully work around faulting
>>> logic. When a page is write-faulted, the fault() hook is called first, then
>>> its page_mkwrite() hook is called (to allow for dirty tracking in file
>>> systems).
>>>
>>> On fault we set the folio's mapping in perf_mmap_fault(), this is because
>>> when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
>>> as an indicator that the fault should be retried.
>>>
>>> We also set the folio's index so, given the folio is being treated as faux
>>> user memory, it correctly references its offset within the VMA.
>>>
>>> This explains why the mapping and index fields are used - but it's not
>>> necessary.
>>>
>>> We preallocate pages when perf_mmap() is called for the first time via
>>> rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
>>> needed if the mapping requires it.
>>>
>>> This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
>>> and so we can instead simply map all the memory right away here - there's
>>> no point in handling (read) page faults when we don't demand page nor need
>>> to be notified about them (perf does not).
>>>
>>> This patch therefore changes this logic to map everything when the mmap()
>>> hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
>>> to provide the required read/write vs. read-only behaviour, which does not
>>> require the previously implemented workarounds.
>>>
>>> It makes sense semantically to establish a PFN map too - we are managing
>>> the pages internally and so it is appropriate to mark this as a special
>>> mapping.
>>
>> It's rather sad seeing more PFNMAP users where PFNMAP is not really required
>> (-> this is struct page backed).
>>
>> Especially having to perform several independent remap_pfn_range() calls
>> rather looks like yet another workaround ...
>>
>> Would we be able to achieve something comparable with vm_insert_pages(), to
>> just map them in advance?
> 
> Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
> friends all refer vma->vm_page_prot which is not yet _correctly_ established at
> the point of the f_op->mmap() hook being invoked :)

So all you want is a vm_insert_pages() variant where we can pass in the 
vm_page_prot?

Or a way detect internally that it is not setup yet and fallback to 
vm_get_page_prot(vma->vm_flags & ~VM_SHARED)?

Or a way to just remove write permissions?

> 
> We set the field in __mmap_new_vma(), _but_ importantly, we defer the
> writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
> were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
> read/write mappings, which is emphatically not what we want - we want them
> read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
> first page read/write and the rest read-only.
> 
> It's this requirement that means this is really the only way to do this as far
> as I can tell.
> 
> It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
> mapping, as the pages reference kernel-allocated memory and are managed by perf,
> not put on any LRU, etc.
> 
> It sucks to have to loop like this and it feels like a workaround, which makes
> me wonder if we need a new interface to better allow this stuff on mmap...
> 
> In any case I think this is the most sensible solution currently available that
> avoids the pre-existing situation of pretending the pages are folios but
> somewhat abusing the interface to allow page_mkwrite() to work correctly by
> setting page->index, mapping.

Yes, that page->index stuff is nasty.

> 
> The alternative to this would be to folio-fy, but these are emphatically _not_
> folios, that is userland memory managed as userland memory, it's a mapping onto
> kernel memory exposed to userspace.

Yes, we should even move away from folios completely in the future for 
vm_insert_page().

> 
> It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
> to expose an interface that doesn't assume the VMA is already fully set
> up... but I think one for a future series perhaps.

If the solution to your problem is as easy as making vm_insert_pages() 
pass something else than vma->vm_page_prot to insert_pages(), then I 
think we should go for that. Like ... vm_insert_pages_prot().

Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). 
But yes, in an ideal world we'd avoid having temporarily messed up 
vma->vm_page_prot. So we'd then document clearly how 
vm_insert_pages_prot() may be used.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 13:37     ` David Hildenbrand
@ 2024-11-28 14:23       ` Lorenzo Stoakes
  2024-11-29 12:12         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 14:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Thu, Nov 28, 2024 at 02:37:17PM +0100, David Hildenbrand wrote:
> On 28.11.24 14:20, Lorenzo Stoakes wrote:
> > On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
> > > On 28.11.24 12:37, Lorenzo Stoakes wrote:
[snip]
> > > > It makes sense semantically to establish a PFN map too - we are managing
> > > > the pages internally and so it is appropriate to mark this as a special
> > > > mapping.
> > >
> > > It's rather sad seeing more PFNMAP users where PFNMAP is not really required
> > > (-> this is struct page backed).
> > >
> > > Especially having to perform several independent remap_pfn_range() calls
> > > rather looks like yet another workaround ...
> > >
> > > Would we be able to achieve something comparable with vm_insert_pages(), to
> > > just map them in advance?
> >
> > Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
> > friends all refer vma->vm_page_prot which is not yet _correctly_ established at
> > the point of the f_op->mmap() hook being invoked :)
>
> So all you want is a vm_insert_pages() variant where we can pass in the
> vm_page_prot?

Hmm, looking into the code I don't think VM_MIXEDMAP is correct after all.

We don't want these pages touched at all, we manage them ourselves, and
VM_MIXEDMAP, unless mapping memory mapped I/O pages, will treat them as such.

For instance, vm_insert_page() -> insert_page() -> insert_page_into_pte_locked()
acts as if this is a folio, manipulating the ref count and invoking
folio_add_file_rmap_pte() - which we emphatically do not want.

Since this is a non-CoW mapping (according to vma->vm_flags), VM_PFNMAP even
with !IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) will not touch vma->vm_pgoff
(which we rely on equalling the offset into the range) and all works as it
should.

>
> Or a way detect internally that it is not setup yet and fallback to
> vm_get_page_prot(vma->vm_flags & ~VM_SHARED)?

No, this would be too complicated I think. And we can't know that we will need
to do this necessarily... Better to just expose prot the same way we do for
remap_pfn_range().

>
> Or a way to just remove write permissions?

We can't really do this without bigger changes to the faulting mechanism,
because a shared mapping with VM_WRITE that write-faults but doesn't have a
page_mkwrite() or pfn_mkwrite() hook will end up invoking wp_page_reuse() and be
made r/w again.

>
> >
> > We set the field in __mmap_new_vma(), _but_ importantly, we defer the
> > writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
> > were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
> > read/write mappings, which is emphatically not what we want - we want them
> > read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
> > first page read/write and the rest read-only.
> >
> > It's this requirement that means this is really the only way to do this as far
> > as I can tell.
> >
> > It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
> > mapping, as the pages reference kernel-allocated memory and are managed by perf,
> > not put on any LRU, etc.
> >
> > It sucks to have to loop like this and it feels like a workaround, which makes
> > me wonder if we need a new interface to better allow this stuff on mmap...
> >
> > In any case I think this is the most sensible solution currently available that
> > avoids the pre-existing situation of pretending the pages are folios but
> > somewhat abusing the interface to allow page_mkwrite() to work correctly by
> > setting page->index, mapping.
>
> Yes, that page->index stuff is nasty.

It's the ->mapping that is more of the issue I think, as that _has_ to be set in
the original version, I can't actually see why index _must_ be set, there should
be no case in which rmap is used on the page, so possibly was a mistake, but
both fields are going from struct page so both must be eliminated :)

>
> >
> > The alternative to this would be to folio-fy, but these are emphatically _not_
> > folios, that is userland memory managed as userland memory, it's a mapping onto
> > kernel memory exposed to userspace.
>
> Yes, we should even move away from folios completely in the future for
> vm_insert_page().

Well, isn't VM_MIXEDMAP intended specifically so you can mix normal user pages
that live in the LRU and have an rmap etc. etc. with PFN mappings to I/O mapped
memory? :) so then that's folios + raw PFN's.

>
> >
> > It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
> > to expose an interface that doesn't assume the VMA is already fully set
> > up... but I think one for a future series perhaps.
>
> If the solution to your problem is as easy as making vm_insert_pages() pass
> something else than vma->vm_page_prot to insert_pages(), then I think we
> should go for that. Like ... vm_insert_pages_prot().

Sadly no for reasons above.

>
> Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). But
> yes, in an ideal world we'd avoid having temporarily messed up
> vma->vm_page_prot. So we'd then document clearly how vm_insert_pages_prot()
> may be used.

I think the thing with the delay in setting vma->vm_page_prot properly that is
we have a chicken and egg scenario (oh so often the case in mmap_region()
logic...) in that the mmap hook might change some of these flags which changes
what that function will do...

I was discussing with Liam recently how perhaps we should see how feasible it is
to do away with this hook and replace it with something where drivers specify
which VMA flags they want to set _ahead of time_, since this really is the only
thing they should be changing other than vma->vm_private_data.

Then we could possibly have a hook _only_ for assigning vma->vm_private_data to
allow for any driver-specific init logic and doing mappings, and hey presto we
have made things vastly saner. Could perhaps pass a const struct vm_area_struct
* to make this clear...

But I may be missing some weird corner cases (hey probably am) or being too
optimistic :>)

>
> --
> Cheers,
>
> David / dhildenb
>

I wonder if we need a new interface then for 'pages which we don't want touched
but do have a struct page' that is more expressed by the interface than
remap_pfn_range() expresses.

I mean from the comment around vm_normal_page():

 * "Special" mappings do not wish to be associated with a "struct page" (either
 * it doesn't exist, or it exists but they don't want to touch it). In this
 * case, NULL is returned here. "Normal" mappings do have a struct page.

...

 * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
 * special mapping (even if there are underlying and valid "struct pages").
 * COWed pages of a VM_PFNMAP are always normal.

So there's precedence for us just putting pages we allocate/manage ourselves in
a VM_PFNMAP.

So I guess this interface would be something like:

	int remap_kernel_pages(struct vm_area_struct *vma, unsigned long addr,
			       struct page **pages, unsigned long size,
			       pgprot_t prot);

Certainly this area of the kernel is a bit confusing at any rate...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 11:37 [PATCH] perf: map pages in advance Lorenzo Stoakes
  2024-11-28 13:08 ` David Hildenbrand
@ 2024-11-28 20:47 ` Lorenzo Stoakes
  2024-11-29  8:52   ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	linux-mm, Matthew Wilcox

Peter - not sure whether it's easy for you to make a simple adjustment to this
patch or if you want me to just send a v2, but I have to pop an #ifdef CONFIG_MMU
into the code.

I've indicated inline below where it's needed, but if it's a pain/not
possible for you to do a fixup in your tree I'm happy to just respin, let
me know what works for you.

Sorry about that!

On Thu, Nov 28, 2024 at 11:37:14AM +0000, Lorenzo Stoakes wrote:
> We are current refactoring struct page to make it smaller, removing
> unneeded fields that correctly belong to struct folio.
>
> Two of those fields are page->index and page->mapping. Perf is currently
> making use of both of these, so this patch removes this usage as it turns
> out it is unnecessary.
>
> Perf establishes its own internally controlled memory-mapped pages using
> vm_ops hooks. The first page in the mapping is the read/write user control
> page, and the rest of the mapping consists of read-only pages.
>
> The VMA is backed by kernel memory either from the buddy allocator or
> vmalloc depending on configuration. It is intended to be mapped read/write,
> but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
> that it should be mapped read-only.
>
> When a write fault occurs, the provided page_mkwrite() hook,
> perf_mmap_fault() (doing double duty handing faults as well) uses the
> vmf->pgoff field to determine if this is the first page, allowing for the
> desired read/write first page, read-only rest mapping.
>
> For this to work the implementation has to carefully work around faulting
> logic. When a page is write-faulted, the fault() hook is called first, then
> its page_mkwrite() hook is called (to allow for dirty tracking in file
> systems).
>
> On fault we set the folio's mapping in perf_mmap_fault(), this is because
> when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
> as an indicator that the fault should be retried.
>
> We also set the folio's index so, given the folio is being treated as faux
> user memory, it correctly references its offset within the VMA.
>
> This explains why the mapping and index fields are used - but it's not
> necessary.
>
> We preallocate pages when perf_mmap() is called for the first time via
> rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
> needed if the mapping requires it.
>
> This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
> and so we can instead simply map all the memory right away here - there's
> no point in handling (read) page faults when we don't demand page nor need
> to be notified about them (perf does not).
>
> This patch therefore changes this logic to map everything when the mmap()
> hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
> to provide the required read/write vs. read-only behaviour, which does not
> require the previously implemented workarounds.
>
> It makes sense semantically to establish a PFN map too - we are managing
> the pages internally and so it is appropriate to mark this as a special
> mapping.
>
> There should be no change to actual functionality as a result of this
> change.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  kernel/events/core.c        | 76 +++++++++++++++++++------------------
>  kernel/events/ring_buffer.c | 19 +---------
>  2 files changed, 40 insertions(+), 55 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5d4a54f50826..0754b070497f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6284,41 +6284,6 @@ void perf_event_update_userpage(struct perf_event *event)
>  }
>  EXPORT_SYMBOL_GPL(perf_event_update_userpage);
>
> -static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
> -{
> -	struct perf_event *event = vmf->vma->vm_file->private_data;
> -	struct perf_buffer *rb;
> -	vm_fault_t ret = VM_FAULT_SIGBUS;
> -
> -	if (vmf->flags & FAULT_FLAG_MKWRITE) {
> -		if (vmf->pgoff == 0)
> -			ret = 0;
> -		return ret;
> -	}
> -
> -	rcu_read_lock();
> -	rb = rcu_dereference(event->rb);
> -	if (!rb)
> -		goto unlock;
> -
> -	if (vmf->pgoff && (vmf->flags & FAULT_FLAG_WRITE))
> -		goto unlock;
> -
> -	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
> -	if (!vmf->page)
> -		goto unlock;
> -
> -	get_page(vmf->page);
> -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> -	vmf->page->index   = vmf->pgoff;
> -
> -	ret = 0;
> -unlock:
> -	rcu_read_unlock();
> -
> -	return ret;
> -}
> -
>  static void ring_buffer_attach(struct perf_event *event,
>  			       struct perf_buffer *rb)
>  {
> @@ -6558,13 +6523,47 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  	ring_buffer_put(rb); /* could be last */
>  }
>
> +static vm_fault_t perf_mmap_pfn_mkwrite(struct vm_fault *vmf)
> +{
> +	/* The first page is the user control page, others are read-only. */
> +	return vmf->pgoff == 0 ? 0 : VM_FAULT_SIGBUS;
> +}
> +
>  static const struct vm_operations_struct perf_mmap_vmops = {
>  	.open		= perf_mmap_open,
>  	.close		= perf_mmap_close, /* non mergeable */
> -	.fault		= perf_mmap_fault,
> -	.page_mkwrite	= perf_mmap_fault,
> +	.pfn_mkwrite	= perf_mmap_pfn_mkwrite,
>  };
>
> +static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
> +{
> +	unsigned long nr_pages = vma_pages(vma);
> +	int err = 0;
> +	unsigned long pgoff;
> +
> +	for (pgoff = 0; pgoff < nr_pages; pgoff++) {
> +		unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
> +		struct page *page = perf_mmap_to_page(rb, pgoff);
> +
> +		if (page == NULL) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		/* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
> +		err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
> +				      vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
> +		if (err)
> +			break;
> +	}
> +

Need a:

#ifdef CONFIG_MMU
> +	/* Clear any partial mappings on error. */
> +	if (err)
> +		zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
#endif

Here to work around the wonders of nommu :)

> +
> +	return err;
> +}
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> @@ -6783,6 +6782,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
>  	vma->vm_ops = &perf_mmap_vmops;
>
> +	if (!ret)
> +		ret = map_range(rb, vma);
> +
>  	if (event->pmu->event_mapped)
>  		event->pmu->event_mapped(event, vma->vm_mm);
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 4f46f688d0d4..180509132d4b 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -643,7 +643,6 @@ static void rb_free_aux_page(struct perf_buffer *rb, int idx)
>  	struct page *page = virt_to_page(rb->aux_pages[idx]);
>
>  	ClearPagePrivate(page);
> -	page->mapping = NULL;
>  	__free_page(page);
>  }
>
> @@ -819,7 +818,6 @@ static void perf_mmap_free_page(void *addr)
>  {
>  	struct page *page = virt_to_page(addr);
>
> -	page->mapping = NULL;
>  	__free_page(page);
>  }
>
> @@ -890,28 +888,13 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>  	return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
>  }
>
> -static void perf_mmap_unmark_page(void *addr)
> -{
> -	struct page *page = vmalloc_to_page(addr);
> -
> -	page->mapping = NULL;
> -}
> -
>  static void rb_free_work(struct work_struct *work)
>  {
>  	struct perf_buffer *rb;
> -	void *base;
> -	int i, nr;
>
>  	rb = container_of(work, struct perf_buffer, work);
> -	nr = data_page_nr(rb);
> -
> -	base = rb->user_page;
> -	/* The '<=' counts in the user page. */
> -	for (i = 0; i <= nr; i++)
> -		perf_mmap_unmark_page(base + (i * PAGE_SIZE));
>
> -	vfree(base);
> +	vfree(rb->user_page);
>  	kfree(rb);
>  }
>
> --
> 2.47.0
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 20:47 ` Lorenzo Stoakes
@ 2024-11-29  8:52   ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2024-11-29  8:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	linux-mm, Matthew Wilcox

On Thu, Nov 28, 2024 at 08:47:28PM +0000, Lorenzo Stoakes wrote:
> Peter - not sure whether it's easy for you to make a simple adjustment to this
> patch or if you want me to just send a v2, but I have to pop an #ifdef CONFIG_MMU
> into the code.
> 

> > +static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
> > +{
> > +	unsigned long nr_pages = vma_pages(vma);
> > +	int err = 0;
> > +	unsigned long pgoff;
> > +
> > +	for (pgoff = 0; pgoff < nr_pages; pgoff++) {
> > +		unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
> > +		struct page *page = perf_mmap_to_page(rb, pgoff);
> > +
> > +		if (page == NULL) {
> > +			err = -EINVAL;
> > +			break;
> > +		}
> > +
> > +		/* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
> > +		err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
> > +				      vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
> > +		if (err)
> > +			break;
> > +	}
> > +
> 
> Need a:
> 
> #ifdef CONFIG_MMU
> > +	/* Clear any partial mappings on error. */
> > +	if (err)
> > +		zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
> #endif
> 
> Here to work around the wonders of nommu :)

All good, I'll edit the thing.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-28 14:23       ` Lorenzo Stoakes
@ 2024-11-29 12:12         ` David Hildenbrand
  2024-11-29 12:26           ` Peter Zijlstra
  2024-11-29 12:48           ` Lorenzo Stoakes
  0 siblings, 2 replies; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 12:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 28.11.24 15:23, Lorenzo Stoakes wrote:
> On Thu, Nov 28, 2024 at 02:37:17PM +0100, David Hildenbrand wrote:
>> On 28.11.24 14:20, Lorenzo Stoakes wrote:
>>> On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
>>>> On 28.11.24 12:37, Lorenzo Stoakes wrote:
> [snip]
>>>>> It makes sense semantically to establish a PFN map too - we are managing
>>>>> the pages internally and so it is appropriate to mark this as a special
>>>>> mapping.
>>>>
>>>> It's rather sad seeing more PFNMAP users where PFNMAP is not really required
>>>> (-> this is struct page backed).
>>>>
>>>> Especially having to perform several independent remap_pfn_range() calls
>>>> rather looks like yet another workaround ...
>>>>
>>>> Would we be able to achieve something comparable with vm_insert_pages(), to
>>>> just map them in advance?
>>>
>>> Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
>>> friends all refer vma->vm_page_prot which is not yet _correctly_ established at
>>> the point of the f_op->mmap() hook being invoked :)
>>
>> So all you want is a vm_insert_pages() variant where we can pass in the
>> vm_page_prot?
> 
> Hmm, looking into the code I don't think VM_MIXEDMAP is correct after all.
> 
> We don't want these pages touched at all, we manage them ourselves, and
> VM_MIXEDMAP, unless mapping memory mapped I/O pages, will treat them as such.
> 
> For instance, vm_insert_page() -> insert_page() -> insert_page_into_pte_locked()
> acts as if this is a folio, manipulating the ref count and invoking
> folio_add_file_rmap_pte() - which we emphatically do not want.

Right, but that should be independent of what you want to achieve in 
this series, or am I wrong?

vm_insert_page()/vm_insert_pages() is our mechanism to install kernel 
allocations into the page tables. (note "kernel allocations", not 
"kernel memory", which might or might not have "struct pages")

There is the bigger question how we could convert all users to either 
(a) not refcount + mapcount (and we discussed a separate memdesc type 
for that) (b) still refcount (similarly, likely separate memdesc).

But that will be a problem to be solved by all similar drives.

Slapping in a remap_pfn_range() + VM_PFNMAP in now in the absence of 
having solved the bigger problem there sounds quite suboptimal to me. 
remap_pfn_range() saw sufficient abuse already, and the way we hacked in 
VM_PAT handling in there really makes it something we don't want to 
reuse as is when trying to come up with a clean way to map kernel 
allocations. I strongly assume that some of the remap_pfn_range() users 
we currently have do actually deal with kernel allocations as well, and 
likely they should all get converted to something better once we have it.


So, isn't this something to just solve independently of what you are 
actually trying to achieve in this series (page->index and page->mapping)?

[...]


>>>
>>> We set the field in __mmap_new_vma(), _but_ importantly, we defer the
>>> writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
>>> were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
>>> read/write mappings, which is emphatically not what we want - we want them
>>> read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
>>> first page read/write and the rest read-only.
>>>
>>> It's this requirement that means this is really the only way to do this as far
>>> as I can tell.
>>>
>>> It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
>>> mapping, as the pages reference kernel-allocated memory and are managed by perf,
>>> not put on any LRU, etc.
>>>
>>> It sucks to have to loop like this and it feels like a workaround, which makes
>>> me wonder if we need a new interface to better allow this stuff on mmap...
>>>
>>> In any case I think this is the most sensible solution currently available that
>>> avoids the pre-existing situation of pretending the pages are folios but
>>> somewhat abusing the interface to allow page_mkwrite() to work correctly by
>>> setting page->index, mapping.
>>
>> Yes, that page->index stuff is nasty.
> 
> It's the ->mapping that is more of the issue I think, as that _has_ to be set in
> the original version, I can't actually see why index _must_ be set, there should
> be no case in which rmap is used on the page, so possibly was a mistake, but
> both fields are going from struct page so both must be eliminated :)

:) Yes.

>>> The alternative to this would be to folio-fy, but these are emphatically _not_
>>> folios, that is userland memory managed as userland memory, it's a mapping onto
>>> kernel memory exposed to userspace.
>>
>> Yes, we should even move away from folios completely in the future for
>> vm_insert_page().
> 
> Well, isn't VM_MIXEDMAP intended specifically so you can mix normal user pages
> that live in the LRU and have an rmap etc. etc. with PFN mappings to I/O mapped
> memory? :) so then that's folios + raw PFN's.

VM_MIXEDMAP was abused over the years for all kinds of stuff. I consider 
this rather a current "side effect" of using vm_insert_pages() than 
something we'll need in the long term (below).

> 
>>
>>>
>>> It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
>>> to expose an interface that doesn't assume the VMA is already fully set
>>> up... but I think one for a future series perhaps.
>>
>> If the solution to your problem is as easy as making vm_insert_pages() pass
>> something else than vma->vm_page_prot to insert_pages(), then I think we
>> should go for that. Like ... vm_insert_pages_prot().
> 
> Sadly no for reasons above.

Is the reason "refcount+mapcount"? Then it might be a problem better 
tackled separately as raised above. Sorry if I missed another point.

> 
>>
>> Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). But
>> yes, in an ideal world we'd avoid having temporarily messed up
>> vma->vm_page_prot. So we'd then document clearly how vm_insert_pages_prot()
>> may be used.
> 
> I think the thing with the delay in setting vma->vm_page_prot properly that is
> we have a chicken and egg scenario (oh so often the case in mmap_region()
> logic...) in that the mmap hook might change some of these flags which changes
> what that function will do...

Yes, that's ugly.

> 
> I was discussing with Liam recently how perhaps we should see how feasible it is
> to do away with this hook and replace it with something where drivers specify
> which VMA flags they want to set _ahead of time_, since this really is the only
> thing they should be changing other than vma->vm_private_data.

Yes.

> 
> Then we could possibly have a hook _only_ for assigning vma->vm_private_data to
> allow for any driver-specific init logic and doing mappings, and hey presto we
> have made things vastly saner. Could perhaps pass a const struct vm_area_struct
> * to make this clear...
> 
> But I may be missing some weird corner cases (hey probably am) or being too
> optimistic :>)

It's certainly one area we should be cleaning up ...

> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I wonder if we need a new interface then for 'pages which we don't want touched
> but do have a struct page' that is more expressed by the interface than
> remap_pfn_range() expresses.
> 
> I mean from the comment around vm_normal_page():
> 
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
>   * case, NULL is returned here. "Normal" mappings do have a struct page.
> 
> ...
> 
>   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>   * special mapping (even if there are underlying and valid "struct pages").
>   * COWed pages of a VM_PFNMAP are always normal.
> 
> So there's precedence for us just putting pages we allocate/manage ourselves in
> a VM_PFNMAP.
> 
> So I guess this interface would be something like:
> 
> 	int remap_kernel_pages(struct vm_area_struct *vma, unsigned long addr,
> 			       struct page **pages, unsigned long size,
> 			       pgprot_t prot);
> 


Well, I think we simply will want vm_insert_pages_prot() that stops 
treating these things like folios :) . *likely*  we'd want a distinct 
memdesc/type.

We could start that work right now by making some user (iouring, 
ring_buffer) set a new page->_type, and checking that in 
vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the 
refcount and the mapcount.

Because then, we can just make all the relevant drivers set the type, 
refuse in vm_insert_pages_prot() anything that doesn't have the type 
set, and refuse in vm_normal_page() any pages with this memdesc.

Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of 
these things will stop working, I hope that is not a problem.


There is one question is still had for a long time: maybe we *do* want 
to refcount these kernel allocations. When refcounting them, it's 
impossible that we might free them in our driver without having some 
reference lurking somewhere in some page table of a process. I would 
hope that this is being take care of differently. (e.g., VMA lifetime)


But again, I'd hope this is something we can sort out independent of 
this series.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:12         ` David Hildenbrand
@ 2024-11-29 12:26           ` Peter Zijlstra
  2024-11-29 12:45             ` David Hildenbrand
  2024-11-29 12:48           ` Lorenzo Stoakes
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2024-11-29 12:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:

> Well, I think we simply will want vm_insert_pages_prot() that stops treating
> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> 
> We could start that work right now by making some user (iouring,
> ring_buffer) set a new page->_type, and checking that in
> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> and the mapcount.
> 
> Because then, we can just make all the relevant drivers set the type, refuse
> in vm_insert_pages_prot() anything that doesn't have the type set, and
> refuse in vm_normal_page() any pages with this memdesc.
> 
> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> these things will stop working, I hope that is not a problem.

Well... perf-tool likes to call write() upon these pages in order to
write out the data from the mmap() into a file.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:26           ` Peter Zijlstra
@ 2024-11-29 12:45             ` David Hildenbrand
  2024-11-29 12:55               ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lorenzo Stoakes, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 13:26, Peter Zijlstra wrote:
> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> 
>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>
>> We could start that work right now by making some user (iouring,
>> ring_buffer) set a new page->_type, and checking that in
>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>> and the mapcount.
>>
>> Because then, we can just make all the relevant drivers set the type, refuse
>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>> refuse in vm_normal_page() any pages with this memdesc.
>>
>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>> these things will stop working, I hope that is not a problem.
> 
> Well... perf-tool likes to call write() upon these pages in order to
> write out the data from the mmap() into a file.

I think a PFNMAP as used in this patch here will already stop making 
that work ....

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:12         ` David Hildenbrand
  2024-11-29 12:26           ` Peter Zijlstra
@ 2024-11-29 12:48           ` Lorenzo Stoakes
  2024-11-29 13:11             ` David Hildenbrand
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 12:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

Will reply inline also but to be clear - we should differentiate between
ongoing discussion about how best to tackle these things going forward
vs. whether _this patch_ is OK :)

I don't think you're objecting to the patch as such, just disappointed
about VM_PFNMAP and wanting to discuss this more generally?

As I say below, we can't use VM_MIXEDMAP as it's broken for our case (we
have to use page->mapping if there's a struct page), so I think _right now_
this is the only sane solution.

On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> On 28.11.24 15:23, Lorenzo Stoakes wrote:
> > On Thu, Nov 28, 2024 at 02:37:17PM +0100, David Hildenbrand wrote:
> > > On 28.11.24 14:20, Lorenzo Stoakes wrote:
> > > > On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
> > > > > On 28.11.24 12:37, Lorenzo Stoakes wrote:
> > [snip]
> > > > > > It makes sense semantically to establish a PFN map too - we are managing
> > > > > > the pages internally and so it is appropriate to mark this as a special
> > > > > > mapping.
> > > > >
> > > > > It's rather sad seeing more PFNMAP users where PFNMAP is not really required
> > > > > (-> this is struct page backed).
> > > > >
> > > > > Especially having to perform several independent remap_pfn_range() calls
> > > > > rather looks like yet another workaround ...
> > > > >
> > > > > Would we be able to achieve something comparable with vm_insert_pages(), to
> > > > > just map them in advance?
> > > >
> > > > Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
> > > > friends all refer vma->vm_page_prot which is not yet _correctly_ established at
> > > > the point of the f_op->mmap() hook being invoked :)
> > >
> > > So all you want is a vm_insert_pages() variant where we can pass in the
> > > vm_page_prot?
> >
> > Hmm, looking into the code I don't think VM_MIXEDMAP is correct after all.
> >
> > We don't want these pages touched at all, we manage them ourselves, and
> > VM_MIXEDMAP, unless mapping memory mapped I/O pages, will treat them as such.
> >
> > For instance, vm_insert_page() -> insert_page() -> insert_page_into_pte_locked()
> > acts as if this is a folio, manipulating the ref count and invoking
> > folio_add_file_rmap_pte() - which we emphatically do not want.
>
> Right, but that should be independent of what you want to achieve in this
> series, or am I wrong?
>

No, we can't implement things this way if we use VM_MIXEDMAP, as
pfn_mkwrite() will no longer work (we have a page now) and page_mkwrite()
will require the same page->mapping hack as we had before, which breaks the
whole thing.

Also, this is memory that makes absolutely no sense being reference/map
counted/placed in the rmap. It was worked around in the previous code by
pinning the pages on fault, but this is ugly and unnecessary.

The whole issue we saw here arose because we tried to treat this memory as
if it were not kernel memory but instead just standard
refcount/mapcount... yes to below about needing a specific way of doing
this with e.g. memdesc :)) previously the pages were being pinned to avoid
presumably migration, compaction etc. but this isn't something we want to
be worrying about...

This change will break GUP for the range, but I don't see anywhere that
needs GUP for these mappings and the memory is now specifically pinned.

So TL;DR - it's VM_PFNMAP or I have to find a completely different way of
solving this problem afaict.

> vm_insert_page()/vm_insert_pages() is our mechanism to install kernel
> allocations into the page tables. (note "kernel allocations", not "kernel
> memory", which might or might not have "struct pages")
>
> There is the bigger question how we could convert all users to either (a)
> not refcount + mapcount (and we discussed a separate memdesc type for that)
> (b) still refcount (similarly, likely separate memdesc).

I think we definitely need the ability to differentiate between:

1. 'I allocated pages, but I want them to be treated like userland memory'

2. (maybe) 'I allocated pages, and own them, pin them for me (maybe) and do
   refcounts/mapcounts, but I will still manage them

3. Yes there are RAM allocations, but do not touch them at all, I am
   managing them privately.

4. This is a pure PFN mapping of memory-mapped I/O pages.

5. Mix of the above?

I mean roughly speaking :)

We also have nuances, like the issue this patch fixes

>
> But that will be a problem to be solved by all similar drives.
>
> Slapping in a remap_pfn_range() + VM_PFNMAP in now in the absence of having
> solved the bigger problem there sounds quite suboptimal to me.

See above, sadly it is the only way to solve the problem.

Note though that perf has a _very specific_ requirement for 1st page r/w,
rest r/o. I think most drivers will not have this requirement.

> remap_pfn_range() saw sufficient abuse already, and the way we hacked in
> VM_PAT handling in there really makes it something we don't want to reuse as
> is when trying to come up with a clean way to map kernel allocations. I
> strongly assume that some of the remap_pfn_range() users we currently have
> do actually deal with kernel allocations as well, and likely they should all
> get converted to something better once we have it.

Yeah I think all this is a bloody mess and is crying out for a much clearer
way to abstract things.

>
>
> So, isn't this something to just solve independently of what you are
> actually trying to achieve in this series (page->index and page->mapping)?

Again, VM_PFNMAP is necessary here for this to work (sadly). This other
discussion we're having is kind of separate though :)

>
> [...]
>
>
> > > >
> > > > We set the field in __mmap_new_vma(), _but_ importantly, we defer the
> > > > writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
> > > > were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
> > > > read/write mappings, which is emphatically not what we want - we want them
> > > > read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
> > > > first page read/write and the rest read-only.
> > > >
> > > > It's this requirement that means this is really the only way to do this as far
> > > > as I can tell.
> > > >
> > > > It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
> > > > mapping, as the pages reference kernel-allocated memory and are managed by perf,
> > > > not put on any LRU, etc.
> > > >
> > > > It sucks to have to loop like this and it feels like a workaround, which makes
> > > > me wonder if we need a new interface to better allow this stuff on mmap...
> > > >
> > > > In any case I think this is the most sensible solution currently available that
> > > > avoids the pre-existing situation of pretending the pages are folios but
> > > > somewhat abusing the interface to allow page_mkwrite() to work correctly by
> > > > setting page->index, mapping.
> > >
> > > Yes, that page->index stuff is nasty.
> >
> > It's the ->mapping that is more of the issue I think, as that _has_ to be set in
> > the original version, I can't actually see why index _must_ be set, there should
> > be no case in which rmap is used on the page, so possibly was a mistake, but
> > both fields are going from struct page so both must be eliminated :)
>
> :) Yes.
>
> > > > The alternative to this would be to folio-fy, but these are emphatically _not_
> > > > folios, that is userland memory managed as userland memory, it's a mapping onto
> > > > kernel memory exposed to userspace.
> > >
> > > Yes, we should even move away from folios completely in the future for
> > > vm_insert_page().
> >
> > Well, isn't VM_MIXEDMAP intended specifically so you can mix normal user pages
> > that live in the LRU and have an rmap etc. etc. with PFN mappings to I/O mapped
> > memory? :) so then that's folios + raw PFN's.
>
> VM_MIXEDMAP was abused over the years for all kinds of stuff. I consider
> this rather a current "side effect" of using vm_insert_pages() than
> something we'll need in the long term (below).
>
> >
> > >
> > > >
> > > > It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
> > > > to expose an interface that doesn't assume the VMA is already fully set
> > > > up... but I think one for a future series perhaps.
> > >
> > > If the solution to your problem is as easy as making vm_insert_pages() pass
> > > something else than vma->vm_page_prot to insert_pages(), then I think we
> > > should go for that. Like ... vm_insert_pages_prot().
> >
> > Sadly no for reasons above.
>
> Is the reason "refcount+mapcount"? Then it might be a problem better tackled
> separately as raised above. Sorry if I missed another point.

See above, vm_normal_page() cannot return a struct page * or otherwise we
end up having to invoke do_page_mkwrite() which interprets a missing
folio->mapping as requiring a retry.

We could hack it to lock the folio in the page_mkwrite() hook but that's
just horrible and worse than using VM_PFNMAP. Plus we're folio-fying by the
back door at that point.

This is not helped by the fact we allocate non-compound higher order pages
(if not using vmalloc) in the perf code :)

We are in an ugly situation, so it's a question of drinking the least
horrible pint of beer here.

>
> >
> > >
> > > Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). But
> > > yes, in an ideal world we'd avoid having temporarily messed up
> > > vma->vm_page_prot. So we'd then document clearly how vm_insert_pages_prot()
> > > may be used.
> >
> > I think the thing with the delay in setting vma->vm_page_prot properly that is
> > we have a chicken and egg scenario (oh so often the case in mmap_region()
> > logic...) in that the mmap hook might change some of these flags which changes
> > what that function will do...
>
> Yes, that's ugly.
>
> >
> > I was discussing with Liam recently how perhaps we should see how feasible it is
> > to do away with this hook and replace it with something where drivers specify
> > which VMA flags they want to set _ahead of time_, since this really is the only
> > thing they should be changing other than vma->vm_private_data.
>
> Yes.
>
> >
> > Then we could possibly have a hook _only_ for assigning vma->vm_private_data to
> > allow for any driver-specific init logic and doing mappings, and hey presto we
> > have made things vastly saner. Could perhaps pass a const struct vm_area_struct
> > * to make this clear...
> >
> > But I may be missing some weird corner cases (hey probably am) or being too
> > optimistic :>)
>
> It's certainly one area we should be cleaning up ...
>
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I wonder if we need a new interface then for 'pages which we don't want touched
> > but do have a struct page' that is more expressed by the interface than
> > remap_pfn_range() expresses.
> >
> > I mean from the comment around vm_normal_page():
> >
> >   * "Special" mappings do not wish to be associated with a "struct page" (either
> >   * it doesn't exist, or it exists but they don't want to touch it). In this
> >   * case, NULL is returned here. "Normal" mappings do have a struct page.
> >
> > ...
> >
> >   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
> >   * special mapping (even if there are underlying and valid "struct pages").
> >   * COWed pages of a VM_PFNMAP are always normal.
> >
> > So there's precedence for us just putting pages we allocate/manage ourselves in
> > a VM_PFNMAP.
> >
> > So I guess this interface would be something like:
> >
> > 	int remap_kernel_pages(struct vm_area_struct *vma, unsigned long addr,
> > 			       struct page **pages, unsigned long size,
> > 			       pgprot_t prot);
> >
>
>
> Well, I think we simply will want vm_insert_pages_prot() that stops treating
> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>
> We could start that work right now by making some user (iouring,
> ring_buffer) set a new page->_type, and checking that in
> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> and the mapcount.
>
> Because then, we can just make all the relevant drivers set the type, refuse
> in vm_insert_pages_prot() anything that doesn't have the type set, and
> refuse in vm_normal_page() any pages with this memdesc.
>
> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> these things will stop working, I hope that is not a problem.
>
>
> There is one question is still had for a long time: maybe we *do* want to
> refcount these kernel allocations. When refcounting them, it's impossible
> that we might free them in our driver without having some reference lurking
> somewhere in some page table of a process. I would hope that this is being
> take care of differently. (e.g., VMA lifetime)
>
>
> But again, I'd hope this is something we can sort out independent of this
> series.

Yes, in a way you wonder if _everything_ should be refcounted and _nothing_
'manually' controlled by drivers which __free_pages() unconditionally at
the end.

This would avoid uaf's and such for still-mapped memory.

Here we're cleaning up on VMA close and doing our own reference counting
there so we're relatively safe...

Anyway I agree with you that we should have something that explicitly
describes what is desired like presumably a memdesc would.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:45             ` David Hildenbrand
@ 2024-11-29 12:55               ` Lorenzo Stoakes
  2024-11-29 12:59                 ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 12:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
> On 29.11.24 13:26, Peter Zijlstra wrote:
> > On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> >
> > > Well, I think we simply will want vm_insert_pages_prot() that stops treating
> > > these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> > >
> > > We could start that work right now by making some user (iouring,
> > > ring_buffer) set a new page->_type, and checking that in
> > > vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> > > and the mapcount.
> > >
> > > Because then, we can just make all the relevant drivers set the type, refuse
> > > in vm_insert_pages_prot() anything that doesn't have the type set, and
> > > refuse in vm_normal_page() any pages with this memdesc.
> > >
> > > Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> > > these things will stop working, I hope that is not a problem.
> >
> > Well... perf-tool likes to call write() upon these pages in order to
> > write out the data from the mmap() into a file.

I'm confused about what you mean, write() using the fd should work fine, how
would they interact with the mmap? I mean be making a silly mistake here

>
> I think a PFNMAP as used in this patch here will already stop making that
> work ....

Actually I think GUP should work fine, except if you explicitly use a version
that expects an array of struct page *'s, because obviously those can't be
provided?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:55               ` Lorenzo Stoakes
@ 2024-11-29 12:59                 ` David Hildenbrand
  2024-11-29 13:02                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 12:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 13:55, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
>> On 29.11.24 13:26, Peter Zijlstra wrote:
>>> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
>>>
>>>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>>>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>>>
>>>> We could start that work right now by making some user (iouring,
>>>> ring_buffer) set a new page->_type, and checking that in
>>>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>>>> and the mapcount.
>>>>
>>>> Because then, we can just make all the relevant drivers set the type, refuse
>>>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>>>> refuse in vm_normal_page() any pages with this memdesc.
>>>>
>>>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>>>> these things will stop working, I hope that is not a problem.
>>>
>>> Well... perf-tool likes to call write() upon these pages in order to
>>> write out the data from the mmap() into a file.
> 
> I'm confused about what you mean, write() using the fd should work fine, how
> would they interact with the mmap? I mean be making a silly mistake here

write() to file from the mmap()'ed address range to *some* file.

This will GUP the pages you inserted.

GUP does not work on PFNMAP.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:59                 ` David Hildenbrand
@ 2024-11-29 13:02                   ` Lorenzo Stoakes
  2024-11-29 13:12                     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 13:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
> On 29.11.24 13:55, Lorenzo Stoakes wrote:
> > On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
> > > On 29.11.24 13:26, Peter Zijlstra wrote:
> > > > On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> > > >
> > > > > Well, I think we simply will want vm_insert_pages_prot() that stops treating
> > > > > these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> > > > >
> > > > > We could start that work right now by making some user (iouring,
> > > > > ring_buffer) set a new page->_type, and checking that in
> > > > > vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> > > > > and the mapcount.
> > > > >
> > > > > Because then, we can just make all the relevant drivers set the type, refuse
> > > > > in vm_insert_pages_prot() anything that doesn't have the type set, and
> > > > > refuse in vm_normal_page() any pages with this memdesc.
> > > > >
> > > > > Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> > > > > these things will stop working, I hope that is not a problem.
> > > >
> > > > Well... perf-tool likes to call write() upon these pages in order to
> > > > write out the data from the mmap() into a file.
> >
> > I'm confused about what you mean, write() using the fd should work fine, how
> > would they interact with the mmap? I mean be making a silly mistake here
>
> write() to file from the mmap()'ed address range to *some* file.
>

Yeah sorry my brain melted down briefly, for some reason was thinking of read()
writing into the buffer...

> This will GUP the pages you inserted.
>
> GUP does not work on PFNMAP.

Well it _does_ if struct page **pages is set to NULL :)

Anyway let me go try this and see what happens...

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 12:48           ` Lorenzo Stoakes
@ 2024-11-29 13:11             ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 13:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 13:48, Lorenzo Stoakes wrote:
> Will reply inline also but to be clear - we should differentiate between
> ongoing discussion about how best to tackle these things going forward
> vs. whether _this patch_ is OK :)
> 
> I don't think you're objecting to the patch as such, just disappointed
> about VM_PFNMAP and wanting to discuss this more generally?

Well, a bit of both :)

> 
> As I say below, we can't use VM_MIXEDMAP as it's broken for our case (we
> have to use page->mapping if there's a struct page), so I think _right now_
> this is the only sane solution.

I'm not so sure ...

I've also been wondering whether this particular user really wants to 
allocate and work on folios for the time being :/


Putting aside the remainder of what we discuss further below for now, 
I've just been wondering if we could somehow have two mappings with two 
separate ops. The one for the single R/W page, the other for the R/O 
pages ...

... but I'm not sure yet if that would completely help.

... still trying to grasp all the details here, and how to keep GUP working.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:02                   ` Lorenzo Stoakes
@ 2024-11-29 13:12                     ` David Hildenbrand
  2024-11-29 13:19                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 13:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 14:02, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
>> On 29.11.24 13:55, Lorenzo Stoakes wrote:
>>> On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
>>>> On 29.11.24 13:26, Peter Zijlstra wrote:
>>>>> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
>>>>>
>>>>>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>>>>>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>>>>>
>>>>>> We could start that work right now by making some user (iouring,
>>>>>> ring_buffer) set a new page->_type, and checking that in
>>>>>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>>>>>> and the mapcount.
>>>>>>
>>>>>> Because then, we can just make all the relevant drivers set the type, refuse
>>>>>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>>>>>> refuse in vm_normal_page() any pages with this memdesc.
>>>>>>
>>>>>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>>>>>> these things will stop working, I hope that is not a problem.
>>>>>
>>>>> Well... perf-tool likes to call write() upon these pages in order to
>>>>> write out the data from the mmap() into a file.
>>>
>>> I'm confused about what you mean, write() using the fd should work fine, how
>>> would they interact with the mmap? I mean be making a silly mistake here
>>
>> write() to file from the mmap()'ed address range to *some* file.
>>
> 
> Yeah sorry my brain melted down briefly, for some reason was thinking of read()
> writing into the buffer...
> 
>> This will GUP the pages you inserted.
>>
>> GUP does not work on PFNMAP.
> 
> Well it _does_ if struct page **pages is set to NULL :)

Hm? :)

check_vma_flags() unconditionally refuses VM_PFNMAP.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:12                     ` David Hildenbrand
@ 2024-11-29 13:19                       ` Lorenzo Stoakes
  2024-11-29 13:24                         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
> On 29.11.24 14:02, Lorenzo Stoakes wrote:
> > On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
> > > On 29.11.24 13:55, Lorenzo Stoakes wrote:
> > > > On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
> > > > > On 29.11.24 13:26, Peter Zijlstra wrote:
> > > > > > On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> > > > > >
> > > > > > > Well, I think we simply will want vm_insert_pages_prot() that stops treating
> > > > > > > these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> > > > > > >
> > > > > > > We could start that work right now by making some user (iouring,
> > > > > > > ring_buffer) set a new page->_type, and checking that in
> > > > > > > vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> > > > > > > and the mapcount.
> > > > > > >
> > > > > > > Because then, we can just make all the relevant drivers set the type, refuse
> > > > > > > in vm_insert_pages_prot() anything that doesn't have the type set, and
> > > > > > > refuse in vm_normal_page() any pages with this memdesc.
> > > > > > >
> > > > > > > Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> > > > > > > these things will stop working, I hope that is not a problem.
> > > > > >
> > > > > > Well... perf-tool likes to call write() upon these pages in order to
> > > > > > write out the data from the mmap() into a file.
> > > >
> > > > I'm confused about what you mean, write() using the fd should work fine, how
> > > > would they interact with the mmap? I mean be making a silly mistake here
> > >
> > > write() to file from the mmap()'ed address range to *some* file.
> > >
> >
> > Yeah sorry my brain melted down briefly, for some reason was thinking of read()
> > writing into the buffer...
> >
> > > This will GUP the pages you inserted.
> > >
> > > GUP does not work on PFNMAP.
> >
> > Well it _does_ if struct page **pages is set to NULL :)
>
> Hm? :)
>
> check_vma_flags() unconditionally refuses VM_PFNMAP.

Ha, funny with my name all over git blame there... ok yup missed this, the
vm_normal_page() == NULL stuff must but for mixed map (and those other weird
cases I think you can get0...

Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
using uaccess?

One thing to note is I did run all the perf tests with no issues whatsoever. You
would _think_ this would have come up...

I'm editing some test code to explicitly write() from the buffer anyway to see.

If we can't do pfnmap, and we definitely can't do mixedmap (because it's
basically entirely equivalent in every way to just faulting in the pages as
before and requires the same hacks) then I will have to go back to the drawing
board or somehow change the faulting code.

This really sucks.

I'm not quite sure I even understand why we don't allow GUP used _just for
pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
whatever mapped it will maintain the lifetime.

What a mess...

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:19                       ` Lorenzo Stoakes
@ 2024-11-29 13:24                         ` David Hildenbrand
  2024-11-29 13:38                           ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 13:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 14:19, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
>> On 29.11.24 14:02, Lorenzo Stoakes wrote:
>>> On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
>>>> On 29.11.24 13:55, Lorenzo Stoakes wrote:
>>>>> On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
>>>>>> On 29.11.24 13:26, Peter Zijlstra wrote:
>>>>>>> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>>>>>>>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>>>>>>>
>>>>>>>> We could start that work right now by making some user (iouring,
>>>>>>>> ring_buffer) set a new page->_type, and checking that in
>>>>>>>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>>>>>>>> and the mapcount.
>>>>>>>>
>>>>>>>> Because then, we can just make all the relevant drivers set the type, refuse
>>>>>>>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>>>>>>>> refuse in vm_normal_page() any pages with this memdesc.
>>>>>>>>
>>>>>>>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>>>>>>>> these things will stop working, I hope that is not a problem.
>>>>>>>
>>>>>>> Well... perf-tool likes to call write() upon these pages in order to
>>>>>>> write out the data from the mmap() into a file.
>>>>>
>>>>> I'm confused about what you mean, write() using the fd should work fine, how
>>>>> would they interact with the mmap? I mean be making a silly mistake here
>>>>
>>>> write() to file from the mmap()'ed address range to *some* file.
>>>>
>>>
>>> Yeah sorry my brain melted down briefly, for some reason was thinking of read()
>>> writing into the buffer...
>>>
>>>> This will GUP the pages you inserted.
>>>>
>>>> GUP does not work on PFNMAP.
>>>
>>> Well it _does_ if struct page **pages is set to NULL :)
>>
>> Hm? :)
>>
>> check_vma_flags() unconditionally refuses VM_PFNMAP.
> 
> Ha, funny with my name all over git blame there... ok yup missed this, the
> vm_normal_page() == NULL stuff must but for mixed map (and those other weird
> cases I think you can get0...
> 
> Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
> using uaccess?
> 
> One thing to note is I did run all the perf tests with no issues whatsoever. You
> would _think_ this would have come up...
> 
> I'm editing some test code to explicitly write() from the buffer anyway to see.
> 
> If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> basically entirely equivalent in every way to just faulting in the pages as
> before and requires the same hacks) then I will have to go back to the drawing
> board or somehow change the faulting code.
> 
> This really sucks.
> 
> I'm not quite sure I even understand why we don't allow GUP used _just for
> pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> whatever mapped it will maintain the lifetime.
> 
> What a mess...

Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
raised my refcounting comment earlier) just read recent:

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Sep 11 17:11:23 2024 -0700

     mm: avoid leaving partial pfn mappings around in error case
     
     As Jann points out, PFN mappings are special, because unlike normal
     memory mappings, there is no lifetime information associated with the
     mapping - it is just a raw mapping of PFNs with no reference counting of
     a 'struct page'.


GUP relies on the refcount. In a PFNMAP you don't have any way to make sure the
driver won't go down, free the page, to have it used by someone else while IO is still
happening to that page.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:24                         ` David Hildenbrand
@ 2024-11-29 13:38                           ` Lorenzo Stoakes
  2024-11-29 13:47                             ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 13:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 02:24:24PM +0100, David Hildenbrand wrote:
> On 29.11.24 14:19, Lorenzo Stoakes wrote:
> > On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
> > > On 29.11.24 14:02, Lorenzo Stoakes wrote:
> > > > On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
> > > > > On 29.11.24 13:55, Lorenzo Stoakes wrote:
> > > > > > On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
> > > > > > > On 29.11.24 13:26, Peter Zijlstra wrote:
> > > > > > > > On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> > > > > > > >
> > > > > > > > > Well, I think we simply will want vm_insert_pages_prot() that stops treating
> > > > > > > > > these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> > > > > > > > >
> > > > > > > > > We could start that work right now by making some user (iouring,
> > > > > > > > > ring_buffer) set a new page->_type, and checking that in
> > > > > > > > > vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> > > > > > > > > and the mapcount.
> > > > > > > > >
> > > > > > > > > Because then, we can just make all the relevant drivers set the type, refuse
> > > > > > > > > in vm_insert_pages_prot() anything that doesn't have the type set, and
> > > > > > > > > refuse in vm_normal_page() any pages with this memdesc.
> > > > > > > > >
> > > > > > > > > Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> > > > > > > > > these things will stop working, I hope that is not a problem.
> > > > > > > >
> > > > > > > > Well... perf-tool likes to call write() upon these pages in order to
> > > > > > > > write out the data from the mmap() into a file.
> > > > > >
> > > > > > I'm confused about what you mean, write() using the fd should work fine, how
> > > > > > would they interact with the mmap? I mean be making a silly mistake here
> > > > >
> > > > > write() to file from the mmap()'ed address range to *some* file.
> > > > >
> > > >
> > > > Yeah sorry my brain melted down briefly, for some reason was thinking of read()
> > > > writing into the buffer...
> > > >
> > > > > This will GUP the pages you inserted.
> > > > >
> > > > > GUP does not work on PFNMAP.
> > > >
> > > > Well it _does_ if struct page **pages is set to NULL :)
> > >
> > > Hm? :)
> > >
> > > check_vma_flags() unconditionally refuses VM_PFNMAP.
> >
> > Ha, funny with my name all over git blame there... ok yup missed this, the
> > vm_normal_page() == NULL stuff must but for mixed map (and those other weird
> > cases I think you can get0...
> >
> > Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
> > using uaccess?
> >
> > One thing to note is I did run all the perf tests with no issues whatsoever. You
> > would _think_ this would have come up...
> >
> > I'm editing some test code to explicitly write() from the buffer anyway to see.

I just tested it and write() works fine, it uses uaccess afaict as part of the
lib/iov_iter.c code:

generic_perform_write()
-> copy_folio_from_iter_atomic()
-> copy_page_from_iter_atomic()
-> __copy_from_iter()
-> copy_from_user_iter()
-> raw_copy_from_user()
-> copy_user_generic()
-> [uaccess asm]

> >
> > If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> > basically entirely equivalent in every way to just faulting in the pages as
> > before and requires the same hacks) then I will have to go back to the drawing
> > board or somehow change the faulting code.
> >
> > This really sucks.
> >
> > I'm not quite sure I even understand why we don't allow GUP used _just for
> > pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> > whatever mapped it will maintain the lifetime.
> >
> > What a mess...
>
> Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
> raised my refcounting comment earlier) just read recent:
>
> commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Sep 11 17:11:23 2024 -0700
>
>     mm: avoid leaving partial pfn mappings around in error case
>     As Jann points out, PFN mappings are special, because unlike normal
>     memory mappings, there is no lifetime information associated with the
>     mapping - it is just a raw mapping of PFNs with no reference counting of
>     a 'struct page'.
>

I'm _very_ aware of this, having worked extensively on things around this kind
of issue recently (was a little bit involved with the above one too :), and
explicitly zap on error in this patch to ensure we leave no broken code paths.

I agree it's horrible, but we need to have a way of mapping this memory without
having to 'trick' the faulting mechanism to behave correctly.

At least in perf's case, we're safe, because we ref count in open/close of VMA's.

This is a special case due to the R/W, R/O thing.

In reference to that - you said in another email about mapping one part as a
separate R/W VMA and another as a separate R/O VMA, problem is all of the perf
code is set up with its own reference counting mechanism and it's not allowed to
split/merge etc., so we'd have to totally rework all of that to make that work
and correctly refcount things.

It'd be a big task. I don't think that's a reasonable thing to put effort into
at this time...

Also who knows if there's somebody, somewhere who _relies_ somehow on this being
a single mapping...

>
> GUP relies on the refcount. In a PFNMAP you don't have any way to make sure the
> driver won't go down, free the page, to have it used by someone else while IO is still
> happening to that page.

GUP isn't required here afaict, having eliminated write() as an issue.

I mean there's definitely things we need to fix here, but I think it's out of
scope for this fix.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:38                           ` Lorenzo Stoakes
@ 2024-11-29 13:47                             ` David Hildenbrand
  2024-11-29 13:59                               ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 13:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 14:38, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 02:24:24PM +0100, David Hildenbrand wrote:
>> On 29.11.24 14:19, Lorenzo Stoakes wrote:
>>> On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
>>>> On 29.11.24 14:02, Lorenzo Stoakes wrote:
>>>>> On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
>>>>>> On 29.11.24 13:55, Lorenzo Stoakes wrote:
>>>>>>> On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
>>>>>>>> On 29.11.24 13:26, Peter Zijlstra wrote:
>>>>>>>>> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>>>>>>>>>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>>>>>>>>>
>>>>>>>>>> We could start that work right now by making some user (iouring,
>>>>>>>>>> ring_buffer) set a new page->_type, and checking that in
>>>>>>>>>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>>>>>>>>>> and the mapcount.
>>>>>>>>>>
>>>>>>>>>> Because then, we can just make all the relevant drivers set the type, refuse
>>>>>>>>>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>>>>>>>>>> refuse in vm_normal_page() any pages with this memdesc.
>>>>>>>>>>
>>>>>>>>>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>>>>>>>>>> these things will stop working, I hope that is not a problem.
>>>>>>>>>
>>>>>>>>> Well... perf-tool likes to call write() upon these pages in order to
>>>>>>>>> write out the data from the mmap() into a file.
>>>>>>>
>>>>>>> I'm confused about what you mean, write() using the fd should work fine, how
>>>>>>> would they interact with the mmap? I mean be making a silly mistake here
>>>>>>
>>>>>> write() to file from the mmap()'ed address range to *some* file.
>>>>>>
>>>>>
>>>>> Yeah sorry my brain melted down briefly, for some reason was thinking of read()
>>>>> writing into the buffer...
>>>>>
>>>>>> This will GUP the pages you inserted.
>>>>>>
>>>>>> GUP does not work on PFNMAP.
>>>>>
>>>>> Well it _does_ if struct page **pages is set to NULL :)
>>>>
>>>> Hm? :)
>>>>
>>>> check_vma_flags() unconditionally refuses VM_PFNMAP.
>>>
>>> Ha, funny with my name all over git blame there... ok yup missed this, the
>>> vm_normal_page() == NULL stuff must but for mixed map (and those other weird
>>> cases I think you can get0...
>>>
>>> Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
>>> using uaccess?
>>>
>>> One thing to note is I did run all the perf tests with no issues whatsoever. You
>>> would _think_ this would have come up...
>>>
>>> I'm editing some test code to explicitly write() from the buffer anyway to see.
> 
> I just tested it and write() works fine, it uses uaccess afaict as part of the
> lib/iov_iter.c code:
> 
> generic_perform_write()
> -> copy_folio_from_iter_atomic()
> -> copy_page_from_iter_atomic()
> -> __copy_from_iter()
> -> copy_from_user_iter()
> -> raw_copy_from_user()
> -> copy_user_generic()
> -> [uaccess asm]
> 

Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.

Ptrace access and friends should also no longer work on these pages, 
likely that's tolerable.

>>>
>>> If we can't do pfnmap, and we definitely can't do mixedmap (because it's
>>> basically entirely equivalent in every way to just faulting in the pages as
>>> before and requires the same hacks) then I will have to go back to the drawing
>>> board or somehow change the faulting code.
>>>
>>> This really sucks.
>>>
>>> I'm not quite sure I even understand why we don't allow GUP used _just for
>>> pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
>>> whatever mapped it will maintain the lifetime.
>>>
>>> What a mess...
>>
>> Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
>> raised my refcounting comment earlier) just read recent:
>>
>> commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>> Date:   Wed Sep 11 17:11:23 2024 -0700
>>
>>      mm: avoid leaving partial pfn mappings around in error case
>>      As Jann points out, PFN mappings are special, because unlike normal
>>      memory mappings, there is no lifetime information associated with the
>>      mapping - it is just a raw mapping of PFNs with no reference counting of
>>      a 'struct page'.
>>
> 
> I'm _very_ aware of this, having worked extensively on things around this kind
> of issue recently (was a little bit involved with the above one too :), and
> explicitly zap on error in this patch to ensure we leave no broken code paths.
> 
> I agree it's horrible, but we need to have a way of mapping this memory without
> having to 'trick' the faulting mechanism to behave correctly.

What's completely "surprising" to me is, if there is no page_mkwrite, 
but the VMA is writable, then just map the PTE writable ...

> 
> At least in perf's case, we're safe, because we ref count in open/close of VMA's.
> 
> This is a special case due to the R/W, R/O thing.
> 
> In reference to that - you said in another email about mapping one part as a
> separate R/W VMA and another as a separate R/O VMA, problem is all of the perf
> code is set up with its own reference counting mechanism and it's not allowed to
> split/merge etc., so we'd have to totally rework all of that to make that work
> and correctly refcount things.
> 
> It'd be a big task. I don't think that's a reasonable thing to put effort into
> at this time...
> 
> Also who knows if there's somebody, somewhere who _relies_ somehow on this being
> a single mapping...

The main issue here really is that we allow R/O pages in VM_WRITE VMAs, 
and want to make write faults fail :(

What an absolute mess, yeah, without some more core changes 
vm_insert_pages() cannot be used, unfortunately.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:47                             ` David Hildenbrand
@ 2024-11-29 13:59                               ` Lorenzo Stoakes
  2024-11-29 14:09                                 ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 13:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 02:47:59PM +0100, David Hildenbrand wrote:
> On 29.11.24 14:38, Lorenzo Stoakes wrote:
> > On Fri, Nov 29, 2024 at 02:24:24PM +0100, David Hildenbrand wrote:
> > > On 29.11.24 14:19, Lorenzo Stoakes wrote:
> > > > On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
> > > > > On 29.11.24 14:02, Lorenzo Stoakes wrote:
> > > > > > On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
> > > > > > > On 29.11.24 13:55, Lorenzo Stoakes wrote:
> > > > > > > > On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
> > > > > > > > > On 29.11.24 13:26, Peter Zijlstra wrote:
> > > > > > > > > > On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> > > > > > > > > >
> > > > > > > > > > > Well, I think we simply will want vm_insert_pages_prot() that stops treating
> > > > > > > > > > > these things like folios :) . *likely*  we'd want a distinct memdesc/type.
> > > > > > > > > > >
> > > > > > > > > > > We could start that work right now by making some user (iouring,
> > > > > > > > > > > ring_buffer) set a new page->_type, and checking that in
> > > > > > > > > > > vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> > > > > > > > > > > and the mapcount.
> > > > > > > > > > >
> > > > > > > > > > > Because then, we can just make all the relevant drivers set the type, refuse
> > > > > > > > > > > in vm_insert_pages_prot() anything that doesn't have the type set, and
> > > > > > > > > > > refuse in vm_normal_page() any pages with this memdesc.
> > > > > > > > > > >
> > > > > > > > > > > Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> > > > > > > > > > > these things will stop working, I hope that is not a problem.
> > > > > > > > > >
> > > > > > > > > > Well... perf-tool likes to call write() upon these pages in order to
> > > > > > > > > > write out the data from the mmap() into a file.
> > > > > > > >
> > > > > > > > I'm confused about what you mean, write() using the fd should work fine, how
> > > > > > > > would they interact with the mmap? I mean be making a silly mistake here
> > > > > > >
> > > > > > > write() to file from the mmap()'ed address range to *some* file.
> > > > > > >
> > > > > >
> > > > > > Yeah sorry my brain melted down briefly, for some reason was thinking of read()
> > > > > > writing into the buffer...
> > > > > >
> > > > > > > This will GUP the pages you inserted.
> > > > > > >
> > > > > > > GUP does not work on PFNMAP.
> > > > > >
> > > > > > Well it _does_ if struct page **pages is set to NULL :)
> > > > >
> > > > > Hm? :)
> > > > >
> > > > > check_vma_flags() unconditionally refuses VM_PFNMAP.
> > > >
> > > > Ha, funny with my name all over git blame there... ok yup missed this, the
> > > > vm_normal_page() == NULL stuff must but for mixed map (and those other weird
> > > > cases I think you can get0...
> > > >
> > > > Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
> > > > using uaccess?
> > > >
> > > > One thing to note is I did run all the perf tests with no issues whatsoever. You
> > > > would _think_ this would have come up...
> > > >
> > > > I'm editing some test code to explicitly write() from the buffer anyway to see.
> >
> > I just tested it and write() works fine, it uses uaccess afaict as part of the
> > lib/iov_iter.c code:
> >
> > generic_perform_write()
> > -> copy_folio_from_iter_atomic()
> > -> copy_page_from_iter_atomic()
> > -> __copy_from_iter()
> > -> copy_from_user_iter()
> > -> raw_copy_from_user()
> > -> copy_user_generic()
> > -> [uaccess asm]
> >
>
> Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
>
> Ptrace access and friends should also no longer work on these pages, likely
> that's tolerable.

Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
to be able to ptrace perf counter mappings in another process (it'd be kind of
totally insane to do that anyway since it's a ring buffer that needs special
handing anyway).

>
> > > >
> > > > If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> > > > basically entirely equivalent in every way to just faulting in the pages as
> > > > before and requires the same hacks) then I will have to go back to the drawing
> > > > board or somehow change the faulting code.
> > > >
> > > > This really sucks.
> > > >
> > > > I'm not quite sure I even understand why we don't allow GUP used _just for
> > > > pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> > > > whatever mapped it will maintain the lifetime.
> > > >
> > > > What a mess...
> > >
> > > Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
> > > raised my refcounting comment earlier) just read recent:
> > >
> > > commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
> > > Author: Linus Torvalds <torvalds@linux-foundation.org>
> > > Date:   Wed Sep 11 17:11:23 2024 -0700
> > >
> > >      mm: avoid leaving partial pfn mappings around in error case
> > >      As Jann points out, PFN mappings are special, because unlike normal
> > >      memory mappings, there is no lifetime information associated with the
> > >      mapping - it is just a raw mapping of PFNs with no reference counting of
> > >      a 'struct page'.
> > >
> >
> > I'm _very_ aware of this, having worked extensively on things around this kind
> > of issue recently (was a little bit involved with the above one too :), and
> > explicitly zap on error in this patch to ensure we leave no broken code paths.
> >
> > I agree it's horrible, but we need to have a way of mapping this memory without
> > having to 'trick' the faulting mechanism to behave correctly.
>
> What's completely "surprising" to me is, if there is no page_mkwrite, but
> the VMA is writable, then just map the PTE writable ...

I've had a number of surprises on this journey :)

To make sure I understand what you mean do you mean the whole:

do_wp_page()
-> wp_page_shared()
-> if (page_mkwrite) { ... } else { wp_page_reuse(); }
-> wp_page_reuse()
-> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]

Path?

Yes this surprised me too... so you can't just say 'hey map this read-only' and
get what you want, because the kernel thinks it's somehow spuriously read-only
mapped and just needs correction (probably there are situations where this is
necessary).

It has to be explicitly CoW in VMA flags or implement page_mkwrite() (or if
!vm_normal_page() then pfn_mkwrite().

You aren't saved with !vm_normal_page() mappings either without a pfn_mkwrite()
because:

do_wp_page()
-> wp_pfn_shared()
-> if (pfn_mkwrite) { ... } else { wp_page_reuse() }
etc.

So it's the only game in town...

>
> >
> > At least in perf's case, we're safe, because we ref count in open/close of VMA's.
> >
> > This is a special case due to the R/W, R/O thing.
> >
> > In reference to that - you said in another email about mapping one part as a
> > separate R/W VMA and another as a separate R/O VMA, problem is all of the perf
> > code is set up with its own reference counting mechanism and it's not allowed to
> > split/merge etc., so we'd have to totally rework all of that to make that work
> > and correctly refcount things.
> >
> > It'd be a big task. I don't think that's a reasonable thing to put effort into
> > at this time...
> >
> > Also who knows if there's somebody, somewhere who _relies_ somehow on this being
> > a single mapping...
>
> The main issue here really is that we allow R/O pages in VM_WRITE VMAs, and
> want to make write faults fail :(
>
> What an absolute mess, yeah, without some more core changes
> vm_insert_pages() cannot be used, unfortunately.

Yes indeed, sadly. Maybe an idea for a series, as memdesc's are at the very
least a long way away.

>
> --
> Cheers,
>
> David / dhildenb
>

So overall - given all the above and the fact that the alternatives are _even
worse_ (having to spuriously folio lock if that'll even work, totally
unnecessary and semantically incorrect folio-fication or a complete rework of
mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
with the patch, all things considered? :)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 13:59                               ` Lorenzo Stoakes
@ 2024-11-29 14:09                                 ` David Hildenbrand
  2024-11-29 14:35                                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 14:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

>>> I just tested it and write() works fine, it uses uaccess afaict as part of the
>>> lib/iov_iter.c code:
>>>
>>> generic_perform_write()
>>> -> copy_folio_from_iter_atomic()
>>> -> copy_page_from_iter_atomic()
>>> -> __copy_from_iter()
>>> -> copy_from_user_iter()
>>> -> raw_copy_from_user()
>>> -> copy_user_generic()
>>> -> [uaccess asm]
>>>
>>
>> Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
>>
>> Ptrace access and friends should also no longer work on these pages, likely
>> that's tolerable.
> 
> Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
> to be able to ptrace perf counter mappings in another process (it'd be kind of
> totally insane to do that anyway since it's a ring buffer that needs special
> handing anyway).

I think so as well. Disallowing GUP has some side effects, like not 
getting these pages included in a coredump etc ... at least I hope 
nobody uses O_DIRECT on them.

> 
>>
>>>>>
>>>>> If we can't do pfnmap, and we definitely can't do mixedmap (because it's
>>>>> basically entirely equivalent in every way to just faulting in the pages as
>>>>> before and requires the same hacks) then I will have to go back to the drawing
>>>>> board or somehow change the faulting code.
>>>>>
>>>>> This really sucks.
>>>>>
>>>>> I'm not quite sure I even understand why we don't allow GUP used _just for
>>>>> pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
>>>>> whatever mapped it will maintain the lifetime.
>>>>>
>>>>> What a mess...
>>>>
>>>> Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
>>>> raised my refcounting comment earlier) just read recent:
>>>>
>>>> commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
>>>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>>>> Date:   Wed Sep 11 17:11:23 2024 -0700
>>>>
>>>>       mm: avoid leaving partial pfn mappings around in error case
>>>>       As Jann points out, PFN mappings are special, because unlike normal
>>>>       memory mappings, there is no lifetime information associated with the
>>>>       mapping - it is just a raw mapping of PFNs with no reference counting of
>>>>       a 'struct page'.
>>>>
>>>
>>> I'm _very_ aware of this, having worked extensively on things around this kind
>>> of issue recently (was a little bit involved with the above one too :), and
>>> explicitly zap on error in this patch to ensure we leave no broken code paths.
>>>
>>> I agree it's horrible, but we need to have a way of mapping this memory without
>>> having to 'trick' the faulting mechanism to behave correctly.
>>
>> What's completely "surprising" to me is, if there is no page_mkwrite, but
>> the VMA is writable, then just map the PTE writable ...
> 
> I've had a number of surprises on this journey :)
> 
> To make sure I understand what you mean do you mean the whole:
> 
> do_wp_page()
> -> wp_page_shared()
> -> if (page_mkwrite) { ... } else { wp_page_reuse(); }
> -> wp_page_reuse()
> -> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]
> 
> Path?

Yes. I can see how it might be relevant (mprotect(PROT_READ) + 
mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default 
to "writable".

[...]

>>
> 
> So overall - given all the above and the fact that the alternatives are _even
> worse_ (having to spuriously folio lock if that'll even work, totally
> unnecessary and semantically incorrect folio-fication or a complete rework of
> mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
> with the patch, all things considered? :)

It hurts my eyes, and feels like a step into the wrong direction. But I 
don't really see how to do it differently right now.

Can we add a comment/TODO in there that using remap_pfn_range this to 
map a kernel allocation really is suboptimal and we should switch to 
something better once we figured the details out?


BTW, I think the whole reason vm_insert_pages() touches the mapcount is 
because we didn't really have a way to identify during unmap that we 
must not adjust it either. A page->type will be able to sort that out 
(while still allowing to refcount them for now).

Actually, the whole thing is on my todo list, because messing with the 
RMAP with vm_insert_pages() doesn't make any sense (whereby the refcount 
might!). See the TODO I added in __folio_rmap_sanity_checks().

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 14:09                                 ` David Hildenbrand
@ 2024-11-29 14:35                                   ` Lorenzo Stoakes
  2024-11-29 14:48                                     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 14:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Nov 29, 2024 at 03:09:49PM +0100, David Hildenbrand wrote:
> > > > I just tested it and write() works fine, it uses uaccess afaict as part of the
> > > > lib/iov_iter.c code:
> > > >
> > > > generic_perform_write()
> > > > -> copy_folio_from_iter_atomic()
> > > > -> copy_page_from_iter_atomic()
> > > > -> __copy_from_iter()
> > > > -> copy_from_user_iter()
> > > > -> raw_copy_from_user()
> > > > -> copy_user_generic()
> > > > -> [uaccess asm]
> > > >
> > >
> > > Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
> > >
> > > Ptrace access and friends should also no longer work on these pages, likely
> > > that's tolerable.
> >
> > Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
> > to be able to ptrace perf counter mappings in another process (it'd be kind of
> > totally insane to do that anyway since it's a ring buffer that needs special
> > handing anyway).
>
> I think so as well. Disallowing GUP has some side effects, like not getting
> these pages included in a coredump etc ... at least I hope nobody uses
> O_DIRECT on them.

We set VM_DONTDUMP anyway (set by remap_pfn_range() also) so this part won't be
a problem, and I can't see anybody using O_DIRECT on them, sensibly.

>
> >
> > >
> > > > > >
> > > > > > If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> > > > > > basically entirely equivalent in every way to just faulting in the pages as
> > > > > > before and requires the same hacks) then I will have to go back to the drawing
> > > > > > board or somehow change the faulting code.
> > > > > >
> > > > > > This really sucks.
> > > > > >
> > > > > > I'm not quite sure I even understand why we don't allow GUP used _just for
> > > > > > pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> > > > > > whatever mapped it will maintain the lifetime.
> > > > > >
> > > > > > What a mess...
> > > > >
> > > > > Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
> > > > > raised my refcounting comment earlier) just read recent:
> > > > >
> > > > > commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
> > > > > Author: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > Date:   Wed Sep 11 17:11:23 2024 -0700
> > > > >
> > > > >       mm: avoid leaving partial pfn mappings around in error case
> > > > >       As Jann points out, PFN mappings are special, because unlike normal
> > > > >       memory mappings, there is no lifetime information associated with the
> > > > >       mapping - it is just a raw mapping of PFNs with no reference counting of
> > > > >       a 'struct page'.
> > > > >
> > > >
> > > > I'm _very_ aware of this, having worked extensively on things around this kind
> > > > of issue recently (was a little bit involved with the above one too :), and
> > > > explicitly zap on error in this patch to ensure we leave no broken code paths.
> > > >
> > > > I agree it's horrible, but we need to have a way of mapping this memory without
> > > > having to 'trick' the faulting mechanism to behave correctly.
> > >
> > > What's completely "surprising" to me is, if there is no page_mkwrite, but
> > > the VMA is writable, then just map the PTE writable ...
> >
> > I've had a number of surprises on this journey :)
> >
> > To make sure I understand what you mean do you mean the whole:
> >
> > do_wp_page()
> > -> wp_page_shared()
> > -> if (page_mkwrite) { ... } else { wp_page_reuse(); }
> > -> wp_page_reuse()
> > -> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]
> >
> > Path?
>
> Yes. I can see how it might be relevant (mprotect(PROT_READ) +
> mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default to
> "writable".

Yeah, no this whole thing very much resembles a labyrinth with a minotaur at the
end, whose name is VM_PFNMAP :>)

>
> [...]
>
> > >
> >
> > So overall - given all the above and the fact that the alternatives are _even
> > worse_ (having to spuriously folio lock if that'll even work, totally
> > unnecessary and semantically incorrect folio-fication or a complete rework of
> > mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
> > with the patch, all things considered? :)
>
> It hurts my eyes, and feels like a step into the wrong direction. But I
> don't really see how to do it differently right now.

Yeah I agree totally.

>
> Can we add a comment/TODO in there that using remap_pfn_range this to map a
> kernel allocation really is suboptimal and we should switch to something
> better once we figured the details out?

Sure, let me do a v2 then to make Peter's life easier, can do the nommu fixup in
there too.

>
>
> BTW, I think the whole reason vm_insert_pages() touches the mapcount is
> because we didn't really have a way to identify during unmap that we must
> not adjust it either. A page->type will be able to sort that out (while
> still allowing to refcount them for now).

Yeah. I think not being able to differentiate between different cases is the
problem here...

>
> Actually, the whole thing is on my todo list, because messing with the RMAP
> with vm_insert_pages() doesn't make any sense (whereby the refcount might!).
> See the TODO I added in __folio_rmap_sanity_checks().

How long is your TODO list I wonder? :)) I imagine it requires a huge page to
map at this point..

>
> --
> Cheers,
>
> David / dhildenb
>

Cool will respin and send out shortly with an appropriately 'forgive us father
for we have sinned' comment.

Thanks for taking the time to discuss this! Much appreciated.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 14:35                                   ` Lorenzo Stoakes
@ 2024-11-29 14:48                                     ` David Hildenbrand
  2024-11-29 20:42                                       ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-11-29 14:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, linux-mm, Matthew Wilcox

On 29.11.24 15:35, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 03:09:49PM +0100, David Hildenbrand wrote:
>>>>> I just tested it and write() works fine, it uses uaccess afaict as part of the
>>>>> lib/iov_iter.c code:
>>>>>
>>>>> generic_perform_write()
>>>>> -> copy_folio_from_iter_atomic()
>>>>> -> copy_page_from_iter_atomic()
>>>>> -> __copy_from_iter()
>>>>> -> copy_from_user_iter()
>>>>> -> raw_copy_from_user()
>>>>> -> copy_user_generic()
>>>>> -> [uaccess asm]
>>>>>
>>>>
>>>> Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
>>>>
>>>> Ptrace access and friends should also no longer work on these pages, likely
>>>> that's tolerable.
>>>
>>> Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
>>> to be able to ptrace perf counter mappings in another process (it'd be kind of
>>> totally insane to do that anyway since it's a ring buffer that needs special
>>> handing anyway).
>>
>> I think so as well. Disallowing GUP has some side effects, like not getting
>> these pages included in a coredump etc ... at least I hope nobody uses
>> O_DIRECT on them.
> 
> We set VM_DONTDUMP anyway (set by remap_pfn_range() also) so this part won't be
> a problem, and I can't see anybody using O_DIRECT on them, sensibly.

Ah, even better.

>>
>> Actually, the whole thing is on my todo list, because messing with the RMAP
>> with vm_insert_pages() doesn't make any sense (whereby the refcount might!).
>> See the TODO I added in __folio_rmap_sanity_checks().
> 
> How long is your TODO list I wonder? :)) I imagine it requires a huge page to
> map at this point..

:D

Too long, but as some of these TODOs stand in the memdesc way, 
fortunately other people might be able to give a helping hand at some 
point ;)

I'll play with using a page type for some of these "simple" cases and 
see how hard it will get.

> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Cool will respin and send out shortly with an appropriately 'forgive us father
> for we have sinned' comment.
> 
> Thanks for taking the time to discuss this! Much appreciated.

Thanks for bearing with me. Whenever PFNMAP/MIXEDMAP is involved it gets 
tricky, and as soon as PFNMAP/MIXEDMAP is inevitable, things get ugly. :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] perf: map pages in advance
  2024-11-29 14:48                                     ` David Hildenbrand
@ 2024-11-29 20:42                                       ` Matthew Wilcox
  0 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2024-11-29 20:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Kan Liang, linux-perf-users, linux-kernel, linux-mm

On Fri, Nov 29, 2024 at 03:48:33PM +0100, David Hildenbrand wrote:
> Too long, but as some of these TODOs stand in the memdesc way, fortunately
> other people might be able to give a helping hand at some point ;)

;-)

> I'll play with using a page type for some of these "simple" cases and see
> how hard it will get.

I've deliberately made pagetype incompatible with setting a mapcount ...
hugetlb being the exception because we only use entire_mapcount.

Anyway, I agree we need something better than what we have here.
It needs to be simple for a device driver to allocate memory and
map it into userspace.  I'm willing to allow perf's ringbuffer to be
a bit harder than most because it's trying to do the weird thing of
writable-first-page-rest-read-only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-11-29 20:42 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-28 11:37 [PATCH] perf: map pages in advance Lorenzo Stoakes
2024-11-28 13:08 ` David Hildenbrand
2024-11-28 13:20   ` Lorenzo Stoakes
2024-11-28 13:37     ` David Hildenbrand
2024-11-28 14:23       ` Lorenzo Stoakes
2024-11-29 12:12         ` David Hildenbrand
2024-11-29 12:26           ` Peter Zijlstra
2024-11-29 12:45             ` David Hildenbrand
2024-11-29 12:55               ` Lorenzo Stoakes
2024-11-29 12:59                 ` David Hildenbrand
2024-11-29 13:02                   ` Lorenzo Stoakes
2024-11-29 13:12                     ` David Hildenbrand
2024-11-29 13:19                       ` Lorenzo Stoakes
2024-11-29 13:24                         ` David Hildenbrand
2024-11-29 13:38                           ` Lorenzo Stoakes
2024-11-29 13:47                             ` David Hildenbrand
2024-11-29 13:59                               ` Lorenzo Stoakes
2024-11-29 14:09                                 ` David Hildenbrand
2024-11-29 14:35                                   ` Lorenzo Stoakes
2024-11-29 14:48                                     ` David Hildenbrand
2024-11-29 20:42                                       ` Matthew Wilcox
2024-11-29 12:48           ` Lorenzo Stoakes
2024-11-29 13:11             ` David Hildenbrand
2024-11-28 20:47 ` Lorenzo Stoakes
2024-11-29  8:52   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox