* [PATCH v2] perf: map pages in advance
@ 2024-11-29 15:31 Lorenzo Stoakes
2024-11-29 16:03 ` Matthew Wilcox
2024-12-05 2:16 ` Lai, Yi
0 siblings, 2 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 15:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
linux-mm, Matthew Wilcox, David Hildenbrand
We are current refactoring struct page to make it smaller, removing
unneeded fields that correctly belong to struct folio.
Two of those fields are page->index and page->mapping. Perf is currently
making use of both of these, so this patch removes this usage as it turns
out it is unnecessary.
Perf establishes its own internally controlled memory-mapped pages using
vm_ops hooks. The first page in the mapping is the read/write user control
page, and the rest of the mapping consists of read-only pages.
The VMA is backed by kernel memory either from the buddy allocator or
vmalloc depending on configuration. It is intended to be mapped read/write,
but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
that it should be mapped read-only.
When a write fault occurs, the provided page_mkwrite() hook,
perf_mmap_fault() (doing double duty handing faults as well) uses the
vmf->pgoff field to determine if this is the first page, allowing for the
desired read/write first page, read-only rest mapping.
For this to work the implementation has to carefully work around faulting
logic. When a page is write-faulted, the fault() hook is called first, then
its page_mkwrite() hook is called (to allow for dirty tracking in file
systems).
On fault we set the folio's mapping in perf_mmap_fault(), this is because
when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
as an indicator that the fault should be retried.
We also set the folio's index so, given the folio is being treated as faux
user memory, it correctly references its offset within the VMA.
This explains why the mapping and index fields are used - but it's not
necessary.
We preallocate pages when perf_mmap() is called for the first time via
rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
needed if the mapping requires it.
This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
and so we can instead simply map all the memory right away here - there's
no point in handling (read) page faults when we don't demand page nor need
to be notified about them (perf does not).
This patch therefore changes this logic to map everything when the mmap()
hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
to provide the required read/write vs. read-only behaviour, which does not
require the previously implemented workarounds.
While it is not ideal to use a VM_PFNMAP here, doing anything else will
result in the page_mkwrite() hook need to be provided, which requires the
same page->mapping hack this patch seeks to undo.
It will also result in the pages being treated as folios and placed on the
rmap, which really does not make sense for these mappings.
Semantically it makes sense to establish this as some kind of special
mapping, as the pages are managed by perf and are not strictly user pages,
but currently the only means by which we can do so functionally while
maintaining the required R/W and R/O bheaviour is a PFN map.
There should be no change to actual functionality as a result of this
change.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
v2:
* nommu fixup.
* Add comment explaining why we are using a VM_PFNMAP as suggested by
David H.
v1:
https://lore.kernel.org/all/20241128113714.492474-1-lorenzo.stoakes@oracle.com/
kernel/events/core.c | 116 ++++++++++++++++++++++++------------
kernel/events/ring_buffer.c | 19 +-----
2 files changed, 80 insertions(+), 55 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5d4a54f50826..1bb5999d9d81 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6284,41 +6284,6 @@ void perf_event_update_userpage(struct perf_event *event)
}
EXPORT_SYMBOL_GPL(perf_event_update_userpage);
-static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
-{
- struct perf_event *event = vmf->vma->vm_file->private_data;
- struct perf_buffer *rb;
- vm_fault_t ret = VM_FAULT_SIGBUS;
-
- if (vmf->flags & FAULT_FLAG_MKWRITE) {
- if (vmf->pgoff == 0)
- ret = 0;
- return ret;
- }
-
- rcu_read_lock();
- rb = rcu_dereference(event->rb);
- if (!rb)
- goto unlock;
-
- if (vmf->pgoff && (vmf->flags & FAULT_FLAG_WRITE))
- goto unlock;
-
- vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
- if (!vmf->page)
- goto unlock;
-
- get_page(vmf->page);
- vmf->page->mapping = vmf->vma->vm_file->f_mapping;
- vmf->page->index = vmf->pgoff;
-
- ret = 0;
-unlock:
- rcu_read_unlock();
-
- return ret;
-}
-
static void ring_buffer_attach(struct perf_event *event,
struct perf_buffer *rb)
{
@@ -6558,13 +6523,87 @@ static void perf_mmap_close(struct vm_area_struct *vma)
ring_buffer_put(rb); /* could be last */
}
+static vm_fault_t perf_mmap_pfn_mkwrite(struct vm_fault *vmf)
+{
+ /* The first page is the user control page, others are read-only. */
+ return vmf->pgoff == 0 ? 0 : VM_FAULT_SIGBUS;
+}
+
static const struct vm_operations_struct perf_mmap_vmops = {
.open = perf_mmap_open,
.close = perf_mmap_close, /* non mergeable */
- .fault = perf_mmap_fault,
- .page_mkwrite = perf_mmap_fault,
+ .pfn_mkwrite = perf_mmap_pfn_mkwrite,
};
+static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
+{
+ unsigned long nr_pages = vma_pages(vma);
+ int err = 0;
+ unsigned long pgoff;
+
+ /*
+ * We map this as a VM_PFNMAP VMA.
+ *
+ * This is not ideal as this is designed broadly for mappings of PFNs
+ * referencing memory-mapped I/O ranges or non-system RAM i.e. for which
+ * !pfn_valid(pfn).
+ *
+ * We are mapping kernel-allocated memory (memory we manage ourselves)
+ * which would more ideally be mapped using vm_insert_page() or a
+ * similar mechanism, that is as a VM_MIXEDMAP mapping.
+ *
+ * However this won't work here, because:
+ *
+ * 1. It uses vma->vm_page_prot, but this field has not been completely
+ * setup at the point of the f_op->mmp() hook, so we are unable to
+ * indicate that this should be mapped CoW in order that the
+ * mkwrite() hook can be invoked to make the first page R/W and the
+ * rest R/O as desired.
+ *
+ * 2. Anything other than a VM_PFNMAP of valid PFNs will result in
+ * vm_normal_page() returning a struct page * pointer, which means
+ * vm_ops->page_mkwrite() will be invoked rather than
+ * vm_ops->pfn_mkwrite(), and this means we have to set page->mapping
+ * to work around retry logic in the fault handler, however this
+ * field is no longer allowed to be used within struct page.
+ *
+ * 3. Having a struct page * made available in the fault logic also
+ * means that the page gets put on the rmap and becomes
+ * inappropriately accessible and subject to map and ref counting.
+ *
+ * Ideally we would have a mechanism that could explicitly express our
+ * desires, but this is not currently the case, so we instead use
+ * VM_PFNMAP.
+ *
+ * We manage the lifetime of these mappings with internal refcounts (see
+ * perf_mmap_open() and perf_mmap_close()) so we ensure the lifetime of
+ * this mapping is maintained correctly.
+ */
+ for (pgoff = 0; pgoff < nr_pages; pgoff++) {
+ unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
+ struct page *page = perf_mmap_to_page(rb, pgoff);
+
+ if (page == NULL) {
+ err = -EINVAL;
+ break;
+ }
+
+ /* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
+ err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
+ vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
+ if (err)
+ break;
+ }
+
+#ifdef CONFIG_MMU
+ /* Clear any partial mappings on error. */
+ if (err)
+ zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
+#endif
+
+ return err;
+}
+
static int perf_mmap(struct file *file, struct vm_area_struct *vma)
{
struct perf_event *event = file->private_data;
@@ -6783,6 +6822,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
vma->vm_ops = &perf_mmap_vmops;
+ if (!ret)
+ ret = map_range(rb, vma);
+
if (event->pmu->event_mapped)
event->pmu->event_mapped(event, vma->vm_mm);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4f46f688d0d4..180509132d4b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -643,7 +643,6 @@ static void rb_free_aux_page(struct perf_buffer *rb, int idx)
struct page *page = virt_to_page(rb->aux_pages[idx]);
ClearPagePrivate(page);
- page->mapping = NULL;
__free_page(page);
}
@@ -819,7 +818,6 @@ static void perf_mmap_free_page(void *addr)
{
struct page *page = virt_to_page(addr);
- page->mapping = NULL;
__free_page(page);
}
@@ -890,28 +888,13 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
}
-static void perf_mmap_unmark_page(void *addr)
-{
- struct page *page = vmalloc_to_page(addr);
-
- page->mapping = NULL;
-}
-
static void rb_free_work(struct work_struct *work)
{
struct perf_buffer *rb;
- void *base;
- int i, nr;
rb = container_of(work, struct perf_buffer, work);
- nr = data_page_nr(rb);
-
- base = rb->user_page;
- /* The '<=' counts in the user page. */
- for (i = 0; i <= nr; i++)
- perf_mmap_unmark_page(base + (i * PAGE_SIZE));
- vfree(base);
+ vfree(rb->user_page);
kfree(rb);
}
--
2.47.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf: map pages in advance
2024-11-29 15:31 [PATCH v2] perf: map pages in advance Lorenzo Stoakes
@ 2024-11-29 16:03 ` Matthew Wilcox
2024-11-29 16:27 ` Lorenzo Stoakes
2024-12-05 2:16 ` Lai, Yi
1 sibling, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2024-11-29 16:03 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
linux-kernel, linux-mm, David Hildenbrand
On Fri, Nov 29, 2024 at 03:31:34PM +0000, Lorenzo Stoakes wrote:
> While it is not ideal to use a VM_PFNMAP here, doing anything else will
> result in the page_mkwrite() hook need to be provided, which requires the
> same page->mapping hack this patch seeks to undo.
Or we could map the first page read-write to begin with ... ?
Or we could implement a page_mkwrite handler, but do the permissions
change in the driver and return VM_FAULT_NOPAGE.
I don't think we should do the second option in this driver, but
mentioning it because we're going to need to have A Talk about fb_defio.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf: map pages in advance
2024-11-29 16:03 ` Matthew Wilcox
@ 2024-11-29 16:27 ` Lorenzo Stoakes
0 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 16:27 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
linux-kernel, linux-mm, David Hildenbrand
On Fri, Nov 29, 2024 at 04:03:44PM +0000, Matthew Wilcox wrote:
> On Fri, Nov 29, 2024 at 03:31:34PM +0000, Lorenzo Stoakes wrote:
> > While it is not ideal to use a VM_PFNMAP here, doing anything else will
> > result in the page_mkwrite() hook need to be provided, which requires the
> > same page->mapping hack this patch seeks to undo.
>
> Or we could map the first page read-write to begin with ... ?
This would sort of be a rejig, as then you'd just make pfn_mkwrite() always fail
and be doing essentially the same thing, only varying what prot you set on map
(you can't just set the other pages read-only in prot and not provide
pfn_mkwrite(), if you do the kernel will simply make them writable on write
fault in wp_page_reuse() :)
> Or we could implement a page_mkwrite handler, but do the permissions
> change in the driver and return VM_FAULT_NOPAGE.
Yeah interesting, that's possible, but wouldn't we need to be careful about TLB
etc. in that case since the mapping will have already been established?
I suppose we could zap then set, but that's icky...
Even if you don't pre-map this will be the case as it faults in first before
invoking page_mkwrite()...
>
> I don't think we should do the second option in this driver, but
> mentioning it because we're going to need to have A Talk about fb_defio.
>
Yes I think we're good with the current solution for perf, despite its warty
nature it works (TM).
I fear that this Talk may involve much gnashing of teeth...
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf: map pages in advance
2024-11-29 15:31 [PATCH v2] perf: map pages in advance Lorenzo Stoakes
2024-11-29 16:03 ` Matthew Wilcox
@ 2024-12-05 2:16 ` Lai, Yi
2024-12-05 7:09 ` Lorenzo Stoakes
1 sibling, 1 reply; 6+ messages in thread
From: Lai, Yi @ 2024-12-05 2:16 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
linux-kernel, linux-mm, Matthew Wilcox, David Hildenbrand,
yi1.lai
On Fri, Nov 29, 2024 at 03:31:34PM +0000, Lorenzo Stoakes wrote:
> We are current refactoring struct page to make it smaller, removing
> unneeded fields that correctly belong to struct folio.
>
> Two of those fields are page->index and page->mapping. Perf is currently
> making use of both of these, so this patch removes this usage as it turns
> out it is unnecessary.
>
> Perf establishes its own internally controlled memory-mapped pages using
> vm_ops hooks. The first page in the mapping is the read/write user control
> page, and the rest of the mapping consists of read-only pages.
>
> The VMA is backed by kernel memory either from the buddy allocator or
> vmalloc depending on configuration. It is intended to be mapped read/write,
> but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
> that it should be mapped read-only.
>
> When a write fault occurs, the provided page_mkwrite() hook,
> perf_mmap_fault() (doing double duty handing faults as well) uses the
> vmf->pgoff field to determine if this is the first page, allowing for the
> desired read/write first page, read-only rest mapping.
>
> For this to work the implementation has to carefully work around faulting
> logic. When a page is write-faulted, the fault() hook is called first, then
> its page_mkwrite() hook is called (to allow for dirty tracking in file
> systems).
>
> On fault we set the folio's mapping in perf_mmap_fault(), this is because
> when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
> as an indicator that the fault should be retried.
>
> We also set the folio's index so, given the folio is being treated as faux
> user memory, it correctly references its offset within the VMA.
>
> This explains why the mapping and index fields are used - but it's not
> necessary.
>
> We preallocate pages when perf_mmap() is called for the first time via
> rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
> needed if the mapping requires it.
>
> This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
> and so we can instead simply map all the memory right away here - there's
> no point in handling (read) page faults when we don't demand page nor need
> to be notified about them (perf does not).
>
> This patch therefore changes this logic to map everything when the mmap()
> hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
> to provide the required read/write vs. read-only behaviour, which does not
> require the previously implemented workarounds.
>
> While it is not ideal to use a VM_PFNMAP here, doing anything else will
> result in the page_mkwrite() hook need to be provided, which requires the
> same page->mapping hack this patch seeks to undo.
>
> It will also result in the pages being treated as folios and placed on the
> rmap, which really does not make sense for these mappings.
>
> Semantically it makes sense to establish this as some kind of special
> mapping, as the pages are managed by perf and are not strictly user pages,
> but currently the only means by which we can do so functionally while
> maintaining the required R/W and R/O bheaviour is a PFN map.
>
> There should be no change to actual functionality as a result of this
> change.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> v2:
> * nommu fixup.
> * Add comment explaining why we are using a VM_PFNMAP as suggested by
> David H.
>
> v1:
> https://lore.kernel.org/all/20241128113714.492474-1-lorenzo.stoakes@oracle.com/
>
> kernel/events/core.c | 116 ++++++++++++++++++++++++------------
> kernel/events/ring_buffer.c | 19 +-----
> 2 files changed, 80 insertions(+), 55 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5d4a54f50826..1bb5999d9d81 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6284,41 +6284,6 @@ void perf_event_update_userpage(struct perf_event *event)
> }
> EXPORT_SYMBOL_GPL(perf_event_update_userpage);
>
> -static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
> -{
> - struct perf_event *event = vmf->vma->vm_file->private_data;
> - struct perf_buffer *rb;
> - vm_fault_t ret = VM_FAULT_SIGBUS;
> -
> - if (vmf->flags & FAULT_FLAG_MKWRITE) {
> - if (vmf->pgoff == 0)
> - ret = 0;
> - return ret;
> - }
> -
> - rcu_read_lock();
> - rb = rcu_dereference(event->rb);
> - if (!rb)
> - goto unlock;
> -
> - if (vmf->pgoff && (vmf->flags & FAULT_FLAG_WRITE))
> - goto unlock;
> -
> - vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
> - if (!vmf->page)
> - goto unlock;
> -
> - get_page(vmf->page);
> - vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> - vmf->page->index = vmf->pgoff;
> -
> - ret = 0;
> -unlock:
> - rcu_read_unlock();
> -
> - return ret;
> -}
> -
> static void ring_buffer_attach(struct perf_event *event,
> struct perf_buffer *rb)
> {
> @@ -6558,13 +6523,87 @@ static void perf_mmap_close(struct vm_area_struct *vma)
> ring_buffer_put(rb); /* could be last */
> }
>
> +static vm_fault_t perf_mmap_pfn_mkwrite(struct vm_fault *vmf)
> +{
> + /* The first page is the user control page, others are read-only. */
> + return vmf->pgoff == 0 ? 0 : VM_FAULT_SIGBUS;
> +}
> +
> static const struct vm_operations_struct perf_mmap_vmops = {
> .open = perf_mmap_open,
> .close = perf_mmap_close, /* non mergeable */
> - .fault = perf_mmap_fault,
> - .page_mkwrite = perf_mmap_fault,
> + .pfn_mkwrite = perf_mmap_pfn_mkwrite,
> };
>
> +static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
> +{
> + unsigned long nr_pages = vma_pages(vma);
> + int err = 0;
> + unsigned long pgoff;
> +
> + /*
> + * We map this as a VM_PFNMAP VMA.
> + *
> + * This is not ideal as this is designed broadly for mappings of PFNs
> + * referencing memory-mapped I/O ranges or non-system RAM i.e. for which
> + * !pfn_valid(pfn).
> + *
> + * We are mapping kernel-allocated memory (memory we manage ourselves)
> + * which would more ideally be mapped using vm_insert_page() or a
> + * similar mechanism, that is as a VM_MIXEDMAP mapping.
> + *
> + * However this won't work here, because:
> + *
> + * 1. It uses vma->vm_page_prot, but this field has not been completely
> + * setup at the point of the f_op->mmp() hook, so we are unable to
> + * indicate that this should be mapped CoW in order that the
> + * mkwrite() hook can be invoked to make the first page R/W and the
> + * rest R/O as desired.
> + *
> + * 2. Anything other than a VM_PFNMAP of valid PFNs will result in
> + * vm_normal_page() returning a struct page * pointer, which means
> + * vm_ops->page_mkwrite() will be invoked rather than
> + * vm_ops->pfn_mkwrite(), and this means we have to set page->mapping
> + * to work around retry logic in the fault handler, however this
> + * field is no longer allowed to be used within struct page.
> + *
> + * 3. Having a struct page * made available in the fault logic also
> + * means that the page gets put on the rmap and becomes
> + * inappropriately accessible and subject to map and ref counting.
> + *
> + * Ideally we would have a mechanism that could explicitly express our
> + * desires, but this is not currently the case, so we instead use
> + * VM_PFNMAP.
> + *
> + * We manage the lifetime of these mappings with internal refcounts (see
> + * perf_mmap_open() and perf_mmap_close()) so we ensure the lifetime of
> + * this mapping is maintained correctly.
> + */
> + for (pgoff = 0; pgoff < nr_pages; pgoff++) {
> + unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
> + struct page *page = perf_mmap_to_page(rb, pgoff);
> +
> + if (page == NULL) {
> + err = -EINVAL;
> + break;
> + }
> +
> + /* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
> + err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
> + vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
> + if (err)
> + break;
> + }
> +
> +#ifdef CONFIG_MMU
> + /* Clear any partial mappings on error. */
> + if (err)
> + zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
> +#endif
> +
> + return err;
> +}
> +
> static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct perf_event *event = file->private_data;
> @@ -6783,6 +6822,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
> vma->vm_ops = &perf_mmap_vmops;
>
> + if (!ret)
> + ret = map_range(rb, vma);
> +
> if (event->pmu->event_mapped)
> event->pmu->event_mapped(event, vma->vm_mm);
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 4f46f688d0d4..180509132d4b 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -643,7 +643,6 @@ static void rb_free_aux_page(struct perf_buffer *rb, int idx)
> struct page *page = virt_to_page(rb->aux_pages[idx]);
>
> ClearPagePrivate(page);
> - page->mapping = NULL;
> __free_page(page);
> }
>
> @@ -819,7 +818,6 @@ static void perf_mmap_free_page(void *addr)
> {
> struct page *page = virt_to_page(addr);
>
> - page->mapping = NULL;
> __free_page(page);
> }
>
> @@ -890,28 +888,13 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
> return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
> }
>
> -static void perf_mmap_unmark_page(void *addr)
> -{
> - struct page *page = vmalloc_to_page(addr);
> -
> - page->mapping = NULL;
> -}
> -
> static void rb_free_work(struct work_struct *work)
> {
> struct perf_buffer *rb;
> - void *base;
> - int i, nr;
>
> rb = container_of(work, struct perf_buffer, work);
> - nr = data_page_nr(rb);
> -
> - base = rb->user_page;
> - /* The '<=' counts in the user page. */
> - for (i = 0; i <= nr; i++)
> - perf_mmap_unmark_page(base + (i * PAGE_SIZE));
>
> - vfree(base);
> + vfree(rb->user_page);
> kfree(rb);
> }
>
> --
> 2.47.1
Hi Lorenzo Stoakes,
Greetings!
I used Syzkaller and found that there is general protection fault in perf_mmap_to_page in linux-next next-20241203.
After bisection and the first bad commit is:
"
eca51ce01d49 perf: Map pages in advance
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241204_084442_perf_mmap_to_page/bzImage_c245a7a79602ccbee780c004c1e4abcda66aec32
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241204_084442_perf_mmap_to_page/c245a7a79602ccbee780c004c1e4abcda66aec32_dmesg.log
"
[ 22.133358] KASAN: null-ptr-deref in range [0x0000000000000178-0x000000000000017f]
[ 22.133907] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.13.0-rc1-next-20241203-c245a7a79602 #1
[ 22.134557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 22.135371] RIP: 0010:perf_mmap_to_page+0x39/0x500
[ 22.135763] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
[ 22.137075] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
[ 22.137465] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[ 22.137980] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
[ 22.138495] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
[ 22.139012] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 22.139530] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
[ 22.140047] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
[ 22.140630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.141052] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
[ 22.141570] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 22.142088] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 22.142606] PKRU: 55555554
[ 22.142815] Call Trace:
[ 22.143005] <TASK>
[ 22.143173] ? show_regs+0x6d/0x80
[ 22.143455] ? die_addr+0x45/0xb0
[ 22.143720] ? exc_general_protection+0x1ae/0x340
[ 22.144102] ? asm_exc_general_protection+0x2b/0x30
[ 22.144486] ? perf_mmap_to_page+0x21/0x500
[ 22.144810] ? perf_mmap_to_page+0x39/0x500
[ 22.145130] ? perf_mmap_to_page+0x21/0x500
[ 22.145448] perf_mmap+0xbd9/0x1ce0
[ 22.145729] __mmap_region+0x10e7/0x25a0
[ 22.146038] ? __pfx___mmap_region+0x10/0x10
[ 22.146376] ? mark_lock.part.0+0xf3/0x17b0
[ 22.146712] ? __pfx_mark_lock.part.0+0x10/0x10
[ 22.147071] ? __kasan_check_read+0x15/0x20
[ 22.147403] ? mark_lock.part.0+0xf3/0x17b0
[ 22.147744] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 22.148162] ? trace_cap_capable+0x78/0x1e0
[ 22.148500] ? cap_capable+0xa4/0x250
[ 22.148792] mmap_region+0x248/0x2f0
[ 22.149086] do_mmap+0xb29/0x12a0
[ 22.149355] ? __pfx_do_mmap+0x10/0x10
[ 22.149651] ? __pfx_down_write_killable+0x10/0x10
[ 22.150027] ? __this_cpu_preempt_check+0x21/0x30
[ 22.150393] vm_mmap_pgoff+0x235/0x3e0
[ 22.150699] ? __pfx_vm_mmap_pgoff+0x10/0x10
[ 22.151037] ? __fget_files+0x1fb/0x3a0
[ 22.151352] ksys_mmap_pgoff+0x3dc/0x520
[ 22.151664] __x64_sys_mmap+0x139/0x1d0
[ 22.151975] x64_sys_call+0x2001/0x2140
[ 22.152283] do_syscall_64+0x6d/0x140
[ 22.152572] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 22.152960] RIP: 0033:0x7f7f57c3ee5d
[ 22.153251] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 22.154593] RSP: 002b:00007ffd805489f8 EFLAGS: 00000212 ORIG_RAX: 0000000000000009
[ 22.155156] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7f57c3ee5d
[ 22.155683] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 0000000020002000
[ 22.156210] RBP: 00007ffd80548a20 R08: 0000000000000003 R09: 0000000000000000
[ 22.156739] R10: 0000000000006053 R11: 0000000000000212 R12: 00007ffd80548b38
[ 22.157263] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007f7f57f77000
[ 22.157799] </TASK>
[ 22.157975] Modules linked in:
[ 22.158322] ---[ end trace 0000000000000000 ]---
[ 22.158694] RIP: 0010:perf_mmap_to_page+0x39/0x500
[ 22.159061] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
[ 22.160388] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
[ 22.160782] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[ 22.161304] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
[ 22.161824] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
[ 22.162344] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 22.162877] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
[ 22.163403] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
[ 22.163988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.164417] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
[ 22.165409] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 22.165956] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 22.166918] PKRU: 55555554
"
I hope you find it useful.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf: map pages in advance
2024-12-05 2:16 ` Lai, Yi
@ 2024-12-05 7:09 ` Lorenzo Stoakes
2024-12-05 7:48 ` Lorenzo Stoakes
0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes @ 2024-12-05 7:09 UTC (permalink / raw)
To: Lai, Yi
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
linux-kernel, linux-mm, Matthew Wilcox, David Hildenbrand,
yi1.lai
On Thu, Dec 05, 2024 at 10:16:51AM +0800, Lai, Yi wrote:
[snip]
> > --
> > 2.47.1
>
> Hi Lorenzo Stoakes,
>
> Greetings!
>
> I used Syzkaller and found that there is general protection fault in perf_mmap_to_page in linux-next next-20241203.
>
> After bisection and the first bad commit is:
> "
> eca51ce01d49 perf: Map pages in advance
Thanks, I can repro locally.
Will investigate and address as a priority.
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241204_084442_perf_mmap_to_page/bzImage_c245a7a79602ccbee780c004c1e4abcda66aec32
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/241204_084442_perf_mmap_to_page/c245a7a79602ccbee780c004c1e4abcda66aec32_dmesg.log
>
> "
> [ 22.133358] KASAN: null-ptr-deref in range [0x0000000000000178-0x000000000000017f]
> [ 22.133907] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.13.0-rc1-next-20241203-c245a7a79602 #1
> [ 22.134557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 22.135371] RIP: 0010:perf_mmap_to_page+0x39/0x500
> [ 22.135763] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
> [ 22.137075] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
> [ 22.137465] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
> [ 22.137980] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
> [ 22.138495] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
> [ 22.139012] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> [ 22.139530] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
> [ 22.140047] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
> [ 22.140630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 22.141052] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
> [ 22.141570] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 22.142088] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 22.142606] PKRU: 55555554
> [ 22.142815] Call Trace:
> [ 22.143005] <TASK>
> [ 22.143173] ? show_regs+0x6d/0x80
> [ 22.143455] ? die_addr+0x45/0xb0
> [ 22.143720] ? exc_general_protection+0x1ae/0x340
> [ 22.144102] ? asm_exc_general_protection+0x2b/0x30
> [ 22.144486] ? perf_mmap_to_page+0x21/0x500
> [ 22.144810] ? perf_mmap_to_page+0x39/0x500
> [ 22.145130] ? perf_mmap_to_page+0x21/0x500
> [ 22.145448] perf_mmap+0xbd9/0x1ce0
> [ 22.145729] __mmap_region+0x10e7/0x25a0
> [ 22.146038] ? __pfx___mmap_region+0x10/0x10
> [ 22.146376] ? mark_lock.part.0+0xf3/0x17b0
> [ 22.146712] ? __pfx_mark_lock.part.0+0x10/0x10
> [ 22.147071] ? __kasan_check_read+0x15/0x20
> [ 22.147403] ? mark_lock.part.0+0xf3/0x17b0
> [ 22.147744] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 22.148162] ? trace_cap_capable+0x78/0x1e0
> [ 22.148500] ? cap_capable+0xa4/0x250
> [ 22.148792] mmap_region+0x248/0x2f0
> [ 22.149086] do_mmap+0xb29/0x12a0
> [ 22.149355] ? __pfx_do_mmap+0x10/0x10
> [ 22.149651] ? __pfx_down_write_killable+0x10/0x10
> [ 22.150027] ? __this_cpu_preempt_check+0x21/0x30
> [ 22.150393] vm_mmap_pgoff+0x235/0x3e0
> [ 22.150699] ? __pfx_vm_mmap_pgoff+0x10/0x10
> [ 22.151037] ? __fget_files+0x1fb/0x3a0
> [ 22.151352] ksys_mmap_pgoff+0x3dc/0x520
> [ 22.151664] __x64_sys_mmap+0x139/0x1d0
> [ 22.151975] x64_sys_call+0x2001/0x2140
> [ 22.152283] do_syscall_64+0x6d/0x140
> [ 22.152572] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 22.152960] RIP: 0033:0x7f7f57c3ee5d
> [ 22.153251] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [ 22.154593] RSP: 002b:00007ffd805489f8 EFLAGS: 00000212 ORIG_RAX: 0000000000000009
> [ 22.155156] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7f57c3ee5d
> [ 22.155683] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 0000000020002000
> [ 22.156210] RBP: 00007ffd80548a20 R08: 0000000000000003 R09: 0000000000000000
> [ 22.156739] R10: 0000000000006053 R11: 0000000000000212 R12: 00007ffd80548b38
> [ 22.157263] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007f7f57f77000
> [ 22.157799] </TASK>
> [ 22.157975] Modules linked in:
> [ 22.158322] ---[ end trace 0000000000000000 ]---
> [ 22.158694] RIP: 0010:perf_mmap_to_page+0x39/0x500
> [ 22.159061] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
> [ 22.160388] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
> [ 22.160782] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
> [ 22.161304] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
> [ 22.161824] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
> [ 22.162344] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> [ 22.162877] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
> [ 22.163403] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
> [ 22.163988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 22.164417] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
> [ 22.165409] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 22.165956] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 22.166918] PKRU: 55555554
> "
>
> I hope you find it useful.
>
> Regards,
> Yi Lai
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
> // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
> // You could change the bzImage_xxx as you want
> // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf: map pages in advance
2024-12-05 7:09 ` Lorenzo Stoakes
@ 2024-12-05 7:48 ` Lorenzo Stoakes
0 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2024-12-05 7:48 UTC (permalink / raw)
To: Lai, Yi
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
linux-kernel, linux-mm, Matthew Wilcox, David Hildenbrand,
yi1.lai
On Thu, Dec 05, 2024 at 07:09:18AM +0000, Lorenzo Stoakes wrote:
> On Thu, Dec 05, 2024 at 10:16:51AM +0800, Lai, Yi wrote:
> [snip]
> > > --
> > > 2.47.1
> >
> > Hi Lorenzo Stoakes,
> >
> > Greetings!
> >
> > I used Syzkaller and found that there is general protection fault in perf_mmap_to_page in linux-next next-20241203.
> >
> > After bisection and the first bad commit is:
> > "
> > eca51ce01d49 perf: Map pages in advance
>
> Thanks, I can repro locally.
>
> Will investigate and address as a priority.
So this is because the 'rb' local variable is being used for something
other than just being a handy alias for event->rb, for some reason...
There is a specific code path where ret == 0 _and_ rb == NULL, which
obviously means we null ptr deref.
Specifically in this case it's because vma->vm_pgoff == 0, so we don't
assign rb, then after the again label it's:
again:
mutex_lock(&event->mmap_mutex);
if (event->rb) {
if (...) {
...
}
if (...) {
...
}
goto unlock; <- ret == 0, rb == NULL
}
I had assumed, clearly mistakenly, that we would not reach the end of the
function with ret == 0 _and_ rb == NULL, since we need rb, or rather
event->rb to do anything.
I think the correct thing to do here, since we assign rb with rb_alloc()
but do an rcu assignment to event->rb for some reason (possibly now not
necessary actually), is to just add an rb = event->rb here.
Will send a v3.
Thanks!
>
> > "
> >
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241204_084442_perf_mmap_to_page/bzImage_c245a7a79602ccbee780c004c1e4abcda66aec32
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/241204_084442_perf_mmap_to_page/c245a7a79602ccbee780c004c1e4abcda66aec32_dmesg.log
> >
> > "
> > [ 22.133358] KASAN: null-ptr-deref in range [0x0000000000000178-0x000000000000017f]
> > [ 22.133907] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.13.0-rc1-next-20241203-c245a7a79602 #1
> > [ 22.134557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [ 22.135371] RIP: 0010:perf_mmap_to_page+0x39/0x500
> > [ 22.135763] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
> > [ 22.137075] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
> > [ 22.137465] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
> > [ 22.137980] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
> > [ 22.138495] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
> > [ 22.139012] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > [ 22.139530] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
> > [ 22.140047] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
> > [ 22.140630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 22.141052] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
> > [ 22.141570] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 22.142088] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [ 22.142606] PKRU: 55555554
> > [ 22.142815] Call Trace:
> > [ 22.143005] <TASK>
> > [ 22.143173] ? show_regs+0x6d/0x80
> > [ 22.143455] ? die_addr+0x45/0xb0
> > [ 22.143720] ? exc_general_protection+0x1ae/0x340
> > [ 22.144102] ? asm_exc_general_protection+0x2b/0x30
> > [ 22.144486] ? perf_mmap_to_page+0x21/0x500
> > [ 22.144810] ? perf_mmap_to_page+0x39/0x500
> > [ 22.145130] ? perf_mmap_to_page+0x21/0x500
> > [ 22.145448] perf_mmap+0xbd9/0x1ce0
> > [ 22.145729] __mmap_region+0x10e7/0x25a0
> > [ 22.146038] ? __pfx___mmap_region+0x10/0x10
> > [ 22.146376] ? mark_lock.part.0+0xf3/0x17b0
> > [ 22.146712] ? __pfx_mark_lock.part.0+0x10/0x10
> > [ 22.147071] ? __kasan_check_read+0x15/0x20
> > [ 22.147403] ? mark_lock.part.0+0xf3/0x17b0
> > [ 22.147744] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> > [ 22.148162] ? trace_cap_capable+0x78/0x1e0
> > [ 22.148500] ? cap_capable+0xa4/0x250
> > [ 22.148792] mmap_region+0x248/0x2f0
> > [ 22.149086] do_mmap+0xb29/0x12a0
> > [ 22.149355] ? __pfx_do_mmap+0x10/0x10
> > [ 22.149651] ? __pfx_down_write_killable+0x10/0x10
> > [ 22.150027] ? __this_cpu_preempt_check+0x21/0x30
> > [ 22.150393] vm_mmap_pgoff+0x235/0x3e0
> > [ 22.150699] ? __pfx_vm_mmap_pgoff+0x10/0x10
> > [ 22.151037] ? __fget_files+0x1fb/0x3a0
> > [ 22.151352] ksys_mmap_pgoff+0x3dc/0x520
> > [ 22.151664] __x64_sys_mmap+0x139/0x1d0
> > [ 22.151975] x64_sys_call+0x2001/0x2140
> > [ 22.152283] do_syscall_64+0x6d/0x140
> > [ 22.152572] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 22.152960] RIP: 0033:0x7f7f57c3ee5d
> > [ 22.153251] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> > [ 22.154593] RSP: 002b:00007ffd805489f8 EFLAGS: 00000212 ORIG_RAX: 0000000000000009
> > [ 22.155156] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7f57c3ee5d
> > [ 22.155683] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 0000000020002000
> > [ 22.156210] RBP: 00007ffd80548a20 R08: 0000000000000003 R09: 0000000000000000
> > [ 22.156739] R10: 0000000000006053 R11: 0000000000000212 R12: 00007ffd80548b38
> > [ 22.157263] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007f7f57f77000
> > [ 22.157799] </TASK>
> > [ 22.157975] Modules linked in:
> > [ 22.158322] ---[ end trace 0000000000000000 ]---
> > [ 22.158694] RIP: 0010:perf_mmap_to_page+0x39/0x500
> > [ 22.159061] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
> > [ 22.160388] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
> > [ 22.160782] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
> > [ 22.161304] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
> > [ 22.161824] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
> > [ 22.162344] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> > [ 22.162877] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
> > [ 22.163403] FS: 00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
> > [ 22.163988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 22.164417] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
> > [ 22.165409] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 22.165956] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [ 22.166918] PKRU: 55555554
> > "
> >
> > I hope you find it useful.
> >
> > Regards,
> > Yi Lai
> >
> > ---
> >
> > If you don't need the following environment to reproduce the problem or if you
> > already have one reproduced environment, please ignore the following information.
> >
> > How to reproduce:
> > git clone https://gitlab.com/xupengfe/repro_vm_env.git
> > cd repro_vm_env
> > tar -xvf repro_vm_env.tar.gz
> > cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
> > // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
> > // You could change the bzImage_xxx as you want
> > // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> > You could use below command to log in, there is no password for root.
> > ssh -p 10023 root@localhost
> >
> > After login vm(virtual machine) successfully, you could transfer reproduced
> > binary to the vm by below way, and reproduce the problem in vm:
> > gcc -pthread -o repro repro.c
> > scp -P 10023 repro root@localhost:/root/
> >
> > Get the bzImage for target kernel:
> > Please use target kconfig and copy it to kernel_src/.config
> > make olddefconfig
> > make -jx bzImage //x should equal or less than cpu num your pc has
> >
> > Fill the bzImage file into above start3.sh to load the target kernel in vm.
> >
> >
> > Tips:
> > If you already have qemu-system-x86_64, please ignore below info.
> > If you want to install qemu v7.1.0 version:
> > git clone https://github.com/qemu/qemu.git
> > cd qemu
> > git checkout -f v7.1.0
> > mkdir build
> > cd build
> > yum install -y ninja-build.x86_64
> > yum -y install libslirp-devel.x86_64
> > ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> > make
> > make install
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-12-05 7:48 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-29 15:31 [PATCH v2] perf: map pages in advance Lorenzo Stoakes
2024-11-29 16:03 ` Matthew Wilcox
2024-11-29 16:27 ` Lorenzo Stoakes
2024-12-05 2:16 ` Lai, Yi
2024-12-05 7:09 ` Lorenzo Stoakes
2024-12-05 7:48 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox