[PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area
@ 2024-03-05  3:05 Alexei Starovoitov
  2024-03-05  3:05 ` [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-05  3:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, torvalds, brho, hannes, lstoakes, akpm, urezki,
	hch, rppt, boris.ostrovsky, sstabellini, jgross, linux-mm,
	xen-devel, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v3 -> v4
- dropped VM_XEN patch for now. It will be in the follow up.
- fixed constant as pointed out by Mike

v2 -> v3
- added Christoph's reviewed-by to patch 1
- cap commit log lines to 75 chars
- factored out common checks in patch 3 into helper
- made vm_area_unmap_pages() return void

There are various users of kernel virtual address space:
vmalloc, vmap, ioremap, xen.

- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
and these areas are treated differently by KASAN.

- the areas created by vmap() function should be tagged with VM_MAP
(as majority of the users do).

- ioremap areas are tagged with VM_IOREMAP and vm area start is aligned
to size of the area unlike vmalloc/vmap.

- there is also xen usage that is marked as VM_IOREMAP, but it doesn't
call ioremap_page_range() unlike all other VM_IOREMAP users.

To clean this up a bit, enforce that ioremap_page_range() checks the range
and VM_IOREMAP flag.

In addition BPF would like to reserve regions of kernel virtual address
space and populate it lazily, similar to xen use cases.
For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages()
helpers to populate this sparse area.

In the end the /proc/vmallocinfo will show
"vmalloc"
"vmap"
"ioremap"
"sparse"
categories for different kinds of address regions.

ioremap, sparse will return zero when dumped through /proc/kcore

Alexei Starovoitov (2):
  mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

 include/linux/vmalloc.h |  5 +++
 mm/vmalloc.c            | 72 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 75 insertions(+), 2 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  2024-03-05  3:05 [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area Alexei Starovoitov
@ 2024-03-05  3:05 ` Alexei Starovoitov
       [not found]   ` <CGME20240308171422eucas1p293895be469655aa618535cf199b0c43a@eucas1p2.samsung.com>
  2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
  2024-03-06 18:30 ` [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area patchwork-bot+netdevbpf
  2 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-05  3:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, torvalds, brho, hannes, lstoakes, akpm, urezki,
	hch, rppt, boris.ostrovsky, sstabellini, jgross, linux-mm,
	xen-devel, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range
passed to ioremap_page_range() matches created vm_area to avoid
accidentally ioremap-ing into wrong address range.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/vmalloc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..f42f98a127d5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
 int ioremap_page_range(unsigned long addr, unsigned long end,
 		phys_addr_t phys_addr, pgprot_t prot)
 {
+	struct vm_struct *area;
 	int err;
 
+	area = find_vm_area((void *)addr);
+	if (!area || !(area->flags & VM_IOREMAP)) {
+		WARN_ONCE(1, "vm_area at addr %lx is not marked as VM_IOREMAP\n", addr);
+		return -EINVAL;
+	}
+	if (addr != (unsigned long)area->addr ||
+	    (void *)end != area->addr + get_vm_area_size(area)) {
+		WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area [%lx, %lx)\n",
+			  addr, end, (long)area->addr,
+			  (long)area->addr + get_vm_area_size(area));
+		return -ERANGE;
+	}
 	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 				 ioremap_max_page_shift);
 	flush_cache_vmap(addr, end);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-05  3:05 [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area Alexei Starovoitov
  2024-03-05  3:05 ` [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range Alexei Starovoitov
@ 2024-03-05  3:05 ` Alexei Starovoitov
  2024-03-06 14:19   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-06 18:30 ` [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area patchwork-bot+netdevbpf
  2 siblings, 3 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-05  3:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, torvalds, brho, hannes, lstoakes, akpm, urezki,
	hch, rppt, boris.ostrovsky, sstabellini, jgross, linux-mm,
	xen-devel, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
virtual space.

get_vm_area() with appropriate flag is used to request an area of kernel
address range. It's used for vmalloc, vmap, ioremap, xen use cases.
- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
- the areas created by vmap() function should be tagged with VM_MAP.
- ioremap areas are tagged with VM_IOREMAP.

BPF would like to extend the vmap API to implement a lazily-populated
sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
and vm_area_map_pages(area, start_addr, count, pages) API to map a set
of pages within a given area.
It has the same sanity checks as vmap() does.
It also checks that get_vm_area() was created with VM_SPARSE flag
which identifies such areas in /proc/vmallocinfo
and returns zero pages on read through /proc/kcore.

The next commits will introduce bpf_arena which is a sparsely populated
shared memory region between bpf program and user space process. It will
map privately-managed pages into a sparse vm area with the following steps:

  // request virtual memory region during bpf prog verification
  area = get_vm_area(area_size, VM_SPARSE);

  // on demand
  vm_area_map_pages(area, kaddr, kend, pages);
  vm_area_unmap_pages(area, kaddr, kend);

  // after bpf program is detached and unloaded
  free_vm_area(area);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h |  5 ++++
 mm/vmalloc.c            | 59 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..0f72c85a377b 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -35,6 +35,7 @@ struct iov_iter;		/* in uio.h */
 #else
 #define VM_DEFER_KMEMLEAK	0
 #endif
+#define VM_SPARSE		0x00001000	/* sparse vm_area. not all pages are present. */
 
 /* bits [20..32] reserved for arch specific ioremap internals */
 
@@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
 }
 
 #ifdef CONFIG_MMU
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+		      unsigned long end, struct page **pages);
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+			 unsigned long end);
 void vunmap_range(unsigned long addr, unsigned long end);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f42f98a127d5..e5b8c70950bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
+				unsigned long end)
+{
+	might_sleep();
+	if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS))
+		return -EINVAL;
+	if (WARN_ON_ONCE(area->flags & VM_NO_GUARD))
+		return -EINVAL;
+	if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
+		return -EINVAL;
+	if ((end - start) >> PAGE_SHIFT > totalram_pages())
+		return -E2BIG;
+	if (start < (unsigned long)area->addr ||
+	    (void *)end > area->addr + get_vm_area_size(area))
+		return -ERANGE;
+	return 0;
+}
+
+/**
+ * vm_area_map_pages - map pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ * @pages: pages to map (always PAGE_SIZE pages)
+ */
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+		      unsigned long end, struct page **pages)
+{
+	int err;
+
+	err = check_sparse_vm_area(area, start, end);
+	if (err)
+		return err;
+
+	return vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT);
+}
+
+/**
+ * vm_area_unmap_pages - unmap pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ */
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+			 unsigned long end)
+{
+	if (check_sparse_vm_area(area, start, end))
+		return;
+
+	vunmap_range(start, end);
+}
+
 int is_vmalloc_or_module_addr(const void *x)
 {
 	/*
@@ -3822,9 +3874,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 		if (flags & VMAP_RAM)
 			copied = vmap_ram_vread_iter(iter, addr, n, flags);
-		else if (!(vm && (vm->flags & VM_IOREMAP)))
+		else if (!(vm && (vm->flags & (VM_IOREMAP | VM_SPARSE))))
 			copied = aligned_vread_iter(iter, addr, n);
-		else /* IOREMAP area is treated as memory hole */
+		else /* IOREMAP | SPARSE area is treated as memory hole */
 			copied = zero_iter(iter, n);
 
 		addr += copied;
@@ -4415,6 +4467,9 @@ static int s_show(struct seq_file *m, void *p)
 	if (v->flags & VM_IOREMAP)
 		seq_puts(m, " ioremap");
 
+	if (v->flags & VM_SPARSE)
+		seq_puts(m, " sparse");
+
 	if (v->flags & VM_ALLOC)
 		seq_puts(m, " vmalloc");
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
@ 2024-03-06 14:19   ` Christoph Hellwig
  2024-03-06 17:10     ` Alexei Starovoitov
  2024-03-06 21:03   ` Pasha Tatashin
  2024-03-06 22:57   ` Pasha Tatashin
  2 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-03-06 14:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, torvalds, brho, hannes, lstoakes, akpm,
	urezki, hch, rppt, boris.ostrovsky, sstabellini, jgross,
	linux-mm, xen-devel, kernel-team

I'd still prefer to hide the vm_area, but for now:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 14:19   ` Christoph Hellwig
@ 2024-03-06 17:10     ` Alexei Starovoitov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-06 17:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Mike Rapoport, Boris Ostrovsky, sstabellini,
	Juergen Gross, linux-mm, xen-devel, Kernel Team

On Wed, Mar 6, 2024 at 6:19 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> I'd still prefer to hide the vm_area, but for now:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thank you.
I will think of a way to move get_vm_area() to mm/internal.h and
propose a plan by lsf/mm/bpf in May.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area
  2024-03-05  3:05 [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area Alexei Starovoitov
  2024-03-05  3:05 ` [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range Alexei Starovoitov
  2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
@ 2024-03-06 18:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-03-06 18:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, torvalds, brho, hannes, lstoakes, akpm,
	urezki, hch, rppt, boris.ostrovsky, sstabellini, jgross,
	linux-mm, xen-devel, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Mon,  4 Mar 2024 19:05:14 -0800 you wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> v3 -> v4
> - dropped VM_XEN patch for now. It will be in the follow up.
> - fixed constant as pointed out by Mike
> 
> v2 -> v3
> - added Christoph's reviewed-by to patch 1
> - cap commit log lines to 75 chars
> - factored out common checks in patch 3 into helper
> - made vm_area_unmap_pages() return void
> 
> [...]

Here is the summary with links:
  - [v4,bpf-next,1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
    https://git.kernel.org/bpf/bpf-next/c/3e49a866c9dc
  - [v4,bpf-next,2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
    https://git.kernel.org/bpf/bpf-next/c/6b66b3a4ed5e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
  2024-03-06 14:19   ` Christoph Hellwig
@ 2024-03-06 21:03   ` Pasha Tatashin
  2024-03-06 21:28     ` Alexei Starovoitov
  2024-03-06 22:57   ` Pasha Tatashin
  2 siblings, 1 reply; 15+ messages in thread
From: Pasha Tatashin @ 2024-03-06 21:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, torvalds, brho, hannes, lstoakes, akpm,
	urezki, hch, rppt, boris.ostrovsky, sstabellini, jgross,
	linux-mm, xen-devel, kernel-team

On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
> virtual space.
>
> get_vm_area() with appropriate flag is used to request an area of kernel
> address range. It's used for vmalloc, vmap, ioremap, xen use cases.
> - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
> - the areas created by vmap() function should be tagged with VM_MAP.
> - ioremap areas are tagged with VM_IOREMAP.
>
> BPF would like to extend the vmap API to implement a lazily-populated
> sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
> and vm_area_map_pages(area, start_addr, count, pages) API to map a set
> of pages within a given area.
> It has the same sanity checks as vmap() does.
> It also checks that get_vm_area() was created with VM_SPARSE flag
> which identifies such areas in /proc/vmallocinfo
> and returns zero pages on read through /proc/kcore.
>
> The next commits will introduce bpf_arena which is a sparsely populated
> shared memory region between bpf program and user space process. It will
> map privately-managed pages into a sparse vm area with the following steps:
>
>   // request virtual memory region during bpf prog verification
>   area = get_vm_area(area_size, VM_SPARSE);
>
>   // on demand
>   vm_area_map_pages(area, kaddr, kend, pages);
>   vm_area_unmap_pages(area, kaddr, kend);
>
>   // after bpf program is detached and unloaded
>   free_vm_area(area);
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/vmalloc.h |  5 ++++
>  mm/vmalloc.c            | 59 +++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 62 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..0f72c85a377b 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -35,6 +35,7 @@ struct iov_iter;              /* in uio.h */
>  #else
>  #define VM_DEFER_KMEMLEAK      0
>  #endif
> +#define VM_SPARSE              0x00001000      /* sparse vm_area. not all pages are present. */
>
>  /* bits [20..32] reserved for arch specific ioremap internals */
>
> @@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
>  }
>
>  #ifdef CONFIG_MMU
> +int vm_area_map_pages(struct vm_struct *area, unsigned long start,
> +                     unsigned long end, struct page **pages);
> +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
> +                        unsigned long end);
>  void vunmap_range(unsigned long addr, unsigned long end);
>  static inline void set_vm_flush_reset_perms(void *addr)
>  {
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f42f98a127d5..e5b8c70950bc 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end,
>         return err;
>  }
>
> +static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
> +                               unsigned long end)
> +{
> +       might_sleep();

This interface and in general VM_SPARSE would be useful for
dynamically grown kernel stacks [1]. However, the might_sleep() here
would be a problem. We would need to be able to handle
vm_area_map_pages() from interrupt disabled context therefore no
sleeping. The caller would need to guarantee that the page tables are
pre-allocated before the mapping.

Pasha

[1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@mail.gmail.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 21:03   ` Pasha Tatashin
@ 2024-03-06 21:28     ` Alexei Starovoitov
  2024-03-06 21:46       ` Pasha Tatashin
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-06 21:28 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

On Wed, Mar 6, 2024 at 1:04 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
> > virtual space.
> >
> > get_vm_area() with appropriate flag is used to request an area of kernel
> > address range. It's used for vmalloc, vmap, ioremap, xen use cases.
> > - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
> > - the areas created by vmap() function should be tagged with VM_MAP.
> > - ioremap areas are tagged with VM_IOREMAP.
> >
> > BPF would like to extend the vmap API to implement a lazily-populated
> > sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
> > and vm_area_map_pages(area, start_addr, count, pages) API to map a set
> > of pages within a given area.
> > It has the same sanity checks as vmap() does.
> > It also checks that get_vm_area() was created with VM_SPARSE flag
> > which identifies such areas in /proc/vmallocinfo
> > and returns zero pages on read through /proc/kcore.
> >
> > The next commits will introduce bpf_arena which is a sparsely populated
> > shared memory region between bpf program and user space process. It will
> > map privately-managed pages into a sparse vm area with the following steps:
> >
> >   // request virtual memory region during bpf prog verification
> >   area = get_vm_area(area_size, VM_SPARSE);
> >
> >   // on demand
> >   vm_area_map_pages(area, kaddr, kend, pages);
> >   vm_area_unmap_pages(area, kaddr, kend);
> >
> >   // after bpf program is detached and unloaded
> >   free_vm_area(area);
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/vmalloc.h |  5 ++++
> >  mm/vmalloc.c            | 59 +++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 62 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..0f72c85a377b 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -35,6 +35,7 @@ struct iov_iter;              /* in uio.h */
> >  #else
> >  #define VM_DEFER_KMEMLEAK      0
> >  #endif
> > +#define VM_SPARSE              0x00001000      /* sparse vm_area. not all pages are present. */
> >
> >  /* bits [20..32] reserved for arch specific ioremap internals */
> >
> > @@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
> >  }
> >
> >  #ifdef CONFIG_MMU
> > +int vm_area_map_pages(struct vm_struct *area, unsigned long start,
> > +                     unsigned long end, struct page **pages);
> > +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
> > +                        unsigned long end);
> >  void vunmap_range(unsigned long addr, unsigned long end);
> >  static inline void set_vm_flush_reset_perms(void *addr)
> >  {
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index f42f98a127d5..e5b8c70950bc 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end,
> >         return err;
> >  }
> >
> > +static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
> > +                               unsigned long end)
> > +{
> > +       might_sleep();
>
> This interface and in general VM_SPARSE would be useful for
> dynamically grown kernel stacks [1]. However, the might_sleep() here
> would be a problem. We would need to be able to handle
> vm_area_map_pages() from interrupt disabled context therefore no
> sleeping. The caller would need to guarantee that the page tables are
> pre-allocated before the mapping.

Sounds like we'd need to differentiate two kinds of sparse regions.
One that is really sparse where page tables are not populated (bpf use case)
and another where only the pte level might be empty.
Only the latter one will be usable for such auto-grow stacks.

Months back I played with this idea:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
that
"Make vmap_pages_range() allocate page tables down to the last (PTE) level."
Essentially pass NULL instead of 'pages' into vmap_pages_range()
and it will populate all levels except the last.
Then the page fault handler can service a fault in auto-growing stack
area if it has a page stashed in some per-cpu free list.
I suspect this is something you might need for
"16k stack that is populated on fault",
plus a free list of 3 pages per-cpu,
and set_pte_at() in pf handler.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 21:28     ` Alexei Starovoitov
@ 2024-03-06 21:46       ` Pasha Tatashin
  2024-03-06 22:12         ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Pasha Tatashin @ 2024-03-06 21:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

> > This interface and in general VM_SPARSE would be useful for
> > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > would be a problem. We would need to be able to handle
> > vm_area_map_pages() from interrupt disabled context therefore no
> > sleeping. The caller would need to guarantee that the page tables are
> > pre-allocated before the mapping.
>
> Sounds like we'd need to differentiate two kinds of sparse regions.
> One that is really sparse where page tables are not populated (bpf use case)
> and another where only the pte level might be empty.
> Only the latter one will be usable for such auto-grow stacks.
>
> Months back I played with this idea:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> that
> "Make vmap_pages_range() allocate page tables down to the last (PTE) level."
> Essentially pass NULL instead of 'pages' into vmap_pages_range()
> and it will populate all levels except the last.

Yes, this is what is needed, however, it can be a little simpler with
kernel stacks:
given that the first page in the vm_area is mapped when stack is first
allocated, and that the VA range is aligned to 16K, we actually are
guaranteed to have all page table levels down to pte pre-allocated
during that initial mapping. Therefore, we do not need to worry about
allocating them later during PFs.

> Then the page fault handler can service a fault in auto-growing stack
> area if it has a page stashed in some per-cpu free list.
> I suspect this is something you might need for
> "16k stack that is populated on fault",
> plus a free list of 3 pages per-cpu,
> and set_pte_at() in pf handler.

Yes, what you described is exactly what I am working on: using 3-pages
per-cpu to handle kstack page faults. The only thing that is missing
is that I would like to have the ability to call a non-sleeping
version of vm_area_map_pages().

Pasha


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 21:46       ` Pasha Tatashin
@ 2024-03-06 22:12         ` Alexei Starovoitov
  2024-03-06 22:56           ` Pasha Tatashin
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-06 22:12 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

On Wed, Mar 6, 2024 at 1:46 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> > > This interface and in general VM_SPARSE would be useful for
> > > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > > would be a problem. We would need to be able to handle
> > > vm_area_map_pages() from interrupt disabled context therefore no
> > > sleeping. The caller would need to guarantee that the page tables are
> > > pre-allocated before the mapping.
> >
> > Sounds like we'd need to differentiate two kinds of sparse regions.
> > One that is really sparse where page tables are not populated (bpf use case)
> > and another where only the pte level might be empty.
> > Only the latter one will be usable for such auto-grow stacks.
> >
> > Months back I played with this idea:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> > that
> > "Make vmap_pages_range() allocate page tables down to the last (PTE) level."
> > Essentially pass NULL instead of 'pages' into vmap_pages_range()
> > and it will populate all levels except the last.
>
> Yes, this is what is needed, however, it can be a little simpler with
> kernel stacks:
> given that the first page in the vm_area is mapped when stack is first
> allocated, and that the VA range is aligned to 16K, we actually are
> guaranteed to have all page table levels down to pte pre-allocated
> during that initial mapping. Therefore, we do not need to worry about
> allocating them later during PFs.

Ahh. Found:
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ...

> > Then the page fault handler can service a fault in auto-growing stack
> > area if it has a page stashed in some per-cpu free list.
> > I suspect this is something you might need for
> > "16k stack that is populated on fault",
> > plus a free list of 3 pages per-cpu,
> > and set_pte_at() in pf handler.
>
> Yes, what you described is exactly what I am working on: using 3-pages
> per-cpu to handle kstack page faults. The only thing that is missing
> is that I would like to have the ability to call a non-sleeping
> version of vm_area_map_pages().

vm_area_map_pages() cannot be non-sleepable, since the [start, end)
range will dictate whether mid level allocs and locks are needed.

Instead in alloc_thread_stack_node() you'd need a flavor
of get_vm_area() that can align the range to THREAD_ALIGN.
Then immediately call _sleepable_ vm_area_map_pages() to populate
the first page and later set_pte_at() the other pages on demand
from the fault handler.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 22:12         ` Alexei Starovoitov
@ 2024-03-06 22:56           ` Pasha Tatashin
  2024-03-06 23:11             ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Pasha Tatashin @ 2024-03-06 22:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

On Wed, Mar 6, 2024 at 5:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 1:46 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> >
> > > > This interface and in general VM_SPARSE would be useful for
> > > > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > > > would be a problem. We would need to be able to handle
> > > > vm_area_map_pages() from interrupt disabled context therefore no
> > > > sleeping. The caller would need to guarantee that the page tables are
> > > > pre-allocated before the mapping.
> > >
> > > Sounds like we'd need to differentiate two kinds of sparse regions.
> > > One that is really sparse where page tables are not populated (bpf use case)
> > > and another where only the pte level might be empty.
> > > Only the latter one will be usable for such auto-grow stacks.
> > >
> > > Months back I played with this idea:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> > > that
> > > "Make vmap_pages_range() allocate page tables down to the last (PTE) level."
> > > Essentially pass NULL instead of 'pages' into vmap_pages_range()
> > > and it will populate all levels except the last.
> >
> > Yes, this is what is needed, however, it can be a little simpler with
> > kernel stacks:
> > given that the first page in the vm_area is mapped when stack is first
> > allocated, and that the VA range is aligned to 16K, we actually are
> > guaranteed to have all page table levels down to pte pre-allocated
> > during that initial mapping. Therefore, we do not need to worry about
> > allocating them later during PFs.
>
> Ahh. Found:
> stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ...
>
> > > Then the page fault handler can service a fault in auto-growing stack
> > > area if it has a page stashed in some per-cpu free list.
> > > I suspect this is something you might need for
> > > "16k stack that is populated on fault",
> > > plus a free list of 3 pages per-cpu,
> > > and set_pte_at() in pf handler.
> >
> > Yes, what you described is exactly what I am working on: using 3-pages
> > per-cpu to handle kstack page faults. The only thing that is missing
> > is that I would like to have the ability to call a non-sleeping
> > version of vm_area_map_pages().
>
> vm_area_map_pages() cannot be non-sleepable, since the [start, end)
> range will dictate whether mid level allocs and locks are needed.
>
> Instead in alloc_thread_stack_node() you'd need a flavor
> of get_vm_area() that can align the range to THREAD_ALIGN.
> Then immediately call _sleepable_ vm_area_map_pages() to populate
> the first page and later set_pte_at() the other pages on demand
> from the fault handler.

We still need to get to PTE level to use set_pte_at(). So, either
store it in task_struct for faster PF handling, or add another
non-sleeping vmap function that will do something like this:

vm_area_set_page_at(addr, page)
{
   pgd = pgd_offset_k(addr)
   p4d = vunmap_p4d_range(pgd, addr)
   pud = pud_offset(p4d, addr)
   pmd = pmd_offset(pud, addr)
   pte = pte_offset_kernel(pmd, addr)

  set_pte_at(init_mm, addr, pte, mk_pte(page...));
}

Pasha


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
  2024-03-06 14:19   ` Christoph Hellwig
  2024-03-06 21:03   ` Pasha Tatashin
@ 2024-03-06 22:57   ` Pasha Tatashin
  2 siblings, 0 replies; 15+ messages in thread
From: Pasha Tatashin @ 2024-03-06 22:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, torvalds, brho, hannes, lstoakes, akpm,
	urezki, hch, rppt, boris.ostrovsky, sstabellini, jgross,
	linux-mm, xen-devel, kernel-team

On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
> virtual space.
>
> get_vm_area() with appropriate flag is used to request an area of kernel
> address range. It's used for vmalloc, vmap, ioremap, xen use cases.
> - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
> - the areas created by vmap() function should be tagged with VM_MAP.
> - ioremap areas are tagged with VM_IOREMAP.
>
> BPF would like to extend the vmap API to implement a lazily-populated
> sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
> and vm_area_map_pages(area, start_addr, count, pages) API to map a set
> of pages within a given area.
> It has the same sanity checks as vmap() does.
> It also checks that get_vm_area() was created with VM_SPARSE flag
> which identifies such areas in /proc/vmallocinfo
> and returns zero pages on read through /proc/kcore.
>
> The next commits will introduce bpf_arena which is a sparsely populated
> shared memory region between bpf program and user space process. It will
> map privately-managed pages into a sparse vm area with the following steps:
>
>   // request virtual memory region during bpf prog verification
>   area = get_vm_area(area_size, VM_SPARSE);
>
>   // on demand
>   vm_area_map_pages(area, kaddr, kend, pages);
>   vm_area_unmap_pages(area, kaddr, kend);
>
>   // after bpf program is detached and unloaded
>   free_vm_area(area);
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
  2024-03-06 22:56           ` Pasha Tatashin
@ 2024-03-06 23:11             ` Alexei Starovoitov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-06 23:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

On Wed, Mar 6, 2024 at 2:57 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Wed, Mar 6, 2024 at 5:13 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 6, 2024 at 1:46 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > >
> > > > > This interface and in general VM_SPARSE would be useful for
> > > > > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > > > > would be a problem. We would need to be able to handle
> > > > > vm_area_map_pages() from interrupt disabled context therefore no
> > > > > sleeping. The caller would need to guarantee that the page tables are
> > > > > pre-allocated before the mapping.
> > > >
> > > > Sounds like we'd need to differentiate two kinds of sparse regions.
> > > > One that is really sparse where page tables are not populated (bpf use case)
> > > > and another where only the pte level might be empty.
> > > > Only the latter one will be usable for such auto-grow stacks.
> > > >
> > > > Months back I played with this idea:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> > > > that
> > > > "Make vmap_pages_range() allocate page tables down to the last (PTE) level."
> > > > Essentially pass NULL instead of 'pages' into vmap_pages_range()
> > > > and it will populate all levels except the last.
> > >
> > > Yes, this is what is needed, however, it can be a little simpler with
> > > kernel stacks:
> > > given that the first page in the vm_area is mapped when stack is first
> > > allocated, and that the VA range is aligned to 16K, we actually are
> > > guaranteed to have all page table levels down to pte pre-allocated
> > > during that initial mapping. Therefore, we do not need to worry about
> > > allocating them later during PFs.
> >
> > Ahh. Found:
> > stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ...
> >
> > > > Then the page fault handler can service a fault in auto-growing stack
> > > > area if it has a page stashed in some per-cpu free list.
> > > > I suspect this is something you might need for
> > > > "16k stack that is populated on fault",
> > > > plus a free list of 3 pages per-cpu,
> > > > and set_pte_at() in pf handler.
> > >
> > > Yes, what you described is exactly what I am working on: using 3-pages
> > > per-cpu to handle kstack page faults. The only thing that is missing
> > > is that I would like to have the ability to call a non-sleeping
> > > version of vm_area_map_pages().
> >
> > vm_area_map_pages() cannot be non-sleepable, since the [start, end)
> > range will dictate whether mid level allocs and locks are needed.
> >
> > Instead in alloc_thread_stack_node() you'd need a flavor
> > of get_vm_area() that can align the range to THREAD_ALIGN.
> > Then immediately call _sleepable_ vm_area_map_pages() to populate
> > the first page and later set_pte_at() the other pages on demand
> > from the fault handler.
>
> We still need to get to PTE level to use set_pte_at(). So, either
> store it in task_struct for faster PF handling, or add another
> non-sleeping vmap function that will do something like this:
>
> vm_area_set_page_at(addr, page)
> {
>    pgd = pgd_offset_k(addr)
>    p4d = vunmap_p4d_range(pgd, addr)
>    pud = pud_offset(p4d, addr)
>    pmd = pmd_offset(pud, addr)
>    pte = pte_offset_kernel(pmd, addr)
>
>   set_pte_at(init_mm, addr, pte, mk_pte(page...));
> }

Right. There are several flavors of this logic across the tree.
What you're proposing is pretty much vmalloc_to_page() that
returns pte even if !pte_present, instead of a page.
x86 is doing mostly the same in lookup_address() fwiw.
Good opportunity to clean all this up and share the code.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
       [not found]   ` <CGME20240308171422eucas1p293895be469655aa618535cf199b0c43a@eucas1p2.samsung.com>
@ 2024-03-08 17:14     ` Marek Szyprowski
  2024-03-08 17:21       ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Szyprowski @ 2024-03-08 17:14 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, torvalds, brho, hannes, lstoakes, akpm, urezki,
	hch, rppt, boris.ostrovsky, sstabellini, jgross, linux-mm,
	xen-devel, kernel-team

On 05.03.2024 04:05, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> There are various users of get_vm_area() + ioremap_page_range() APIs.
> Enforce that get_vm_area() was requested as VM_IOREMAP type and range
> passed to ioremap_page_range() matches created vm_area to avoid
> accidentally ioremap-ing into wrong address range.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

This patch landed in today's linux-next as commit 3e49a866c9dc ("mm: 
Enforce VM_IOREMAP flag and range in ioremap_page_range."). 
Unfortunately it triggers the following warning on all my test machines 
with PCI bridges. Here is an example reproduced with QEMU and ARM64 
'virt' machine:

pci-host-generic 4010000000.pcie: host bridge /pcie@10000000 ranges:
pci-host-generic 4010000000.pcie:       IO 0x003eff0000..0x003effffff -> 
0x0000000000
pci-host-generic 4010000000.pcie:      MEM 0x0010000000..0x003efeffff -> 
0x0010000000
pci-host-generic 4010000000.pcie:      MEM 0x8000000000..0xffffffffff -> 
0x8000000000
------------[ cut here ]------------
vm_area at addr fffffbfffe800000 is not marked as VM_IOREMAP
WARNING: CPU: 0 PID: 1 at mm/vmalloc.c:315 ioremap_page_range+0x8c/0x174
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6+ #14694
Hardware name: linux,dummy-virt (DT)
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : ioremap_page_range+0x8c/0x174
lr : ioremap_page_range+0x8c/0x174
sp : ffff800083faba10
...
Call trace:
  ioremap_page_range+0x8c/0x174
  pci_remap_iospace+0x74/0x88
  devm_pci_remap_iospace+0x54/0xac
  devm_of_pci_bridge_init+0x160/0x1fc
  devm_pci_alloc_host_bridge+0xb4/0xd0
  pci_host_common_probe+0x44/0x1a0
  platform_probe+0x68/0xd8
  really_probe+0x148/0x2b4
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0xdc/0x164
  __driver_attach+0x9c/0x1ac
  bus_for_each_dev+0x74/0xd4
  driver_attach+0x24/0x30
  bus_add_driver+0xe4/0x1e8
  driver_register+0x60/0x128
  __platform_driver_register+0x28/0x34
  gen_pci_driver_init+0x1c/0x28
  do_one_initcall+0x74/0x2f4
  kernel_init_freeable+0x28c/0x4dc
  kernel_init+0x24/0x1dc
  ret_from_fork+0x10/0x20
irq event stamp: 74360
hardirqs last  enabled at (74359): [<ffff80008012cb9c>] 
console_unlock+0x120/0x12c
hardirqs last disabled at (74360): [<ffff80008122daa0>] el1_dbg+0x24/0x8c
softirqs last  enabled at (71258): [<ffff800080010a60>] 
__do_softirq+0x4a0/0x4e8
softirqs last disabled at (71245): [<ffff8000800169b0>] 
____do_softirq+0x10/0x1c
---[ end trace 0000000000000000 ]---
pci-host-generic 4010000000.pcie: error -22: failed to map resource [io  
0x0000-0xffff]
pci-host-generic 4010000000.pcie: Memory resource size exceeds max for 
32 bits
pci-host-generic 4010000000.pcie: ECAM at [mem 
0x4010000000-0x401fffffff] for [bus 00-ff]
pci-host-generic 4010000000.pcie: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [bus 00-ff]
pci_bus 0000:00: root bus resource [mem 0x10000000-0x3efeffff]
pci_bus 0000:00: root bus resource [mem 0x8000000000-0xffffffffff]
pci 0000:00:00.0: [1b36:0008] type 00 class 0x060000 conventional PCI 
endpoint

It looks that PCI related code must be somehow adjusted for this change.

>   mm/vmalloc.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d12a17fc0c17..f42f98a127d5 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
>   int ioremap_page_range(unsigned long addr, unsigned long end,
>   		phys_addr_t phys_addr, pgprot_t prot)
>   {
> +	struct vm_struct *area;
>   	int err;
>   
> +	area = find_vm_area((void *)addr);
> +	if (!area || !(area->flags & VM_IOREMAP)) {
> +		WARN_ONCE(1, "vm_area at addr %lx is not marked as VM_IOREMAP\n", addr);
> +		return -EINVAL;
> +	}
> +	if (addr != (unsigned long)area->addr ||
> +	    (void *)end != area->addr + get_vm_area_size(area)) {
> +		WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area [%lx, %lx)\n",
> +			  addr, end, (long)area->addr,
> +			  (long)area->addr + get_vm_area_size(area));
> +		return -ERANGE;
> +	}
>   	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
>   				 ioremap_max_page_shift);
>   	flush_cache_vmap(addr, end);

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  2024-03-08 17:14     ` Marek Szyprowski
@ 2024-03-08 17:21       ` Alexei Starovoitov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2024-03-08 17:21 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Linus Torvalds,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Mike Rapoport,
	Boris Ostrovsky, sstabellini, Juergen Gross, linux-mm, xen-devel,
	Kernel Team

On Fri, Mar 8, 2024 at 9:14 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 05.03.2024 04:05, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > There are various users of get_vm_area() + ioremap_page_range() APIs.
> > Enforce that get_vm_area() was requested as VM_IOREMAP type and range
> > passed to ioremap_page_range() matches created vm_area to avoid
> > accidentally ioremap-ing into wrong address range.
> >
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
>
> This patch landed in today's linux-next as commit 3e49a866c9dc ("mm:
> Enforce VM_IOREMAP flag and range in ioremap_page_range.").
> Unfortunately it triggers the following warning on all my test machines
> with PCI bridges. Here is an example reproduced with QEMU and ARM64
> 'virt' machine:

Sorry about the breakage.
Here is the thread where we're discussing the fix:
https://lore.kernel.org/bpf/CAADnVQLP=dxBb+RiMGXoaCEuRrbK387J6B+pfzWKF_F=aRgCPQ@mail.gmail.com/


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-03-08 17:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-05  3:05 [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area Alexei Starovoitov
2024-03-05  3:05 ` [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range Alexei Starovoitov
     [not found]   ` <CGME20240308171422eucas1p293895be469655aa618535cf199b0c43a@eucas1p2.samsung.com>
2024-03-08 17:14     ` Marek Szyprowski
2024-03-08 17:21       ` Alexei Starovoitov
2024-03-05  3:05 ` [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages() Alexei Starovoitov
2024-03-06 14:19   ` Christoph Hellwig
2024-03-06 17:10     ` Alexei Starovoitov
2024-03-06 21:03   ` Pasha Tatashin
2024-03-06 21:28     ` Alexei Starovoitov
2024-03-06 21:46       ` Pasha Tatashin
2024-03-06 22:12         ` Alexei Starovoitov
2024-03-06 22:56           ` Pasha Tatashin
2024-03-06 23:11             ` Alexei Starovoitov
2024-03-06 22:57   ` Pasha Tatashin
2024-03-06 18:30 ` [PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox