* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
@ 2025-05-12 9:24 ` Christian Brauner
2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-12 13:37 ` Vlastimil Babka
` (3 subsequent siblings)
4 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2025-05-12 9:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> return file->f_op->mmap(file, vma);
> }
>
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
nit: I would prefer if we could rename this to vfs_mmap() and
vfs_mmap_prepare() as this is in line with all the other vfs related
helpers we expose.
> +
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..15808cad2bc1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,30 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> /*
> * This struct describes a virtual memory area. There is one of these
> * per VM-area/task. A VM area is any part of the process virtual memory
> diff --git a/mm/memory.c b/mm/memory.c
> index 68c1d962d0ad..99af83434e7c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> dump_page(page, "bad pte");
> pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
> + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> vma->vm_file,
> vma->vm_ops ? vma->vm_ops->fault : NULL,
> vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
> + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
> mapping ? mapping->a_ops->read_folio : NULL);
> dump_stack();
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 81dd962a1cfc..50f902c08341 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> vm_flags &= ~VM_MAYEXEC;
> }
>
> - if (!file->f_op->mmap)
> + if (!file_has_valid_mmap_hooks(file))
> return -ENODEV;
> if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> return -EINVAL;
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..3f32e04bb6cc 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -17,6 +17,11 @@ struct mmap_state {
> unsigned long pglen;
> unsigned long flags;
> struct file *file;
> + pgprot_t page_prot;
> +
> + /* User-defined fields, perhaps updated by .mmap_prepare(). */
> + const struct vm_operations_struct *vm_ops;
> + void *vm_private_data;
>
> unsigned long charged;
> bool retry_merge;
> @@ -40,6 +45,7 @@ struct mmap_state {
> .pglen = PHYS_PFN(len_), \
> .flags = flags_, \
> .file = file_, \
> + .page_prot = vm_get_page_prot(flags_), \
> }
>
> #define VMG_MMAP_STATE(name, map_, vma_) \
> @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> int error;
>
> vma->vm_file = get_file(map->file);
> +
> + if (!map->file->f_op->mmap)
> + return 0;
> +
> error = mmap_file(vma->vm_file, vma);
> if (error) {
> fput(vma->vm_file);
> @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> vma_iter_config(vmi, map->addr, map->end);
> vma_set_range(vma, map->addr, map->end, map->pgoff);
> vm_flags_init(vma, map->flags);
> - vma->vm_page_prot = vm_get_page_prot(map->flags);
> + vma->vm_page_prot = map->page_prot;
>
> if (vma_iter_prealloc(vmi, vma)) {
> error = -ENOMEM;
> @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> vma_set_page_prot(vma);
> }
>
> +/*
> + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
> + * specifies it.
> + *
> + * This is called prior to any merge attempt, and updates whitelisted fields
> + * that are permitted to be updated by the caller.
> + *
> + * All but user-defined fields will be pre-populated with original values.
> + *
> + * Returns 0 on success, or an error code otherwise.
> + */
> +static int call_mmap_prepare(struct mmap_state *map)
> +{
> + int err;
> + struct vm_area_desc desc = {
> + .mm = map->mm,
> + .start = map->addr,
> + .end = map->end,
> +
> + .pgoff = map->pgoff,
> + .file = map->file,
> + .vm_flags = map->flags,
> + .page_prot = map->page_prot,
> + };
> +
> + /* Invoke the hook. */
> + err = __call_mmap_prepare(map->file, &desc);
> + if (err)
> + return err;
> +
> + /* Update fields permitted to be changed. */
> + map->pgoff = desc.pgoff;
> + map->file = desc.file;
> + map->flags = desc.vm_flags;
> + map->page_prot = desc.page_prot;
> + /* User-defined fields. */
> + map->vm_ops = desc.vm_ops;
> + map->vm_private_data = desc.private_data;
> +
> + return 0;
> +}
> +
> +static void set_vma_user_defined_fields(struct vm_area_struct *vma,
> + struct mmap_state *map)
> +{
> + if (map->vm_ops)
> + vma->vm_ops = map->vm_ops;
> + vma->vm_private_data = map->vm_private_data;
> +}
> +
> static unsigned long __mmap_region(struct file *file, unsigned long addr,
> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> struct list_head *uf)
> @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> int error;
> + bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> VMA_ITERATOR(vmi, mm, addr);
> MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
>
> error = __mmap_prepare(&map, uf);
> + if (!error && have_mmap_prepare)
> + error = call_mmap_prepare(&map);
> if (error)
> goto abort_munmap;
>
> @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> goto unacct_error;
> }
>
> + if (have_mmap_prepare)
> + set_vma_user_defined_fields(vma, &map);
> +
> /* If flags changed, we might be able to merge, so try again. */
> if (map.retry_merge) {
> struct vm_area_struct *merged;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 198abe66de5a..f6e45e62da3a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -253,8 +253,40 @@ struct mm_struct {
> unsigned long flags; /* Must use atomic bitops to access */
> };
>
> +struct vm_area_struct;
> +
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> +struct file_operations {
> + int (*mmap)(struct file *, struct vm_area_struct *);
> + int (*mmap_prepare)(struct vm_area_desc *);
> +};
> +
> struct file {
> struct address_space *f_mapping;
> + const struct file_operations *f_op;
> };
>
> #define VMA_LOCK_OFFSET 0x40000000
> @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
> vma->__vm_flags &= ~flags;
> }
>
> -static inline int call_mmap(struct file *, struct vm_area_struct *)
> -{
> - return 0;
> -}
> -
> static inline int shmem_zero_setup(struct vm_area_struct *)
> {
> return 0;
> @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> (void)vma;
> }
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> +static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> + return file->f_op->mmap(file, vma);
> +}
> +
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-12 9:24 ` Christian Brauner
@ 2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-13 7:29 ` Christian Brauner
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-12 11:31 UTC (permalink / raw)
To: Christian Brauner
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Mon, May 12, 2025 at 11:24:06AM +0200, Christian Brauner wrote:
> On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
[snip]
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 016b0fe1536e..e2721a1ff13d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
[snip]
> > static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > {
> > + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> > + return -EINVAL;
> > +
> > return file->f_op->mmap(file, vma);
> > }
> >
> > +static inline int __call_mmap_prepare(struct file *file,
> > + struct vm_area_desc *desc)
> > +{
> > + return file->f_op->mmap_prepare(desc);
> > +}
>
> nit: I would prefer if we could rename this to vfs_mmap() and
> vfs_mmap_prepare() as this is in line with all the other vfs related
> helpers we expose.
>
Happy to do it, but:
call_mmap() is already invoked in a bunch of places, so kinda falls outside this
series (+ would touch a bunch of unrelated files), would you mind if I sent that
separately?
Thanks!
[snip]
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-12 11:31 ` Lorenzo Stoakes
@ 2025-05-13 7:29 ` Christian Brauner
0 siblings, 0 replies; 23+ messages in thread
From: Christian Brauner @ 2025-05-13 7:29 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Mon, May 12, 2025 at 12:31:34PM +0100, Lorenzo Stoakes wrote:
> On Mon, May 12, 2025 at 11:24:06AM +0200, Christian Brauner wrote:
> > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
>
> [snip]
>
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 016b0fe1536e..e2721a1ff13d 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
>
> [snip]
>
> > > static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > > {
> > > + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> > > + return -EINVAL;
> > > +
> > > return file->f_op->mmap(file, vma);
> > > }
> > >
> > > +static inline int __call_mmap_prepare(struct file *file,
> > > + struct vm_area_desc *desc)
> > > +{
> > > + return file->f_op->mmap_prepare(desc);
> > > +}
> >
> > nit: I would prefer if we could rename this to vfs_mmap() and
> > vfs_mmap_prepare() as this is in line with all the other vfs related
> > helpers we expose.
> >
>
> Happy to do it, but:
>
> call_mmap() is already invoked in a bunch of places, so kinda falls outside this
> series (+ would touch a bunch of unrelated files), would you mind if I sent that
> separately?
Sure, that's fine.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
@ 2025-05-12 13:37 ` Vlastimil Babka
2025-05-13 9:01 ` David Hildenbrand
` (2 subsequent siblings)
4 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2025-05-12 13:37 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 5/9/25 14:13, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
2025-05-12 13:37 ` Vlastimil Babka
@ 2025-05-13 9:01 ` David Hildenbrand
2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
4 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-13 9:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 09.05.25 14:13, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
So, if neither is set, it's also an invalid setting, understood.
So we want XOR.
const bool has_mmap = file->f_op->mmap;
const bool has_mmap_prepare = file->f_op->mmap_prepare;
const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
WARN_ON_ONCE(!mutual_exclusive)
return mutual_exclusive;
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 9:01 ` David Hildenbrand
@ 2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:25 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-13 9:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
> On 09.05.25 14:13, Lorenzo Stoakes wrote:
> > Provide a means by which drivers can specify which fields of those
> > permitted to be changed should be altered to prior to mmap()'ing a
> > range (which may either result from a merge or from mapping an entirely new
> > VMA).
> >
> > Doing so is substantially safer than the existing .mmap() calback which
> > provides unrestricted access to the part-constructed VMA and permits
> > drivers and file systems to do 'creative' things which makes it hard to
> > reason about the state of the VMA after the function returns.
> >
> > The existing .mmap() callback's freedom has caused a great deal of issues,
> > especially in error handling, as unwinding the mmap() state has proven to
> > be non-trivial and caused significant issues in the past, for instance
> > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > error path behaviour").
> >
> > It also necessitates a second attempt at merge once the .mmap() callback
> > has completed, which has caused issues in the past, is awkward, adds
> > overhead and is difficult to reason about.
> >
> > The .mmap_prepare() callback eliminates this requirement, as we can update
> > fields prior to even attempting the first merge. It is safer, as we heavily
> > restrict what can actually be modified, and being invoked very early in the
> > mmap() process, error handling can be performed safely with very little
> > unwinding of state required.
> >
> > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > exclusive, so we permit only one to be invoked at a time.
> >
> > Update vma userland test stubs to account for changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > include/linux/fs.h | 25 ++++++++++++
> > include/linux/mm_types.h | 24 +++++++++++
> > mm/memory.c | 3 +-
> > mm/mmap.c | 2 +-
> > mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> > tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> > 6 files changed, 180 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 016b0fe1536e..e2721a1ff13d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2169,6 +2169,7 @@ struct file_operations {
> > int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> > int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> > unsigned int poll_flags);
> > + int (*mmap_prepare)(struct vm_area_desc *);
> > } __randomize_layout;
> > /* Supports async buffered reads */
> > @@ -2238,11 +2239,35 @@ struct inode_operations {
> > struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> > } ____cacheline_aligned;
> > +/* Did the driver provide valid mmap hook configuration? */
> > +static inline bool file_has_valid_mmap_hooks(struct file *file)
> > +{
> > + bool has_mmap = file->f_op->mmap;
> > + bool has_mmap_prepare = file->f_op->mmap_prepare;
> > +
> > + /* Hooks are mutually exclusive. */
> > + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> > + return false;
> > + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> > + return false;
> > +
> > + return true;
> > +}
>
> So, if neither is set, it's also an invalid setting, understood.
>
> So we want XOR.
>
>
>
> const bool has_mmap = file->f_op->mmap;
> const bool has_mmap_prepare = file->f_op->mmap_prepare;
> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
>
> WARN_ON_ONCE(!mutual_exclusive)
> return mutual_exclusive;
Yeah I did consider xor like this but I've always found it quite confusing
in this kind of context, honestly.
In a way I think it's a bit easier spelt out as it is now. But happy to
change if you feel strongly about it? :)
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks!
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 9:32 ` Lorenzo Stoakes
@ 2025-05-13 13:25 ` David Hildenbrand
2025-05-13 15:31 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-13 13:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 13.05.25 11:32, Lorenzo Stoakes wrote:
> On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
>> On 09.05.25 14:13, Lorenzo Stoakes wrote:
>>> Provide a means by which drivers can specify which fields of those
>>> permitted to be changed should be altered to prior to mmap()'ing a
>>> range (which may either result from a merge or from mapping an entirely new
>>> VMA).
>>>
>>> Doing so is substantially safer than the existing .mmap() calback which
>>> provides unrestricted access to the part-constructed VMA and permits
>>> drivers and file systems to do 'creative' things which makes it hard to
>>> reason about the state of the VMA after the function returns.
>>>
>>> The existing .mmap() callback's freedom has caused a great deal of issues,
>>> especially in error handling, as unwinding the mmap() state has proven to
>>> be non-trivial and caused significant issues in the past, for instance
>>> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
>>> error path behaviour").
>>>
>>> It also necessitates a second attempt at merge once the .mmap() callback
>>> has completed, which has caused issues in the past, is awkward, adds
>>> overhead and is difficult to reason about.
>>>
>>> The .mmap_prepare() callback eliminates this requirement, as we can update
>>> fields prior to even attempting the first merge. It is safer, as we heavily
>>> restrict what can actually be modified, and being invoked very early in the
>>> mmap() process, error handling can be performed safely with very little
>>> unwinding of state required.
>>>
>>> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
>>> exclusive, so we permit only one to be invoked at a time.
>>>
>>> Update vma userland test stubs to account for changes.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>> include/linux/fs.h | 25 ++++++++++++
>>> include/linux/mm_types.h | 24 +++++++++++
>>> mm/memory.c | 3 +-
>>> mm/mmap.c | 2 +-
>>> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
>>> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
>>> 6 files changed, 180 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 016b0fe1536e..e2721a1ff13d 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -2169,6 +2169,7 @@ struct file_operations {
>>> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
>>> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>>> unsigned int poll_flags);
>>> + int (*mmap_prepare)(struct vm_area_desc *);
>>> } __randomize_layout;
>>> /* Supports async buffered reads */
>>> @@ -2238,11 +2239,35 @@ struct inode_operations {
>>> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
>>> } ____cacheline_aligned;
>>> +/* Did the driver provide valid mmap hook configuration? */
>>> +static inline bool file_has_valid_mmap_hooks(struct file *file)
>>> +{
>>> + bool has_mmap = file->f_op->mmap;
>>> + bool has_mmap_prepare = file->f_op->mmap_prepare;
>>> +
>>> + /* Hooks are mutually exclusive. */
>>> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
>>> + return false;
>>> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
>>> + return false;
>>> +
>>> + return true;
>>> +}
>>
>> So, if neither is set, it's also an invalid setting, understood.
>>
>> So we want XOR.
>>
>>
>>
>> const bool has_mmap = file->f_op->mmap;
>> const bool has_mmap_prepare = file->f_op->mmap_prepare;
>> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
>>
>> WARN_ON_ONCE(!mutual_exclusive)
>> return mutual_exclusive;
>
> Yeah I did consider xor like this but I've always found it quite confusing
> in this kind of context, honestly.
With the local variable I think it's quite helpful (no need for a
comment :P ).
>
> In a way I think it's a bit easier spelt out as it is now. But happy to
> change if you feel strongly about it? :)
Certainly not strongly! :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 13:25 ` David Hildenbrand
@ 2025-05-13 15:31 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 15:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.05.25 11:32, Lorenzo Stoakes wrote:
> > On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
> >> On 09.05.25 14:13, Lorenzo Stoakes wrote:
> >>> Provide a means by which drivers can specify which fields of those
> >>> permitted to be changed should be altered to prior to mmap()'ing a
> >>> range (which may either result from a merge or from mapping an entirely new
> >>> VMA).
> >>>
> >>> Doing so is substantially safer than the existing .mmap() calback which
> >>> provides unrestricted access to the part-constructed VMA and permits
> >>> drivers and file systems to do 'creative' things which makes it hard to
> >>> reason about the state of the VMA after the function returns.
> >>>
> >>> The existing .mmap() callback's freedom has caused a great deal of issues,
> >>> especially in error handling, as unwinding the mmap() state has proven to
> >>> be non-trivial and caused significant issues in the past, for instance
> >>> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> >>> error path behaviour").
> >>>
> >>> It also necessitates a second attempt at merge once the .mmap() callback
> >>> has completed, which has caused issues in the past, is awkward, adds
> >>> overhead and is difficult to reason about.
> >>>
> >>> The .mmap_prepare() callback eliminates this requirement, as we can update
> >>> fields prior to even attempting the first merge. It is safer, as we heavily
> >>> restrict what can actually be modified, and being invoked very early in the
> >>> mmap() process, error handling can be performed safely with very little
> >>> unwinding of state required.
> >>>
> >>> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> >>> exclusive, so we permit only one to be invoked at a time.
> >>>
> >>> Update vma userland test stubs to account for changes.
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> >>> ---
> >>> include/linux/fs.h | 25 ++++++++++++
> >>> include/linux/mm_types.h | 24 +++++++++++
> >>> mm/memory.c | 3 +-
> >>> mm/mmap.c | 2 +-
> >>> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> >>> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> >>> 6 files changed, 180 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>> index 016b0fe1536e..e2721a1ff13d 100644
> >>> --- a/include/linux/fs.h
> >>> +++ b/include/linux/fs.h
> >>> @@ -2169,6 +2169,7 @@ struct file_operations {
> >>> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> >>> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> >>> unsigned int poll_flags);
> >>> + int (*mmap_prepare)(struct vm_area_desc *);
> >>> } __randomize_layout;
> >>> /* Supports async buffered reads */
> >>> @@ -2238,11 +2239,35 @@ struct inode_operations {
> >>> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> >>> } ____cacheline_aligned;
> >>> +/* Did the driver provide valid mmap hook configuration? */
> >>> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> >>> +{
> >>> + bool has_mmap = file->f_op->mmap;
> >>> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> >>> +
> >>> + /* Hooks are mutually exclusive. */
> >>> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> >>> + return false;
> >>> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> >>> + return false;
> >>> +
> >>> + return true;
> >>> +}
> >>
> >> So, if neither is set, it's also an invalid setting, understood.
> >>
> >> So we want XOR.
> >>
> >>
> >>
> >> const bool has_mmap = file->f_op->mmap;
> >> const bool has_mmap_prepare = file->f_op->mmap_prepare;
> >> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
> >>
> >> WARN_ON_ONCE(!mutual_exclusive)
> >> return mutual_exclusive;
> >
> > Yeah I did consider xor like this but I've always found it quite confusing
> > in this kind of context, honestly.
>
> With the local variable I think it's quite helpful (no need for a
> comment :P ).
>
> >
> > In a way I think it's a bit easier spelt out as it is now. But happy to
> > change if you feel strongly about it? :)
>
> Certainly not strongly! :)
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
` (2 preceding siblings ...)
2025-05-13 9:01 ` David Hildenbrand
@ 2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
4 siblings, 0 replies; 23+ messages in thread
From: Liam R. Howlett @ 2025-05-13 13:22 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> return file->f_op->mmap(file, vma);
> }
>
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..15808cad2bc1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,30 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> /*
> * This struct describes a virtual memory area. There is one of these
> * per VM-area/task. A VM area is any part of the process virtual memory
> diff --git a/mm/memory.c b/mm/memory.c
> index 68c1d962d0ad..99af83434e7c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> dump_page(page, "bad pte");
> pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
> + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> vma->vm_file,
> vma->vm_ops ? vma->vm_ops->fault : NULL,
> vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
> + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
> mapping ? mapping->a_ops->read_folio : NULL);
> dump_stack();
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 81dd962a1cfc..50f902c08341 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> vm_flags &= ~VM_MAYEXEC;
> }
>
> - if (!file->f_op->mmap)
> + if (!file_has_valid_mmap_hooks(file))
> return -ENODEV;
> if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> return -EINVAL;
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..3f32e04bb6cc 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -17,6 +17,11 @@ struct mmap_state {
> unsigned long pglen;
> unsigned long flags;
> struct file *file;
> + pgprot_t page_prot;
> +
> + /* User-defined fields, perhaps updated by .mmap_prepare(). */
> + const struct vm_operations_struct *vm_ops;
> + void *vm_private_data;
>
> unsigned long charged;
> bool retry_merge;
> @@ -40,6 +45,7 @@ struct mmap_state {
> .pglen = PHYS_PFN(len_), \
> .flags = flags_, \
> .file = file_, \
> + .page_prot = vm_get_page_prot(flags_), \
> }
>
> #define VMG_MMAP_STATE(name, map_, vma_) \
> @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> int error;
>
> vma->vm_file = get_file(map->file);
> +
> + if (!map->file->f_op->mmap)
> + return 0;
> +
> error = mmap_file(vma->vm_file, vma);
> if (error) {
> fput(vma->vm_file);
> @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> vma_iter_config(vmi, map->addr, map->end);
> vma_set_range(vma, map->addr, map->end, map->pgoff);
> vm_flags_init(vma, map->flags);
> - vma->vm_page_prot = vm_get_page_prot(map->flags);
> + vma->vm_page_prot = map->page_prot;
>
> if (vma_iter_prealloc(vmi, vma)) {
> error = -ENOMEM;
> @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> vma_set_page_prot(vma);
> }
>
> +/*
> + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
> + * specifies it.
> + *
> + * This is called prior to any merge attempt, and updates whitelisted fields
> + * that are permitted to be updated by the caller.
> + *
> + * All but user-defined fields will be pre-populated with original values.
> + *
> + * Returns 0 on success, or an error code otherwise.
> + */
> +static int call_mmap_prepare(struct mmap_state *map)
> +{
> + int err;
> + struct vm_area_desc desc = {
> + .mm = map->mm,
> + .start = map->addr,
> + .end = map->end,
> +
> + .pgoff = map->pgoff,
> + .file = map->file,
> + .vm_flags = map->flags,
> + .page_prot = map->page_prot,
> + };
> +
> + /* Invoke the hook. */
> + err = __call_mmap_prepare(map->file, &desc);
> + if (err)
> + return err;
> +
> + /* Update fields permitted to be changed. */
> + map->pgoff = desc.pgoff;
> + map->file = desc.file;
> + map->flags = desc.vm_flags;
> + map->page_prot = desc.page_prot;
> + /* User-defined fields. */
> + map->vm_ops = desc.vm_ops;
> + map->vm_private_data = desc.private_data;
> +
> + return 0;
> +}
> +
> +static void set_vma_user_defined_fields(struct vm_area_struct *vma,
> + struct mmap_state *map)
> +{
> + if (map->vm_ops)
> + vma->vm_ops = map->vm_ops;
> + vma->vm_private_data = map->vm_private_data;
> +}
> +
> static unsigned long __mmap_region(struct file *file, unsigned long addr,
> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> struct list_head *uf)
> @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> int error;
> + bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> VMA_ITERATOR(vmi, mm, addr);
> MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
>
> error = __mmap_prepare(&map, uf);
> + if (!error && have_mmap_prepare)
> + error = call_mmap_prepare(&map);
> if (error)
> goto abort_munmap;
>
> @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> goto unacct_error;
> }
>
> + if (have_mmap_prepare)
> + set_vma_user_defined_fields(vma, &map);
> +
> /* If flags changed, we might be able to merge, so try again. */
> if (map.retry_merge) {
> struct vm_area_struct *merged;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 198abe66de5a..f6e45e62da3a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -253,8 +253,40 @@ struct mm_struct {
> unsigned long flags; /* Must use atomic bitops to access */
> };
>
> +struct vm_area_struct;
> +
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> +struct file_operations {
> + int (*mmap)(struct file *, struct vm_area_struct *);
> + int (*mmap_prepare)(struct vm_area_desc *);
> +};
> +
> struct file {
> struct address_space *f_mapping;
> + const struct file_operations *f_op;
> };
>
> #define VMA_LOCK_OFFSET 0x40000000
> @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
> vma->__vm_flags &= ~flags;
> }
>
> -static inline int call_mmap(struct file *, struct vm_area_struct *)
> -{
> - return 0;
> -}
> -
> static inline int shmem_zero_setup(struct vm_area_struct *)
> {
> return 0;
> @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> (void)vma;
> }
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> +static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> + return file->f_op->mmap(file, vma);
> +}
> +
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
` (3 preceding siblings ...)
2025-05-13 13:22 ` Liam R. Howlett
@ 2025-05-14 9:04 ` Pedro Falcato
2025-05-14 9:12 ` Lorenzo Stoakes
4 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2025-05-14 9:04 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Neat idea, thanks. This should also help out with the insane proliferation of
vm_flags_set in ->mmap() callbacks all over. Hopefully.
However, could we:
1) Add a small writeup to Documentation/filesystems/vfs.rst for this callback
2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
So things like remap_pfn_range can still be used by drivers' mmap()
implementation.
1) is particularly important so our VFS and driver friends know this is supposed
to be The Way Forward.
--
Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 9:04 ` Pedro Falcato
@ 2025-05-14 9:12 ` Lorenzo Stoakes
2025-05-14 10:01 ` Pedro Falcato
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-14 9:12 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > Provide a means by which drivers can specify which fields of those
> > permitted to be changed should be altered to prior to mmap()'ing a
> > range (which may either result from a merge or from mapping an entirely new
> > VMA).
> >
> > Doing so is substantially safer than the existing .mmap() calback which
> > provides unrestricted access to the part-constructed VMA and permits
> > drivers and file systems to do 'creative' things which makes it hard to
> > reason about the state of the VMA after the function returns.
> >
> > The existing .mmap() callback's freedom has caused a great deal of issues,
> > especially in error handling, as unwinding the mmap() state has proven to
> > be non-trivial and caused significant issues in the past, for instance
> > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > error path behaviour").
> >
> > It also necessitates a second attempt at merge once the .mmap() callback
> > has completed, which has caused issues in the past, is awkward, adds
> > overhead and is difficult to reason about.
> >
> > The .mmap_prepare() callback eliminates this requirement, as we can update
> > fields prior to even attempting the first merge. It is safer, as we heavily
> > restrict what can actually be modified, and being invoked very early in the
> > mmap() process, error handling can be performed safely with very little
> > unwinding of state required.
> >
> > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > exclusive, so we permit only one to be invoked at a time.
> >
> > Update vma userland test stubs to account for changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> Neat idea, thanks. This should also help out with the insane proliferation of
> vm_flags_set in ->mmap() callbacks all over. Hopefully.
Thanks! :)
>
> However, could we:
>
> 1) Add a small writeup to Documentation/filesystems/vfs.rst for this callback
I looked at this but the problem is the docs there are already hopelessly
out of date (see the kernel version referred to).
But I'll look into how best to do the docs going forward.
>
> 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> So things like remap_pfn_range can still be used by drivers' mmap()
> implementation.
Thanks for raising the remap_pfn_range() case! Yes this is definitely a
thing.
However this proposed callback would totally undermine the purpose of the
change - the idea is to never give a vma because if we do so we lose all of
the advantages here and may as well just leave the mmap in place for
this...
However I do think we'll need a new callback at some point (previously
discussed in thread).
We could perhaps provide the option to _explicitly_ remap for instance. I
would want it to be heavily locked down as to what can happen and to happen
as early as possible.
This is something we can iterate on, as trying to find the ideal scheme
immediately will just lead to inaction, the big advantage with the approach
here is we can be iterative.
We provide this, use it in a scenario which allows us to eliminate merge
retry, and can take it from there :)
So indeed, watch this space basically... I will be highly proactive on this
stuff moving forward.
>
> 1) is particularly important so our VFS and driver friends know this is supposed
> to be The Way Forward.
I think probably the answer is for me to fully update the document to be
bang up to date, right? But that would obviously work best as a follow up
patch :)
Have added to todo, so will follow up on this thanks!
>
> --
> Pedro
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 9:12 ` Lorenzo Stoakes
@ 2025-05-14 10:01 ` Pedro Falcato
2025-05-14 10:14 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2025-05-14 10:01 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote:
> On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > > Provide a means by which drivers can specify which fields of those
> > > permitted to be changed should be altered to prior to mmap()'ing a
> > > range (which may either result from a merge or from mapping an entirely new
> > > VMA).
> > >
> > > Doing so is substantially safer than the existing .mmap() calback which
> > > provides unrestricted access to the part-constructed VMA and permits
> > > drivers and file systems to do 'creative' things which makes it hard to
> > > reason about the state of the VMA after the function returns.
> > >
> > > The existing .mmap() callback's freedom has caused a great deal of issues,
> > > especially in error handling, as unwinding the mmap() state has proven to
> > > be non-trivial and caused significant issues in the past, for instance
> > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > error path behaviour").
> > >
> > > It also necessitates a second attempt at merge once the .mmap() callback
> > > has completed, which has caused issues in the past, is awkward, adds
> > > overhead and is difficult to reason about.
> > >
> > > The .mmap_prepare() callback eliminates this requirement, as we can update
> > > fields prior to even attempting the first merge. It is safer, as we heavily
> > > restrict what can actually be modified, and being invoked very early in the
> > > mmap() process, error handling can be performed safely with very little
> > > unwinding of state required.
> > >
> > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > > exclusive, so we permit only one to be invoked at a time.
> > >
> > > Update vma userland test stubs to account for changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> >
> > Neat idea, thanks. This should also help out with the insane proliferation of
> > vm_flags_set in ->mmap() callbacks all over. Hopefully.
[snip]
> >
> > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> > So things like remap_pfn_range can still be used by drivers' mmap()
> > implementation.
>
> Thanks for raising the remap_pfn_range() case! Yes this is definitely a
> thing.
>
> However this proposed callback would totally undermine the purpose of the
> change - the idea is to never give a vma because if we do so we lose all of
> the advantages here and may as well just leave the mmap in place for
> this...
>
Yes, good point.
> However I do think we'll need a new callback at some point (previously
> discussed in thread).
>
> We could perhaps provide the option to _explicitly_ remap for instance. I
> would want it to be heavily locked down as to what can happen and to happen
> as early as possible.
>
I think we can simply combine various ideas here. Like:
struct vm_area_desc_private {
struct vm_area_desc desc;
struct vm_area_struct *vma;
};
Then, for this "mmap_finish" callback and associated infra:
int (*mmap_finish)(struct vm_area_desc *desc);
int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/)
{
struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc);
return remap_pfn_range(private->vma, /*...*/);
}
int random_driver_mmap_finish(struct vm_area_desc *desc)
{
return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size,
desc->page_prot);
}
I think something of the sort would be quite less prone to abuse, and we could
take the time to then even polish up the interface (e.g maybe it would be nicer
if mmap_remap_pfn_range took a vma offset, and not a start address).
Anyway, just brainstorming. This idea came to mind, I think it's quite interesting.
> This is something we can iterate on, as trying to find the ideal scheme
> immediately will just lead to inaction, the big advantage with the approach
> here is we can be iterative.
>
> We provide this, use it in a scenario which allows us to eliminate merge
> retry, and can take it from there :)
Yep, totally.
>
> So indeed, watch this space basically... I will be highly proactive on this
> stuff moving forward.
>
> >
> > 1) is particularly important so our VFS and driver friends know this is supposed
> > to be The Way Forward.
>
> I think probably the answer is for me to fully update the document to be
> bang up to date, right? But that would obviously work best as a follow up
> patch :)
>
You love your big projects :p I had the impression the docs were more or less up to date?
The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area,
and now ->mmap_prepare. And the descriptions are ATM quite useless, just
"called by the mmap(2) system call".
--
Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 10:01 ` Pedro Falcato
@ 2025-05-14 10:14 ` Lorenzo Stoakes
0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-14 10:14 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 11:01:49AM +0100, Pedro Falcato wrote:
> On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote:
> > On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> > > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > > > Provide a means by which drivers can specify which fields of those
> > > > permitted to be changed should be altered to prior to mmap()'ing a
> > > > range (which may either result from a merge or from mapping an entirely new
> > > > VMA).
> > > >
> > > > Doing so is substantially safer than the existing .mmap() calback which
> > > > provides unrestricted access to the part-constructed VMA and permits
> > > > drivers and file systems to do 'creative' things which makes it hard to
> > > > reason about the state of the VMA after the function returns.
> > > >
> > > > The existing .mmap() callback's freedom has caused a great deal of issues,
> > > > especially in error handling, as unwinding the mmap() state has proven to
> > > > be non-trivial and caused significant issues in the past, for instance
> > > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > > error path behaviour").
> > > >
> > > > It also necessitates a second attempt at merge once the .mmap() callback
> > > > has completed, which has caused issues in the past, is awkward, adds
> > > > overhead and is difficult to reason about.
> > > >
> > > > The .mmap_prepare() callback eliminates this requirement, as we can update
> > > > fields prior to even attempting the first merge. It is safer, as we heavily
> > > > restrict what can actually be modified, and being invoked very early in the
> > > > mmap() process, error handling can be performed safely with very little
> > > > unwinding of state required.
> > > >
> > > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > > > exclusive, so we permit only one to be invoked at a time.
> > > >
> > > > Update vma userland test stubs to account for changes.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > >
> > > Neat idea, thanks. This should also help out with the insane proliferation of
> > > vm_flags_set in ->mmap() callbacks all over. Hopefully.
> [snip]
> > >
> > > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> > > So things like remap_pfn_range can still be used by drivers' mmap()
> > > implementation.
> >
> > Thanks for raising the remap_pfn_range() case! Yes this is definitely a
> > thing.
> >
> > However this proposed callback would totally undermine the purpose of the
> > change - the idea is to never give a vma because if we do so we lose all of
> > the advantages here and may as well just leave the mmap in place for
> > this...
> >
>
> Yes, good point.
>
> > However I do think we'll need a new callback at some point (previously
> > discussed in thread).
> >
> > We could perhaps provide the option to _explicitly_ remap for instance. I
> > would want it to be heavily locked down as to what can happen and to happen
> > as early as possible.
> >
>
> I think we can simply combine various ideas here. Like:
>
> struct vm_area_desc_private {
> struct vm_area_desc desc;
> struct vm_area_struct *vma;
> };
>
> Then, for this "mmap_finish" callback and associated infra:
>
> int (*mmap_finish)(struct vm_area_desc *desc);
>
> int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/)
> {
> struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc);
> return remap_pfn_range(private->vma, /*...*/);
> }
>
> int random_driver_mmap_finish(struct vm_area_desc *desc)
> {
> return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size,
> desc->page_prot);
> }
>
>
Yeah that's really nice actually! I like that. Will note down for when we
come to this.
Obviously we need to know whent the finish implies a remap, but we can have
a field for this potentially set in the vm_area_desc to indicate this.
> I think something of the sort would be quite less prone to abuse, and we could
> take the time to then even polish up the interface (e.g maybe it would be nicer
> if mmap_remap_pfn_range took a vma offset, and not a start address).
Right yeah, David definitely wanted some improvements in this area so this
aligns with other intentions/work.
>
> Anyway, just brainstorming. This idea came to mind, I think it's quite interesting.
Nice ideas :) yeah, good chance to undo some sins here.
>
> > This is something we can iterate on, as trying to find the ideal scheme
> > immediately will just lead to inaction, the big advantage with the approach
> > here is we can be iterative.
> >
> > We provide this, use it in a scenario which allows us to eliminate merge
> > retry, and can take it from there :)
>
> Yep, totally.
>
> >
> > So indeed, watch this space basically... I will be highly proactive on this
> > stuff moving forward.
> >
> > >
> > > 1) is particularly important so our VFS and driver friends know this is supposed
> > > to be The Way Forward.
> >
> > I think probably the answer is for me to fully update the document to be
> > bang up to date, right? But that would obviously work best as a follow up
> > patch :)
> >
>
> You love your big projects :p I had the impression the docs were more or less up to date?
> The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area,
> and now ->mmap_prepare. And the descriptions are ATM quite useless, just
> "called by the mmap(2) system call".
Much like leg day, sometimes painful things are worthwhile things :)
As for docs I saw there were some missing uring entries, and it explicitly
says 'as of kernel 4.18'.
So I feel that it really needs to be properly updated and it ends up being
a little out of scope to the series.
Since this series only uses this callback in one, mm-specific, place, I
think it makes sense to attack this on a follow up, alongside updating file
system callbacks etc.
But it's definitely on the todo! :)
Cheers, Lorenzo
>
> --
> Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread