* [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only
@ 2024-11-28 15:06 Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
call_mmap()") (and preceding changes in the same series) it became possible
to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
behaviour") unintentionally undid this logic by moving the
mapping_map_writable() check before the shmem_mmap() hook is invoked,
thereby regressing this change.
This series reworks how we both permit write-sealed mappings being mapped
read-only and disallow mprotect() from undoing the write-seal, fixing this
regression.
We also add a regression test to ensure that we do not accidentally regress
this in future.
Thanks to Julian Orth for reporting this regression.
Note that this will require stable backports to 6.6.y and 6.12.y, I will
send these manually when this lands upstream.
Lorenzo Stoakes (2):
mm: reinstate ability to map write-sealed memfd mappings read-only
selftests/memfd: add test for mapping write-sealed memfd read-only
include/linux/memfd.h | 14 ++++++
include/linux/mm.h | 58 +++++++++++++++-------
mm/memfd.c | 2 +-
mm/mmap.c | 4 ++
tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++
5 files changed, 102 insertions(+), 19 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes @ 2024-11-28 15:06 ` Lorenzo Stoakes 2024-11-28 17:45 ` Jann Horn 2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes 2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2 siblings, 1 reply; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw) To: Andrew Morton Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan, Julian Orth, Linus Torvalds, linux-mm, linux-kernel In commit 158978945f31 ("mm: perform the mapping_map_writable() check after call_mmap()") (and preceding changes in the same series) it became possible to mmap() F_SEAL_WRITE sealed memfd mappings read-only. This was previously unnecessarily disallowed, despite the man page documentation indicating that it would be, thereby limiting the usefulness of F_SEAL_WRITE logic. We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE seal (one which disallows future writes to the memfd) to also be used for F_SEAL_WRITE. For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a read-only mapping to disallow mprotect() from overriding the seal - an operation performed by seal_check_write(), invoked from shmem_mmap(), the f_op->mmap() hook used by shmem mappings. By extending this to F_SEAL_WRITE and critically - checking mapping_map_writable() to determine if we may map the memfd AFTER we invoke shmem_mmap() - the desired logic becomes possible. This is because mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will have cleared. Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") unintentionally undid this logic by moving the mapping_map_writable() check before the shmem_mmap() hook is invoked, thereby regressing this change. We reinstate this functionality by moving the check out of shmem_mmap() and instead performing it in do_mmap() at the point at which VMA flags are being determined, which seems in any case to be a more appropriate place in which to make this determination. In order to achieve this we rework memfd seal logic to allow us access to this information using existing logic and eliminate the clearing of VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() instead. Reported-by: Julian Orth <ju.orth@gmail.com> Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- include/linux/memfd.h | 14 +++++++++++ include/linux/mm.h | 58 +++++++++++++++++++++++++++++-------------- mm/memfd.c | 2 +- mm/mmap.c | 4 +++ 4 files changed, 59 insertions(+), 19 deletions(-) diff --git a/include/linux/memfd.h b/include/linux/memfd.h index 3f2cf339ceaf..d437e3070850 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -7,6 +7,7 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx); +unsigned int *memfd_file_seals_ptr(struct file *file); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { @@ -16,6 +17,19 @@ static inline struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) { return ERR_PTR(-EINVAL); } + +static inline unsigned int *memfd_file_seals_ptr(struct file *file) +{ + return NULL; +} #endif +/* Retrieve memfd seals associated with the file, if any. */ +static inline unsigned int memfd_file_seals(struct file *file) +{ + unsigned int *sealsp = memfd_file_seals_ptr(file); + + return sealsp ? *sealsp : 0; +} + #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 2bbf73eb53e7..043689a56e8d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4090,6 +4090,37 @@ void mem_dump_obj(void *object); static inline void mem_dump_obj(void *object) {} #endif +static inline bool is_write_sealed(int seals) +{ + return seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE); +} + +/** + * is_readonly_sealed - Checks whether write-sealed but mapped read-only, + * in which case writes should be disallowing moving + * forwards. + * @seals: the seals to check + * @vm_flags: the VMA flags to check + * + * Returns whether readonly sealed, in which case writess should be disallowed + * going forward. + */ +static inline bool is_readonly_sealed(int seals, vm_flags_t vm_flags) +{ + /* + * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as + * MAP_SHARED and read-only, take care to not allow mprotect to + * revert protections on such mappings. Do this only for shared + * mappings. For private mappings, don't need to mask + * VM_MAYWRITE as we still want them to be COW-writable. + */ + if (is_write_sealed(seals) && + ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_SHARED)) + return true; + + return false; +} + /** * seal_check_write - Check for F_SEAL_WRITE or F_SEAL_FUTURE_WRITE flags and * handle them. @@ -4101,24 +4132,15 @@ static inline void mem_dump_obj(void *object) {} */ static inline int seal_check_write(int seals, struct vm_area_struct *vma) { - if (seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) { - /* - * New PROT_WRITE and MAP_SHARED mmaps are not allowed when - * write seals are active. - */ - if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) - return -EPERM; - - /* - * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as - * MAP_SHARED and read-only, take care to not allow mprotect to - * revert protections on such mappings. Do this only for shared - * mappings. For private mappings, don't need to mask - * VM_MAYWRITE as we still want them to be COW-writable. - */ - if (vma->vm_flags & VM_SHARED) - vm_flags_clear(vma, VM_MAYWRITE); - } + if (!is_write_sealed(seals)) + return 0; + + /* + * New PROT_WRITE and MAP_SHARED mmaps are not allowed when + * write seals are active. + */ + if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) + return -EPERM; return 0; } diff --git a/mm/memfd.c b/mm/memfd.c index c17c3ea701a1..35a370d75c9a 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -170,7 +170,7 @@ static int memfd_wait_for_pins(struct address_space *mapping) return error; } -static unsigned int *memfd_file_seals_ptr(struct file *file) +unsigned int *memfd_file_seals_ptr(struct file *file) { if (shmem_file(file)) return &SHMEM_I(file_inode(file))->seals; diff --git a/mm/mmap.c b/mm/mmap.c index 386429f7db5a..b1b2a24ef82e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -47,6 +47,7 @@ #include <linux/oom.h> #include <linux/sched/mm.h> #include <linux/ksm.h> +#include <linux/memfd.h> #include <linux/uaccess.h> #include <asm/cacheflush.h> @@ -368,6 +369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (file) { struct inode *inode = file_inode(file); + unsigned int seals = memfd_file_seals(file); unsigned long flags_mask; if (!file_mmap_ok(file, inode, pgoff, len)) @@ -408,6 +410,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); + else if (is_readonly_sealed(seals, vm_flags)) + vm_flags &= ~VM_MAYWRITE; fallthrough; case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) -- 2.47.0 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes @ 2024-11-28 17:45 ` Jann Horn 2024-11-28 17:58 ` Julian Orth 2024-11-28 18:05 ` Lorenzo Stoakes 0 siblings, 2 replies; 15+ messages in thread From: Jann Horn @ 2024-11-28 17:45 UTC (permalink / raw) To: Lorenzo Stoakes, Julian Orth Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > call_mmap()") (and preceding changes in the same series) it became possible > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > This was previously unnecessarily disallowed, despite the man page > documentation indicating that it would be, thereby limiting the usefulness > of F_SEAL_WRITE logic. > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > seal (one which disallows future writes to the memfd) to also be used for > F_SEAL_WRITE. > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > read-only mapping to disallow mprotect() from overriding the seal - an > operation performed by seal_check_write(), invoked from shmem_mmap(), the > f_op->mmap() hook used by shmem mappings. > > By extending this to F_SEAL_WRITE and critically - checking > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > shmem_mmap() - the desired logic becomes possible. This is because > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > have cleared. > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > behaviour") unintentionally undid this logic by moving the > mapping_map_writable() check before the shmem_mmap() hook is invoked, > thereby regressing this change. > > We reinstate this functionality by moving the check out of shmem_mmap() and > instead performing it in do_mmap() at the point at which VMA flags are > being determined, which seems in any case to be a more appropriate place in > which to make this determination. > > In order to achieve this we rework memfd seal logic to allow us access to > this information using existing logic and eliminate the clearing of > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > instead. If we already check is_readonly_sealed() and strip VM_MAYWRITE in do_mmap(), without holding any kind of lock or counter on the file yet, then this check is clearly racy somehow, right? I think we have a race where we intermittently reject shared-readonly mmap() calls? Like: process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() process 2: adds a F_SEAL_WRITE seal process 1: enters mmap_region(), is_shared_maywrite() is true, mapping_map_writable() fails But even if we fix that, the same scenario would result in F_SEAL_WRITE randomly failing depending on the ordering, so it's not like we can actually do anything particularly sensible if these two operations race. Taking a step back, read-only shared mappings of F_SEAL_WRITE-sealed files are just kind of a bad idea because if someone first creates a read-only shared mapping and *then* tries to apply F_SEAL_WRITE, that won't work because the existing mapping will be VM_MAYWRITE. And the manpage is just misleading on interaction with shared mappings in general, it says "Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping exists" when actually, it more or less fails if any shared mapping exists at all. @Julian Orth: Did you report this regression because this change caused issues with existing userspace code? > Reported-by: Julian Orth <ju.orth@gmail.com> > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 17:45 ` Jann Horn @ 2024-11-28 17:58 ` Julian Orth 2024-11-28 18:20 ` Lorenzo Stoakes 2024-11-28 18:05 ` Lorenzo Stoakes 1 sibling, 1 reply; 15+ messages in thread From: Julian Orth @ 2024-11-28 17:58 UTC (permalink / raw) To: Jann Horn Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel (Re-sending the message below since I forgot to reply-all) On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote: > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > call_mmap()") (and preceding changes in the same series) it became possible > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > This was previously unnecessarily disallowed, despite the man page > > documentation indicating that it would be, thereby limiting the usefulness > > of F_SEAL_WRITE logic. > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > seal (one which disallows future writes to the memfd) to also be used for > > F_SEAL_WRITE. > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > read-only mapping to disallow mprotect() from overriding the seal - an > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > f_op->mmap() hook used by shmem mappings. > > > > By extending this to F_SEAL_WRITE and critically - checking > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > shmem_mmap() - the desired logic becomes possible. This is because > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > have cleared. > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > behaviour") unintentionally undid this logic by moving the > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > thereby regressing this change. > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > instead performing it in do_mmap() at the point at which VMA flags are > > being determined, which seems in any case to be a more appropriate place in > > which to make this determination. > > > > In order to achieve this we rework memfd seal logic to allow us access to > > this information using existing logic and eliminate the clearing of > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > instead. > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > do_mmap(), without holding any kind of lock or counter on the file > yet, then this check is clearly racy somehow, right? I think we have a > race where we intermittently reject shared-readonly mmap() calls? Apropos race, some time ago I reported a way to get a mutable mapping for a write-sealed memfd via a race: https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > Like: > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > process 2: adds a F_SEAL_WRITE seal > process 1: enters mmap_region(), is_shared_maywrite() is true, > mapping_map_writable() fails > > But even if we fix that, the same scenario would result in > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > like we can actually do anything particularly sensible if these two > operations race. Taking a step back, read-only shared mappings of > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > someone first creates a read-only shared mapping and *then* tries to > apply F_SEAL_WRITE, that won't work because the existing mapping will > be VM_MAYWRITE. > > And the manpage is just misleading on interaction with shared mappings > in general, it says "Using the F_ADD_SEALS operation to set the > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > exists" when actually, it more or less fails if any shared mapping > exists at all. > > @Julian Orth: Did you report this regression because this change > caused issues with existing userspace code? I noticed this because it broke one of my testcases. It would also affect production code but making that code work on pre-6.6 kernels is probably a good idea anyway. > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 17:58 ` Julian Orth @ 2024-11-28 18:20 ` Lorenzo Stoakes 2024-11-28 18:27 ` Julian Orth 2024-11-28 18:27 ` Jann Horn 0 siblings, 2 replies; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 18:20 UTC (permalink / raw) To: Julian Orth Cc: Jann Horn, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > (Re-sending the message below since I forgot to reply-all) > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote: > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > call_mmap()") (and preceding changes in the same series) it became possible > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > This was previously unnecessarily disallowed, despite the man page > > > documentation indicating that it would be, thereby limiting the usefulness > > > of F_SEAL_WRITE logic. > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > seal (one which disallows future writes to the memfd) to also be used for > > > F_SEAL_WRITE. > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > f_op->mmap() hook used by shmem mappings. > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > shmem_mmap() - the desired logic becomes possible. This is because > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > have cleared. > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > behaviour") unintentionally undid this logic by moving the > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > thereby regressing this change. > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > instead performing it in do_mmap() at the point at which VMA flags are > > > being determined, which seems in any case to be a more appropriate place in > > > which to make this determination. > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > this information using existing logic and eliminate the clearing of > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > instead. > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > do_mmap(), without holding any kind of lock or counter on the file > > yet, then this check is clearly racy somehow, right? I think we have a > > race where we intermittently reject shared-readonly mmap() calls? > > Apropos race, some time ago I reported a way to get a mutable mapping > for a write-sealed memfd via a race: > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 Kind of hard to read rust code, but it looks like you're intentionally trying to race sealing on the assumption it's atomic when it's not? That doesn't seem like a bug? The intent of sealing memfds is you establish the memfd buffer, then seal it and _only then_ expose it elsewhere. I may be missing something here, however. > > > > > Like: > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > process 2: adds a F_SEAL_WRITE seal > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > mapping_map_writable() fails > > > > But even if we fix that, the same scenario would result in > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > like we can actually do anything particularly sensible if these two > > operations race. Taking a step back, read-only shared mappings of > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > someone first creates a read-only shared mapping and *then* tries to > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > be VM_MAYWRITE. > > > > And the manpage is just misleading on interaction with shared mappings > > in general, it says "Using the F_ADD_SEALS operation to set the > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > exists" when actually, it more or less fails if any shared mapping > > exists at all. > > > > @Julian Orth: Did you report this regression because this change > > caused issues with existing userspace code? > > I noticed this because it broke one of my testcases. It would also > affect production code but making that code work on pre-6.6 kernels is > probably a good idea anyway. Thanks for having that test case! I have added a test here to ensure we do not regress this again. This was a new feature introduced in 6.6, there is no reason to backport it to any earlier kernels if this is what you mean :) It's more a convenience thing like 'hm I can read() this but I can mmap-read this even though the man page says I can'. > > > > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 18:20 ` Lorenzo Stoakes @ 2024-11-28 18:27 ` Julian Orth 2024-11-28 18:27 ` Jann Horn 1 sibling, 0 replies; 15+ messages in thread From: Julian Orth @ 2024-11-28 18:27 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Jann Horn, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 7:20 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > (Re-sending the message below since I forgot to reply-all) > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote: > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > of F_SEAL_WRITE logic. > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > F_SEAL_WRITE. > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > have cleared. > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > behaviour") unintentionally undid this logic by moving the > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > thereby regressing this change. > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > being determined, which seems in any case to be a more appropriate place in > > > > which to make this determination. > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > this information using existing logic and eliminate the clearing of > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > instead. > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > do_mmap(), without holding any kind of lock or counter on the file > > > yet, then this check is clearly racy somehow, right? I think we have a > > > race where we intermittently reject shared-readonly mmap() calls? > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > for a write-sealed memfd via a race: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > Kind of hard to read rust code, but it looks like you're intentionally > trying to race sealing on the assumption it's atomic when it's not? That > doesn't seem like a bug? > > The intent of sealing memfds is you establish the memfd buffer, then seal > it and _only then_ expose it elsewhere. > > I may be missing something here, however. systemd allows the client to send journal logs via a file descriptor. If this file descriptor is a write-sealed, truncate-sealed memfd, systemd uses an optimized code path that relies on the contents of the file being immutable. I could not find any actual way to exploit this nor could I find any other open-source code that relies on write-seals being accurate. But maybe there is some code out there that could be exploited this way. E.g. if it used the contents of the file as an array index, then you could easily get a TOC/TOU exploit with this. > > > > > > > > > Like: > > > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > > process 2: adds a F_SEAL_WRITE seal > > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > > mapping_map_writable() fails > > > > > > But even if we fix that, the same scenario would result in > > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > > like we can actually do anything particularly sensible if these two > > > operations race. Taking a step back, read-only shared mappings of > > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > > someone first creates a read-only shared mapping and *then* tries to > > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > > be VM_MAYWRITE. > > > > > > And the manpage is just misleading on interaction with shared mappings > > > in general, it says "Using the F_ADD_SEALS operation to set the > > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > > exists" when actually, it more or less fails if any shared mapping > > > exists at all. > > > > > > @Julian Orth: Did you report this regression because this change > > > caused issues with existing userspace code? > > > > I noticed this because it broke one of my testcases. It would also > > affect production code but making that code work on pre-6.6 kernels is > > probably a good idea anyway. > > Thanks for having that test case! I have added a test here to ensure we do > not regress this again. > > This was a new feature introduced in 6.6, there is no reason to backport it > to any earlier kernels if this is what you mean :) > > It's more a convenience thing like 'hm I can read() this but I can > mmap-read this even though the man page says I can'. > > > > > > > > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 18:20 ` Lorenzo Stoakes 2024-11-28 18:27 ` Julian Orth @ 2024-11-28 18:27 ` Jann Horn 2024-11-28 18:35 ` Lorenzo Stoakes 1 sibling, 1 reply; 15+ messages in thread From: Jann Horn @ 2024-11-28 18:27 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > (Re-sending the message below since I forgot to reply-all) > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote: > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > of F_SEAL_WRITE logic. > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > F_SEAL_WRITE. > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > have cleared. > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > behaviour") unintentionally undid this logic by moving the > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > thereby regressing this change. > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > being determined, which seems in any case to be a more appropriate place in > > > > which to make this determination. > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > this information using existing logic and eliminate the clearing of > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > instead. > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > do_mmap(), without holding any kind of lock or counter on the file > > > yet, then this check is clearly racy somehow, right? I think we have a > > > race where we intermittently reject shared-readonly mmap() calls? > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > for a write-sealed memfd via a race: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > Kind of hard to read rust code, but it looks like you're intentionally > trying to race sealing on the assumption it's atomic when it's not? That > doesn't seem like a bug? > > The intent of sealing memfds is you establish the memfd buffer, then seal > it and _only then_ expose it elsewhere. > > I may be missing something here, however. AFAIU memfds are supposed to also guarantee *to the recipient* of the shared memfd that the memory inside it won't change anymore, so that the recipient can parse data out of this shared memory buffer without having to worry about the data concurrently changing. udmabuf_create() looks like it indeed breaks that assumption by first calling check_memfd_seals() and then doing udmabuf_pin_folios() without any lock that prevents a seal being added in between those. That's also why we have memfd_wait_for_pins(), which ensures that folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE seal is added. (I believe that's one of the major differences in usecases of F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough for cases where only the creator of the memfd wants to prevent other tasks from writing into it, while F_SEAL_WRITE is suitable for cases where the creator and recipient of the memfd want mutual protection.) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 18:27 ` Jann Horn @ 2024-11-28 18:35 ` Lorenzo Stoakes 0 siblings, 0 replies; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 18:35 UTC (permalink / raw) To: Jann Horn Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 07:27:44PM +0100, Jann Horn wrote: > On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > > (Re-sending the message below since I forgot to reply-all) > > > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote: > > > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > > of F_SEAL_WRITE logic. > > > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > > F_SEAL_WRITE. > > > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > > have cleared. > > > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > > behaviour") unintentionally undid this logic by moving the > > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > > thereby regressing this change. > > > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > > being determined, which seems in any case to be a more appropriate place in > > > > > which to make this determination. > > > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > > this information using existing logic and eliminate the clearing of > > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > > instead. > > > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > > do_mmap(), without holding any kind of lock or counter on the file > > > > yet, then this check is clearly racy somehow, right? I think we have a > > > > race where we intermittently reject shared-readonly mmap() calls? > > > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > > for a write-sealed memfd via a race: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > > > Kind of hard to read rust code, but it looks like you're intentionally > > trying to race sealing on the assumption it's atomic when it's not? That > > doesn't seem like a bug? > > > > The intent of sealing memfds is you establish the memfd buffer, then seal > > it and _only then_ expose it elsewhere. > > > > I may be missing something here, however. > > AFAIU memfds are supposed to also guarantee *to the recipient* of the > shared memfd that the memory inside it won't change anymore, so that > the recipient can parse data out of this shared memory buffer without > having to worry about the data concurrently changing. udmabuf_create() > looks like it indeed breaks that assumption by first calling > check_memfd_seals() and then doing udmabuf_pin_folios() without any > lock that prevents a seal being added in between those. > > That's also why we have memfd_wait_for_pins(), which ensures that > folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE > seal is added. > > (I believe that's one of the major differences in usecases of > F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough > for cases where only the creator of the memfd wants to prevent other > tasks from writing into it, while F_SEAL_WRITE is suitable for cases > where the creator and recipient of the memfd want mutual protection.) Those being stupid can also be in kernel code... :) I mean that just looks like udmabuf is doing something buggy and require a fix which is out of scope for mm but perhaps worth reporting direct to udmabuf maintainers. That makes sense re: F_SEAL_WRITE and further makes the case for making it possible to read-only mmap() these buffers for convenience. You'd also want to add other seals to ensure it's truly immutable. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 17:45 ` Jann Horn 2024-11-28 17:58 ` Julian Orth @ 2024-11-28 18:05 ` Lorenzo Stoakes 2024-11-28 18:18 ` Jann Horn 1 sibling, 1 reply; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 18:05 UTC (permalink / raw) To: Jann Horn Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote: > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > call_mmap()") (and preceding changes in the same series) it became possible > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > This was previously unnecessarily disallowed, despite the man page > > documentation indicating that it would be, thereby limiting the usefulness > > of F_SEAL_WRITE logic. > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > seal (one which disallows future writes to the memfd) to also be used for > > F_SEAL_WRITE. > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > read-only mapping to disallow mprotect() from overriding the seal - an > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > f_op->mmap() hook used by shmem mappings. > > > > By extending this to F_SEAL_WRITE and critically - checking > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > shmem_mmap() - the desired logic becomes possible. This is because > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > have cleared. > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > behaviour") unintentionally undid this logic by moving the > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > thereby regressing this change. > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > instead performing it in do_mmap() at the point at which VMA flags are > > being determined, which seems in any case to be a more appropriate place in > > which to make this determination. > > > > In order to achieve this we rework memfd seal logic to allow us access to > > this information using existing logic and eliminate the clearing of > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > instead. > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > do_mmap(), without holding any kind of lock or counter on the file > yet, then this check is clearly racy somehow, right? I think we have a > race where we intermittently reject shared-readonly mmap() calls? > > Like: > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > process 2: adds a F_SEAL_WRITE seal > process 1: enters mmap_region(), is_shared_maywrite() is true, > mapping_map_writable() fails I don't think this matters? Firstly these would have to be threads unless you are going out of your way to transmit the memfd incompletely set up via a socket or something, and then you'd have to be doing it on the assumption that it wouldn't race? The whole purpose of this change is to _allow read-only mapping *at all*_. Not to avoid silly races that are the product of somebody doing stupid things. > > But even if we fix that, the same scenario would result in > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > like we can actually do anything particularly sensible if these two > operations race. Taking a step back, read-only shared mappings of > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > someone first creates a read-only shared mapping and *then* tries to > apply F_SEAL_WRITE, that won't work because the existing mapping will > be VM_MAYWRITE. I don't think so? If they try to do that, attempting to apply the seal will fail as write will be disallowed. So there's no risk of overriding the seal. The idea is you establish a buffer, write into it, unmap, write-seal, and now you can mmap() it PROT_READ. Obviously it's not sensible (or really probably sensibly feasible) to try to find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the VM_MAYWRITE, so instead we simply disallow that scenario. But it's totally reasonable to be able to mmap() it readonly afterwards. > > And the manpage is just misleading on interaction with shared mappings > in general, it says "Using the F_ADD_SEALS operation to set the > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > exists" when actually, it more or less fails if any shared mapping > exists at all. No, it's when any writable mapping exists after my changes :) but people might not be quite aware of the distinction between VM_MAYWRITE and VM_WRITE. > > @Julian Orth: Did you report this regression because this change > caused issues with existing userspace code? > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> In any case, we are not discussing my original patch in 6.6 that permitted this behaviour, whether you agree or disagree it was sensible, we have regressed user-visible behaviour, this change restores it. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 18:05 ` Lorenzo Stoakes @ 2024-11-28 18:18 ` Jann Horn 2024-11-28 18:27 ` Lorenzo Stoakes 0 siblings, 1 reply; 15+ messages in thread From: Jann Horn @ 2024-11-28 18:18 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 7:05 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote: > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > call_mmap()") (and preceding changes in the same series) it became possible > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > This was previously unnecessarily disallowed, despite the man page > > > documentation indicating that it would be, thereby limiting the usefulness > > > of F_SEAL_WRITE logic. > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > seal (one which disallows future writes to the memfd) to also be used for > > > F_SEAL_WRITE. > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > f_op->mmap() hook used by shmem mappings. > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > shmem_mmap() - the desired logic becomes possible. This is because > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > have cleared. > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > behaviour") unintentionally undid this logic by moving the > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > thereby regressing this change. > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > instead performing it in do_mmap() at the point at which VMA flags are > > > being determined, which seems in any case to be a more appropriate place in > > > which to make this determination. > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > this information using existing logic and eliminate the clearing of > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > instead. > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > do_mmap(), without holding any kind of lock or counter on the file > > yet, then this check is clearly racy somehow, right? I think we have a > > race where we intermittently reject shared-readonly mmap() calls? > > > > Like: > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > process 2: adds a F_SEAL_WRITE seal > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > mapping_map_writable() fails > > I don't think this matters? Firstly these would have to be threads unless you > are going out of your way to transmit the memfd incompletely set up via a socket > or something, and then you'd have to be doing it on the assumption that it > wouldn't race? Ah, I guess that's true. > The whole purpose of this change is to _allow read-only mapping *at all*_. Not > to avoid silly races that are the product of somebody doing stupid things. > > > > > But even if we fix that, the same scenario would result in > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > like we can actually do anything particularly sensible if these two > > operations race. Taking a step back, read-only shared mappings of > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > someone first creates a read-only shared mapping and *then* tries to > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > be VM_MAYWRITE. > > I don't think so? > > If they try to do that, attempting to apply the seal will fail as write will be > disallowed. So there's no risk of overriding the seal. > > The idea is you establish a buffer, write into it, unmap, write-seal, and now > you can mmap() it PROT_READ. > > Obviously it's not sensible (or really probably sensibly feasible) to try to > find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the > VM_MAYWRITE, so instead we simply disallow that scenario. > > But it's totally reasonable to be able to mmap() it readonly afterwards. > > > > > And the manpage is just misleading on interaction with shared mappings > > in general, it says "Using the F_ADD_SEALS operation to set the > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > exists" when actually, it more or less fails if any shared mapping > > exists at all. > > No, it's when any writable mapping exists after my changes :) but people > might not be quite aware of the distinction between VM_MAYWRITE and > VM_WRITE. To clarify, do you read "writable" as "VM_MAYWRITE|VM_SHARED"? > > @Julian Orth: Did you report this regression because this change > > caused issues with existing userspace code? > > > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > In any case, we are not discussing my original patch in 6.6 that permitted > this behaviour, whether you agree or disagree it was sensible, we have > regressed user-visible behaviour, this change restores it. Hm, yeah, you're right. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 18:18 ` Jann Horn @ 2024-11-28 18:27 ` Lorenzo Stoakes 0 siblings, 0 replies; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 18:27 UTC (permalink / raw) To: Jann Horn Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm, linux-kernel On Thu, Nov 28, 2024 at 07:18:20PM +0100, Jann Horn wrote: > On Thu, Nov 28, 2024 at 7:05 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote: > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > of F_SEAL_WRITE logic. > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > F_SEAL_WRITE. > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > have cleared. > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > behaviour") unintentionally undid this logic by moving the > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > thereby regressing this change. > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > being determined, which seems in any case to be a more appropriate place in > > > > which to make this determination. > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > this information using existing logic and eliminate the clearing of > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > instead. > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > do_mmap(), without holding any kind of lock or counter on the file > > > yet, then this check is clearly racy somehow, right? I think we have a > > > race where we intermittently reject shared-readonly mmap() calls? > > > > > > Like: > > > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > > process 2: adds a F_SEAL_WRITE seal > > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > > mapping_map_writable() fails > > > > I don't think this matters? Firstly these would have to be threads unless you > > are going out of your way to transmit the memfd incompletely set up via a socket > > or something, and then you'd have to be doing it on the assumption that it > > wouldn't race? > > Ah, I guess that's true. > > > The whole purpose of this change is to _allow read-only mapping *at all*_. Not > > to avoid silly races that are the product of somebody doing stupid things. > > > > > > > > But even if we fix that, the same scenario would result in > > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > > like we can actually do anything particularly sensible if these two > > > operations race. Taking a step back, read-only shared mappings of > > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > > someone first creates a read-only shared mapping and *then* tries to > > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > > be VM_MAYWRITE. > > > > I don't think so? > > > > If they try to do that, attempting to apply the seal will fail as write will be > > disallowed. So there's no risk of overriding the seal. > > > > The idea is you establish a buffer, write into it, unmap, write-seal, and now > > you can mmap() it PROT_READ. > > > > Obviously it's not sensible (or really probably sensibly feasible) to try to > > find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the > > VM_MAYWRITE, so instead we simply disallow that scenario. > > > > But it's totally reasonable to be able to mmap() it readonly afterwards. > > > > > > > > And the manpage is just misleading on interaction with shared mappings > > > in general, it says "Using the F_ADD_SEALS operation to set the > > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > > exists" when actually, it more or less fails if any shared mapping > > > exists at all. > > > > No, it's when any writable mapping exists after my changes :) but people > > might not be quite aware of the distinction between VM_MAYWRITE and > > VM_WRITE. > > To clarify, do you read "writable" as "VM_MAYWRITE|VM_SHARED"? Yeah. It's a nuanced thing, and I don't _think_ there's any way to mmap without this so yeah in a way you're right, _except_ that I explicitly make this possible. It also protects you from mprotect() later making a F_SEAL_FUTURE_WRITE mapping mmap()'d PROT_READ from being mprotect()'d PROT_READ | PROT_WRITE. I really get the feeling that people felt F_SEAL_WRITE didn't work the way anybody expected, so implemented F_SEAL_FUTURE_WRITE instead, which doesn't use the address_space atomics mapping_map_writable() et al. use (since it doesn't have the 'disallow if anything writably mapped now' semantics). Not being able to mmap() read after applying the F_SEAL_WRITE really limits its usefulness if that's how you want to access it too. So this basically is a convenience thing just to let people be able to do that as well as read()'ing data. > > > > @Julian Orth: Did you report this regression because this change > > > caused issues with existing userspace code? > > > > > > > Reported-by: Julian Orth <ju.orth@gmail.com> > > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/ > > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > > In any case, we are not discussing my original patch in 6.6 that permitted > > this behaviour, whether you agree or disagree it was sensible, we have > > regressed user-visible behaviour, this change restores it. > > Hm, yeah, you're right. Yeah, I mean at the end of the day, unless we explicitly want to just get rid of this feature here and possibly break userspace (probably very few people affected to be honest but as Julian demonstrates at least some), we pretty much have to do this. I think this implementation is a less awful way of doing it than alternatives. BTW we do at least fget() the file before we interact with the file object (I mean mmap would be broken if we didn't), though at no point were we protected from a racing seal being applied. In the end, most users will want F_SEAL_FUTURE_WRITE anyway to be honest :) ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only 2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes @ 2024-11-28 15:06 ` Lorenzo Stoakes 2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2 siblings, 0 replies; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw) To: Andrew Morton Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan, Julian Orth, Linus Torvalds, linux-mm, linux-kernel Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only, assert that we are able to do this in a test to ensure that we do not regress this again. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 95af2d78fd31..46027c889e74 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -281,6 +281,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_assert_mmap_read_shared(int fd) +{ + void *p; + + p = mmap(NULL, + mfd_def_size, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -979,6 +997,30 @@ static void test_seal_future_write(void) close(fd); } +static void test_seal_write_map_read_shared(void) +{ + int fd; + void *p; + + printf("%s SEAL-WRITE-MAP-READ\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_write_map_read", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_add_seals(fd, F_SEAL_WRITE); + mfd_assert_has_seals(fd, F_SEAL_WRITE); + + p = mfd_assert_mmap_read_shared(fd); + + mfd_assert_read(fd); + mfd_assert_read_shared(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -1587,6 +1629,7 @@ int main(int argc, char **argv) test_seal_write(); test_seal_future_write(); + test_seal_write_map_read_shared(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.47.0 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes 2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes @ 2024-11-29 10:03 ` Lorenzo Stoakes 2024-11-29 10:24 ` Andrew Morton 2 siblings, 1 reply; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-29 10:03 UTC (permalink / raw) To: Andrew Morton Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan, Julian Orth, Linus Torvalds, linux-mm, linux-kernel Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :) Thanks! On Thu, Nov 28, 2024 at 03:06:16PM +0000, Lorenzo Stoakes wrote: > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > call_mmap()") (and preceding changes in the same series) it became possible > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > behaviour") unintentionally undid this logic by moving the > mapping_map_writable() check before the shmem_mmap() hook is invoked, > thereby regressing this change. > > This series reworks how we both permit write-sealed mappings being mapped > read-only and disallow mprotect() from undoing the write-seal, fixing this > regression. > > We also add a regression test to ensure that we do not accidentally regress > this in future. > > Thanks to Julian Orth for reporting this regression. > > Note that this will require stable backports to 6.6.y and 6.12.y, I will > send these manually when this lands upstream. > > Lorenzo Stoakes (2): > mm: reinstate ability to map write-sealed memfd mappings read-only > selftests/memfd: add test for mapping write-sealed memfd read-only > > include/linux/memfd.h | 14 ++++++ > include/linux/mm.h | 58 +++++++++++++++------- > mm/memfd.c | 2 +- > mm/mmap.c | 4 ++ > tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++ > 5 files changed, 102 insertions(+), 19 deletions(-) > > -- > 2.47.0 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes @ 2024-11-29 10:24 ` Andrew Morton 2024-11-29 11:36 ` Lorenzo Stoakes 0 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2024-11-29 10:24 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan, Julian Orth, Linus Torvalds, linux-mm, linux-kernel On Fri, 29 Nov 2024 10:03:51 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :) Jann's feedback didn't sound very ackish? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only 2024-11-29 10:24 ` Andrew Morton @ 2024-11-29 11:36 ` Lorenzo Stoakes 0 siblings, 0 replies; 15+ messages in thread From: Lorenzo Stoakes @ 2024-11-29 11:36 UTC (permalink / raw) To: Andrew Morton Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan, Julian Orth, Linus Torvalds, linux-mm, linux-kernel On Fri, Nov 29, 2024 at 02:24:33AM -0800, Andrew Morton wrote: > On Fri, 29 Nov 2024 10:03:51 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :) > > Jann's feedback didn't sound very ackish? Obviously Jann can indicate whether he feels this needs work or not, but I believe we resolved the issues raised? Mostly concerns around expected behaviour for these operations _in general_ and whether mmap-read'ing memfd write sealed mappings make sense, however it's by the by, we accidentally regressed user-facing behaviour and so should restore it. I also feel I made a case for why it is in fact sane and useful to be able to mmap memfd-write-sealed mappings :) But of course - Jann can indicate whether he is happy with this or whether it needs work, as usual his input is excellent and highly appreciated so Jann - if you have any remaining concerns - please let me know. Thanks! ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-11-29 11:37 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes 2024-11-28 17:45 ` Jann Horn 2024-11-28 17:58 ` Julian Orth 2024-11-28 18:20 ` Lorenzo Stoakes 2024-11-28 18:27 ` Julian Orth 2024-11-28 18:27 ` Jann Horn 2024-11-28 18:35 ` Lorenzo Stoakes 2024-11-28 18:05 ` Lorenzo Stoakes 2024-11-28 18:18 ` Jann Horn 2024-11-28 18:27 ` Lorenzo Stoakes 2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes 2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes 2024-11-29 10:24 ` Andrew Morton 2024-11-29 11:36 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox