* [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only
@ 2024-11-28 15:06 Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
call_mmap()") (and preceding changes in the same series) it became possible
to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
behaviour") unintentionally undid this logic by moving the
mapping_map_writable() check before the shmem_mmap() hook is invoked,
thereby regressing this change.
This series reworks how we both permit write-sealed mappings being mapped
read-only and disallow mprotect() from undoing the write-seal, fixing this
regression.
We also add a regression test to ensure that we do not accidentally regress
this in future.
Thanks to Julian Orth for reporting this regression.
Note that this will require stable backports to 6.6.y and 6.12.y, I will
send these manually when this lands upstream.
Lorenzo Stoakes (2):
mm: reinstate ability to map write-sealed memfd mappings read-only
selftests/memfd: add test for mapping write-sealed memfd read-only
include/linux/memfd.h | 14 ++++++
include/linux/mm.h | 58 +++++++++++++++-------
mm/memfd.c | 2 +-
mm/mmap.c | 4 ++
tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++
5 files changed, 102 insertions(+), 19 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
@ 2024-11-28 15:06 ` Lorenzo Stoakes
2024-11-28 17:45 ` Jann Horn
2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes
2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
call_mmap()") (and preceding changes in the same series) it became possible
to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
This was previously unnecessarily disallowed, despite the man page
documentation indicating that it would be, thereby limiting the usefulness
of F_SEAL_WRITE logic.
We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
seal (one which disallows future writes to the memfd) to also be used for
F_SEAL_WRITE.
For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
read-only mapping to disallow mprotect() from overriding the seal - an
operation performed by seal_check_write(), invoked from shmem_mmap(), the
f_op->mmap() hook used by shmem mappings.
By extending this to F_SEAL_WRITE and critically - checking
mapping_map_writable() to determine if we may map the memfd AFTER we invoke
shmem_mmap() - the desired logic becomes possible. This is because
mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
have cleared.
Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
behaviour") unintentionally undid this logic by moving the
mapping_map_writable() check before the shmem_mmap() hook is invoked,
thereby regressing this change.
We reinstate this functionality by moving the check out of shmem_mmap() and
instead performing it in do_mmap() at the point at which VMA flags are
being determined, which seems in any case to be a more appropriate place in
which to make this determination.
In order to achieve this we rework memfd seal logic to allow us access to
this information using existing logic and eliminate the clearing of
VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
instead.
Reported-by: Julian Orth <ju.orth@gmail.com>
Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/memfd.h | 14 +++++++++++
include/linux/mm.h | 58 +++++++++++++++++++++++++++++--------------
mm/memfd.c | 2 +-
mm/mmap.c | 4 +++
4 files changed, 59 insertions(+), 19 deletions(-)
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 3f2cf339ceaf..d437e3070850 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -7,6 +7,7 @@
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg);
struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx);
+unsigned int *memfd_file_seals_ptr(struct file *file);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
{
@@ -16,6 +17,19 @@ static inline struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
{
return ERR_PTR(-EINVAL);
}
+
+static inline unsigned int *memfd_file_seals_ptr(struct file *file)
+{
+ return NULL;
+}
#endif
+/* Retrieve memfd seals associated with the file, if any. */
+static inline unsigned int memfd_file_seals(struct file *file)
+{
+ unsigned int *sealsp = memfd_file_seals_ptr(file);
+
+ return sealsp ? *sealsp : 0;
+}
+
#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2bbf73eb53e7..043689a56e8d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4090,6 +4090,37 @@ void mem_dump_obj(void *object);
static inline void mem_dump_obj(void *object) {}
#endif
+static inline bool is_write_sealed(int seals)
+{
+ return seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE);
+}
+
+/**
+ * is_readonly_sealed - Checks whether write-sealed but mapped read-only,
+ * in which case writes should be disallowing moving
+ * forwards.
+ * @seals: the seals to check
+ * @vm_flags: the VMA flags to check
+ *
+ * Returns whether readonly sealed, in which case writess should be disallowed
+ * going forward.
+ */
+static inline bool is_readonly_sealed(int seals, vm_flags_t vm_flags)
+{
+ /*
+ * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
+ * MAP_SHARED and read-only, take care to not allow mprotect to
+ * revert protections on such mappings. Do this only for shared
+ * mappings. For private mappings, don't need to mask
+ * VM_MAYWRITE as we still want them to be COW-writable.
+ */
+ if (is_write_sealed(seals) &&
+ ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_SHARED))
+ return true;
+
+ return false;
+}
+
/**
* seal_check_write - Check for F_SEAL_WRITE or F_SEAL_FUTURE_WRITE flags and
* handle them.
@@ -4101,24 +4132,15 @@ static inline void mem_dump_obj(void *object) {}
*/
static inline int seal_check_write(int seals, struct vm_area_struct *vma)
{
- if (seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
- /*
- * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * write seals are active.
- */
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
- return -EPERM;
-
- /*
- * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
- * MAP_SHARED and read-only, take care to not allow mprotect to
- * revert protections on such mappings. Do this only for shared
- * mappings. For private mappings, don't need to mask
- * VM_MAYWRITE as we still want them to be COW-writable.
- */
- if (vma->vm_flags & VM_SHARED)
- vm_flags_clear(vma, VM_MAYWRITE);
- }
+ if (!is_write_sealed(seals))
+ return 0;
+
+ /*
+ * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
+ * write seals are active.
+ */
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
+ return -EPERM;
return 0;
}
diff --git a/mm/memfd.c b/mm/memfd.c
index c17c3ea701a1..35a370d75c9a 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -170,7 +170,7 @@ static int memfd_wait_for_pins(struct address_space *mapping)
return error;
}
-static unsigned int *memfd_file_seals_ptr(struct file *file)
+unsigned int *memfd_file_seals_ptr(struct file *file)
{
if (shmem_file(file))
return &SHMEM_I(file_inode(file))->seals;
diff --git a/mm/mmap.c b/mm/mmap.c
index 386429f7db5a..b1b2a24ef82e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
#include <linux/oom.h>
#include <linux/sched/mm.h>
#include <linux/ksm.h>
+#include <linux/memfd.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -368,6 +369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (file) {
struct inode *inode = file_inode(file);
+ unsigned int seals = memfd_file_seals(file);
unsigned long flags_mask;
if (!file_mmap_ok(file, inode, pgoff, len))
@@ -408,6 +410,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
+ else if (is_readonly_sealed(seals, vm_flags))
+ vm_flags &= ~VM_MAYWRITE;
fallthrough;
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
--
2.47.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only
2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
@ 2024-11-28 15:06 ` Lorenzo Stoakes
2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 15:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only,
assert that we are able to do this in a test to ensure that we do not
regress this again.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 95af2d78fd31..46027c889e74 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -281,6 +281,24 @@ static void *mfd_assert_mmap_shared(int fd)
return p;
}
+static void *mfd_assert_mmap_read_shared(int fd)
+{
+ void *p;
+
+ p = mmap(NULL,
+ mfd_def_size,
+ PROT_READ,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ return p;
+}
+
static void *mfd_assert_mmap_private(int fd)
{
void *p;
@@ -979,6 +997,30 @@ static void test_seal_future_write(void)
close(fd);
}
+static void test_seal_write_map_read_shared(void)
+{
+ int fd;
+ void *p;
+
+ printf("%s SEAL-WRITE-MAP-READ\n", memfd_str);
+
+ fd = mfd_assert_new("kern_memfd_seal_write_map_read",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ mfd_assert_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE);
+
+ p = mfd_assert_mmap_read_shared(fd);
+
+ mfd_assert_read(fd);
+ mfd_assert_read_shared(fd);
+ mfd_fail_write(fd);
+
+ munmap(p, mfd_def_size);
+ close(fd);
+}
+
/*
* Test SEAL_SHRINK
* Test whether SEAL_SHRINK actually prevents shrinking
@@ -1587,6 +1629,7 @@ int main(int argc, char **argv)
test_seal_write();
test_seal_future_write();
+ test_seal_write_map_read_shared();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
--
2.47.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
@ 2024-11-28 17:45 ` Jann Horn
2024-11-28 17:58 ` Julian Orth
2024-11-28 18:05 ` Lorenzo Stoakes
0 siblings, 2 replies; 15+ messages in thread
From: Jann Horn @ 2024-11-28 17:45 UTC (permalink / raw)
To: Lorenzo Stoakes, Julian Orth
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Shuah Khan,
Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> call_mmap()") (and preceding changes in the same series) it became possible
> to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
>
> This was previously unnecessarily disallowed, despite the man page
> documentation indicating that it would be, thereby limiting the usefulness
> of F_SEAL_WRITE logic.
>
> We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> seal (one which disallows future writes to the memfd) to also be used for
> F_SEAL_WRITE.
>
> For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> read-only mapping to disallow mprotect() from overriding the seal - an
> operation performed by seal_check_write(), invoked from shmem_mmap(), the
> f_op->mmap() hook used by shmem mappings.
>
> By extending this to F_SEAL_WRITE and critically - checking
> mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> shmem_mmap() - the desired logic becomes possible. This is because
> mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> have cleared.
>
> Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> behaviour") unintentionally undid this logic by moving the
> mapping_map_writable() check before the shmem_mmap() hook is invoked,
> thereby regressing this change.
>
> We reinstate this functionality by moving the check out of shmem_mmap() and
> instead performing it in do_mmap() at the point at which VMA flags are
> being determined, which seems in any case to be a more appropriate place in
> which to make this determination.
>
> In order to achieve this we rework memfd seal logic to allow us access to
> this information using existing logic and eliminate the clearing of
> VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> instead.
If we already check is_readonly_sealed() and strip VM_MAYWRITE in
do_mmap(), without holding any kind of lock or counter on the file
yet, then this check is clearly racy somehow, right? I think we have a
race where we intermittently reject shared-readonly mmap() calls?
Like:
process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
process 2: adds a F_SEAL_WRITE seal
process 1: enters mmap_region(), is_shared_maywrite() is true,
mapping_map_writable() fails
But even if we fix that, the same scenario would result in
F_SEAL_WRITE randomly failing depending on the ordering, so it's not
like we can actually do anything particularly sensible if these two
operations race. Taking a step back, read-only shared mappings of
F_SEAL_WRITE-sealed files are just kind of a bad idea because if
someone first creates a read-only shared mapping and *then* tries to
apply F_SEAL_WRITE, that won't work because the existing mapping will
be VM_MAYWRITE.
And the manpage is just misleading on interaction with shared mappings
in general, it says "Using the F_ADD_SEALS operation to set the
F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
exists" when actually, it more or less fails if any shared mapping
exists at all.
@Julian Orth: Did you report this regression because this change
caused issues with existing userspace code?
> Reported-by: Julian Orth <ju.orth@gmail.com>
> Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 17:45 ` Jann Horn
@ 2024-11-28 17:58 ` Julian Orth
2024-11-28 18:20 ` Lorenzo Stoakes
2024-11-28 18:05 ` Lorenzo Stoakes
1 sibling, 1 reply; 15+ messages in thread
From: Julian Orth @ 2024-11-28 17:58 UTC (permalink / raw)
To: Jann Horn
Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
Vlastimil Babka, Shuah Khan, Linus Torvalds, linux-mm,
linux-kernel
(Re-sending the message below since I forgot to reply-all)
On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > call_mmap()") (and preceding changes in the same series) it became possible
> > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> >
> > This was previously unnecessarily disallowed, despite the man page
> > documentation indicating that it would be, thereby limiting the usefulness
> > of F_SEAL_WRITE logic.
> >
> > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > seal (one which disallows future writes to the memfd) to also be used for
> > F_SEAL_WRITE.
> >
> > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > read-only mapping to disallow mprotect() from overriding the seal - an
> > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > f_op->mmap() hook used by shmem mappings.
> >
> > By extending this to F_SEAL_WRITE and critically - checking
> > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > shmem_mmap() - the desired logic becomes possible. This is because
> > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > have cleared.
> >
> > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > behaviour") unintentionally undid this logic by moving the
> > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > thereby regressing this change.
> >
> > We reinstate this functionality by moving the check out of shmem_mmap() and
> > instead performing it in do_mmap() at the point at which VMA flags are
> > being determined, which seems in any case to be a more appropriate place in
> > which to make this determination.
> >
> > In order to achieve this we rework memfd seal logic to allow us access to
> > this information using existing logic and eliminate the clearing of
> > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > instead.
>
> If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> do_mmap(), without holding any kind of lock or counter on the file
> yet, then this check is clearly racy somehow, right? I think we have a
> race where we intermittently reject shared-readonly mmap() calls?
Apropos race, some time ago I reported a way to get a mutable mapping
for a write-sealed memfd via a race:
https://bugzilla.kernel.org/show_bug.cgi?id=219106
>
> Like:
>
> process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> process 2: adds a F_SEAL_WRITE seal
> process 1: enters mmap_region(), is_shared_maywrite() is true,
> mapping_map_writable() fails
>
> But even if we fix that, the same scenario would result in
> F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> like we can actually do anything particularly sensible if these two
> operations race. Taking a step back, read-only shared mappings of
> F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> someone first creates a read-only shared mapping and *then* tries to
> apply F_SEAL_WRITE, that won't work because the existing mapping will
> be VM_MAYWRITE.
>
> And the manpage is just misleading on interaction with shared mappings
> in general, it says "Using the F_ADD_SEALS operation to set the
> F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> exists" when actually, it more or less fails if any shared mapping
> exists at all.
>
> @Julian Orth: Did you report this regression because this change
> caused issues with existing userspace code?
I noticed this because it broke one of my testcases. It would also
affect production code but making that code work on pre-6.6 kernels is
probably a good idea anyway.
>
> > Reported-by: Julian Orth <ju.orth@gmail.com>
> > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 17:45 ` Jann Horn
2024-11-28 17:58 ` Julian Orth
@ 2024-11-28 18:05 ` Lorenzo Stoakes
2024-11-28 18:18 ` Jann Horn
1 sibling, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 18:05 UTC (permalink / raw)
To: Jann Horn
Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote:
> On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > call_mmap()") (and preceding changes in the same series) it became possible
> > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> >
> > This was previously unnecessarily disallowed, despite the man page
> > documentation indicating that it would be, thereby limiting the usefulness
> > of F_SEAL_WRITE logic.
> >
> > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > seal (one which disallows future writes to the memfd) to also be used for
> > F_SEAL_WRITE.
> >
> > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > read-only mapping to disallow mprotect() from overriding the seal - an
> > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > f_op->mmap() hook used by shmem mappings.
> >
> > By extending this to F_SEAL_WRITE and critically - checking
> > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > shmem_mmap() - the desired logic becomes possible. This is because
> > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > have cleared.
> >
> > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > behaviour") unintentionally undid this logic by moving the
> > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > thereby regressing this change.
> >
> > We reinstate this functionality by moving the check out of shmem_mmap() and
> > instead performing it in do_mmap() at the point at which VMA flags are
> > being determined, which seems in any case to be a more appropriate place in
> > which to make this determination.
> >
> > In order to achieve this we rework memfd seal logic to allow us access to
> > this information using existing logic and eliminate the clearing of
> > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > instead.
>
> If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> do_mmap(), without holding any kind of lock or counter on the file
> yet, then this check is clearly racy somehow, right? I think we have a
> race where we intermittently reject shared-readonly mmap() calls?
>
> Like:
>
> process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> process 2: adds a F_SEAL_WRITE seal
> process 1: enters mmap_region(), is_shared_maywrite() is true,
> mapping_map_writable() fails
I don't think this matters? Firstly these would have to be threads unless you
are going out of your way to transmit the memfd incompletely set up via a socket
or something, and then you'd have to be doing it on the assumption that it
wouldn't race?
The whole purpose of this change is to _allow read-only mapping *at all*_. Not
to avoid silly races that are the product of somebody doing stupid things.
>
> But even if we fix that, the same scenario would result in
> F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> like we can actually do anything particularly sensible if these two
> operations race. Taking a step back, read-only shared mappings of
> F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> someone first creates a read-only shared mapping and *then* tries to
> apply F_SEAL_WRITE, that won't work because the existing mapping will
> be VM_MAYWRITE.
I don't think so?
If they try to do that, attempting to apply the seal will fail as write will be
disallowed. So there's no risk of overriding the seal.
The idea is you establish a buffer, write into it, unmap, write-seal, and now
you can mmap() it PROT_READ.
Obviously it's not sensible (or really probably sensibly feasible) to try to
find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the
VM_MAYWRITE, so instead we simply disallow that scenario.
But it's totally reasonable to be able to mmap() it readonly afterwards.
>
> And the manpage is just misleading on interaction with shared mappings
> in general, it says "Using the F_ADD_SEALS operation to set the
> F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> exists" when actually, it more or less fails if any shared mapping
> exists at all.
No, it's when any writable mapping exists after my changes :) but people
might not be quite aware of the distinction between VM_MAYWRITE and
VM_WRITE.
>
> @Julian Orth: Did you report this regression because this change
> caused issues with existing userspace code?
>
> > Reported-by: Julian Orth <ju.orth@gmail.com>
> > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
In any case, we are not discussing my original patch in 6.6 that permitted
this behaviour, whether you agree or disagree it was sensible, we have
regressed user-visible behaviour, this change restores it.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 18:05 ` Lorenzo Stoakes
@ 2024-11-28 18:18 ` Jann Horn
2024-11-28 18:27 ` Lorenzo Stoakes
0 siblings, 1 reply; 15+ messages in thread
From: Jann Horn @ 2024-11-28 18:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 7:05 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote:
> > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > call_mmap()") (and preceding changes in the same series) it became possible
> > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > >
> > > This was previously unnecessarily disallowed, despite the man page
> > > documentation indicating that it would be, thereby limiting the usefulness
> > > of F_SEAL_WRITE logic.
> > >
> > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > seal (one which disallows future writes to the memfd) to also be used for
> > > F_SEAL_WRITE.
> > >
> > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > f_op->mmap() hook used by shmem mappings.
> > >
> > > By extending this to F_SEAL_WRITE and critically - checking
> > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > shmem_mmap() - the desired logic becomes possible. This is because
> > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > have cleared.
> > >
> > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > behaviour") unintentionally undid this logic by moving the
> > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > thereby regressing this change.
> > >
> > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > instead performing it in do_mmap() at the point at which VMA flags are
> > > being determined, which seems in any case to be a more appropriate place in
> > > which to make this determination.
> > >
> > > In order to achieve this we rework memfd seal logic to allow us access to
> > > this information using existing logic and eliminate the clearing of
> > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > instead.
> >
> > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > do_mmap(), without holding any kind of lock or counter on the file
> > yet, then this check is clearly racy somehow, right? I think we have a
> > race where we intermittently reject shared-readonly mmap() calls?
> >
> > Like:
> >
> > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> > process 2: adds a F_SEAL_WRITE seal
> > process 1: enters mmap_region(), is_shared_maywrite() is true,
> > mapping_map_writable() fails
>
> I don't think this matters? Firstly these would have to be threads unless you
> are going out of your way to transmit the memfd incompletely set up via a socket
> or something, and then you'd have to be doing it on the assumption that it
> wouldn't race?
Ah, I guess that's true.
> The whole purpose of this change is to _allow read-only mapping *at all*_. Not
> to avoid silly races that are the product of somebody doing stupid things.
>
> >
> > But even if we fix that, the same scenario would result in
> > F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> > like we can actually do anything particularly sensible if these two
> > operations race. Taking a step back, read-only shared mappings of
> > F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> > someone first creates a read-only shared mapping and *then* tries to
> > apply F_SEAL_WRITE, that won't work because the existing mapping will
> > be VM_MAYWRITE.
>
> I don't think so?
>
> If they try to do that, attempting to apply the seal will fail as write will be
> disallowed. So there's no risk of overriding the seal.
>
> The idea is you establish a buffer, write into it, unmap, write-seal, and now
> you can mmap() it PROT_READ.
>
> Obviously it's not sensible (or really probably sensibly feasible) to try to
> find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the
> VM_MAYWRITE, so instead we simply disallow that scenario.
>
> But it's totally reasonable to be able to mmap() it readonly afterwards.
>
> >
> > And the manpage is just misleading on interaction with shared mappings
> > in general, it says "Using the F_ADD_SEALS operation to set the
> > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> > exists" when actually, it more or less fails if any shared mapping
> > exists at all.
>
> No, it's when any writable mapping exists after my changes :) but people
> might not be quite aware of the distinction between VM_MAYWRITE and
> VM_WRITE.
To clarify, do you read "writable" as "VM_MAYWRITE|VM_SHARED"?
> > @Julian Orth: Did you report this regression because this change
> > caused issues with existing userspace code?
> >
> > > Reported-by: Julian Orth <ju.orth@gmail.com>
> > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> In any case, we are not discussing my original patch in 6.6 that permitted
> this behaviour, whether you agree or disagree it was sensible, we have
> regressed user-visible behaviour, this change restores it.
Hm, yeah, you're right.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 17:58 ` Julian Orth
@ 2024-11-28 18:20 ` Lorenzo Stoakes
2024-11-28 18:27 ` Julian Orth
2024-11-28 18:27 ` Jann Horn
0 siblings, 2 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 18:20 UTC (permalink / raw)
To: Julian Orth
Cc: Jann Horn, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote:
> (Re-sending the message below since I forgot to reply-all)
>
> On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > call_mmap()") (and preceding changes in the same series) it became possible
> > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > >
> > > This was previously unnecessarily disallowed, despite the man page
> > > documentation indicating that it would be, thereby limiting the usefulness
> > > of F_SEAL_WRITE logic.
> > >
> > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > seal (one which disallows future writes to the memfd) to also be used for
> > > F_SEAL_WRITE.
> > >
> > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > f_op->mmap() hook used by shmem mappings.
> > >
> > > By extending this to F_SEAL_WRITE and critically - checking
> > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > shmem_mmap() - the desired logic becomes possible. This is because
> > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > have cleared.
> > >
> > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > behaviour") unintentionally undid this logic by moving the
> > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > thereby regressing this change.
> > >
> > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > instead performing it in do_mmap() at the point at which VMA flags are
> > > being determined, which seems in any case to be a more appropriate place in
> > > which to make this determination.
> > >
> > > In order to achieve this we rework memfd seal logic to allow us access to
> > > this information using existing logic and eliminate the clearing of
> > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > instead.
> >
> > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > do_mmap(), without holding any kind of lock or counter on the file
> > yet, then this check is clearly racy somehow, right? I think we have a
> > race where we intermittently reject shared-readonly mmap() calls?
>
> Apropos race, some time ago I reported a way to get a mutable mapping
> for a write-sealed memfd via a race:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=219106
Kind of hard to read rust code, but it looks like you're intentionally
trying to race sealing on the assumption it's atomic when it's not? That
doesn't seem like a bug?
The intent of sealing memfds is you establish the memfd buffer, then seal
it and _only then_ expose it elsewhere.
I may be missing something here, however.
>
> >
> > Like:
> >
> > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> > process 2: adds a F_SEAL_WRITE seal
> > process 1: enters mmap_region(), is_shared_maywrite() is true,
> > mapping_map_writable() fails
> >
> > But even if we fix that, the same scenario would result in
> > F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> > like we can actually do anything particularly sensible if these two
> > operations race. Taking a step back, read-only shared mappings of
> > F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> > someone first creates a read-only shared mapping and *then* tries to
> > apply F_SEAL_WRITE, that won't work because the existing mapping will
> > be VM_MAYWRITE.
> >
> > And the manpage is just misleading on interaction with shared mappings
> > in general, it says "Using the F_ADD_SEALS operation to set the
> > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> > exists" when actually, it more or less fails if any shared mapping
> > exists at all.
> >
> > @Julian Orth: Did you report this regression because this change
> > caused issues with existing userspace code?
>
> I noticed this because it broke one of my testcases. It would also
> affect production code but making that code work on pre-6.6 kernels is
> probably a good idea anyway.
Thanks for having that test case! I have added a test here to ensure we do
not regress this again.
This was a new feature introduced in 6.6, there is no reason to backport it
to any earlier kernels if this is what you mean :)
It's more a convenience thing like 'hm I can read() this but I can
mmap-read this even though the man page says I can'.
>
> >
> > > Reported-by: Julian Orth <ju.orth@gmail.com>
> > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 18:20 ` Lorenzo Stoakes
@ 2024-11-28 18:27 ` Julian Orth
2024-11-28 18:27 ` Jann Horn
1 sibling, 0 replies; 15+ messages in thread
From: Julian Orth @ 2024-11-28 18:27 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Jann Horn, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 7:20 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote:
> > (Re-sending the message below since I forgot to reply-all)
> >
> > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > > call_mmap()") (and preceding changes in the same series) it became possible
> > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > > >
> > > > This was previously unnecessarily disallowed, despite the man page
> > > > documentation indicating that it would be, thereby limiting the usefulness
> > > > of F_SEAL_WRITE logic.
> > > >
> > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > > seal (one which disallows future writes to the memfd) to also be used for
> > > > F_SEAL_WRITE.
> > > >
> > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > > f_op->mmap() hook used by shmem mappings.
> > > >
> > > > By extending this to F_SEAL_WRITE and critically - checking
> > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > > shmem_mmap() - the desired logic becomes possible. This is because
> > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > > have cleared.
> > > >
> > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > > behaviour") unintentionally undid this logic by moving the
> > > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > > thereby regressing this change.
> > > >
> > > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > > instead performing it in do_mmap() at the point at which VMA flags are
> > > > being determined, which seems in any case to be a more appropriate place in
> > > > which to make this determination.
> > > >
> > > > In order to achieve this we rework memfd seal logic to allow us access to
> > > > this information using existing logic and eliminate the clearing of
> > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > > instead.
> > >
> > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > > do_mmap(), without holding any kind of lock or counter on the file
> > > yet, then this check is clearly racy somehow, right? I think we have a
> > > race where we intermittently reject shared-readonly mmap() calls?
> >
> > Apropos race, some time ago I reported a way to get a mutable mapping
> > for a write-sealed memfd via a race:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=219106
>
> Kind of hard to read rust code, but it looks like you're intentionally
> trying to race sealing on the assumption it's atomic when it's not? That
> doesn't seem like a bug?
>
> The intent of sealing memfds is you establish the memfd buffer, then seal
> it and _only then_ expose it elsewhere.
>
> I may be missing something here, however.
systemd allows the client to send journal logs via a file descriptor.
If this file descriptor is a write-sealed, truncate-sealed memfd,
systemd uses an optimized code path that relies on the contents of the
file being immutable.
I could not find any actual way to exploit this nor could I find any
other open-source code that relies on write-seals being accurate. But
maybe there is some code out there that could be exploited this way.
E.g. if it used the contents of the file as an array index, then you
could easily get a TOC/TOU exploit with this.
>
> >
> > >
> > > Like:
> > >
> > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> > > process 2: adds a F_SEAL_WRITE seal
> > > process 1: enters mmap_region(), is_shared_maywrite() is true,
> > > mapping_map_writable() fails
> > >
> > > But even if we fix that, the same scenario would result in
> > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> > > like we can actually do anything particularly sensible if these two
> > > operations race. Taking a step back, read-only shared mappings of
> > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> > > someone first creates a read-only shared mapping and *then* tries to
> > > apply F_SEAL_WRITE, that won't work because the existing mapping will
> > > be VM_MAYWRITE.
> > >
> > > And the manpage is just misleading on interaction with shared mappings
> > > in general, it says "Using the F_ADD_SEALS operation to set the
> > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> > > exists" when actually, it more or less fails if any shared mapping
> > > exists at all.
> > >
> > > @Julian Orth: Did you report this regression because this change
> > > caused issues with existing userspace code?
> >
> > I noticed this because it broke one of my testcases. It would also
> > affect production code but making that code work on pre-6.6 kernels is
> > probably a good idea anyway.
>
> Thanks for having that test case! I have added a test here to ensure we do
> not regress this again.
>
> This was a new feature introduced in 6.6, there is no reason to backport it
> to any earlier kernels if this is what you mean :)
>
> It's more a convenience thing like 'hm I can read() this but I can
> mmap-read this even though the man page says I can'.
>
> >
> > >
> > > > Reported-by: Julian Orth <ju.orth@gmail.com>
> > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 18:18 ` Jann Horn
@ 2024-11-28 18:27 ` Lorenzo Stoakes
0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 18:27 UTC (permalink / raw)
To: Jann Horn
Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 07:18:20PM +0100, Jann Horn wrote:
> On Thu, Nov 28, 2024 at 7:05 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote:
> > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > > call_mmap()") (and preceding changes in the same series) it became possible
> > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > > >
> > > > This was previously unnecessarily disallowed, despite the man page
> > > > documentation indicating that it would be, thereby limiting the usefulness
> > > > of F_SEAL_WRITE logic.
> > > >
> > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > > seal (one which disallows future writes to the memfd) to also be used for
> > > > F_SEAL_WRITE.
> > > >
> > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > > f_op->mmap() hook used by shmem mappings.
> > > >
> > > > By extending this to F_SEAL_WRITE and critically - checking
> > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > > shmem_mmap() - the desired logic becomes possible. This is because
> > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > > have cleared.
> > > >
> > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > > behaviour") unintentionally undid this logic by moving the
> > > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > > thereby regressing this change.
> > > >
> > > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > > instead performing it in do_mmap() at the point at which VMA flags are
> > > > being determined, which seems in any case to be a more appropriate place in
> > > > which to make this determination.
> > > >
> > > > In order to achieve this we rework memfd seal logic to allow us access to
> > > > this information using existing logic and eliminate the clearing of
> > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > > instead.
> > >
> > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > > do_mmap(), without holding any kind of lock or counter on the file
> > > yet, then this check is clearly racy somehow, right? I think we have a
> > > race where we intermittently reject shared-readonly mmap() calls?
> > >
> > > Like:
> > >
> > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed()
> > > process 2: adds a F_SEAL_WRITE seal
> > > process 1: enters mmap_region(), is_shared_maywrite() is true,
> > > mapping_map_writable() fails
> >
> > I don't think this matters? Firstly these would have to be threads unless you
> > are going out of your way to transmit the memfd incompletely set up via a socket
> > or something, and then you'd have to be doing it on the assumption that it
> > wouldn't race?
>
> Ah, I guess that's true.
>
> > The whole purpose of this change is to _allow read-only mapping *at all*_. Not
> > to avoid silly races that are the product of somebody doing stupid things.
> >
> > >
> > > But even if we fix that, the same scenario would result in
> > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not
> > > like we can actually do anything particularly sensible if these two
> > > operations race. Taking a step back, read-only shared mappings of
> > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if
> > > someone first creates a read-only shared mapping and *then* tries to
> > > apply F_SEAL_WRITE, that won't work because the existing mapping will
> > > be VM_MAYWRITE.
> >
> > I don't think so?
> >
> > If they try to do that, attempting to apply the seal will fail as write will be
> > disallowed. So there's no risk of overriding the seal.
> >
> > The idea is you establish a buffer, write into it, unmap, write-seal, and now
> > you can mmap() it PROT_READ.
> >
> > Obviously it's not sensible (or really probably sensibly feasible) to try to
> > find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the
> > VM_MAYWRITE, so instead we simply disallow that scenario.
> >
> > But it's totally reasonable to be able to mmap() it readonly afterwards.
> >
> > >
> > > And the manpage is just misleading on interaction with shared mappings
> > > in general, it says "Using the F_ADD_SEALS operation to set the
> > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping
> > > exists" when actually, it more or less fails if any shared mapping
> > > exists at all.
> >
> > No, it's when any writable mapping exists after my changes :) but people
> > might not be quite aware of the distinction between VM_MAYWRITE and
> > VM_WRITE.
>
> To clarify, do you read "writable" as "VM_MAYWRITE|VM_SHARED"?
Yeah. It's a nuanced thing, and I don't _think_ there's any way to mmap without
this so yeah in a way you're right, _except_ that I explicitly make this
possible.
It also protects you from mprotect() later making a F_SEAL_FUTURE_WRITE mapping
mmap()'d PROT_READ from being mprotect()'d PROT_READ | PROT_WRITE.
I really get the feeling that people felt F_SEAL_WRITE didn't work the way
anybody expected, so implemented F_SEAL_FUTURE_WRITE instead, which doesn't use
the address_space atomics mapping_map_writable() et al. use (since it doesn't
have the 'disallow if anything writably mapped now' semantics).
Not being able to mmap() read after applying the F_SEAL_WRITE really limits its
usefulness if that's how you want to access it too.
So this basically is a convenience thing just to let people be able to do that
as well as read()'ing data.
>
> > > @Julian Orth: Did you report this regression because this change
> > > caused issues with existing userspace code?
> > >
> > > > Reported-by: Julian Orth <ju.orth@gmail.com>
> > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
> > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > In any case, we are not discussing my original patch in 6.6 that permitted
> > this behaviour, whether you agree or disagree it was sensible, we have
> > regressed user-visible behaviour, this change restores it.
>
> Hm, yeah, you're right.
Yeah, I mean at the end of the day, unless we explicitly want to just get rid of
this feature here and possibly break userspace (probably very few people
affected to be honest but as Julian demonstrates at least some), we pretty much
have to do this.
I think this implementation is a less awful way of doing it than alternatives.
BTW we do at least fget() the file before we interact with the file object (I
mean mmap would be broken if we didn't), though at no point were we protected
from a racing seal being applied.
In the end, most users will want F_SEAL_FUTURE_WRITE anyway to be honest :)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 18:20 ` Lorenzo Stoakes
2024-11-28 18:27 ` Julian Orth
@ 2024-11-28 18:27 ` Jann Horn
2024-11-28 18:35 ` Lorenzo Stoakes
1 sibling, 1 reply; 15+ messages in thread
From: Jann Horn @ 2024-11-28 18:27 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote:
> > (Re-sending the message below since I forgot to reply-all)
> >
> > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > > call_mmap()") (and preceding changes in the same series) it became possible
> > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > > >
> > > > This was previously unnecessarily disallowed, despite the man page
> > > > documentation indicating that it would be, thereby limiting the usefulness
> > > > of F_SEAL_WRITE logic.
> > > >
> > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > > seal (one which disallows future writes to the memfd) to also be used for
> > > > F_SEAL_WRITE.
> > > >
> > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > > f_op->mmap() hook used by shmem mappings.
> > > >
> > > > By extending this to F_SEAL_WRITE and critically - checking
> > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > > shmem_mmap() - the desired logic becomes possible. This is because
> > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > > have cleared.
> > > >
> > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > > behaviour") unintentionally undid this logic by moving the
> > > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > > thereby regressing this change.
> > > >
> > > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > > instead performing it in do_mmap() at the point at which VMA flags are
> > > > being determined, which seems in any case to be a more appropriate place in
> > > > which to make this determination.
> > > >
> > > > In order to achieve this we rework memfd seal logic to allow us access to
> > > > this information using existing logic and eliminate the clearing of
> > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > > instead.
> > >
> > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > > do_mmap(), without holding any kind of lock or counter on the file
> > > yet, then this check is clearly racy somehow, right? I think we have a
> > > race where we intermittently reject shared-readonly mmap() calls?
> >
> > Apropos race, some time ago I reported a way to get a mutable mapping
> > for a write-sealed memfd via a race:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=219106
>
> Kind of hard to read rust code, but it looks like you're intentionally
> trying to race sealing on the assumption it's atomic when it's not? That
> doesn't seem like a bug?
>
> The intent of sealing memfds is you establish the memfd buffer, then seal
> it and _only then_ expose it elsewhere.
>
> I may be missing something here, however.
AFAIU memfds are supposed to also guarantee *to the recipient* of the
shared memfd that the memory inside it won't change anymore, so that
the recipient can parse data out of this shared memory buffer without
having to worry about the data concurrently changing. udmabuf_create()
looks like it indeed breaks that assumption by first calling
check_memfd_seals() and then doing udmabuf_pin_folios() without any
lock that prevents a seal being added in between those.
That's also why we have memfd_wait_for_pins(), which ensures that
folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE
seal is added.
(I believe that's one of the major differences in usecases of
F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough
for cases where only the creator of the memfd wants to prevent other
tasks from writing into it, while F_SEAL_WRITE is suitable for cases
where the creator and recipient of the memfd want mutual protection.)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 18:27 ` Jann Horn
@ 2024-11-28 18:35 ` Lorenzo Stoakes
0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-28 18:35 UTC (permalink / raw)
To: Jann Horn
Cc: Julian Orth, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Shuah Khan, Linus Torvalds, linux-mm, linux-kernel
On Thu, Nov 28, 2024 at 07:27:44PM +0100, Jann Horn wrote:
> On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote:
> > > (Re-sending the message below since I forgot to reply-all)
> > >
> > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> > > > > call_mmap()") (and preceding changes in the same series) it became possible
> > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
> > > > >
> > > > > This was previously unnecessarily disallowed, despite the man page
> > > > > documentation indicating that it would be, thereby limiting the usefulness
> > > > > of F_SEAL_WRITE logic.
> > > > >
> > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
> > > > > seal (one which disallows future writes to the memfd) to also be used for
> > > > > F_SEAL_WRITE.
> > > > >
> > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
> > > > > read-only mapping to disallow mprotect() from overriding the seal - an
> > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the
> > > > > f_op->mmap() hook used by shmem mappings.
> > > > >
> > > > > By extending this to F_SEAL_WRITE and critically - checking
> > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke
> > > > > shmem_mmap() - the desired logic becomes possible. This is because
> > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
> > > > > have cleared.
> > > > >
> > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> > > > > behaviour") unintentionally undid this logic by moving the
> > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked,
> > > > > thereby regressing this change.
> > > > >
> > > > > We reinstate this functionality by moving the check out of shmem_mmap() and
> > > > > instead performing it in do_mmap() at the point at which VMA flags are
> > > > > being determined, which seems in any case to be a more appropriate place in
> > > > > which to make this determination.
> > > > >
> > > > > In order to achieve this we rework memfd seal logic to allow us access to
> > > > > this information using existing logic and eliminate the clearing of
> > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
> > > > > instead.
> > > >
> > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in
> > > > do_mmap(), without holding any kind of lock or counter on the file
> > > > yet, then this check is clearly racy somehow, right? I think we have a
> > > > race where we intermittently reject shared-readonly mmap() calls?
> > >
> > > Apropos race, some time ago I reported a way to get a mutable mapping
> > > for a write-sealed memfd via a race:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=219106
> >
> > Kind of hard to read rust code, but it looks like you're intentionally
> > trying to race sealing on the assumption it's atomic when it's not? That
> > doesn't seem like a bug?
> >
> > The intent of sealing memfds is you establish the memfd buffer, then seal
> > it and _only then_ expose it elsewhere.
> >
> > I may be missing something here, however.
>
> AFAIU memfds are supposed to also guarantee *to the recipient* of the
> shared memfd that the memory inside it won't change anymore, so that
> the recipient can parse data out of this shared memory buffer without
> having to worry about the data concurrently changing. udmabuf_create()
> looks like it indeed breaks that assumption by first calling
> check_memfd_seals() and then doing udmabuf_pin_folios() without any
> lock that prevents a seal being added in between those.
>
> That's also why we have memfd_wait_for_pins(), which ensures that
> folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE
> seal is added.
>
> (I believe that's one of the major differences in usecases of
> F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough
> for cases where only the creator of the memfd wants to prevent other
> tasks from writing into it, while F_SEAL_WRITE is suitable for cases
> where the creator and recipient of the memfd want mutual protection.)
Those being stupid can also be in kernel code... :) I mean that just looks
like udmabuf is doing something buggy and require a fix which is out of
scope for mm but perhaps worth reporting direct to udmabuf maintainers.
That makes sense re: F_SEAL_WRITE and further makes the case for making it
possible to read-only mmap() these buffers for convenience.
You'd also want to add other seals to ensure it's truly immutable.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes
@ 2024-11-29 10:03 ` Lorenzo Stoakes
2024-11-29 10:24 ` Andrew Morton
2 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 10:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :)
Thanks!
On Thu, Nov 28, 2024 at 03:06:16PM +0000, Lorenzo Stoakes wrote:
> In commit 158978945f31 ("mm: perform the mapping_map_writable() check after
> call_mmap()") (and preceding changes in the same series) it became possible
> to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
>
> Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
> behaviour") unintentionally undid this logic by moving the
> mapping_map_writable() check before the shmem_mmap() hook is invoked,
> thereby regressing this change.
>
> This series reworks how we both permit write-sealed mappings being mapped
> read-only and disallow mprotect() from undoing the write-seal, fixing this
> regression.
>
> We also add a regression test to ensure that we do not accidentally regress
> this in future.
>
> Thanks to Julian Orth for reporting this regression.
>
> Note that this will require stable backports to 6.6.y and 6.12.y, I will
> send these manually when this lands upstream.
>
> Lorenzo Stoakes (2):
> mm: reinstate ability to map write-sealed memfd mappings read-only
> selftests/memfd: add test for mapping write-sealed memfd read-only
>
> include/linux/memfd.h | 14 ++++++
> include/linux/mm.h | 58 +++++++++++++++-------
> mm/memfd.c | 2 +-
> mm/mmap.c | 4 ++
> tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++
> 5 files changed, 102 insertions(+), 19 deletions(-)
>
> --
> 2.47.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
@ 2024-11-29 10:24 ` Andrew Morton
2024-11-29 11:36 ` Lorenzo Stoakes
0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2024-11-29 10:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
On Fri, 29 Nov 2024 10:03:51 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :)
Jann's feedback didn't sound very ackish?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only
2024-11-29 10:24 ` Andrew Morton
@ 2024-11-29 11:36 ` Lorenzo Stoakes
0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2024-11-29 11:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Shuah Khan,
Julian Orth, Linus Torvalds, linux-mm, linux-kernel
On Fri, Nov 29, 2024 at 02:24:33AM -0800, Andrew Morton wrote:
> On Fri, 29 Nov 2024 10:03:51 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Andrew - This is a hotfix for 6.13, sorry forgot to tag the series as such :)
>
> Jann's feedback didn't sound very ackish?
Obviously Jann can indicate whether he feels this needs work or not, but I
believe we resolved the issues raised?
Mostly concerns around expected behaviour for these operations _in general_
and whether mmap-read'ing memfd write sealed mappings make sense, however
it's by the by, we accidentally regressed user-facing behaviour and so
should restore it.
I also feel I made a case for why it is in fact sane and useful to be able
to mmap memfd-write-sealed mappings :)
But of course - Jann can indicate whether he is happy with this or whether
it needs work, as usual his input is excellent and highly appreciated so
Jann - if you have any remaining concerns - please let me know.
Thanks!
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-11-29 11:37 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-28 15:06 [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 1/2] " Lorenzo Stoakes
2024-11-28 17:45 ` Jann Horn
2024-11-28 17:58 ` Julian Orth
2024-11-28 18:20 ` Lorenzo Stoakes
2024-11-28 18:27 ` Julian Orth
2024-11-28 18:27 ` Jann Horn
2024-11-28 18:35 ` Lorenzo Stoakes
2024-11-28 18:05 ` Lorenzo Stoakes
2024-11-28 18:18 ` Jann Horn
2024-11-28 18:27 ` Lorenzo Stoakes
2024-11-28 15:06 ` [PATCH 2/2] selftests/memfd: add test for mapping write-sealed memfd read-only Lorenzo Stoakes
2024-11-29 10:03 ` [PATCH 0/2] mm: reinstate ability to map write-sealed memfd mappings read-only Lorenzo Stoakes
2024-11-29 10:24 ` Andrew Morton
2024-11-29 11:36 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox