* [PATCH v8 00/19] fanotify: add pre-content hooks
@ 2024-11-15 15:30 Josef Bacik
2024-11-15 15:30 ` [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge Josef Bacik
` (19 more replies)
0 siblings, 20 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
v7: https://lore.kernel.org/linux-fsdevel/cover.1731433903.git.josef@toxicpanda.com/
v6: https://lore.kernel.org/linux-fsdevel/cover.1731355931.git.josef@toxicpanda.com/
v5: https://lore.kernel.org/linux-fsdevel/cover.1725481503.git.josef@toxicpanda.com/
v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
v7->v8:
- A bunch of work from Amir to cleanup the fast path for the common case of no
watches, which cascades through the rest of th series to update the helpers
and the hooks to use the new helpers.
- A patch from Al to get rid of the __FMODE_NONOTIFY flag and cleanup the usage
there, thanks Al!
v6->v7:
- As per Linus's suggestion, Amir added the file flag FMODE_NOTIFY_PERM that
will be set at open time if the file has permission related watches (this is
the original malware style permission watches and the new precontent watches).
All of the VFS hooks and the page fault hooks use this flag to determine if
they should generate a notification to allow for a much cheaper check in the
common case.
v5->v6:
- Linus had problems with this and rejected Jan's PR
(https://lore.kernel.org/linux-fsdevel/20240923110348.tbwihs42dxxltabc@quack3/),
so I'm respinning this series to address his concerns. Hopefully this is more
acceptable.
- Change the page fault hooks to happen only in the case where we have to add a
page, not where there exists pages already.
- Amir added a hook to truncate.
- We made the flag per SB instead of per fstype, Amir wanted this because of
some potential issues with other file system specific work he's doing.
- Dropped the bcachefs patch, there were some concerns that we were doing
something wrong, and it's not a huge deal to not have this feature for now.
- Unfortunately the xfs write fault path still has to do the page fault hook
before we know if we have a page or not, this is because of the locking that's
done before we get to the part where we know if we have a page already or not,
so that's the path that is still the same from last iteration.
- I've re-validated this series with btrfs, xfs, and ext4 to make sure I didn't
break anything.
v4->v5:
- Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
to respin the series anyway.
- Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
- Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
- Added a patch to disable btrfs defrag on pre-content watched files.
- Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
- Added two fstests (which will be posted separately) to validate everything,
re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
didn't break anything.
v3->v4:
- Trying to send a final verson Friday at 5pm before you go on vacation is a
recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
review.
- Reworked the file system helper so it's handling of fpin was a little less
silly, per Chinner's suggestion.
- Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
in filemap_fault that says if VM_FAULT_ERROR is set we won't have
VM_FAULT_RETRY set.
v2->v3:
- Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
Amir's suggestion.
- Reworked the exported helper so the per-filesystem changes are much smaller,
per Amir's suggestion.
- Fixed the screwup for DAX writes per Chinner's suggestion.
- Added Christian's reviewed-by's where appropriate.
v1->v2:
- reworked the page fault logic based on Jan's suggestion and turned it into a
helper.
- Added 3 patches per-fs where we need to call the fsnotify helper from their
->fault handlers.
- Disabled readahead in the case that there's a pre-content watch in place.
- Disabled huge faults when there's a pre-content watch in place (entirely
because it's untested, theoretically it should be straightforward to do).
- Updated the command numbers.
- Addressed the random spelling/grammer mistakes that Jan pointed out.
- Addressed the other random nits from Jan.
--- Original email ---
Hello,
These are the patches for the bare bones pre-content fanotify support. The
majority of this work is Amir's, my contribution to this has solely been around
adding the page fault hooks, testing and validating everything. I'm sending it
because Amir is traveling a bunch, and I touched it last so I'm going to take
all the hate and he can take all the credit.
There is a PoC that I've been using to validate this work, you can find the git
repo here
https://github.com/josefbacik/remote-fetch
This consists of 3 different tools.
1. populate. This just creates all the stub files in the directory from the
source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
recursively create all of the stub files and directories.
2. remote-fetch. This is the actual PoC, you just point it at the source and
destination directory and then you can do whatever. ./remote-fetch ~/linux
~/hsm-linux.
3. mmap-validate. This was to validate the pagefault thing, this is likely what
will be turned into the selftest with remote-fetch. It creates a file and
then you can validate the file matches the right pattern with both normal
reads and mmap. Normally I do something like
./mmap-validate create ~/src/foo
./populate ~/src ~/dst
./rmeote-fetch ~/src ~/dst
./mmap-validate validate ~/dst/foo
I did a bunch of testing, I also got some performance numbers. I copied a
kernel tree, and then did remote-fetch, and then make -j4
Normal
real 9m49.709s
user 28m11.372s
sys 4m57.304s
HSM
real 10m6.454s
user 29m10.517s
sys 5m2.617s
So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
to see the size
[root@fedora ~]# du -hs /src/linux
1.6G /src/linux
[root@fedora ~]# du -hs dst
125M dst
This mirrors the sort of savings we've seen in production.
Meta has had these patches (minus the page fault patch) deployed in production
for almost a year with our own utility for doing on-demand package fetching.
The savings from this has been pretty significant.
The page-fault hooks are necessary for the last thing we need, which is
on-demand range fetching of executables. Some of our binaries are several gigs
large, having the ability to remote fetch them on demand is a huge win for us
not only with space savings, but with startup time of containers.
There will be tests for this going into LTP once we're satisfied with the
patches and they're on their way upstream. Thanks,
Josef
Al Viro (1):
fs: get rid of __FMODE_NONOTIFY kludge
Amir Goldstein (12):
fsnotify: opt-in for permission events at file open time
fsnotify: add helper to check if file is actually being watched
fanotify: don't skip extra event info if no info_mode is set
fanotify: rename a misnamed constant
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fsnotify: introduce pre-content permission events
fsnotify: pass optional file access range in pre-content event
fsnotify: generate pre-content permission event on truncate
fanotify: introduce FAN_PRE_ACCESS permission event
fanotify: report file range info with pre-content events
fanotify: allow to set errno in FAN_DENY permission response
fanotify: add a helper to check for pre content events
Josef Bacik (6):
fanotify: disable readahead if we have pre-content watches
mm: don't allow huge faults for files with pre content watches
fsnotify: generate pre-content permission event on page fault
xfs: add pre-content fsnotify hook for write faults
btrfs: disable defrag on pre-content watched files
fs: enable pre-content events on supported file systems
fs/btrfs/ioctl.c | 9 ++
fs/btrfs/super.c | 2 +-
fs/ext4/super.c | 3 +
fs/fcntl.c | 4 +-
fs/notify/fanotify/fanotify.c | 33 +++++--
fs/notify/fanotify/fanotify.h | 15 +++
fs/notify/fanotify/fanotify_user.c | 145 +++++++++++++++++++++++------
fs/notify/fsnotify.c | 56 +++++++++--
fs/open.c | 62 +++++++++---
fs/xfs/xfs_file.c | 4 +
fs/xfs/xfs_super.c | 2 +-
include/linux/fanotify.h | 19 +++-
include/linux/fs.h | 42 +++++++--
include/linux/fsnotify.h | 135 +++++++++++++++++++++++----
include/linux/fsnotify_backend.h | 60 +++++++++++-
include/linux/mm.h | 1 +
include/uapi/asm-generic/fcntl.h | 1 -
include/uapi/linux/fanotify.h | 18 ++++
mm/filemap.c | 90 ++++++++++++++++++
mm/memory.c | 22 +++++
mm/readahead.c | 13 +++
security/selinux/hooks.c | 3 +-
22 files changed, 639 insertions(+), 100 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-18 18:14 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time Josef Bacik
` (18 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Al Viro <viro@zeniv.linux.org.uk>
All it takes to get rid of the __FMODE_NONOTIFY kludge is switching
fanotify from anon_inode_getfd() to anon_inode_getfile_fmode() and adding
a dentry_open_fmode() helper to be used by fanotify on the other path.
That's it - no more weird shit in OPEN_FMODE(), etc.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/linux-fsdevel/20241113043003.GH3387508@ZenIV/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/fcntl.c | 4 ++--
fs/notify/fanotify/fanotify_user.c | 25 ++++++++++++++++---------
fs/open.c | 23 +++++++++++++++++++----
include/linux/fs.h | 6 +++---
include/uapi/asm-generic/fcntl.h | 1 -
5 files changed, 40 insertions(+), 19 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index ac77dd912412..88db23aa864a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1155,10 +1155,10 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
- __FMODE_EXEC | __FMODE_NONOTIFY));
+ __FMODE_EXEC));
fasync_cache = kmem_cache_create("fasync_cache",
sizeof(struct fasync_struct), 0,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 2d85c71717d6..919ff59cb802 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -100,8 +100,7 @@ static void __init fanotify_sysctls_init(void)
*
* Internal and external open flags are stored together in field f_flags of
* struct file. Only external open flags shall be allowed in event_f_flags.
- * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
- * excluded.
+ * Internal flags like FMODE_EXEC shall be excluded.
*/
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
O_ACCMODE | O_APPEND | O_NONBLOCK | \
@@ -258,12 +257,11 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
return client_fd;
/*
- * we need a new file handle for the userspace program so it can read even if it was
- * originally opened O_WRONLY.
+ * We provide an fd for the userspace program, so it could access the
+ * file without generating fanotify events itself.
*/
- new_file = dentry_open(path,
- group->fanotify_data.f_flags | __FMODE_NONOTIFY,
- current_cred());
+ new_file = dentry_open_nonotify(path, group->fanotify_data.f_flags,
+ current_cred());
if (IS_ERR(new_file)) {
put_unused_fd(client_fd);
client_fd = PTR_ERR(new_file);
@@ -1409,6 +1407,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
unsigned int class = flags & FANOTIFY_CLASS_BITS;
unsigned int internal_flags = 0;
+ struct file *file;
pr_debug("%s: flags=%x event_f_flags=%x\n",
__func__, flags, event_f_flags);
@@ -1477,7 +1476,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
(!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
return -EINVAL;
- f_flags = O_RDWR | __FMODE_NONOTIFY;
+ f_flags = O_RDWR;
if (flags & FAN_CLOEXEC)
f_flags |= O_CLOEXEC;
if (flags & FAN_NONBLOCK)
@@ -1555,10 +1554,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
goto out_destroy_group;
}
- fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
+ fd = get_unused_fd_flags(f_flags);
if (fd < 0)
goto out_destroy_group;
+ file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
+ f_flags, FMODE_NONOTIFY);
+ if (IS_ERR(file)) {
+ fd = PTR_ERR(file);
+ put_unused_fd(fd);
+ goto out_destroy_group;
+ }
+ fd_install(fd, file);
return fd;
out_destroy_group:
diff --git a/fs/open.c b/fs/open.c
index e6911101fe71..c3490286092e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1105,6 +1105,23 @@ struct file *dentry_open(const struct path *path, int flags,
}
EXPORT_SYMBOL(dentry_open);
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+ const struct cred *cred)
+{
+ struct file *f = alloc_empty_file(flags, cred);
+ if (!IS_ERR(f)) {
+ int error;
+
+ f->f_mode |= FMODE_NONOTIFY;
+ error = vfs_open(path, f);
+ if (error) {
+ fput(f);
+ f = ERR_PTR(error);
+ }
+ }
+ return f;
+}
+
/**
* dentry_create - Create and open a file
* @path: path to create
@@ -1202,7 +1219,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
inline int build_open_flags(const struct open_how *how, struct open_flags *op)
{
u64 flags = how->flags;
- u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
+ u64 strip = O_CLOEXEC;
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
@@ -1210,9 +1227,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
"struct open_flags doesn't yet handle flags > 32 bits");
/*
- * Strip flags that either shouldn't be set by userspace like
- * FMODE_NONOTIFY or that aren't relevant in determining struct
- * open_flags like O_CLOEXEC.
+ * Strip flags that aren't relevant in determining struct open_flags.
*/
flags &= ~strip;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9c13222362f5..23bd058576b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2750,6 +2750,8 @@ static inline struct file *file_open_root_mnt(struct vfsmount *mnt,
}
struct file *dentry_open(const struct path *path, int flags,
const struct cred *creds);
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+ const struct cred *cred);
struct file *dentry_create(const struct path *path, int flags, umode_t mode,
const struct cred *cred);
struct path *backing_file_user_path(struct file *f);
@@ -3706,11 +3708,9 @@ struct ctl_table;
int __init list_bdev_fs_names(char *buf, size_t size);
#define __FMODE_EXEC ((__force int) FMODE_EXEC)
-#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
-#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
- (flag & __FMODE_NONOTIFY)))
+#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
static inline bool is_sxid(umode_t mode)
{
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 80f37a0d40d7..613475285643 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -6,7 +6,6 @@
/*
* FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x4000000
* These cannot be used by userspace O_* until internal and external open
* flags are split.
* -Eric Paris
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
2024-11-15 15:30 ` [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-20 15:53 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched Josef Bacik
` (17 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Legacy inotify/fanotify listeners can add watches for events on inode,
parent or mount and expect to get events (e.g. FS_MODIFY) on files that
were already open at the time of setting up the watches.
fanotify permission events are typically used by Anti-malware sofware,
that is watching the entire mount and it is not common to have more that
one Anti-malware engine installed on a system.
To reduce the overhead of the fsnotify_file_perm() hooks on every file
access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
events only if there were *any* permission event listeners on the
filesystem at the time that the file was opened.
The new semantic is implemented by extending the FMODE_NONOTIFY bit into
two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
events types to report.
This is going to apply to the new fanotify pre-content events in order
to reduce the cost of the new pre-content event vfs hooks.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/open.c | 8 ++++-
include/linux/fs.h | 35 ++++++++++++++++---
include/linux/fsnotify.h | 72 +++++++++++++++++++++++++++++++---------
3 files changed, 93 insertions(+), 22 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index c3490286092e..1a9483872e1f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -901,7 +901,7 @@ static int do_dentry_open(struct file *f,
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
+ f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_NONOTIFY;
f->f_op = &empty_fops;
return 0;
}
@@ -929,6 +929,12 @@ static int do_dentry_open(struct file *f,
if (error)
goto cleanup_all;
+ /*
+ * Set FMODE_NONOTIFY_* bits according to existing permission watches.
+ * If FMODE_NONOTIFY was already set for an fanotify fd, this doesn't
+ * change anything.
+ */
+ file_set_fsnotify_mode(f);
error = fsnotify_open_perm(f);
if (error)
goto cleanup_all;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 23bd058576b1..8e5c783013d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
-/* FMODE_* bit 24 */
-
/* File is embedded in backing_file object */
-#define FMODE_BACKING ((__force fmode_t)(1 << 25))
+#define FMODE_BACKING ((__force fmode_t)(1 << 24))
-/* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
+/* File shouldn't generate fanotify pre-content events */
+#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
+
+/* File shouldn't generate fanotify permission events */
+#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)(1 << 27))
@@ -190,6 +191,30 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
+/*
+ * The two FMODE_NONOTIFY_ bits used together have a special meaning of
+ * not reporting any events at all including non-permission events.
+ * These are the possible values of FMODE_FSNOTIFY(f->f_mode) and their meaning:
+ *
+ * FMODE_NONOTIFY_HSM - suppress only pre-content events.
+ * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
+ * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
+ */
+#define FMODE_FSNOTIFY_MASK \
+ (FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
+#define FMODE_NONOTIFY FMODE_FSNOTIFY_MASK
+#define FMODE_FSNOTIFY(mode) \
+ ((mode) & FMODE_FSNOTIFY_MASK)
+
+#define FMODE_FSNOTIFY_NONE(mode) \
+ (FMODE_FSNOTIFY(mode) == FMODE_NONOTIFY)
+#define FMODE_FSNOTIFY_NORMAL(mode) \
+ (FMODE_FSNOTIFY(mode) == FMODE_NONOTIFY_PERM)
+#define FMODE_FSNOTIFY_PERM(mode) \
+ (!((mode) & FMODE_NONOTIFY_PERM))
+#define FMODE_FSNOTIFY_HSM(mode) \
+ (FMODE_FSNOTIFY(mode) == 0)
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..54ec97366d7c 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -108,38 +108,68 @@ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask)
fsnotify_parent(dentry, mask, dentry, FSNOTIFY_EVENT_DENTRY);
}
+static inline int fsnotify_path(const struct path *path, __u32 mask)
+{
+ return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
+}
+
static inline int fsnotify_file(struct file *file, __u32 mask)
{
- const struct path *path;
-
/*
* FMODE_NONOTIFY are fds generated by fanotify itself which should not
* generate new events. We also don't want to generate events for
* FMODE_PATH fds (involves open & close events) as they are just
* handle creation / destruction events and not "real" file events.
*/
- if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
+ if (FMODE_FSNOTIFY_NONE(file->f_mode))
return 0;
- path = &file->f_path;
- /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
- if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
- !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
- FSNOTIFY_PRIO_CONTENT))
- return 0;
-
- return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
+ return fsnotify_path(&file->f_path, mask);
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+/*
+ * At open time we check fsnotify_sb_has_priority_watchers() and set the
+ * FMODE_NONOTIFY_ mode bits accordignly.
+ * Later, fsnotify permission hooks do not check if there are permission event
+ * watches, but that there were permission event watches at open time.
+ */
+static void file_set_fsnotify_mode(struct file *file)
+{
+ struct super_block *sb = file->f_path.dentry->d_sb;
+
+ /* Is it a file opened by fanotify? */
+ if (FMODE_FSNOTIFY_NONE(file->f_mode))
+ return;
+
+ /*
+ * Permission events is a super set of pre-content events, so if there
+ * are no permission event watchers, there are also no pre-content event
+ * watchers and this is implied from the single FMODE_NONOTIFY_PERM bit.
+ */
+ if (likely(!fsnotify_sb_has_priority_watchers(sb,
+ FSNOTIFY_PRIO_CONTENT))) {
+ file->f_mode |= FMODE_NONOTIFY_PERM;
+ return;
+ }
+
+ /*
+ * FMODE_NONOTIFY_HSM bit means there are permission event watchers, but
+ * no pre-content event watchers.
+ */
+ if (likely(!fsnotify_sb_has_priority_watchers(sb,
+ FSNOTIFY_PRIO_PRE_CONTENT))) {
+ file->f_mode |= FMODE_NONOTIFY_HSM;
+ return;
+ }
+}
+
/*
* fsnotify_file_area_perm - permission hook before access to file range
*/
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
- __u32 fsnotify_mask = FS_ACCESS_PERM;
-
/*
* filesystem may be modified in the context of permission events
* (e.g. by HSM filling a file on access), so sb freeze protection
@@ -150,7 +180,10 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
if (!(perm_mask & MAY_READ))
return 0;
- return fsnotify_file(file, fsnotify_mask);
+ if (likely(file->f_mode & FMODE_NONOTIFY_PERM))
+ return 0;
+
+ return fsnotify_path(&file->f_path, FS_ACCESS_PERM);
}
/*
@@ -168,16 +201,23 @@ static inline int fsnotify_open_perm(struct file *file)
{
int ret;
+ if (likely(!FMODE_FSNOTIFY_PERM(file->f_mode)))
+ return 0;
+
if (file->f_flags & __FMODE_EXEC) {
- ret = fsnotify_file(file, FS_OPEN_EXEC_PERM);
+ ret = fsnotify_path(&file->f_path, FS_OPEN_EXEC_PERM);
if (ret)
return ret;
}
- return fsnotify_file(file, FS_OPEN_PERM);
+ return fsnotify_path(&file->f_path, FS_OPEN_PERM);
}
#else
+static inline void file_set_fsnotify_mode(struct file *file)
+{
+}
+
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
2024-11-15 15:30 ` [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge Josef Bacik
2024-11-15 15:30 ` [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-20 16:02 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 04/19] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
` (16 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
So far, we set FMODE_NONOTIFY_ flags at open time if we know that there
are no permission event watchers at all on the filesystem, but lack of
FMODE_NONOTIFY_ flags does not mean that the file is actually watched.
To make the flags more accurate we add a helper that checks if the
file's inode, mount, sb or parent are being watched for a set of events.
This is going to be used for setting FMODE_NONOTIFY_HSM only when the
specific file is actually watched for pre-content events.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fsnotify.c | 36 +++++++++++++++++++++++++-------
include/linux/fsnotify_backend.h | 7 +++++++
2 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index f976949d2634..33576a848a9f 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -193,16 +193,38 @@ static bool fsnotify_event_needs_parent(struct inode *inode, __u32 mnt_mask,
return mask & marks_mask;
}
-/* Are there any inode/mount/sb objects that are interested in this event? */
-static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
- __u32 mask)
+/* Are there any inode/mount/sb objects that watch for these events? */
+static inline __u32 fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
+ __u32 events_mask)
{
__u32 marks_mask = READ_ONCE(inode->i_fsnotify_mask) | mnt_mask |
READ_ONCE(inode->i_sb->s_fsnotify_mask);
- return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
+ return events_mask & marks_mask;
}
+/* Are there any inode/mount/sb/parent objects that watch for these events? */
+__u32 fsnotify_file_object_watched(struct file *file, __u32 events_mask)
+{
+ struct dentry *dentry = file->f_path.dentry;
+ struct dentry *parent;
+ __u32 marks_mask, mnt_mask =
+ READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
+
+ marks_mask = fsnotify_object_watched(d_inode(dentry), mnt_mask,
+ events_mask);
+
+ if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)))
+ return marks_mask;
+
+ parent = dget_parent(dentry);
+ marks_mask |= fsnotify_inode_watches_children(d_inode(parent));
+ dput(parent);
+
+ return marks_mask & events_mask;
+}
+EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
+
/*
* Notify this dentry's parent about a child's events with child name info
* if parent is watching or if inode/sb/mount are interested in events with
@@ -221,7 +243,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
struct dentry *parent;
bool parent_watched = dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED;
bool parent_needed, parent_interested;
- __u32 p_mask;
+ __u32 p_mask, test_mask = mask & ALL_FSNOTIFY_EVENTS;
struct inode *p_inode = NULL;
struct name_snapshot name;
struct qstr *file_name = NULL;
@@ -229,7 +251,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
/* Optimize the likely case of nobody watching this path */
if (likely(!parent_watched &&
- !fsnotify_object_watched(inode, mnt_mask, mask)))
+ !fsnotify_object_watched(inode, mnt_mask, test_mask)))
return 0;
parent = NULL;
@@ -248,7 +270,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
* Include parent/name in notification either if some notification
* groups require parent info or the parent is interested in this event.
*/
- parent_interested = mask & p_mask & ALL_FSNOTIFY_EVENTS;
+ parent_interested = p_mask & test_mask;
if (parent_needed || parent_interested) {
/* When notifying parent, child should be passed as data */
WARN_ON_ONCE(inode != fsnotify_data_inode(data, data_type));
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 3ecf7768e577..99d81c3c11d7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -855,8 +855,15 @@ static inline void fsnotify_init_event(struct fsnotify_event *event)
INIT_LIST_HEAD(&event->list);
}
+__u32 fsnotify_file_object_watched(struct file *file, __u32 mask);
+
#else
+static inline __u32 fsnotify_file_object_watched(struct file *file, __u32 mask)
+{
+ return 0;
+}
+
static inline int fsnotify(__u32 mask, const void *data, int data_type,
struct inode *dir, const struct qstr *name,
struct inode *inode, u32 cookie)
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 04/19] fanotify: don't skip extra event info if no info_mode is set
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (2 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 05/19] fanotify: rename a misnamed constant Josef Bacik
` (15 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Previously we would only include optional information if you requested
it via an FAN_ flag at fanotify_init time (FAN_REPORT_FID for example).
However this isn't necessary as the event length is encoded in the
metadata, and if the user doesn't want to consume the information they
don't have to. With the PRE_ACCESS events we will always generate range
information, so drop this check in order to allow this extra
information to be exported without needing to have another flag.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fanotify/fanotify_user.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 919ff59cb802..8fca5ec442e4 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -158,9 +158,6 @@ static size_t fanotify_event_len(unsigned int info_mode,
int fh_len;
int dot_len = 0;
- if (!info_mode)
- return event_len;
-
if (fanotify_is_error_event(event->mask))
event_len += FANOTIFY_ERROR_INFO_LEN;
@@ -754,12 +751,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
buf += FAN_EVENT_METADATA_LEN;
count -= FAN_EVENT_METADATA_LEN;
- if (info_mode) {
- ret = copy_info_records_to_user(event, info, info_mode, pidfd,
- buf, count);
- if (ret < 0)
- goto out_close_fd;
- }
+ ret = copy_info_records_to_user(event, info, info_mode, pidfd,
+ buf, count);
+ if (ret < 0)
+ goto out_close_fd;
if (f)
fd_install(fd, f);
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 05/19] fanotify: rename a misnamed constant
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (3 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 04/19] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 06/19] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
` (14 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
FANOTIFY_PIDFD_INFO_HDR_LEN is not the length of the header.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify_user.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 8fca5ec442e4..456cc3e92c88 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -117,7 +117,7 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
#define FANOTIFY_EVENT_ALIGN 4
#define FANOTIFY_FID_INFO_HDR_LEN \
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
-#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+#define FANOTIFY_PIDFD_INFO_LEN \
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
@@ -172,14 +172,14 @@ static size_t fanotify_event_len(unsigned int info_mode,
dot_len = 1;
}
- if (info_mode & FAN_REPORT_PIDFD)
- event_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
-
if (fanotify_event_has_object_fh(event)) {
fh_len = fanotify_event_object_fh_len(event);
event_len += fanotify_fid_info_len(fh_len, dot_len);
}
+ if (info_mode & FAN_REPORT_PIDFD)
+ event_len += FANOTIFY_PIDFD_INFO_LEN;
+
return event_len;
}
@@ -501,7 +501,7 @@ static int copy_pidfd_info_to_user(int pidfd,
size_t count)
{
struct fanotify_event_info_pidfd info = { };
- size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+ size_t info_len = FANOTIFY_PIDFD_INFO_LEN;
if (WARN_ON_ONCE(info_len > count))
return -EFAULT;
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 06/19] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (4 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 05/19] fanotify: rename a misnamed constant Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 07/19] fsnotify: introduce pre-content permission events Josef Bacik
` (13 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Avoid reusing it, because we would like to reserve it for future
FAN_PATH_MODIFY pre-content event.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
include/linux/fsnotify_backend.h | 1 +
include/uapi/linux/fanotify.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 99d81c3c11d7..2dc30cf637aa 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -55,6 +55,7 @@
#define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
#define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
+/* #define FS_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
/*
* Set on inode mark that cares about things that happen to its children.
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 34f221d3a1b9..79072b6894f2 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -25,6 +25,7 @@
#define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
+/* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 07/19] fsnotify: introduce pre-content permission events
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (5 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 06/19] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 08/19] fsnotify: pass optional file access range in pre-content event Josef Bacik
` (12 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.
Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content
scanners could inspect the content filled by pre-content event handler.
Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is
modified by syscalls as write() and fallocate().
FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.
The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.
There are some specific requirements from filesystems that could
be used with pre-content events, so add a flag for fs to opt-in
for pre-content events explicitly before they can be used.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fsnotify.c | 2 +-
include/linux/fs.h | 1 +
include/linux/fsnotify.h | 39 ++++++++++++++++++++++++++++----
include/linux/fsnotify_backend.h | 12 ++++++++--
security/selinux/hooks.c | 3 ++-
5 files changed, 49 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 33576a848a9f..d128cb7dee62 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -649,7 +649,7 @@ static __init int fsnotify_init(void)
{
int ret;
- BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 24);
ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8e5c783013d2..d231f4bc12aa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1256,6 +1256,7 @@ extern int send_sigurg(struct file *file);
#define SB_I_RETIRED 0x00000800 /* superblock shouldn't be reused */
#define SB_I_NOUMASK 0x00001000 /* VFS does not apply umask */
#define SB_I_NOIDMAP 0x00002000 /* No idmapped mounts on this superblock */
+#define SB_I_ALLOW_HSM 0x00004000 /* Allow HSM events on this superblock */
/* Possible states of 'frozen' field */
enum {
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 54ec97366d7c..994d7a322369 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -134,9 +134,10 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
* Later, fsnotify permission hooks do not check if there are permission event
* watches, but that there were permission event watches at open time.
*/
-static void file_set_fsnotify_mode(struct file *file)
+static inline void file_set_fsnotify_mode(struct file *file)
{
struct super_block *sb = file->f_path.dentry->d_sb;
+ struct inode *inode;
/* Is it a file opened by fanotify? */
if (FMODE_FSNOTIFY_NONE(file->f_mode))
@@ -162,6 +163,19 @@ static void file_set_fsnotify_mode(struct file *file)
file->f_mode |= FMODE_NONOTIFY_HSM;
return;
}
+
+ /*
+ * There are pre-content watchers in the filesystem, but are there
+ * pre-content watchers on this specific file?
+ * Pre-content events are only reported for regular files and dirs.
+ */
+ inode = file_inode(file);
+ if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
+ likely(!fsnotify_file_object_watched(file,
+ FSNOTIFY_PRE_CONTENT_EVENTS))) {
+ file->f_mode |= FMODE_NONOTIFY_HSM;
+ return;
+ }
}
/*
@@ -177,12 +191,29 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
*/
lockdep_assert_once(file_write_not_started(file));
+ if (!(perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)))
+ return 0;
+
+ if (likely(!FMODE_FSNOTIFY_PERM(file->f_mode)))
+ return 0;
+
+ /*
+ * read()/write() and other types of access generate pre-content events.
+ */
+ if (unlikely(FMODE_FSNOTIFY_HSM(file->f_mode))) {
+ int ret = fsnotify_path(&file->f_path, FS_PRE_ACCESS);
+
+ if (ret)
+ return ret;
+ }
+
if (!(perm_mask & MAY_READ))
return 0;
- if (likely(file->f_mode & FMODE_NONOTIFY_PERM))
- return 0;
-
+ /*
+ * read() also generates the legacy FS_ACCESS_PERM event, so content
+ * scanners can inspect the content filled by pre-content event.
+ */
return fsnotify_path(&file->f_path, FS_ACCESS_PERM);
}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 2dc30cf637aa..33880de72ef3 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -57,6 +57,8 @@
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
/* #define FS_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
+#define FS_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+
/*
* Set on inode mark that cares about things that happen to its children.
* Always set for dnotify and inotify.
@@ -78,8 +80,14 @@
*/
#define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
-#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
- FS_OPEN_EXEC_PERM)
+/* Content events can be used to inspect file content */
+#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
+ FS_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FSNOTIFY_PRE_CONTENT_EVENTS (FS_PRE_ACCESS)
+
+#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
+ FSNOTIFY_PRE_CONTENT_EVENTS)
/*
* This is a list of all events that may get sent to a parent that is watching
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fc926d3cac6e..c6f38705c715 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3404,7 +3404,8 @@ static int selinux_path_notify(const struct path *path, u64 mask,
perm |= FILE__WATCH_WITH_PERM;
/* watches on read-like events need the file:watch_reads permission */
- if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_CLOSE_NOWRITE))
+ if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_PRE_ACCESS |
+ FS_CLOSE_NOWRITE))
perm |= FILE__WATCH_READS;
return path_has_perm(current_cred(), path, perm);
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 08/19] fsnotify: pass optional file access range in pre-content event
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (6 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 07/19] fsnotify: introduce pre-content permission events Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate Josef Bacik
` (11 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
We would like to add file range information to pre-content events.
Pass a struct file_range with offset and length to event handler
along with pre-content permission event.
The offset and length are aligned to page size, but we may need to
align them to minimum folio size for filesystems with large block size.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 11 +++++++--
fs/notify/fanotify/fanotify.h | 2 ++
fs/notify/fsnotify.c | 18 ++++++++++++++
include/linux/fsnotify.h | 4 ++--
include/linux/fsnotify_backend.h | 40 ++++++++++++++++++++++++++++++++
5 files changed, 71 insertions(+), 4 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 24c7c5df4998..2e6ba94ec405 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -548,9 +548,13 @@ static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
return &pevent->fae;
}
-static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
+static struct fanotify_event *fanotify_alloc_perm_event(const void *data,
+ int data_type,
gfp_t gfp)
{
+ const struct path *path = fsnotify_data_path(data, data_type);
+ const struct file_range *range =
+ fsnotify_data_file_range(data, data_type);
struct fanotify_perm_event *pevent;
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
@@ -564,6 +568,9 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
pevent->hdr.len = 0;
pevent->state = FAN_EVENT_INIT;
pevent->path = *path;
+ /* NULL ppos means no range info */
+ pevent->ppos = range ? &range->pos : NULL;
+ pevent->count = range ? range->count : 0;
path_get(path);
return &pevent->fae;
@@ -801,7 +808,7 @@ static struct fanotify_event *fanotify_alloc_event(
old_memcg = set_active_memcg(group->memcg);
if (fanotify_is_perm_event(mask)) {
- event = fanotify_alloc_perm_event(path, gfp);
+ event = fanotify_alloc_perm_event(data, data_type, gfp);
} else if (fanotify_is_error_event(mask)) {
event = fanotify_alloc_error_event(group, fsid, data,
data_type, &hash);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index e5ab33cae6a7..93598b7d5952 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -425,6 +425,8 @@ FANOTIFY_PE(struct fanotify_event *event)
struct fanotify_perm_event {
struct fanotify_event fae;
struct path path;
+ const loff_t *ppos; /* optional file range info */
+ size_t count;
u32 response; /* userspace answer to the event */
unsigned short state; /* state of the event */
int fd; /* fd we passed to userspace for this event */
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index d128cb7dee62..538aacf990ca 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -225,6 +225,24 @@ __u32 fsnotify_file_object_watched(struct file *file, __u32 events_mask)
}
EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
+/* Report pre-content event with optional range info */
+int fsnotify_pre_content(const struct path *path, const loff_t *ppos,
+ size_t count)
+{
+ struct file_range range;
+
+ /* Report page aligned range only when pos is known */
+ if (!ppos)
+ return fsnotify_path(path, FS_PRE_ACCESS);
+
+ range.path = path;
+ range.pos = PAGE_ALIGN_DOWN(*ppos);
+ range.count = PAGE_ALIGN(*ppos + count) - range.pos;
+
+ return fsnotify_parent(path->dentry, FS_PRE_ACCESS, &range,
+ FSNOTIFY_EVENT_FILE_RANGE);
+}
+
/*
* Notify this dentry's parent about a child's events with child name info
* if parent is watching or if inode/sb/mount are interested in events with
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 994d7a322369..ce189b4778a5 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -201,7 +201,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
* read()/write() and other types of access generate pre-content events.
*/
if (unlikely(FMODE_FSNOTIFY_HSM(file->f_mode))) {
- int ret = fsnotify_path(&file->f_path, FS_PRE_ACCESS);
+ int ret = fsnotify_pre_content(&file->f_path, ppos, count);
if (ret)
return ret;
@@ -218,7 +218,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
}
/*
- * fsnotify_file_perm - permission hook before file access
+ * fsnotify_file_perm - permission hook before file access (unknown range)
*/
static inline int fsnotify_file_perm(struct file *file, int perm_mask)
{
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 33880de72ef3..89f351193d8f 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -294,6 +294,7 @@ static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
/* When calling fsnotify tell it if the data is a path or inode */
enum fsnotify_data_type {
FSNOTIFY_EVENT_NONE,
+ FSNOTIFY_EVENT_FILE_RANGE,
FSNOTIFY_EVENT_PATH,
FSNOTIFY_EVENT_INODE,
FSNOTIFY_EVENT_DENTRY,
@@ -306,6 +307,17 @@ struct fs_error_report {
struct super_block *sb;
};
+struct file_range {
+ const struct path *path;
+ loff_t pos;
+ size_t count;
+};
+
+static inline const struct path *file_range_path(const struct file_range *range)
+{
+ return range->path;
+}
+
static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
{
switch (data_type) {
@@ -315,6 +327,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
return d_inode(data);
case FSNOTIFY_EVENT_PATH:
return d_inode(((const struct path *)data)->dentry);
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return d_inode(file_range_path(data)->dentry);
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *)data)->inode;
default:
@@ -330,6 +344,8 @@ static inline struct dentry *fsnotify_data_dentry(const void *data, int data_typ
return (struct dentry *)data;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry;
default:
return NULL;
}
@@ -341,6 +357,8 @@ static inline const struct path *fsnotify_data_path(const void *data,
switch (data_type) {
case FSNOTIFY_EVENT_PATH:
return data;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data);
default:
return NULL;
}
@@ -356,6 +374,8 @@ static inline struct super_block *fsnotify_data_sb(const void *data,
return ((struct dentry *)data)->d_sb;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry->d_sb;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry->d_sb;
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *) data)->sb;
default:
@@ -375,6 +395,18 @@ static inline struct fs_error_report *fsnotify_data_error_report(
}
}
+static inline const struct file_range *fsnotify_data_file_range(
+ const void *data,
+ int data_type)
+{
+ switch (data_type) {
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return (struct file_range *)data;
+ default:
+ return NULL;
+ }
+}
+
/*
* Index to merged marks iterator array that correlates to a type of watch.
* The type of watched object can be deduced from the iterator type, but not
@@ -865,6 +897,8 @@ static inline void fsnotify_init_event(struct fsnotify_event *event)
}
__u32 fsnotify_file_object_watched(struct file *file, __u32 mask);
+int fsnotify_pre_content(const struct path *path, const loff_t *ppos,
+ size_t count);
#else
@@ -873,6 +907,12 @@ static inline __u32 fsnotify_file_object_watched(struct file *file, __u32 mask)
return 0;
}
+static inline int fsnotify_pre_content(const struct path *path,
+ const loff_t *ppos, size_t count)
+{
+ return 0;
+}
+
static inline int fsnotify(__u32 mask, const void *data, int data_type,
struct inode *dir, const struct qstr *name,
struct inode *inode, u32 cookie)
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (7 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 08/19] fsnotify: pass optional file access range in pre-content event Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-20 15:23 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
` (10 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Generate FS_PRE_ACCESS event before truncate, without sb_writers held.
Move the security hooks also before sb_start_write() to conform with
other security hooks (e.g. in write, fallocate).
The event will have a range info of the page surrounding the new size
to provide an opportunity to fill the conetnt at the end of file before
truncating to non-page aligned size.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/open.c | 31 +++++++++++++++++++++----------
include/linux/fsnotify.h | 20 ++++++++++++++++++++
2 files changed, 41 insertions(+), 10 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 1a9483872e1f..d11d373dca80 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -81,14 +81,18 @@ long vfs_truncate(const struct path *path, loff_t length)
if (!S_ISREG(inode->i_mode))
return -EINVAL;
- error = mnt_want_write(path->mnt);
- if (error)
- goto out;
-
idmap = mnt_idmap(path->mnt);
error = inode_permission(idmap, inode, MAY_WRITE);
if (error)
- goto mnt_drop_write_and_out;
+ return error;
+
+ error = fsnotify_truncate_perm(path, length);
+ if (error)
+ return error;
+
+ error = mnt_want_write(path->mnt);
+ if (error)
+ return error;
error = -EPERM;
if (IS_APPEND(inode))
@@ -114,7 +118,7 @@ long vfs_truncate(const struct path *path, loff_t length)
put_write_access(inode);
mnt_drop_write_and_out:
mnt_drop_write(path->mnt);
-out:
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_truncate);
@@ -175,11 +179,18 @@ long do_ftruncate(struct file *file, loff_t length, int small)
/* Check IS_APPEND on real upper inode */
if (IS_APPEND(file_inode(file)))
return -EPERM;
- sb_start_write(inode->i_sb);
+
error = security_file_truncate(file);
- if (!error)
- error = do_truncate(file_mnt_idmap(file), dentry, length,
- ATTR_MTIME | ATTR_CTIME, file);
+ if (error)
+ return error;
+
+ error = fsnotify_truncate_perm(&file->f_path, length);
+ if (error)
+ return error;
+
+ sb_start_write(inode->i_sb);
+ error = do_truncate(file_mnt_idmap(file), dentry, length,
+ ATTR_MTIME | ATTR_CTIME, file);
sb_end_write(inode->i_sb);
return error;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index ce189b4778a5..08893429a818 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -217,6 +217,21 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
return fsnotify_path(&file->f_path, FS_ACCESS_PERM);
}
+/*
+ * fsnotify_truncate_perm - permission hook before file truncate
+ */
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+ struct inode *inode = d_inode(path->dentry);
+
+ if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
+ !fsnotify_sb_has_priority_watchers(inode->i_sb,
+ FSNOTIFY_PRIO_PRE_CONTENT))
+ return 0;
+
+ return fsnotify_pre_content(path, &length, 0);
+}
+
/*
* fsnotify_file_perm - permission hook before file access (unknown range)
*/
@@ -255,6 +270,11 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
return 0;
}
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+ return 0;
+}
+
static inline int fsnotify_file_perm(struct file *file, int perm_mask)
{
return 0;
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (8 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:59 ` Amir Goldstein
2024-11-21 10:44 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 11/19] fanotify: report file range info with pre-content events Josef Bacik
` (9 subsequent siblings)
19 siblings, 2 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
in the context of the event handler.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill the content of files on first read access.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 3 ++-
fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
include/linux/fanotify.h | 14 ++++++++++----
include/uapi/linux/fanotify.h | 2 ++
4 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 2e6ba94ec405..da6c3c1c7edf 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -916,8 +916,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
+ BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 456cc3e92c88..5ea447e9e5a8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1640,11 +1640,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
unsigned int flags)
{
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
+ bool is_dir = d_is_dir(path->dentry);
/* Strict validation of events in non-dir inode mask with v5.17+ APIs */
bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
(mask & FAN_RENAME) ||
(flags & FAN_MARK_IGNORE);
+ /*
+ * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
+ * and they are only supported on regular files and directories.
+ */
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
+ if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
+ return -EINVAL;
+ if (!is_dir && !d_is_reg(path->dentry))
+ return -EINVAL;
+ }
+
/*
* Some filesystems such as 'proc' acquire unusual locks when opening
* files. For them fanotify permission events have high chances of
@@ -1677,7 +1689,7 @@ static int fanotify_events_supported(struct fsnotify_group *group,
* but because we always allowed it, error only when using new APIs.
*/
if (strict_dir_events && mark_type == FAN_MARK_INODE &&
- !d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
+ !is_dir && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
return -ENOTDIR;
return 0;
@@ -1778,10 +1790,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
return -EPERM;
/*
- * Permission events require minimum priority FAN_CLASS_CONTENT.
+ * Permission events are not allowed for FAN_CLASS_NOTIF.
+ * Pre-content permission events are not allowed for FAN_CLASS_CONTENT.
*/
if (mask & FANOTIFY_PERM_EVENTS &&
- group->priority < FSNOTIFY_PRIO_CONTENT)
+ group->priority == FSNOTIFY_PRIO_NORMAL)
+ return -EINVAL;
+ else if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
+ group->priority == FSNOTIFY_PRIO_CONTENT)
return -EINVAL;
if (mask & FAN_FS_ERROR &&
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 89ff45bd6f01..c747af064d2c 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -89,6 +89,16 @@
#define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE | \
FAN_RENAME)
+/* Content events can be used to inspect file content */
+#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
+ FAN_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
+
+/* Events that require a permission response from user */
+#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
+ FANOTIFY_PRE_CONTENT_EVENTS)
+
/* Events that can be reported with event->fd */
#define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
@@ -104,10 +114,6 @@
FANOTIFY_INODE_EVENTS | \
FANOTIFY_ERROR_EVENTS)
-/* Events that require a permission response from user */
-#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
- FAN_OPEN_EXEC_PERM)
-
/* Extra flags that may be reported with event or control handling of events */
#define FANOTIFY_EVENT_FLAGS (FAN_EVENT_ON_CHILD | FAN_ONDIR)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 79072b6894f2..7596168c80eb 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -27,6 +27,8 @@
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
/* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
+#define FAN_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
#define FAN_RENAME 0x10000000 /* File was renamed */
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 11/19] fanotify: report file range info with pre-content events
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (9 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 12/19] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
` (8 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
With group class FAN_CLASS_PRE_CONTENT, report offset and length info
along with FAN_PRE_ACCESS pre-content events.
This information is meant to be used by hierarchical storage managers
that want to fill partial content of files on first access to range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.h | 8 +++++++
fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 8 +++++++
3 files changed, 54 insertions(+)
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 93598b7d5952..7f06355afa1f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -448,6 +448,14 @@ static inline bool fanotify_is_perm_event(u32 mask)
mask & FANOTIFY_PERM_EVENTS;
}
+static inline bool fanotify_event_has_access_range(struct fanotify_event *event)
+{
+ if (!(event->mask & FANOTIFY_PRE_CONTENT_EVENTS))
+ return false;
+
+ return FANOTIFY_PERM(event)->ppos;
+}
+
static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
{
return container_of(fse, struct fanotify_event, fse);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5ea447e9e5a8..c7938d9e8101 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -121,6 +121,8 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
+#define FANOTIFY_RANGE_INFO_LEN \
+ (sizeof(struct fanotify_event_info_range))
static int fanotify_fid_info_len(int fh_len, int name_len)
{
@@ -180,6 +182,9 @@ static size_t fanotify_event_len(unsigned int info_mode,
if (info_mode & FAN_REPORT_PIDFD)
event_len += FANOTIFY_PIDFD_INFO_LEN;
+ if (fanotify_event_has_access_range(event))
+ event_len += FANOTIFY_RANGE_INFO_LEN;
+
return event_len;
}
@@ -516,6 +521,30 @@ static int copy_pidfd_info_to_user(int pidfd,
return info_len;
}
+static size_t copy_range_info_to_user(struct fanotify_event *event,
+ char __user *buf, int count)
+{
+ struct fanotify_perm_event *pevent = FANOTIFY_PERM(event);
+ struct fanotify_event_info_range info = { };
+ size_t info_len = FANOTIFY_RANGE_INFO_LEN;
+
+ if (WARN_ON_ONCE(info_len > count))
+ return -EFAULT;
+
+ if (WARN_ON_ONCE(!pevent->ppos))
+ return -EINVAL;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_RANGE;
+ info.hdr.len = info_len;
+ info.offset = *(pevent->ppos);
+ info.count = pevent->count;
+
+ if (copy_to_user(buf, &info, info_len))
+ return -EFAULT;
+
+ return info_len;
+}
+
static int copy_info_records_to_user(struct fanotify_event *event,
struct fanotify_info *info,
unsigned int info_mode, int pidfd,
@@ -637,6 +666,15 @@ static int copy_info_records_to_user(struct fanotify_event *event,
total_bytes += ret;
}
+ if (fanotify_event_has_access_range(event)) {
+ ret = copy_range_info_to_user(event, buf, count);
+ if (ret < 0)
+ return ret;
+ buf += ret;
+ count -= ret;
+ total_bytes += ret;
+ }
+
return total_bytes;
}
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 7596168c80eb..0636a9c85dd0 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -146,6 +146,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_DFID 3
#define FAN_EVENT_INFO_TYPE_PIDFD 4
#define FAN_EVENT_INFO_TYPE_ERROR 5
+#define FAN_EVENT_INFO_TYPE_RANGE 6
/* Special info types for FAN_RENAME */
#define FAN_EVENT_INFO_TYPE_OLD_DFID_NAME 10
@@ -192,6 +193,13 @@ struct fanotify_event_info_error {
__u32 error_count;
};
+struct fanotify_event_info_range {
+ struct fanotify_event_info_header hdr;
+ __u32 pad;
+ __u64 offset;
+ __u64 count;
+};
+
/*
* User space may need to record additional information about its decision.
* The extra information type records what kind of information is included.
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 12/19] fanotify: allow to set errno in FAN_DENY permission response
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (10 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 11/19] fanotify: report file range info with pre-content events Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 13/19] fanotify: add a helper to check for pre content events Josef Bacik
` (7 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
With FAN_DENY response, user trying to perform the filesystem operation
gets an error with errno set to EPERM.
It is useful for hierarchical storage management (HSM) service to be able
to deny access for reasons more diverse than EPERM, for example EAGAIN,
if HSM could retry the operation later.
Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd
to permission events with the response value FAN_DENY_ERRNO(errno),
instead of FAN_DENY to return a custom error.
Limit custom error values to errors expected on read(2)/write(2) and
open(2) of regular files. This list could be extended in the future.
Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by
writing a response to an fanotify group fd with a value of FAN_NOFD in
the fd field of the response.
The change in fanotify_response is backward compatible, because errno is
written in the high 8 bits of the 32bit response field and old kernels
reject respose value with high bits set.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 19 +++++++++++----
fs/notify/fanotify/fanotify.h | 5 ++++
fs/notify/fanotify/fanotify_user.c | 37 ++++++++++++++++++++++++++----
include/linux/fanotify.h | 5 +++-
include/uapi/linux/fanotify.h | 7 ++++++
5 files changed, 62 insertions(+), 11 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index da6c3c1c7edf..e3d04d77caba 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -223,7 +223,8 @@ static int fanotify_get_response(struct fsnotify_group *group,
struct fanotify_perm_event *event,
struct fsnotify_iter_info *iter_info)
{
- int ret;
+ int ret, errno;
+ u32 decision;
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
@@ -256,20 +257,28 @@ static int fanotify_get_response(struct fsnotify_group *group,
goto out;
}
+ decision = event->response &
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
/* userspace responded, convert to something usable */
- switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
ret = 0;
break;
case FAN_DENY:
+ /* Check custom errno from pre-content events */
+ errno = fanotify_get_response_errno(event->response);
+ if (errno) {
+ ret = -errno;
+ break;
+ }
+ fallthrough;
default:
ret = -EPERM;
}
/* Check if the response should be audited */
- if (event->response & FAN_AUDIT)
- audit_fanotify(event->response & ~FAN_AUDIT,
- &event->audit_rule);
+ if (decision & FAN_AUDIT)
+ audit_fanotify(decision & ~FAN_AUDIT, &event->audit_rule);
pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
group, event, ret);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 7f06355afa1f..9e93aba210c9 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -528,3 +528,8 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
return mflags;
}
+
+static inline u32 fanotify_get_response_errno(int res)
+{
+ return res >> FAN_ERRNO_SHIFT;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index c7938d9e8101..28aac467c7e2 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -327,11 +327,14 @@ static int process_access_response(struct fsnotify_group *group,
struct fanotify_perm_event *event;
int fd = response_struct->fd;
u32 response = response_struct->response;
+ u32 decision = response &
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
+ int errno = fanotify_get_response_errno(response);
int ret = info_len;
struct fanotify_response_info_audit_rule friar;
- pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
- group, fd, response, info, info_len);
+ pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n",
+ __func__, group, fd, response, errno, info, info_len);
/*
* make sure the response is valid, if invalid we do nothing and either
* userspace can send a valid response or we will clean it up after the
@@ -340,18 +343,42 @@ static int process_access_response(struct fsnotify_group *group,
if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
return -EINVAL;
- switch (response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
+ if (errno)
+ return -EINVAL;
+ break;
case FAN_DENY:
+ /* Custom errno is supported only for pre-content groups */
+ if (errno && group->priority != FSNOTIFY_PRIO_PRE_CONTENT)
+ return -EINVAL;
+
+ /*
+ * Limit errno to values expected on open(2)/read(2)/write(2)
+ * of regular files.
+ */
+ switch (errno) {
+ case 0:
+ case EIO:
+ case EPERM:
+ case EBUSY:
+ case ETXTBSY:
+ case EAGAIN:
+ case ENOSPC:
+ case EDQUOT:
+ break;
+ default:
+ return -EINVAL;
+ }
break;
default:
return -EINVAL;
}
- if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
+ if ((decision & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
return -EINVAL;
- if (response & FAN_INFO) {
+ if (decision & FAN_INFO) {
ret = process_access_response_info(info, info_len, &friar);
if (ret < 0)
return ret;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index c747af064d2c..d9bb48976b53 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -132,7 +132,10 @@
/* These masks check for invalid bits in permission responses. */
#define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY)
#define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO)
-#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS)
+#define FANOTIFY_RESPONSE_ERRNO (FAN_ERRNO_MASK << FAN_ERRNO_SHIFT)
+#define FANOTIFY_RESPONSE_VALID_MASK \
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \
+ FANOTIFY_RESPONSE_ERRNO)
/* Do not use these old uapi constants internally */
#undef FAN_ALL_CLASS_BITS
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 0636a9c85dd0..bd8167979707 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -235,6 +235,13 @@ struct fanotify_response_info_audit_rule {
/* Legit userspace responses to a _PERM event */
#define FAN_ALLOW 0x01
#define FAN_DENY 0x02
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_ERRNO_BITS 8
+#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS)
+#define FAN_ERRNO_MASK ((1 << FAN_ERRNO_BITS) - 1)
+#define FAN_DENY_ERRNO(err) \
+ (FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT))
+
#define FAN_AUDIT 0x10 /* Bitmask to create audit record for result */
#define FAN_INFO 0x20 /* Bitmask to indicate additional information */
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 13/19] fanotify: add a helper to check for pre content events
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (11 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 12/19] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-20 15:44 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 14/19] fanotify: disable readahead if we have pre-content watches Josef Bacik
` (6 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
We want to emit events during page fault, and calling into fanotify
could be expensive, so add a helper to allow us to skip calling into
fanotify from page fault. This will also be used to disable readahead
for content watched files which will be handled in a subsequent patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
include/linux/fsnotify.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 08893429a818..d5a0d8648000 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -178,6 +178,11 @@ static inline void file_set_fsnotify_mode(struct file *file)
}
}
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return file && unlikely(FMODE_FSNOTIFY_HSM(file->f_mode));
+}
+
/*
* fsnotify_file_area_perm - permission hook before access to file range
*/
@@ -264,6 +269,11 @@ static inline void file_set_fsnotify_mode(struct file *file)
{
}
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return false;
+}
+
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 14/19] fanotify: disable readahead if we have pre-content watches
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (12 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 13/19] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Josef Bacik
` (5 subsequent siblings)
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event. To avoid this case, disable
readahead if we have pre-content watches on the file. This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/filemap.c | 12 ++++++++++++
mm/readahead.c | 13 +++++++++++++
2 files changed, 25 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 196779e8e396..68ea596f6905 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3151,6 +3151,14 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
unsigned long vm_flags = vmf->vma->vm_flags;
unsigned int mmap_miss;
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't populate our mapping with 0 filled pages that we
+ * never emitted an event for.
+ */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
@@ -3219,6 +3227,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
struct file *fpin = NULL;
unsigned int mmap_miss;
+ /* See comment in do_sync_mmap_readahead. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
diff --git a/mm/readahead.c b/mm/readahead.c
index 9a807727d809..b42792c20605 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -128,6 +128,7 @@
#include <linux/blk-cgroup.h>
#include <linux/fadvise.h>
#include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
#include "internal.h"
@@ -544,6 +545,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
unsigned long max_pages, contig_count;
pgoff_t prev_index, miss;
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't find 0 filled pages in cache that we never emitted
+ * events for.
+ */
+ if (fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Even if readahead is disabled, issue this request as readahead
* as we'll need it to satisfy the requested range. The forced
@@ -622,6 +631,10 @@ void page_cache_async_ra(struct readahead_control *ractl,
if (!ra->ra_pages)
return;
+ /* See the comment in page_cache_sync_ra. */
+ if (fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Same bit is used for PG_readahead and PG_reclaim.
*/
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (13 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 14/19] fanotify: disable readahead if we have pre-content watches Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2025-01-31 19:17 ` [REGRESSION] " Alex Williamson
2024-11-15 15:30 ` [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault Josef Bacik
` (4 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
There's nothing stopping us from supporting this, we could simply pass
the order into the helper and emit the proper length. However currently
there's no tests to validate this works properly, so disable it until
there's a desire to support this along with the appropriate tests.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index bdf77a3ec47b..843ad75a4148 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
#include <linux/sched/sysctl.h>
+#include <linux/fsnotify.h>
#include <trace/events/kmem.h>
@@ -5637,8 +5638,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
if (vma_is_anonymous(vma))
return do_huge_pmd_anonymous_page(vmf);
+ /*
+ * Currently we just emit PAGE_SIZE for our fault events, so don't allow
+ * a huge fault if we have a pre content watch on this file. This would
+ * be trivial to support, but there would need to be tests to ensure
+ * this works properly and those don't exist currently.
+ */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
return VM_FAULT_FALLBACK;
@@ -5648,6 +5658,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
vm_fault_t ret;
@@ -5662,6 +5673,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
}
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
@@ -5681,9 +5695,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
return VM_FAULT_FALLBACK;
+ /* See comment in create_huge_pmd. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5695,12 +5713,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
vm_fault_t ret;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
goto split;
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (14 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-12-08 16:58 ` Klara Modin
2024-11-15 15:30 ` [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults Josef Bacik
` (3 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
on the faulting method.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.
Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things with
in filemap_fault.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/mm.h | 1 +
mm/filemap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01c5e7a4489f..90155ef8599a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
+extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
extern unsigned long stack_guard_gap;
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
diff --git a/mm/filemap.c b/mm/filemap.c
index 68ea596f6905..0bf7d645dec5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -47,6 +47,7 @@
#include <linux/splice.h>
#include <linux/rcupdate_wait.h>
#include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -3289,6 +3290,52 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
return ret;
}
+/**
+ * filemap_fsnotify_fault - maybe emit a pre-content event.
+ * @vmf: struct vm_fault containing details of the fault.
+ * @folio: the folio we're faulting in.
+ *
+ * If we have a pre-content watch on this file we will emit an event for this
+ * range. If we return anything the fault caller should return immediately, we
+ * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
+ * fault again and then the fault handler will run the second time through.
+ *
+ * This is meant to be called with the folio that we will be filling in to make
+ * sure the event is emitted for the correct range.
+ *
+ * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
+ */
+vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
+{
+ struct file *fpin = NULL;
+ int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
+ loff_t pos = vmf->pgoff >> PAGE_SHIFT;
+ size_t count = PAGE_SIZE;
+ vm_fault_t ret;
+
+ /*
+ * We already did this and now we're retrying with everything locked,
+ * don't emit the event and continue.
+ */
+ if (vmf->flags & FAULT_FLAG_TRIED)
+ return 0;
+
+ /* No watches, we're done. */
+ if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
+ return 0;
+
+ fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+ if (!fpin)
+ return VM_FAULT_SIGBUS;
+
+ ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
+ fput(fpin);
+ if (ret)
+ return VM_FAULT_SIGBUS;
+ return VM_FAULT_RETRY;
+}
+EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
+
/**
* filemap_fault - read in file data for page fault handling
* @vmf: struct vm_fault containing details of the fault
@@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* or because readahead was otherwise unable to retrieve it.
*/
if (unlikely(!folio_test_uptodate(folio))) {
+ /*
+ * If this is a precontent file we have can now emit an event to
+ * try and populate the folio.
+ */
+ if (!(vmf->flags & FAULT_FLAG_TRIED) &&
+ fsnotify_file_has_pre_content_watches(file)) {
+ loff_t pos = folio_pos(folio);
+ size_t count = folio_size(folio);
+
+ /* We're NOWAIT, we have to retry. */
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+ folio_unlock(folio);
+ goto out_retry;
+ }
+
+ if (mapping_locked)
+ filemap_invalidate_unlock_shared(mapping);
+ mapping_locked = false;
+
+ folio_unlock(folio);
+ fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+ if (!fpin)
+ goto out_retry;
+
+ error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
+ count);
+ if (error)
+ ret = VM_FAULT_SIGBUS;
+ goto out_retry;
+ }
+
/*
* If the invalidate lock is not held, the folio was in cache
* and uptodate and now it is not. Strange but possible since we
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (15 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-21 10:22 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 18/19] btrfs: disable defrag on pre-content watched files Josef Bacik
` (2 subsequent siblings)
19 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
xfs has it's own handling for write faults, so we need to add the
pre-content fsnotify hook for this case. Reads go through filemap_fault
so they're handled properly there.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/xfs_file.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ca47cae5a40a..4fe89770ecb5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1458,6 +1458,10 @@ xfs_write_fault(
unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
vm_fault_t ret;
+ ret = filemap_fsnotify_fault(vmf);
+ if (unlikely(ret))
+ return ret;
+
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 18/19] btrfs: disable defrag on pre-content watched files
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (16 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-15 15:30 ` [PATCH v8 19/19] fs: enable pre-content events on supported file systems Josef Bacik
2024-11-21 11:29 ` [PATCH v8 00/19] fanotify: add pre-content hooks Jan Kara
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
We queue up inodes to be defrag'ed asynchronously, which means we do not
have their original file for readahead. This means that the code to
skip readahead on pre-content watched files will not run, and we could
potentially read in empty pages.
Handle this corner case by disabling defrag on files that are currently
being watched for pre-content events.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/ioctl.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c9302d193187..1e5913f276be 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2635,6 +2635,15 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
goto out;
}
+ /*
+ * Don't allow defrag on pre-content watched files, as it could
+ * populate the page cache with 0's via readahead.
+ */
+ if (fsnotify_file_has_pre_content_watches(file)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (argp) {
if (copy_from_user(&range, argp, sizeof(range))) {
ret = -EFAULT;
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH v8 19/19] fs: enable pre-content events on supported file systems
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (17 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 18/19] btrfs: disable defrag on pre-content watched files Josef Bacik
@ 2024-11-15 15:30 ` Josef Bacik
2024-11-21 11:29 ` [PATCH v8 00/19] fanotify: add pre-content hooks Jan Kara
19 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2024-11-15 15:30 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/super.c | 2 +-
fs/ext4/super.c | 3 +++
fs/xfs/xfs_super.c | 2 +-
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97a85d180b61..fe6ecc3f1cab 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -961,7 +961,7 @@ static int btrfs_fill_super(struct super_block *sb,
#endif
sb->s_xattr = btrfs_xattr_handlers;
sb->s_time_gran = 1;
- sb->s_iflags |= SB_I_CGROUPWB;
+ sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
err = super_setup_bdi(sb);
if (err) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b3512d78b55c..13b9d67a4eec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5306,6 +5306,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
/* i_version is always enabled now */
sb->s_flags |= SB_I_VERSION;
+ /* HSM events are allowed by default. */
+ sb->s_iflags |= SB_I_ALLOW_HSM;
+
err = ext4_check_feature_compatibility(sb, es, silent);
if (err)
goto failed_mount;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fda75db739b1..2d1e9db8548d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1713,7 +1713,7 @@ xfs_fs_fill_super(
sb->s_time_max = XFS_LEGACY_TIME_MAX;
}
trace_xfs_inode_timestamp_range(mp, sb->s_time_min, sb->s_time_max);
- sb->s_iflags |= SB_I_CGROUPWB;
+ sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
set_posix_acl_flag(sb);
--
2.43.0
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-15 15:30 ` [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-11-15 15:59 ` Amir Goldstein
2024-11-21 10:44 ` Jan Kara
1 sibling, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2024-11-15 15:59 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri, Nov 15, 2024 at 4:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> From: Amir Goldstein <amir73il@gmail.com>
>
> Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
>
> Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> in the context of the event handler.
>
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill the content of files on first read access.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> fs/notify/fanotify/fanotify.c | 3 ++-
> fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
> include/linux/fanotify.h | 14 ++++++++++----
> include/uapi/linux/fanotify.h | 2 ++
> 4 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 2e6ba94ec405..da6c3c1c7edf 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -916,8 +916,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
> BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
> + BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
>
> - BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
> + BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
>
> mask = fanotify_group_event_mask(group, iter_info, &match_mask,
> mask, data, data_type, dir);
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 456cc3e92c88..5ea447e9e5a8 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1640,11 +1640,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
> unsigned int flags)
> {
> unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
> + bool is_dir = d_is_dir(path->dentry);
> /* Strict validation of events in non-dir inode mask with v5.17+ APIs */
> bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
> (mask & FAN_RENAME) ||
> (flags & FAN_MARK_IGNORE);
>
> + /*
> + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> + * and they are only supported on regular files and directories.
> + */
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> + return -EINVAL;
You missed my latest push of this change.
no worries, for final version want:
return -EOPNOTSUPP;
> + if (!is_dir && !d_is_reg(path->dentry))
> + return -EINVAL;
> + }
> +
> /*
> * Some filesystems such as 'proc' acquire unusual locks when opening
> * files. For them fanotify permission events have high chances of
> @@ -1677,7 +1689,7 @@ static int fanotify_events_supported(struct fsnotify_group *group,
> * but because we always allowed it, error only when using new APIs.
> */
> if (strict_dir_events && mark_type == FAN_MARK_INODE &&
> - !d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
> + !is_dir && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
> return -ENOTDIR;
>
> return 0;
> @@ -1778,10 +1790,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> return -EPERM;
>
> /*
> - * Permission events require minimum priority FAN_CLASS_CONTENT.
> + * Permission events are not allowed for FAN_CLASS_NOTIF.
> + * Pre-content permission events are not allowed for FAN_CLASS_CONTENT.
> */
> if (mask & FANOTIFY_PERM_EVENTS &&
> - group->priority < FSNOTIFY_PRIO_CONTENT)
> + group->priority == FSNOTIFY_PRIO_NORMAL)
> + return -EINVAL;
> + else if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
> + group->priority == FSNOTIFY_PRIO_CONTENT)
> return -EINVAL;
>
> if (mask & FAN_FS_ERROR &&
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index 89ff45bd6f01..c747af064d2c 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -89,6 +89,16 @@
> #define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE | \
> FAN_RENAME)
>
> +/* Content events can be used to inspect file content */
> +#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
> + FAN_ACCESS_PERM)
> +/* Pre-content events can be used to fill file content */
> +#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
> +
> +/* Events that require a permission response from user */
> +#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
> + FANOTIFY_PRE_CONTENT_EVENTS)
> +
> /* Events that can be reported with event->fd */
> #define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
>
> @@ -104,10 +114,6 @@
> FANOTIFY_INODE_EVENTS | \
> FANOTIFY_ERROR_EVENTS)
>
> -/* Events that require a permission response from user */
> -#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
> - FAN_OPEN_EXEC_PERM)
> -
> /* Extra flags that may be reported with event or control handling of events */
> #define FANOTIFY_EVENT_FLAGS (FAN_EVENT_ON_CHILD | FAN_ONDIR)
>
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 79072b6894f2..7596168c80eb 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -27,6 +27,8 @@
> #define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
> /* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
>
> +#define FAN_PRE_ACCESS 0x00100000 /* Pre-content access hook */
> +
> #define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
>
> #define FAN_RENAME 0x10000000 /* File was renamed */
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge
2024-11-15 15:30 ` [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge Josef Bacik
@ 2024-11-18 18:14 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-18 18:14 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:14, Josef Bacik wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
>
> All it takes to get rid of the __FMODE_NONOTIFY kludge is switching
> fanotify from anon_inode_getfd() to anon_inode_getfile_fmode() and adding
> a dentry_open_fmode() helper to be used by fanotify on the other path.
^^^ this ended up being dentry_open_nonotify()
> That's it - no more weird shit in OPEN_FMODE(), etc.
>
> Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
> Link: https://lore.kernel.org/linux-fsdevel/20241113043003.GH3387508@ZenIV/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
...
> @@ -3706,11 +3708,9 @@ struct ctl_table;
> int __init list_bdev_fs_names(char *buf, size_t size);
>
> #define __FMODE_EXEC ((__force int) FMODE_EXEC)
> -#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
>
> #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
> -#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
> - (flag & __FMODE_NONOTIFY)))
> +#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
^^^ one more level of braces
than necessary now
Otherwise looks good to me. Don't need to resend just because of this, I
can fix this up if there's nothing else.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate
2024-11-15 15:30 ` [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate Josef Bacik
@ 2024-11-20 15:23 ` Jan Kara
2024-11-20 15:57 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-20 15:23 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:22, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> Generate FS_PRE_ACCESS event before truncate, without sb_writers held.
>
> Move the security hooks also before sb_start_write() to conform with
> other security hooks (e.g. in write, fallocate).
>
> The event will have a range info of the page surrounding the new size
> to provide an opportunity to fill the conetnt at the end of file before
> truncating to non-page aligned size.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
I was thinking about this. One small issue is that similarly as the
filesystems may do RMW of tail page during truncate, they will do RMW of
head & tail pages on hole punch or zero range so we should have some
strategically sprinkled fsnotify_truncate_perm() calls there as well.
That's easy enough to fix.
But there's another problem which I'm more worried about: If we have
a file 64k large, user punches 12k..20k and then does read for 0..64k, then
how does HSM daemon in userspace know what data to fill in? When we'll have
modify pre-content event, daemon can watch it and since punch will send modify
for 12k-20k, the daemon knows the local (empty) page cache is the source of
truth. But without modify event this is just a recipe for data corruption
AFAICT.
So it seems the current setting with access pre-content event has only chance
to work reliably in read-only mode? So we should probably refuse writeable
open if file is being watched for pre-content events and similarly refuse
truncate?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 13/19] fanotify: add a helper to check for pre content events
2024-11-15 15:30 ` [PATCH v8 13/19] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-11-20 15:44 ` Jan Kara
2024-11-20 16:43 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-20 15:44 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:26, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> We want to emit events during page fault, and calling into fanotify
> could be expensive, so add a helper to allow us to skip calling into
> fanotify from page fault. This will also be used to disable readahead
> for content watched files which will be handled in a subsequent patch.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> include/linux/fsnotify.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index 08893429a818..d5a0d8648000 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -178,6 +178,11 @@ static inline void file_set_fsnotify_mode(struct file *file)
> }
> }
>
> +static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
> +{
> + return file && unlikely(FMODE_FSNOTIFY_HSM(file->f_mode));
> +}
> +
I was pondering about this and since we are trying to make these quick
checks more explicit, I'll probably drop this helper. Also the 'file &&'
part looks strange (I understand page_cache_[a]sync_ra() need it but I'd
rather handle it explicitely there).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-15 15:30 ` [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time Josef Bacik
@ 2024-11-20 15:53 ` Jan Kara
2024-11-20 16:12 ` Amir Goldstein
2024-11-21 9:45 ` Christian Brauner
0 siblings, 2 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-20 15:53 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:15, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> Legacy inotify/fanotify listeners can add watches for events on inode,
> parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> were already open at the time of setting up the watches.
>
> fanotify permission events are typically used by Anti-malware sofware,
> that is watching the entire mount and it is not common to have more that
> one Anti-malware engine installed on a system.
>
> To reduce the overhead of the fsnotify_file_perm() hooks on every file
> access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> events only if there were *any* permission event listeners on the
> filesystem at the time that the file was opened.
>
> The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> events types to report.
>
> This is going to apply to the new fanotify pre-content events in order
> to reduce the cost of the new pre-content event vfs hooks.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
FWIW I've ended up somewhat massaging this patch (see below).
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 23bd058576b1..8e5c783013d2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>
> #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
>
> -/* FMODE_* bit 24 */
> -
> /* File is embedded in backing_file object */
> -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
>
> -/* File was opened by fanotify and shouldn't generate fanotify events */
> -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> +/* File shouldn't generate fanotify pre-content events */
> +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
> +
> +/* File shouldn't generate fanotify permission events */
> +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
Firstly, I've kept FMODE_NONOTIFY to stay a single bit instead of two bit
constant. I've seen too many bugs caused by people expecting the constant
has a single bit set when it actually had more in my life. So I've ended up
with:
+/*
+ * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be
+ * generated (see below)
+ */
+#define FMODE_NONOTIFY ((__force fmode_t)(1 << 25))
+
+/*
+ * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be
+ * generated (see below)
+ */
+#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
and
+/*
+ * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
+ * for a file. These are the possible values of (f->f_mode &
+ * FMODE_FSNOTIFY_MASK) and their meaning:
+ *
+ * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
+ * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
+ * FMODE_NONOTIFY | FMODE_NONOTIFY_PERM - suppress only pre-content events.
+ */
+#define FMODE_FSNOTIFY_MASK \
+ (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)
+
+#define FMODE_FSNOTIFY_NONE(mode) \
+ ((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY)
+#define FMODE_FSNOTIFY_PERM(mode) \
+ (!(mode & FMODE_NONOTIFY_PERM))
+#define FMODE_FSNOTIFY_HSM(mode) \
+ ((mode & FMODE_FSNOTIFY_MASK) == 0)
Also I've moved file_set_fsnotify_mode() out of line into fsnotify.c. The
function gets quite big and the call is not IMO so expensive to warrant
inlining. Furthermore it saves exporting some fsnotify internals to modules
(in later patches).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate
2024-11-20 15:23 ` Jan Kara
@ 2024-11-20 15:57 ` Amir Goldstein
2024-11-20 16:16 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-20 15:57 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 4:23 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-11-24 10:30:22, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > Generate FS_PRE_ACCESS event before truncate, without sb_writers held.
> >
> > Move the security hooks also before sb_start_write() to conform with
> > other security hooks (e.g. in write, fallocate).
> >
> > The event will have a range info of the page surrounding the new size
> > to provide an opportunity to fill the conetnt at the end of file before
> > truncating to non-page aligned size.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I was thinking about this. One small issue is that similarly as the
> filesystems may do RMW of tail page during truncate, they will do RMW of
> head & tail pages on hole punch or zero range so we should have some
> strategically sprinkled fsnotify_truncate_perm() calls there as well.
> That's easy enough to fix.
fallocate already has fsnotify_file_area_perm() hook.
What is missing?
>
> But there's another problem which I'm more worried about: If we have
> a file 64k large, user punches 12k..20k and then does read for 0..64k, then
> how does HSM daemon in userspace know what data to fill in? When we'll have
> modify pre-content event, daemon can watch it and since punch will send modify
> for 12k-20k, the daemon knows the local (empty) page cache is the source of
> truth. But without modify event this is just a recipe for data corruption
> AFAICT.
>
> So it seems the current setting with access pre-content event has only chance
> to work reliably in read-only mode? So we should probably refuse writeable
> open if file is being watched for pre-content events and similarly refuse
> truncate?
I am confused. not sure I understand the problem.
In the events that you specific, punch hole WILL generate a FS_PRE_ACCESS
event for 12k-20k.
When HSM gets a FS_PRE_ACCESS event for 12k-20k it MUST fill the content
and keep to itself that 12k-20k is the source of truth from now on.
The extra FS_PRE_ACCESS event on 0..64k absolutely does not change that.
IOW, a FS_PRE_ACCESS event on 0..64k definitely does NOT mean that
HSM NEEDS to fill content in 0..64k, it just means that it MAY needs
to fill content
if it hasn't done that for a range before the event.
To reiterate this important point, it is HSM responsibility to maintain the
"content filled map" per file is its own way, under no circumstances it is
assumed that fiemap or page cache state has anything to do with the
"content filled map".
The *only* thing that HSM can assume if that if its "content filled map"
is empty for some range, then page cache is NOT yet populated in that
range and that also relies on how HSM and mount are being initialized
and exposed to users.
Did I misunderstand your concern?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched
2024-11-15 15:30 ` [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched Josef Bacik
@ 2024-11-20 16:02 ` Jan Kara
2024-11-20 16:42 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-20 16:02 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:16, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> So far, we set FMODE_NONOTIFY_ flags at open time if we know that there
> are no permission event watchers at all on the filesystem, but lack of
> FMODE_NONOTIFY_ flags does not mean that the file is actually watched.
>
> To make the flags more accurate we add a helper that checks if the
> file's inode, mount, sb or parent are being watched for a set of events.
>
> This is going to be used for setting FMODE_NONOTIFY_HSM only when the
> specific file is actually watched for pre-content events.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
I did some changes here as well. See below:
> -/* Are there any inode/mount/sb objects that are interested in this event? */
> -static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> - __u32 mask)
> +/* Are there any inode/mount/sb objects that watch for these events? */
> +static inline __u32 fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> + __u32 events_mask)
> {
> __u32 marks_mask = READ_ONCE(inode->i_fsnotify_mask) | mnt_mask |
> READ_ONCE(inode->i_sb->s_fsnotify_mask);
>
> - return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
> + return events_mask & marks_mask;
> }
>
> +/* Are there any inode/mount/sb/parent objects that watch for these events? */
> +__u32 fsnotify_file_object_watched(struct file *file, __u32 events_mask)
> +{
> + struct dentry *dentry = file->f_path.dentry;
> + struct dentry *parent;
> + __u32 marks_mask, mnt_mask =
> + READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
> +
> + marks_mask = fsnotify_object_watched(d_inode(dentry), mnt_mask,
> + events_mask);
> +
> + if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)))
> + return marks_mask;
> +
> + parent = dget_parent(dentry);
> + marks_mask |= fsnotify_inode_watches_children(d_inode(parent));
> + dput(parent);
> +
> + return marks_mask & events_mask;
> +}
> +EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
I find it confusing that fsnotify_object_watched() does not take parent
into account while fsnotify_file_object_watched() does. Furthermore the
naming doesn't very well reflect the fact we are actually returning a mask
of events. I've ended up dropping this helper (it's used in a single place
anyway) and instead doing the same directly in file_set_fsnotify_mode().
@@ -658,6 +660,27 @@ void file_set_fsnotify_mode(struct file *file)
file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
return;
}
+
+ /*
+ * OK, there are some pre-content watchers. Check if anybody can be
+ * watching for pre-content events on *this* file.
+ */
+ mnt_mask = READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
+ if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED) &&
+ !fsnotify_object_watched(d_inode(dentry), mnt_mask,
+ FSNOTIFY_PRE_CONTENT_EVENTS))) {
+ file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
+ return;
+ }
+
+ /* Even parent is not watching for pre-content events on this file? */
+ parent = dget_parent(dentry);
+ p_mask = fsnotify_inode_watches_children(d_inode(parent));
+ dput(parent);
+ if (!(p_mask & FSNOTIFY_PRE_CONTENT_EVENTS)) {
+ file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
+ return;
+ }
}
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-20 15:53 ` Jan Kara
@ 2024-11-20 16:12 ` Amir Goldstein
2024-11-21 9:39 ` Jan Kara
2024-11-21 9:45 ` Christian Brauner
1 sibling, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-20 16:12 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 4:53 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-11-24 10:30:15, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > Legacy inotify/fanotify listeners can add watches for events on inode,
> > parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> > were already open at the time of setting up the watches.
> >
> > fanotify permission events are typically used by Anti-malware sofware,
> > that is watching the entire mount and it is not common to have more that
> > one Anti-malware engine installed on a system.
> >
> > To reduce the overhead of the fsnotify_file_perm() hooks on every file
> > access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> > events only if there were *any* permission event listeners on the
> > filesystem at the time that the file was opened.
> >
> > The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> > two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> > events types to report.
> >
> > This is going to apply to the new fanotify pre-content events in order
> > to reduce the cost of the new pre-content event vfs hooks.
> >
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> FWIW I've ended up somewhat massaging this patch (see below).
>
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23bd058576b1..8e5c783013d2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >
> > #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
> >
> > -/* FMODE_* bit 24 */
> > -
> > /* File is embedded in backing_file object */
> > -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> > +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
> >
> > -/* File was opened by fanotify and shouldn't generate fanotify events */
> > -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> > +/* File shouldn't generate fanotify pre-content events */
> > +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
> > +
> > +/* File shouldn't generate fanotify permission events */
> > +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
>
> Firstly, I've kept FMODE_NONOTIFY to stay a single bit instead of two bit
> constant. I've seen too many bugs caused by people expecting the constant
> has a single bit set when it actually had more in my life. So I've ended up
> with:
>
> +/*
> + * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be
> + * generated (see below)
> + */
> +#define FMODE_NONOTIFY ((__force fmode_t)(1 << 25))
> +
> +/*
> + * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be
> + * generated (see below)
> + */
> +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
>
> and
>
> +/*
> + * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
> + * for a file. These are the possible values of (f->f_mode &
> + * FMODE_FSNOTIFY_MASK) and their meaning:
> + *
> + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> + * FMODE_NONOTIFY | FMODE_NONOTIFY_PERM - suppress only pre-content events.
> + */
> +#define FMODE_FSNOTIFY_MASK \
> + (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)
> +
> +#define FMODE_FSNOTIFY_NONE(mode) \
> + ((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY)
> +#define FMODE_FSNOTIFY_PERM(mode) \
> + (!(mode & FMODE_NONOTIFY_PERM))
That looks incorrect -
It gives the wrong value for FMODE_NONOTIFY | FMODE_NONOTIFY_PERM
should be:
!= FMODE_NONOTIFY_PERM &&
!= FMODE_NONOTIFY
The simplicity of the single bit test is for permission events
is why I chose my model, but I understand your reasoning.
> +#define FMODE_FSNOTIFY_HSM(mode) \
> + ((mode & FMODE_FSNOTIFY_MASK) == 0)
>
> Also I've moved file_set_fsnotify_mode() out of line into fsnotify.c. The
> function gets quite big and the call is not IMO so expensive to warrant
> inlining. Furthermore it saves exporting some fsnotify internals to modules
> (in later patches).
Sounds good.
Since you wanted to refrain from defining a two bit constant,
I wonder how you annotated for NONOTIFY_HSM case
return FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate
2024-11-20 15:57 ` Amir Goldstein
@ 2024-11-20 16:16 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-20 16:16 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed 20-11-24 16:57:30, Amir Goldstein wrote:
> On Wed, Nov 20, 2024 at 4:23 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 15-11-24 10:30:22, Josef Bacik wrote:
> > > From: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Generate FS_PRE_ACCESS event before truncate, without sb_writers held.
> > >
> > > Move the security hooks also before sb_start_write() to conform with
> > > other security hooks (e.g. in write, fallocate).
> > >
> > > The event will have a range info of the page surrounding the new size
> > > to provide an opportunity to fill the conetnt at the end of file before
> > > truncating to non-page aligned size.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I was thinking about this. One small issue is that similarly as the
> > filesystems may do RMW of tail page during truncate, they will do RMW of
> > head & tail pages on hole punch or zero range so we should have some
> > strategically sprinkled fsnotify_truncate_perm() calls there as well.
> > That's easy enough to fix.
>
> fallocate already has fsnotify_file_area_perm() hook.
> What is missing?
Sorry, I've missed that in the patch that was adding it.
> > But there's another problem which I'm more worried about: If we have
> > a file 64k large, user punches 12k..20k and then does read for 0..64k, then
> > how does HSM daemon in userspace know what data to fill in? When we'll have
> > modify pre-content event, daemon can watch it and since punch will send modify
> > for 12k-20k, the daemon knows the local (empty) page cache is the source of
> > truth. But without modify event this is just a recipe for data corruption
> > AFAICT.
> >
> > So it seems the current setting with access pre-content event has only chance
> > to work reliably in read-only mode? So we should probably refuse writeable
> > open if file is being watched for pre-content events and similarly refuse
> > truncate?
>
> I am confused. not sure I understand the problem.
>
> In the events that you specific, punch hole WILL generate a FS_PRE_ACCESS
> event for 12k-20k.
>
> When HSM gets a FS_PRE_ACCESS event for 12k-20k it MUST fill the content
> and keep to itself that 12k-20k is the source of truth from now on.
Ah, right. I've got confused and didn't realize we'll be sending FS_PRE_ACCESS
for 12k-20k. Thanks for clarification!
> The extra FS_PRE_ACCESS event on 0..64k absolutely does not change that.
> IOW, a FS_PRE_ACCESS event on 0..64k definitely does NOT mean that
> HSM NEEDS to fill content in 0..64k, it just means that it MAY needs
> to fill content
> if it hasn't done that for a range before the event.
>
> To reiterate this important point, it is HSM responsibility to maintain the
> "content filled map" per file is its own way, under no circumstances it is
> assumed that fiemap or page cache state has anything to do with the
> "content filled map".
>
> The *only* thing that HSM can assume if that if its "content filled map"
> is empty for some range, then page cache is NOT yet populated in that
> range and that also relies on how HSM and mount are being initialized
> and exposed to users.
OK, understood and makes sense.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched
2024-11-20 16:02 ` Jan Kara
@ 2024-11-20 16:42 ` Amir Goldstein
2024-11-21 8:54 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-20 16:42 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-11-24 10:30:16, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > So far, we set FMODE_NONOTIFY_ flags at open time if we know that there
> > are no permission event watchers at all on the filesystem, but lack of
> > FMODE_NONOTIFY_ flags does not mean that the file is actually watched.
> >
> > To make the flags more accurate we add a helper that checks if the
> > file's inode, mount, sb or parent are being watched for a set of events.
> >
> > This is going to be used for setting FMODE_NONOTIFY_HSM only when the
> > specific file is actually watched for pre-content events.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I did some changes here as well. See below:
>
> > -/* Are there any inode/mount/sb objects that are interested in this event? */
> > -static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> > - __u32 mask)
> > +/* Are there any inode/mount/sb objects that watch for these events? */
> > +static inline __u32 fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> > + __u32 events_mask)
> > {
> > __u32 marks_mask = READ_ONCE(inode->i_fsnotify_mask) | mnt_mask |
> > READ_ONCE(inode->i_sb->s_fsnotify_mask);
> >
> > - return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
> > + return events_mask & marks_mask;
> > }
> >
> > +/* Are there any inode/mount/sb/parent objects that watch for these events? */
> > +__u32 fsnotify_file_object_watched(struct file *file, __u32 events_mask)
> > +{
> > + struct dentry *dentry = file->f_path.dentry;
> > + struct dentry *parent;
> > + __u32 marks_mask, mnt_mask =
> > + READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
> > +
> > + marks_mask = fsnotify_object_watched(d_inode(dentry), mnt_mask,
> > + events_mask);
> > +
> > + if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)))
> > + return marks_mask;
> > +
> > + parent = dget_parent(dentry);
> > + marks_mask |= fsnotify_inode_watches_children(d_inode(parent));
> > + dput(parent);
> > +
> > + return marks_mask & events_mask;
> > +}
> > +EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
>
> I find it confusing that fsnotify_object_watched() does not take parent
> into account while fsnotify_file_object_watched() does. Furthermore the
> naming doesn't very well reflect the fact we are actually returning a mask
> of events. I've ended up dropping this helper (it's used in a single place
> anyway) and instead doing the same directly in file_set_fsnotify_mode().
>
> @@ -658,6 +660,27 @@ void file_set_fsnotify_mode(struct file *file)
> file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> return;
> }
> +
> + /*
> + * OK, there are some pre-content watchers. Check if anybody can be
> + * watching for pre-content events on *this* file.
> + */
> + mnt_mask = READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
> + if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED) &&
> + !fsnotify_object_watched(d_inode(dentry), mnt_mask,
> + FSNOTIFY_PRE_CONTENT_EVENTS))) {
> + file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> + return;
> + }
> +
> + /* Even parent is not watching for pre-content events on this file? */
> + parent = dget_parent(dentry);
> + p_mask = fsnotify_inode_watches_children(d_inode(parent));
> + dput(parent);
> + if (!(p_mask & FSNOTIFY_PRE_CONTENT_EVENTS)) {
> + file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> + return;
> + }
> }
>
Nice!
Note that I had a "hidden motive" for future optimization when I changed
return value of fsnotify_object_watched() to a mask -
I figured that while we are doing the checks above, we can check for the
same price the mask ALL_FSNOTIFY_PERM_EVENTS
then we get several answers for the same price:
1. Is the specific file watched by HSM?
2. Is the specific file watched by open permission events?
3. Is the specific file watched by post-open FAN_ACCESS_PERM?
If the answers are No, No, No, we get some extra optimization
in the (uncommon) use case that there are permission event watchers
on some random inodes in the filesystem.
If the answers are Yes, Yes, No, or No, Yes, No we can return a special
value from file_set_fsnotify_mode() to indicate that permission events
are needed ONLY for fsnotify_open_perm() hook, but not thereafter.
This would implement the semantic change of "respect FAN_ACCESS_PERM
only if it existed at open time" that can save a lot of unneeded cycles in
the very hot read/write path, for example, when watcher only cares about
FAN_OPEN_EXEC_PERM.
I wasn't sure that any of this was worth the effort at this time, but
just in case
this gives you ideas of other useful optimizations we can do with the
object combined marks_mask if we get it for free.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 13/19] fanotify: add a helper to check for pre content events
2024-11-20 15:44 ` Jan Kara
@ 2024-11-20 16:43 ` Amir Goldstein
0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2024-11-20 16:43 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 4:44 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-11-24 10:30:26, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > We want to emit events during page fault, and calling into fanotify
> > could be expensive, so add a helper to allow us to skip calling into
> > fanotify from page fault. This will also be used to disable readahead
> > for content watched files which will be handled in a subsequent patch.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> > include/linux/fsnotify.h | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> > index 08893429a818..d5a0d8648000 100644
> > --- a/include/linux/fsnotify.h
> > +++ b/include/linux/fsnotify.h
> > @@ -178,6 +178,11 @@ static inline void file_set_fsnotify_mode(struct file *file)
> > }
> > }
> >
> > +static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
> > +{
> > + return file && unlikely(FMODE_FSNOTIFY_HSM(file->f_mode));
> > +}
> > +
>
> I was pondering about this and since we are trying to make these quick
> checks more explicit, I'll probably drop this helper. Also the 'file &&'
> part looks strange (I understand page_cache_[a]sync_ra() need it but I'd
> rather handle it explicitely there).
Makes sense.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched
2024-11-20 16:42 ` Amir Goldstein
@ 2024-11-21 8:54 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-21 8:54 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed 20-11-24 17:42:18, Amir Goldstein wrote:
> On Wed, Nov 20, 2024 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 15-11-24 10:30:16, Josef Bacik wrote:
> > > From: Amir Goldstein <amir73il@gmail.com>
> > >
> > > So far, we set FMODE_NONOTIFY_ flags at open time if we know that there
> > > are no permission event watchers at all on the filesystem, but lack of
> > > FMODE_NONOTIFY_ flags does not mean that the file is actually watched.
> > >
> > > To make the flags more accurate we add a helper that checks if the
> > > file's inode, mount, sb or parent are being watched for a set of events.
> > >
> > > This is going to be used for setting FMODE_NONOTIFY_HSM only when the
> > > specific file is actually watched for pre-content events.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I did some changes here as well. See below:
> >
> > > -/* Are there any inode/mount/sb objects that are interested in this event? */
> > > -static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> > > - __u32 mask)
> > > +/* Are there any inode/mount/sb objects that watch for these events? */
> > > +static inline __u32 fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> > > + __u32 events_mask)
> > > {
> > > __u32 marks_mask = READ_ONCE(inode->i_fsnotify_mask) | mnt_mask |
> > > READ_ONCE(inode->i_sb->s_fsnotify_mask);
> > >
> > > - return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
> > > + return events_mask & marks_mask;
> > > }
> > >
> > > +/* Are there any inode/mount/sb/parent objects that watch for these events? */
> > > +__u32 fsnotify_file_object_watched(struct file *file, __u32 events_mask)
> > > +{
> > > + struct dentry *dentry = file->f_path.dentry;
> > > + struct dentry *parent;
> > > + __u32 marks_mask, mnt_mask =
> > > + READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
> > > +
> > > + marks_mask = fsnotify_object_watched(d_inode(dentry), mnt_mask,
> > > + events_mask);
> > > +
> > > + if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)))
> > > + return marks_mask;
> > > +
> > > + parent = dget_parent(dentry);
> > > + marks_mask |= fsnotify_inode_watches_children(d_inode(parent));
> > > + dput(parent);
> > > +
> > > + return marks_mask & events_mask;
> > > +}
> > > +EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
> >
> > I find it confusing that fsnotify_object_watched() does not take parent
> > into account while fsnotify_file_object_watched() does. Furthermore the
> > naming doesn't very well reflect the fact we are actually returning a mask
> > of events. I've ended up dropping this helper (it's used in a single place
> > anyway) and instead doing the same directly in file_set_fsnotify_mode().
> >
> > @@ -658,6 +660,27 @@ void file_set_fsnotify_mode(struct file *file)
> > file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> > return;
> > }
> > +
> > + /*
> > + * OK, there are some pre-content watchers. Check if anybody can be
> > + * watching for pre-content events on *this* file.
> > + */
> > + mnt_mask = READ_ONCE(real_mount(file->f_path.mnt)->mnt_fsnotify_mask);
> > + if (likely(!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED) &&
> > + !fsnotify_object_watched(d_inode(dentry), mnt_mask,
> > + FSNOTIFY_PRE_CONTENT_EVENTS))) {
> > + file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> > + return;
> > + }
> > +
> > + /* Even parent is not watching for pre-content events on this file? */
> > + parent = dget_parent(dentry);
> > + p_mask = fsnotify_inode_watches_children(d_inode(parent));
> > + dput(parent);
> > + if (!(p_mask & FSNOTIFY_PRE_CONTENT_EVENTS)) {
> > + file->f_mode |= FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> > + return;
> > + }
> > }
> >
>
> Nice!
>
> Note that I had a "hidden motive" for future optimization when I changed
> return value of fsnotify_object_watched() to a mask -
>
> I figured that while we are doing the checks above, we can check for the
> same price the mask ALL_FSNOTIFY_PERM_EVENTS
> then we get several answers for the same price:
> 1. Is the specific file watched by HSM?
> 2. Is the specific file watched by open permission events?
> 3. Is the specific file watched by post-open FAN_ACCESS_PERM?
>
> If the answers are No, No, No, we get some extra optimization
> in the (uncommon) use case that there are permission event watchers
> on some random inodes in the filesystem.
>
> If the answers are Yes, Yes, No, or No, Yes, No we can return a special
> value from file_set_fsnotify_mode() to indicate that permission events
> are needed ONLY for fsnotify_open_perm() hook, but not thereafter.
>
> This would implement the semantic change of "respect FAN_ACCESS_PERM
> only if it existed at open time" that can save a lot of unneeded cycles in
> the very hot read/write path, for example, when watcher only cares about
> FAN_OPEN_EXEC_PERM.
>
> I wasn't sure that any of this was worth the effort at this time, but
> just in case this gives you ideas of other useful optimizations we can do
> with the object combined marks_mask if we get it for free.
OK, I'm not opposed to returning the combined mask in principle. Just I'd
pick somewhat different function name and it didn't quite make sense to me
in the context of this series. If we decide to implement the optimizations
you describe above, then I have no problem with tweaking the helpers.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-20 16:12 ` Amir Goldstein
@ 2024-11-21 9:39 ` Jan Kara
2024-11-21 10:09 ` Christian Brauner
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-21 9:39 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed 20-11-24 17:12:21, Amir Goldstein wrote:
> On Wed, Nov 20, 2024 at 4:53 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 15-11-24 10:30:15, Josef Bacik wrote:
> > > From: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Legacy inotify/fanotify listeners can add watches for events on inode,
> > > parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> > > were already open at the time of setting up the watches.
> > >
> > > fanotify permission events are typically used by Anti-malware sofware,
> > > that is watching the entire mount and it is not common to have more that
> > > one Anti-malware engine installed on a system.
> > >
> > > To reduce the overhead of the fsnotify_file_perm() hooks on every file
> > > access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> > > events only if there were *any* permission event listeners on the
> > > filesystem at the time that the file was opened.
> > >
> > > The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> > > two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> > > events types to report.
> > >
> > > This is going to apply to the new fanotify pre-content events in order
> > > to reduce the cost of the new pre-content event vfs hooks.
> > >
> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > FWIW I've ended up somewhat massaging this patch (see below).
> >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 23bd058576b1..8e5c783013d2 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > >
> > > #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
> > >
> > > -/* FMODE_* bit 24 */
> > > -
> > > /* File is embedded in backing_file object */
> > > -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> > > +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
> > >
> > > -/* File was opened by fanotify and shouldn't generate fanotify events */
> > > -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> > > +/* File shouldn't generate fanotify pre-content events */
> > > +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
> > > +
> > > +/* File shouldn't generate fanotify permission events */
> > > +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
> >
> > Firstly, I've kept FMODE_NONOTIFY to stay a single bit instead of two bit
> > constant. I've seen too many bugs caused by people expecting the constant
> > has a single bit set when it actually had more in my life. So I've ended up
> > with:
> >
> > +/*
> > + * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be
> > + * generated (see below)
> > + */
> > +#define FMODE_NONOTIFY ((__force fmode_t)(1 << 25))
> > +
> > +/*
> > + * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be
> > + * generated (see below)
> > + */
> > +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
> >
> > and
> >
> > +/*
> > + * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
> > + * for a file. These are the possible values of (f->f_mode &
> > + * FMODE_FSNOTIFY_MASK) and their meaning:
> > + *
> > + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> > + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> > + * FMODE_NONOTIFY | FMODE_NONOTIFY_PERM - suppress only pre-content events.
> > + */
> > +#define FMODE_FSNOTIFY_MASK \
> > + (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)
> > +
> > +#define FMODE_FSNOTIFY_NONE(mode) \
> > + ((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY)
> > +#define FMODE_FSNOTIFY_PERM(mode) \
> > + (!(mode & FMODE_NONOTIFY_PERM))
>
> That looks incorrect -
> It gives the wrong value for FMODE_NONOTIFY | FMODE_NONOTIFY_PERM
>
> should be:
> != FMODE_NONOTIFY_PERM &&
> != FMODE_NONOTIFY
>
> The simplicity of the single bit test is for permission events
> is why I chose my model, but I understand your reasoning.
Ah, thanks for catching this! I've fixed this to:
+#define FMODE_FSNOTIFY_PERM(mode) \
+ ((mode & FMODE_FSNOTIFY_MASK) == 0 || \
+ (mode & FMODE_FSNOTIFY_MASK) == (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM))
It is not a single bit test so it ends up being:
0x0000000060180345 <+101>: mov 0x20(%r12),%edx
0x000000006018034a <+106>: and $0x6000000,%edx
0x0000000060180350 <+112>: je 0x6018035a <rw_verify_area+122>
0x0000000060180352 <+114>: cmp $0x6000000,%edx
0x0000000060180358 <+120>: jne 0x6018032e <rw_verify_area+78>
But I guess that's not terrible either.
> > +#define FMODE_FSNOTIFY_HSM(mode) \
> > + ((mode & FMODE_FSNOTIFY_MASK) == 0)
> >
> > Also I've moved file_set_fsnotify_mode() out of line into fsnotify.c. The
> > function gets quite big and the call is not IMO so expensive to warrant
> > inlining. Furthermore it saves exporting some fsnotify internals to modules
> > (in later patches).
>
> Sounds good.
> Since you wanted to refrain from defining a two bit constant,
> I wonder how you annotated for NONOTIFY_HSM case
>
> return FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
I'm not sure I understand. What do you mean by "annotated"?
It is not that I object to "two bit constants". FMODE_FSNOTIFY_MASK is a
two-bit constant and a good one. But the name clearly suggests it is not a
single bit constant. When you have all FMODE_FOO and FMODE_BAR things
single bit except for FMODE_BAZ which is multi-bit, then this is IMHO a
recipe for problems and I rather prefer explicitely spelling the
combination out as FMODE_NONOTIFY | FMODE_NONOTIFY_PERM in the few places
that need this instead of hiding it behind some other name.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-20 15:53 ` Jan Kara
2024-11-20 16:12 ` Amir Goldstein
@ 2024-11-21 9:45 ` Christian Brauner
2024-11-21 11:39 ` Amir Goldstein
1 sibling, 1 reply; 69+ messages in thread
From: Christian Brauner @ 2024-11-21 9:45 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, amir73il, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 04:53:09PM +0100, Jan Kara wrote:
> On Fri 15-11-24 10:30:15, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > Legacy inotify/fanotify listeners can add watches for events on inode,
> > parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> > were already open at the time of setting up the watches.
> >
> > fanotify permission events are typically used by Anti-malware sofware,
> > that is watching the entire mount and it is not common to have more that
> > one Anti-malware engine installed on a system.
> >
> > To reduce the overhead of the fsnotify_file_perm() hooks on every file
> > access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> > events only if there were *any* permission event listeners on the
> > filesystem at the time that the file was opened.
> >
> > The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> > two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> > events types to report.
> >
> > This is going to apply to the new fanotify pre-content events in order
> > to reduce the cost of the new pre-content event vfs hooks.
> >
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> FWIW I've ended up somewhat massaging this patch (see below).
>
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23bd058576b1..8e5c783013d2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >
> > #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
> >
> > -/* FMODE_* bit 24 */
> > -
> > /* File is embedded in backing_file object */
> > -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> > +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
> >
> > -/* File was opened by fanotify and shouldn't generate fanotify events */
> > -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> > +/* File shouldn't generate fanotify pre-content events */
> > +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
> > +
> > +/* File shouldn't generate fanotify permission events */
> > +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
>
> Firstly, I've kept FMODE_NONOTIFY to stay a single bit instead of two bit
> constant. I've seen too many bugs caused by people expecting the constant
> has a single bit set when it actually had more in my life. So I've ended up
> with:
>
> +/*
> + * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be
> + * generated (see below)
> + */
> +#define FMODE_NONOTIFY ((__force fmode_t)(1 << 25))
> +
> +/*
> + * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be
> + * generated (see below)
> + */
> +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
>
> and
>
> +/*
> + * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
> + * for a file. These are the possible values of (f->f_mode &
> + * FMODE_FSNOTIFY_MASK) and their meaning:
> + *
> + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> + * FMODE_NONOTIFY | FMODE_NONOTIFY_PERM - suppress only pre-content events.
> + */
> +#define FMODE_FSNOTIFY_MASK \
> + (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)
This is fine by me. But I want to preemptively caution to please not
spread the disease of further defines based on such multi-bit defines
like fanotify does. I'm specifically worried about stuff like:
#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
FS_OPEN_EXEC_PERM)
#define FS_EVENTS_POSS_ON_CHILD (ALL_FSNOTIFY_PERM_EVENTS | \
FS_ACCESS | FS_MODIFY | FS_ATTRIB | \
FS_CLOSE_WRITE | FS_CLOSE_NOWRITE | \
FS_OPEN | FS_OPEN_EXEC)
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-21 9:39 ` Jan Kara
@ 2024-11-21 10:09 ` Christian Brauner
2024-11-21 11:04 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Christian Brauner @ 2024-11-21 10:09 UTC (permalink / raw)
To: Jan Kara
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
> It is not that I object to "two bit constants". FMODE_FSNOTIFY_MASK is a
> two-bit constant and a good one. But the name clearly suggests it is not a
> single bit constant. When you have all FMODE_FOO and FMODE_BAR things
> single bit except for FMODE_BAZ which is multi-bit, then this is IMHO a
> recipe for problems and I rather prefer explicitely spelling the
> combination out as FMODE_NONOTIFY | FMODE_NONOTIFY_PERM in the few places
> that need this instead of hiding it behind some other name.
Very much agreed!
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults
2024-11-15 15:30 ` [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-11-21 10:22 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-21 10:22 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:30, Josef Bacik wrote:
> xfs has it's own handling for write faults, so we need to add the
> pre-content fsnotify hook for this case. Reads go through filemap_fault
> so they're handled properly there.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
This was missing proper handling for DAX read faults. What I've ended up
with is:
struct xfs_inode *ip = XFS_I(file_inode(vmf->vma->vm_file));
vm_fault_t ret;
+ ret = filemap_fsnotify_fault(vmf);
+ if (unlikely(ret))
+ return ret;
xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
ret = xfs_dax_fault_locked(vmf, order, false);
xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
@@ -1412,6 +1415,17 @@ xfs_write_fault(
unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
vm_fault_t ret;
+ /*
+ * Usually we get here from ->page_mkwrite callback but in case of DAX
+ * we will get here also for ordinary write fault. Handle HSM
+ * notifications for that case.
+ */
+ if (IS_DAX(inode)) {
+ ret = filemap_fsnotify_fault(vmf);
+ if (unlikely(ret))
+ return ret;
+ }
+
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-15 15:30 ` [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
2024-11-15 15:59 ` Amir Goldstein
@ 2024-11-21 10:44 ` Jan Kara
2024-11-21 14:18 ` Amir Goldstein
1 sibling, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-21 10:44 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:23, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
>
> Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> in the context of the event handler.
>
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill the content of files on first read access.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Here I was wondering about one thing:
> + /*
> + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> + * and they are only supported on regular files and directories.
> + */
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> + return -EINVAL;
> + if (!is_dir && !d_is_reg(path->dentry))
> + return -EINVAL;
> + }
AFAICS, currently no pre-content events are generated for directories. So
perhaps we should refuse directories here as well for now? I'd like to
avoid the mistake of original fanotify which had some events available on
directories but they did nothing and then you have to ponder hard whether
you're going to break userspace if you actually start emitting them...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-21 10:09 ` Christian Brauner
@ 2024-11-21 11:04 ` Amir Goldstein
2024-11-21 11:16 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 11:04 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 21, 2024 at 11:09 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > It is not that I object to "two bit constants". FMODE_FSNOTIFY_MASK is a
> > two-bit constant and a good one. But the name clearly suggests it is not a
> > single bit constant. When you have all FMODE_FOO and FMODE_BAR things
> > single bit except for FMODE_BAZ which is multi-bit, then this is IMHO a
> > recipe for problems and I rather prefer explicitely spelling the
> > combination out as FMODE_NONOTIFY | FMODE_NONOTIFY_PERM in the few places
> > that need this instead of hiding it behind some other name.
>
> Very much agreed!
Yes, I agree as well.
What I meant is that the code that does
return FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
is going to be unclear to the future code reviewer unless there is
a comment above explaining that this is a special flag combination
to specify "suppress only pre-content events".
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-21 11:04 ` Amir Goldstein
@ 2024-11-21 11:16 ` Jan Kara
2024-11-21 11:32 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-21 11:16 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christian Brauner, Jan Kara, Josef Bacik, kernel-team,
linux-fsdevel, torvalds, viro, linux-xfs, linux-btrfs, linux-mm,
linux-ext4
On Thu 21-11-24 12:04:23, Amir Goldstein wrote:
> On Thu, Nov 21, 2024 at 11:09 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > > It is not that I object to "two bit constants". FMODE_FSNOTIFY_MASK is a
> > > two-bit constant and a good one. But the name clearly suggests it is not a
> > > single bit constant. When you have all FMODE_FOO and FMODE_BAR things
> > > single bit except for FMODE_BAZ which is multi-bit, then this is IMHO a
> > > recipe for problems and I rather prefer explicitely spelling the
> > > combination out as FMODE_NONOTIFY | FMODE_NONOTIFY_PERM in the few places
> > > that need this instead of hiding it behind some other name.
> >
> > Very much agreed!
>
> Yes, I agree as well.
> What I meant is that the code that does
> return FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
>
> is going to be unclear to the future code reviewer unless there is
> a comment above explaining that this is a special flag combination
> to specify "suppress only pre-content events".
So this combination is used in file_set_fsnotify_mode() only (three
occurences) and there I have:
/*
* If there are permission event watchers but no pre-content event
* watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
*/
at the first occurence. So hopefully that's enough of an explanation.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 00/19] fanotify: add pre-content hooks
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
` (18 preceding siblings ...)
2024-11-15 15:30 ` [PATCH v8 19/19] fs: enable pre-content events on supported file systems Josef Bacik
@ 2024-11-21 11:29 ` Jan Kara
19 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-11-21 11:29 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 10:30:13, Josef Bacik wrote:
> v7: https://lore.kernel.org/linux-fsdevel/cover.1731433903.git.josef@toxicpanda.com/
> v6: https://lore.kernel.org/linux-fsdevel/cover.1731355931.git.josef@toxicpanda.com/
> v5: https://lore.kernel.org/linux-fsdevel/cover.1725481503.git.josef@toxicpanda.com/
> v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
> v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
OK, I have merged the series with all the changes I've suggested into a topic
branch in my tree:
https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git/log/?h=fsnotify_hsm
I've also added there a patch making sure HSM events are properly generated
on ext4 with DAX. There's one open question I still have whether we
shouldn't completely refuse pre-content events on directories but besides
that I'm now happy with the series.
The branch passes LTP tests so we hopefully didn't break some existing
functionality but it would be great to run it through the tests Josef has.
Josef, can you please do that?
After the merge window closes and if the tests pass, I plan to merge the
topic branch into fsnotify branch so that we get some exposure in
linux-next.
Honza
> v7->v8:
> - A bunch of work from Amir to cleanup the fast path for the common case of no
> watches, which cascades through the rest of th series to update the helpers
> and the hooks to use the new helpers.
> - A patch from Al to get rid of the __FMODE_NONOTIFY flag and cleanup the usage
> there, thanks Al!
>
> v6->v7:
> - As per Linus's suggestion, Amir added the file flag FMODE_NOTIFY_PERM that
> will be set at open time if the file has permission related watches (this is
> the original malware style permission watches and the new precontent watches).
> All of the VFS hooks and the page fault hooks use this flag to determine if
> they should generate a notification to allow for a much cheaper check in the
> common case.
>
> v5->v6:
> - Linus had problems with this and rejected Jan's PR
> (https://lore.kernel.org/linux-fsdevel/20240923110348.tbwihs42dxxltabc@quack3/),
> so I'm respinning this series to address his concerns. Hopefully this is more
> acceptable.
> - Change the page fault hooks to happen only in the case where we have to add a
> page, not where there exists pages already.
> - Amir added a hook to truncate.
> - We made the flag per SB instead of per fstype, Amir wanted this because of
> some potential issues with other file system specific work he's doing.
> - Dropped the bcachefs patch, there were some concerns that we were doing
> something wrong, and it's not a huge deal to not have this feature for now.
> - Unfortunately the xfs write fault path still has to do the page fault hook
> before we know if we have a page or not, this is because of the locking that's
> done before we get to the part where we know if we have a page already or not,
> so that's the path that is still the same from last iteration.
> - I've re-validated this series with btrfs, xfs, and ext4 to make sure I didn't
> break anything.
>
> v4->v5:
> - Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
> to respin the series anyway.
> - Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
> - Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
> - Added a patch to disable btrfs defrag on pre-content watched files.
> - Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
> - Added two fstests (which will be posted separately) to validate everything,
> re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
> didn't break anything.
>
> v3->v4:
> - Trying to send a final verson Friday at 5pm before you go on vacation is a
> recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
> review.
> - Reworked the file system helper so it's handling of fpin was a little less
> silly, per Chinner's suggestion.
> - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
> in filemap_fault that says if VM_FAULT_ERROR is set we won't have
> VM_FAULT_RETRY set.
>
> v2->v3:
> - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
> emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
> Amir's suggestion.
> - Reworked the exported helper so the per-filesystem changes are much smaller,
> per Amir's suggestion.
> - Fixed the screwup for DAX writes per Chinner's suggestion.
> - Added Christian's reviewed-by's where appropriate.
>
> v1->v2:
> - reworked the page fault logic based on Jan's suggestion and turned it into a
> helper.
> - Added 3 patches per-fs where we need to call the fsnotify helper from their
> ->fault handlers.
> - Disabled readahead in the case that there's a pre-content watch in place.
> - Disabled huge faults when there's a pre-content watch in place (entirely
> because it's untested, theoretically it should be straightforward to do).
> - Updated the command numbers.
> - Addressed the random spelling/grammer mistakes that Jan pointed out.
> - Addressed the other random nits from Jan.
>
> --- Original email ---
>
> Hello,
>
> These are the patches for the bare bones pre-content fanotify support. The
> majority of this work is Amir's, my contribution to this has solely been around
> adding the page fault hooks, testing and validating everything. I'm sending it
> because Amir is traveling a bunch, and I touched it last so I'm going to take
> all the hate and he can take all the credit.
>
> There is a PoC that I've been using to validate this work, you can find the git
> repo here
>
> https://github.com/josefbacik/remote-fetch
>
> This consists of 3 different tools.
>
> 1. populate. This just creates all the stub files in the directory from the
> source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
> recursively create all of the stub files and directories.
> 2. remote-fetch. This is the actual PoC, you just point it at the source and
> destination directory and then you can do whatever. ./remote-fetch ~/linux
> ~/hsm-linux.
> 3. mmap-validate. This was to validate the pagefault thing, this is likely what
> will be turned into the selftest with remote-fetch. It creates a file and
> then you can validate the file matches the right pattern with both normal
> reads and mmap. Normally I do something like
>
> ./mmap-validate create ~/src/foo
> ./populate ~/src ~/dst
> ./rmeote-fetch ~/src ~/dst
> ./mmap-validate validate ~/dst/foo
>
> I did a bunch of testing, I also got some performance numbers. I copied a
> kernel tree, and then did remote-fetch, and then make -j4
>
> Normal
> real 9m49.709s
> user 28m11.372s
> sys 4m57.304s
>
> HSM
> real 10m6.454s
> user 29m10.517s
> sys 5m2.617s
>
> So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
> to see the size
>
> [root@fedora ~]# du -hs /src/linux
> 1.6G /src/linux
> [root@fedora ~]# du -hs dst
> 125M dst
>
> This mirrors the sort of savings we've seen in production.
>
> Meta has had these patches (minus the page fault patch) deployed in production
> for almost a year with our own utility for doing on-demand package fetching.
> The savings from this has been pretty significant.
>
> The page-fault hooks are necessary for the last thing we need, which is
> on-demand range fetching of executables. Some of our binaries are several gigs
> large, having the ability to remote fetch them on demand is a huge win for us
> not only with space savings, but with startup time of containers.
>
> There will be tests for this going into LTP once we're satisfied with the
> patches and they're on their way upstream. Thanks,
>
> Josef
>
> Al Viro (1):
> fs: get rid of __FMODE_NONOTIFY kludge
>
> Amir Goldstein (12):
> fsnotify: opt-in for permission events at file open time
> fsnotify: add helper to check if file is actually being watched
> fanotify: don't skip extra event info if no info_mode is set
> fanotify: rename a misnamed constant
> fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
> fsnotify: introduce pre-content permission events
> fsnotify: pass optional file access range in pre-content event
> fsnotify: generate pre-content permission event on truncate
> fanotify: introduce FAN_PRE_ACCESS permission event
> fanotify: report file range info with pre-content events
> fanotify: allow to set errno in FAN_DENY permission response
> fanotify: add a helper to check for pre content events
>
> Josef Bacik (6):
> fanotify: disable readahead if we have pre-content watches
> mm: don't allow huge faults for files with pre content watches
> fsnotify: generate pre-content permission event on page fault
> xfs: add pre-content fsnotify hook for write faults
> btrfs: disable defrag on pre-content watched files
> fs: enable pre-content events on supported file systems
>
> fs/btrfs/ioctl.c | 9 ++
> fs/btrfs/super.c | 2 +-
> fs/ext4/super.c | 3 +
> fs/fcntl.c | 4 +-
> fs/notify/fanotify/fanotify.c | 33 +++++--
> fs/notify/fanotify/fanotify.h | 15 +++
> fs/notify/fanotify/fanotify_user.c | 145 +++++++++++++++++++++++------
> fs/notify/fsnotify.c | 56 +++++++++--
> fs/open.c | 62 +++++++++---
> fs/xfs/xfs_file.c | 4 +
> fs/xfs/xfs_super.c | 2 +-
> include/linux/fanotify.h | 19 +++-
> include/linux/fs.h | 42 +++++++--
> include/linux/fsnotify.h | 135 +++++++++++++++++++++++----
> include/linux/fsnotify_backend.h | 60 +++++++++++-
> include/linux/mm.h | 1 +
> include/uapi/asm-generic/fcntl.h | 1 -
> include/uapi/linux/fanotify.h | 18 ++++
> mm/filemap.c | 90 ++++++++++++++++++
> mm/memory.c | 22 +++++
> mm/readahead.c | 13 +++
> security/selinux/hooks.c | 3 +-
> 22 files changed, 639 insertions(+), 100 deletions(-)
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-21 11:16 ` Jan Kara
@ 2024-11-21 11:32 ` Amir Goldstein
0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 11:32 UTC (permalink / raw)
To: Jan Kara
Cc: Christian Brauner, Josef Bacik, kernel-team, linux-fsdevel,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 21, 2024 at 12:16 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 21-11-24 12:04:23, Amir Goldstein wrote:
> > On Thu, Nov 21, 2024 at 11:09 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > > It is not that I object to "two bit constants". FMODE_FSNOTIFY_MASK is a
> > > > two-bit constant and a good one. But the name clearly suggests it is not a
> > > > single bit constant. When you have all FMODE_FOO and FMODE_BAR things
> > > > single bit except for FMODE_BAZ which is multi-bit, then this is IMHO a
> > > > recipe for problems and I rather prefer explicitely spelling the
> > > > combination out as FMODE_NONOTIFY | FMODE_NONOTIFY_PERM in the few places
> > > > that need this instead of hiding it behind some other name.
> > >
> > > Very much agreed!
> >
> > Yes, I agree as well.
> > What I meant is that the code that does
> > return FMODE_NONOTIFY | FMODE_NONOTIFY_PERM;
> >
> > is going to be unclear to the future code reviewer unless there is
> > a comment above explaining that this is a special flag combination
> > to specify "suppress only pre-content events".
>
> So this combination is used in file_set_fsnotify_mode() only (three
> occurences) and there I have:
>
> /*
> * If there are permission event watchers but no pre-content event
> * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
> */
>
> at the first occurence. So hopefully that's enough of an explanation.
>
Yes, that's the comment that I did not see, but assumed it was there ;)
which I wrongly expressed as "I wonder how you annotated".
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time
2024-11-21 9:45 ` Christian Brauner
@ 2024-11-21 11:39 ` Amir Goldstein
0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 11:39 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
> This is fine by me. But I want to preemptively caution to please not
> spread the disease of further defines based on such multi-bit defines
> like fanotify does. I'm specifically worried about stuff like:
>
> #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
> FS_OPEN_EXEC_PERM)
>
> #define FS_EVENTS_POSS_ON_CHILD (ALL_FSNOTIFY_PERM_EVENTS | \
> FS_ACCESS | FS_MODIFY | FS_ATTRIB | \
> FS_CLOSE_WRITE | FS_CLOSE_NOWRITE | \
> FS_OPEN | FS_OPEN_EXEC)
What do you mean?
Those are masks for event groups, which we test in many cases.
What is wrong with those defined?
For FMODE_, we do not plan to add anymore defined (famous last words).
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-21 10:44 ` Jan Kara
@ 2024-11-21 14:18 ` Amir Goldstein
2024-11-21 16:36 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 14:18 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-11-24 10:30:23, Josef Bacik wrote:
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> > class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
> >
> > Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> > in the context of the event handler.
> >
> > This pre-content event is meant to be used by hierarchical storage
> > managers that want to fill the content of files on first read access.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Here I was wondering about one thing:
>
> > + /*
> > + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> > + * and they are only supported on regular files and directories.
> > + */
> > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> > + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> > + return -EINVAL;
> > + if (!is_dir && !d_is_reg(path->dentry))
> > + return -EINVAL;
> > + }
>
> AFAICS, currently no pre-content events are generated for directories. So
> perhaps we should refuse directories here as well for now? I'd like to
readdir() does emit PRE_ACCESS (without a range) and also always
emitted ACCESS_PERM. my POC is using that PRE_ACCESS to populate
directories on-demand, although the functionality is incomplete without the
"populate on lookup" event.
> avoid the mistake of original fanotify which had some events available on
> directories but they did nothing and then you have to ponder hard whether
> you're going to break userspace if you actually start emitting them...
But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-21 14:18 ` Amir Goldstein
@ 2024-11-21 16:36 ` Jan Kara
2024-11-21 18:31 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-21 16:36 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 15-11-24 10:30:23, Josef Bacik wrote:
> > > From: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> > > class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
> > >
> > > Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> > > in the context of the event handler.
> > >
> > > This pre-content event is meant to be used by hierarchical storage
> > > managers that want to fill the content of files on first read access.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Here I was wondering about one thing:
> >
> > > + /*
> > > + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> > > + * and they are only supported on regular files and directories.
> > > + */
> > > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> > > + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> > > + return -EINVAL;
> > > + if (!is_dir && !d_is_reg(path->dentry))
> > > + return -EINVAL;
> > > + }
> >
> > AFAICS, currently no pre-content events are generated for directories. So
> > perhaps we should refuse directories here as well for now? I'd like to
>
> readdir() does emit PRE_ACCESS (without a range)
Ah, right.
> and also always emitted ACCESS_PERM.
I know that and it's one of those mostly useless events AFAICT.
> my POC is using that PRE_ACCESS to populate
> directories on-demand, although the functionality is incomplete without the
> "populate on lookup" event.
Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
a demo but not really usable in practice because you can get spurious
ENOENT from a lookup.
> > avoid the mistake of original fanotify which had some events available on
> > directories but they did nothing and then you have to ponder hard whether
> > you're going to break userspace if you actually start emitting them...
>
> But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
Well, I'm not so concerned about filtering out uninteresting events. I'm
more concerned about emitting the event now and figuring out later that we
need to emit it in different places or with some other info when actual
production users appear.
But I've realized we must allow pre-content marks to be placed on dirs so
that such marks can be placed on parents watching children. What we'd need
to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-21 16:36 ` Jan Kara
@ 2024-11-21 18:31 ` Amir Goldstein
2024-11-21 18:37 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 18:31 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 15-11-24 10:30:23, Josef Bacik wrote:
> > > > From: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> > > > class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
> > > >
> > > > Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> > > > in the context of the event handler.
> > > >
> > > > This pre-content event is meant to be used by hierarchical storage
> > > > managers that want to fill the content of files on first read access.
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Here I was wondering about one thing:
> > >
> > > > + /*
> > > > + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> > > > + * and they are only supported on regular files and directories.
> > > > + */
> > > > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> > > > + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> > > > + return -EINVAL;
> > > > + if (!is_dir && !d_is_reg(path->dentry))
> > > > + return -EINVAL;
> > > > + }
> > >
> > > AFAICS, currently no pre-content events are generated for directories. So
> > > perhaps we should refuse directories here as well for now? I'd like to
> >
> > readdir() does emit PRE_ACCESS (without a range)
>
> Ah, right.
>
> > and also always emitted ACCESS_PERM.
>
> I know that and it's one of those mostly useless events AFAICT.
>
> > my POC is using that PRE_ACCESS to populate
> > directories on-demand, although the functionality is incomplete without the
> > "populate on lookup" event.
>
> Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> a demo but not really usable in practice because you can get spurious
> ENOENT from a lookup.
>
> > > avoid the mistake of original fanotify which had some events available on
> > > directories but they did nothing and then you have to ponder hard whether
> > > you're going to break userspace if you actually start emitting them...
> >
> > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
>
> Well, I'm not so concerned about filtering out uninteresting events. I'm
> more concerned about emitting the event now and figuring out later that we
> need to emit it in different places or with some other info when actual
> production users appear.
>
> But I've realized we must allow pre-content marks to be placed on dirs so
> that such marks can be placed on parents watching children. What we'd need
> to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
Yes, I think that can work well for now.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-21 18:31 ` Amir Goldstein
@ 2024-11-21 18:37 ` Amir Goldstein
2024-11-22 12:42 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-21 18:37 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 21, 2024 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 15-11-24 10:30:23, Josef Bacik wrote:
> > > > > From: Amir Goldstein <amir73il@gmail.com>
> > > > >
> > > > > Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> > > > > class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
> > > > >
> > > > > Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> > > > > in the context of the event handler.
> > > > >
> > > > > This pre-content event is meant to be used by hierarchical storage
> > > > > managers that want to fill the content of files on first read access.
> > > > >
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Here I was wondering about one thing:
> > > >
> > > > > + /*
> > > > > + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> > > > > + * and they are only supported on regular files and directories.
> > > > > + */
> > > > > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> > > > > + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> > > > > + return -EINVAL;
> > > > > + if (!is_dir && !d_is_reg(path->dentry))
> > > > > + return -EINVAL;
> > > > > + }
> > > >
> > > > AFAICS, currently no pre-content events are generated for directories. So
> > > > perhaps we should refuse directories here as well for now? I'd like to
> > >
> > > readdir() does emit PRE_ACCESS (without a range)
> >
> > Ah, right.
> >
> > > and also always emitted ACCESS_PERM.
> >
> > I know that and it's one of those mostly useless events AFAICT.
> >
> > > my POC is using that PRE_ACCESS to populate
> > > directories on-demand, although the functionality is incomplete without the
> > > "populate on lookup" event.
> >
> > Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> > a demo but not really usable in practice because you can get spurious
> > ENOENT from a lookup.
> >
> > > > avoid the mistake of original fanotify which had some events available on
> > > > directories but they did nothing and then you have to ponder hard whether
> > > > you're going to break userspace if you actually start emitting them...
> > >
> > > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
> >
> > Well, I'm not so concerned about filtering out uninteresting events. I'm
> > more concerned about emitting the event now and figuring out later that we
> > need to emit it in different places or with some other info when actual
> > production users appear.
> >
> > But I've realized we must allow pre-content marks to be placed on dirs so
> > that such marks can be placed on parents watching children. What we'd need
> > to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
>
> Yes, I think that can work well for now.
>
Only it does not require only check at API time that both flags are not
set, because FAN_ONDIR can be set earlier and then FAN_PRE_ACCESS
can be added later and vice versa, so need to do this in
fanotify_may_update_existing_mark() AFAICT.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-21 18:37 ` Amir Goldstein
@ 2024-11-22 12:42 ` Jan Kara
2024-11-22 13:51 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-22 12:42 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu 21-11-24 19:37:43, Amir Goldstein wrote:
> On Thu, Nov 21, 2024 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > > > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > > > and also always emitted ACCESS_PERM.
> > >
> > > I know that and it's one of those mostly useless events AFAICT.
> > >
> > > > my POC is using that PRE_ACCESS to populate
> > > > directories on-demand, although the functionality is incomplete without the
> > > > "populate on lookup" event.
> > >
> > > Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> > > a demo but not really usable in practice because you can get spurious
> > > ENOENT from a lookup.
> > >
> > > > > avoid the mistake of original fanotify which had some events available on
> > > > > directories but they did nothing and then you have to ponder hard whether
> > > > > you're going to break userspace if you actually start emitting them...
> > > >
> > > > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
> > >
> > > Well, I'm not so concerned about filtering out uninteresting events. I'm
> > > more concerned about emitting the event now and figuring out later that we
> > > need to emit it in different places or with some other info when actual
> > > production users appear.
> > >
> > > But I've realized we must allow pre-content marks to be placed on dirs so
> > > that such marks can be placed on parents watching children. What we'd need
> > > to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
> >
> > Yes, I think that can work well for now.
> >
>
> Only it does not require only check at API time that both flags are not
> set, because FAN_ONDIR can be set earlier and then FAN_PRE_ACCESS
> can be added later and vice versa, so need to do this in
> fanotify_may_update_existing_mark() AFAICT.
I have now something like:
@@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
}
static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
- unsigned int fan_flags)
+ __u32 mask, unsigned int fan_flags)
{
/*
* Non evictable mark cannot be downgraded to evictable mark.
@@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
return -EEXIST;
+ /* For now pre-content events are not generated for directories */
+ mask |= fsn_mark->mask;
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
+ return -EEXIST;
+
return 0;
}
So far only compile tested...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-22 12:42 ` Jan Kara
@ 2024-11-22 13:51 ` Amir Goldstein
2024-11-27 12:18 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2024-11-22 13:51 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri, Nov 22, 2024 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 21-11-24 19:37:43, Amir Goldstein wrote:
> > On Thu, Nov 21, 2024 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > > > > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > > > > and also always emitted ACCESS_PERM.
> > > >
> > > > I know that and it's one of those mostly useless events AFAICT.
> > > >
> > > > > my POC is using that PRE_ACCESS to populate
> > > > > directories on-demand, although the functionality is incomplete without the
> > > > > "populate on lookup" event.
> > > >
> > > > Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> > > > a demo but not really usable in practice because you can get spurious
> > > > ENOENT from a lookup.
> > > >
> > > > > > avoid the mistake of original fanotify which had some events available on
> > > > > > directories but they did nothing and then you have to ponder hard whether
> > > > > > you're going to break userspace if you actually start emitting them...
> > > > >
> > > > > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
> > > >
> > > > Well, I'm not so concerned about filtering out uninteresting events. I'm
> > > > more concerned about emitting the event now and figuring out later that we
> > > > need to emit it in different places or with some other info when actual
> > > > production users appear.
> > > >
> > > > But I've realized we must allow pre-content marks to be placed on dirs so
> > > > that such marks can be placed on parents watching children. What we'd need
> > > > to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
> > >
> > > Yes, I think that can work well for now.
> > >
> >
> > Only it does not require only check at API time that both flags are not
> > set, because FAN_ONDIR can be set earlier and then FAN_PRE_ACCESS
> > can be added later and vice versa, so need to do this in
> > fanotify_may_update_existing_mark() AFAICT.
>
> I have now something like:
>
> @@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
> }
>
> static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> - unsigned int fan_flags)
> + __u32 mask, unsigned int fan_flags)
> {
> /*
> * Non evictable mark cannot be downgraded to evictable mark.
> @@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> return -EEXIST;
>
> + /* For now pre-content events are not generated for directories */
> + mask |= fsn_mark->mask;
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
> + return -EEXIST;
> +
EEXIST is going to be confusing if there was never any mark.
Either return -EINVAL here or also check this condition on the added mask
itself before calling fanotify_add_mark() and return -EINVAL there.
I prefer two distinct errors, but probably one is also good enough.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-22 13:51 ` Amir Goldstein
@ 2024-11-27 12:18 ` Jan Kara
2024-11-27 12:20 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-11-27 12:18 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 22-11-24 14:51:23, Amir Goldstein wrote:
> On Fri, Nov 22, 2024 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 21-11-24 19:37:43, Amir Goldstein wrote:
> > > On Thu, Nov 21, 2024 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > > > > > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > and also always emitted ACCESS_PERM.
> > > > >
> > > > > I know that and it's one of those mostly useless events AFAICT.
> > > > >
> > > > > > my POC is using that PRE_ACCESS to populate
> > > > > > directories on-demand, although the functionality is incomplete without the
> > > > > > "populate on lookup" event.
> > > > >
> > > > > Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> > > > > a demo but not really usable in practice because you can get spurious
> > > > > ENOENT from a lookup.
> > > > >
> > > > > > > avoid the mistake of original fanotify which had some events available on
> > > > > > > directories but they did nothing and then you have to ponder hard whether
> > > > > > > you're going to break userspace if you actually start emitting them...
> > > > > >
> > > > > > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
> > > > >
> > > > > Well, I'm not so concerned about filtering out uninteresting events. I'm
> > > > > more concerned about emitting the event now and figuring out later that we
> > > > > need to emit it in different places or with some other info when actual
> > > > > production users appear.
> > > > >
> > > > > But I've realized we must allow pre-content marks to be placed on dirs so
> > > > > that such marks can be placed on parents watching children. What we'd need
> > > > > to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
> > > >
> > > > Yes, I think that can work well for now.
> > > >
> > >
> > > Only it does not require only check at API time that both flags are not
> > > set, because FAN_ONDIR can be set earlier and then FAN_PRE_ACCESS
> > > can be added later and vice versa, so need to do this in
> > > fanotify_may_update_existing_mark() AFAICT.
> >
> > I have now something like:
> >
> > @@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
> > }
> >
> > static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> > - unsigned int fan_flags)
> > + __u32 mask, unsigned int fan_flags)
> > {
> > /*
> > * Non evictable mark cannot be downgraded to evictable mark.
> > @@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> > fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> > return -EEXIST;
> >
> > + /* For now pre-content events are not generated for directories */
> > + mask |= fsn_mark->mask;
> > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
> > + return -EEXIST;
> > +
>
> EEXIST is going to be confusing if there was never any mark.
> Either return -EINVAL here or also check this condition on the added mask
> itself before calling fanotify_add_mark() and return -EINVAL there.
>
> I prefer two distinct errors, but probably one is also good enough.
That's actually a good point. My previous change allowed setting
FAN_PRE_ACCESS | FAN_ONDIR on a new mark because that doesn't get to
fanotify_may_update_existing_mark(). So I now have:
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 0919ea735f4a..38a46865408e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
}
static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
- unsigned int fan_flags)
+ __u32 mask, unsigned int fan_flags)
{
/*
* Non evictable mark cannot be downgraded to evictable mark.
@@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
return -EEXIST;
+ /* For now pre-content events are not generated for directories */
+ mask |= fsn_mark->mask;
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
+ return -EEXIST;
+
return 0;
}
@@ -1409,7 +1414,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
/*
* Check if requested mark flags conflict with an existing mark flags.
*/
- ret = fanotify_may_update_existing_mark(fsn_mark, fan_flags);
+ ret = fanotify_may_update_existing_mark(fsn_mark, mask, fan_flags);
if (ret)
goto out;
@@ -1905,6 +1910,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
if (mask & FAN_RENAME && !(fid_mode & FAN_REPORT_NAME))
goto fput_and_out;
+ /* Pre-content events are not currently generated for directories. */
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
+ goto fput_and_out;
+
if (mark_cmd == FAN_MARK_FLUSH) {
ret = 0;
if (mark_type == FAN_MARK_MOUNT)
--
2.35.3
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-27 12:18 ` Jan Kara
@ 2024-11-27 12:20 ` Amir Goldstein
0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2024-11-27 12:20 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, torvalds, viro,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 27, 2024 at 1:18 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 22-11-24 14:51:23, Amir Goldstein wrote:
> > On Fri, Nov 22, 2024 at 1:42 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 21-11-24 19:37:43, Amir Goldstein wrote:
> > > > On Thu, Nov 21, 2024 at 7:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Thu, Nov 21, 2024 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > On Thu 21-11-24 15:18:36, Amir Goldstein wrote:
> > > > > > > On Thu, Nov 21, 2024 at 11:44 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > > and also always emitted ACCESS_PERM.
> > > > > >
> > > > > > I know that and it's one of those mostly useless events AFAICT.
> > > > > >
> > > > > > > my POC is using that PRE_ACCESS to populate
> > > > > > > directories on-demand, although the functionality is incomplete without the
> > > > > > > "populate on lookup" event.
> > > > > >
> > > > > > Exactly. Without "populate on lookup" doing "populate on readdir" is ok for
> > > > > > a demo but not really usable in practice because you can get spurious
> > > > > > ENOENT from a lookup.
> > > > > >
> > > > > > > > avoid the mistake of original fanotify which had some events available on
> > > > > > > > directories but they did nothing and then you have to ponder hard whether
> > > > > > > > you're going to break userspace if you actually start emitting them...
> > > > > > >
> > > > > > > But in any case, the FAN_ONDIR built-in filter is applicable to PRE_ACCESS.
> > > > > >
> > > > > > Well, I'm not so concerned about filtering out uninteresting events. I'm
> > > > > > more concerned about emitting the event now and figuring out later that we
> > > > > > need to emit it in different places or with some other info when actual
> > > > > > production users appear.
> > > > > >
> > > > > > But I've realized we must allow pre-content marks to be placed on dirs so
> > > > > > that such marks can be placed on parents watching children. What we'd need
> > > > > > to forbid is a combination of FAN_ONDIR and FAN_PRE_ACCESS, wouldn't we?
> > > > >
> > > > > Yes, I think that can work well for now.
> > > > >
> > > >
> > > > Only it does not require only check at API time that both flags are not
> > > > set, because FAN_ONDIR can be set earlier and then FAN_PRE_ACCESS
> > > > can be added later and vice versa, so need to do this in
> > > > fanotify_may_update_existing_mark() AFAICT.
> > >
> > > I have now something like:
> > >
> > > @@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
> > > }
> > >
> > > static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> > > - unsigned int fan_flags)
> > > + __u32 mask, unsigned int fan_flags)
> > > {
> > > /*
> > > * Non evictable mark cannot be downgraded to evictable mark.
> > > @@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> > > fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> > > return -EEXIST;
> > >
> > > + /* For now pre-content events are not generated for directories */
> > > + mask |= fsn_mark->mask;
> > > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
> > > + return -EEXIST;
> > > +
> >
> > EEXIST is going to be confusing if there was never any mark.
> > Either return -EINVAL here or also check this condition on the added mask
> > itself before calling fanotify_add_mark() and return -EINVAL there.
> >
> > I prefer two distinct errors, but probably one is also good enough.
>
> That's actually a good point. My previous change allowed setting
> FAN_PRE_ACCESS | FAN_ONDIR on a new mark because that doesn't get to
> fanotify_may_update_existing_mark(). So I now have:
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 0919ea735f4a..38a46865408e 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1356,7 +1356,7 @@ static int fanotify_group_init_error_pool(struct fsnotify_group *group)
> }
>
> static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> - unsigned int fan_flags)
> + __u32 mask, unsigned int fan_flags)
> {
> /*
> * Non evictable mark cannot be downgraded to evictable mark.
> @@ -1383,6 +1383,11 @@ static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
> fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> return -EEXIST;
>
> + /* For now pre-content events are not generated for directories */
> + mask |= fsn_mark->mask;
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
> + return -EEXIST;
> +
> return 0;
> }
>
> @@ -1409,7 +1414,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
> /*
> * Check if requested mark flags conflict with an existing mark flags.
> */
> - ret = fanotify_may_update_existing_mark(fsn_mark, fan_flags);
> + ret = fanotify_may_update_existing_mark(fsn_mark, mask, fan_flags);
> if (ret)
> goto out;
>
> @@ -1905,6 +1910,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> if (mask & FAN_RENAME && !(fid_mode & FAN_REPORT_NAME))
> goto fput_and_out;
>
> + /* Pre-content events are not currently generated for directories. */
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS && mask & FAN_ONDIR)
> + goto fput_and_out;
> +
> if (mark_cmd == FAN_MARK_FLUSH) {
> ret = 0;
> if (mark_type == FAN_MARK_MOUNT)
> --
> 2.35.3
>
Looks good.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-11-15 15:30 ` [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-12-08 16:58 ` Klara Modin
2024-12-09 10:45 ` Aithal, Srikanth
` (2 more replies)
0 siblings, 3 replies; 69+ messages in thread
From: Klara Modin @ 2024-12-08 16:58 UTC (permalink / raw)
To: Josef Bacik, kernel-team, linux-fsdevel, jack, amir73il, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
[-- Attachment #1: Type: text/plain, Size: 5547 bytes --]
Hi,
On 2024-11-15 16:30, Josef Bacik wrote:
> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> on the faulting method.
>
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill in the file content on first read access.
>
> Export a simple helper that file systems that have their own ->fault()
> will use, and have a more complicated helper to be do fancy things with
> in filemap_fault.
>
This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
interacts poorly with some programs which hang and are stuck at 100 %
sys cpu usage (examples of programs are logrotate and atop with root
privileges).
I also retested the new version on Jan Kara's for_next branch and it
behaves the same way.
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> include/linux/mm.h | 1 +
> mm/filemap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 79 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 01c5e7a4489f..90155ef8599a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
> extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> pgoff_t start_pgoff, pgoff_t end_pgoff);
> extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
> +extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
>
> extern unsigned long stack_guard_gap;
> /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 68ea596f6905..0bf7d645dec5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -47,6 +47,7 @@
> #include <linux/splice.h>
> #include <linux/rcupdate_wait.h>
> #include <linux/sched/mm.h>
> +#include <linux/fsnotify.h>
> #include <asm/pgalloc.h>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -3289,6 +3290,52 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
> return ret;
> }
>
> +/**
> + * filemap_fsnotify_fault - maybe emit a pre-content event.
> + * @vmf: struct vm_fault containing details of the fault.
> + * @folio: the folio we're faulting in.
> + *
> + * If we have a pre-content watch on this file we will emit an event for this
> + * range. If we return anything the fault caller should return immediately, we
> + * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
> + * fault again and then the fault handler will run the second time through.
> + *
> + * This is meant to be called with the folio that we will be filling in to make
> + * sure the event is emitted for the correct range.
> + *
> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
> + */
> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
The parameters mentioned above do not seem to match with the function.
> +{
> + struct file *fpin = NULL;
> + int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
> + loff_t pos = vmf->pgoff >> PAGE_SHIFT;
> + size_t count = PAGE_SIZE;
> + vm_fault_t ret;
> +
> + /*
> + * We already did this and now we're retrying with everything locked,
> + * don't emit the event and continue.
> + */
> + if (vmf->flags & FAULT_FLAG_TRIED)
> + return 0;
> +
> + /* No watches, we're done. */
> + if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
> + return 0;
> +
> + fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> + if (!fpin)
> + return VM_FAULT_SIGBUS;
> +
> + ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
> + fput(fpin);
> + if (ret)
> + return VM_FAULT_SIGBUS;
> + return VM_FAULT_RETRY;
> +}
> +EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
> +
> /**
> * filemap_fault - read in file data for page fault handling
> * @vmf: struct vm_fault containing details of the fault
> @@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> * or because readahead was otherwise unable to retrieve it.
> */
> if (unlikely(!folio_test_uptodate(folio))) {
> + /*
> + * If this is a precontent file we have can now emit an event to
> + * try and populate the folio.
> + */
> + if (!(vmf->flags & FAULT_FLAG_TRIED) &&
> + fsnotify_file_has_pre_content_watches(file)) {
> + loff_t pos = folio_pos(folio);
> + size_t count = folio_size(folio);
> +
> + /* We're NOWAIT, we have to retry. */
> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> + folio_unlock(folio);
> + goto out_retry;
> + }
> +
> + if (mapping_locked)
> + filemap_invalidate_unlock_shared(mapping);
> + mapping_locked = false;
> +
> + folio_unlock(folio);
> + fpin = maybe_unlock_mmap_for_io(vmf, fpin);
When I look at it with GDB it seems to get here, but then always jumps
to out_retry, which keeps happening when it reenters, and never seems to
progress beyond from what I could tell.
For logrotate, strace stops at "mmap(NULL, 909, PROT_READ,
MAP_PRIVATE|MAP_POPULATE, 3, 0".
For atop, strace stops at "mlockall(MCL_CURRENT|MCL_FUTURE".
If I remove this entire patch snippet everything seems to be normal.
> + if (!fpin)
> + goto out_retry;
> +
> + error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
> + count);
> + if (error)
> + ret = VM_FAULT_SIGBUS;
> + goto out_retry;
> + }
> +
> /*
> * If the invalidate lock is not held, the folio was in cache
> * and uptodate and now it is not. Strange but possible since we
Please let me know if there's anything else you need.
Regards,
Klara Modin
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 44308 bytes --]
[-- Attachment #3: gdb-atop-bt.log.gz --]
[-- Type: application/gzip, Size: 792 bytes --]
[-- Attachment #4: gdb-logrotate-bt.log.gz --]
[-- Type: application/gzip, Size: 848 bytes --]
[-- Attachment #5: hang-bisect --]
[-- Type: text/plain, Size: 2547 bytes --]
# bad: [084d3ed79399b308a21bcd0a7f009db6bd57ff38] ext4: enable large folio for regular file
git bisect start 'HEAD'
# status: waiting for good commit(s), bad commit known
# good: [b8f52214c61a5b99a54168145378e91b40d10c90] Merge tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
git bisect good b8f52214c61a5b99a54168145378e91b40d10c90
# bad: [0413a93cd7f4ed0d625f38094746b342b9456e8c] Merge branch 'hwmon-next' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git
git bisect bad 0413a93cd7f4ed0d625f38094746b342b9456e8c
# good: [8e073f624bcaada15339e92bb8b22523324f403f] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/bmc.git
git bisect good 8e073f624bcaada15339e92bb8b22523324f403f
# bad: [2c230f2d1a4cf25414d445c8983bf9ff90fc3a2d] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs.git
git bisect bad 2c230f2d1a4cf25414d445c8983bf9ff90fc3a2d
# good: [9a4bc7289685088172122d5579886db94c384510] bcachefs: Go RW earlier, for normal rw mount
git bisect good 9a4bc7289685088172122d5579886db94c384510
# good: [4fdad3e3cc21679d4d61bb0b796fc86b7568c08e] Merge branch 'for-next-next-v6.13-20241203' into for-next-20241203
git bisect good 4fdad3e3cc21679d4d61bb0b796fc86b7568c08e
# good: [fb92391ef70a46f9f17b766115075dc8c7cef44c] bcachefs: Call bch2_btree_lost_data() on btree read error
git bisect good fb92391ef70a46f9f17b766115075dc8c7cef44c
# bad: [78765a8ed5f98d66d5725b4ecfa32114d88f89fc] Merge fanotify HSM implementation.
git bisect bad 78765a8ed5f98d66d5725b4ecfa32114d88f89fc
# good: [6cc9cb93ac86ada0af3c5f132075b5b6ca7e85dd] fanotify: report file range info with pre-content events
git bisect good 6cc9cb93ac86ada0af3c5f132075b5b6ca7e85dd
# bad: [d8d33c43e1fb472890f7d97d0026d478d2b6fa36] xfs: add pre-content fsnotify hook for DAX faults
git bisect bad d8d33c43e1fb472890f7d97d0026d478d2b6fa36
# good: [9dee1a117266990083bf514038a19017bdc21497] fanotify: disable readahead if we have pre-content watches
git bisect good 9dee1a117266990083bf514038a19017bdc21497
# bad: [0790303ec869d0fd658a548551972b51ced7390c] fsnotify: generate pre-content permission event on page fault
git bisect bad 0790303ec869d0fd658a548551972b51ced7390c
# good: [6199ec7fa1ddb7e9c546ae8e549da904a3afb688] mm: don't allow huge faults for files with pre content watches
git bisect good 6199ec7fa1ddb7e9c546ae8e549da904a3afb688
# first bad commit: [0790303ec869d0fd658a548551972b51ced7390c] fsnotify: generate pre-content permission event on page fault
[-- Attachment #6: strace-atop.log.gz --]
[-- Type: application/gzip, Size: 1846 bytes --]
[-- Attachment #7: strace-logrotate.log.gz --]
[-- Type: application/gzip, Size: 1202 bytes --]
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-08 16:58 ` Klara Modin
@ 2024-12-09 10:45 ` Aithal, Srikanth
2024-12-09 12:34 ` Jan Kara
2024-12-09 12:31 ` Jan Kara
2024-12-10 21:12 ` Randy Dunlap
2 siblings, 1 reply; 69+ messages in thread
From: Aithal, Srikanth @ 2024-12-09 10:45 UTC (permalink / raw)
To: Klara Modin, Josef Bacik, kernel-team, linux-fsdevel, jack,
amir73il, brauner, torvalds, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, Linux-Next Mailing List
On 12/8/2024 10:28 PM, Klara Modin wrote:
> Hi,
>
> On 2024-11-15 16:30, Josef Bacik wrote:
>> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
>> on the faulting method.
>>
>> This pre-content event is meant to be used by hierarchical storage
>> managers that want to fill in the file content on first read access.
>>
>> Export a simple helper that file systems that have their own ->fault()
>> will use, and have a more complicated helper to be do fancy things with
>> in filemap_fault.
>>
>
> This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
> interacts poorly with some programs which hang and are stuck at 100 %
> sys cpu usage (examples of programs are logrotate and atop with root
> privileges).
>
> I also retested the new version on Jan Kara's for_next branch and it
> behaves the same way.
From linux-next20241206 onward we started hitting issues where KVM
guests running kernel > next20241206 on AMD platforms fails to shutdown,
hangs forever with below errors:
[ OK ] Reached target Late Shutdown Services.
[ OK ] Finished System Power Off.
[ OK ] Reached target System Power Off.
[ 128.946271] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Connection refused
[ 198.945362] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 298.945402] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 378.945345] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 488.945402] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 558.945904] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 632.945409] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 738.945403] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
[ 848.945342] systemd-journald[93]: Failed to send WATCHDOG=1
notification message: Transport endpoint is not connected
..
..
Bisecting the issue pointed to this patch.
commit 0790303ec869d0fd658a548551972b51ced7390c
Author: Josef Bacik <josef@toxicpanda.com>
Date: Fri Nov 15 10:30:29 2024 -0500
fsnotify: generate pre-content permission event on page fault
Same issue exists with todays linux-next build as well.
Adding below configs in the guest_config fixes the shutdown hang issue:
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
Regards,
Srikanth Aithal
>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>> include/linux/mm.h | 1 +
>> mm/filemap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 79 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 01c5e7a4489f..90155ef8599a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault
>> *vmf);
>> extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>> pgoff_t start_pgoff, pgoff_t end_pgoff);
>> extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
>> +extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
>>
>> extern unsigned long stack_guard_gap;
>> /* Generic expand stack which grows the stack according to
>> GROWS{UP,DOWN} */
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 68ea596f6905..0bf7d645dec5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -47,6 +47,7 @@
>> #include <linux/splice.h>
>> #include <linux/rcupdate_wait.h>
>> #include <linux/sched/mm.h>
>> +#include <linux/fsnotify.h>
>> #include <asm/pgalloc.h>
>> #include <asm/tlbflush.h>
>> #include "internal.h"
>> @@ -3289,6 +3290,52 @@ static vm_fault_t
>> filemap_fault_recheck_pte_none(struct vm_fault *vmf)
>> return ret;
>> }
>>
>> +/**
>> + * filemap_fsnotify_fault - maybe emit a pre-content event.
>> + * @vmf: struct vm_fault containing details of the fault.
>> + * @folio: the folio we're faulting in.
>> + *
>> + * If we have a pre-content watch on this file we will emit an event
>> for this
>> + * range. If we return anything the fault caller should return
>> immediately, we
>> + * will return VM_FAULT_RETRY if we had to emit an event, which will
>> trigger the
>> + * fault again and then the fault handler will run the second time
>> through.
>> + *
>> + * This is meant to be called with the folio that we will be filling
>> in to make
>> + * sure the event is emitted for the correct range.
>> + *
>> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
>> + */
>> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
>
> The parameters mentioned above do not seem to match with the function.
>
>> +{
>> + struct file *fpin = NULL;
>> + int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
>> + loff_t pos = vmf->pgoff >> PAGE_SHIFT;
>> + size_t count = PAGE_SIZE;
>> + vm_fault_t ret;
>> +
>> + /*
>> + * We already did this and now we're retrying with everything
>> locked,
>> + * don't emit the event and continue.
>> + */
>> + if (vmf->flags & FAULT_FLAG_TRIED)
>> + return 0;
>> +
>> + /* No watches, we're done. */
>> + if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
>> + return 0;
>> +
>> + fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> + if (!fpin)
>> + return VM_FAULT_SIGBUS;
>> +
>> + ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
>> + fput(fpin);
>> + if (ret)
>> + return VM_FAULT_SIGBUS;
>> + return VM_FAULT_RETRY;
>> +}
>> +EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
>> +
>> /**
>> * filemap_fault - read in file data for page fault handling
>> * @vmf: struct vm_fault containing details of the fault
>
>
>
>> @@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>> * or because readahead was otherwise unable to retrieve it.
>> */
>> if (unlikely(!folio_test_uptodate(folio))) {
>> + /*
>> + * If this is a precontent file we have can now emit an event to
>> + * try and populate the folio.
>> + */
>> + if (!(vmf->flags & FAULT_FLAG_TRIED) &&
>> + fsnotify_file_has_pre_content_watches(file)) {
>> + loff_t pos = folio_pos(folio);
>> + size_t count = folio_size(folio);
>> +
>> + /* We're NOWAIT, we have to retry. */
>> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
>> + folio_unlock(folio);
>> + goto out_retry;
>> + }
>> +
>> + if (mapping_locked)
>> + filemap_invalidate_unlock_shared(mapping);
>> + mapping_locked = false;
>> +
>> + folio_unlock(folio);
>> + fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>
> When I look at it with GDB it seems to get here, but then always jumps
> to out_retry, which keeps happening when it reenters, and never seems to
> progress beyond from what I could tell.
>
> For logrotate, strace stops at "mmap(NULL, 909, PROT_READ, MAP_PRIVATE|
> MAP_POPULATE, 3, 0".
> For atop, strace stops at "mlockall(MCL_CURRENT|MCL_FUTURE".
>
> If I remove this entire patch snippet everything seems to be normal.
>
>> + if (!fpin)
>> + goto out_retry;
>> +
>> + error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
>> + count);
>> + if (error)
>> + ret = VM_FAULT_SIGBUS;
>> + goto out_retry;
>> + }
>> +
>> /*
>> * If the invalidate lock is not held, the folio was in cache
>> * and uptodate and now it is not. Strange but possible since we
>
> Please let me know if there's anything else you need.
>
> Regards,
> Klara Modin
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-08 16:58 ` Klara Modin
2024-12-09 10:45 ` Aithal, Srikanth
@ 2024-12-09 12:31 ` Jan Kara
2024-12-09 12:56 ` Klara Modin
2024-12-10 21:12 ` Randy Dunlap
2 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2024-12-09 12:31 UTC (permalink / raw)
To: Klara Modin
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, amir73il, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4
Hello!
On Sun 08-12-24 17:58:42, Klara Modin wrote:
> On 2024-11-15 16:30, Josef Bacik wrote:
> > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> > on the faulting method.
> >
> > This pre-content event is meant to be used by hierarchical storage
> > managers that want to fill in the file content on first read access.
> >
> > Export a simple helper that file systems that have their own ->fault()
> > will use, and have a more complicated helper to be do fancy things with
> > in filemap_fault.
> >
>
> This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
> interacts poorly with some programs which hang and are stuck at 100 % sys
> cpu usage (examples of programs are logrotate and atop with root
> privileges).
>
> I also retested the new version on Jan Kara's for_next branch and it behaves
> the same way.
Thanks for report! What is your kernel config please? I've just fixed a
bug reported by [1] which manifested in the same way with
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n.
Can you perhaps test with my for_next branch I've just pushed out? Thanks!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-09 10:45 ` Aithal, Srikanth
@ 2024-12-09 12:34 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-12-09 12:34 UTC (permalink / raw)
To: Aithal, Srikanth
Cc: Klara Modin, Josef Bacik, kernel-team, linux-fsdevel, jack,
amir73il, brauner, torvalds, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, Linux-Next Mailing List
On Mon 09-12-24 16:15:32, Aithal, Srikanth wrote:
> On 12/8/2024 10:28 PM, Klara Modin wrote:
> > Hi,
> >
> > On 2024-11-15 16:30, Josef Bacik wrote:
> > > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> > > on the faulting method.
> > >
> > > This pre-content event is meant to be used by hierarchical storage
> > > managers that want to fill in the file content on first read access.
> > >
> > > Export a simple helper that file systems that have their own ->fault()
> > > will use, and have a more complicated helper to be do fancy things with
> > > in filemap_fault.
> > >
> >
> > This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
> > interacts poorly with some programs which hang and are stuck at 100 %
> > sys cpu usage (examples of programs are logrotate and atop with root
> > privileges).
> >
> > I also retested the new version on Jan Kara's for_next branch and it
> > behaves the same way.
>
> From linux-next20241206 onward we started hitting issues where KVM guests
> running kernel > next20241206 on AMD platforms fails to shutdown, hangs
> forever with below errors:
Thanks for report! This was discussed in [1] and I've just pushed out a
branch which has this bug fixed.
[1] https://lore.kernel.org/all/20241208152520.3559-1-spasswolf@web.de
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-09 12:31 ` Jan Kara
@ 2024-12-09 12:56 ` Klara Modin
2024-12-09 14:16 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Klara Modin @ 2024-12-09 12:56 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, amir73il, brauner,
torvalds, viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4,
sraithal
Hi,
On 2024-12-09 13:31, Jan Kara wrote:
> Hello!
>
> On Sun 08-12-24 17:58:42, Klara Modin wrote:
>> On 2024-11-15 16:30, Josef Bacik wrote:
>>> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
>>> on the faulting method.
>>>
>>> This pre-content event is meant to be used by hierarchical storage
>>> managers that want to fill in the file content on first read access.
>>>
>>> Export a simple helper that file systems that have their own ->fault()
>>> will use, and have a more complicated helper to be do fancy things with
>>> in filemap_fault.
>>>
>>
>> This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
>> interacts poorly with some programs which hang and are stuck at 100 % sys
>> cpu usage (examples of programs are logrotate and atop with root
>> privileges).
>>
>> I also retested the new version on Jan Kara's for_next branch and it behaves
>> the same way.
>
> Thanks for report! What is your kernel config please? I've just fixed a
> bug reported by [1] which manifested in the same way with
> CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n.
>
> Can you perhaps test with my for_next branch I've just pushed out? Thanks!
>
> Honza
My config was attached, but yes, I have
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n. I tried the tip by Srikanth Aithal
to enable it and that resolved the issue.
Your new for_next branch resolved the
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n case for me.
Thanks,
Klara Modin
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-09 12:56 ` Klara Modin
@ 2024-12-09 14:16 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-12-09 14:16 UTC (permalink / raw)
To: Klara Modin
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, amir73il,
brauner, torvalds, viro, linux-xfs, linux-btrfs, linux-mm,
linux-ext4, sraithal
On Mon 09-12-24 13:56:47, Klara Modin wrote:
> Hi,
>
> On 2024-12-09 13:31, Jan Kara wrote:
> > Hello!
> >
> > On Sun 08-12-24 17:58:42, Klara Modin wrote:
> > > On 2024-11-15 16:30, Josef Bacik wrote:
> > > > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> > > > on the faulting method.
> > > >
> > > > This pre-content event is meant to be used by hierarchical storage
> > > > managers that want to fill in the file content on first read access.
> > > >
> > > > Export a simple helper that file systems that have their own ->fault()
> > > > will use, and have a more complicated helper to be do fancy things with
> > > > in filemap_fault.
> > > >
> > >
> > > This patch (0790303ec869d0fd658a548551972b51ced7390c in next-20241206)
> > > interacts poorly with some programs which hang and are stuck at 100 % sys
> > > cpu usage (examples of programs are logrotate and atop with root
> > > privileges).
> > >
> > > I also retested the new version on Jan Kara's for_next branch and it behaves
> > > the same way.
> >
> > Thanks for report! What is your kernel config please? I've just fixed a
> > bug reported by [1] which manifested in the same way with
> > CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n.
> >
> > Can you perhaps test with my for_next branch I've just pushed out? Thanks!
> >
> > Honza
>
> My config was attached, but yes, I have
Ah, sorry, somehow I've missed that.
> CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n. I tried the tip by Srikanth Aithal to
> enable it and that resolved the issue.
>
> Your new for_next branch resolved the CONFIG_FANOTIFY_ACCESS_PERMISSIONS=n
> case for me.
Thanks for testing! Glad to hear the problem is solved.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-08 16:58 ` Klara Modin
2024-12-09 10:45 ` Aithal, Srikanth
2024-12-09 12:31 ` Jan Kara
@ 2024-12-10 21:12 ` Randy Dunlap
2024-12-11 16:30 ` Jan Kara
2 siblings, 1 reply; 69+ messages in thread
From: Randy Dunlap @ 2024-12-10 21:12 UTC (permalink / raw)
To: Klara Modin, Josef Bacik, kernel-team, linux-fsdevel, jack,
amir73il, brauner, torvalds, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4
On 12/8/24 8:58 AM, Klara Modin wrote:
>> +/**
>> + * filemap_fsnotify_fault - maybe emit a pre-content event.
>> + * @vmf: struct vm_fault containing details of the fault.
>> + * @folio: the folio we're faulting in.
>> + *
>> + * If we have a pre-content watch on this file we will emit an event for this
>> + * range. If we return anything the fault caller should return immediately, we
>> + * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
>> + * fault again and then the fault handler will run the second time through.
>> + *
>> + * This is meant to be called with the folio that we will be filling in to make
>> + * sure the event is emitted for the correct range.
>> + *
>> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
>> + */
>> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
>
> The parameters mentioned above do not seem to match with the function.
which causes a warning:
mm/filemap.c:3289: warning: Excess function parameter 'folio' description in 'filemap_fsnotify_fault'
--
~Randy
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault
2024-12-10 21:12 ` Randy Dunlap
@ 2024-12-11 16:30 ` Jan Kara
0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2024-12-11 16:30 UTC (permalink / raw)
To: Randy Dunlap
Cc: Klara Modin, Josef Bacik, kernel-team, linux-fsdevel, jack,
amir73il, brauner, torvalds, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4
On Tue 10-12-24 13:12:01, Randy Dunlap wrote:
> On 12/8/24 8:58 AM, Klara Modin wrote:
> >> +/**
> >> + * filemap_fsnotify_fault - maybe emit a pre-content event.
> >> + * @vmf: struct vm_fault containing details of the fault.
> >> + * @folio: the folio we're faulting in.
> >> + *
> >> + * If we have a pre-content watch on this file we will emit an event for this
> >> + * range. If we return anything the fault caller should return immediately, we
> >> + * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
> >> + * fault again and then the fault handler will run the second time through.
> >> + *
> >> + * This is meant to be called with the folio that we will be filling in to make
> >> + * sure the event is emitted for the correct range.
> >> + *
> >> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
> >> + */
> >> +vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
> >
> > The parameters mentioned above do not seem to match with the function.
>
>
> which causes a warning:
>
> mm/filemap.c:3289: warning: Excess function parameter 'folio' description in 'filemap_fsnotify_fault'
Thanks, fixed up!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2024-11-15 15:30 ` [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2025-01-31 19:17 ` Alex Williamson
2025-01-31 19:59 ` Linus Torvalds
0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2025-01-31 19:17 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4, Peter Xu,
linux-kernel, kvm
20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
This breaks huge_fault support for PFNMAPs that was recently added in
v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
order mappings. Thanks,
Alex
On Fri, 15 Nov 2024 10:30:28 -0500
Josef Bacik <josef@toxicpanda.com> wrote:
> There's nothing stopping us from supporting this, we could simply pass
> the order into the helper and emit the proper length. However currently
> there's no tests to validate this works properly, so disable it until
> there's a desire to support this along with the appropriate tests.
>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> mm/memory.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index bdf77a3ec47b..843ad75a4148 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -78,6 +78,7 @@
> #include <linux/ptrace.h>
> #include <linux/vmalloc.h>
> #include <linux/sched/sysctl.h>
> +#include <linux/fsnotify.h>
>
> #include <trace/events/kmem.h>
>
> @@ -5637,8 +5638,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> if (vma_is_anonymous(vma))
> return do_huge_pmd_anonymous_page(vmf);
> + /*
> + * Currently we just emit PAGE_SIZE for our fault events, so don't allow
> + * a huge fault if we have a pre content watch on this file. This would
> + * be trivial to support, but there would need to be tests to ensure
> + * this works properly and those don't exist currently.
> + */
> + if (fsnotify_file_has_pre_content_watches(file))
> + return VM_FAULT_FALLBACK;
> if (vma->vm_ops->huge_fault)
> return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
> return VM_FAULT_FALLBACK;
> @@ -5648,6 +5658,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
> static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> vm_fault_t ret;
>
> @@ -5662,6 +5673,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> }
>
> if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + /* See comment in create_huge_pmd. */
> + if (fsnotify_file_has_pre_content_watches(file))
> + goto split;
> if (vma->vm_ops->huge_fault) {
> ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
> if (!(ret & VM_FAULT_FALLBACK))
> @@ -5681,9 +5695,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> /* No support for anonymous transparent PUD pages yet */
> if (vma_is_anonymous(vma))
> return VM_FAULT_FALLBACK;
> + /* See comment in create_huge_pmd. */
> + if (fsnotify_file_has_pre_content_watches(file))
> + return VM_FAULT_FALLBACK;
> if (vma->vm_ops->huge_fault)
> return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -5695,12 +5713,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> vm_fault_t ret;
>
> /* No support for anonymous transparent PUD pages yet */
> if (vma_is_anonymous(vma))
> goto split;
> if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + /* See comment in create_huge_pmd. */
> + if (fsnotify_file_has_pre_content_watches(file))
> + goto split;
> if (vma->vm_ops->huge_fault) {
> ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> if (!(ret & VM_FAULT_FALLBACK))
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-01-31 19:17 ` [REGRESSION] " Alex Williamson
@ 2025-01-31 19:59 ` Linus Torvalds
2025-02-01 1:19 ` Peter Xu
0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2025-01-31 19:59 UTC (permalink / raw)
To: Alex Williamson
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, amir73il, brauner,
viro, linux-xfs, linux-btrfs, linux-mm, linux-ext4, Peter Xu,
linux-kernel, kvm
On Fri, 31 Jan 2025 at 11:17, Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
>
> This breaks huge_fault support for PFNMAPs that was recently added in
> v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> order mappings.
Surely only for content watches?
Which shouldn't be a valid situation *anyway*.
IOW, there must be some unrelated bug somewhere: either somebody is
allowed to set a pre-content match on a special device.
That should be disabled by the whole
/*
* If there are permission event watchers but no pre-content event
* watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
*/
thing in file_set_fsnotify_mode() which only allows regular files and
directories to be notified on.
Or, alternatively, that check for huge-fault disabling is just
checking the wrong bits.
Or - quite possibly - I am missing something obvious?
Linus
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-01-31 19:59 ` Linus Torvalds
@ 2025-02-01 1:19 ` Peter Xu
2025-02-01 14:38 ` Christian Brauner
0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2025-02-01 1:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alex Williamson, Josef Bacik, kernel-team, linux-fsdevel, jack,
amir73il, brauner, viro, linux-xfs, linux-btrfs, linux-mm,
linux-ext4, linux-kernel, kvm
On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote:
> On Fri, 31 Jan 2025 at 11:17, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >
> > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
> >
> > This breaks huge_fault support for PFNMAPs that was recently added in
> > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> > order mappings.
>
> Surely only for content watches?
>
> Which shouldn't be a valid situation *anyway*.
>
> IOW, there must be some unrelated bug somewhere: either somebody is
> allowed to set a pre-content match on a special device.
>
> That should be disabled by the whole
>
> /*
> * If there are permission event watchers but no pre-content event
> * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
> */
>
> thing in file_set_fsnotify_mode() which only allows regular files and
> directories to be notified on.
>
> Or, alternatively, that check for huge-fault disabling is just
> checking the wrong bits.
>
> Or - quite possibly - I am missing something obvious?
Is it possible that we have some paths got overlooked in setting up the
fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on
those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on
those if overlooked..
One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are
not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU
groups rather than the device fd itself.
The app normally needs to first open the IOMMU group fd under /dev/vfio/*,
then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which
will be the mmap() target, instead of the ones under /dev.
I checked, those device fds were allocated from vfio_device_open_file()
within the ioctl, which internally uses anon_inode_getfile(). I don't see
anywhere in that path that will set the fanotify bits..
Further, I'm not sure whether some callers of alloc_file() can also suffer
from similar issue, because at least memfd_create() syscall also uses the
API, which (hopefully?) would used to allow THPs for shmem backed memfds on
aligned mmap()s, but not sure whether it'll also wrongly trigger the
FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs. I
didn't verify it though, nor did I yet check more users.
So I wonder whether we should setup the fanotify bits in at least
alloc_file() too (to FMODE_NONOTIFY?).
I'm totally not familiar with fanotify, and it's a bit late to try verify
anything (I cannot quickly find my previous huge pfnmap setup, so setup
those will also take time..). but maybe above can provide some clues for
others..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-01 1:19 ` Peter Xu
@ 2025-02-01 14:38 ` Christian Brauner
2025-02-02 0:58 ` Linus Torvalds
0 siblings, 1 reply; 69+ messages in thread
From: Christian Brauner @ 2025-02-01 14:38 UTC (permalink / raw)
To: Peter Xu
Cc: Linus Torvalds, Alex Williamson, Josef Bacik, kernel-team,
linux-fsdevel, jack, amir73il, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, linux-kernel, kvm
On Fri, Jan 31, 2025 at 08:19:22PM -0500, Peter Xu wrote:
> On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote:
> > On Fri, 31 Jan 2025 at 11:17, Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > >
> > > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
> > >
> > > This breaks huge_fault support for PFNMAPs that was recently added in
> > > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD
> > > order mappings.
> >
> > Surely only for content watches?
> >
> > Which shouldn't be a valid situation *anyway*.
> >
> > IOW, there must be some unrelated bug somewhere: either somebody is
> > allowed to set a pre-content match on a special device.
> >
> > That should be disabled by the whole
> >
> > /*
> > * If there are permission event watchers but no pre-content event
> > * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that.
> > */
> >
> > thing in file_set_fsnotify_mode() which only allows regular files and
> > directories to be notified on.
> >
> > Or, alternatively, that check for huge-fault disabling is just
> > checking the wrong bits.
> >
> > Or - quite possibly - I am missing something obvious?
>
> Is it possible that we have some paths got overlooked in setting up the
> fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on
> those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on
> those if overlooked..
>
> One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are
> not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU
> groups rather than the device fd itself.
>
> The app normally needs to first open the IOMMU group fd under /dev/vfio/*,
> then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which
> will be the mmap() target, instead of the ones under /dev.
Ok, but those "device fds" aren't really device fds in the sense that
they are character fds. They are regular files afaict from:
vfio_device_open_file(struct vfio_device *device)
(Well, it's actually worse as anon_inode_getfile() files don't have any
mode at all but that's beside the point.)?
In any case, I think you're right that such files would (accidently?)
qualify for content watches afaict. So at least that should probably get
FMODE_NONOTIFY.
>
> I checked, those device fds were allocated from vfio_device_open_file()
> within the ioctl, which internally uses anon_inode_getfile(). I don't see
> anywhere in that path that will set the fanotify bits..
>
> Further, I'm not sure whether some callers of alloc_file() can also suffer
Sidenote, mm/memfd.c should pretty please rename alloc_file() to
memfd_alloc_file() or something. That would be great because
alloc_file() is a local fs/file_table.c helper and grepping for it is
confusing as I first thought someone made alloc_file() available outside
of fs/file_table.c
> from similar issue, because at least memfd_create() syscall also uses the
> API, which (hopefully?) would used to allow THPs for shmem backed memfds on
> aligned mmap()s, but not sure whether it'll also wrongly trigger the
> FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs. I
> didn't verify it though, nor did I yet check more users.
>
> So I wonder whether we should setup the fanotify bits in at least
> alloc_file() too (to FMODE_NONOTIFY?).
>
> I'm totally not familiar with fanotify, and it's a bit late to try verify
> anything (I cannot quickly find my previous huge pfnmap setup, so setup
> those will also take time..). but maybe above can provide some clues for
> others..
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-01 14:38 ` Christian Brauner
@ 2025-02-02 0:58 ` Linus Torvalds
2025-02-02 7:46 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2025-02-02 0:58 UTC (permalink / raw)
To: Christian Brauner
Cc: Peter Xu, Alex Williamson, Josef Bacik, kernel-team,
linux-fsdevel, jack, amir73il, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, linux-kernel, kvm
On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
>
> Ok, but those "device fds" aren't really device fds in the sense that
> they are character fds. They are regular files afaict from:
>
> vfio_device_open_file(struct vfio_device *device)
>
> (Well, it's actually worse as anon_inode_getfile() files don't have any
> mode at all but that's beside the point.)?
>
> In any case, I think you're right that such files would (accidently?)
> qualify for content watches afaict. So at least that should probably get
> FMODE_NONOTIFY.
Hmm. Can we just make all anon_inodes do that? I don't think you can
sanely have pre-content watches on anon-inodes, since you can't really
have access to them to _set_ the content watch from outside anyway..
In fact, maybe do it in alloc_file_pseudo()?
Amir / Josef?
Linus
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-02 0:58 ` Linus Torvalds
@ 2025-02-02 7:46 ` Amir Goldstein
2025-02-02 10:04 ` Christian Brauner
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2025-02-02 7:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, Peter Xu, Alex Williamson, Josef Bacik,
kernel-team, linux-fsdevel, jack, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, linux-kernel, kvm
On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ok, but those "device fds" aren't really device fds in the sense that
> > they are character fds. They are regular files afaict from:
> >
> > vfio_device_open_file(struct vfio_device *device)
> >
> > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > mode at all but that's beside the point.)?
> >
> > In any case, I think you're right that such files would (accidently?)
> > qualify for content watches afaict. So at least that should probably get
> > FMODE_NONOTIFY.
>
> Hmm. Can we just make all anon_inodes do that? I don't think you can
> sanely have pre-content watches on anon-inodes, since you can't really
> have access to them to _set_ the content watch from outside anyway..
>
> In fact, maybe do it in alloc_file_pseudo()?
>
The problem is that we cannot set FMODE_NONOTIFY -
we tried that once but it regressed some workloads watching
write on pipe fd or something.
and the no-pre-content is a flag combination (to save FMODE_ flags)
which makes things a bit messy.
We could try to initialize f_mode to FMODE_NONOTIFY_PERM
for anon_inode, which opts out of both permission and pre-content
events and leaves the legacy inotify workloads unaffected.
But, then code like this will not do the right thing:
/* We refuse fsnotify events on ptmx, since it's a shared resource */
filp->f_mode |= FMODE_NONOTIFY;
We will need to convert all those to use a helper.
I am traveling today so will be able to look closer tomorrow.
Jan,
What do you think?
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-02 7:46 ` Amir Goldstein
@ 2025-02-02 10:04 ` Christian Brauner
2025-02-03 12:41 ` Jan Kara
0 siblings, 1 reply; 69+ messages in thread
From: Christian Brauner @ 2025-02-02 10:04 UTC (permalink / raw)
To: Amir Goldstein
Cc: Linus Torvalds, Peter Xu, Alex Williamson, Josef Bacik,
kernel-team, linux-fsdevel, jack, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, linux-kernel, kvm
On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ok, but those "device fds" aren't really device fds in the sense that
> > > they are character fds. They are regular files afaict from:
> > >
> > > vfio_device_open_file(struct vfio_device *device)
> > >
> > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > mode at all but that's beside the point.)?
> > >
> > > In any case, I think you're right that such files would (accidently?)
> > > qualify for content watches afaict. So at least that should probably get
> > > FMODE_NONOTIFY.
> >
> > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > sanely have pre-content watches on anon-inodes, since you can't really
> > have access to them to _set_ the content watch from outside anyway..
> >
> > In fact, maybe do it in alloc_file_pseudo()?
> >
>
> The problem is that we cannot set FMODE_NONOTIFY -
> we tried that once but it regressed some workloads watching
> write on pipe fd or something.
Ok, that might be true. But I would assume that most users of
alloc_file_pseudo() or the anonymous inode infrastructure will not care
about fanotify events. I would not go for a separate helper. It'd be
nice to keep the number of file allocation functions low.
I'd rather have the subsystems that want it explicitly opt-in to
fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
broken fanotify support for e.g., nsfs already. So make the subsystems
think about whether they actually want to support it.
I would disqualify all anonymous inodes and see what actually does
break. I naively suspect that almost no one uses anonymous inodes +
fanotify. I'd be very surprised.
I'm currently traveling (see you later btw) but from a very cursory
reading I would naively suspect the following:
// Suspects for FMODE_NONOTIFY
drivers/dma-buf/dma-buf.c: file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
drivers/scsi/cxlflash/ocxl_hw.c: file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
fs/anon_inodes.c: file = alloc_file_pseudo(inode, anon_inode_mnt, name,
fs/hugetlbfs/inode.c: file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
kernel/bpf/token.c: file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
block/bdev.c: bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
// Suspects for ~FMODE_NONOTIFY
fs/aio.c: file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
fs/pipe.c: f = alloc_file_pseudo(inode, pipe_mnt, "",
mm/shmem.c: res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
// Unsure:
fs/nfs/nfs4file.c: filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
net/socket.c: file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
>
> and the no-pre-content is a flag combination (to save FMODE_ flags)
> which makes things a bit messy.
>
> We could try to initialize f_mode to FMODE_NONOTIFY_PERM
> for anon_inode, which opts out of both permission and pre-content
> events and leaves the legacy inotify workloads unaffected.
>
> But, then code like this will not do the right thing:
>
> /* We refuse fsnotify events on ptmx, since it's a shared resource */
> filp->f_mode |= FMODE_NONOTIFY;
>
> We will need to convert all those to use a helper.
> I am traveling today so will be able to look closer tomorrow.
>
> Jan,
>
> What do you think?
>
> Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-02 10:04 ` Christian Brauner
@ 2025-02-03 12:41 ` Jan Kara
2025-02-03 20:39 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2025-02-03 12:41 UTC (permalink / raw)
To: Christian Brauner
Cc: Amir Goldstein, Linus Torvalds, Peter Xu, Alex Williamson,
Josef Bacik, kernel-team, linux-fsdevel, jack, viro, linux-xfs,
linux-btrfs, linux-mm, linux-ext4, linux-kernel, kvm
On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > they are character fds. They are regular files afaict from:
> > > >
> > > > vfio_device_open_file(struct vfio_device *device)
> > > >
> > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > mode at all but that's beside the point.)?
> > > >
> > > > In any case, I think you're right that such files would (accidently?)
> > > > qualify for content watches afaict. So at least that should probably get
> > > > FMODE_NONOTIFY.
> > >
> > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > sanely have pre-content watches on anon-inodes, since you can't really
> > > have access to them to _set_ the content watch from outside anyway..
> > >
> > > In fact, maybe do it in alloc_file_pseudo()?
> > >
> >
> > The problem is that we cannot set FMODE_NONOTIFY -
> > we tried that once but it regressed some workloads watching
> > write on pipe fd or something.
>
> Ok, that might be true. But I would assume that most users of
> alloc_file_pseudo() or the anonymous inode infrastructure will not care
> about fanotify events. I would not go for a separate helper. It'd be
> nice to keep the number of file allocation functions low.
>
> I'd rather have the subsystems that want it explicitly opt-in to
> fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> broken fanotify support for e.g., nsfs already. So make the subsystems
> think about whether they actually want to support it.
Agreed, that would be a saner default.
> I would disqualify all anonymous inodes and see what actually does
> break. I naively suspect that almost no one uses anonymous inodes +
> fanotify. I'd be very surprised.
>
> I'm currently traveling (see you later btw) but from a very cursory
> reading I would naively suspect the following:
>
> // Suspects for FMODE_NONOTIFY
> drivers/dma-buf/dma-buf.c: file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> drivers/scsi/cxlflash/ocxl_hw.c: file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> fs/anon_inodes.c: file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> fs/hugetlbfs/inode.c: file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> kernel/bpf/token.c: file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> block/bdev.c: bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
>
> // Suspects for ~FMODE_NONOTIFY
> fs/aio.c: file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
This is just a helper file for managing aio context so I don't think any
notification makes sense there (events are not well defined). So I'd say
FMODE_NONOTIFY here as well.
> fs/pipe.c: f = alloc_file_pseudo(inode, pipe_mnt, "",
> mm/shmem.c: res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
This is actually used for stuff like IPC SEM where notification doesn't
make sense. It's also used when mmapping /dev/zero but that struct file
isn't easily accessible to userspace so overall I'd say this should be
FMODE_NONOTIFY as well.
> // Unsure:
> fs/nfs/nfs4file.c: filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
AFAICS this struct file is for copy offload and doesn't leave the kernel.
Hence FMODE_NONOTIFY should be fine.
> net/socket.c: file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
In this case I think we need to be careful. It's a similar case as pipes so
probably we should use ~FMODE_NONOTIFY here from pure caution.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-03 12:41 ` Jan Kara
@ 2025-02-03 20:39 ` Amir Goldstein
2025-02-03 21:41 ` Alex Williamson
0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2025-02-03 20:39 UTC (permalink / raw)
To: Jan Kara, Alex Williamson
Cc: Christian Brauner, Linus Torvalds, Peter Xu, Josef Bacik,
kernel-team, linux-fsdevel, viro, linux-xfs, linux-btrfs,
linux-mm, linux-ext4, linux-kernel, kvm
On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > they are character fds. They are regular files afaict from:
> > > > >
> > > > > vfio_device_open_file(struct vfio_device *device)
> > > > >
> > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > mode at all but that's beside the point.)?
> > > > >
> > > > > In any case, I think you're right that such files would (accidently?)
> > > > > qualify for content watches afaict. So at least that should probably get
> > > > > FMODE_NONOTIFY.
> > > >
> > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > have access to them to _set_ the content watch from outside anyway..
> > > >
> > > > In fact, maybe do it in alloc_file_pseudo()?
> > > >
> > >
> > > The problem is that we cannot set FMODE_NONOTIFY -
> > > we tried that once but it regressed some workloads watching
> > > write on pipe fd or something.
> >
> > Ok, that might be true. But I would assume that most users of
> > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > about fanotify events. I would not go for a separate helper. It'd be
> > nice to keep the number of file allocation functions low.
> >
> > I'd rather have the subsystems that want it explicitly opt-in to
> > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > broken fanotify support for e.g., nsfs already. So make the subsystems
> > think about whether they actually want to support it.
>
> Agreed, that would be a saner default.
>
> > I would disqualify all anonymous inodes and see what actually does
> > break. I naively suspect that almost no one uses anonymous inodes +
> > fanotify. I'd be very surprised.
> >
> > I'm currently traveling (see you later btw) but from a very cursory
> > reading I would naively suspect the following:
> >
> > // Suspects for FMODE_NONOTIFY
> > drivers/dma-buf/dma-buf.c: file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > drivers/scsi/cxlflash/ocxl_hw.c: file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > fs/anon_inodes.c: file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > fs/hugetlbfs/inode.c: file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > kernel/bpf/token.c: file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > block/bdev.c: bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> >
> > // Suspects for ~FMODE_NONOTIFY
> > fs/aio.c: file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
>
> This is just a helper file for managing aio context so I don't think any
> notification makes sense there (events are not well defined). So I'd say
> FMODE_NONOTIFY here as well.
>
> > fs/pipe.c: f = alloc_file_pseudo(inode, pipe_mnt, "",
> > mm/shmem.c: res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
>
> This is actually used for stuff like IPC SEM where notification doesn't
> make sense. It's also used when mmapping /dev/zero but that struct file
> isn't easily accessible to userspace so overall I'd say this should be
> FMODE_NONOTIFY as well.
I think there is another code path that the audit missed for getting these
pseudo files not via alloc_file_pseudo():
ipc/shm.c: file = alloc_file_clone(base, f_flags,
which does not copy f_mode as far as I can tell.
>
> > // Unsure:
> > fs/nfs/nfs4file.c: filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
>
> AFAICS this struct file is for copy offload and doesn't leave the kernel.
> Hence FMODE_NONOTIFY should be fine.
>
> > net/socket.c: file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
>
> In this case I think we need to be careful. It's a similar case as pipes so
> probably we should use ~FMODE_NONOTIFY here from pure caution.
>
I tried this approach with patch:
"fsnotify: disable notification by default for all pseudo files"
But I also added another patch:
"fsnotify: disable pre-content and permission events by default"
So that code paths that we missed such as alloc_file_clone()
will not have pre-content events enabled.
Alex,
Can you please try this branch:
https://github.com/amir73il/linux/commits/fsnotify-fixes/
and verify that it fixes your issue.
The branch contains one prep patch:
"fsnotify: use accessor to set FMODE_NONOTIFY_*"
and two independent Fixes patches.
Assuming that it fixes your issue, can you please test each of the
Fixes patches individually, because every one of them should be fixing
the issue independently and every one of them could break something,
so we may end up reverting it later on.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-03 20:39 ` Amir Goldstein
@ 2025-02-03 21:41 ` Alex Williamson
2025-02-03 22:04 ` Amir Goldstein
0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2025-02-03 21:41 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Christian Brauner, Linus Torvalds, Peter Xu,
Josef Bacik, kernel-team, linux-fsdevel, viro, linux-xfs,
linux-btrfs, linux-mm, linux-ext4, linux-kernel, kvm
On Mon, 3 Feb 2025 21:39:27 +0100
Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> > > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > > > >
> > > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > > they are character fds. They are regular files afaict from:
> > > > > >
> > > > > > vfio_device_open_file(struct vfio_device *device)
> > > > > >
> > > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > > mode at all but that's beside the point.)?
> > > > > >
> > > > > > In any case, I think you're right that such files would (accidently?)
> > > > > > qualify for content watches afaict. So at least that should probably get
> > > > > > FMODE_NONOTIFY.
> > > > >
> > > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > > have access to them to _set_ the content watch from outside anyway..
> > > > >
> > > > > In fact, maybe do it in alloc_file_pseudo()?
> > > > >
> > > >
> > > > The problem is that we cannot set FMODE_NONOTIFY -
> > > > we tried that once but it regressed some workloads watching
> > > > write on pipe fd or something.
> > >
> > > Ok, that might be true. But I would assume that most users of
> > > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > > about fanotify events. I would not go for a separate helper. It'd be
> > > nice to keep the number of file allocation functions low.
> > >
> > > I'd rather have the subsystems that want it explicitly opt-in to
> > > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > > broken fanotify support for e.g., nsfs already. So make the subsystems
> > > think about whether they actually want to support it.
> >
> > Agreed, that would be a saner default.
> >
> > > I would disqualify all anonymous inodes and see what actually does
> > > break. I naively suspect that almost no one uses anonymous inodes +
> > > fanotify. I'd be very surprised.
> > >
> > > I'm currently traveling (see you later btw) but from a very cursory
> > > reading I would naively suspect the following:
> > >
> > > // Suspects for FMODE_NONOTIFY
> > > drivers/dma-buf/dma-buf.c: file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > > drivers/scsi/cxlflash/ocxl_hw.c: file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > > fs/anon_inodes.c: file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > > fs/hugetlbfs/inode.c: file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > > kernel/bpf/token.c: file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > > block/bdev.c: bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> > >
> > > // Suspects for ~FMODE_NONOTIFY
> > > fs/aio.c: file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
> >
> > This is just a helper file for managing aio context so I don't think any
> > notification makes sense there (events are not well defined). So I'd say
> > FMODE_NONOTIFY here as well.
> >
> > > fs/pipe.c: f = alloc_file_pseudo(inode, pipe_mnt, "",
> > > mm/shmem.c: res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> >
> > This is actually used for stuff like IPC SEM where notification doesn't
> > make sense. It's also used when mmapping /dev/zero but that struct file
> > isn't easily accessible to userspace so overall I'd say this should be
> > FMODE_NONOTIFY as well.
>
> I think there is another code path that the audit missed for getting these
> pseudo files not via alloc_file_pseudo():
> ipc/shm.c: file = alloc_file_clone(base, f_flags,
>
> which does not copy f_mode as far as I can tell.
>
> >
> > > // Unsure:
> > > fs/nfs/nfs4file.c: filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
> >
> > AFAICS this struct file is for copy offload and doesn't leave the kernel.
> > Hence FMODE_NONOTIFY should be fine.
> >
> > > net/socket.c: file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
> >
> > In this case I think we need to be careful. It's a similar case as pipes so
> > probably we should use ~FMODE_NONOTIFY here from pure caution.
> >
>
> I tried this approach with patch:
> "fsnotify: disable notification by default for all pseudo files"
>
> But I also added another patch:
> "fsnotify: disable pre-content and permission events by default"
>
> So that code paths that we missed such as alloc_file_clone()
> will not have pre-content events enabled.
>
> Alex,
>
> Can you please try this branch:
>
> https://github.com/amir73il/linux/commits/fsnotify-fixes/
>
> and verify that it fixes your issue.
>
> The branch contains one prep patch:
> "fsnotify: use accessor to set FMODE_NONOTIFY_*"
> and two independent Fixes patches.
>
> Assuming that it fixes your issue, can you please test each of the
> Fixes patches individually, because every one of them should be fixing
> the issue independently and every one of them could break something,
> so we may end up reverting it later on.
Test #1:
fsnotify: disable pre-content and permission events by default
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*
Result: Pass, vfio-pci huge_fault observed
Test #2:
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*
Result: Pass, vfio-pci huge_fault observed
Test #3:
fsnotify: disable pre-content and permission events by default
fsnotify: use accessor to set FMODE_NONOTIFY_*
Result: Pass, vfio-pci huge_fault observed
Test #4 (control):
fsnotify: use accessor to set FMODE_NONOTIFY_*
Result: Fail, no vfio-pci huge_fault observed
For any combination of the Fixes patches:
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Thanks!
Alex
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [REGRESSION] Re: [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches
2025-02-03 21:41 ` Alex Williamson
@ 2025-02-03 22:04 ` Amir Goldstein
0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2025-02-03 22:04 UTC (permalink / raw)
To: Alex Williamson
Cc: Jan Kara, Christian Brauner, Linus Torvalds, Peter Xu,
Josef Bacik, kernel-team, linux-fsdevel, viro, linux-xfs,
linux-btrfs, linux-mm, linux-ext4, linux-kernel, kvm
On Mon, Feb 3, 2025 at 10:41 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Mon, 3 Feb 2025 21:39:27 +0100
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > On Mon, Feb 3, 2025 at 1:41 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sun 02-02-25 11:04:02, Christian Brauner wrote:
> > > > On Sun, Feb 02, 2025 at 08:46:21AM +0100, Amir Goldstein wrote:
> > > > > On Sun, Feb 2, 2025 at 1:58 AM Linus Torvalds
> > > > > <torvalds@linux-foundation.org> wrote:
> > > > > >
> > > > > > On Sat, 1 Feb 2025 at 06:38, Christian Brauner <brauner@kernel.org> wrote:
> > > > > > >
> > > > > > > Ok, but those "device fds" aren't really device fds in the sense that
> > > > > > > they are character fds. They are regular files afaict from:
> > > > > > >
> > > > > > > vfio_device_open_file(struct vfio_device *device)
> > > > > > >
> > > > > > > (Well, it's actually worse as anon_inode_getfile() files don't have any
> > > > > > > mode at all but that's beside the point.)?
> > > > > > >
> > > > > > > In any case, I think you're right that such files would (accidently?)
> > > > > > > qualify for content watches afaict. So at least that should probably get
> > > > > > > FMODE_NONOTIFY.
> > > > > >
> > > > > > Hmm. Can we just make all anon_inodes do that? I don't think you can
> > > > > > sanely have pre-content watches on anon-inodes, since you can't really
> > > > > > have access to them to _set_ the content watch from outside anyway..
> > > > > >
> > > > > > In fact, maybe do it in alloc_file_pseudo()?
> > > > > >
> > > > >
> > > > > The problem is that we cannot set FMODE_NONOTIFY -
> > > > > we tried that once but it regressed some workloads watching
> > > > > write on pipe fd or something.
> > > >
> > > > Ok, that might be true. But I would assume that most users of
> > > > alloc_file_pseudo() or the anonymous inode infrastructure will not care
> > > > about fanotify events. I would not go for a separate helper. It'd be
> > > > nice to keep the number of file allocation functions low.
> > > >
> > > > I'd rather have the subsystems that want it explicitly opt-in to
> > > > fanotify watches, i.e., remove FMODE_NONOTIFY. Because right now we have
> > > > broken fanotify support for e.g., nsfs already. So make the subsystems
> > > > think about whether they actually want to support it.
> > >
> > > Agreed, that would be a saner default.
> > >
> > > > I would disqualify all anonymous inodes and see what actually does
> > > > break. I naively suspect that almost no one uses anonymous inodes +
> > > > fanotify. I'd be very surprised.
> > > >
> > > > I'm currently traveling (see you later btw) but from a very cursory
> > > > reading I would naively suspect the following:
> > > >
> > > > // Suspects for FMODE_NONOTIFY
> > > > drivers/dma-buf/dma-buf.c: file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > > > drivers/misc/cxl/api.c: file = alloc_file_pseudo(inode, cxl_vfs_mount, name,
> > > > drivers/scsi/cxlflash/ocxl_hw.c: file = alloc_file_pseudo(inode, ocxlflash_vfs_mount, name,
> > > > fs/anon_inodes.c: file = alloc_file_pseudo(inode, anon_inode_mnt, name,
> > > > fs/hugetlbfs/inode.c: file = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > > > kernel/bpf/token.c: file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > > mm/secretmem.c: file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > > > block/bdev.c: bdev_file = alloc_file_pseudo_noaccount(BD_INODE(bdev),
> > > > drivers/tty/pty.c: static int ptmx_open(struct inode *inode, struct file *filp)
> > > >
> > > > // Suspects for ~FMODE_NONOTIFY
> > > > fs/aio.c: file = alloc_file_pseudo(inode, aio_mnt, "[aio]",
> > >
> > > This is just a helper file for managing aio context so I don't think any
> > > notification makes sense there (events are not well defined). So I'd say
> > > FMODE_NONOTIFY here as well.
> > >
> > > > fs/pipe.c: f = alloc_file_pseudo(inode, pipe_mnt, "",
> > > > mm/shmem.c: res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
> > >
> > > This is actually used for stuff like IPC SEM where notification doesn't
> > > make sense. It's also used when mmapping /dev/zero but that struct file
> > > isn't easily accessible to userspace so overall I'd say this should be
> > > FMODE_NONOTIFY as well.
> >
> > I think there is another code path that the audit missed for getting these
> > pseudo files not via alloc_file_pseudo():
> > ipc/shm.c: file = alloc_file_clone(base, f_flags,
> >
> > which does not copy f_mode as far as I can tell.
> >
> > >
> > > > // Unsure:
> > > > fs/nfs/nfs4file.c: filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, O_RDONLY,
> > >
> > > AFAICS this struct file is for copy offload and doesn't leave the kernel.
> > > Hence FMODE_NONOTIFY should be fine.
> > >
> > > > net/socket.c: file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
> > >
> > > In this case I think we need to be careful. It's a similar case as pipes so
> > > probably we should use ~FMODE_NONOTIFY here from pure caution.
> > >
> >
> > I tried this approach with patch:
> > "fsnotify: disable notification by default for all pseudo files"
> >
> > But I also added another patch:
> > "fsnotify: disable pre-content and permission events by default"
> >
> > So that code paths that we missed such as alloc_file_clone()
> > will not have pre-content events enabled.
> >
> > Alex,
> >
> > Can you please try this branch:
> >
> > https://github.com/amir73il/linux/commits/fsnotify-fixes/
> >
> > and verify that it fixes your issue.
> >
> > The branch contains one prep patch:
> > "fsnotify: use accessor to set FMODE_NONOTIFY_*"
> > and two independent Fixes patches.
> >
> > Assuming that it fixes your issue, can you please test each of the
> > Fixes patches individually, because every one of them should be fixing
> > the issue independently and every one of them could break something,
> > so we may end up reverting it later on.
>
> Test #1:
>
> fsnotify: disable pre-content and permission events by default
> fsnotify: disable notification by default for all pseudo files
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #2:
>
> fsnotify: disable notification by default for all pseudo files
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #3:
>
> fsnotify: disable pre-content and permission events by default
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Pass, vfio-pci huge_fault observed
>
> Test #4 (control):
>
> fsnotify: use accessor to set FMODE_NONOTIFY_*
>
> Result: Fail, no vfio-pci huge_fault observed
>
> For any combination of the Fixes patches:
>
> Tested-by: Alex Williamson <alex.williamson@redhat.com>
>
That was fast.
I will post the patches.
Thanks!
Amir.
^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2025-02-03 22:05 UTC | newest]
Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-15 15:30 [PATCH v8 00/19] fanotify: add pre-content hooks Josef Bacik
2024-11-15 15:30 ` [PATCH v8 01/19] fs: get rid of __FMODE_NONOTIFY kludge Josef Bacik
2024-11-18 18:14 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 02/19] fsnotify: opt-in for permission events at file open time Josef Bacik
2024-11-20 15:53 ` Jan Kara
2024-11-20 16:12 ` Amir Goldstein
2024-11-21 9:39 ` Jan Kara
2024-11-21 10:09 ` Christian Brauner
2024-11-21 11:04 ` Amir Goldstein
2024-11-21 11:16 ` Jan Kara
2024-11-21 11:32 ` Amir Goldstein
2024-11-21 9:45 ` Christian Brauner
2024-11-21 11:39 ` Amir Goldstein
2024-11-15 15:30 ` [PATCH v8 03/19] fsnotify: add helper to check if file is actually being watched Josef Bacik
2024-11-20 16:02 ` Jan Kara
2024-11-20 16:42 ` Amir Goldstein
2024-11-21 8:54 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 04/19] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
2024-11-15 15:30 ` [PATCH v8 05/19] fanotify: rename a misnamed constant Josef Bacik
2024-11-15 15:30 ` [PATCH v8 06/19] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
2024-11-15 15:30 ` [PATCH v8 07/19] fsnotify: introduce pre-content permission events Josef Bacik
2024-11-15 15:30 ` [PATCH v8 08/19] fsnotify: pass optional file access range in pre-content event Josef Bacik
2024-11-15 15:30 ` [PATCH v8 09/19] fsnotify: generate pre-content permission event on truncate Josef Bacik
2024-11-20 15:23 ` Jan Kara
2024-11-20 15:57 ` Amir Goldstein
2024-11-20 16:16 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 10/19] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
2024-11-15 15:59 ` Amir Goldstein
2024-11-21 10:44 ` Jan Kara
2024-11-21 14:18 ` Amir Goldstein
2024-11-21 16:36 ` Jan Kara
2024-11-21 18:31 ` Amir Goldstein
2024-11-21 18:37 ` Amir Goldstein
2024-11-22 12:42 ` Jan Kara
2024-11-22 13:51 ` Amir Goldstein
2024-11-27 12:18 ` Jan Kara
2024-11-27 12:20 ` Amir Goldstein
2024-11-15 15:30 ` [PATCH v8 11/19] fanotify: report file range info with pre-content events Josef Bacik
2024-11-15 15:30 ` [PATCH v8 12/19] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
2024-11-15 15:30 ` [PATCH v8 13/19] fanotify: add a helper to check for pre content events Josef Bacik
2024-11-20 15:44 ` Jan Kara
2024-11-20 16:43 ` Amir Goldstein
2024-11-15 15:30 ` [PATCH v8 14/19] fanotify: disable readahead if we have pre-content watches Josef Bacik
2024-11-15 15:30 ` [PATCH v8 15/19] mm: don't allow huge faults for files with pre content watches Josef Bacik
2025-01-31 19:17 ` [REGRESSION] " Alex Williamson
2025-01-31 19:59 ` Linus Torvalds
2025-02-01 1:19 ` Peter Xu
2025-02-01 14:38 ` Christian Brauner
2025-02-02 0:58 ` Linus Torvalds
2025-02-02 7:46 ` Amir Goldstein
2025-02-02 10:04 ` Christian Brauner
2025-02-03 12:41 ` Jan Kara
2025-02-03 20:39 ` Amir Goldstein
2025-02-03 21:41 ` Alex Williamson
2025-02-03 22:04 ` Amir Goldstein
2024-11-15 15:30 ` [PATCH v8 16/19] fsnotify: generate pre-content permission event on page fault Josef Bacik
2024-12-08 16:58 ` Klara Modin
2024-12-09 10:45 ` Aithal, Srikanth
2024-12-09 12:34 ` Jan Kara
2024-12-09 12:31 ` Jan Kara
2024-12-09 12:56 ` Klara Modin
2024-12-09 14:16 ` Jan Kara
2024-12-10 21:12 ` Randy Dunlap
2024-12-11 16:30 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 17/19] xfs: add pre-content fsnotify hook for write faults Josef Bacik
2024-11-21 10:22 ` Jan Kara
2024-11-15 15:30 ` [PATCH v8 18/19] btrfs: disable defrag on pre-content watched files Josef Bacik
2024-11-15 15:30 ` [PATCH v8 19/19] fs: enable pre-content events on supported file systems Josef Bacik
2024-11-21 11:29 ` [PATCH v8 00/19] fanotify: add pre-content hooks Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox