* [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 19:45 ` Linus Torvalds
2024-11-12 17:55 ` [PATCH v7 02/18] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
` (16 subsequent siblings)
17 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Legacy inotify/fanotify listeners can add watches for events on inode,
parent or mount and expect to get events (e.g. FS_MODIFY) on files that
were already open at the time of setting up the watches.
fanotify permission events are typically used by Anti-malware sofware,
that is watching the entire mount and it is not common to have more that
one Anti-malware engine installed on a system.
To reduce the overhead of the fsnotify_file_perm() hooks on every file
access, relax the semantics of the legacy FAN_OPEN_PERM event to generate
events only if there were *any* permission event listeners on the
filesystem at the time that the file was open.
The new semantics, implemented with the opt-in FMODE_NOTIFY_PERM flag
are also going to apply to the new fanotify pre-content event in order
to reduce the cost of the pre-content event vfs hooks.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
include/linux/fs.h | 3 ++-
include/linux/fsnotify.h | 47 ++++++++++++++++++++++++++++------------
2 files changed, 35 insertions(+), 15 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9c13222362f5..9b58e9887e4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,7 +173,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
-/* FMODE_* bit 24 */
+/* File may generate fanotify access permission events */
+#define FMODE_NOTIFY_PERM ((__force fmode_t)(1 << 24))
/* File is embedded in backing_file object */
#define FMODE_BACKING ((__force fmode_t)(1 << 25))
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..f0fd3dcae654 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -108,10 +108,9 @@ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask)
fsnotify_parent(dentry, mask, dentry, FSNOTIFY_EVENT_DENTRY);
}
-static inline int fsnotify_file(struct file *file, __u32 mask)
+/* Should events be generated on this open file regardless of watches? */
+static inline bool fsnotify_file_watchable(struct file *file, __u32 mask)
{
- const struct path *path;
-
/*
* FMODE_NONOTIFY are fds generated by fanotify itself which should not
* generate new events. We also don't want to generate events for
@@ -119,14 +118,37 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
* handle creation / destruction events and not "real" file events.
*/
if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
+ return false;
+
+ /* Permission events require that watches are set before FS_OPEN_PERM */
+ if (mask & ALL_FSNOTIFY_PERM_EVENTS & ~FS_OPEN_PERM &&
+ !(file->f_mode & FMODE_NOTIFY_PERM))
+ return false;
+
+ return true;
+}
+
+static inline int fsnotify_file(struct file *file, __u32 mask)
+{
+ const struct path *path;
+
+ if (!fsnotify_file_watchable(file, mask))
return 0;
path = &file->f_path;
- /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
- if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
- !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
- FSNOTIFY_PRIO_CONTENT))
- return 0;
+ /*
+ * Permission events require group prio >= FSNOTIFY_PRIO_CONTENT.
+ * Unless permission event watchers exist at FS_OPEN_PERM time,
+ * operations on file will not be generating any permission events.
+ */
+ if (mask & ALL_FSNOTIFY_PERM_EVENTS) {
+ if (!fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
+ FSNOTIFY_PRIO_CONTENT))
+ return 0;
+
+ if (mask & FS_OPEN_PERM)
+ file->f_mode |= FMODE_NOTIFY_PERM;
+ }
return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
}
@@ -166,15 +188,12 @@ static inline int fsnotify_file_perm(struct file *file, int perm_mask)
*/
static inline int fsnotify_open_perm(struct file *file)
{
- int ret;
+ int ret = fsnotify_file(file, FS_OPEN_PERM);
- if (file->f_flags & __FMODE_EXEC) {
+ if (!ret && file->f_flags & __FMODE_EXEC)
ret = fsnotify_file(file, FS_OPEN_EXEC_PERM);
- if (ret)
- return ret;
- }
- return fsnotify_file(file, FS_OPEN_PERM);
+ return ret;
}
#else
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time
2024-11-12 17:55 ` [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time Josef Bacik
@ 2024-11-12 19:45 ` Linus Torvalds
2024-11-12 22:37 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-12 19:45 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
>
> @@ -119,14 +118,37 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
> * handle creation / destruction events and not "real" file events.
> */
> if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
> + return false;
> +
> + /* Permission events require that watches are set before FS_OPEN_PERM */
> + if (mask & ALL_FSNOTIFY_PERM_EVENTS & ~FS_OPEN_PERM &&
> + !(file->f_mode & FMODE_NOTIFY_PERM))
> + return false;
This still all looks very strange.
As far as I can tell, there is exactly one user of FS_OPEN_PERM in
'mask', and that's fsnotify_open_perm(). Which is called in exactly
one place: security_file_open(), which is the wrong place to call it
anyway and is the only place where fsnotify is called from the
security layer.
In fact, that looks like an active bug: if you enable FSNOTIFY, but
you *don't* enable CONFIG_SECURITY, the whole fsnotify_open_perm()
will never be called at all.
And I just verified that yes, you can very much generate such a config.
So the whole FS_OPEN_PERM thing looks like a special case, called from
a (broken) special place, and now polluting this "fsnotify_file()"
logic for no actual reason and making it all look unnecessarily messy.
I'd suggest that the whole fsnotify_open_perm() simply be moved to
where it *should* be - in the open path - and not make a bad and
broken attempt at hiding inside the security layer, and not use this
"fsnotify_file()" logic at all.
The open-time logic is different. It shouldn't even attempt - badly -
to look like it's the same thing as some regular file access.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time
2024-11-12 19:45 ` Linus Torvalds
@ 2024-11-12 22:37 ` Amir Goldstein
0 siblings, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-12 22:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 8:46 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > @@ -119,14 +118,37 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
> > * handle creation / destruction events and not "real" file events.
> > */
> > if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
> > + return false;
> > +
> > + /* Permission events require that watches are set before FS_OPEN_PERM */
> > + if (mask & ALL_FSNOTIFY_PERM_EVENTS & ~FS_OPEN_PERM &&
> > + !(file->f_mode & FMODE_NOTIFY_PERM))
> > + return false;
>
> This still all looks very strange.
>
> As far as I can tell, there is exactly one user of FS_OPEN_PERM in
> 'mask', and that's fsnotify_open_perm(). Which is called in exactly
> one place: security_file_open(), which is the wrong place to call it
> anyway and is the only place where fsnotify is called from the
> security layer.
>
> In fact, that looks like an active bug: if you enable FSNOTIFY, but
> you *don't* enable CONFIG_SECURITY, the whole fsnotify_open_perm()
> will never be called at all.
>
> And I just verified that yes, you can very much generate such a config.
>
See: 1cda52f1b461 fsnotify, lsm: Decouple fsnotify from lsm
in linux-next. This patch set is based on the fs-next branch.
> So the whole FS_OPEN_PERM thing looks like a special case, called from
> a (broken) special place, and now polluting this "fsnotify_file()"
> logic for no actual reason and making it all look unnecessarily messy.
>
> I'd suggest that the whole fsnotify_open_perm() simply be moved to
> where it *should* be - in the open path - and not make a bad and
> broken attempt at hiding inside the security layer, and not use this
> "fsnotify_file()" logic at all.
>
> The open-time logic is different. It shouldn't even attempt - badly -
> to look like it's the same thing as some regular file access.
>
OK, we can move setting the FMODE_NOTIFY_PERM to the open path.
I have considered that it may be better to unhide it, but wasn't sure.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v7 02/18] fanotify: don't skip extra event info if no info_mode is set
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
2024-11-12 17:55 ` [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 03/18] fanotify: rename a misnamed constant Josef Bacik
` (15 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Previously we would only include optional information if you requested
it via an FAN_ flag at fanotify_init time (FAN_REPORT_FID for example).
However this isn't necessary as the event length is encoded in the
metadata, and if the user doesn't want to consume the information they
don't have to. With the PRE_ACCESS events we will always generate range
information, so drop this check in order to allow this extra
information to be exported without needing to have another flag.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fanotify/fanotify_user.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 2d85c71717d6..8528c1bfee7d 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -159,9 +159,6 @@ static size_t fanotify_event_len(unsigned int info_mode,
int fh_len;
int dot_len = 0;
- if (!info_mode)
- return event_len;
-
if (fanotify_is_error_event(event->mask))
event_len += FANOTIFY_ERROR_INFO_LEN;
@@ -756,12 +753,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
buf += FAN_EVENT_METADATA_LEN;
count -= FAN_EVENT_METADATA_LEN;
- if (info_mode) {
- ret = copy_info_records_to_user(event, info, info_mode, pidfd,
- buf, count);
- if (ret < 0)
- goto out_close_fd;
- }
+ ret = copy_info_records_to_user(event, info, info_mode, pidfd,
+ buf, count);
+ if (ret < 0)
+ goto out_close_fd;
if (f)
fd_install(fd, f);
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 03/18] fanotify: rename a misnamed constant
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
2024-11-12 17:55 ` [PATCH v7 01/18] fsnotify: opt-in for permission events at file_open_perm() time Josef Bacik
2024-11-12 17:55 ` [PATCH v7 02/18] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 04/18] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
` (14 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
FANOTIFY_PIDFD_INFO_HDR_LEN is not the length of the header.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify_user.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 8528c1bfee7d..9cc4a9ac1515 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -118,7 +118,7 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
#define FANOTIFY_EVENT_ALIGN 4
#define FANOTIFY_FID_INFO_HDR_LEN \
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
-#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+#define FANOTIFY_PIDFD_INFO_LEN \
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
@@ -173,14 +173,14 @@ static size_t fanotify_event_len(unsigned int info_mode,
dot_len = 1;
}
- if (info_mode & FAN_REPORT_PIDFD)
- event_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
-
if (fanotify_event_has_object_fh(event)) {
fh_len = fanotify_event_object_fh_len(event);
event_len += fanotify_fid_info_len(fh_len, dot_len);
}
+ if (info_mode & FAN_REPORT_PIDFD)
+ event_len += FANOTIFY_PIDFD_INFO_LEN;
+
return event_len;
}
@@ -503,7 +503,7 @@ static int copy_pidfd_info_to_user(int pidfd,
size_t count)
{
struct fanotify_event_info_pidfd info = { };
- size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+ size_t info_len = FANOTIFY_PIDFD_INFO_LEN;
if (WARN_ON_ONCE(info_len > count))
return -EFAULT;
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 04/18] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (2 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 03/18] fanotify: rename a misnamed constant Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 05/18] fsnotify: introduce pre-content permission events Josef Bacik
` (13 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Avoid reusing it, because we would like to reserve it for future
FAN_PATH_MODIFY pre-content event.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
include/linux/fsnotify_backend.h | 1 +
include/uapi/linux/fanotify.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 3ecf7768e577..53d5d0e02943 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -55,6 +55,7 @@
#define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
#define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
+/* #define FS_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
/*
* Set on inode mark that cares about things that happen to its children.
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 34f221d3a1b9..79072b6894f2 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -25,6 +25,7 @@
#define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
+/* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (3 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 04/18] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 20:12 ` Linus Torvalds
2024-11-12 17:55 ` [PATCH v7 06/18] fsnotify: pass optional file access range in pre-content event Josef Bacik
` (12 subsequent siblings)
17 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.
Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content
scanners could inspect the content filled by pre-content event handler.
Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is
modified by syscalls as write() and fallocate().
FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.
The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.
There are some specific requirements from filesystems that could
be used with pre-content events, so add a flag for fs to opt-in
for pre-content events explicitly before they can be used.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fsnotify.c | 2 +-
include/linux/fs.h | 1 +
include/linux/fsnotify.h | 37 ++++++++++++++++++++++++++++----
include/linux/fsnotify_backend.h | 12 +++++++++--
security/selinux/hooks.c | 3 ++-
5 files changed, 47 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 316eec309299..cab5a1a16e57 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -626,7 +626,7 @@ static __init int fsnotify_init(void)
{
int ret;
- BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 24);
ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b58e9887e4b..ee0637fcb197 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1232,6 +1232,7 @@ extern int send_sigurg(struct file *file);
#define SB_I_RETIRED 0x00000800 /* superblock shouldn't be reused */
#define SB_I_NOUMASK 0x00001000 /* VFS does not apply umask */
#define SB_I_NOIDMAP 0x00002000 /* No idmapped mounts on this superblock */
+#define SB_I_ALLOW_HSM 0x00004000 /* Allow HSM events on this superblock */
/* Possible states of 'frozen' field */
enum {
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f0fd3dcae654..0f44cd60ac9a 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -154,14 +154,29 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+static inline int fsnotify_pre_content(struct file *file)
+{
+ struct inode *inode = file_inode(file);
+
+ /*
+ * Pre-content events are only reported for regular files and dirs
+ * if there are any pre-content event watchers on this sb.
+ */
+ if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
+ !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
+ !fsnotify_sb_has_priority_watchers(inode->i_sb,
+ FSNOTIFY_PRIO_PRE_CONTENT))
+ return 0;
+
+ return fsnotify_file(file, FS_PRE_ACCESS);
+}
+
/*
- * fsnotify_file_area_perm - permission hook before access to file range
+ * fsnotify_file_area_perm - permission hook before access of file range
*/
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
- __u32 fsnotify_mask = FS_ACCESS_PERM;
-
/*
* filesystem may be modified in the context of permission events
* (e.g. by HSM filling a file on access), so sb freeze protection
@@ -169,10 +184,24 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
*/
lockdep_assert_once(file_write_not_started(file));
+ /*
+ * read()/write and other types of access generate pre-content events.
+ */
+ if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+ int ret = fsnotify_pre_content(file);
+
+ if (ret)
+ return ret;
+ }
+
if (!(perm_mask & MAY_READ))
return 0;
- return fsnotify_file(file, fsnotify_mask);
+ /*
+ * read() also generates the legacy FS_ACCESS_PERM event, so content
+ * scanners can inspect the content filled by pre-content event.
+ */
+ return fsnotify_file(file, FS_ACCESS_PERM);
}
/*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 53d5d0e02943..9bda354b5538 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -57,6 +57,8 @@
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
/* #define FS_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
+#define FS_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+
/*
* Set on inode mark that cares about things that happen to its children.
* Always set for dnotify and inotify.
@@ -78,8 +80,14 @@
*/
#define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
-#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
- FS_OPEN_EXEC_PERM)
+/* Content events can be used to inspect file content */
+#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
+ FS_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FSNOTIFY_PRE_CONTENT_EVENTS (FS_PRE_ACCESS)
+
+#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
+ FSNOTIFY_PRE_CONTENT_EVENTS)
/*
* This is a list of all events that may get sent to a parent that is watching
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fc926d3cac6e..c6f38705c715 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3404,7 +3404,8 @@ static int selinux_path_notify(const struct path *path, u64 mask,
perm |= FILE__WATCH_WITH_PERM;
/* watches on read-like events need the file:watch_reads permission */
- if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_CLOSE_NOWRITE))
+ if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_PRE_ACCESS |
+ FS_CLOSE_NOWRITE))
perm |= FILE__WATCH_READS;
return path_has_perm(current_cred(), path, perm);
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 17:55 ` [PATCH v7 05/18] fsnotify: introduce pre-content permission events Josef Bacik
@ 2024-11-12 20:12 ` Linus Torvalds
2024-11-12 23:06 ` Amir Goldstein
2024-11-13 19:11 ` Amir Goldstein
0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-11-12 20:12 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
>
> #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +static inline int fsnotify_pre_content(struct file *file)
> +{
> + struct inode *inode = file_inode(file);
> +
> + /*
> + * Pre-content events are only reported for regular files and dirs
> + * if there are any pre-content event watchers on this sb.
> + */
> + if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> + !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> + !fsnotify_sb_has_priority_watchers(inode->i_sb,
> + FSNOTIFY_PRIO_PRE_CONTENT))
> + return 0;
> +
> + return fsnotify_file(file, FS_PRE_ACCESS);
> +}
Yeah, no.
None of this should check inode->i_sb->s_iflags at any point.
The "is there a pre-content" thing should check one thing, and one
thing only: that "is this file watched" flag.
The whole indecipherable mess of inline functions that do random
things in <linux/fsnotify.h> needs to be cleaned up, not made even
more indecipherable.
I'm NAKing this whole series until this is all sane and cleaned up,
and I don't want to see a new hacky version being sent out tomorrow
with just another layer of new hacks, with random new inline functions
that call other inline functions and have complex odd conditionals
that make no sense.
Really. If the new hooks don't have that *SINGLE* bit test, they will
not get merged.
And that *SINGLE* bit test had better not be hidden under multiple
layers of odd inline functions.
You DO NOT get to use the same old broken complex function for the new
hooks that then mix these odd helpers.
This whole "add another crazy inline function using another crazy
helper needs to STOP. Later on in the patch series you do
+/*
+ * fsnotify_truncate_perm - permission hook before file truncate
+ */
+static inline int fsnotify_truncate_perm(const struct path *path,
loff_t length)
+{
+ return fsnotify_pre_content(path, &length, 0);
+}
or things like this:
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ if (!(file->f_mode & FMODE_NOTIFY_PERM))
+ return false;
+
+ if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
+ return false;
+
+ return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
+}
and no, NONE of that should be tested at runtime.
I repeat: you should have *ONE* inline function that basically does
static inline bool fsnotify_file_watched(struct file *file)
{
return file && unlikely(file->f_mode & FMODE_NOTIFY_PERM);
}
and absolutely nothing else. If that file is set, the file has
notification events, and you go to an out-of-line slow case. You don't
inline the unlikely cases after that.
And you make sure that you only set that special bit on files and
filesystems that support it. You most definitely don't check for
SB_I_ALLOW_HSM kind of flags at runtime in critical code.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 20:12 ` Linus Torvalds
@ 2024-11-12 23:06 ` Amir Goldstein
2024-11-12 23:48 ` Linus Torvalds
2024-11-13 19:11 ` Amir Goldstein
1 sibling, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-12 23:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 9:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> > +static inline int fsnotify_pre_content(struct file *file)
> > +{
> > + struct inode *inode = file_inode(file);
> > +
> > + /*
> > + * Pre-content events are only reported for regular files and dirs
> > + * if there are any pre-content event watchers on this sb.
> > + */
> > + if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> > + !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> > + !fsnotify_sb_has_priority_watchers(inode->i_sb,
> > + FSNOTIFY_PRIO_PRE_CONTENT))
> > + return 0;
> > +
> > + return fsnotify_file(file, FS_PRE_ACCESS);
> > +}
>
> Yeah, no.
>
> None of this should check inode->i_sb->s_iflags at any point.
>
> The "is there a pre-content" thing should check one thing, and one
> thing only: that "is this file watched" flag.
> The whole indecipherable mess of inline functions that do random
> things in <linux/fsnotify.h> needs to be cleaned up, not made even
> more indecipherable.
>
> I'm NAKing this whole series until this is all sane and cleaned up,
> and I don't want to see a new hacky version being sent out tomorrow
> with just another layer of new hacks, with random new inline functions
> that call other inline functions and have complex odd conditionals
> that make no sense.
>
> Really. If the new hooks don't have that *SINGLE* bit test, they will
> not get merged.
>
> And that *SINGLE* bit test had better not be hidden under multiple
> layers of odd inline functions.
>
> You DO NOT get to use the same old broken complex function for the new
> hooks that then mix these odd helpers.
>
> This whole "add another crazy inline function using another crazy
> helper needs to STOP. Later on in the patch series you do
>
> +/*
> + * fsnotify_truncate_perm - permission hook before file truncate
> + */
> +static inline int fsnotify_truncate_perm(const struct path *path,
> loff_t length)
> +{
> + return fsnotify_pre_content(path, &length, 0);
> +}
>
> or things like this:
>
> +static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
> +{
> + if (!(file->f_mode & FMODE_NOTIFY_PERM))
> + return false;
> +
> + if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
> + return false;
> +
> + return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
> +}
>
> and no, NONE of that should be tested at runtime.
>
> I repeat: you should have *ONE* inline function that basically does
>
> static inline bool fsnotify_file_watched(struct file *file)
> {
> return file && unlikely(file->f_mode & FMODE_NOTIFY_PERM);
> }
>
> and absolutely nothing else. If that file is set, the file has
> notification events, and you go to an out-of-line slow case. You don't
> inline the unlikely cases after that.
>
> And you make sure that you only set that special bit on files and
> filesystems that support it. You most definitely don't check for
> SB_I_ALLOW_HSM kind of flags at runtime in critical code.
I understand your point. It makes sense.
But it requires using another FMODE_HSM flag,
because FMODE_NOTIFY_PERM covers also the legacy
FS_ACCESS_PERM event, which has different semantics
that I consider broken, but it is what it is.
I am fine not optimizing out the legacy FS_ACCESS_PERM event
and just making sure not to add new bad code, if that is what you prefer
and I also am fine with using two FMODE_ flags if that is prefered.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 23:06 ` Amir Goldstein
@ 2024-11-12 23:48 ` Linus Torvalds
2024-11-13 0:05 ` Amir Goldstein
` (3 more replies)
0 siblings, 4 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-11-12 23:48 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
>
> I am fine not optimizing out the legacy FS_ACCESS_PERM event
> and just making sure not to add new bad code, if that is what you prefer
> and I also am fine with using two FMODE_ flags if that is prefered.
So iirc we do have a handful of FMODE flags left. Not many, but I do
think a new one would be fine.
And if we were to run out (and I'm *not* suggesting we do that now!)
we actually have more free bits in "f_flags".
That f_flags set of flags is a mess for other reasons: we expose them
to user space, and we define the bits using octal numbers for random
bad historical reasons, and some architectures specify their own set
or bits, etc etc - nasty.
But if anybody is really worried about running out of f_mode bits, we
could almost certainly turn the existing
unsigned int f_flags;
into a bitfield, and make it be something like
unsigned int f_flags:26, f_special:6;
instead, with the rule being that "f_special" only gets set at open
time and never any other time (to avoid any data races with fcntl()
touching the other 24 bits in the word).
[ Bah. I thought we had 8 unused bits in f_flags, but I went and
looked. sparc uses 0x2000000 for __O_TMPFILE, so we actually only have
6 bits unused in f_flags. No actual good reason for the sparc choice I
think, but it is what it is ]
Anyway, I wouldn't begrudge you a bit if that cleans this fsnotify
mess up and makes it much simpler and clearer. I really think that if
we can do this cleanly, using a bit in f_mode is a good cause.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 23:48 ` Linus Torvalds
@ 2024-11-13 0:05 ` Amir Goldstein
2024-11-13 16:57 ` Linus Torvalds
2024-11-13 0:12 ` Al Viro
` (2 subsequent siblings)
3 siblings, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 0:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 12:48 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
>
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.
>
Maybe I could use just this one bit, but together with the existing
FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
highest watching priority:
FMODE_NOTIFY_HSM (pre-content and all the rest)
FMODE_NOTIFY_PERM (permission and async)
FMODE_NOTIFY_NORMAL (only async events)
FMODE_NOTIFY_NONE (no events)
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 0:05 ` Amir Goldstein
@ 2024-11-13 16:57 ` Linus Torvalds
2024-11-13 18:49 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 16:57 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 16:06, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Maybe I could use just this one bit, but together with the existing
> FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
> highest watching priority:
So you'd use two bits, but one of those would re-use the existing
FMODE_NONOTIFY? That sounds perfectly fine to me.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 16:57 ` Linus Torvalds
@ 2024-11-13 18:49 ` Amir Goldstein
2024-11-14 15:01 ` Jan Kara
0 siblings, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 18:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]
On Wed, Nov 13, 2024 at 5:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 16:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Maybe I could use just this one bit, but together with the existing
> > FMODE_NONOTIFY bit, I get 4 modes, which correspond to the
> > highest watching priority:
>
> So you'd use two bits, but one of those would re-use the existing
> FMODE_NONOTIFY? That sounds perfectly fine to me.
>
Yes, exactly, like this:
/*
* The two FMODE_NONOTIFY_ bits used together have a special meaning of
* not reporting any events at all including non-permission events.
* These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
*
* FMODE_NONOTIFY_HSM - suppress only pre-content events.
* FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
* FMODE_NONOTIFY - suppress all (incl. non-permission) events.
*/
#define FMODE_NONOTIFY_MASK \
(FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
#define FMODE_NOTIFY(mode) \
((mode) & FMODE_NONOTIFY_MASK)
Please see attached patch (build and sanity tested) to make sure that
we are on the same page.
Going forward in the patch series, the choice of the NONOTIFY lingo
creates some double negatives, like:
/*
* read()/write() and other types of access generate pre-content events.
*/
if (!likely(file->f_mode & FMODE_NONOTIFY_HSM)) {
int ret = fsnotify_path(&file->f_path);
But it was easier for me to work with NONOTIFY flags.
Thanks,
Amir.
[-- Attachment #2: 0001-fsnotify-opt-in-for-permission-events-at-file-open-t.patch --]
[-- Type: text/x-patch, Size: 8202 bytes --]
From 7a2cd74654a53684d545b96c57c9091e420b3add Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Tue, 12 Nov 2024 13:46:08 +0100
Subject: [PATCH] fsnotify: opt-in for permission events at file open time
Legacy inotify/fanotify listeners can add watches for events on inode,
parent or mount and expect to get events (e.g. FS_MODIFY) on files that
were already open at the time of setting up the watches.
fanotify permission events are typically used by Anti-malware sofware,
that is watching the entire mount and it is not common to have more that
one Anti-malware engine installed on a system.
To reduce the overhead of the fsnotify_file_perm() hooks on every file
access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
events only if there were *any* permission event listeners on the
filesystem at the time that the file was opened.
The new semantic is implemented by extending the FMODE_NONOTIFY bit into
two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
events types to report.
This is going to apply to the new fanotify pre-content events in order
to reduce the cost of the new pre-content event vfs hooks.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/open.c | 8 ++++-
include/linux/fs.h | 26 ++++++++++++---
include/linux/fsnotify.h | 71 +++++++++++++++++++++++++++++++---------
3 files changed, 83 insertions(+), 22 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 226aca8c7909..194c2c8d8cd4 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -901,7 +901,7 @@ static int do_dentry_open(struct file *f,
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
+ f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_NONOTIFY;
f->f_op = &empty_fops;
return 0;
}
@@ -929,6 +929,12 @@ static int do_dentry_open(struct file *f,
if (error)
goto cleanup_all;
+ /*
+ * Set FMODE_NONOTIFY_* bits according to existing permission watches.
+ * If FMODE_NONOTIFY was already set for an fanotify fd, this doesn't
+ * change anything.
+ */
+ f->f_mode |= fsnotify_file_mode(f);
error = fsnotify_open_perm(f);
if (error)
goto cleanup_all;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70359dd669ff..dd583ce7dba8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
-/* FMODE_* bit 24 */
-
/* File is embedded in backing_file object */
-#define FMODE_BACKING ((__force fmode_t)(1 << 25))
+#define FMODE_BACKING ((__force fmode_t)(1 << 24))
+
+/* File shouldn't generate fanotify pre-content events */
+#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
-/* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
+/* File shouldn't generate fanotify permission events */
+#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)(1 << 27))
@@ -190,6 +191,21 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
+/*
+ * The two FMODE_NONOTIFY_ bits used together have a special meaning of
+ * not reporting any events at all including non-permission events.
+ * These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
+ *
+ * FMODE_NONOTIFY_HSM - suppress only pre-content events.
+ * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
+ * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
+ */
+#define FMODE_NONOTIFY_MASK \
+ (FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
+#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
+#define FMODE_NOTIFY(mode) \
+ ((mode) & FMODE_NONOTIFY_MASK)
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..9f13c7c19b74 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -108,38 +108,66 @@ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask)
fsnotify_parent(dentry, mask, dentry, FSNOTIFY_EVENT_DENTRY);
}
-static inline int fsnotify_file(struct file *file, __u32 mask)
+static inline int fsnotify_path(const struct path *path, __u32 mask)
{
- const struct path *path;
+ return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
+}
+static inline int fsnotify_file(struct file *file, __u32 mask)
+{
/*
* FMODE_NONOTIFY are fds generated by fanotify itself which should not
* generate new events. We also don't want to generate events for
* FMODE_PATH fds (involves open & close events) as they are just
* handle creation / destruction events and not "real" file events.
*/
- if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
- return 0;
-
- path = &file->f_path;
- /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
- if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
- !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
- FSNOTIFY_PRIO_CONTENT))
+ if (FMODE_NOTIFY(file->f_mode) == FMODE_NONOTIFY)
return 0;
- return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
+ return fsnotify_path(&file->f_path, mask);
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+/*
+ * At open time we check fsnotify_sb_has_priority_watchers() and set the
+ * FMODE_NONOTIFY_ mode bits accordignly.
+ * Later, fsnotify permission hooks do not check if there are permission event
+ * watches, but that there were permission event watches at open time.
+ */
+static inline fmode_t fsnotify_file_mode(struct file *file)
+{
+ struct super_block *sb = file->f_path.dentry->d_sb;
+
+ /* Is it a file opened by fanotify? */
+ if (FMODE_NOTIFY(file->f_mode) == FMODE_NONOTIFY)
+ return FMODE_NONOTIFY;
+
+ /*
+ * Permission events is a super set of pre-content events, so if there
+ * are no permission event watchers, there are also no pre-content event
+ * watchers and this is implied from the single FMODE_NONOTIFY_PERM bit.
+ */
+ if (likely(!fsnotify_sb_has_priority_watchers(sb,
+ FSNOTIFY_PRIO_CONTENT)))
+ return FMODE_NONOTIFY_PERM;
+
+ /*
+ * FMODE_NONOTIFY_HSM bit means there are permission event watchers, but
+ * no pre-content event watchers.
+ */
+ if (likely(!fsnotify_sb_has_priority_watchers(sb,
+ FSNOTIFY_PRIO_PRE_CONTENT)))
+ return FMODE_NONOTIFY_HSM;
+
+ return 0;
+}
+
/*
* fsnotify_file_area_perm - permission hook before access to file range
*/
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
- __u32 fsnotify_mask = FS_ACCESS_PERM;
-
/*
* filesystem may be modified in the context of permission events
* (e.g. by HSM filling a file on access), so sb freeze protection
@@ -150,7 +178,10 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
if (!(perm_mask & MAY_READ))
return 0;
- return fsnotify_file(file, fsnotify_mask);
+ if (likely(file->f_mode & FMODE_NONOTIFY_PERM))
+ return 0;
+
+ return fsnotify_path(&file->f_path, FS_ACCESS_PERM);
}
/*
@@ -168,16 +199,24 @@ static inline int fsnotify_open_perm(struct file *file)
{
int ret;
+ if (likely(file->f_mode & FMODE_NONOTIFY_PERM))
+ return 0;
+
if (file->f_flags & __FMODE_EXEC) {
- ret = fsnotify_file(file, FS_OPEN_EXEC_PERM);
+ ret = fsnotify_path(&file->f_path, FS_OPEN_EXEC_PERM);
if (ret)
return ret;
}
- return fsnotify_file(file, FS_OPEN_PERM);
+ return fsnotify_path(&file->f_path, FS_OPEN_PERM);
}
#else
+static inline fmode_t fsnotify_file_mode(struct file *file)
+{
+ return 0;
+}
+
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
--
2.34.1
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 18:49 ` Amir Goldstein
@ 2024-11-14 15:01 ` Jan Kara
2024-11-14 17:22 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2024-11-14 15:01 UTC (permalink / raw)
To: Amir Goldstein
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed 13-11-24 19:49:31, Amir Goldstein wrote:
> From 7a2cd74654a53684d545b96c57c9091e420b3add Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Tue, 12 Nov 2024 13:46:08 +0100
> Subject: [PATCH] fsnotify: opt-in for permission events at file open time
>
> Legacy inotify/fanotify listeners can add watches for events on inode,
> parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> were already open at the time of setting up the watches.
>
> fanotify permission events are typically used by Anti-malware sofware,
> that is watching the entire mount and it is not common to have more that
> one Anti-malware engine installed on a system.
>
> To reduce the overhead of the fsnotify_file_perm() hooks on every file
> access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> events only if there were *any* permission event listeners on the
> filesystem at the time that the file was opened.
>
> The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> events types to report.
>
> This is going to apply to the new fanotify pre-content events in order
> to reduce the cost of the new pre-content event vfs hooks.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Couple of notes below.
> diff --git a/fs/open.c b/fs/open.c
> index 226aca8c7909..194c2c8d8cd4 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -901,7 +901,7 @@ static int do_dentry_open(struct file *f,
> f->f_sb_err = file_sample_sb_err(f);
>
> if (unlikely(f->f_flags & O_PATH)) {
> - f->f_mode = FMODE_PATH | FMODE_OPENED;
> + f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_NONOTIFY;
> f->f_op = &empty_fops;
> return 0;
> }
> @@ -929,6 +929,12 @@ static int do_dentry_open(struct file *f,
> if (error)
> goto cleanup_all;
>
> + /*
> + * Set FMODE_NONOTIFY_* bits according to existing permission watches.
> + * If FMODE_NONOTIFY was already set for an fanotify fd, this doesn't
> + * change anything.
> + */
> + f->f_mode |= fsnotify_file_mode(f);
Maybe it would be obvious to do this like:
file_set_fsnotify_mode(f);
Because currently this depends on the details of how exactly FMODE_NONOTIFY
is encoded.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 70359dd669ff..dd583ce7dba8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>
> #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
>
> -/* FMODE_* bit 24 */
> -
> /* File is embedded in backing_file object */
> -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
> +
> +/* File shouldn't generate fanotify pre-content events */
> +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
>
> -/* File was opened by fanotify and shouldn't generate fanotify events */
> -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> +/* File shouldn't generate fanotify permission events */
> +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
>
> /* File is capable of returning -EAGAIN if I/O will block */
> #define FMODE_NOWAIT ((__force fmode_t)(1 << 27))
> @@ -190,6 +191,21 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> /* File does not contribute to nr_files count */
> #define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
>
> +/*
> + * The two FMODE_NONOTIFY_ bits used together have a special meaning of
> + * not reporting any events at all including non-permission events.
> + * These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
> + *
> + * FMODE_NONOTIFY_HSM - suppress only pre-content events.
> + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> + */
> +#define FMODE_NONOTIFY_MASK \
> + (FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
> +#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
> +#define FMODE_NOTIFY(mode) \
> + ((mode) & FMODE_NONOTIFY_MASK)
This looks a bit error-prone to me (FMODE_NONOTIFY looks like another FMODE
flag but in fact it is not which is an invitation for subtle bugs) and the
tests below which are sometimes done as (FMODE_NOTIFY(mode) == xxx) and
sometimes as (file->f_mode & xxx) are inconsistent and confusing (unless you
understand what's happening under the hood).
So how about defining macros like FMODE_FSNOTIFY_NORMAL(),
FMODE_FSNOTIFY_CONTENT() and FMODE_FSNOTIFY_PRE_CONTENT() which evaluate to
true if we should be sending normal/content/pre-content events to the file.
With appropriate comments this should make things more obvious.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-14 15:01 ` Jan Kara
@ 2024-11-14 17:22 ` Amir Goldstein
0 siblings, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-14 17:22 UTC (permalink / raw)
To: Jan Kara
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Thu, Nov 14, 2024 at 4:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-11-24 19:49:31, Amir Goldstein wrote:
> > From 7a2cd74654a53684d545b96c57c9091e420b3add Mon Sep 17 00:00:00 2001
> > From: Amir Goldstein <amir73il@gmail.com>
> > Date: Tue, 12 Nov 2024 13:46:08 +0100
> > Subject: [PATCH] fsnotify: opt-in for permission events at file open time
> >
> > Legacy inotify/fanotify listeners can add watches for events on inode,
> > parent or mount and expect to get events (e.g. FS_MODIFY) on files that
> > were already open at the time of setting up the watches.
> >
> > fanotify permission events are typically used by Anti-malware sofware,
> > that is watching the entire mount and it is not common to have more that
> > one Anti-malware engine installed on a system.
> >
> > To reduce the overhead of the fsnotify_file_perm() hooks on every file
> > access, relax the semantics of the legacy FAN_ACCESS_PERM event to generate
> > events only if there were *any* permission event listeners on the
> > filesystem at the time that the file was opened.
> >
> > The new semantic is implemented by extending the FMODE_NONOTIFY bit into
> > two FMODE_NONOTIFY_* bits, that are used to store a mode for which of the
> > events types to report.
> >
> > This is going to apply to the new fanotify pre-content events in order
> > to reduce the cost of the new pre-content event vfs hooks.
> >
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Couple of notes below.
>
> > diff --git a/fs/open.c b/fs/open.c
> > index 226aca8c7909..194c2c8d8cd4 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -901,7 +901,7 @@ static int do_dentry_open(struct file *f,
> > f->f_sb_err = file_sample_sb_err(f);
> >
> > if (unlikely(f->f_flags & O_PATH)) {
> > - f->f_mode = FMODE_PATH | FMODE_OPENED;
> > + f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_NONOTIFY;
> > f->f_op = &empty_fops;
> > return 0;
> > }
> > @@ -929,6 +929,12 @@ static int do_dentry_open(struct file *f,
> > if (error)
> > goto cleanup_all;
> >
> > + /*
> > + * Set FMODE_NONOTIFY_* bits according to existing permission watches.
> > + * If FMODE_NONOTIFY was already set for an fanotify fd, this doesn't
> > + * change anything.
> > + */
> > + f->f_mode |= fsnotify_file_mode(f);
>
> Maybe it would be obvious to do this like:
>
> file_set_fsnotify_mode(f);
>
> Because currently this depends on the details of how exactly FMODE_NONOTIFY
> is encoded.
>
ok. makes sense.
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 70359dd669ff..dd583ce7dba8 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -173,13 +173,14 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >
> > #define FMODE_NOREUSE ((__force fmode_t)(1 << 23))
> >
> > -/* FMODE_* bit 24 */
> > -
> > /* File is embedded in backing_file object */
> > -#define FMODE_BACKING ((__force fmode_t)(1 << 25))
> > +#define FMODE_BACKING ((__force fmode_t)(1 << 24))
> > +
> > +/* File shouldn't generate fanotify pre-content events */
> > +#define FMODE_NONOTIFY_HSM ((__force fmode_t)(1 << 25))
> >
> > -/* File was opened by fanotify and shouldn't generate fanotify events */
> > -#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
> > +/* File shouldn't generate fanotify permission events */
> > +#define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26))
> >
> > /* File is capable of returning -EAGAIN if I/O will block */
> > #define FMODE_NOWAIT ((__force fmode_t)(1 << 27))
> > @@ -190,6 +191,21 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > /* File does not contribute to nr_files count */
> > #define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
> >
> > +/*
> > + * The two FMODE_NONOTIFY_ bits used together have a special meaning of
> > + * not reporting any events at all including non-permission events.
> > + * These are the possible values of FMODE_NOTIFY(f->f_mode) and their meaning:
> > + *
> > + * FMODE_NONOTIFY_HSM - suppress only pre-content events.
> > + * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
> > + * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
> > + */
> > +#define FMODE_NONOTIFY_MASK \
> > + (FMODE_NONOTIFY_HSM | FMODE_NONOTIFY_PERM)
> > +#define FMODE_NONOTIFY FMODE_NONOTIFY_MASK
> > +#define FMODE_NOTIFY(mode) \
> > + ((mode) & FMODE_NONOTIFY_MASK)
>
> This looks a bit error-prone to me (FMODE_NONOTIFY looks like another FMODE
> flag but in fact it is not which is an invitation for subtle bugs) and the
> tests below which are sometimes done as (FMODE_NOTIFY(mode) == xxx) and
> sometimes as (file->f_mode & xxx) are inconsistent and confusing (unless you
> understand what's happening under the hood).
>
> So how about defining macros like FMODE_FSNOTIFY_NORMAL(),
> FMODE_FSNOTIFY_CONTENT() and FMODE_FSNOTIFY_PRE_CONTENT() which evaluate to
> true if we should be sending normal/content/pre-content events to the file.
> With appropriate comments this should make things more obvious.
>
ok, maybe something like this:
#define FMODE_FSNOTIFY_NONE(mode) \
(FMODE_FSNOTIFY(mode) == FMODE_NONOTIFY)
#define FMODE_FSNOTIFY_NORMAL(mode) \
(FMODE_FSNOTIFY(mode) == FMODE_NONOTIFY_PERM)
#define FMODE_FSNOTIFY_PERM(mode) \
(!((mode) & FMODE_NONOTIFY_PERM))
#define FMODE_FSNOTIFY_HSM(mode) \
(FMODE_FSNOTIFY(mode) == 0)
At least keeping the double negatives contained in one place.
And then we have these users in the final code:
static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
{
return file && unlikely(FMODE_FSNOTIFY_HSM(file->f_mode));
}
static inline int fsnotify_open_perm(struct file *file)
{
int ret;
if (likely(!FMODE_FSNOTIFY_PERM(file->f_mode)))
return 0;
...
static inline int fsnotify_file(struct file *file, __u32 mask)
{
if (FMODE_FSNOTIFY_NONE(file->f_mode))
return 0;
...
BTW, I prefer using PERM,HSM instead of the FSNOTIFY_PRIO_
names for brevity, but also because at the moment of this patch
FMODE_NONOTIFY_PERM means "suppress permission events
if there are no listeners with priority >= FSNOTIFY_PRIO_CONTENT
at all on any objects of the filesystem".
It does NOT mean that there ARE permission events watchers on the file's
sb/mnt/inode or parent, but what the users of the flag care about really is
whether the specific file is being watched for permission events.
I was contemplating if we should add the following check at open time
as following patches add for pre-content watchers also for
permission watchers on the specific file:
/*
* Permission events is a super set of pre-content events, so if there
* are no permission event watchers, there are also no pre-content event
* watchers and this is implied from the single FMODE_NONOTIFY_PERM bit.
*/
if (likely(!fsnotify_sb_has_priority_watchers(sb,
FSNOTIFY_PRIO_CONTENT)))
return FMODE_NONOTIFY_PERM;
+ /*
+ * There are content watchers in the filesystem, but are there
+ * permission event watchers on this specific file?
+ */
+ if (likely(!fsnotify_file_object_watched(file,
+ ALL_FSNOTIFY_PERM_EVENTS)))
+ return FMODE_NONOTIFY_PERM;
+
I decided not to stretch the behavior change too much and also since
Anti-malware permission watchers often watch all the mounts of a
filesystem, there is probably little to gain from this extra check.
But we can reconsider this in the future.
WDYT?
In any case, IMO the language of FMODE_FSNOTIFY_PERM() matches
the meaning of the users better and makes the code easier to understand.
FMODE_FSNOTIFY_HSM() is debatable, but at least it is short ;)
Anyway, I will send v2 with your suggestions.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 23:48 ` Linus Torvalds
2024-11-13 0:05 ` Amir Goldstein
@ 2024-11-13 0:12 ` Al Viro
2024-11-13 0:23 ` Linus Torvalds
2024-11-13 10:10 ` Christian Brauner
2024-11-20 11:09 ` Christian Brauner
3 siblings, 1 reply; 51+ messages in thread
From: Al Viro @ 2024-11-13 0:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 03:48:10PM -0800, Linus Torvalds wrote:
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
>
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.
8, 13, 24, 30 and 31.
> But if anybody is really worried about running out of f_mode bits, we
> could almost certainly turn the existing
>
> unsigned int f_flags;
>
> into a bitfield, and make it be something like
>
> unsigned int f_flags:26, f_special:6;
>
> instead, with the rule being that "f_special" only gets set at open
> time and never any other time (to avoid any data races with fcntl()
> touching the other 24 bits in the word).
Ugh... Actually, I would rather mask that on fcntl side (and possibly
moved FMODE_RANDOM/FMODE_NOREUSE over there as well).
Would make for simpler rules for locking - ->f_mode would be never
changed past open, ->f_flags would have all changes under ->f_lock.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 0:12 ` Al Viro
@ 2024-11-13 0:23 ` Linus Torvalds
2024-11-13 0:38 ` Linus Torvalds
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 0:23 UTC (permalink / raw)
To: Al Viro
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 16:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Ugh... Actually, I would rather mask that on fcntl side (and possibly
> moved FMODE_RANDOM/FMODE_NOREUSE over there as well).
Yeah, that's probably cleaner. I was thinking the bitfield would be a
simpler solution, but we already mask writes to specific bits on the
fcntl side for other reasons *anyway*, so we might as well mask reads
too, and just not expose any kernel-internal bits to user space.
> Would make for simpler rules for locking - ->f_mode would be never
> changed past open, ->f_flags would have all changes under ->f_lock.
Yeah, sounds sane.
That said, just looking at which bits are used in f_flags is a major
PITA. About half the definitions use octal, with the other half using
hex. Lovely.
So I'd rather not touch that mess until we have to.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 0:23 ` Linus Torvalds
@ 2024-11-13 0:38 ` Linus Torvalds
2024-11-13 1:19 ` Al Viro
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 0:38 UTC (permalink / raw)
To: Al Viro
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 16:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 16:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Ugh... Actually, I would rather mask that on fcntl side (and possibly
> > moved FMODE_RANDOM/FMODE_NOREUSE over there as well).
> >
> > Would make for simpler rules for locking - ->f_mode would be never
> > changed past open, ->f_flags would have all changes under ->f_lock.
>
> Yeah, sounds sane.
>
> That said, just looking at which bits are used in f_flags is a major
> PITA. About half the definitions use octal, with the other half using
> hex. Lovely.
>
> So I'd rather not touch that mess until we have to.
Actually, maybe the locking and the octal/hex mess should be
considered a reason to clean this all up early rather than ignore it.
Looking at that locking code in fadvise() just for the f_mode use does
make me think this would be a really good cleanup.
I note that our fcntl code seems buggy as-is, because while it does
use f_lock for assignments (good), it clearly does *not* use them for
reading.
So it looks like you can actually read inconsistent values.
I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
_addition_ to the f_lock use it has.
The f_mode thing with fadvise() smells like the same bug. Just because
the modifications are serialized wrt each other doesn't mean that
readers are then automatically ok.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 0:38 ` Linus Torvalds
@ 2024-11-13 1:19 ` Al Viro
2024-11-13 4:30 ` Al Viro
0 siblings, 1 reply; 51+ messages in thread
From: Al Viro @ 2024-11-13 1:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> Looking at that locking code in fadvise() just for the f_mode use does
> make me think this would be a really good cleanup.
>
> I note that our fcntl code seems buggy as-is, because while it does
> use f_lock for assignments (good), it clearly does *not* use them for
> reading.
>
> So it looks like you can actually read inconsistent values.
>
> I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> _addition_ to the f_lock use it has.
AFAICS, fasync logics is the fishy part - the rest should be sane.
> The f_mode thing with fadvise() smells like the same bug. Just because
> the modifications are serialized wrt each other doesn't mean that
> readers are then automatically ok.
Reads are also under ->f_lock in there, AFAICS...
Another thing in the vicinity is ->f_mode modifications after the calls
of anon_inode_getfile() in several callers - probably ought to switch
those to anon_inode_getfile_fmode(). That had been discussed back in
April when the function got merged, but "convert to using it" followup
series hadn't materialized...
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 1:19 ` Al Viro
@ 2024-11-13 4:30 ` Al Viro
2024-11-13 8:50 ` Amir Goldstein
2024-11-13 14:36 ` Amir Goldstein
0 siblings, 2 replies; 51+ messages in thread
From: Al Viro @ 2024-11-13 4:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > Looking at that locking code in fadvise() just for the f_mode use does
> > make me think this would be a really good cleanup.
> >
> > I note that our fcntl code seems buggy as-is, because while it does
> > use f_lock for assignments (good), it clearly does *not* use them for
> > reading.
> >
> > So it looks like you can actually read inconsistent values.
> >
> > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > _addition_ to the f_lock use it has.
>
> AFAICS, fasync logics is the fishy part - the rest should be sane.
>
> > The f_mode thing with fadvise() smells like the same bug. Just because
> > the modifications are serialized wrt each other doesn't mean that
> > readers are then automatically ok.
>
> Reads are also under ->f_lock in there, AFAICS...
>
> Another thing in the vicinity is ->f_mode modifications after the calls
> of anon_inode_getfile() in several callers - probably ought to switch
> those to anon_inode_getfile_fmode(). That had been discussed back in
> April when the function got merged, but "convert to using it" followup
> series hadn't materialized...
While we are at it, there's is a couple of kludges I really hate -
mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.
E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
anon_inode_getfd() to anon_inode_getfile_fmode() and adding
a dentry_open_nonotify() to be used by fanotify on the other path.
That's it - no more weird shit in OPEN_FMODE(), etc.
For __FMODE_EXEC it might get trickier (nfs is the main consumer),
but I seriously suspect that something like "have path_openat()
check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
right after struct file allocation" would make a good starting
point; yes, it would affect uselib(2), but... I've no idea whether
it wouldn't be the right thing to do; would be hard to test.
Anyway, untested __FMODE_NONOTIFY side of it:
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22dd9dcce7ec..ebd1c82bfb6b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
- __FMODE_EXEC | __FMODE_NONOTIFY));
+ __FMODE_EXEC));
fasync_cache = kmem_cache_create("fasync_cache",
sizeof(struct fasync_struct), 0,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9644bc72e457..43fbf29ef03a 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
*
* Internal and external open flags are stored together in field f_flags of
* struct file. Only external open flags shall be allowed in event_f_flags.
- * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
- * excluded.
+ * Internal flags like FMODE_EXEC shall be excluded.
*/
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
O_ACCMODE | O_APPEND | O_NONBLOCK | \
@@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
* we need a new file handle for the userspace program so it can read even if it was
* originally opened O_WRONLY.
*/
- new_file = dentry_open(path,
- group->fanotify_data.f_flags | __FMODE_NONOTIFY,
+ new_file = dentry_open_nonotify(path,
+ group->fanotify_data.f_flags,
current_cred());
if (IS_ERR(new_file)) {
/*
@@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
unsigned int class = flags & FANOTIFY_CLASS_BITS;
unsigned int internal_flags = 0;
+ struct file *file;
pr_debug("%s: flags=%x event_f_flags=%x\n",
__func__, flags, event_f_flags);
@@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
(!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
return -EINVAL;
- f_flags = O_RDWR | __FMODE_NONOTIFY;
+ f_flags = O_RDWR;
if (flags & FAN_CLOEXEC)
f_flags |= O_CLOEXEC;
if (flags & FAN_NONBLOCK)
@@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
goto out_destroy_group;
}
- fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
+ fd = get_unused_fd_flags(flags);
if (fd < 0)
goto out_destroy_group;
+ file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
+ f_flags, FMODE_NONOTIFY);
+ if (IS_ERR(file)) {
+ fd = PTR_ERR(file);
+ put_unused_fd(fd);
+ goto out_destroy_group;
+ }
+ fd_install(fd, file);
return fd;
out_destroy_group:
diff --git a/fs/open.c b/fs/open.c
index acaeb3e25c88..04cb581528ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
}
EXPORT_SYMBOL(dentry_open);
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+ const struct cred *cred)
+{
+ struct file *f = alloc_empty_file(flags, cred);
+ if (!IS_ERR(f)) {
+ int error;
+
+ f->f_mode |= FMODE_NONOTIFY;
+ error = vfs_open(path, f);
+ if (error) {
+ fput(f);
+ f = ERR_PTR(error);
+ }
+ }
+ return f;
+}
+
/**
* dentry_create - Create and open a file
* @path: path to create
@@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
inline int build_open_flags(const struct open_how *how, struct open_flags *op)
{
u64 flags = how->flags;
- u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
+ u64 strip = O_CLOEXEC;
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337..18888d601550 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2731,6 +2731,8 @@ struct file *dentry_open(const struct path *path, int flags,
struct file *dentry_create(const struct path *path, int flags, umode_t mode,
const struct cred *cred);
struct path *backing_file_user_path(struct file *f);
+struct file *dentry_open_nonotify(const struct path *path, int flags,
+ const struct cred *creds);
/*
* When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
@@ -3620,11 +3622,9 @@ struct ctl_table;
int __init list_bdev_fs_names(char *buf, size_t size);
#define __FMODE_EXEC ((__force int) FMODE_EXEC)
-#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
-#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
- (flag & __FMODE_NONOTIFY)))
+#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
static inline bool is_sxid(umode_t mode)
{
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 80f37a0d40d7..613475285643 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -6,7 +6,6 @@
/*
* FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x4000000
* These cannot be used by userspace O_* until internal and external open
* flags are split.
* -Eric Paris
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 4:30 ` Al Viro
@ 2024-11-13 8:50 ` Amir Goldstein
2024-11-13 14:36 ` Amir Goldstein
1 sibling, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 8:50 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 5:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> > On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > > Looking at that locking code in fadvise() just for the f_mode use does
> > > make me think this would be a really good cleanup.
> > >
> > > I note that our fcntl code seems buggy as-is, because while it does
> > > use f_lock for assignments (good), it clearly does *not* use them for
> > > reading.
> > >
> > > So it looks like you can actually read inconsistent values.
> > >
> > > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > > _addition_ to the f_lock use it has.
> >
> > AFAICS, fasync logics is the fishy part - the rest should be sane.
> >
> > > The f_mode thing with fadvise() smells like the same bug. Just because
> > > the modifications are serialized wrt each other doesn't mean that
> > > readers are then automatically ok.
> >
> > Reads are also under ->f_lock in there, AFAICS...
> >
> > Another thing in the vicinity is ->f_mode modifications after the calls
> > of anon_inode_getfile() in several callers - probably ought to switch
> > those to anon_inode_getfile_fmode(). That had been discussed back in
> > April when the function got merged, but "convert to using it" followup
> > series hadn't materialized...
>
> While we are at it, there's is a couple of kludges I really hate -
> mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.
>
> E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
> anon_inode_getfd() to anon_inode_getfile_fmode() and adding
> a dentry_open_nonotify() to be used by fanotify on the other path.
> That's it - no more weird shit in OPEN_FMODE(), etc.
>
> For __FMODE_EXEC it might get trickier (nfs is the main consumer),
> but I seriously suspect that something like "have path_openat()
> check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
> right after struct file allocation" would make a good starting
> point; yes, it would affect uselib(2), but... I've no idea whether
> it wouldn't be the right thing to do; would be hard to test.
>
> Anyway, untested __FMODE_NONOTIFY side of it:
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 22dd9dcce7ec..ebd1c82bfb6b 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> - __FMODE_EXEC | __FMODE_NONOTIFY));
> + __FMODE_EXEC));
>
> fasync_cache = kmem_cache_create("fasync_cache",
> sizeof(struct fasync_struct), 0,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9644bc72e457..43fbf29ef03a 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
> *
> * Internal and external open flags are stored together in field f_flags of
> * struct file. Only external open flags shall be allowed in event_f_flags.
> - * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
> - * excluded.
> + * Internal flags like FMODE_EXEC shall be excluded.
> */
> #define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
> O_ACCMODE | O_APPEND | O_NONBLOCK | \
> @@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
> * we need a new file handle for the userspace program so it can read even if it was
> * originally opened O_WRONLY.
> */
> - new_file = dentry_open(path,
> - group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> + new_file = dentry_open_nonotify(path,
> + group->fanotify_data.f_flags,
> current_cred());
> if (IS_ERR(new_file)) {
> /*
> @@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
> unsigned int class = flags & FANOTIFY_CLASS_BITS;
> unsigned int internal_flags = 0;
> + struct file *file;
>
> pr_debug("%s: flags=%x event_f_flags=%x\n",
> __func__, flags, event_f_flags);
> @@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> (!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
> return -EINVAL;
>
> - f_flags = O_RDWR | __FMODE_NONOTIFY;
> + f_flags = O_RDWR;
> if (flags & FAN_CLOEXEC)
> f_flags |= O_CLOEXEC;
> if (flags & FAN_NONBLOCK)
> @@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> goto out_destroy_group;
> }
>
> - fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> + fd = get_unused_fd_flags(flags);
> if (fd < 0)
> goto out_destroy_group;
>
> + file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
> + f_flags, FMODE_NONOTIFY);
> + if (IS_ERR(file)) {
> + fd = PTR_ERR(file);
> + put_unused_fd(fd);
> + goto out_destroy_group;
> + }
> + fd_install(fd, file);
> return fd;
>
> out_destroy_group:
> diff --git a/fs/open.c b/fs/open.c
> index acaeb3e25c88..04cb581528ff 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
> }
> EXPORT_SYMBOL(dentry_open);
>
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> + const struct cred *cred)
> +{
> + struct file *f = alloc_empty_file(flags, cred);
> + if (!IS_ERR(f)) {
> + int error;
> +
> + f->f_mode |= FMODE_NONOTIFY;
> + error = vfs_open(path, f);
> + if (error) {
> + fput(f);
> + f = ERR_PTR(error);
> + }
> + }
> + return f;
> +}
> +
> /**
> * dentry_create - Create and open a file
> * @path: path to create
> @@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
> inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> {
> u64 flags = how->flags;
> - u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
> + u64 strip = O_CLOEXEC;
> int lookup_flags = 0;
> int acc_mode = ACC_MODE(flags);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e3c603d01337..18888d601550 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2731,6 +2731,8 @@ struct file *dentry_open(const struct path *path, int flags,
> struct file *dentry_create(const struct path *path, int flags, umode_t mode,
> const struct cred *cred);
> struct path *backing_file_user_path(struct file *f);
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> + const struct cred *creds);
>
> /*
> * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
> @@ -3620,11 +3622,9 @@ struct ctl_table;
> int __init list_bdev_fs_names(char *buf, size_t size);
>
> #define __FMODE_EXEC ((__force int) FMODE_EXEC)
> -#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
>
> #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
> -#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
> - (flag & __FMODE_NONOTIFY)))
> +#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE)))
>
> static inline bool is_sxid(umode_t mode)
> {
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 80f37a0d40d7..613475285643 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -6,7 +6,6 @@
>
> /*
> * FMODE_EXEC is 0x20
> - * FMODE_NONOTIFY is 0x4000000
> * These cannot be used by userspace O_* until internal and external open
> * flags are split.
> * -Eric Paris
Nice. I will take it for a test drive.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 4:30 ` Al Viro
2024-11-13 8:50 ` Amir Goldstein
@ 2024-11-13 14:36 ` Amir Goldstein
2024-11-13 20:31 ` Al Viro
1 sibling, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 14:36 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 5:30 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 13, 2024 at 01:19:54AM +0000, Al Viro wrote:
> > On Tue, Nov 12, 2024 at 04:38:42PM -0800, Linus Torvalds wrote:
> > > Looking at that locking code in fadvise() just for the f_mode use does
> > > make me think this would be a really good cleanup.
> > >
> > > I note that our fcntl code seems buggy as-is, because while it does
> > > use f_lock for assignments (good), it clearly does *not* use them for
> > > reading.
> > >
> > > So it looks like you can actually read inconsistent values.
> > >
> > > I get the feeling that f_flags would want WRITE_ONCE/READ_ONCE in
> > > _addition_ to the f_lock use it has.
> >
> > AFAICS, fasync logics is the fishy part - the rest should be sane.
> >
> > > The f_mode thing with fadvise() smells like the same bug. Just because
> > > the modifications are serialized wrt each other doesn't mean that
> > > readers are then automatically ok.
> >
> > Reads are also under ->f_lock in there, AFAICS...
> >
> > Another thing in the vicinity is ->f_mode modifications after the calls
> > of anon_inode_getfile() in several callers - probably ought to switch
> > those to anon_inode_getfile_fmode(). That had been discussed back in
> > April when the function got merged, but "convert to using it" followup
> > series hadn't materialized...
>
> While we are at it, there's is a couple of kludges I really hate -
> mixing __FMODE_NONOTIFY and __FMODE_EXEC with O_... flags.
>
> E.g. for __FMODE_NONOTIFY all it takes is switching fanotify from
> anon_inode_getfd() to anon_inode_getfile_fmode() and adding
> a dentry_open_nonotify() to be used by fanotify on the other path.
> That's it - no more weird shit in OPEN_FMODE(), etc.
>
> For __FMODE_EXEC it might get trickier (nfs is the main consumer),
> but I seriously suspect that something like "have path_openat()
> check op->acc_mode & MAY_EXEC and set FMODE_EXEC in ->f_mode
> right after struct file allocation" would make a good starting
> point; yes, it would affect uselib(2), but... I've no idea whether
> it wouldn't be the right thing to do; would be hard to test.
>
> Anyway, untested __FMODE_NONOTIFY side of it:
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 22dd9dcce7ec..ebd1c82bfb6b 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1161,10 +1161,10 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> - __FMODE_EXEC | __FMODE_NONOTIFY));
> + __FMODE_EXEC));
>
> fasync_cache = kmem_cache_create("fasync_cache",
> sizeof(struct fasync_struct), 0,
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9644bc72e457..43fbf29ef03a 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -101,8 +101,7 @@ static void __init fanotify_sysctls_init(void)
> *
> * Internal and external open flags are stored together in field f_flags of
> * struct file. Only external open flags shall be allowed in event_f_flags.
> - * Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
> - * excluded.
> + * Internal flags like FMODE_EXEC shall be excluded.
> */
> #define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
> O_ACCMODE | O_APPEND | O_NONBLOCK | \
> @@ -262,8 +261,8 @@ static int create_fd(struct fsnotify_group *group, const struct path *path,
> * we need a new file handle for the userspace program so it can read even if it was
> * originally opened O_WRONLY.
> */
> - new_file = dentry_open(path,
> - group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> + new_file = dentry_open_nonotify(path,
> + group->fanotify_data.f_flags,
I would make this a bit more generic helper and the comment above
is truly clueless:
/*
- * we need a new file handle for the userspace program so it
can read even if it was
- * originally opened O_WRONLY.
+ * We provide an fd for the userspace program, so it could access the
+ * file without generating fanotify events itself.
*/
- new_file = dentry_open(path,
- group->fanotify_data.f_flags | __FMODE_NONOTIFY,
- current_cred());
+ new_file = dentry_open_fmode(path, group->fanotify_data.f_flags,
+ FMODE_NONOTIFY, current_cred());
> current_cred());
> if (IS_ERR(new_file)) {
> /*
> @@ -1404,6 +1403,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
> unsigned int class = flags & FANOTIFY_CLASS_BITS;
> unsigned int internal_flags = 0;
> + struct file *file;
>
> pr_debug("%s: flags=%x event_f_flags=%x\n",
> __func__, flags, event_f_flags);
> @@ -1472,7 +1472,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> (!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
> return -EINVAL;
>
> - f_flags = O_RDWR | __FMODE_NONOTIFY;
> + f_flags = O_RDWR;
> if (flags & FAN_CLOEXEC)
> f_flags |= O_CLOEXEC;
> if (flags & FAN_NONBLOCK)
> @@ -1550,10 +1550,18 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> goto out_destroy_group;
> }
>
> - fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> + fd = get_unused_fd_flags(flags);
s/flags/f_flags
> if (fd < 0)
> goto out_destroy_group;
>
> + file = anon_inode_getfile_fmode("[fanotify]", &fanotify_fops, group,
> + f_flags, FMODE_NONOTIFY);
> + if (IS_ERR(file)) {
> + fd = PTR_ERR(file);
> + put_unused_fd(fd);
> + goto out_destroy_group;
> + }
> + fd_install(fd, file);
> return fd;
>
> out_destroy_group:
> diff --git a/fs/open.c b/fs/open.c
> index acaeb3e25c88..04cb581528ff 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1118,6 +1118,23 @@ struct file *dentry_open(const struct path *path, int flags,
> }
> EXPORT_SYMBOL(dentry_open);
>
> +struct file *dentry_open_nonotify(const struct path *path, int flags,
> + const struct cred *cred)
> +{
> + struct file *f = alloc_empty_file(flags, cred);
> + if (!IS_ERR(f)) {
> + int error;
> +
> + f->f_mode |= FMODE_NONOTIFY;
> + error = vfs_open(path, f);
> + if (error) {
> + fput(f);
> + f = ERR_PTR(error);
> + }
> + }
> + return f;
> +}
> +
> /**
> * dentry_create - Create and open a file
> * @path: path to create
> @@ -1215,7 +1232,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
> inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> {
> u64 flags = how->flags;
> - u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
> + u64 strip = O_CLOEXEC;
> int lookup_flags = 0;
> int acc_mode = ACC_MODE(flags);
>
Get rid of another stale comment:
/*
- * Strip flags that either shouldn't be set by userspace like
- * FMODE_NONOTIFY or that aren't relevant in determining struct
- * open_flags like O_CLOEXEC.
+ * Strip flags that aren't relevant in determining struct open_flags.
*/
With these changed, you can add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
With the f_flags typo fixed, this passed LTP sanity tests, but I am
going to test the NONOTIFY functionally some more.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 14:36 ` Amir Goldstein
@ 2024-11-13 20:31 ` Al Viro
0 siblings, 0 replies; 51+ messages in thread
From: Al Viro @ 2024-11-13 20:31 UTC (permalink / raw)
To: Amir Goldstein
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 03:36:33PM +0100, Amir Goldstein wrote:
> I would make this a bit more generic helper and the comment above
> is truly clueless:
>
> /*
> - * we need a new file handle for the userspace program so it
> can read even if it was
> - * originally opened O_WRONLY.
> + * We provide an fd for the userspace program, so it could access the
> + * file without generating fanotify events itself.
> */
> - new_file = dentry_open(path,
> - group->fanotify_data.f_flags | __FMODE_NONOTIFY,
> - current_cred());
> + new_file = dentry_open_fmode(path, group->fanotify_data.f_flags,
> + FMODE_NONOTIFY, current_cred());
Hmm... Not sure I like that, TBH, since that'll lead to temptation to
turn dentry_open() into a wrapper for that thing and I would rather
keep them separate.
> > - fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
> > + fd = get_unused_fd_flags(flags);
>
> s/flags/f_flags
ACK - thanks for catching that one.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 23:48 ` Linus Torvalds
2024-11-13 0:05 ` Amir Goldstein
2024-11-13 0:12 ` Al Viro
@ 2024-11-13 10:10 ` Christian Brauner
2024-11-20 11:09 ` Christian Brauner
3 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2024-11-13 10:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 03:48:10PM -0800, Linus Torvalds wrote:
> On Tue, 12 Nov 2024 at 15:06, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I am fine not optimizing out the legacy FS_ACCESS_PERM event
> > and just making sure not to add new bad code, if that is what you prefer
> > and I also am fine with using two FMODE_ flags if that is prefered.
>
> So iirc we do have a handful of FMODE flags left. Not many, but I do
> think a new one would be fine.
I freed up five bits and commented which bits are available in
include/linux/fs.h.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 23:48 ` Linus Torvalds
` (2 preceding siblings ...)
2024-11-13 10:10 ` Christian Brauner
@ 2024-11-20 11:09 ` Christian Brauner
2024-11-20 11:36 ` Amir Goldstein
3 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2024-11-20 11:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
> But if anybody is really worried about running out of f_mode bits, we
> could almost certainly turn the existing
>
> unsigned int f_flags;
>
> into a bitfield, and make it be something like
>
> unsigned int f_flags:26, f_special:6;
I just saw this now. Two points I would like to keep you to keep mind.
I've already mentiond that I've freed up 5 fmode bits so it's not that
we're in immediate danger of running out. Especially since I added
f_ops_flags which contains all flags that are static, i.e., never change
and can simply live in the file operations struct and aren't that
performance sensitive.
I shrunk struct file to three cachelines. And in fact, we have 8 bytes
to use left since I removed f_version. So it really wouldn't be a
problem to simply add a separate u32 f_special member into struct file
without growing it and still leaving a 4 byte hole if it ever comes to
that.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-20 11:09 ` Christian Brauner
@ 2024-11-20 11:36 ` Amir Goldstein
0 siblings, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-20 11:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 20, 2024 at 12:10 PM Christian Brauner <brauner@kernel.org> wrote:
>
> > But if anybody is really worried about running out of f_mode bits, we
> > could almost certainly turn the existing
> >
> > unsigned int f_flags;
> >
> > into a bitfield, and make it be something like
> >
> > unsigned int f_flags:26, f_special:6;
>
> I just saw this now. Two points I would like to keep you to keep mind.
>
> I've already mentiond that I've freed up 5 fmode bits so it's not that
> we're in immediate danger of running out. Especially since I added
> f_ops_flags which contains all flags that are static, i.e., never change
> and can simply live in the file operations struct and aren't that
> performance sensitive.
>
> I shrunk struct file to three cachelines. And in fact, we have 8 bytes
> to use left since I removed f_version. So it really wouldn't be a
> problem to simply add a separate u32 f_special member into struct file
> without growing it and still leaving a 4 byte hole if it ever comes to
> that.
That's good to know, but for the record, I ended up using just one
extra f_mode bit for fsnotify [1].
Thanks,
Amir.
[1] https://lore.kernel.org/linux-fsdevel/5ea5f8e283d1edb55aa79c35187bfe344056af14.1731684329.git.josef@toxicpanda.com/
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-12 20:12 ` Linus Torvalds
2024-11-12 23:06 ` Amir Goldstein
@ 2024-11-13 19:11 ` Amir Goldstein
2024-11-13 21:22 ` Linus Torvalds
1 sibling, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 19:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 9:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> > +static inline int fsnotify_pre_content(struct file *file)
> > +{
> > + struct inode *inode = file_inode(file);
> > +
> > + /*
> > + * Pre-content events are only reported for regular files and dirs
> > + * if there are any pre-content event watchers on this sb.
> > + */
> > + if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
> > + !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
> > + !fsnotify_sb_has_priority_watchers(inode->i_sb,
> > + FSNOTIFY_PRIO_PRE_CONTENT))
> > + return 0;
> > +
> > + return fsnotify_file(file, FS_PRE_ACCESS);
> > +}
>
> Yeah, no.
>
> None of this should check inode->i_sb->s_iflags at any point.
>
> The "is there a pre-content" thing should check one thing, and one
> thing only: that "is this file watched" flag.
>
> The whole indecipherable mess of inline functions that do random
> things in <linux/fsnotify.h> needs to be cleaned up, not made even
> more indecipherable.
>
> I'm NAKing this whole series until this is all sane and cleaned up,
> and I don't want to see a new hacky version being sent out tomorrow
> with just another layer of new hacks, with random new inline functions
> that call other inline functions and have complex odd conditionals
> that make no sense.
>
> Really. If the new hooks don't have that *SINGLE* bit test, they will
> not get merged.
>
> And that *SINGLE* bit test had better not be hidden under multiple
> layers of odd inline functions.
>
> You DO NOT get to use the same old broken complex function for the new
> hooks that then mix these odd helpers.
Up to here I understand.
>
> This whole "add another crazy inline function using another crazy
> helper needs to STOP. Later on in the patch series you do
>
The patch that I sent did add another convenience helper
fsnotify_path(), but as long as it is not hiding crazy tests,
and does not expand to huge inlined code, I don't see the problem.
Those convenience helpers help me to maintain readability and code
reuse. I do agree that the new hooks that can use the new open-time
check semantics should not expand to huge inlined code.
> +/*
> + * fsnotify_truncate_perm - permission hook before file truncate
> + */
> +static inline int fsnotify_truncate_perm(const struct path *path,
> loff_t length)
> +{
> + return fsnotify_pre_content(path, &length, 0);
> +}
>
This example that you pointed at, I do not understand.
truncate() does not happen on an open file, so I cannot use the
FMODE_NONOTIFY_ test.
This is what I have in my WIP branch:
static inline int fsnotify_file_range(const struct path *path,
const loff_t *ppos, size_t count)
{
struct file_range range;
const void *data;
int data_type;
/* Report page aligned range only when pos is known */
if (ppos) {
range.path = path;
range.pos = PAGE_ALIGN_DOWN(*ppos);
range.count = PAGE_ALIGN(*ppos + count) - range.pos;
data = ⦥
data_type = FSNOTIFY_EVENT_FILE_RANGE;
} else {
data = path;
data_type = FSNOTIFY_EVENT_PATH;
}
return fsnotify_parent(path->dentry, FS_PRE_ACCESS, data, data_type);
}
/*
* fsnotify_truncate_perm - permission hook before file truncate
*/
static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
{
struct inode *inode = d_inode(path->dentry);
/*
* Pre-content events are only reported for regular files and dirs
* if there are any pre-content event watchers on this sb.
*/
if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
!unlikely(fsnotify_sb_has_priority_watchers(inode->i_sb,
FSNOTIFY_PRIO_PRE_CONTENT)))
return 0;
return fsnotify_file_range(path, &length, 0);
}
fsnotify_file_range() does not need to be inlined, but I do want
to reuse the code of fsnotify_file_range() which is also called for
the common file pre-access hook.
So did you mean that the unlikely stuff (i.e. fsnotify_file_range())
should be an indirect call? or something else?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 19:11 ` Amir Goldstein
@ 2024-11-13 21:22 ` Linus Torvalds
2024-11-13 22:35 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 21:22 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, 13 Nov 2024 at 11:11, Amir Goldstein <amir73il@gmail.com> wrote:
>
> >
> > This whole "add another crazy inline function using another crazy
> > helper needs to STOP. Later on in the patch series you do
> >
>
> The patch that I sent did add another convenience helper
> fsnotify_path(), but as long as it is not hiding crazy tests,
> and does not expand to huge inlined code, I don't see the problem.
So I don't mind adding a new inline function for convenience.
But I do mind the whole "multiple levels of inline functions" model,
and the thing I _particularly_ hate is the "mask is usually constant
so that the effect of the inline function is practically two different
things" as exemplified by "fsnotify_file()" and friends.
At that point, the inline function isn't a helper any more, it's a
hindrance to understanding what the heck is going on.
Basically, as an example: fsnotify_file() is actually two very
different things depending on the "mask" argument, an that argument is
*typically* a constant.
In fact, in fsnotify_file_area_perm() is very much is a constant, but
to make it extra non-obvious, it's a *hidden* constant, using
__u32 fsnotify_mask = FS_ACCESS_PERM;
to hide the fact that it's actually calling fsnotify_file() with that
constant argument.
And in fsnotify_open() it's not exactly a constant, but it's kind of
one: when you actually look at fsnotify_file(), it has that "I do a
different filtering event based on mask", and the two different
constants fsnotify_open() uses are actually the same for that mask.
In other words, that whole "mask" test part of fsnotify_file()
/* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
!fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
FSNOTIFY_PRIO_CONTENT))
return 0;
mess is actually STATICALLY TRUE OR FALSE, but it's made out to be
somehow an "arghumenty" to the function, and it's really obfuscated.
That is the kind of "helper inline" that I don't want to see in the
new paths. Making that conditional more complicated was part of what I
objected to in one of the patches.
> Those convenience helpers help me to maintain readability and code
> reuse.
ABSOLUTELY NOT.
That "convenience helkper" does exactly the opposite. It explicitly
and actively obfuscates when the actual
fsnotify_sb_has_priority_watchers() filtering is done.
That helper is evil.
Just go and look at the actual uses, let's take
fsnotify_file_area_perm() as an example. As mentioned, as an extra
level of obfuscation, that horrid "helper" function tries to hide how
"mask" is constant by doing
__u32 fsnotify_mask = FS_ACCESS_PERM;
and then never modifying it, and then doing
return fsnotify_file(file, fsnotify_mask);
but if you walk through the logic, you now see that ok, that means
that the "mask" conditional fsnotify_file() is actually just
FS_ACCESS_PERM & ALL_FSNOTIFY_PERM_EVENTS
which is always true, so it means that fsnotify_file_area_perm()
unconditionally does that
fsnotify_sb_has_priority_watchers(..)
filitering.
And dammit, you shouldn't have to walk through that pointless "helper"
variable, and that pointless "helper" inline function to see that. It
shouldn't be the case that fsnotify_file() does two completely
different things based on a constant argument.
It would have literally been much clearer to just have two explicitly
different versions of that function, *WITHOUT* some kind of
pseudo-conditional that isn't actually a conditional, and just have
fsnotify_file_area_perm() be very explicit about the fact that it uses
the fsnotify_sb_has_priority_watchers() logic.
IOW, that conditional only makes it harder to see what the actual
rules are. For no good reason.
Look, magically for some reason fsnotify_name() could do the same
thing without this kind of silly obfuscation. It just unconditonally
calls fsnotify_sb_has_watchers() to filter the events. No silly games
with doing two entirely different things based on a random constant
argument.
So this is why I say that any new fsnotify events will be NAK'ed and
not merged by me unless it's all obvious, and unless it all obviously
DOES NOT USE these inline garbage "helper" functions.
The new logic had better be very obviously *only* using the
file->f_mode bits, and just calling out-of-line to do the work. If I
have to walk through several layers of inline functions, and look at
what magic arguments those inline functions get just to see what the
hell they actually do, I'm not going to merge it.
Because I'm really tired of actively obfuscated VFS hooks that use
inline functions to hide what the hell they are doing and whether they
are expensive or not.
Your fsnotify_file_range() uses fsnotify_parent(), which is another of
those "it does two different things" functions that either call
fsnotify() on the dentry, or call __fsnotify_parent() on it if it's an
inode, which means that it's another case of "what does this actually
do" which is pointlessly hard to follow, since clearly for a truncate
event it can't be a directory.
And to make matters worse, fsnotify_truncate_perm() actually checks
truncate events for directories and regular files, when truncates
can't actually happen for anything but regular files in the first
place. So your helper function does a nonsensical cray-cray test that
shouldn't exist.
That makes it another "this is not a helper function, this is just obfuscation".
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 21:22 ` Linus Torvalds
@ 2024-11-13 22:35 ` Amir Goldstein
2024-11-13 23:07 ` Linus Torvalds
0 siblings, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 22:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 10:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 13 Nov 2024 at 11:11, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > >
> > > This whole "add another crazy inline function using another crazy
> > > helper needs to STOP. Later on in the patch series you do
> > >
> >
> > The patch that I sent did add another convenience helper
> > fsnotify_path(), but as long as it is not hiding crazy tests,
> > and does not expand to huge inlined code, I don't see the problem.
>
> So I don't mind adding a new inline function for convenience.
>
> But I do mind the whole "multiple levels of inline functions" model,
> and the thing I _particularly_ hate is the "mask is usually constant
> so that the effect of the inline function is practically two different
> things" as exemplified by "fsnotify_file()" and friends.
>
> At that point, the inline function isn't a helper any more, it's a
> hindrance to understanding what the heck is going on.
>
> Basically, as an example: fsnotify_file() is actually two very
> different things depending on the "mask" argument, an that argument is
> *typically* a constant.
>
> In fact, in fsnotify_file_area_perm() is very much is a constant, but
> to make it extra non-obvious, it's a *hidden* constant, using
>
> __u32 fsnotify_mask = FS_ACCESS_PERM;
>
> to hide the fact that it's actually calling fsnotify_file() with that
> constant argument.
Yeh, that specific "obfuscation" is a leftover from history.
It is already gone in the patches that we sent.
>
> And in fsnotify_open() it's not exactly a constant, but it's kind of
> one: when you actually look at fsnotify_file(), it has that "I do a
> different filtering event based on mask", and the two different
> constants fsnotify_open() uses are actually the same for that mask.
>
> In other words, that whole "mask" test part of fsnotify_file()
>
> /* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
> if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
> !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
> FSNOTIFY_PRIO_CONTENT))
> return 0;
>
> mess is actually STATICALLY TRUE OR FALSE, but it's made out to be
> somehow an "arghumenty" to the function, and it's really obfuscated.
>
Yeh. I see that problem absolutely.
This is already gone in the patch that I send you today:
- All the old hooks call fsnotify_file() that only checks FMODE_NONOTIFY
and calls fsnotify_path()
- The permission hooks now check FMODE_NONOTIFY_PERM
and call fsnotify_path()
> That is the kind of "helper inline" that I don't want to see in the
> new paths. Making that conditional more complicated was part of what I
> objected to in one of the patches.
>
> > Those convenience helpers help me to maintain readability and code
> > reuse.
>
> ABSOLUTELY NOT.
>
> That "convenience helkper" does exactly the opposite. It explicitly
> and actively obfuscates when the actual
> fsnotify_sb_has_priority_watchers() filtering is done.
>
> That helper is evil.
>
> Just go and look at the actual uses, let's take
> fsnotify_file_area_perm() as an example. As mentioned, as an extra
> level of obfuscation, that horrid "helper" function tries to hide how
> "mask" is constant by doing
>
> __u32 fsnotify_mask = FS_ACCESS_PERM;
>
> and then never modifying it, and then doing
>
> return fsnotify_file(file, fsnotify_mask);
>
> but if you walk through the logic, you now see that ok, that means
> that the "mask" conditional fsnotify_file() is actually just
>
> FS_ACCESS_PERM & ALL_FSNOTIFY_PERM_EVENTS
>
> which is always true, so it means that fsnotify_file_area_perm()
> unconditionally does that
>
> fsnotify_sb_has_priority_watchers(..)
>
> filitering.
>
> And dammit, you shouldn't have to walk through that pointless "helper"
> variable, and that pointless "helper" inline function to see that. It
> shouldn't be the case that fsnotify_file() does two completely
> different things based on a constant argument.
>
ok. that's going to be history soon.
I will send this cleanup patch regardless of the pre-content series.
> It would have literally been much clearer to just have two explicitly
> different versions of that function, *WITHOUT* some kind of
> pseudo-conditional that isn't actually a conditional, and just have
> fsnotify_file_area_perm() be very explicit about the fact that it uses
> the fsnotify_sb_has_priority_watchers() logic.
>
> IOW, that conditional only makes it harder to see what the actual
> rules are. For no good reason.
>
> Look, magically for some reason fsnotify_name() could do the same
> thing without this kind of silly obfuscation. It just unconditonally
> calls fsnotify_sb_has_watchers() to filter the events. No silly games
> with doing two entirely different things based on a random constant
> argument.
>
> So this is why I say that any new fsnotify events will be NAK'ed and
> not merged by me unless it's all obvious, and unless it all obviously
> DOES NOT USE these inline garbage "helper" functions.
>
> The new logic had better be very obviously *only* using the
> file->f_mode bits, and just calling out-of-line to do the work. If I
> have to walk through several layers of inline functions, and look at
> what magic arguments those inline functions get just to see what the
> hell they actually do, I'm not going to merge it.
Sure for new hooks with new check-on-open semantics that is
going to be easy to do. The historic reason for the heavy inlining
is trying to optimize out indirect calls when we do not have the
luxury of using the check-on-open semantics.
>
> Because I'm really tired of actively obfuscated VFS hooks that use
> inline functions to hide what the hell they are doing and whether they
> are expensive or not.
>
> Your fsnotify_file_range() uses fsnotify_parent(), which is another of
> those "it does two different things" functions that either call
> fsnotify() on the dentry, or call __fsnotify_parent() on it if it's an
> inode, which means that it's another case of "what does this actually
> do" which is pointlessly hard to follow, since clearly for a truncate
> event it can't be a directory.
>
> And to make matters worse, fsnotify_truncate_perm() actually checks
> truncate events for directories and regular files, when truncates
> can't actually happen for anything but regular files in the first
> place. So your helper function does a nonsensical cray-cray test that
> shouldn't exist.
Ha, right, that's a stupid copy&paste braino.
Easy to fix.
The simplest thing to do for the new hooks I think is to make
fsnotify_file_range() extern and then you won't need to look beyond it,
because it already comes after the unlikley FMODE_NONOTIFY_ bits check.
Will work on the rest of the series following those guidelines.
Let me know if the patch I sent you has taken a wrong direction.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 05/18] fsnotify: introduce pre-content permission events
2024-11-13 22:35 ` Amir Goldstein
@ 2024-11-13 23:07 ` Linus Torvalds
0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 23:07 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, 13 Nov 2024 at 14:35, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Sure for new hooks with new check-on-open semantics that is
> going to be easy to do. The historic reason for the heavy inlining
> is trying to optimize out indirect calls when we do not have the
> luxury of using the check-on-open semantics.
Right. I'm not asking you to fix the old cases - it would be lovely to
do, but I think that's a different story. The compiler *does* figure
out the oddities, so usually generated code doesn't look horrible, but
it's really hard for a human to understand.
And honestly, code that "the compiler can figure out, but ordinary
humans can't" isn't great code.
And hey, we have tons of "isn't great code". Stuff happens. And the
fsnotify code in particular has this really odd history of
inotify/dnotify/unification and the VFS layer also having been
modified under it and becoming much more complex.
I really wish we could just throw some of the legacy cases away. Oh well.
But because I'm very sensitive to the VFS layer core code, and partly
*because* we have this bad history of horridness here (and
particularly in the security hooks), I just want to make really sure
that the new cases do *not* use the same completely incomprehensible
model with random conditionals that make no sense.
So that's why I then react so strongly to some of this.
Put another way: I'm not expecting the fsnotify_file() and
fsnotify_parent() horror to go away. But I *am* expecting new
interfaces to not use them, and not write new code like that again.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v7 06/18] fsnotify: pass optional file access range in pre-content event
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (4 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 05/18] fsnotify: introduce pre-content permission events Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 07/18] fsnotify: generate pre-content permission event on open Josef Bacik
` (11 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
We would like to add file range information to pre-content events.
Pass a struct file_range with offset and length to event handler
along with pre-content permission event.
The offset and length are aligned to page size, but we may need to
align them to minimum folio size for filesystems with large block size.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 11 +++++++++--
fs/notify/fanotify/fanotify.h | 2 ++
include/linux/fsnotify.h | 28 ++++++++++++++++++++++++----
include/linux/fsnotify_backend.h | 32 ++++++++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 6 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 24c7c5df4998..2e6ba94ec405 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -548,9 +548,13 @@ static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
return &pevent->fae;
}
-static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
+static struct fanotify_event *fanotify_alloc_perm_event(const void *data,
+ int data_type,
gfp_t gfp)
{
+ const struct path *path = fsnotify_data_path(data, data_type);
+ const struct file_range *range =
+ fsnotify_data_file_range(data, data_type);
struct fanotify_perm_event *pevent;
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
@@ -564,6 +568,9 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
pevent->hdr.len = 0;
pevent->state = FAN_EVENT_INIT;
pevent->path = *path;
+ /* NULL ppos means no range info */
+ pevent->ppos = range ? &range->pos : NULL;
+ pevent->count = range ? range->count : 0;
path_get(path);
return &pevent->fae;
@@ -801,7 +808,7 @@ static struct fanotify_event *fanotify_alloc_event(
old_memcg = set_active_memcg(group->memcg);
if (fanotify_is_perm_event(mask)) {
- event = fanotify_alloc_perm_event(path, gfp);
+ event = fanotify_alloc_perm_event(data, data_type, gfp);
} else if (fanotify_is_error_event(mask)) {
event = fanotify_alloc_error_event(group, fsid, data,
data_type, &hash);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index e5ab33cae6a7..93598b7d5952 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -425,6 +425,8 @@ FANOTIFY_PE(struct fanotify_event *event)
struct fanotify_perm_event {
struct fanotify_event fae;
struct path path;
+ const loff_t *ppos; /* optional file range info */
+ size_t count;
u32 response; /* userspace answer to the event */
unsigned short state; /* state of the event */
int fd; /* fd we passed to userspace for this event */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 0f44cd60ac9a..7110bc2f5aa7 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -154,9 +154,16 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
-static inline int fsnotify_pre_content(struct file *file)
+static inline int fsnotify_pre_content(const struct file *file,
+ const loff_t *ppos, size_t count)
{
struct inode *inode = file_inode(file);
+ struct file_range range;
+ const void *data;
+ int data_type;
+
+ if (!fsnotify_file_watchable(file, FS_PRE_ACCESS))
+ return 0;
/*
* Pre-content events are only reported for regular files and dirs
@@ -168,7 +175,20 @@ static inline int fsnotify_pre_content(struct file *file)
FSNOTIFY_PRIO_PRE_CONTENT))
return 0;
- return fsnotify_file(file, FS_PRE_ACCESS);
+ /* Report page aligned range only when pos is known */
+ if (ppos) {
+ range.path = &file->f_path;
+ range.pos = PAGE_ALIGN_DOWN(*ppos);
+ range.count = PAGE_ALIGN(*ppos + count) - range.pos;
+ data = ⦥
+ data_type = FSNOTIFY_EVENT_FILE_RANGE;
+ } else {
+ data = &file->f_path;
+ data_type = FSNOTIFY_EVENT_PATH;
+ }
+
+ return fsnotify_parent(file->f_path.dentry, FS_PRE_ACCESS,
+ data, data_type);
}
/*
@@ -188,7 +208,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
* read()/write and other types of access generate pre-content events.
*/
if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
- int ret = fsnotify_pre_content(file);
+ int ret = fsnotify_pre_content(file, ppos, count);
if (ret)
return ret;
@@ -205,7 +225,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
}
/*
- * fsnotify_file_perm - permission hook before file access
+ * fsnotify_file_perm - permission hook before file access (unknown range)
*/
static inline int fsnotify_file_perm(struct file *file, int perm_mask)
{
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 9bda354b5538..abd292edb48c 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -294,6 +294,7 @@ static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
/* When calling fsnotify tell it if the data is a path or inode */
enum fsnotify_data_type {
FSNOTIFY_EVENT_NONE,
+ FSNOTIFY_EVENT_FILE_RANGE,
FSNOTIFY_EVENT_PATH,
FSNOTIFY_EVENT_INODE,
FSNOTIFY_EVENT_DENTRY,
@@ -306,6 +307,17 @@ struct fs_error_report {
struct super_block *sb;
};
+struct file_range {
+ const struct path *path;
+ loff_t pos;
+ size_t count;
+};
+
+static inline const struct path *file_range_path(const struct file_range *range)
+{
+ return range->path;
+}
+
static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
{
switch (data_type) {
@@ -315,6 +327,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
return d_inode(data);
case FSNOTIFY_EVENT_PATH:
return d_inode(((const struct path *)data)->dentry);
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return d_inode(file_range_path(data)->dentry);
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *)data)->inode;
default:
@@ -330,6 +344,8 @@ static inline struct dentry *fsnotify_data_dentry(const void *data, int data_typ
return (struct dentry *)data;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry;
default:
return NULL;
}
@@ -341,6 +357,8 @@ static inline const struct path *fsnotify_data_path(const void *data,
switch (data_type) {
case FSNOTIFY_EVENT_PATH:
return data;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data);
default:
return NULL;
}
@@ -356,6 +374,8 @@ static inline struct super_block *fsnotify_data_sb(const void *data,
return ((struct dentry *)data)->d_sb;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry->d_sb;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry->d_sb;
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *) data)->sb;
default:
@@ -375,6 +395,18 @@ static inline struct fs_error_report *fsnotify_data_error_report(
}
}
+static inline const struct file_range *fsnotify_data_file_range(
+ const void *data,
+ int data_type)
+{
+ switch (data_type) {
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return (struct file_range *)data;
+ default:
+ return NULL;
+ }
+}
+
/*
* Index to merged marks iterator array that correlates to a type of watch.
* The type of watched object can be deduced from the iterator type, but not
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 07/18] fsnotify: generate pre-content permission event on open
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (5 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 06/18] fsnotify: pass optional file access range in pre-content event Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 19:54 ` Linus Torvalds
2024-11-12 17:55 ` [PATCH v7 08/18] fsnotify: generate pre-content permission event on truncate Josef Bacik
` (10 subsequent siblings)
17 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Generate pre-content event on open in addition to FS_OPEN_PERM,
but without sb_writers held and after file was truncated
in case file was opened with O_CREAT and/or O_TRUNC.
The event will have a range info of [0..0] to provide an opportunity
to fill entire file content on open.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/namei.c | 10 +++++++++-
include/linux/fsnotify.h | 4 +++-
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 9d30c7aa9aa6..a1a5b10893f6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3836,7 +3836,15 @@ static int do_open(struct nameidata *nd,
}
if (do_truncate)
mnt_drop_write(nd->path.mnt);
- return error;
+ if (error)
+ return error;
+
+ /*
+ * This permission hook is different than fsnotify_open_perm() hook.
+ * This is a pre-content hook that is called without sb_writers held
+ * and after the file was truncated.
+ */
+ return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
}
/**
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 7110bc2f5aa7..2d1c13df112c 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -193,6 +193,8 @@ static inline int fsnotify_pre_content(const struct file *file,
/*
* fsnotify_file_area_perm - permission hook before access of file range
+ *
+ * Called post open with access range [0..0].
*/
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
@@ -207,7 +209,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
/*
* read()/write and other types of access generate pre-content events.
*/
- if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+ if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) {
int ret = fsnotify_pre_content(file, ppos, count);
if (ret)
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 07/18] fsnotify: generate pre-content permission event on open
2024-11-12 17:55 ` [PATCH v7 07/18] fsnotify: generate pre-content permission event on open Josef Bacik
@ 2024-11-12 19:54 ` Linus Torvalds
2024-11-12 23:40 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-12 19:54 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
>
> + /*
> + * This permission hook is different than fsnotify_open_perm() hook.
> + * This is a pre-content hook that is called without sb_writers held
> + * and after the file was truncated.
> + */
> + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> }
I still object to this all.
You can't say "permission denied" after you've already truncated the
file. It's not a sane model. I complained about that earlier, it seems
that complaint was missed in the other complaints.
Also, this whole "This permission hook is different than
fsnotify_open_perm() hook" statement is purely because
fsnotify_open_perm() itself was broken and called from the wrong place
as mentioned in the other email.
Fix *THAT* first, then unify the two places that should *not* be
different into one single "this is the fsnotify_open" code. And that
place explicitly sets that FMODE_NOTIFY_PERM bit, and makes sure that
it does *not* set it for FMODE_NONOTIFY or FMODE_PATH cases.
And then please work on making sure that that isn't called unless
actually required.
The actual real "pre-content permission events" should then ONLY test
the FMODE_NOTIFY_PERM bit. Nothing else. None of this "re-use the
existing fsnotify_file() logic" stuff. Noe extra tests, no extra
logic.
Don't make me jump through filve layers of inline functions that all
test different 'mask' bits, just to verify that the open / read /
write paths don't do something stupid.
IOW, make it straightforward and obvious what you are doing, and make
it very clear that you're not pointlessly testing things like
FMODE_NONOTIFY when the *ONLY* thing that should be tested is whether
FMODE_NOTIFY_PERM is set.
Please.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 07/18] fsnotify: generate pre-content permission event on open
2024-11-12 19:54 ` Linus Torvalds
@ 2024-11-12 23:40 ` Amir Goldstein
2024-11-13 0:58 ` Linus Torvalds
0 siblings, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-12 23:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 8:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 09:56, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > + /*
> > + * This permission hook is different than fsnotify_open_perm() hook.
> > + * This is a pre-content hook that is called without sb_writers held
> > + * and after the file was truncated.
> > + */
> > + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> > }
>
> I still object to this all.
>
> You can't say "permission denied" after you've already truncated the
> file. It's not a sane model. I complained about that earlier, it seems
> that complaint was missed in the other complaints.
>
Not missed.
I answered here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxg0k4bGz6zOKS+Qt5BjEqDdUhvgG+5pLBPqSCcnQdffig@mail.gmail.com/
Starting with "...I understand why it seems stupid..."
Nevertheless, we can also drop this patch for now,
I don't think the post-open is a must-have hook for HSM.
> Also, this whole "This permission hook is different than
> fsnotify_open_perm() hook" statement is purely because
> fsnotify_open_perm() itself was broken and called from the wrong place
> as mentioned in the other email.
>
You wrote it should be called "in the open path" - that is ambiguous.
pre-content hook must be called without sb_writers held, so current
(in linux-next) location of fsnotify_open_perm() is not good in case of
O_CREATE flag, so I am not sure where a good location is.
Easier is to drop this patch.
> Fix *THAT* first, then unify the two places that should *not* be
> different into one single "this is the fsnotify_open" code. And that
> place explicitly sets that FMODE_NOTIFY_PERM bit, and makes sure that
> it does *not* set it for FMODE_NONOTIFY or FMODE_PATH cases.
>
> And then please work on making sure that that isn't called unless
> actually required.
>
> The actual real "pre-content permission events" should then ONLY test
> the FMODE_NOTIFY_PERM bit. Nothing else. None of this "re-use the
> existing fsnotify_file() logic" stuff. Noe extra tests, no extra
> logic.
>
> Don't make me jump through filve layers of inline functions that all
> test different 'mask' bits, just to verify that the open / read /
> write paths don't do something stupid.
>
> IOW, make it straightforward and obvious what you are doing, and make
> it very clear that you're not pointlessly testing things like
> FMODE_NONOTIFY when the *ONLY* thing that should be tested is whether
> FMODE_NOTIFY_PERM is set.
>
> Please.
Yes. Clear.
Thanks for taking the time to look closer.
and thanks for the feedback,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v7 07/18] fsnotify: generate pre-content permission event on open
2024-11-12 23:40 ` Amir Goldstein
@ 2024-11-13 0:58 ` Linus Torvalds
2024-11-13 10:12 ` Amir Goldstein
0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-11-13 0:58 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Tue, 12 Nov 2024 at 15:41, Amir Goldstein <amir73il@gmail.com> wrote:
>
> You wrote it should be called "in the open path" - that is ambiguous.
> pre-content hook must be called without sb_writers held, so current
> (in linux-next) location of fsnotify_open_perm() is not good in case of
> O_CREATE flag, so I am not sure where a good location is.
> Easier is to drop this patch.
Dropping that patch obviously removes my objection.
But since none of the whole "return errors" is valid with a truncate
or a new file creation anyway, isn't the whole thing kind of moot?
I guess do_open() could do it, but only inside a
if (!error && !do_truncate && !(file->f_mode & FMODE_CREATED))
error = fsnotify_opened_old(file);
kind of thing. With a big comment about how this is a pre-read hook,
and not relevant for a new file or a truncate event since then it's
always empty anyway.
But hey, if you don't absolutely need it in the first place, not
having it is *MUCH* preferable.
It sounds like the whole point was to catch reads - not opens. So then
you should catch it at read() time, not at open() time.
Linus
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 07/18] fsnotify: generate pre-content permission event on open
2024-11-13 0:58 ` Linus Torvalds
@ 2024-11-13 10:12 ` Amir Goldstein
0 siblings, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 10:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Wed, Nov 13, 2024 at 1:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Nov 2024 at 15:41, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > You wrote it should be called "in the open path" - that is ambiguous.
> > pre-content hook must be called without sb_writers held, so current
> > (in linux-next) location of fsnotify_open_perm() is not good in case of
> > O_CREATE flag, so I am not sure where a good location is.
> > Easier is to drop this patch.
>
> Dropping that patch obviously removes my objection.
>
> But since none of the whole "return errors" is valid with a truncate
> or a new file creation anyway, isn't the whole thing kind of moot?
>
Not moot. It is needed for the case that open with O_CREAT
finds an existing file and that file needs to be filled on open
and anyway do_open() is also taking sb_writers for O_RDWR
and O_WRONLY (not 100% sure why) not only for O_CREAT.
Essentially, this means that the legacy FAN_OPEN_PERM event
is not safe to be used by HSM, to fill file content on open.
and while I can document that fact all over the internet, that won't
stop people from using FAN_OPEN_PERM to implement a simple
HSM.
This is (the only) reason that I wanted to have a noticeable new event
at open time that is documented as safe for use by HSM and inviting
HSM developers to use the correct event.
Very possible that this is not a good enough reason.
> I guess do_open() could do it, but only inside a
>
> if (!error && !do_truncate && !(file->f_mode & FMODE_CREATED))
> error = fsnotify_opened_old(file);
>
> kind of thing. With a big comment about how this is a pre-read hook,
> and not relevant for a new file or a truncate event since then it's
> always empty anyway.
Right. That would be good for what I wanted to achieve.
>
> But hey, if you don't absolutely need it in the first place, not
> having it is *MUCH* preferable.
>
> It sounds like the whole point was to catch reads - not opens. So then
> you should catch it at read() time, not at open() time.
Yeh, for sure.
Will drop this patch.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v7 08/18] fsnotify: generate pre-content permission event on truncate
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (6 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 07/18] fsnotify: generate pre-content permission event on open Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
` (9 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Generate FS_PRE_ACCESS event before truncate, without sb_writers held.
Move the security hooks also before sb_start_write() to conform with
other security hooks (e.g. in write, fallocate).
The event will have a range info of the page surrounding the new size
to provide an opportunity to fill the conetnt at the end of file before
truncating to non-page aligned size.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/open.c | 31 +++++++++++++++++++++----------
include/linux/fsnotify.h | 32 ++++++++++++++++++++++----------
2 files changed, 43 insertions(+), 20 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index e6911101fe71..e75456cda440 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -81,14 +81,18 @@ long vfs_truncate(const struct path *path, loff_t length)
if (!S_ISREG(inode->i_mode))
return -EINVAL;
- error = mnt_want_write(path->mnt);
- if (error)
- goto out;
-
idmap = mnt_idmap(path->mnt);
error = inode_permission(idmap, inode, MAY_WRITE);
if (error)
- goto mnt_drop_write_and_out;
+ return error;
+
+ error = fsnotify_truncate_perm(path, length);
+ if (error)
+ return error;
+
+ error = mnt_want_write(path->mnt);
+ if (error)
+ return error;
error = -EPERM;
if (IS_APPEND(inode))
@@ -114,7 +118,7 @@ long vfs_truncate(const struct path *path, loff_t length)
put_write_access(inode);
mnt_drop_write_and_out:
mnt_drop_write(path->mnt);
-out:
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_truncate);
@@ -175,11 +179,18 @@ long do_ftruncate(struct file *file, loff_t length, int small)
/* Check IS_APPEND on real upper inode */
if (IS_APPEND(file_inode(file)))
return -EPERM;
- sb_start_write(inode->i_sb);
+
error = security_file_truncate(file);
- if (!error)
- error = do_truncate(file_mnt_idmap(file), dentry, length,
- ATTR_MTIME | ATTR_CTIME, file);
+ if (error)
+ return error;
+
+ error = fsnotify_truncate_perm(&file->f_path, length);
+ if (error)
+ return error;
+
+ sb_start_write(inode->i_sb);
+ error = do_truncate(file_mnt_idmap(file), dentry, length,
+ ATTR_MTIME | ATTR_CTIME, file);
sb_end_write(inode->i_sb);
return error;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 2d1c13df112c..f1ef072a3b2f 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -154,17 +154,14 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
-static inline int fsnotify_pre_content(const struct file *file,
+static inline int fsnotify_pre_content(const struct path *path,
const loff_t *ppos, size_t count)
{
- struct inode *inode = file_inode(file);
+ struct inode *inode = d_inode(path->dentry);
struct file_range range;
const void *data;
int data_type;
- if (!fsnotify_file_watchable(file, FS_PRE_ACCESS))
- return 0;
-
/*
* Pre-content events are only reported for regular files and dirs
* if there are any pre-content event watchers on this sb.
@@ -177,18 +174,17 @@ static inline int fsnotify_pre_content(const struct file *file,
/* Report page aligned range only when pos is known */
if (ppos) {
- range.path = &file->f_path;
+ range.path = path;
range.pos = PAGE_ALIGN_DOWN(*ppos);
range.count = PAGE_ALIGN(*ppos + count) - range.pos;
data = ⦥
data_type = FSNOTIFY_EVENT_FILE_RANGE;
} else {
- data = &file->f_path;
+ data = path;
data_type = FSNOTIFY_EVENT_PATH;
}
- return fsnotify_parent(file->f_path.dentry, FS_PRE_ACCESS,
- data, data_type);
+ return fsnotify_parent(path->dentry, FS_PRE_ACCESS, data, data_type);
}
/*
@@ -206,11 +202,14 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
*/
lockdep_assert_once(file_write_not_started(file));
+ if (!fsnotify_file_watchable(file, FS_PRE_ACCESS | FS_ACCESS_PERM))
+ return 0;
+
/*
* read()/write and other types of access generate pre-content events.
*/
if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) {
- int ret = fsnotify_pre_content(file, ppos, count);
+ int ret = fsnotify_pre_content(&file->f_path, ppos, count);
if (ret)
return ret;
@@ -226,6 +225,14 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
return fsnotify_file(file, FS_ACCESS_PERM);
}
+/*
+ * fsnotify_truncate_perm - permission hook before file truncate
+ */
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+ return fsnotify_pre_content(path, &length, 0);
+}
+
/*
* fsnotify_file_perm - permission hook before file access (unknown range)
*/
@@ -254,6 +261,11 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
return 0;
}
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+ return 0;
+}
+
static inline int fsnotify_file_perm(struct file *file, int perm_mask)
{
return 0;
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (7 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 08/18] fsnotify: generate pre-content permission event on truncate Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-15 11:28 ` Amir Goldstein
2024-11-12 17:55 ` [PATCH v7 10/18] fanotify: report file range info with pre-content events Josef Bacik
` (8 subsequent siblings)
17 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
in the context of the event handler.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill the content of files on first read access.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 3 ++-
fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
include/linux/fanotify.h | 14 ++++++++++----
include/uapi/linux/fanotify.h | 2 ++
4 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 2e6ba94ec405..da6c3c1c7edf 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -916,8 +916,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
+ BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9cc4a9ac1515..2ec0cc9c85cf 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1633,11 +1633,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
unsigned int flags)
{
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
+ bool is_dir = d_is_dir(path->dentry);
/* Strict validation of events in non-dir inode mask with v5.17+ APIs */
bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
(mask & FAN_RENAME) ||
(flags & FAN_MARK_IGNORE);
+ /*
+ * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
+ * and they are only supported on regular files and directories.
+ */
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
+ if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
+ return -EINVAL;
+ if (!is_dir && !d_is_reg(path->dentry))
+ return -EINVAL;
+ }
+
/*
* Some filesystems such as 'proc' acquire unusual locks when opening
* files. For them fanotify permission events have high chances of
@@ -1670,7 +1682,7 @@ static int fanotify_events_supported(struct fsnotify_group *group,
* but because we always allowed it, error only when using new APIs.
*/
if (strict_dir_events && mark_type == FAN_MARK_INODE &&
- !d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
+ !is_dir && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
return -ENOTDIR;
return 0;
@@ -1771,10 +1783,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
return -EPERM;
/*
- * Permission events require minimum priority FAN_CLASS_CONTENT.
+ * Permission events are not allowed for FAN_CLASS_NOTIF.
+ * Pre-content permission events are not allowed for FAN_CLASS_CONTENT.
*/
if (mask & FANOTIFY_PERM_EVENTS &&
- group->priority < FSNOTIFY_PRIO_CONTENT)
+ group->priority == FSNOTIFY_PRIO_NORMAL)
+ return -EINVAL;
+ else if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
+ group->priority == FSNOTIFY_PRIO_CONTENT)
return -EINVAL;
if (mask & FAN_FS_ERROR &&
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 89ff45bd6f01..c747af064d2c 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -89,6 +89,16 @@
#define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE | \
FAN_RENAME)
+/* Content events can be used to inspect file content */
+#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
+ FAN_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
+
+/* Events that require a permission response from user */
+#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
+ FANOTIFY_PRE_CONTENT_EVENTS)
+
/* Events that can be reported with event->fd */
#define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
@@ -104,10 +114,6 @@
FANOTIFY_INODE_EVENTS | \
FANOTIFY_ERROR_EVENTS)
-/* Events that require a permission response from user */
-#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
- FAN_OPEN_EXEC_PERM)
-
/* Extra flags that may be reported with event or control handling of events */
#define FANOTIFY_EVENT_FLAGS (FAN_EVENT_ON_CHILD | FAN_ONDIR)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 79072b6894f2..7596168c80eb 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -27,6 +27,8 @@
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
/* #define FAN_DIR_MODIFY 0x00080000 */ /* Deprecated (reserved) */
+#define FAN_PRE_ACCESS 0x00100000 /* Pre-content access hook */
+
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
#define FAN_RENAME 0x10000000 /* File was renamed */
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-12 17:55 ` [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-11-15 11:28 ` Amir Goldstein
2024-11-15 11:47 ` Jan Kara
0 siblings, 1 reply; 51+ messages in thread
From: Amir Goldstein @ 2024-11-15 11:28 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, brauner, torvalds, linux-xfs,
linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 6:56 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> From: Amir Goldstein <amir73il@gmail.com>
>
> Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
>
> Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> in the context of the event handler.
>
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill the content of files on first read access.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> fs/notify/fanotify/fanotify.c | 3 ++-
> fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
> include/linux/fanotify.h | 14 ++++++++++----
> include/uapi/linux/fanotify.h | 2 ++
> 4 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 2e6ba94ec405..da6c3c1c7edf 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -916,8 +916,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
> BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
> + BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
>
> - BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
> + BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
>
> mask = fanotify_group_event_mask(group, iter_info, &match_mask,
> mask, data, data_type, dir);
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9cc4a9ac1515..2ec0cc9c85cf 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1633,11 +1633,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
> unsigned int flags)
> {
> unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
> + bool is_dir = d_is_dir(path->dentry);
> /* Strict validation of events in non-dir inode mask with v5.17+ APIs */
> bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
> (mask & FAN_RENAME) ||
> (flags & FAN_MARK_IGNORE);
>
> + /*
> + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> + * and they are only supported on regular files and directories.
> + */
> + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> + return -EINVAL;
Should we make this return -EOPNOTSUPP?
This way the LTP test could report the accurate message
"FAN_PRE_ACCESS not supported in kernel" vs.
"FAN_PRE_ACCESS not supported on XXX filesystem"
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event
2024-11-15 11:28 ` Amir Goldstein
@ 2024-11-15 11:47 ` Jan Kara
0 siblings, 0 replies; 51+ messages in thread
From: Jan Kara @ 2024-11-15 11:47 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
On Fri 15-11-24 12:28:01, Amir Goldstein wrote:
> On Tue, Nov 12, 2024 at 6:56 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > From: Amir Goldstein <amir73il@gmail.com>
> >
> > Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
> > class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
> >
> > Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
> > in the context of the event handler.
> >
> > This pre-content event is meant to be used by hierarchical storage
> > managers that want to fill the content of files on first read access.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> > fs/notify/fanotify/fanotify.c | 3 ++-
> > fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
> > include/linux/fanotify.h | 14 ++++++++++----
> > include/uapi/linux/fanotify.h | 2 ++
> > 4 files changed, 33 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 2e6ba94ec405..da6c3c1c7edf 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -916,8 +916,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> > BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> > BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
> > BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
> > + BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
> >
> > - BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
> > + BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
> >
> > mask = fanotify_group_event_mask(group, iter_info, &match_mask,
> > mask, data, data_type, dir);
> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 9cc4a9ac1515..2ec0cc9c85cf 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -1633,11 +1633,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
> > unsigned int flags)
> > {
> > unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
> > + bool is_dir = d_is_dir(path->dentry);
> > /* Strict validation of events in non-dir inode mask with v5.17+ APIs */
> > bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
> > (mask & FAN_RENAME) ||
> > (flags & FAN_MARK_IGNORE);
> >
> > + /*
> > + * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
> > + * and they are only supported on regular files and directories.
> > + */
> > + if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
> > + if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
> > + return -EINVAL;
>
> Should we make this return -EOPNOTSUPP?
I see no reason not to do that so go ahead.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v7 10/18] fanotify: report file range info with pre-content events
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (8 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 09/18] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 11/18] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
` (7 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
With group class FAN_CLASS_PRE_CONTENT, report offset and length info
along with FAN_PRE_ACCESS pre-content events.
This information is meant to be used by hierarchical storage managers
that want to fill partial content of files on first access to range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.h | 8 +++++++
fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 8 +++++++
3 files changed, 54 insertions(+)
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 93598b7d5952..7f06355afa1f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -448,6 +448,14 @@ static inline bool fanotify_is_perm_event(u32 mask)
mask & FANOTIFY_PERM_EVENTS;
}
+static inline bool fanotify_event_has_access_range(struct fanotify_event *event)
+{
+ if (!(event->mask & FANOTIFY_PRE_CONTENT_EVENTS))
+ return false;
+
+ return FANOTIFY_PERM(event)->ppos;
+}
+
static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
{
return container_of(fse, struct fanotify_event, fse);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 2ec0cc9c85cf..5ab9ad69915a 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -122,6 +122,8 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
+#define FANOTIFY_RANGE_INFO_LEN \
+ (sizeof(struct fanotify_event_info_range))
static int fanotify_fid_info_len(int fh_len, int name_len)
{
@@ -181,6 +183,9 @@ static size_t fanotify_event_len(unsigned int info_mode,
if (info_mode & FAN_REPORT_PIDFD)
event_len += FANOTIFY_PIDFD_INFO_LEN;
+ if (fanotify_event_has_access_range(event))
+ event_len += FANOTIFY_RANGE_INFO_LEN;
+
return event_len;
}
@@ -518,6 +523,30 @@ static int copy_pidfd_info_to_user(int pidfd,
return info_len;
}
+static size_t copy_range_info_to_user(struct fanotify_event *event,
+ char __user *buf, int count)
+{
+ struct fanotify_perm_event *pevent = FANOTIFY_PERM(event);
+ struct fanotify_event_info_range info = { };
+ size_t info_len = FANOTIFY_RANGE_INFO_LEN;
+
+ if (WARN_ON_ONCE(info_len > count))
+ return -EFAULT;
+
+ if (WARN_ON_ONCE(!pevent->ppos))
+ return -EINVAL;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_RANGE;
+ info.hdr.len = info_len;
+ info.offset = *(pevent->ppos);
+ info.count = pevent->count;
+
+ if (copy_to_user(buf, &info, info_len))
+ return -EFAULT;
+
+ return info_len;
+}
+
static int copy_info_records_to_user(struct fanotify_event *event,
struct fanotify_info *info,
unsigned int info_mode, int pidfd,
@@ -639,6 +668,15 @@ static int copy_info_records_to_user(struct fanotify_event *event,
total_bytes += ret;
}
+ if (fanotify_event_has_access_range(event)) {
+ ret = copy_range_info_to_user(event, buf, count);
+ if (ret < 0)
+ return ret;
+ buf += ret;
+ count -= ret;
+ total_bytes += ret;
+ }
+
return total_bytes;
}
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 7596168c80eb..0636a9c85dd0 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -146,6 +146,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_DFID 3
#define FAN_EVENT_INFO_TYPE_PIDFD 4
#define FAN_EVENT_INFO_TYPE_ERROR 5
+#define FAN_EVENT_INFO_TYPE_RANGE 6
/* Special info types for FAN_RENAME */
#define FAN_EVENT_INFO_TYPE_OLD_DFID_NAME 10
@@ -192,6 +193,13 @@ struct fanotify_event_info_error {
__u32 error_count;
};
+struct fanotify_event_info_range {
+ struct fanotify_event_info_header hdr;
+ __u32 pad;
+ __u64 offset;
+ __u64 count;
+};
+
/*
* User space may need to record additional information about its decision.
* The extra information type records what kind of information is included.
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 11/18] fanotify: allow to set errno in FAN_DENY permission response
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (9 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 10/18] fanotify: report file range info with pre-content events Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 12/18] fanotify: add a helper to check for pre content events Josef Bacik
` (6 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
From: Amir Goldstein <amir73il@gmail.com>
With FAN_DENY response, user trying to perform the filesystem operation
gets an error with errno set to EPERM.
It is useful for hierarchical storage management (HSM) service to be able
to deny access for reasons more diverse than EPERM, for example EAGAIN,
if HSM could retry the operation later.
Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd
to permission events with the response value FAN_DENY_ERRNO(errno),
instead of FAN_DENY to return a custom error.
Limit custom error values to errors expected on read(2)/write(2) and
open(2) of regular files. This list could be extended in the future.
Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by
writing a response to an fanotify group fd with a value of FAN_NOFD in
the fd field of the response.
The change in fanotify_response is backward compatible, because errno is
written in the high 8 bits of the 32bit response field and old kernels
reject respose value with high bits set.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 19 +++++++++++----
fs/notify/fanotify/fanotify.h | 5 ++++
fs/notify/fanotify/fanotify_user.c | 37 ++++++++++++++++++++++++++----
include/linux/fanotify.h | 5 +++-
include/uapi/linux/fanotify.h | 7 ++++++
5 files changed, 62 insertions(+), 11 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index da6c3c1c7edf..e3d04d77caba 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -223,7 +223,8 @@ static int fanotify_get_response(struct fsnotify_group *group,
struct fanotify_perm_event *event,
struct fsnotify_iter_info *iter_info)
{
- int ret;
+ int ret, errno;
+ u32 decision;
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
@@ -256,20 +257,28 @@ static int fanotify_get_response(struct fsnotify_group *group,
goto out;
}
+ decision = event->response &
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
/* userspace responded, convert to something usable */
- switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
ret = 0;
break;
case FAN_DENY:
+ /* Check custom errno from pre-content events */
+ errno = fanotify_get_response_errno(event->response);
+ if (errno) {
+ ret = -errno;
+ break;
+ }
+ fallthrough;
default:
ret = -EPERM;
}
/* Check if the response should be audited */
- if (event->response & FAN_AUDIT)
- audit_fanotify(event->response & ~FAN_AUDIT,
- &event->audit_rule);
+ if (decision & FAN_AUDIT)
+ audit_fanotify(decision & ~FAN_AUDIT, &event->audit_rule);
pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
group, event, ret);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 7f06355afa1f..9e93aba210c9 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -528,3 +528,8 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
return mflags;
}
+
+static inline u32 fanotify_get_response_errno(int res)
+{
+ return res >> FAN_ERRNO_SHIFT;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5ab9ad69915a..b83fefc8aa2d 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -329,11 +329,14 @@ static int process_access_response(struct fsnotify_group *group,
struct fanotify_perm_event *event;
int fd = response_struct->fd;
u32 response = response_struct->response;
+ u32 decision = response &
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
+ int errno = fanotify_get_response_errno(response);
int ret = info_len;
struct fanotify_response_info_audit_rule friar;
- pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
- group, fd, response, info, info_len);
+ pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n",
+ __func__, group, fd, response, errno, info, info_len);
/*
* make sure the response is valid, if invalid we do nothing and either
* userspace can send a valid response or we will clean it up after the
@@ -342,18 +345,42 @@ static int process_access_response(struct fsnotify_group *group,
if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
return -EINVAL;
- switch (response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
+ if (errno)
+ return -EINVAL;
+ break;
case FAN_DENY:
+ /* Custom errno is supported only for pre-content groups */
+ if (errno && group->priority != FSNOTIFY_PRIO_PRE_CONTENT)
+ return -EINVAL;
+
+ /*
+ * Limit errno to values expected on open(2)/read(2)/write(2)
+ * of regular files.
+ */
+ switch (errno) {
+ case 0:
+ case EIO:
+ case EPERM:
+ case EBUSY:
+ case ETXTBSY:
+ case EAGAIN:
+ case ENOSPC:
+ case EDQUOT:
+ break;
+ default:
+ return -EINVAL;
+ }
break;
default:
return -EINVAL;
}
- if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
+ if ((decision & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
return -EINVAL;
- if (response & FAN_INFO) {
+ if (decision & FAN_INFO) {
ret = process_access_response_info(info, info_len, &friar);
if (ret < 0)
return ret;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index c747af064d2c..d9bb48976b53 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -132,7 +132,10 @@
/* These masks check for invalid bits in permission responses. */
#define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY)
#define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO)
-#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS)
+#define FANOTIFY_RESPONSE_ERRNO (FAN_ERRNO_MASK << FAN_ERRNO_SHIFT)
+#define FANOTIFY_RESPONSE_VALID_MASK \
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \
+ FANOTIFY_RESPONSE_ERRNO)
/* Do not use these old uapi constants internally */
#undef FAN_ALL_CLASS_BITS
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 0636a9c85dd0..bd8167979707 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -235,6 +235,13 @@ struct fanotify_response_info_audit_rule {
/* Legit userspace responses to a _PERM event */
#define FAN_ALLOW 0x01
#define FAN_DENY 0x02
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_ERRNO_BITS 8
+#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS)
+#define FAN_ERRNO_MASK ((1 << FAN_ERRNO_BITS) - 1)
+#define FAN_DENY_ERRNO(err) \
+ (FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT))
+
#define FAN_AUDIT 0x10 /* Bitmask to create audit record for result */
#define FAN_INFO 0x20 /* Bitmask to indicate additional information */
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 12/18] fanotify: add a helper to check for pre content events
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (10 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 11/18] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-13 18:33 ` Amir Goldstein
2024-11-12 17:55 ` [PATCH v7 13/18] fanotify: disable readahead if we have pre-content watches Josef Bacik
` (5 subsequent siblings)
17 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
We want to emit events during page fault, and calling into fanotify
could be expensive, so add a helper to allow us to skip calling into
fanotify from page fault. This will also be used to disable readahead
for content watched files which will be handled in a subsequent patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fsnotify.c | 12 ++++++++++++
include/linux/fsnotify_backend.h | 26 ++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index cab5a1a16e57..17047c44cf91 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -203,6 +203,18 @@ static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
}
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_object_watched(struct file *file, __u32 mask)
+{
+ struct inode *inode = file_inode(file);
+ __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
+
+ return fsnotify_object_watched(inode, mnt_mask, mask);
+}
+EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
+#endif
+
+
/*
* Notify this dentry's parent about a child's events with child name info
* if parent is watching or if inode/sb/mount are interested in events with
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index abd292edb48c..a92d59b66f93 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -896,6 +896,27 @@ static inline void fsnotify_init_event(struct fsnotify_event *event)
INIT_LIST_HEAD(&event->list);
}
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_object_watched(struct file *file, __u32 mask);
+
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ if (!(file->f_mode & FMODE_NOTIFY_PERM))
+ return false;
+
+ if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
+ return false;
+
+ return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
+}
+
+#else
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return false;
+}
+#endif /* CONFIG_FANOTIFY_ACCESS_PERMISSIONS */
+
#else
static inline int fsnotify(__u32 mask, const void *data, int data_type,
@@ -934,6 +955,11 @@ static inline u32 fsnotify_get_cookie(void)
static inline void fsnotify_unmount_inodes(struct super_block *sb)
{}
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return false;
+}
+
#endif /* CONFIG_FSNOTIFY */
#endif /* __KERNEL __ */
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v7 12/18] fanotify: add a helper to check for pre content events
2024-11-12 17:55 ` [PATCH v7 12/18] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-11-13 18:33 ` Amir Goldstein
0 siblings, 0 replies; 51+ messages in thread
From: Amir Goldstein @ 2024-11-13 18:33 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, brauner, torvalds, linux-xfs,
linux-btrfs, linux-mm, linux-ext4
On Tue, Nov 12, 2024 at 6:56 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> We want to emit events during page fault, and calling into fanotify
> could be expensive, so add a helper to allow us to skip calling into
> fanotify from page fault. This will also be used to disable readahead
> for content watched files which will be handled in a subsequent patch.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/notify/fsnotify.c | 12 ++++++++++++
> include/linux/fsnotify_backend.h | 26 ++++++++++++++++++++++++++
> 2 files changed, 38 insertions(+)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index cab5a1a16e57..17047c44cf91 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -203,6 +203,18 @@ static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
> return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
> }
>
> +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +bool fsnotify_file_object_watched(struct file *file, __u32 mask)
> +{
> + struct inode *inode = file_inode(file);
> + __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
> +
> + return fsnotify_object_watched(inode, mnt_mask, mask);
> +}
> +EXPORT_SYMBOL_GPL(fsnotify_file_object_watched);
> +#endif
> +
FYI, I was going to use this helper to set the FMODE_ flags, but
I noticed that it is missing the check for parent watching pre content
events.
The other user of fsnotify_object_watched(), __fsnotify_parent()
explicitly checks the fsnotify_inode_watches_children() mask.
I will need to add this.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v7 13/18] fanotify: disable readahead if we have pre-content watches
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (11 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 12/18] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 14/18] mm: don't allow huge faults for files with pre content watches Josef Bacik
` (4 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event. To avoid this case, disable
readahead if we have pre-content watches on the file. This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/filemap.c | 12 ++++++++++++
mm/readahead.c | 13 +++++++++++++
2 files changed, 25 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 196779e8e396..68ea596f6905 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3151,6 +3151,14 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
unsigned long vm_flags = vmf->vma->vm_flags;
unsigned int mmap_miss;
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't populate our mapping with 0 filled pages that we
+ * never emitted an event for.
+ */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
@@ -3219,6 +3227,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
struct file *fpin = NULL;
unsigned int mmap_miss;
+ /* See comment in do_sync_mmap_readahead. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
diff --git a/mm/readahead.c b/mm/readahead.c
index 9a807727d809..277c2061fc82 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -128,6 +128,7 @@
#include <linux/blk-cgroup.h>
#include <linux/fadvise.h>
#include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
#include "internal.h"
@@ -544,6 +545,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
unsigned long max_pages, contig_count;
pgoff_t prev_index, miss;
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't find 0 filled pages in cache that we never emitted
+ * events for.
+ */
+ if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Even if readahead is disabled, issue this request as readahead
* as we'll need it to satisfy the requested range. The forced
@@ -622,6 +631,10 @@ void page_cache_async_ra(struct readahead_control *ractl,
if (!ra->ra_pages)
return;
+ /* See the comment in page_cache_sync_ra. */
+ if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Same bit is used for PG_readahead and PG_reclaim.
*/
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 14/18] mm: don't allow huge faults for files with pre content watches
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (12 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 13/18] fanotify: disable readahead if we have pre-content watches Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 15/18] fsnotify: generate pre-content permission event on page fault Josef Bacik
` (3 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
There's nothing stopping us from supporting this, we could simply pass
the order into the helper and emit the proper length. However currently
there's no tests to validate this works properly, so disable it until
there's a desire to support this along with the appropriate tests.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index bdf77a3ec47b..dc16a0b171e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
#include <linux/sched/sysctl.h>
+#include <linux/fsnotify.h>
#include <trace/events/kmem.h>
@@ -5637,8 +5638,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
if (vma_is_anonymous(vma))
return do_huge_pmd_anonymous_page(vmf);
+ /*
+ * Currently we just emit PAGE_SIZE for our fault events, so don't allow
+ * a huge fault if we have a pre content watch on this file. This would
+ * be trivial to support, but there would need to be tests to ensure
+ * this works properly and those don't exist currently.
+ */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
return VM_FAULT_FALLBACK;
@@ -5648,6 +5658,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
vm_fault_t ret;
@@ -5662,6 +5673,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
}
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
@@ -5681,9 +5695,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
return VM_FAULT_FALLBACK;
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5695,12 +5713,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
vm_fault_t ret;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
goto split;
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 15/18] fsnotify: generate pre-content permission event on page fault
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (13 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 14/18] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 16/18] xfs: add pre-content fsnotify hook for write faults Josef Bacik
` (2 subsequent siblings)
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
on the faulting method.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.
Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things with
in filemap_fault.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/mm.h | 1 +
mm/filemap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01c5e7a4489f..90155ef8599a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3406,6 +3406,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
+extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
extern unsigned long stack_guard_gap;
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
diff --git a/mm/filemap.c b/mm/filemap.c
index 68ea596f6905..0bf7d645dec5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -47,6 +47,7 @@
#include <linux/splice.h>
#include <linux/rcupdate_wait.h>
#include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -3289,6 +3290,52 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
return ret;
}
+/**
+ * filemap_fsnotify_fault - maybe emit a pre-content event.
+ * @vmf: struct vm_fault containing details of the fault.
+ * @folio: the folio we're faulting in.
+ *
+ * If we have a pre-content watch on this file we will emit an event for this
+ * range. If we return anything the fault caller should return immediately, we
+ * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
+ * fault again and then the fault handler will run the second time through.
+ *
+ * This is meant to be called with the folio that we will be filling in to make
+ * sure the event is emitted for the correct range.
+ *
+ * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
+ */
+vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
+{
+ struct file *fpin = NULL;
+ int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
+ loff_t pos = vmf->pgoff >> PAGE_SHIFT;
+ size_t count = PAGE_SIZE;
+ vm_fault_t ret;
+
+ /*
+ * We already did this and now we're retrying with everything locked,
+ * don't emit the event and continue.
+ */
+ if (vmf->flags & FAULT_FLAG_TRIED)
+ return 0;
+
+ /* No watches, we're done. */
+ if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
+ return 0;
+
+ fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+ if (!fpin)
+ return VM_FAULT_SIGBUS;
+
+ ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
+ fput(fpin);
+ if (ret)
+ return VM_FAULT_SIGBUS;
+ return VM_FAULT_RETRY;
+}
+EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
+
/**
* filemap_fault - read in file data for page fault handling
* @vmf: struct vm_fault containing details of the fault
@@ -3392,6 +3439,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* or because readahead was otherwise unable to retrieve it.
*/
if (unlikely(!folio_test_uptodate(folio))) {
+ /*
+ * If this is a precontent file we have can now emit an event to
+ * try and populate the folio.
+ */
+ if (!(vmf->flags & FAULT_FLAG_TRIED) &&
+ fsnotify_file_has_pre_content_watches(file)) {
+ loff_t pos = folio_pos(folio);
+ size_t count = folio_size(folio);
+
+ /* We're NOWAIT, we have to retry. */
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+ folio_unlock(folio);
+ goto out_retry;
+ }
+
+ if (mapping_locked)
+ filemap_invalidate_unlock_shared(mapping);
+ mapping_locked = false;
+
+ folio_unlock(folio);
+ fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+ if (!fpin)
+ goto out_retry;
+
+ error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
+ count);
+ if (error)
+ ret = VM_FAULT_SIGBUS;
+ goto out_retry;
+ }
+
/*
* If the invalidate lock is not held, the folio was in cache
* and uptodate and now it is not. Strange but possible since we
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 16/18] xfs: add pre-content fsnotify hook for write faults
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (14 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 15/18] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 17/18] btrfs: disable defrag on pre-content watched files Josef Bacik
2024-11-12 17:55 ` [PATCH v7 18/18] fs: enable pre-content events on supported file systems Josef Bacik
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
xfs has it's own handling for write faults, so we need to add the
pre-content fsnotify hook for this case. Reads go through filemap_fault
so they're handled properly there.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/xfs_file.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ca47cae5a40a..4fe89770ecb5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1458,6 +1458,10 @@ xfs_write_fault(
unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
vm_fault_t ret;
+ ret = filemap_fsnotify_fault(vmf);
+ if (unlikely(ret))
+ return ret;
+
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 17/18] btrfs: disable defrag on pre-content watched files
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (15 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 16/18] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
2024-11-12 17:55 ` [PATCH v7 18/18] fs: enable pre-content events on supported file systems Josef Bacik
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
We queue up inodes to be defrag'ed asynchronously, which means we do not
have their original file for readahead. This means that the code to
skip readahead on pre-content watched files will not run, and we could
potentially read in empty pages.
Handle this corner case by disabling defrag on files that are currently
being watched for pre-content events.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/ioctl.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c9302d193187..1e5913f276be 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2635,6 +2635,15 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
goto out;
}
+ /*
+ * Don't allow defrag on pre-content watched files, as it could
+ * populate the page cache with 0's via readahead.
+ */
+ if (fsnotify_file_has_pre_content_watches(file)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (argp) {
if (copy_from_user(&range, argp, sizeof(range))) {
ret = -EFAULT;
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH v7 18/18] fs: enable pre-content events on supported file systems
2024-11-12 17:55 [PATCH v7 00/18] fanotify: add pre-content hooks Josef Bacik
` (16 preceding siblings ...)
2024-11-12 17:55 ` [PATCH v7 17/18] btrfs: disable defrag on pre-content watched files Josef Bacik
@ 2024-11-12 17:55 ` Josef Bacik
17 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2024-11-12 17:55 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
linux-xfs, linux-btrfs, linux-mm, linux-ext4
Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/super.c | 2 +-
fs/ext4/super.c | 3 +++
fs/xfs/xfs_super.c | 2 +-
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97a85d180b61..fe6ecc3f1cab 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -961,7 +961,7 @@ static int btrfs_fill_super(struct super_block *sb,
#endif
sb->s_xattr = btrfs_xattr_handlers;
sb->s_time_gran = 1;
- sb->s_iflags |= SB_I_CGROUPWB;
+ sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
err = super_setup_bdi(sb);
if (err) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b3512d78b55c..13b9d67a4eec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5306,6 +5306,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
/* i_version is always enabled now */
sb->s_flags |= SB_I_VERSION;
+ /* HSM events are allowed by default. */
+ sb->s_iflags |= SB_I_ALLOW_HSM;
+
err = ext4_check_feature_compatibility(sb, es, silent);
if (err)
goto failed_mount;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fda75db739b1..2d1e9db8548d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1713,7 +1713,7 @@ xfs_fs_fill_super(
sb->s_time_max = XFS_LEGACY_TIME_MAX;
}
trace_xfs_inode_timestamp_range(mp, sb->s_time_min, sb->s_time_max);
- sb->s_iflags |= SB_I_CGROUPWB;
+ sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
set_posix_acl_flag(sb);
--
2.43.0
^ permalink raw reply [flat|nested] 51+ messages in thread