linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/17] fanotify: add pre-content hooks
@ 2024-11-11 20:17 Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 01/17] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

v5: https://lore.kernel.org/linux-fsdevel/cover.1725481503.git.josef@toxicpanda.com/
v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/

v5->v6:
- Linus had problems with this and rejected Jan's PR
  (https://lore.kernel.org/linux-fsdevel/20240923110348.tbwihs42dxxltabc@quack3/),
  so I'm respinning this series to address his concerns.  Hopefully this is more
  acceptable.
- Change the page fault hooks to happen only in the case where we have to add a
  page, not where there exists pages already.
- Amir added a hook to truncate.
- We made the flag per SB instead of per fstype, Amir wanted this because of
  some potential issues with other file system specific work he's doing.
- Dropped the bcachefs patch, there were some concerns that we were doing
  something wrong, and it's not a huge deal to not have this feature for now.
- Unfortunately the xfs write fault path still has to do the page fault hook
  before we know if we have a page or not, this is because of the locking that's
  done before we get to the part where we know if we have a page already or not,
  so that's the path that is still the same from last iteration.
- I've re-validated this series with btrfs, xfs, and ext4 to make sure I didn't
  break anything.

v4->v5:
- Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
  to respin the series anyway.
- Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
- Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
- Added a patch to disable btrfs defrag on pre-content watched files.
- Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
- Added two fstests (which will be posted separately) to validate everything,
  re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
  didn't break anything.

v3->v4:
- Trying to send a final verson Friday at 5pm before you go on vacation is a
  recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
  review.
- Reworked the file system helper so it's handling of fpin was a little less
  silly, per Chinner's suggestion.
- Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
  in filemap_fault that says if VM_FAULT_ERROR is set we won't have
  VM_FAULT_RETRY set.

v2->v3:
- Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
  emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
  Amir's suggestion.
- Reworked the exported helper so the per-filesystem changes are much smaller,
  per Amir's suggestion.
- Fixed the screwup for DAX writes per Chinner's suggestion.
- Added Christian's reviewed-by's where appropriate.

v1->v2:
- reworked the page fault logic based on Jan's suggestion and turned it into a
  helper.
- Added 3 patches per-fs where we need to call the fsnotify helper from their
  ->fault handlers.
- Disabled readahead in the case that there's a pre-content watch in place.
- Disabled huge faults when there's a pre-content watch in place (entirely
  because it's untested, theoretically it should be straightforward to do).
- Updated the command numbers.
- Addressed the random spelling/grammer mistakes that Jan pointed out.
- Addressed the other random nits from Jan.

--- Original email ---

Hello,

These are the patches for the bare bones pre-content fanotify support.  The
majority of this work is Amir's, my contribution to this has solely been around
adding the page fault hooks, testing and validating everything.  I'm sending it
because Amir is traveling a bunch, and I touched it last so I'm going to take
all the hate and he can take all the credit.

There is a PoC that I've been using to validate this work, you can find the git
repo here

https://github.com/josefbacik/remote-fetch

This consists of 3 different tools.

1. populate.  This just creates all the stub files in the directory from the
   source directory.  Just run ./populate ~/linux ~/hsm-linux and it'll
   recursively create all of the stub files and directories.
2. remote-fetch.  This is the actual PoC, you just point it at the source and
   destination directory and then you can do whatever.  ./remote-fetch ~/linux
   ~/hsm-linux.
3. mmap-validate.  This was to validate the pagefault thing, this is likely what
   will be turned into the selftest with remote-fetch.  It creates a file and
   then you can validate the file matches the right pattern with both normal
   reads and mmap.  Normally I do something like

   ./mmap-validate create ~/src/foo
   ./populate ~/src ~/dst
   ./rmeote-fetch ~/src ~/dst
   ./mmap-validate validate ~/dst/foo

I did a bunch of testing, I also got some performance numbers.  I copied a
kernel tree, and then did remote-fetch, and then make -j4

Normal
real    9m49.709s
user    28m11.372s
sys     4m57.304s

HSM
real    10m6.454s
user    29m10.517s
sys     5m2.617s

So ~17 seconds more to build with HSM.  I then did a make mrproper on both trees
to see the size

[root@fedora ~]# du -hs /src/linux
1.6G    /src/linux
[root@fedora ~]# du -hs dst
125M    dst

This mirrors the sort of savings we've seen in production.

Meta has had these patches (minus the page fault patch) deployed in production
for almost a year with our own utility for doing on-demand package fetching.
The savings from this has been pretty significant.

The page-fault hooks are necessary for the last thing we need, which is
on-demand range fetching of executables.  Some of our binaries are several gigs
large, having the ability to remote fetch them on demand is a huge win for us
not only with space savings, but with startup time of containers.

There will be tests for this going into LTP once we're satisfied with the
patches and they're on their way upstream.  Thanks,

Josef

Amir Goldstein (9):
  fanotify: rename a misnamed constant
  fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
  fsnotify: introduce pre-content permission events
  fsnotify: pass optional file access range in pre-content event
  fsnotify: generate pre-content permission event on open
  fsnotify: generate pre-content permission event on truncate
  fanotify: introduce FAN_PRE_ACCESS permission event
  fanotify: report file range info with pre-content events
  fanotify: allow to set errno in FAN_DENY permission response

Josef Bacik (8):
  fanotify: don't skip extra event info if no info_mode is set
  fanotify: add a helper to check for pre content events
  fanotify: disable readahead if we have pre-content watches
  mm: don't allow huge faults for files with pre content watches
  fsnotify: generate pre-content permission event on page fault
  xfs: add pre-content fsnotify hook for write faults
  btrfs: disable defrag on pre-content watched files
  fs: enable pre-content events on supported file systems

 fs/btrfs/ioctl.c                   |   9 +++
 fs/btrfs/super.c                   |   5 +-
 fs/ext4/super.c                    |   3 +
 fs/namei.c                         |  10 ++-
 fs/notify/fanotify/fanotify.c      |  33 ++++++--
 fs/notify/fanotify/fanotify.h      |  15 ++++
 fs/notify/fanotify/fanotify_user.c | 120 +++++++++++++++++++++++------
 fs/notify/fsnotify.c               |  18 ++++-
 fs/open.c                          |  31 +++++---
 fs/xfs/xfs_file.c                  |   4 +
 fs/xfs/xfs_super.c                 |   2 +-
 include/linux/fanotify.h           |  19 +++--
 include/linux/fs.h                 |   1 +
 include/linux/fsnotify.h           |  73 ++++++++++++++++--
 include/linux/fsnotify_backend.h   |  59 +++++++++++++-
 include/linux/mm.h                 |   1 +
 include/uapi/linux/fanotify.h      |  18 +++++
 mm/filemap.c                       |  90 ++++++++++++++++++++++
 mm/memory.c                        |  22 ++++++
 mm/readahead.c                     |  13 ++++
 security/selinux/hooks.c           |   3 +-
 21 files changed, 491 insertions(+), 58 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 01/17] fanotify: don't skip extra event info if no info_mode is set
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 02/17] fanotify: rename a misnamed constant Josef Bacik
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

New pre-content events will be path events but they will also carry
additional range information. Remove the optimization to skip checking
whether info structures need to be generated for path events. This
results in no change in generated info structures for existing events.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/notify/fanotify/fanotify_user.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 8e2d43fc6f7c..d4dd34690fc6 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -160,9 +160,6 @@ static size_t fanotify_event_len(unsigned int info_mode,
 	int fh_len;
 	int dot_len = 0;
 
-	if (!info_mode)
-		return event_len;
-
 	if (fanotify_is_error_event(event->mask))
 		event_len += FANOTIFY_ERROR_INFO_LEN;
 
@@ -757,12 +754,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	buf += FAN_EVENT_METADATA_LEN;
 	count -= FAN_EVENT_METADATA_LEN;
 
-	if (info_mode) {
-		ret = copy_info_records_to_user(event, info, info_mode, pidfd,
-						buf, count);
-		if (ret < 0)
-			goto out_close_fd;
-	}
+	ret = copy_info_records_to_user(event, info, info_mode, pidfd,
+					buf, count);
+	if (ret < 0)
+		goto out_close_fd;
 
 	if (f)
 		fd_install(fd, f);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 02/17] fanotify: rename a misnamed constant
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 01/17] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 03/17] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

FANOTIFY_PIDFD_INFO_HDR_LEN is not the length of the header.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index d4dd34690fc6..0ae4cd87e712 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -119,7 +119,7 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
 #define FANOTIFY_EVENT_ALIGN 4
 #define FANOTIFY_FID_INFO_HDR_LEN \
 	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
-#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+#define FANOTIFY_PIDFD_INFO_LEN \
 	sizeof(struct fanotify_event_info_pidfd)
 #define FANOTIFY_ERROR_INFO_LEN \
 	(sizeof(struct fanotify_event_info_error))
@@ -174,14 +174,14 @@ static size_t fanotify_event_len(unsigned int info_mode,
 		dot_len = 1;
 	}
 
-	if (info_mode & FAN_REPORT_PIDFD)
-		event_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
-
 	if (fanotify_event_has_object_fh(event)) {
 		fh_len = fanotify_event_object_fh_len(event);
 		event_len += fanotify_fid_info_len(fh_len, dot_len);
 	}
 
+	if (info_mode & FAN_REPORT_PIDFD)
+		event_len += FANOTIFY_PIDFD_INFO_LEN;
+
 	return event_len;
 }
 
@@ -504,7 +504,7 @@ static int copy_pidfd_info_to_user(int pidfd,
 				   size_t count)
 {
 	struct fanotify_event_info_pidfd info = { };
-	size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+	size_t info_len = FANOTIFY_PIDFD_INFO_LEN;
 
 	if (WARN_ON_ONCE(info_len > count))
 		return -EFAULT;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 03/17] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 01/17] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 02/17] fanotify: rename a misnamed constant Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 04/17] fsnotify: introduce pre-content permission events Josef Bacik
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

Avoid reusing it, because we would like to reserve it for future
FAN_PATH_MODIFY pre-content event.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 include/linux/fsnotify_backend.h | 1 +
 include/uapi/linux/fanotify.h    | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 3ecf7768e577..53d5d0e02943 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -55,6 +55,7 @@
 #define FS_OPEN_PERM		0x00010000	/* open event in an permission hook */
 #define FS_ACCESS_PERM		0x00020000	/* access event in a permissions hook */
 #define FS_OPEN_EXEC_PERM	0x00040000	/* open/exec event in a permission hook */
+/* #define FS_DIR_MODIFY	0x00080000 */	/* Deprecated (reserved) */
 
 /*
  * Set on inode mark that cares about things that happen to its children.
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 34f221d3a1b9..79072b6894f2 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -25,6 +25,7 @@
 #define FAN_OPEN_PERM		0x00010000	/* File open in perm check */
 #define FAN_ACCESS_PERM		0x00020000	/* File accessed in perm check */
 #define FAN_OPEN_EXEC_PERM	0x00040000	/* File open/exec in perm check */
+/* #define FAN_DIR_MODIFY	0x00080000 */	/* Deprecated (reserved) */
 
 #define FAN_EVENT_ON_CHILD	0x08000000	/* Interested in child events */
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 04/17] fsnotify: introduce pre-content permission events
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (2 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 03/17] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 05/17] fsnotify: pass optional file access range in pre-content event Josef Bacik
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.

Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content
scanners could inspect the content filled by pre-content event handler.

Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is
modified by syscalls as write() and fallocate().

FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.

The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.

There are some specific requirements from filesystems that could
be used with pre-content events, so add a flag for fs to opt-in
for pre-content events explicitly before they can be used.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fsnotify.c             |  2 +-
 include/linux/fs.h               |  1 +
 include/linux/fsnotify.h         | 37 ++++++++++++++++++++++++++++----
 include/linux/fsnotify_backend.h | 12 +++++++++--
 security/selinux/hooks.c         |  3 ++-
 5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 82ae8254c068..0696c1771b2a 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -624,7 +624,7 @@ static __init int fsnotify_init(void)
 {
 	int ret;
 
-	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
+	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 24);
 
 	ret = init_srcu_struct(&fsnotify_mark_srcu);
 	if (ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..1b9f74bda43c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1230,6 +1230,7 @@ extern int send_sigurg(struct file *file);
 #define SB_I_RETIRED	0x00000800	/* superblock shouldn't be reused */
 #define SB_I_NOUMASK	0x00001000	/* VFS does not apply umask */
 #define SB_I_NOIDMAP	0x00002000	/* No idmapped mounts on this superblock */
+#define SB_I_ALLOW_HSM	0x00004000	/* Allow HSM events on this superblock */
 
 /* Possible states of 'frozen' field */
 enum {
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..7c641161b281 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -132,14 +132,29 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
 }
 
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+static inline int fsnotify_pre_content(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	/*
+	 * Pre-content events are only reported for regular files and dirs
+	 * if there are any pre-content event watchers on this sb.
+	 */
+	if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
+	    !(inode->i_sb->s_iflags & SB_I_ALLOW_HSM) ||
+	    !fsnotify_sb_has_priority_watchers(inode->i_sb,
+					       FSNOTIFY_PRIO_PRE_CONTENT))
+		return 0;
+
+	return fsnotify_file(file, FS_PRE_ACCESS);
+}
+
 /*
- * fsnotify_file_area_perm - permission hook before access to file range
+ * fsnotify_file_area_perm - permission hook before access of file range
  */
 static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 					  const loff_t *ppos, size_t count)
 {
-	__u32 fsnotify_mask = FS_ACCESS_PERM;
-
 	/*
 	 * filesystem may be modified in the context of permission events
 	 * (e.g. by HSM filling a file on access), so sb freeze protection
@@ -147,10 +162,24 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	 */
 	lockdep_assert_once(file_write_not_started(file));
 
+	/*
+	 * read()/write and other types of access generate pre-content events.
+	 */
+	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+		int ret = fsnotify_pre_content(file);
+
+		if (ret)
+			return ret;
+	}
+
 	if (!(perm_mask & MAY_READ))
 		return 0;
 
-	return fsnotify_file(file, fsnotify_mask);
+	/*
+	 * read() also generates the legacy FS_ACCESS_PERM event, so content
+	 * scanners can inspect the content filled by pre-content event.
+	 */
+	return fsnotify_file(file, FS_ACCESS_PERM);
 }
 
 /*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 53d5d0e02943..9bda354b5538 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -57,6 +57,8 @@
 #define FS_OPEN_EXEC_PERM	0x00040000	/* open/exec event in a permission hook */
 /* #define FS_DIR_MODIFY	0x00080000 */	/* Deprecated (reserved) */
 
+#define FS_PRE_ACCESS		0x00100000	/* Pre-content access hook */
+
 /*
  * Set on inode mark that cares about things that happen to its children.
  * Always set for dnotify and inotify.
@@ -78,8 +80,14 @@
  */
 #define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
 
-#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
-				  FS_OPEN_EXEC_PERM)
+/* Content events can be used to inspect file content */
+#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
+				      FS_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FSNOTIFY_PRE_CONTENT_EVENTS  (FS_PRE_ACCESS)
+
+#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
+				  FSNOTIFY_PRE_CONTENT_EVENTS)
 
 /*
  * This is a list of all events that may get sent to a parent that is watching
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index fc926d3cac6e..c6f38705c715 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3404,7 +3404,8 @@ static int selinux_path_notify(const struct path *path, u64 mask,
 		perm |= FILE__WATCH_WITH_PERM;
 
 	/* watches on read-like events need the file:watch_reads permission */
-	if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_CLOSE_NOWRITE))
+	if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_PRE_ACCESS |
+		    FS_CLOSE_NOWRITE))
 		perm |= FILE__WATCH_READS;
 
 	return path_has_perm(current_cred(), path, perm);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 05/17] fsnotify: pass optional file access range in pre-content event
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (3 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 04/17] fsnotify: introduce pre-content permission events Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 06/17] fsnotify: generate pre-content permission event on open Josef Bacik
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

We would like to add file range information to pre-content events.

Pass a struct file_range with offset and length to event handler
along with pre-content permission event.

The offset and length are aligned to page size, but we may need to
align them to minimum folio size for filesystems with large block size.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c    | 11 +++++++++--
 fs/notify/fanotify/fanotify.h    |  2 ++
 include/linux/fsnotify.h         | 28 ++++++++++++++++++++++++----
 include/linux/fsnotify_backend.h | 32 ++++++++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 224bccaab4cc..c1e4ae221093 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -549,9 +549,13 @@ static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
 	return &pevent->fae;
 }
 
-static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
+static struct fanotify_event *fanotify_alloc_perm_event(const void *data,
+							int data_type,
 							gfp_t gfp)
 {
+	const struct path *path = fsnotify_data_path(data, data_type);
+	const struct file_range *range =
+			    fsnotify_data_file_range(data, data_type);
 	struct fanotify_perm_event *pevent;
 
 	pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
@@ -565,6 +569,9 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
 	pevent->hdr.len = 0;
 	pevent->state = FAN_EVENT_INIT;
 	pevent->path = *path;
+	/* NULL ppos means no range info */
+	pevent->ppos = range ? &range->pos : NULL;
+	pevent->count = range ? range->count : 0;
 	path_get(path);
 
 	return &pevent->fae;
@@ -802,7 +809,7 @@ static struct fanotify_event *fanotify_alloc_event(
 	old_memcg = set_active_memcg(group->memcg);
 
 	if (fanotify_is_perm_event(mask)) {
-		event = fanotify_alloc_perm_event(path, gfp);
+		event = fanotify_alloc_perm_event(data, data_type, gfp);
 	} else if (fanotify_is_error_event(mask)) {
 		event = fanotify_alloc_error_event(group, fsid, data,
 						   data_type, &hash);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index e5ab33cae6a7..93598b7d5952 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -425,6 +425,8 @@ FANOTIFY_PE(struct fanotify_event *event)
 struct fanotify_perm_event {
 	struct fanotify_event fae;
 	struct path path;
+	const loff_t *ppos;		/* optional file range info */
+	size_t count;
 	u32 response;			/* userspace answer to the event */
 	unsigned short state;		/* state of the event */
 	int fd;		/* fd we passed to userspace for this event */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 7c641161b281..22150e5797c5 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -132,9 +132,16 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
 }
 
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
-static inline int fsnotify_pre_content(struct file *file)
+static inline int fsnotify_pre_content(const struct file *file,
+				       const loff_t *ppos, size_t count)
 {
 	struct inode *inode = file_inode(file);
+	struct file_range range;
+	const void *data;
+	int data_type;
+
+	if (file->f_mode & FMODE_NONOTIFY)
+		return 0;
 
 	/*
 	 * Pre-content events are only reported for regular files and dirs
@@ -146,7 +153,20 @@ static inline int fsnotify_pre_content(struct file *file)
 					       FSNOTIFY_PRIO_PRE_CONTENT))
 		return 0;
 
-	return fsnotify_file(file, FS_PRE_ACCESS);
+	/* Report page aligned range only when pos is known */
+	if (ppos) {
+		range.path = &file->f_path;
+		range.pos = PAGE_ALIGN_DOWN(*ppos);
+		range.count = PAGE_ALIGN(*ppos + count) - range.pos;
+		data = &range;
+		data_type = FSNOTIFY_EVENT_FILE_RANGE;
+	} else {
+		data = &file->f_path;
+		data_type = FSNOTIFY_EVENT_PATH;
+	}
+
+	return fsnotify_parent(file->f_path.dentry, FS_PRE_ACCESS,
+			       data, data_type);
 }
 
 /*
@@ -166,7 +186,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	 * read()/write and other types of access generate pre-content events.
 	 */
 	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
-		int ret = fsnotify_pre_content(file);
+		int ret = fsnotify_pre_content(file, ppos, count);
 
 		if (ret)
 			return ret;
@@ -183,7 +203,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 }
 
 /*
- * fsnotify_file_perm - permission hook before file access
+ * fsnotify_file_perm - permission hook before file access (unknown range)
  */
 static inline int fsnotify_file_perm(struct file *file, int perm_mask)
 {
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 9bda354b5538..abd292edb48c 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -294,6 +294,7 @@ static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
 /* When calling fsnotify tell it if the data is a path or inode */
 enum fsnotify_data_type {
 	FSNOTIFY_EVENT_NONE,
+	FSNOTIFY_EVENT_FILE_RANGE,
 	FSNOTIFY_EVENT_PATH,
 	FSNOTIFY_EVENT_INODE,
 	FSNOTIFY_EVENT_DENTRY,
@@ -306,6 +307,17 @@ struct fs_error_report {
 	struct super_block *sb;
 };
 
+struct file_range {
+	const struct path *path;
+	loff_t pos;
+	size_t count;
+};
+
+static inline const struct path *file_range_path(const struct file_range *range)
+{
+	return range->path;
+}
+
 static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
 {
 	switch (data_type) {
@@ -315,6 +327,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
 		return d_inode(data);
 	case FSNOTIFY_EVENT_PATH:
 		return d_inode(((const struct path *)data)->dentry);
+	case FSNOTIFY_EVENT_FILE_RANGE:
+		return d_inode(file_range_path(data)->dentry);
 	case FSNOTIFY_EVENT_ERROR:
 		return ((struct fs_error_report *)data)->inode;
 	default:
@@ -330,6 +344,8 @@ static inline struct dentry *fsnotify_data_dentry(const void *data, int data_typ
 		return (struct dentry *)data;
 	case FSNOTIFY_EVENT_PATH:
 		return ((const struct path *)data)->dentry;
+	case FSNOTIFY_EVENT_FILE_RANGE:
+		return file_range_path(data)->dentry;
 	default:
 		return NULL;
 	}
@@ -341,6 +357,8 @@ static inline const struct path *fsnotify_data_path(const void *data,
 	switch (data_type) {
 	case FSNOTIFY_EVENT_PATH:
 		return data;
+	case FSNOTIFY_EVENT_FILE_RANGE:
+		return file_range_path(data);
 	default:
 		return NULL;
 	}
@@ -356,6 +374,8 @@ static inline struct super_block *fsnotify_data_sb(const void *data,
 		return ((struct dentry *)data)->d_sb;
 	case FSNOTIFY_EVENT_PATH:
 		return ((const struct path *)data)->dentry->d_sb;
+	case FSNOTIFY_EVENT_FILE_RANGE:
+		return file_range_path(data)->dentry->d_sb;
 	case FSNOTIFY_EVENT_ERROR:
 		return ((struct fs_error_report *) data)->sb;
 	default:
@@ -375,6 +395,18 @@ static inline struct fs_error_report *fsnotify_data_error_report(
 	}
 }
 
+static inline const struct file_range *fsnotify_data_file_range(
+							const void *data,
+							int data_type)
+{
+	switch (data_type) {
+	case FSNOTIFY_EVENT_FILE_RANGE:
+		return (struct file_range *)data;
+	default:
+		return NULL;
+	}
+}
+
 /*
  * Index to merged marks iterator array that correlates to a type of watch.
  * The type of watched object can be deduced from the iterator type, but not
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (4 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 05/17] fsnotify: pass optional file access range in pre-content event Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 21:51   ` Linus Torvalds
  2024-11-11 20:17 ` [PATCH v6 07/17] fsnotify: generate pre-content permission event on truncate Josef Bacik
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

Generate pre-content event on open in addition to FS_OPEN_PERM,
but without sb_writers held and after file was truncated
in case file was opened with O_CREAT and/or O_TRUNC.

The event will have a range info of [0..0] to provide an opportunity
to fill entire file content on open.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/namei.c               | 10 +++++++++-
 include/linux/fsnotify.h |  4 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4a4a22a08ac2..b49fb1f80c0c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
 	}
 	if (do_truncate)
 		mnt_drop_write(nd->path.mnt);
-	return error;
+	if (error)
+		return error;
+
+	/*
+	 * This permission hook is different than fsnotify_open_perm() hook.
+	 * This is a pre-content hook that is called without sb_writers held
+	 * and after the file was truncated.
+	 */
+	return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
 }
 
 /**
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 22150e5797c5..1e87a54b88b6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -171,6 +171,8 @@ static inline int fsnotify_pre_content(const struct file *file,
 
 /*
  * fsnotify_file_area_perm - permission hook before access of file range
+ *
+ * Called post open with access range [0..0].
  */
 static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 					  const loff_t *ppos, size_t count)
@@ -185,7 +187,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	/*
 	 * read()/write and other types of access generate pre-content events.
 	 */
-	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) {
 		int ret = fsnotify_pre_content(file, ppos, count);
 
 		if (ret)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 07/17] fsnotify: generate pre-content permission event on truncate
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (5 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 06/17] fsnotify: generate pre-content permission event on open Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 08/17] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

Generate FS_PRE_ACCESS event before truncate, without sb_writers held.

Move the security hooks also before sb_start_write() to conform with
other security hooks (e.g. in write, fallocate).

The event will have a range info of the page surrounding the new size
to provide an opportunity to fill the conetnt at the end of file before
truncating to non-page aligned size.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/open.c                | 31 +++++++++++++++++++++----------
 include/linux/fsnotify.h | 32 ++++++++++++++++++++++----------
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index c822f88d4c1d..51103ba339d0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -81,14 +81,18 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (!S_ISREG(inode->i_mode))
 		return -EINVAL;
 
-	error = mnt_want_write(path->mnt);
-	if (error)
-		goto out;
-
 	idmap = mnt_idmap(path->mnt);
 	error = inode_permission(idmap, inode, MAY_WRITE);
 	if (error)
-		goto mnt_drop_write_and_out;
+		return error;
+
+	error = fsnotify_truncate_perm(path, length);
+	if (error)
+		return error;
+
+	error = mnt_want_write(path->mnt);
+	if (error)
+		return error;
 
 	error = -EPERM;
 	if (IS_APPEND(inode))
@@ -114,7 +118,7 @@ long vfs_truncate(const struct path *path, loff_t length)
 	put_write_access(inode);
 mnt_drop_write_and_out:
 	mnt_drop_write(path->mnt);
-out:
+
 	return error;
 }
 EXPORT_SYMBOL_GPL(vfs_truncate);
@@ -175,11 +179,18 @@ long do_ftruncate(struct file *file, loff_t length, int small)
 	/* Check IS_APPEND on real upper inode */
 	if (IS_APPEND(file_inode(file)))
 		return -EPERM;
-	sb_start_write(inode->i_sb);
+
 	error = security_file_truncate(file);
-	if (!error)
-		error = do_truncate(file_mnt_idmap(file), dentry, length,
-				    ATTR_MTIME | ATTR_CTIME, file);
+	if (error)
+		return error;
+
+	error = fsnotify_truncate_perm(&file->f_path, length);
+	if (error)
+		return error;
+
+	sb_start_write(inode->i_sb);
+	error = do_truncate(file_mnt_idmap(file), dentry, length,
+			    ATTR_MTIME | ATTR_CTIME, file);
 	sb_end_write(inode->i_sb);
 
 	return error;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 1e87a54b88b6..fbcdddb9601a 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -132,17 +132,14 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
 }
 
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
-static inline int fsnotify_pre_content(const struct file *file,
+static inline int fsnotify_pre_content(const struct path *path,
 				       const loff_t *ppos, size_t count)
 {
-	struct inode *inode = file_inode(file);
+	struct inode *inode = d_inode(path->dentry);
 	struct file_range range;
 	const void *data;
 	int data_type;
 
-	if (file->f_mode & FMODE_NONOTIFY)
-		return 0;
-
 	/*
 	 * Pre-content events are only reported for regular files and dirs
 	 * if there are any pre-content event watchers on this sb.
@@ -155,18 +152,17 @@ static inline int fsnotify_pre_content(const struct file *file,
 
 	/* Report page aligned range only when pos is known */
 	if (ppos) {
-		range.path = &file->f_path;
+		range.path = path;
 		range.pos = PAGE_ALIGN_DOWN(*ppos);
 		range.count = PAGE_ALIGN(*ppos + count) - range.pos;
 		data = &range;
 		data_type = FSNOTIFY_EVENT_FILE_RANGE;
 	} else {
-		data = &file->f_path;
+		data = path;
 		data_type = FSNOTIFY_EVENT_PATH;
 	}
 
-	return fsnotify_parent(file->f_path.dentry, FS_PRE_ACCESS,
-			       data, data_type);
+	return fsnotify_parent(path->dentry, FS_PRE_ACCESS, data, data_type);
 }
 
 /*
@@ -184,11 +180,14 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	 */
 	lockdep_assert_once(file_write_not_started(file));
 
+	if (file->f_mode & FMODE_NONOTIFY)
+		return 0;
+
 	/*
 	 * read()/write and other types of access generate pre-content events.
 	 */
 	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) {
-		int ret = fsnotify_pre_content(file, ppos, count);
+		int ret = fsnotify_pre_content(&file->f_path, ppos, count);
 
 		if (ret)
 			return ret;
@@ -204,6 +203,14 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	return fsnotify_file(file, FS_ACCESS_PERM);
 }
 
+/*
+ * fsnotify_truncate_perm - permission hook before file truncate
+ */
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+	return fsnotify_pre_content(path, &length, 0);
+}
+
 /*
  * fsnotify_file_perm - permission hook before file access (unknown range)
  */
@@ -235,6 +242,11 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	return 0;
 }
 
+static inline int fsnotify_truncate_perm(const struct path *path, loff_t length)
+{
+	return 0;
+}
+
 static inline int fsnotify_file_perm(struct file *file, int perm_mask)
 {
 	return 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 08/17] fanotify: introduce FAN_PRE_ACCESS permission event
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (6 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 07/17] fsnotify: generate pre-content permission event on truncate Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 09/17] fanotify: report file range info with pre-content events Josef Bacik
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.

Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
in the context of the event handler.

This pre-content event is meant to be used by hierarchical storage
managers that want to fill the content of files on first read access.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      |  3 ++-
 fs/notify/fanotify/fanotify_user.c | 22 +++++++++++++++++++---
 include/linux/fanotify.h           | 14 ++++++++++----
 include/uapi/linux/fanotify.h      |  2 ++
 4 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index c1e4ae221093..5e05410ddb9f 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -917,8 +917,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
 	BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
 	BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
 	BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
+	BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
 
-	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
+	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
 
 	mask = fanotify_group_event_mask(group, iter_info, &match_mask,
 					 mask, data, data_type, dir);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 0ae4cd87e712..da9cf09565ce 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1639,11 +1639,23 @@ static int fanotify_events_supported(struct fsnotify_group *group,
 				     unsigned int flags)
 {
 	unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
+	bool is_dir = d_is_dir(path->dentry);
 	/* Strict validation of events in non-dir inode mask with v5.17+ APIs */
 	bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
 				 (mask & FAN_RENAME) ||
 				 (flags & FAN_MARK_IGNORE);
 
+	/*
+	 * Filesystems need to opt-into pre-content evnets (a.k.a HSM)
+	 * and they are only supported on regular files and directories.
+	 */
+	if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
+		if (!(path->mnt->mnt_sb->s_iflags & SB_I_ALLOW_HSM))
+			return -EINVAL;
+		if (!is_dir && !d_is_reg(path->dentry))
+			return -EINVAL;
+	}
+
 	/*
 	 * Some filesystems such as 'proc' acquire unusual locks when opening
 	 * files. For them fanotify permission events have high chances of
@@ -1676,7 +1688,7 @@ static int fanotify_events_supported(struct fsnotify_group *group,
 	 * but because we always allowed it, error only when using new APIs.
 	 */
 	if (strict_dir_events && mark_type == FAN_MARK_INODE &&
-	    !d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
+	    !is_dir && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
 		return -ENOTDIR;
 
 	return 0;
@@ -1780,11 +1792,15 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 		goto fput_and_out;
 
 	/*
-	 * Permission events require minimum priority FAN_CLASS_CONTENT.
+	 * Permission events are not allowed for FAN_CLASS_NOTIF.
+	 * Pre-content permission events are not allowed for FAN_CLASS_CONTENT.
 	 */
 	ret = -EINVAL;
 	if (mask & FANOTIFY_PERM_EVENTS &&
-	    group->priority < FSNOTIFY_PRIO_CONTENT)
+	    group->priority == FSNOTIFY_PRIO_NORMAL)
+		goto fput_and_out;
+	else if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
+		 group->priority == FSNOTIFY_PRIO_CONTENT)
 		goto fput_and_out;
 
 	if (mask & FAN_FS_ERROR &&
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 89ff45bd6f01..c747af064d2c 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -89,6 +89,16 @@
 #define FANOTIFY_DIRENT_EVENTS	(FAN_MOVE | FAN_CREATE | FAN_DELETE | \
 				 FAN_RENAME)
 
+/* Content events can be used to inspect file content */
+#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
+				      FAN_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FANOTIFY_PRE_CONTENT_EVENTS  (FAN_PRE_ACCESS)
+
+/* Events that require a permission response from user */
+#define FANOTIFY_PERM_EVENTS	(FANOTIFY_CONTENT_PERM_EVENTS | \
+				 FANOTIFY_PRE_CONTENT_EVENTS)
+
 /* Events that can be reported with event->fd */
 #define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
 
@@ -104,10 +114,6 @@
 				 FANOTIFY_INODE_EVENTS | \
 				 FANOTIFY_ERROR_EVENTS)
 
-/* Events that require a permission response from user */
-#define FANOTIFY_PERM_EVENTS	(FAN_OPEN_PERM | FAN_ACCESS_PERM | \
-				 FAN_OPEN_EXEC_PERM)
-
 /* Extra flags that may be reported with event or control handling of events */
 #define FANOTIFY_EVENT_FLAGS	(FAN_EVENT_ON_CHILD | FAN_ONDIR)
 
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 79072b6894f2..7596168c80eb 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -27,6 +27,8 @@
 #define FAN_OPEN_EXEC_PERM	0x00040000	/* File open/exec in perm check */
 /* #define FAN_DIR_MODIFY	0x00080000 */	/* Deprecated (reserved) */
 
+#define FAN_PRE_ACCESS		0x00100000	/* Pre-content access hook */
+
 #define FAN_EVENT_ON_CHILD	0x08000000	/* Interested in child events */
 
 #define FAN_RENAME		0x10000000	/* File was renamed */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 09/17] fanotify: report file range info with pre-content events
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (7 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 08/17] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:17 ` [PATCH v6 10/17] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

With group class FAN_CLASS_PRE_CONTENT, report offset and length info
along with FAN_PRE_ACCESS pre-content events.

This information is meant to be used by hierarchical storage managers
that want to fill partial content of files on first access to range.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.h      |  8 +++++++
 fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++++
 include/uapi/linux/fanotify.h      |  8 +++++++
 3 files changed, 54 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 93598b7d5952..7f06355afa1f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -448,6 +448,14 @@ static inline bool fanotify_is_perm_event(u32 mask)
 		mask & FANOTIFY_PERM_EVENTS;
 }
 
+static inline bool fanotify_event_has_access_range(struct fanotify_event *event)
+{
+	if (!(event->mask & FANOTIFY_PRE_CONTENT_EVENTS))
+		return false;
+
+	return FANOTIFY_PERM(event)->ppos;
+}
+
 static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
 {
 	return container_of(fse, struct fanotify_event, fse);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index da9cf09565ce..17402f9e8609 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -123,6 +123,8 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
 	sizeof(struct fanotify_event_info_pidfd)
 #define FANOTIFY_ERROR_INFO_LEN \
 	(sizeof(struct fanotify_event_info_error))
+#define FANOTIFY_RANGE_INFO_LEN \
+	(sizeof(struct fanotify_event_info_range))
 
 static int fanotify_fid_info_len(int fh_len, int name_len)
 {
@@ -182,6 +184,9 @@ static size_t fanotify_event_len(unsigned int info_mode,
 	if (info_mode & FAN_REPORT_PIDFD)
 		event_len += FANOTIFY_PIDFD_INFO_LEN;
 
+	if (fanotify_event_has_access_range(event))
+		event_len += FANOTIFY_RANGE_INFO_LEN;
+
 	return event_len;
 }
 
@@ -519,6 +524,30 @@ static int copy_pidfd_info_to_user(int pidfd,
 	return info_len;
 }
 
+static size_t copy_range_info_to_user(struct fanotify_event *event,
+				      char __user *buf, int count)
+{
+	struct fanotify_perm_event *pevent = FANOTIFY_PERM(event);
+	struct fanotify_event_info_range info = { };
+	size_t info_len = FANOTIFY_RANGE_INFO_LEN;
+
+	if (WARN_ON_ONCE(info_len > count))
+		return -EFAULT;
+
+	if (WARN_ON_ONCE(!pevent->ppos))
+		return -EINVAL;
+
+	info.hdr.info_type = FAN_EVENT_INFO_TYPE_RANGE;
+	info.hdr.len = info_len;
+	info.offset = *(pevent->ppos);
+	info.count = pevent->count;
+
+	if (copy_to_user(buf, &info, info_len))
+		return -EFAULT;
+
+	return info_len;
+}
+
 static int copy_info_records_to_user(struct fanotify_event *event,
 				     struct fanotify_info *info,
 				     unsigned int info_mode, int pidfd,
@@ -640,6 +669,15 @@ static int copy_info_records_to_user(struct fanotify_event *event,
 		total_bytes += ret;
 	}
 
+	if (fanotify_event_has_access_range(event)) {
+		ret = copy_range_info_to_user(event, buf, count);
+		if (ret < 0)
+			return ret;
+		buf += ret;
+		count -= ret;
+		total_bytes += ret;
+	}
+
 	return total_bytes;
 }
 
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 7596168c80eb..0636a9c85dd0 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -146,6 +146,7 @@ struct fanotify_event_metadata {
 #define FAN_EVENT_INFO_TYPE_DFID	3
 #define FAN_EVENT_INFO_TYPE_PIDFD	4
 #define FAN_EVENT_INFO_TYPE_ERROR	5
+#define FAN_EVENT_INFO_TYPE_RANGE	6
 
 /* Special info types for FAN_RENAME */
 #define FAN_EVENT_INFO_TYPE_OLD_DFID_NAME	10
@@ -192,6 +193,13 @@ struct fanotify_event_info_error {
 	__u32 error_count;
 };
 
+struct fanotify_event_info_range {
+	struct fanotify_event_info_header hdr;
+	__u32 pad;
+	__u64 offset;
+	__u64 count;
+};
+
 /*
  * User space may need to record additional information about its decision.
  * The extra information type records what kind of information is included.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 10/17] fanotify: allow to set errno in FAN_DENY permission response
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (8 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 09/17] fanotify: report file range info with pre-content events Josef Bacik
@ 2024-11-11 20:17 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 11/17] fanotify: add a helper to check for pre content events Josef Bacik
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:17 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

From: Amir Goldstein <amir73il@gmail.com>

With FAN_DENY response, user trying to perform the filesystem operation
gets an error with errno set to EPERM.

It is useful for hierarchical storage management (HSM) service to be able
to deny access for reasons more diverse than EPERM, for example EAGAIN,
if HSM could retry the operation later.

Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd
to permission events with the response value FAN_DENY_ERRNO(errno),
instead of FAN_DENY to return a custom error.

Limit custom error values to errors expected on read(2)/write(2) and
open(2) of regular files. This list could be extended in the future.
Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by
writing a response to an fanotify group fd with a value of FAN_NOFD in
the fd field of the response.

The change in fanotify_response is backward compatible, because errno is
written in the high 8 bits of the 32bit response field and old kernels
reject respose value with high bits set.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      | 19 +++++++++++----
 fs/notify/fanotify/fanotify.h      |  5 ++++
 fs/notify/fanotify/fanotify_user.c | 37 ++++++++++++++++++++++++++----
 include/linux/fanotify.h           |  5 +++-
 include/uapi/linux/fanotify.h      |  7 ++++++
 5 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 5e05410ddb9f..17af1b822791 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -224,7 +224,8 @@ static int fanotify_get_response(struct fsnotify_group *group,
 				 struct fanotify_perm_event *event,
 				 struct fsnotify_iter_info *iter_info)
 {
-	int ret;
+	int ret, errno;
+	u32 decision;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
@@ -257,20 +258,28 @@ static int fanotify_get_response(struct fsnotify_group *group,
 		goto out;
 	}
 
+	decision = event->response &
+		(FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
 	/* userspace responded, convert to something usable */
-	switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
+	switch (decision & FANOTIFY_RESPONSE_ACCESS) {
 	case FAN_ALLOW:
 		ret = 0;
 		break;
 	case FAN_DENY:
+		/* Check custom errno from pre-content events */
+		errno = fanotify_get_response_errno(event->response);
+		if (errno) {
+			ret = -errno;
+			break;
+		}
+		fallthrough;
 	default:
 		ret = -EPERM;
 	}
 
 	/* Check if the response should be audited */
-	if (event->response & FAN_AUDIT)
-		audit_fanotify(event->response & ~FAN_AUDIT,
-			       &event->audit_rule);
+	if (decision & FAN_AUDIT)
+		audit_fanotify(decision & ~FAN_AUDIT, &event->audit_rule);
 
 	pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
 		 group, event, ret);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 7f06355afa1f..9e93aba210c9 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -528,3 +528,8 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
 
 	return mflags;
 }
+
+static inline u32 fanotify_get_response_errno(int res)
+{
+	return res >> FAN_ERRNO_SHIFT;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 17402f9e8609..ca8cc2103b5d 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -330,11 +330,14 @@ static int process_access_response(struct fsnotify_group *group,
 	struct fanotify_perm_event *event;
 	int fd = response_struct->fd;
 	u32 response = response_struct->response;
+	u32 decision = response &
+		(FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
+	int errno = fanotify_get_response_errno(response);
 	int ret = info_len;
 	struct fanotify_response_info_audit_rule friar;
 
-	pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
-		 group, fd, response, info, info_len);
+	pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n",
+		 __func__, group, fd, response, errno, info, info_len);
 	/*
 	 * make sure the response is valid, if invalid we do nothing and either
 	 * userspace can send a valid response or we will clean it up after the
@@ -343,18 +346,42 @@ static int process_access_response(struct fsnotify_group *group,
 	if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
 		return -EINVAL;
 
-	switch (response & FANOTIFY_RESPONSE_ACCESS) {
+	switch (decision & FANOTIFY_RESPONSE_ACCESS) {
 	case FAN_ALLOW:
+		if (errno)
+			return -EINVAL;
+		break;
 	case FAN_DENY:
+		/* Custom errno is supported only for pre-content groups */
+		if (errno && group->priority != FSNOTIFY_PRIO_PRE_CONTENT)
+			return -EINVAL;
+
+		/*
+		 * Limit errno to values expected on open(2)/read(2)/write(2)
+		 * of regular files.
+		 */
+		switch (errno) {
+		case 0:
+		case EIO:
+		case EPERM:
+		case EBUSY:
+		case ETXTBSY:
+		case EAGAIN:
+		case ENOSPC:
+		case EDQUOT:
+			break;
+		default:
+			return -EINVAL;
+		}
 		break;
 	default:
 		return -EINVAL;
 	}
 
-	if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
+	if ((decision & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
 		return -EINVAL;
 
-	if (response & FAN_INFO) {
+	if (decision & FAN_INFO) {
 		ret = process_access_response_info(info, info_len, &friar);
 		if (ret < 0)
 			return ret;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index c747af064d2c..d9bb48976b53 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -132,7 +132,10 @@
 /* These masks check for invalid bits in permission responses. */
 #define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY)
 #define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO)
-#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS)
+#define FANOTIFY_RESPONSE_ERRNO	(FAN_ERRNO_MASK << FAN_ERRNO_SHIFT)
+#define FANOTIFY_RESPONSE_VALID_MASK \
+	(FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \
+	 FANOTIFY_RESPONSE_ERRNO)
 
 /* Do not use these old uapi constants internally */
 #undef FAN_ALL_CLASS_BITS
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 0636a9c85dd0..bd8167979707 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -235,6 +235,13 @@ struct fanotify_response_info_audit_rule {
 /* Legit userspace responses to a _PERM event */
 #define FAN_ALLOW	0x01
 #define FAN_DENY	0x02
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_ERRNO_BITS	8
+#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS)
+#define FAN_ERRNO_MASK	((1 << FAN_ERRNO_BITS) - 1)
+#define FAN_DENY_ERRNO(err) \
+	(FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT))
+
 #define FAN_AUDIT	0x10	/* Bitmask to create audit record for result */
 #define FAN_INFO	0x20	/* Bitmask to indicate additional information */
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 11/17] fanotify: add a helper to check for pre content events
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (9 preceding siblings ...)
  2024-11-11 20:17 ` [PATCH v6 10/17] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 12/17] fanotify: disable readahead if we have pre-content watches Josef Bacik
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

We want to emit events during page fault, and calling into fanotify
could be expensive, so add a helper to allow us to skip calling into
fanotify from page fault.  This will also be used to disable readahead
for content watched files which will be handled in a subsequent patch.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/notify/fsnotify.c             | 16 ++++++++++++++++
 include/linux/fsnotify_backend.h | 14 ++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0696c1771b2a..a49c42c6ce01 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -203,6 +203,22 @@ static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
 	return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
 }
 
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	__u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
+
+	if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
+		return false;
+
+	return fsnotify_object_watched(inode, mnt_mask,
+				       FSNOTIFY_PRE_CONTENT_EVENTS);
+}
+EXPORT_SYMBOL_GPL(fsnotify_file_has_pre_content_watches);
+#endif
+
+
 /*
  * Notify this dentry's parent about a child's events with child name info
  * if parent is watching or if inode/sb/mount are interested in events with
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index abd292edb48c..e3a4a8e06fa0 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -896,6 +896,15 @@ static inline void fsnotify_init_event(struct fsnotify_event *event)
 	INIT_LIST_HEAD(&event->list);
 }
 
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_has_pre_content_watches(struct file *file);
+#else
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+	return false;
+}
+#endif /* CONFIG_FANOTIFY_ACCESS_PERMISSIONS */
+
 #else
 
 static inline int fsnotify(__u32 mask, const void *data, int data_type,
@@ -934,6 +943,11 @@ static inline u32 fsnotify_get_cookie(void)
 static inline void fsnotify_unmount_inodes(struct super_block *sb)
 {}
 
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+	return false;
+}
+
 #endif	/* CONFIG_FSNOTIFY */
 
 #endif	/* __KERNEL __ */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 12/17] fanotify: disable readahead if we have pre-content watches
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (10 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 11/17] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 13/17] mm: don't allow huge faults for files with pre content watches Josef Bacik
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event.  To avoid this case, disable
readahead if we have pre-content watches on the file.  This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/filemap.c   | 12 ++++++++++++
 mm/readahead.c | 13 +++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56fa431c52af..fc36a00fa014 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3149,6 +3149,14 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	unsigned long vm_flags = vmf->vma->vm_flags;
 	unsigned int mmap_miss;
 
+	/*
+	 * If we have pre-content watches we need to disable readahead to make
+	 * sure that we don't populate our mapping with 0 filled pages that we
+	 * never emitted an event for.
+	 */
+	if (fsnotify_file_has_pre_content_watches(file))
+		return fpin;
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
 	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
@@ -3217,6 +3225,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	struct file *fpin = NULL;
 	unsigned int mmap_miss;
 
+	/* See comment in do_sync_mmap_readahead. */
+	if (fsnotify_file_has_pre_content_watches(file))
+		return fpin;
+
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
 		return fpin;
diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..9fe678cceba8 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -128,6 +128,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/fadvise.h>
 #include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
 
 #include "internal.h"
 
@@ -544,6 +545,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	unsigned long max_pages, contig_count;
 	pgoff_t prev_index, miss;
 
+	/*
+	 * If we have pre-content watches we need to disable readahead to make
+	 * sure that we don't find 0 filled pages in cache that we never emitted
+	 * events for.
+	 */
+	if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+		return;
+
 	/*
 	 * Even if readahead is disabled, issue this request as readahead
 	 * as we'll need it to satisfy the requested range. The forced
@@ -622,6 +631,10 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	if (!ra->ra_pages)
 		return;
 
+	/* See the comment in page_cache_sync_ra. */
+	if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+		return;
+
 	/*
 	 * Same bit is used for PG_readahead and PG_reclaim.
 	 */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 13/17] mm: don't allow huge faults for files with pre content watches
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (11 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 12/17] fanotify: disable readahead if we have pre-content watches Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 14/17] fsnotify: generate pre-content permission event on page fault Josef Bacik
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

There's nothing stopping us from supporting this, we could simply pass
the order into the helper and emit the proper length.  However currently
there's no tests to validate this works properly, so disable it until
there's a desire to support this along with the appropriate tests.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mm/memory.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index bdf77a3ec47b..dc16a0b171e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/fsnotify.h>
 
 #include <trace/events/kmem.h>
 
@@ -5637,8 +5638,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
 	if (vma_is_anonymous(vma))
 		return do_huge_pmd_anonymous_page(vmf);
+	/*
+	 * Currently we just emit PAGE_SIZE for our fault events, so don't allow
+	 * a huge fault if we have a pre content watch on this file.  This would
+	 * be trivial to support, but there would need to be tests to ensure
+	 * this works properly and those don't exist currently.
+	 */
+	if (file && fsnotify_file_has_pre_content_watches(file))
+		return VM_FAULT_FALLBACK;
 	if (vma->vm_ops->huge_fault)
 		return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
 	return VM_FAULT_FALLBACK;
@@ -5648,6 +5658,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	vm_fault_t ret;
 
@@ -5662,6 +5673,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 	}
 
 	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/* See comment in create_huge_pmd. */
+		if (file && fsnotify_file_has_pre_content_watches(file))
+			goto split;
 		if (vma->vm_ops->huge_fault) {
 			ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
 			if (!(ret & VM_FAULT_FALLBACK))
@@ -5681,9 +5695,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vma))
 		return VM_FAULT_FALLBACK;
+	/* See comment in create_huge_pmd. */
+	if (file && fsnotify_file_has_pre_content_watches(file))
+		return VM_FAULT_FALLBACK;
 	if (vma->vm_ops->huge_fault)
 		return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5695,12 +5713,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
 	vm_fault_t ret;
 
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vma))
 		goto split;
 	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/* See comment in create_huge_pmd. */
+		if (file && fsnotify_file_has_pre_content_watches(file))
+			goto split;
 		if (vma->vm_ops->huge_fault) {
 			ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
 			if (!(ret & VM_FAULT_FALLBACK))
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 14/17] fsnotify: generate pre-content permission event on page fault
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (12 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 13/17] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 15/17] xfs: add pre-content fsnotify hook for write faults Josef Bacik
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
on the faulting method.

This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.

Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things with
in filemap_fault.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/mm.h |  1 +
 mm/filemap.c       | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61fff5d34ed5..bce9e2f5dfa4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3405,6 +3405,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
 extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
+extern vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf);
 
 extern unsigned long stack_guard_gap;
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
diff --git a/mm/filemap.c b/mm/filemap.c
index fc36a00fa014..aa3c92d605b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -47,6 +47,7 @@
 #include <linux/splice.h>
 #include <linux/rcupdate_wait.h>
 #include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -3287,6 +3288,52 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
 	return ret;
 }
 
+/**
+ * filemap_fsnotify_fault - maybe emit a pre-content event.
+ * @vmf:	struct vm_fault containing details of the fault.
+ * @folio:	the folio we're faulting in.
+ *
+ * If we have a pre-content watch on this file we will emit an event for this
+ * range.  If we return anything the fault caller should return immediately, we
+ * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
+ * fault again and then the fault handler will run the second time through.
+ *
+ * This is meant to be called with the folio that we will be filling in to make
+ * sure the event is emitted for the correct range.
+ *
+ * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
+ */
+vm_fault_t filemap_fsnotify_fault(struct vm_fault *vmf)
+{
+	struct file *fpin = NULL;
+	int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
+	loff_t pos = vmf->pgoff >> PAGE_SHIFT;
+	size_t count = PAGE_SIZE;
+	vm_fault_t ret;
+
+	/*
+	 * We already did this and now we're retrying with everything locked,
+	 * don't emit the event and continue.
+	 */
+	if (vmf->flags & FAULT_FLAG_TRIED)
+		return 0;
+
+	/* No watches, we're done. */
+	if (!fsnotify_file_has_pre_content_watches(vmf->vma->vm_file))
+		return 0;
+
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+	if (!fpin)
+		return VM_FAULT_SIGBUS;
+
+	ret = fsnotify_file_area_perm(fpin, mask, &pos, count);
+	fput(fpin);
+	if (ret)
+		return VM_FAULT_SIGBUS;
+	return VM_FAULT_RETRY;
+}
+EXPORT_SYMBOL_GPL(filemap_fsnotify_fault);
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vmf:	struct vm_fault containing details of the fault
@@ -3390,6 +3437,37 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * or because readahead was otherwise unable to retrieve it.
 	 */
 	if (unlikely(!folio_test_uptodate(folio))) {
+		/*
+		 * If this is a precontent file we have can now emit an event to
+		 * try and populate the folio.
+		 */
+		if (!(vmf->flags & FAULT_FLAG_TRIED) &&
+		    fsnotify_file_has_pre_content_watches(file)) {
+			loff_t pos = folio_pos(folio);
+			size_t count = folio_size(folio);
+
+			/* We're NOWAIT, we have to retry. */
+			if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+				folio_unlock(folio);
+				goto out_retry;
+			}
+
+			if (mapping_locked)
+				filemap_invalidate_unlock_shared(mapping);
+			mapping_locked = false;
+
+			folio_unlock(folio);
+			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+			if (!fpin)
+				goto out_retry;
+
+			error = fsnotify_file_area_perm(fpin, MAY_ACCESS, &pos,
+							count);
+			if (error)
+				ret = VM_FAULT_SIGBUS;
+			goto out_retry;
+		}
+
 		/*
 		 * If the invalidate lock is not held, the folio was in cache
 		 * and uptodate and now it is not. Strange but possible since we
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 15/17] xfs: add pre-content fsnotify hook for write faults
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (13 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 14/17] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 16/17] btrfs: disable defrag on pre-content watched files Josef Bacik
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

xfs has it's own handling for write faults, so we need to add the
pre-content fsnotify hook for this case.  Reads go through filemap_fault
so they're handled properly there.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/xfs/xfs_file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b19916b11fd5..d1966f996c77 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1442,6 +1442,10 @@ xfs_write_fault(
 	unsigned int		lock_mode = XFS_MMAPLOCK_SHARED;
 	vm_fault_t		ret;
 
+	ret = filemap_fsnotify_fault(vmf);
+	if (unlikely(ret))
+		return ret;
+
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vmf->vma->vm_file);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 16/17] btrfs: disable defrag on pre-content watched files
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (14 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 15/17] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:18 ` [PATCH v6 17/17] fs: enable pre-content events on supported file systems Josef Bacik
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

We queue up inodes to be defrag'ed asynchronously, which means we do not
have their original file for readahead.  This means that the code to
skip readahead on pre-content watched files will not run, and we could
potentially read in empty pages.

Handle this corner case by disabling defrag on files that are currently
being watched for pre-content events.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ioctl.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 226c91fe31a7..9b13df1ea729 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2638,6 +2638,15 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 			goto out;
 		}
 
+		/*
+		 * Don't allow defrag on pre-content watched files, as it could
+		 * populate the page cache with 0's via readahead.
+		 */
+		if (fsnotify_file_has_pre_content_watches(file)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
 		if (argp) {
 			if (copy_from_user(&range, argp, sizeof(range))) {
 				ret = -EFAULT;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 17/17] fs: enable pre-content events on supported file systems
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (15 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 16/17] btrfs: disable defrag on pre-content watched files Josef Bacik
@ 2024-11-11 20:18 ` Josef Bacik
  2024-11-11 20:27 ` [PATCH v6 00/17] fanotify: add pre-content hooks Amir Goldstein
  2024-11-11 21:55 ` Linus Torvalds
  18 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 20:18 UTC (permalink / raw)
  To: kernel-team, linux-fsdevel, jack, amir73il, brauner, torvalds,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/super.c   | 5 +++--
 fs/ext4/super.c    | 3 +++
 fs/xfs/xfs_super.c | 2 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c64d07134122..9c3877aee9d4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -963,7 +963,7 @@ static int btrfs_fill_super(struct super_block *sb,
 #endif
 	sb->s_xattr = btrfs_xattr_handlers;
 	sb->s_time_gran = 1;
-	sb->s_iflags |= SB_I_CGROUPWB;
+	sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
 
 	err = super_setup_bdi(sb);
 	if (err) {
@@ -2191,7 +2191,8 @@ static struct file_system_type btrfs_fs_type = {
 	.init_fs_context	= btrfs_init_fs_context,
 	.parameters		= btrfs_fs_parameters,
 	.kill_sb		= btrfs_kill_super,
-	.fs_flags		= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_ALLOW_IDMAP,
+	.fs_flags		= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+				  FS_ALLOW_IDMAP,
  };
 
 MODULE_ALIAS_FS("btrfs");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 16a4ce704460..733d71dac09e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5266,6 +5266,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	/* i_version is always enabled now */
 	sb->s_flags |= SB_I_VERSION;
 
+	/* HSM events are allowed by default. */
+	sb->s_iflags |= SB_I_ALLOW_HSM;
+
 	err = ext4_check_feature_compatibility(sb, es, silent);
 	if (err)
 		goto failed_mount;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fbb3a1594c0d..b6cd52f2289d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1713,7 +1713,7 @@ xfs_fs_fill_super(
 		sb->s_time_max = XFS_LEGACY_TIME_MAX;
 	}
 	trace_xfs_inode_timestamp_range(mp, sb->s_time_min, sb->s_time_max);
-	sb->s_iflags |= SB_I_CGROUPWB;
+	sb->s_iflags |= SB_I_CGROUPWB | SB_I_ALLOW_HSM;
 
 	set_posix_acl_flag(sb);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 00/17] fanotify: add pre-content hooks
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (16 preceding siblings ...)
  2024-11-11 20:18 ` [PATCH v6 17/17] fs: enable pre-content events on supported file systems Josef Bacik
@ 2024-11-11 20:27 ` Amir Goldstein
  2024-11-11 21:55 ` Linus Torvalds
  18 siblings, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2024-11-11 20:27 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, brauner, torvalds, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, Nov 11, 2024 at 9:19 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> v5: https://lore.kernel.org/linux-fsdevel/cover.1725481503.git.josef@toxicpanda.com/
> v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
> v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
>
> v5->v6:
> - Linus had problems with this and rejected Jan's PR
>   (https://lore.kernel.org/linux-fsdevel/20240923110348.tbwihs42dxxltabc@quack3/),
>   so I'm respinning this series to address his concerns.  Hopefully this is more
>   acceptable.
> - Change the page fault hooks to happen only in the case where we have to add a
>   page, not where there exists pages already.
> - Amir added a hook to truncate.
> - We made the flag per SB instead of per fstype, Amir wanted this because of
>   some potential issues with other file system specific work he's doing.

Not me :) it for the upcoming ps > bs patch set for xfs.
It would be easiest to opt-out of this config for HSM to begin with.

> - Dropped the bcachefs patch, there were some concerns that we were doing
>   something wrong, and it's not a huge deal to not have this feature for now.
> - Unfortunately the xfs write fault path still has to do the page fault hook

As Jan corrected me, this is only for the DAX page faults in xfs, so we should
be ok with fsnotify hook called on every fault in this case.

>   before we know if we have a page or not, this is because of the locking that's
>   done before we get to the part where we know if we have a page already or not,
>   so that's the path that is still the same from last iteration.
> - I've re-validated this series with btrfs, xfs, and ext4 to make sure I didn't
>   break anything.

Thanks!
Amir.

>
> v4->v5:
> - Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
>   to respin the series anyway.
> - Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
> - Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
> - Added a patch to disable btrfs defrag on pre-content watched files.
> - Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
> - Added two fstests (which will be posted separately) to validate everything,
>   re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
>   didn't break anything.
>
> v3->v4:
> - Trying to send a final verson Friday at 5pm before you go on vacation is a
>   recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
>   review.
> - Reworked the file system helper so it's handling of fpin was a little less
>   silly, per Chinner's suggestion.
> - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
>   in filemap_fault that says if VM_FAULT_ERROR is set we won't have
>   VM_FAULT_RETRY set.
>
> v2->v3:
> - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
>   emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
>   Amir's suggestion.
> - Reworked the exported helper so the per-filesystem changes are much smaller,
>   per Amir's suggestion.
> - Fixed the screwup for DAX writes per Chinner's suggestion.
> - Added Christian's reviewed-by's where appropriate.
>
> v1->v2:
> - reworked the page fault logic based on Jan's suggestion and turned it into a
>   helper.
> - Added 3 patches per-fs where we need to call the fsnotify helper from their
>   ->fault handlers.
> - Disabled readahead in the case that there's a pre-content watch in place.
> - Disabled huge faults when there's a pre-content watch in place (entirely
>   because it's untested, theoretically it should be straightforward to do).
> - Updated the command numbers.
> - Addressed the random spelling/grammer mistakes that Jan pointed out.
> - Addressed the other random nits from Jan.
>
> --- Original email ---
>
> Hello,
>
> These are the patches for the bare bones pre-content fanotify support.  The
> majority of this work is Amir's, my contribution to this has solely been around
> adding the page fault hooks, testing and validating everything.  I'm sending it
> because Amir is traveling a bunch, and I touched it last so I'm going to take
> all the hate and he can take all the credit.
>
> There is a PoC that I've been using to validate this work, you can find the git
> repo here
>
> https://github.com/josefbacik/remote-fetch
>
> This consists of 3 different tools.
>
> 1. populate.  This just creates all the stub files in the directory from the
>    source directory.  Just run ./populate ~/linux ~/hsm-linux and it'll
>    recursively create all of the stub files and directories.
> 2. remote-fetch.  This is the actual PoC, you just point it at the source and
>    destination directory and then you can do whatever.  ./remote-fetch ~/linux
>    ~/hsm-linux.
> 3. mmap-validate.  This was to validate the pagefault thing, this is likely what
>    will be turned into the selftest with remote-fetch.  It creates a file and
>    then you can validate the file matches the right pattern with both normal
>    reads and mmap.  Normally I do something like
>
>    ./mmap-validate create ~/src/foo
>    ./populate ~/src ~/dst
>    ./rmeote-fetch ~/src ~/dst
>    ./mmap-validate validate ~/dst/foo
>
> I did a bunch of testing, I also got some performance numbers.  I copied a
> kernel tree, and then did remote-fetch, and then make -j4
>
> Normal
> real    9m49.709s
> user    28m11.372s
> sys     4m57.304s
>
> HSM
> real    10m6.454s
> user    29m10.517s
> sys     5m2.617s
>
> So ~17 seconds more to build with HSM.  I then did a make mrproper on both trees
> to see the size
>
> [root@fedora ~]# du -hs /src/linux
> 1.6G    /src/linux
> [root@fedora ~]# du -hs dst
> 125M    dst
>
> This mirrors the sort of savings we've seen in production.
>
> Meta has had these patches (minus the page fault patch) deployed in production
> for almost a year with our own utility for doing on-demand package fetching.
> The savings from this has been pretty significant.
>
> The page-fault hooks are necessary for the last thing we need, which is
> on-demand range fetching of executables.  Some of our binaries are several gigs
> large, having the ability to remote fetch them on demand is a huge win for us
> not only with space savings, but with startup time of containers.
>
> There will be tests for this going into LTP once we're satisfied with the
> patches and they're on their way upstream.  Thanks,
>
> Josef
>
> Amir Goldstein (9):
>   fanotify: rename a misnamed constant
>   fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
>   fsnotify: introduce pre-content permission events
>   fsnotify: pass optional file access range in pre-content event
>   fsnotify: generate pre-content permission event on open
>   fsnotify: generate pre-content permission event on truncate
>   fanotify: introduce FAN_PRE_ACCESS permission event
>   fanotify: report file range info with pre-content events
>   fanotify: allow to set errno in FAN_DENY permission response
>
> Josef Bacik (8):
>   fanotify: don't skip extra event info if no info_mode is set
>   fanotify: add a helper to check for pre content events
>   fanotify: disable readahead if we have pre-content watches
>   mm: don't allow huge faults for files with pre content watches
>   fsnotify: generate pre-content permission event on page fault
>   xfs: add pre-content fsnotify hook for write faults
>   btrfs: disable defrag on pre-content watched files
>   fs: enable pre-content events on supported file systems
>
>  fs/btrfs/ioctl.c                   |   9 +++
>  fs/btrfs/super.c                   |   5 +-
>  fs/ext4/super.c                    |   3 +
>  fs/namei.c                         |  10 ++-
>  fs/notify/fanotify/fanotify.c      |  33 ++++++--
>  fs/notify/fanotify/fanotify.h      |  15 ++++
>  fs/notify/fanotify/fanotify_user.c | 120 +++++++++++++++++++++++------
>  fs/notify/fsnotify.c               |  18 ++++-
>  fs/open.c                          |  31 +++++---
>  fs/xfs/xfs_file.c                  |   4 +
>  fs/xfs/xfs_super.c                 |   2 +-
>  include/linux/fanotify.h           |  19 +++--
>  include/linux/fs.h                 |   1 +
>  include/linux/fsnotify.h           |  73 ++++++++++++++++--
>  include/linux/fsnotify_backend.h   |  59 +++++++++++++-
>  include/linux/mm.h                 |   1 +
>  include/uapi/linux/fanotify.h      |  18 +++++
>  mm/filemap.c                       |  90 ++++++++++++++++++++++
>  mm/memory.c                        |  22 ++++++
>  mm/readahead.c                     |  13 ++++
>  security/selinux/hooks.c           |   3 +-
>  21 files changed, 491 insertions(+), 58 deletions(-)
>
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 20:17 ` [PATCH v6 06/17] fsnotify: generate pre-content permission event on open Josef Bacik
@ 2024-11-11 21:51   ` Linus Torvalds
  2024-11-11 22:46     ` Josef Bacik
  2024-11-11 23:36     ` Amir Goldstein
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-11 21:51 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
>
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> +       /*
> +        * This permission hook is different than fsnotify_open_perm() hook.
> +        * This is a pre-content hook that is called without sb_writers held
> +        * and after the file was truncated.
> +        */
> +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
>  }

Stop adding sh*t like this to the VFS layer.

Seriously. I spend time and effort looking at profiles, and then
people who do *not* seem to spend the time and effort just willy nilly
add fsnotify and security events and show down basic paths.

I'm going to NAK any new fsnotify and permission hooks unless people
show that they don't add any overhead.

Because I'm really really tired of having to wade through various
permission hooks in the profiles that I can not fix or optimize,
because those hoosk have no sane defined semantics, just "let user
space know".

Yes, right now it's mostly just the security layer. But this really
looks to me like now fsnotify will be the same kind of pain.

And that location is STUPID. Dammit, it is even *documented* to be
stupid. It's a "pre-content" hook that happens after the contents have
already been truncated. WTF? That's no "pre".

I tried to follow the deep chain of inlines to see what actually ends
up happening, and it looks like if the *whole* filesystem has no
notify events at all, the fsnotify_sb_has_watchers() check will make
this mostly go away, except for all the D$ accesses needed just to
check for it.

But even *one* entirely unrelated event will now force every single
open to typically call __fsnotify_parent() (or possibly "just"
fsnotify), because there's no sane "nobody cares about this dentry"
kind of thing.

So effectively this is a new hook that gets called on every single
open call that nobody else cares about than you, and that people have
lived without for three decades.

Stop it, or at least add the code to not do this all completely pointlessly.

Because otherwise I will not take this kind of stuff any more. I just
spent time trying to figure out how to avoid the pointless cache
misses we did for every path component traversal.

So I *really* don't want to see another pointless stupid fsnotify hook
in my profiles.

                Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 00/17] fanotify: add pre-content hooks
  2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
                   ` (17 preceding siblings ...)
  2024-11-11 20:27 ` [PATCH v6 00/17] fanotify: add pre-content hooks Amir Goldstein
@ 2024-11-11 21:55 ` Linus Torvalds
  18 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-11 21:55 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
>
> - Linus had problems with this and rejected Jan's PR
>   (https://lore.kernel.org/linux-fsdevel/20240923110348.tbwihs42dxxltabc@quack3/),
>   so I'm respinning this series to address his concerns.  Hopefully this is more
>   acceptable.

I'm still rejecting this. I spent some time trying to avoid overhead
in the really basic permission code the last couple of weeks, and I
look at this and go "this is adding more overhead".

It all seems to be completely broken too. Doing some permission check
at open() time *aftert* the O_TRUNC has already truncated the file?
No. That's just beyond stupid. That's just terminally broken sh*t.

And that's just the stuff I noticed until I got so fed up that I
stopped reading the patches.

             Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 21:51   ` Linus Torvalds
@ 2024-11-11 22:46     ` Josef Bacik
  2024-11-11 23:22       ` Linus Torvalds
  2024-11-11 23:36     ` Amir Goldstein
  1 sibling, 1 reply; 33+ messages in thread
From: Josef Bacik @ 2024-11-11 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, Nov 11, 2024 at 4:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> > +       /*
> > +        * This permission hook is different than fsnotify_open_perm() hook.
> > +        * This is a pre-content hook that is called without sb_writers held
> > +        * and after the file was truncated.
> > +        */
> > +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> >  }
>
> Stop adding sh*t like this to the VFS layer.
>
> Seriously. I spend time and effort looking at profiles, and then
> people who do *not* seem to spend the time and effort just willy nilly
> add fsnotify and security events and show down basic paths.
>
> I'm going to NAK any new fsnotify and permission hooks unless people
> show that they don't add any overhead.
>
> Because I'm really really tired of having to wade through various
> permission hooks in the profiles that I can not fix or optimize,
> because those hoosk have no sane defined semantics, just "let user
> space know".
>
> Yes, right now it's mostly just the security layer. But this really
> looks to me like now fsnotify will be the same kind of pain.
>
> And that location is STUPID. Dammit, it is even *documented* to be
> stupid. It's a "pre-content" hook that happens after the contents have
> already been truncated. WTF? That's no "pre".
>
> I tried to follow the deep chain of inlines to see what actually ends
> up happening, and it looks like if the *whole* filesystem has no
> notify events at all, the fsnotify_sb_has_watchers() check will make
> this mostly go away, except for all the D$ accesses needed just to
> check for it.
>
> But even *one* entirely unrelated event will now force every single
> open to typically call __fsnotify_parent() (or possibly "just"
> fsnotify), because there's no sane "nobody cares about this dentry"
> kind of thing.
>
> So effectively this is a new hook that gets called on every single
> open call that nobody else cares about than you, and that people have
> lived without for three decades.
>
> Stop it, or at least add the code to not do this all completely pointlessly.
>
> Because otherwise I will not take this kind of stuff any more. I just
> spent time trying to figure out how to avoid the pointless cache
> misses we did for every path component traversal.
>
> So I *really* don't want to see another pointless stupid fsnotify hook
> in my profiles.

Did you see the patch that added the
fsnotify_file_has_pre_content_watches() thing?  It'll look at the
inode/sb/mnt to see if there are any watches and carry on if there's
nothing.  This will be the cheapest thing open/write/whatever does.
If there's no watches, nothing happens.  I'm not entirely sure what
else you want me to do here, other than not add the feature, which I
suppose is a position to take.

The overhead here is purely for people who use HSM.  I'm telling you
that in production, with real world workloads, the overhead is far
outweighed by having to copy bytes around we'll never use.  Your
argument is similar to the argument against adding compression to
btrfs, "but it costs CPU!?", sure, but the benefit of writing
significantly less waaaaay outweighed the CPU cost and was a huge net
positive.

This is going to cost something if it's on, it costs nothing if it's
off.  If you want performance numbers that's reasonable, I'll run
fsperf tomorrow and get you a side by side comparison.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 22:46     ` Josef Bacik
@ 2024-11-11 23:22       ` Linus Torvalds
  2024-11-11 23:39         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-11 23:22 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
>
> Did you see the patch that added the
> fsnotify_file_has_pre_content_watches() thing?

No, because I had gotten to patch 6/11, and it added this open thing,
and there was no such thing in any of the patches before it.

It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.

However, at no point does it look like you actually test it at open
time, so none of this seems to matter.

As far as I can see, even at the end of the series, you will call the
fsnotify hook at open time even if there are no content watches on the
file.

So apparently the fsnotify_file_has_pre_content_watches() is not
called when it should be, and when it *is* called, it's also doing
completely the wrong thing.

Look, for basic operations THAT DON'T CARE, you now added a function
call to fsnotify_file_has_pre_content_watches(), that function call
looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
be done!), and then after that looks at the i_fsnotify_mask.

THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.

This code has been written by somebody who NEVER EVER looked at
profiles. You're following chains of pointers when you never should.

Look, here's a very basic example of the kind of complete mis-design
I'm talking about:

 - we're doing a basic read() on a file that isn't being watched.

 - we want to maybe do read-ahead

 - the code does

        if (fsnotify_file_has_pre_content_watches(file))
                return fpin;

   to say that "don't do read-ahead".

Fine, I understand the concept. But keep in mind that the common case
is presumably that there are no content watches.

And even ignoring the "common case" issue, that's the one you want to
OPTIMIZE for. That's the case that matters for performance, because
clearly if there are content watches, you're going to go into "Go
Slow" mode anyway and not do pre-fetching. So even if content watches
are common on some load, they are clearly not the case you should do
performance optimization for.

With me so far?

So if THAT is the case that matters, then dammit, we shouldn't be
calling a function at all.

And when calling the function, we shouldn't start out with this
completely broken logic:

        struct inode *inode = file_inode(file);
        __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;

        if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
                return false;

that does random crap and looks up some "mount mask" and looks up the
superblock flags.

Why shouldn't we do this?

BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
CONTENT MATCHES!

See why I'm shouting? You're doing insane things, and you're doing
them for all the cases that DO NOT MATTER. You're doing all of this
for the common case that doesn't want to see that kind of mindless
overhead.

You literally check for the "do I even care" *last*, when you finally
do that fsnotify_object_watched() check that looks at the inode. But
by then you have already wasted all that time and effort, and
fsnotify_object_watched() is broken anyway, because it's stupidly
designed to require that mnt_mask that isn't needed if you have
properly marked each object individually.

So what *should* you have?

You should have had a per-file flag saying "Do I need to even call
this crud at all", and have it in a location where you don't need to
look at anything else.

And fsnotify already basically has that flag, except it's mis-designed
too. We have FMODE_NONOTIFY, which is the wrong way around (saying
"don't notify", when that should just be the *default*), and the
fsnotify layer uses it only to mark its own internal files so that it
doesn't get called recursively. So that flag that *looks* sane and is
in the right location is actually doing the wrong thing, because it's
dealing with a rare special case, not the important cases that
actually matter.

So all of this readahead logic - and all of the read and write hooks -
should be behind a simple "oh, this file doesn't have any notification
stuff, so don't bother calling any fsnotify functions".

So I think the pattern should be

    static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
    {
        if (unlikely(file->f_mode & FMODE_NOTIFY))
                return out_of_line_crud(file);
        return false;
    }

so that the *only* thing that gets inlined is "does this file have any
fsnotify stuff at all", and in the common case where that isn't true,
we don't call any fsnotify functions, and we don't start looking at
inodes or superblocks or anything pointless like that.

THAT is the kind of code sequence we should have for unlikely cases.
Not the "call fsnotify code to check for something unlikely". Not
"look at file_inode(file)->i_sb->s_iflags" until it's *required*.

Similarly, the actual open-time code SHOULD NEVER BE CALLED, because
it should have the same kind of simple logic based on some simple flag
either in the dentry or maybe in the inode. So that all those *normal*
dentries and inodes that don't have watches don't get the overhead of
calling into fsnotify code.

Because yes, I do see all those function calls, and all those
unnecessary indirect pointer accesses when I do profiles. And if I see
fsnotify in my profiles when I don't have any notifications enabled,
that really really *annoys* me.

                 Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 21:51   ` Linus Torvalds
  2024-11-11 22:46     ` Josef Bacik
@ 2024-11-11 23:36     ` Amir Goldstein
  1 sibling, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2024-11-11 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4, kernel test robot

On Mon, Nov 11, 2024 at 10:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> > +       /*
> > +        * This permission hook is different than fsnotify_open_perm() hook.
> > +        * This is a pre-content hook that is called without sb_writers held
> > +        * and after the file was truncated.
> > +        */
> > +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> >  }
>
> Stop adding sh*t like this to the VFS layer.
>
> Seriously. I spend time and effort looking at profiles, and then
> people who do *not* seem to spend the time and effort just willy nilly
> add fsnotify and security events and show down basic paths.
>
> I'm going to NAK any new fsnotify and permission hooks unless people
> show that they don't add any overhead.
>

FWIW we did work to eliminate overhead long before posting the new hooks.

This prep work has already been merged back in v6.9:
https://lore.kernel.org/linux-fsdevel/20240317184154.1200192-1-amir73il@gmail.com/

After working closely with Oliver and kernel test robot to eliminate
the overhead
of the pre-content hooks and on the way, also reduce overhead of the existing
fanotify permission hooks that were merged back in the day.

> Because I'm really really tired of having to wade through various
> permission hooks in the profiles that I can not fix or optimize,
> because those hoosk have no sane defined semantics, just "let user
> space know".
>
> Yes, right now it's mostly just the security layer. But this really
> looks to me like now fsnotify will be the same kind of pain.
>
> And that location is STUPID. Dammit, it is even *documented* to be
> stupid. It's a "pre-content" hook that happens after the contents have
> already been truncated. WTF? That's no "pre".
>

Yeh, I understand why it seems stupid and it may have been poorly
documented, but there is actually a good reason for the location of
this hook.

The problem with the existing fsnotify_open_perm() hook is that
with O_CREATE, open_last_lookups() takes sb_writers freeze
protection (regardless if file with really created), so we cannot
safely use this hook to start writing to file and fill its content.

So the important part of the comment is:
"This permission hook is different than fsnotify_open_perm() hook.
 This is a pre-content hook that is called without sb_writers held"

The part that follows "and after the file was truncated", simply
means that in case of O_TRUNC we won't need to fill file content,
because the file will have zero size anyway.

And for the record, it is not "pre-open" it is "pre-content-access", so the
name may be confusing but it is not wrong.

> I tried to follow the deep chain of inlines to see what actually ends
> up happening, and it looks like if the *whole* filesystem has no
> notify events at all, the fsnotify_sb_has_watchers() check will make
> this mostly go away, except for all the D$ accesses needed just to
> check for it.
>
> But even *one* entirely unrelated event will now force every single
> open to typically call __fsnotify_parent() (or possibly "just"
> fsnotify), because there's no sane "nobody cares about this dentry"
> kind of thing.
>
> So effectively this is a new hook that gets called on every single
> open call that nobody else cares about than you, and that people have
> lived without for three decades.
>

That's exactly the reason for the commit
a5e57b4d370c fsnotify: optimize the case of no permission event watchers

It is supposed to reduce overhead to bare minimum for hooks of
events from the class "permission" (FSNOTIFY_PRIO_CONTENT), which
are typically only watched for Anti-malware software, so
fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT)
is typically false on any given fs.

And same for the new hooks for events of class "pre-content".
fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_PRE_CONTENT)
will only be true for fs of dedicated systems that run an HSM service, where
the overhead of the hooks is not going to be a concern.

> Stop it, or at least add the code to not do this all completely pointlessly.
>
> Because otherwise I will not take this kind of stuff any more. I just
> spent time trying to figure out how to avoid the pointless cache
> misses we did for every path component traversal.
>
> So I *really* don't want to see another pointless stupid fsnotify hook
> in my profiles.

I understand your frustration from crawling performance regressions, I
really do.
I am personally very grateful for the services of Oliver and his
kernel test robot
who test my development branches to catch regressions very soon in the
development
process.

We may have missed things along the way and you may yet find more issues
that justify another NAK, or more work, but you should know that a lot of care
was taken to try to avoid inflicting pain on the system.

I have been practically doing vfs prep and cleanup work together with
Josef and Jan and Christian for at least a year before we got to post v1 of the
pre-content patches.

So all I ask in return is that you consider the possibility that this is not
utter garbage and try to work with us to address your concerns.

Pre-content hooks (and HSM) is a very powerful feature that many people
are requesting and I believe that we will be able to deliver this
feature without
crapping all over the VFS.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 23:22       ` Linus Torvalds
@ 2024-11-11 23:39         ` Linus Torvalds
  2024-11-11 23:59         ` Amir Goldstein
  2024-11-12 15:24         ` Josef Bacik
  2 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-11 23:39 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, 11 Nov 2024 at 15:22, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.

Side note: I think as filesystem people, you guys are taught to think
"IO is expensive, as long as you can avoid IO, things go fast".

And that's largely true at a filesystem level.

But on the VFS level, the common case is actually "everything is
cached in memory, we're never calling down to the filesystem at all".

And then IO isn't the issue.

So on a VFS level, to a very close approximation, the only thing that
matters is cache misses and mispredicted branches.

(Indirect calls have always had some overhead, and Spectre made it
much worse, so arguably indirect calls have become the third thing
that matters).

So in the VFS layer, we have ridiculous tricks like

        if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
                if (likely(inode->i_op->permission))
                        return inode->i_op->permission(idmap, inode, mask);

                /* This gets set once for the inode lifetime */
                spin_lock(&inode->i_lock);
                inode->i_opflags |= IOP_FASTPERM;
                spin_unlock(&inode->i_lock);
        }
        return generic_permission(idmap, inode, mask);

in do_inode_permission, because it turns out that the IOP_FASTPERM
flag means that we literally don't even need to dereference
inode->i_op->permission (nasty chain of D$ accesses), and we can
*only* look at accesses off the 'inode' pointer.

Is this an extreme example? Yes. But the whole i_opflags kind of thing
does end up mattering, exactly because it keeps the D$ footprint
smaller.

                  Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 23:22       ` Linus Torvalds
  2024-11-11 23:39         ` Linus Torvalds
@ 2024-11-11 23:59         ` Amir Goldstein
  2024-11-12  0:37           ` Linus Torvalds
  2024-11-12 15:24         ` Josef Bacik
  2 siblings, 1 reply; 33+ messages in thread
From: Amir Goldstein @ 2024-11-11 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

On Tue, Nov 12, 2024 at 12:22 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Did you see the patch that added the
> > fsnotify_file_has_pre_content_watches() thing?
>
> No, because I had gotten to patch 6/11, and it added this open thing,
> and there was no such thing in any of the patches before it.
>
> It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.
>
> However, at no point does it look like you actually test it at open
> time, so none of this seems to matter.
>
> As far as I can see, even at the end of the series, you will call the
> fsnotify hook at open time even if there are no content watches on the
> file.
>
> So apparently the fsnotify_file_has_pre_content_watches() is not
> called when it should be, and when it *is* called, it's also doing
> completely the wrong thing.
>
> Look, for basic operations THAT DON'T CARE, you now added a function
> call to fsnotify_file_has_pre_content_watches(), that function call
> looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
> be done!), and then after that looks at the i_fsnotify_mask.
>
> THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.
>
> This code has been written by somebody who NEVER EVER looked at
> profiles. You're following chains of pointers when you never should.
>
> Look, here's a very basic example of the kind of complete mis-design
> I'm talking about:
>
>  - we're doing a basic read() on a file that isn't being watched.
>
>  - we want to maybe do read-ahead
>
>  - the code does
>
>         if (fsnotify_file_has_pre_content_watches(file))
>                 return fpin;
>
>    to say that "don't do read-ahead".
>
> Fine, I understand the concept. But keep in mind that the common case
> is presumably that there are no content watches.
>
> And even ignoring the "common case" issue, that's the one you want to
> OPTIMIZE for. That's the case that matters for performance, because
> clearly if there are content watches, you're going to go into "Go
> Slow" mode anyway and not do pre-fetching. So even if content watches
> are common on some load, they are clearly not the case you should do
> performance optimization for.
>
> With me so far?
>
> So if THAT is the case that matters, then dammit, we shouldn't be
> calling a function at all.
>
> And when calling the function, we shouldn't start out with this
> completely broken logic:
>
>         struct inode *inode = file_inode(file);
>         __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
>
>         if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
>                 return false;
>
> that does random crap and looks up some "mount mask" and looks up the
> superblock flags.
>
> Why shouldn't we do this?
>
> BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
> CONTENT MATCHES!
>
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.
>
> You literally check for the "do I even care" *last*, when you finally
> do that fsnotify_object_watched() check that looks at the inode. But
> by then you have already wasted all that time and effort, and
> fsnotify_object_watched() is broken anyway, because it's stupidly
> designed to require that mnt_mask that isn't needed if you have
> properly marked each object individually.
>
> So what *should* you have?
>
> You should have had a per-file flag saying "Do I need to even call
> this crud at all", and have it in a location where you don't need to
> look at anything else.
>
> And fsnotify already basically has that flag, except it's mis-designed
> too. We have FMODE_NONOTIFY, which is the wrong way around (saying
> "don't notify", when that should just be the *default*), and the
> fsnotify layer uses it only to mark its own internal files so that it
> doesn't get called recursively. So that flag that *looks* sane and is
> in the right location is actually doing the wrong thing, because it's
> dealing with a rare special case, not the important cases that
> actually matter.
>
> So all of this readahead logic - and all of the read and write hooks -
> should be behind a simple "oh, this file doesn't have any notification
> stuff, so don't bother calling any fsnotify functions".
>
> So I think the pattern should be
>
>     static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
>     {
>         if (unlikely(file->f_mode & FMODE_NOTIFY))
>                 return out_of_line_crud(file);
>         return false;
>     }
>

I think that's a good idea for pre-content events, because it's fine
to say that if the sb/mount was not watched by a pre-content event listener
at the time of file open, then we do not care.

The problem is that legacy inotify/fanotify watches can be added after
file is open,
so that is allegedly why this optimization was not done for fsnotify
hooks in the past.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 23:59         ` Amir Goldstein
@ 2024-11-12  0:37           ` Linus Torvalds
  2024-11-12  8:11             ` Amir Goldstein
  2024-11-12 14:28             ` Jan Kara
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-12  0:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
>
> I think that's a good idea for pre-content events, because it's fine
> to say that if the sb/mount was not watched by a pre-content event listener
> at the time of file open, then we do not care.

Right.

> The problem is that legacy inotify/fanotify watches can be added after
> file is open, so that is allegedly why this optimization was not done for
> fsnotify hooks in the past.

So honestly, even if the legacy fsnotify hooks can't look at the file
flag, they could damn well look at an inode flag.

And I'm not even convinced that we couldn't fix them to just look at a
file flag, and say "tough luck, somebody opened that file before you
started watching, you don't get to see what they did".

So even if we don't look at a file->f_mode flag, the lergacy cases
should look at i_fsnotify_mask, and do that *first*.

IOW, not do it like fsnotify_object_watched() does now, which is just
broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
pointlessly, but it also does it much too late - it gets called after
we've already called into the fsnotify() code and have messed up the
I$ etc.

The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
it should be very *literally* pointless. If some bit isn't set in
i_sb->s_fsnotify_mask, then there should be no way to set that bit in
inode->i_fsnotify_mask. So the only time we should access
i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
it gets tested.

But even if that silly and pointless i_sb->s_fsnotify_mask thing is
removed, fsnotify_object_watched() is *still* wrong, because it
requires that mnt_mask as an argument, which means that the caller now
has to look it up - all this entirely pointless work that should never
be done if the bit wasn't set in inode->i_fsnotify_mask.

So I really think fsnotify is doing *everything* wrong.

And I most certainly don't want to add more runtime hooks to
*critical* code like open/read/write.

Right now, many of the fsnotify things are for "metadata", ie for
bigger file creation / removal / move etc. And yes, the "don't do this
if there are no fsnotify watchers AT ALL" does actually end up meaning
that most of the time I never see any of it in profiles, because the
fsnotify_sb_has_watchers() culls out that case.

And while the fsnotify_sb_has_watchers() thing is broken garbage and
does too many indirections and is not testing the right thing, at
least it's inlined and you don't get the function calls.

That doesn't make fsnotify "right", but at least it's not in my face.
I see the sb accesses, and I hate them, but it's usually at least
hidden. Admittedly not as well hidden as it *should* be, since it does
the access tests in the wrong order, but the old fsnotify_open()
doesn't strike me as "terminally broken".

It doesn't have a permission test after the open has already done
things, and it's inlined enough that it isn't actively offensive.

And most of the other fsnotify things have the same pattern - not
great, but not actively offensive.

These new patches make it in my face.

So I do require that the *new* cases at least get it right. The fact
that we have old code that is misdesigned and gets it wrong and should
also be improved isn't an excuse to add *more* badly coded stuff.

And yes, if somebody fixes the old fsnotify stuff to check just the
i_fsnotify_mask in the inline function, and moves all the other silly
checks out-of-line, that would be an improvement. I'd very much
applaud that. But it's a separate thing from adding new hooks.

                Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-12  0:37           ` Linus Torvalds
@ 2024-11-12  8:11             ` Amir Goldstein
  2024-11-12 13:54               ` Jan Kara
  2024-11-12 14:28             ` Jan Kara
  1 sibling, 1 reply; 33+ messages in thread
From: Amir Goldstein @ 2024-11-12  8:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4

On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think that's a good idea for pre-content events, because it's fine
> > to say that if the sb/mount was not watched by a pre-content event listener
> > at the time of file open, then we do not care.
>
> Right.
>
> > The problem is that legacy inotify/fanotify watches can be added after
> > file is open, so that is allegedly why this optimization was not done for
> > fsnotify hooks in the past.
>
> So honestly, even if the legacy fsnotify hooks can't look at the file
> flag, they could damn well look at an inode flag.
>

Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
which is the common way for Anti-malware to set watches on
filesystems, so I am not sure what you are saying.

> And I'm not even convinced that we couldn't fix them to just look at a
> file flag, and say "tough luck, somebody opened that file before you
> started watching, you don't get to see what they did".

That would specifically break tail -f (for inotify) and probably many other
tools, but as long as we also look at the inode flags (i_fsnotify_mask)
and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
then I think we may be able to get away with changing the semantics
for open files on a fanotify mount watch.

Specifically, I would really like to eliminate completely the cost of
FAN_ACCESS_PERM event, which could be gated on file flag, because
this is only for security/Anti-malware and I don't think this event is
practically
useful and it sure does not need to guarantee permission events to mount
watchers on already open files.

>
> So even if we don't look at a file->f_mode flag, the lergacy cases
> should look at i_fsnotify_mask, and do that *first*.
>
> IOW, not do it like fsnotify_object_watched() does now, which is just
> broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
> pointlessly, but it also does it much too late - it gets called after
> we've already called into the fsnotify() code and have messed up the
> I$ etc.
>
> The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
> it should be very *literally* pointless. If some bit isn't set in
> i_sb->s_fsnotify_mask, then there should be no way to set that bit in
> inode->i_fsnotify_mask. So the only time we should access
> i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
> it gets tested.
>

i_fsnotify_mask is the cumulative mask of all inode watchers
s_fsnotify_mask is *not* the cumulative of all i_fsnotify_mask
s_fsnotify_mask is the cumulative mask of all sb watchers
mnt_fsnotify_marks is the cumulative mask of all mount watchers

> But even if that silly and pointless i_sb->s_fsnotify_mask thing is
> removed, fsnotify_object_watched() is *still* wrong, because it
> requires that mnt_mask as an argument, which means that the caller now
> has to look it up - all this entirely pointless work that should never
> be done if the bit wasn't set in inode->i_fsnotify_mask.
>
> So I really think fsnotify is doing *everything* wrong.

Note the difference between fsnotify_sb_has_watchers()
and fsnotify_object_watched().

The former is an early optimization gate that checks if there are
any inode/mount/sb watchers (per class) on the filesystem, regardless
of specific events and specific target inode/file.

We could possibly further optimize fsnotify_sb_has_watchers() to avoid
access to ->s_fsnotify_info by setting sb flag (for each priority class).

The latter checks if any inode/mount/sb are interested in a specific
event on the said object.

In upstream code, fsnotify_object_watched() is always gated behind
fsnotify_sb_has_watchers(), which tries to prevent the indirect call.

The new fsnotify_file_has_pre_content_watches() helper could
have checked fsnotify_sb_has_priority_watchers(sb,
                                               FSNOTIFY_PRIO_CONTENT);
but it is better to gate by file flag as you suggested.


>
> And I most certainly don't want to add more runtime hooks to
> *critical* code like open/read/write.
>
> Right now, many of the fsnotify things are for "metadata", ie for
> bigger file creation / removal / move etc. And yes, the "don't do this
> if there are no fsnotify watchers AT ALL" does actually end up meaning
> that most of the time I never see any of it in profiles, because the
> fsnotify_sb_has_watchers() culls out that case.
>
> And while the fsnotify_sb_has_watchers() thing is broken garbage and
> does too many indirections and is not testing the right thing, at
> least it's inlined and you don't get the function calls.
>
> That doesn't make fsnotify "right", but at least it's not in my face.
> I see the sb accesses, and I hate them, but it's usually at least
> hidden. Admittedly not as well hidden as it *should* be, since it does
> the access tests in the wrong order, but the old fsnotify_open()
> doesn't strike me as "terminally broken".
>
> It doesn't have a permission test after the open has already done
> things, and it's inlined enough that it isn't actively offensive.
>
> And most of the other fsnotify things have the same pattern - not
> great, but not actively offensive.
>
> These new patches make it in my face.
>
> So I do require that the *new* cases at least get it right. The fact
> that we have old code that is misdesigned and gets it wrong and should
> also be improved isn't an excuse to add *more* badly coded stuff.
>
> And yes, if somebody fixes the old fsnotify stuff to check just the
> i_fsnotify_mask in the inline function, and moves all the other silly
> checks out-of-line, that would be an improvement. I'd very much
> applaud that. But it's a separate thing from adding new hooks.

We will do both.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-12  8:11             ` Amir Goldstein
@ 2024-11-12 13:54               ` Jan Kara
  2024-11-12 14:42                 ` Amir Goldstein
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2024-11-12 13:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, jack,
	brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4

On Tue 12-11-24 09:11:32, Amir Goldstein wrote:
> On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > I think that's a good idea for pre-content events, because it's fine
> > > to say that if the sb/mount was not watched by a pre-content event listener
> > > at the time of file open, then we do not care.
> >
> > Right.
> >
> > > The problem is that legacy inotify/fanotify watches can be added after
> > > file is open, so that is allegedly why this optimization was not done for
> > > fsnotify hooks in the past.
> >
> > So honestly, even if the legacy fsnotify hooks can't look at the file
> > flag, they could damn well look at an inode flag.
> 
> Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
> which is the common way for Anti-malware to set watches on
> filesystems, so I am not sure what you are saying.
> 
> > And I'm not even convinced that we couldn't fix them to just look at a
> > file flag, and say "tough luck, somebody opened that file before you
> > started watching, you don't get to see what they did".
> 
> That would specifically break tail -f (for inotify) and probably many other
> tools, but as long as we also look at the inode flags (i_fsnotify_mask)
> and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
> then I think we may be able to get away with changing the semantics
> for open files on a fanotify mount watch.

Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark
was placed after file open. There's too much stuff in userspace depending
on this since this behavior dates back to inotify interface sometime in
2010 or so.

> Specifically, I would really like to eliminate completely the cost of
> FAN_ACCESS_PERM event, which could be gated on file flag, because
> this is only for security/Anti-malware and I don't think this event is
> practically
> useful and it sure does not need to guarantee permission events to mount
> watchers on already open files.

For traditional fanotify permission events I agree generating them only if
the mark was placed before open is likely fine but we'll have to try and
see whether something breaks. For the new pre-content events I like the
per-file flag as Linus suggested. That should indeed save us some cache
misses in some fast paths.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-12  0:37           ` Linus Torvalds
  2024-11-12  8:11             ` Amir Goldstein
@ 2024-11-12 14:28             ` Jan Kara
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Kara @ 2024-11-12 14:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Amir Goldstein, Josef Bacik, kernel-team, linux-fsdevel, jack,
	brauner, linux-xfs, linux-btrfs, linux-mm, linux-ext4

On Mon 11-11-24 16:37:30, Linus Torvalds wrote:
> On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think that's a good idea for pre-content events, because it's fine
> > to say that if the sb/mount was not watched by a pre-content event listener
> > at the time of file open, then we do not care.
> 
> Right.
> 
> > The problem is that legacy inotify/fanotify watches can be added after
> > file is open, so that is allegedly why this optimization was not done for
> > fsnotify hooks in the past.
> 
> So honestly, even if the legacy fsnotify hooks can't look at the file
> flag, they could damn well look at an inode flag.
> 
> And I'm not even convinced that we couldn't fix them to just look at a
> file flag, and say "tough luck, somebody opened that file before you
> started watching, you don't get to see what they did".
> 
> So even if we don't look at a file->f_mode flag, the lergacy cases
> should look at i_fsnotify_mask, and do that *first*.
> 
> IOW, not do it like fsnotify_object_watched() does now, which is just
> broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
> pointlessly, but it also does it much too late - it gets called after
> we've already called into the fsnotify() code and have messed up the
> I$ etc.
> 
> The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
> it should be very *literally* pointless. If some bit isn't set in
> i_sb->s_fsnotify_mask, then there should be no way to set that bit in
> inode->i_fsnotify_mask. So the only time we should access
> i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
> it gets tested.

Thanks for explanation but what you write here and below seems to me as if
you are not aware of some features fanotify API provides (for many years).
With fanotify you can place a mark not only on a file / directory but you
can place it also on a mount point or a superblock. In that case any
operation of appropriate type happening through that mount point or on that
superblock should deliver the event to the notification group that placed
the mark. So for example we check inode->i_sb->s_fsnotify_mask on open to
be able to decide whether somebody asked to get notifications about *all*
opens on that superblock and generate appropriate events. These features
are there since day 1 of fanotify (at least for mountpoints, superblocks
were added later) and quite some userspace depends on them for doing
filesystem-wide event monitoring.

We could somehow cache the fact that someone placed a mark on the
superblock in the inode / struct file but I don't see how to keep such
cache consistent with marks that are attached to the superblock without
incurring the very same cache misses we are trying to avoid.

I do like your idea of caching whether somebody wants the notification in
struct file as that way we can avoid the cache misses for the
new pre-content hooks and possibly in hooks for the traditional fanotify
permission events. But I don't see how we could possibly avoid these cache
misses for the classical notification events like FAN_OPEN, FAN_ACCESS,
FAN_MODIFY, ...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-12 13:54               ` Jan Kara
@ 2024-11-12 14:42                 ` Amir Goldstein
  0 siblings, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2024-11-12 14:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Josef Bacik, kernel-team, linux-fsdevel, brauner,
	linux-xfs, linux-btrfs, linux-mm, linux-ext4, kernel test robot

[-- Attachment #1: Type: text/plain, Size: 3298 bytes --]

On Tue, Nov 12, 2024 at 2:55 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 12-11-24 09:11:32, Amir Goldstein wrote:
> > On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > I think that's a good idea for pre-content events, because it's fine
> > > > to say that if the sb/mount was not watched by a pre-content event listener
> > > > at the time of file open, then we do not care.
> > >
> > > Right.
> > >
> > > > The problem is that legacy inotify/fanotify watches can be added after
> > > > file is open, so that is allegedly why this optimization was not done for
> > > > fsnotify hooks in the past.
> > >
> > > So honestly, even if the legacy fsnotify hooks can't look at the file
> > > flag, they could damn well look at an inode flag.
> >
> > Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
> > which is the common way for Anti-malware to set watches on
> > filesystems, so I am not sure what you are saying.
> >
> > > And I'm not even convinced that we couldn't fix them to just look at a
> > > file flag, and say "tough luck, somebody opened that file before you
> > > started watching, you don't get to see what they did".
> >
> > That would specifically break tail -f (for inotify) and probably many other
> > tools, but as long as we also look at the inode flags (i_fsnotify_mask)
> > and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
> > then I think we may be able to get away with changing the semantics
> > for open files on a fanotify mount watch.
>
> Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark
> was placed after file open. There's too much stuff in userspace depending
> on this since this behavior dates back to inotify interface sometime in
> 2010 or so.
>
> > Specifically, I would really like to eliminate completely the cost of
> > FAN_ACCESS_PERM event, which could be gated on file flag, because
> > this is only for security/Anti-malware and I don't think this event is
> > practically
> > useful and it sure does not need to guarantee permission events to mount
> > watchers on already open files.
>
> For traditional fanotify permission events I agree generating them only if
> the mark was placed before open is likely fine but we'll have to try and
> see whether something breaks. For the new pre-content events I like the
> per-file flag as Linus suggested. That should indeed save us some cache
> misses in some fast paths.

FWIW, attached a patch that implements FMODE_NOTIFY_PERM
I have asked Oliver to run his performance tests to see if we can
observe an improvement with existing workloads, but is sure is going
to be useful for pre-content events.

For example, here is what the pre content helper looks like after
I adapted Josef's patches to use the flag:

static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
{
        if (!(file->f_mode & FMODE_NOTIFY_PERM))
                return false;

        if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
                return false;

        return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
}

Thanks,
Amir.

[-- Attachment #2: 0001-fsnotify-opt-in-for-permission-events-at-file_open_p.patch --]
[-- Type: text/x-patch, Size: 4545 bytes --]

From 8c8e9452d153a1918470cbe52a8eb6505c675911 Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Tue, 12 Nov 2024 13:46:08 +0100
Subject: [PATCH] fsnotify: opt-in for permission events at file_open_perm()
 time

Legacy inotify/fanotify listeners can add watches for events on inode,
parent or mount and expect to get events (e.g. FS_MODIFY) on files that
were already open at the time of setting up the watches.

fanotify permission events are typically used by Anti-malware sofware,
that is watching the entire mount and it is not common to have more that
one Anti-malware engine installed on a system.

To reduce the overhead of the fsnotify_file_perm() hooks on every file
access, relax the semantics of the legacy FAN_OPEN_PERM event to generate
events only if there were *any* permission event listeners on the
filesystem at the time that the file was open.

The new semantics, implemented with the opt-in FMODE_NOTIFY_PERM flag
are also going to apply to the new fanotify pre-content event in order
to reduce the cost of the pre-content event vfs hooks.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wj8L=mtcRTi=NECHMGfZQgXOp_uix1YVh04fEmrKaMnXA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 include/linux/fs.h       |  3 ++-
 include/linux/fsnotify.h | 47 ++++++++++++++++++++++++++++------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9c13222362f5..9b58e9887e4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,7 +173,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 
 #define	FMODE_NOREUSE		((__force fmode_t)(1 << 23))
 
-/* FMODE_* bit 24 */
+/* File may generate fanotify access permission events */
+#define FMODE_NOTIFY_PERM	((__force fmode_t)(1 << 24))
 
 /* File is embedded in backing_file object */
 #define FMODE_BACKING		((__force fmode_t)(1 << 25))
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..f0fd3dcae654 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -108,10 +108,9 @@ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask)
 	fsnotify_parent(dentry, mask, dentry, FSNOTIFY_EVENT_DENTRY);
 }
 
-static inline int fsnotify_file(struct file *file, __u32 mask)
+/* Should events be generated on this open file regardless of watches? */
+static inline bool fsnotify_file_watchable(struct file *file, __u32 mask)
 {
-	const struct path *path;
-
 	/*
 	 * FMODE_NONOTIFY are fds generated by fanotify itself which should not
 	 * generate new events. We also don't want to generate events for
@@ -119,14 +118,37 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
 	 * handle creation / destruction events and not "real" file events.
 	 */
 	if (file->f_mode & (FMODE_NONOTIFY | FMODE_PATH))
+		return false;
+
+	/* Permission events require that watches are set before FS_OPEN_PERM */
+	if (mask & ALL_FSNOTIFY_PERM_EVENTS & ~FS_OPEN_PERM &&
+	    !(file->f_mode & FMODE_NOTIFY_PERM))
+		return false;
+
+	return true;
+}
+
+static inline int fsnotify_file(struct file *file, __u32 mask)
+{
+	const struct path *path;
+
+	if (!fsnotify_file_watchable(file, mask))
 		return 0;
 
 	path = &file->f_path;
-	/* Permission events require group prio >= FSNOTIFY_PRIO_CONTENT */
-	if (mask & ALL_FSNOTIFY_PERM_EVENTS &&
-	    !fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
-					       FSNOTIFY_PRIO_CONTENT))
-		return 0;
+	/*
+	 * Permission events require group prio >= FSNOTIFY_PRIO_CONTENT.
+	 * Unless permission event watchers exist at FS_OPEN_PERM time,
+	 * operations on file will not be generating any permission events.
+	 */
+	if (mask & ALL_FSNOTIFY_PERM_EVENTS) {
+		if (!fsnotify_sb_has_priority_watchers(path->dentry->d_sb,
+						       FSNOTIFY_PRIO_CONTENT))
+			return 0;
+
+		if (mask & FS_OPEN_PERM)
+			file->f_mode |= FMODE_NOTIFY_PERM;
+	}
 
 	return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
 }
@@ -166,15 +188,12 @@ static inline int fsnotify_file_perm(struct file *file, int perm_mask)
  */
 static inline int fsnotify_open_perm(struct file *file)
 {
-	int ret;
+	int ret = fsnotify_file(file, FS_OPEN_PERM);
 
-	if (file->f_flags & __FMODE_EXEC) {
+	if (!ret && file->f_flags & __FMODE_EXEC)
 		ret = fsnotify_file(file, FS_OPEN_EXEC_PERM);
-		if (ret)
-			return ret;
-	}
 
-	return fsnotify_file(file, FS_OPEN_PERM);
+	return ret;
 }
 
 #else
-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-11 23:22       ` Linus Torvalds
  2024-11-11 23:39         ` Linus Torvalds
  2024-11-11 23:59         ` Amir Goldstein
@ 2024-11-12 15:24         ` Josef Bacik
  2024-11-12 17:27           ` Linus Torvalds
  2 siblings, 1 reply; 33+ messages in thread
From: Josef Bacik @ 2024-11-12 15:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Mon, Nov 11, 2024 at 03:22:06PM -0800, Linus Torvalds wrote:
> On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Did you see the patch that added the
> > fsnotify_file_has_pre_content_watches() thing?
> 
> No, because I had gotten to patch 6/11, and it added this open thing,
> and there was no such thing in any of the patches before it.
> 
> It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.
> 
> However, at no point does it look like you actually test it at open
> time, so none of this seems to matter.
> 
> As far as I can see, even at the end of the series, you will call the
> fsnotify hook at open time even if there are no content watches on the
> file.
> 
> So apparently the fsnotify_file_has_pre_content_watches() is not
> called when it should be, and when it *is* called, it's also doing
> completely the wrong thing.
> 
> Look, for basic operations THAT DON'T CARE, you now added a function
> call to fsnotify_file_has_pre_content_watches(), that function call
> looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
> be done!), and then after that looks at the i_fsnotify_mask.
> 
> THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.
> 
> This code has been written by somebody who NEVER EVER looked at
> profiles. You're following chains of pointers when you never should.
> 
> Look, here's a very basic example of the kind of complete mis-design
> I'm talking about:
> 
>  - we're doing a basic read() on a file that isn't being watched.
> 
>  - we want to maybe do read-ahead
> 
>  - the code does
> 
>         if (fsnotify_file_has_pre_content_watches(file))
>                 return fpin;
> 
>    to say that "don't do read-ahead".
> 
> Fine, I understand the concept. But keep in mind that the common case
> is presumably that there are no content watches.
> 
> And even ignoring the "common case" issue, that's the one you want to
> OPTIMIZE for. That's the case that matters for performance, because
> clearly if there are content watches, you're going to go into "Go
> Slow" mode anyway and not do pre-fetching. So even if content watches
> are common on some load, they are clearly not the case you should do
> performance optimization for.
> 
> With me so far?
> 
> So if THAT is the case that matters, then dammit, we shouldn't be
> calling a function at all.
> 
> And when calling the function, we shouldn't start out with this
> completely broken logic:
> 
>         struct inode *inode = file_inode(file);
>         __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
> 
>         if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
>                 return false;
> 
> that does random crap and looks up some "mount mask" and looks up the
> superblock flags.
> 
> Why shouldn't we do this?
> 
> BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
> CONTENT MATCHES!
> 
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.
> 
> You literally check for the "do I even care" *last*, when you finally
> do that fsnotify_object_watched() check that looks at the inode. But
> by then you have already wasted all that time and effort, and
> fsnotify_object_watched() is broken anyway, because it's stupidly
> designed to require that mnt_mask that isn't needed if you have
> properly marked each object individually.
> 
> So what *should* you have?
> 
> You should have had a per-file flag saying "Do I need to even call
> this crud at all", and have it in a location where you don't need to
> look at anything else.
> 
> And fsnotify already basically has that flag, except it's mis-designed
> too. We have FMODE_NONOTIFY, which is the wrong way around (saying
> "don't notify", when that should just be the *default*), and the
> fsnotify layer uses it only to mark its own internal files so that it
> doesn't get called recursively. So that flag that *looks* sane and is
> in the right location is actually doing the wrong thing, because it's
> dealing with a rare special case, not the important cases that
> actually matter.
> 
> So all of this readahead logic - and all of the read and write hooks -
> should be behind a simple "oh, this file doesn't have any notification
> stuff, so don't bother calling any fsnotify functions".
> 
> So I think the pattern should be
> 
>     static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
>     {
>         if (unlikely(file->f_mode & FMODE_NOTIFY))
>                 return out_of_line_crud(file);
>         return false;
>     }
> 
> so that the *only* thing that gets inlined is "does this file have any
> fsnotify stuff at all", and in the common case where that isn't true,
> we don't call any fsnotify functions, and we don't start looking at
> inodes or superblocks or anything pointless like that.
> 
> THAT is the kind of code sequence we should have for unlikely cases.
> Not the "call fsnotify code to check for something unlikely". Not
> "look at file_inode(file)->i_sb->s_iflags" until it's *required*.
> 
> Similarly, the actual open-time code SHOULD NEVER BE CALLED, because
> it should have the same kind of simple logic based on some simple flag
> either in the dentry or maybe in the inode. So that all those *normal*
> dentries and inodes that don't have watches don't get the overhead of
> calling into fsnotify code.
> 
> Because yes, I do see all those function calls, and all those
> unnecessary indirect pointer accesses when I do profiles. And if I see
> fsnotify in my profiles when I don't have any notifications enabled,
> that really really *annoys* me.
> 

There are good suggestions here, and decent analysis, Amir has followed up with
a patch that will improve things, and that's good.

But this was an entirely inappropriate way to communicate your point, even with
people who have been here a while and are used to being yelled at.

Throughout this email you have suggested that myself, Amir, and Jan have never
looked at profiles.  You go so far as to suggest that we have no idea what we're
doing and don't understand common case optimizations.  These are the sort of
comments that are unhelpful and put most people on the defensive, and make them
unwilling to listen to your suggestions and feedback.  These are the sort of
comments that make people work very hard to exit this community.

Are you wrong?  No.  We all get tunnel vision, I know I was deeply focused on
making the page fault case have as little impact as possible, but I definitely
didn't consider the indirect access case.  I don't run with mitigations on, and
frankly I am a file system person, I know if you're here we're going to do IO
and that's going to be the bad part.  That's why we do code review, because
we're all human and we all miss things.

But being dressed down like this for missing a better way, because lets be
honest what I did wasn't earth shatteringly bad, is not helpful.  It is actively
harmful, and it makes me not want to work on this stuff anymore.

We constantly have discussions about how do we bring in new people and how do we
train up new maintainers, without looking at how do we keep maintainers and how
do we keep developers.  If you're losing people like me, gaining new people is
going to be an even taller order.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 06/17] fsnotify: generate pre-content permission event on open
  2024-11-12 15:24         ` Josef Bacik
@ 2024-11-12 17:27           ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2024-11-12 17:27 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
	linux-btrfs, linux-mm, linux-ext4

On Tue, 12 Nov 2024 at 07:24, Josef Bacik <josef@toxicpanda.com> wrote:
>
> But this was an entirely inappropriate way to communicate your point, even with
> people who have been here a while and are used to being yelled at.

Fair enough.

I was probably way too frustrated, because I spent some of the last
few weeks literally trying to remove two cache misses from the
permission checking path.

And those two cache misses were for features that are in POSIX made
worse because of uid translations infrastructure used for containers.

Those are both things that people actually *use* in major ways.
Admittedly the particular optimization was exactly to _not_ bother
with looking up the exact uid details when we could tell early that it
was all unnecessary, but the point was that this was to optimize
something real, something important, and not some special case.

And here's an example of a profile I've been looking at (this is a
real load: the profile is literally my "make allmodconfig" build with
not a lot of actual rebuilding needed):

  1.30%  avc_has_perm_noaudit
  1.24%  selinux_inode_permission
  1.18%  link_path_walk.part.0.constprop.0
  0.99%  btrfs_getattr
  0.82%  clear_page_rep
  0.82%  security_inode_permission
  0.60%  dput
  0.60%  path_lookupat
  0.60%  __check_object_size
  0.54%  strncpy_from_user
  0.51%  inode_permission
  0.50%  generic_permission
  0.47%  kmem_cache_alloc_noprof
  0.47%  step_into
  0.46%  btrfs_permission
  0.45%  cp_new_stat
  0.44%  filename_lookup

and hey, you go "most of those are just around half a percent". Sure.
But add those things up, and you'll see that they add up to about 12%.

And yes, that 12% is 1/8th of the whole load. So it's not some "just
12% of the kernel footprint". It's literally 12% of one of the main
loads I run day-to-day, which is why I care about these paths so
deeply.

And look at the two top offenders. No, they aren't fsnotify. But guess
what they are? They are hooks in the VFS layer that the VFS code
cannot do anything about and cannot optimize as a result. They do
ABSOLUTELY NOTHING on this load, and they add no actual value.

If they had some kind of "I don't have any special security label"
test, the two top offenders (and down that list you can find a third
one) would likely just not exist at all. But they aren't "real" VFS
functions, they are just hooks into a black hole with random semantics
that is set by user-supplied tables, so we don't have that.

So this is why I'm putting my foot down. No more badly coded and badly
designed hooks that the VFS layer can't just optimize away for the
common case.

I can't fix the existing issues. But I can say "no more". If you want
new hooks, they had better have effectively *ZERO* overhead when they
aren't relevant.

And yes, I was too forceful. Sorry. But my foot stays down.

If you want a specialty hook for specialty uses, that hook needs to be
so *fundamentally* low-cost that I simply don't need to worry about
it.

It needs to have a flag that doesn't take a cache miss that says
"Nobody cares", and doesn't call out to random pointless functions
that cause I$ misses either.

I really wish I had made that requirement for the security people all
those years ago. Because 99% of the time, all that stupid selinux
noise is literally worthless and does nothing. But we don't have the
flag that says "nothing to see".

               Linus


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-11-12 17:28 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-11 20:17 [PATCH v6 00/17] fanotify: add pre-content hooks Josef Bacik
2024-11-11 20:17 ` [PATCH v6 01/17] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
2024-11-11 20:17 ` [PATCH v6 02/17] fanotify: rename a misnamed constant Josef Bacik
2024-11-11 20:17 ` [PATCH v6 03/17] fanotify: reserve event bit of deprecated FAN_DIR_MODIFY Josef Bacik
2024-11-11 20:17 ` [PATCH v6 04/17] fsnotify: introduce pre-content permission events Josef Bacik
2024-11-11 20:17 ` [PATCH v6 05/17] fsnotify: pass optional file access range in pre-content event Josef Bacik
2024-11-11 20:17 ` [PATCH v6 06/17] fsnotify: generate pre-content permission event on open Josef Bacik
2024-11-11 21:51   ` Linus Torvalds
2024-11-11 22:46     ` Josef Bacik
2024-11-11 23:22       ` Linus Torvalds
2024-11-11 23:39         ` Linus Torvalds
2024-11-11 23:59         ` Amir Goldstein
2024-11-12  0:37           ` Linus Torvalds
2024-11-12  8:11             ` Amir Goldstein
2024-11-12 13:54               ` Jan Kara
2024-11-12 14:42                 ` Amir Goldstein
2024-11-12 14:28             ` Jan Kara
2024-11-12 15:24         ` Josef Bacik
2024-11-12 17:27           ` Linus Torvalds
2024-11-11 23:36     ` Amir Goldstein
2024-11-11 20:17 ` [PATCH v6 07/17] fsnotify: generate pre-content permission event on truncate Josef Bacik
2024-11-11 20:17 ` [PATCH v6 08/17] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
2024-11-11 20:17 ` [PATCH v6 09/17] fanotify: report file range info with pre-content events Josef Bacik
2024-11-11 20:17 ` [PATCH v6 10/17] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
2024-11-11 20:18 ` [PATCH v6 11/17] fanotify: add a helper to check for pre content events Josef Bacik
2024-11-11 20:18 ` [PATCH v6 12/17] fanotify: disable readahead if we have pre-content watches Josef Bacik
2024-11-11 20:18 ` [PATCH v6 13/17] mm: don't allow huge faults for files with pre content watches Josef Bacik
2024-11-11 20:18 ` [PATCH v6 14/17] fsnotify: generate pre-content permission event on page fault Josef Bacik
2024-11-11 20:18 ` [PATCH v6 15/17] xfs: add pre-content fsnotify hook for write faults Josef Bacik
2024-11-11 20:18 ` [PATCH v6 16/17] btrfs: disable defrag on pre-content watched files Josef Bacik
2024-11-11 20:18 ` [PATCH v6 17/17] fs: enable pre-content events on supported file systems Josef Bacik
2024-11-11 20:27 ` [PATCH v6 00/17] fanotify: add pre-content hooks Amir Goldstein
2024-11-11 21:55 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox