[PATCH v7 0/7] mseal system mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/7]  mseal system mappings
@ 2025-02-24 22:52 jeffxu
  2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
                   ` (9 more replies)
  0 siblings, 10 replies; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

This is V7 version, addressing comments from V6, without code logic
change.

--------------------------------------------------

History:
V7:
 - Remove cover letter from the first patch (Liam R. Howlett)
 - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
 - logging and fclose() in selftest (Liam R. Howlett)

V6:
  https://lore.kernel.org/all/20250224174513.3600914-1-jeffxu@google.com/

V5
  https://lore.kernel.org/all/20250212032155.1276806-1-jeffxu@google.com/

V4:
  https://lore.kernel.org/all/20241125202021.3684919-1-jeffxu@google.com/

V3:
  https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/

V2:
  https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/

V1:
  https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/

--------------------------------------------------
As discussed during mseal() upstream process [1], mseal() protects
the VMAs of a given virtual memory range against modifications, such
as the read/write (RW) and no-execute (NX) bits. For complete
descriptions of memory sealing, please see mseal.rst [2].

The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped.

The system mappings are readonly only, memory sealing can protect
them from ever changing to writable or unmmap/remapped as different
attributes.

System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
are created by the kernel during program initialization, and could
be sealed after creation.

Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [3]. It could be sealed from creation.

The vsyscall on x86-64 uses a special address (0xffffffffff600000),
which is outside the mm managed range. This means mprotect, munmap, and
mremap won't work on the vsyscall. Since sealing doesn't enhance
the vsyscall's security, it is skipped in this patch. If we ever seal
the vsyscall, it is probably only for decorative purpose, i.e. showing
the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.

It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor, rr are also known to change the vdso/vvar mappings.
Consequently, this feature cannot be universally enabled across all
systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.

To support mseal of system mappings, architectures must define
CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
calls to pass mseal flag. Additionally, architectures must confirm they
do not unmap/remap system mappings during the process lifetime.

In this version, we've improved the handling of system mapping sealing from
previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.

Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:
- The PT_LOAD header may report an incorrect length for vdso, (smaller
  than its actual size). The dynamic linker, which relies on PT_LOAD
  information to determine mapping size, would then split and partially
  seal the vdso mapping. Since each architecture has its own vdso/vvar
  code, fixing this in the kernel would require going through each
  archiecture. Our initial goal was to enable sealing readonly mappings,
  e.g. .text, across all architectures, sealing vdso from kernel since
  creation appears to be simpler than sealing vdso at glibc.
- The [vvar] mapping header only contains address information, not length
  information. Similar issues might exist for other special mappings.
- Mappings like uprobe are not covered by the dynamic linker,
  and there is no effective solution for them.

This feature's security enhancements will benefit ChromeOS, Android,
and other high security systems.

Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
  i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
  Android to ensure the sealing doesn't affect the functionality of
  Chromebook and Android phone.

I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
  normal operations such as browsing the web, open/edit doc are OK.

In addition, Benjamin Berg tested this on UML.

Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]

Jeff Xu (7):
  mseal, system mappings: kernel config and header change
  selftests: x86: test_mremap_vdso: skip if vdso is msealed
  mseal, system mappings: enable x86-64
  mseal, system mappings: enable arm64
  mseal, system mappings: enable uml architecture
  mseal, system mappings: uprobe mapping
  mseal, system mappings: update mseal.rst

 Documentation/userspace-api/mseal.rst         |  7 +++
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/kernel/vdso.c                      | 22 +++++++---
 arch/um/Kconfig                               |  1 +
 arch/x86/Kconfig                              |  1 +
 arch/x86/entry/vdso/vma.c                     | 16 ++++---
 arch/x86/um/vdso/vma.c                        |  6 ++-
 include/linux/mm.h                            | 10 +++++
 init/Kconfig                                  | 18 ++++++++
 kernel/events/uprobes.c                       |  5 ++-
 security/Kconfig                              | 18 ++++++++
 .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
 12 files changed, 132 insertions(+), 16 deletions(-)

-- 
2.48.1.658.g4767266eb4-goog

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  6:05   ` Lorenzo Stoakes
  2025-02-25 15:22   ` Liam R. Howlett
  2025-02-24 22:52 ` [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed jeffxu
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Provide infrastructure to mseal system mappings. Establish
two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP
macro for future patches.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/mm.h | 10 ++++++++++
 init/Kconfig       | 18 ++++++++++++++++++
 security/Kconfig   | 18 ++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb7..8b800941678d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4155,4 +4155,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+
+/*
+ * mseal of userspace process's system mappings.
+ */
+#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
+#define VM_SEALED_SYSMAP	VM_SEALED
+#else
+#define VM_SEALED_SYSMAP	VM_NONE
+#endif
+
 #endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..07435e33f965 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
 config ARCH_HAS_MEMBARRIER_SYNC_CORE
 	bool
 
+config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
+	bool
+	help
+	  Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
+
+	  A 64-bit kernel is required for the memory sealing feature.
+	  No specific hardware features from the CPU are needed.
+
+	  To enable this feature, the architecture needs to update their
+	  special mappings calls to include the sealing flag and confirm
+	  that it doesn't unmap/remap system mappings during the life
+	  time of the process. After the architecture enables this, a
+	  distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
+	  to the feature.
+
+	  For complete descriptions of memory sealing, please see
+	  Documentation/userspace-api/mseal.rst
+
 config HAVE_PERF_EVENTS
 	bool
 	help
diff --git a/security/Kconfig b/security/Kconfig
index f10dbf15c294..15a86a952910 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
 
 endchoice
 
+config MSEAL_SYSTEM_MAPPINGS
+	bool "mseal system mappings"
+	depends on 64BIT
+	depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
+	depends on !CHECKPOINT_RESTORE
+	help
+	  Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
+
+	  A 64-bit kernel is required for the memory sealing feature.
+	  No specific hardware features from the CPU are needed.
+
+	  Note: CHECKPOINT_RESTORE, UML, gVisor, rr are known to relocate or
+	  unmap system mapping, therefore this config can't be enabled
+	  universally.
+
+	  For complete descriptions of memory sealing, please see
+	  Documentation/userspace-api/mseal.rst
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
  2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  6:15   ` Lorenzo Stoakes
  2025-02-24 22:52 ` [PATCH v7 3/7] mseal, system mappings: enable x86-64 jeffxu
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu,
	Kees Cook

From: Jeff Xu <jeffxu@chromium.org>

Add code to detect if the vdso is memory sealed, skip the test
if it is.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <kees@kernel.org>
---
 .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c b/tools/testing/selftests/x86/test_mremap_vdso.c
index d53959e03593..94bee6e0c813 100644
--- a/tools/testing/selftests/x86/test_mremap_vdso.c
+++ b/tools/testing/selftests/x86/test_mremap_vdso.c
@@ -14,6 +14,7 @@
 #include <errno.h>
 #include <unistd.h>
 #include <string.h>
+#include <stdbool.h>
 
 #include <sys/mman.h>
 #include <sys/auxv.h>
@@ -55,13 +56,55 @@ static int try_to_remap(void *vdso_addr, unsigned long size)
 
 }
 
+#define VDSO_NAME "[vdso]"
+#define VMFLAGS "VmFlags:"
+#define MSEAL_FLAGS "sl"
+#define MAX_LINE_LEN 512
+
+bool vdso_sealed(FILE *maps)
+{
+	char line[MAX_LINE_LEN];
+	bool has_vdso = false;
+
+	while (fgets(line, sizeof(line), maps)) {
+		if (strstr(line, VDSO_NAME))
+			has_vdso = true;
+
+		if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) {
+			if (strstr(line, MSEAL_FLAGS))
+				return true;
+
+			return false;
+		}
+	}
+
+	return false;
+}
+
 int main(int argc, char **argv, char **envp)
 {
 	pid_t child;
+	FILE *maps;
 
 	ksft_print_header();
 	ksft_set_plan(1);
 
+	maps = fopen("/proc/self/smaps", "r");
+	if (!maps) {
+		ksft_test_result_skip(
+			"Could not open /proc/self/smaps, errno=%d\n",
+			 errno);
+
+		return 0;
+	}
+
+	if (vdso_sealed(maps)) {
+		ksft_test_result_skip("vdso is sealed\n");
+		return 0;
+	}
+
+	fclose(maps);
+
 	child = fork();
 	if (child == -1)
 		ksft_exit_fail_msg("failed to fork (%d): %m\n", errno);
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
  2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
  2025-02-24 22:52 ` [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  1:03   ` Kees Cook
  2025-02-25  8:08   ` Thomas Weißschuh
  2025-02-24 22:52 ` [PATCH v7 4/7] mseal, system mappings: enable arm64 jeffxu
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
covering the vdso, vvar, vvar_vclock.

Production release testing passes on Android and Chrome OS.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 arch/x86/Kconfig          |  1 +
 arch/x86/entry/vdso/vma.c | 16 ++++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 87198d957e2f..8fa17032ca46 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86_64
 	depends on 64BIT
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
+	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 39e6efc1a9ca..1b1c009f20a8 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long text_start;
+	unsigned long vm_flags;
 	int ret = 0;
 
 	if (mmap_write_lock_killable(mm))
@@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 	/*
 	 * MAYWRITE to allow gdb to COW and set breakpoints
 	 */
+	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
+	vm_flags |= VM_SEALED_SYSMAP;
 	vma = _install_special_mapping(mm,
 				       text_start,
 				       image->size,
-				       VM_READ|VM_EXEC|
-				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				       vm_flags,
 				       &vdso_mapping);
 
 	if (IS_ERR(vma)) {
@@ -276,11 +278,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 		goto up_fail;
 	}
 
+	vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
+	vm_flags |= VM_SEALED_SYSMAP;
 	vma = _install_special_mapping(mm,
 				       addr,
 				       (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
-				       VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
-				       VM_PFNMAP,
+				       vm_flags,
 				       &vvar_mapping);
 
 	if (IS_ERR(vma)) {
@@ -289,11 +292,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 		goto up_fail;
 	}
 
+	vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
+	vm_flags |= VM_SEALED_SYSMAP;
 	vma = _install_special_mapping(mm,
 				       addr + (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
 				       VDSO_NR_VCLOCK_PAGES * PAGE_SIZE,
-				       VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
-				       VM_PFNMAP,
+				       vm_flags,
 				       &vvar_vclock_mapping);
 
 	if (IS_ERR(vma)) {
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (2 preceding siblings ...)
  2025-02-24 22:52 ` [PATCH v7 3/7] mseal, system mappings: enable x86-64 jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  6:20   ` Lorenzo Stoakes
  2025-02-24 22:52 ` [PATCH v7 5/7] mseal, system mappings: enable uml architecture jeffxu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
the vdso, vvar, and compat-mode vectors and sigpage mappings.

Production release testing passes on Android and Chrome OS.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 arch/arm64/Kconfig       |  1 +
 arch/arm64/kernel/vdso.c | 22 +++++++++++++++-------
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fcdd0ed3eca8..39202aa9a5af 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -38,6 +38,7 @@ config ARM64
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index e8ed8e5b713b..12e6ab396018 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -183,6 +183,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
 {
 	unsigned long vdso_base, vdso_text_len, vdso_mapping_len;
 	unsigned long gp_flags = 0;
+	unsigned long vm_flags;
 	void *ret;
 
 	BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
@@ -197,8 +198,10 @@ static int __setup_additional_pages(enum vdso_abi abi,
 		goto up_fail;
 	}
 
+	vm_flags = VM_READ|VM_MAYREAD|VM_PFNMAP;
+	vm_flags |= VM_SEALED_SYSMAP;
 	ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
-				       VM_READ|VM_MAYREAD|VM_PFNMAP,
+				       vm_flags,
 				       &vvar_map);
 	if (IS_ERR(ret))
 		goto up_fail;
@@ -208,9 +211,10 @@ static int __setup_additional_pages(enum vdso_abi abi,
 
 	vdso_base += VVAR_NR_PAGES * PAGE_SIZE;
 	mm->context.vdso = (void *)vdso_base;
+	vm_flags = VM_READ|VM_EXEC|gp_flags|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
+	vm_flags |= VM_SEALED_SYSMAP;
 	ret = _install_special_mapping(mm, vdso_base, vdso_text_len,
-				       VM_READ|VM_EXEC|gp_flags|
-				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				       vm_flags,
 				       vdso_info[abi].cm);
 	if (IS_ERR(ret))
 		goto up_fail;
@@ -326,6 +330,7 @@ arch_initcall(aarch32_alloc_vdso_pages);
 static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
 {
 	void *ret;
+	unsigned long vm_flags;
 
 	if (!IS_ENABLED(CONFIG_KUSER_HELPERS))
 		return 0;
@@ -334,9 +339,10 @@ static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
 	 * Avoid VM_MAYWRITE for compatibility with arch/arm/, where it's
 	 * not safe to CoW the page containing the CPU exception vectors.
 	 */
+	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC;
+	vm_flags |= VM_SEALED_SYSMAP;
 	ret = _install_special_mapping(mm, AARCH32_VECTORS_BASE, PAGE_SIZE,
-				       VM_READ | VM_EXEC |
-				       VM_MAYREAD | VM_MAYEXEC,
+				       vm_flags,
 				       &aarch32_vdso_maps[AA32_MAP_VECTORS]);
 
 	return PTR_ERR_OR_ZERO(ret);
@@ -345,6 +351,7 @@ static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
 static int aarch32_sigreturn_setup(struct mm_struct *mm)
 {
 	unsigned long addr;
+	unsigned long vm_flags;
 	void *ret;
 
 	addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
@@ -357,9 +364,10 @@ static int aarch32_sigreturn_setup(struct mm_struct *mm)
 	 * VM_MAYWRITE is required to allow gdb to Copy-on-Write and
 	 * set breakpoints.
 	 */
+	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
+	vm_flags |= VM_SEALED_SYSMAP;
 	ret = _install_special_mapping(mm, addr, PAGE_SIZE,
-				       VM_READ | VM_EXEC | VM_MAYREAD |
-				       VM_MAYWRITE | VM_MAYEXEC,
+				       vm_flags,
 				       &aarch32_vdso_maps[AA32_MAP_SIGPAGE]);
 	if (IS_ERR(ret))
 		goto out;
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (3 preceding siblings ...)
  2025-02-24 22:52 ` [PATCH v7 4/7] mseal, system mappings: enable arm64 jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  6:22   ` Lorenzo Stoakes
  2025-02-24 22:52 ` [PATCH v7 6/7] mseal, system mappings: uprobe mapping jeffxu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu,
	Benjamin Berg

From: Jeff Xu <jeffxu@chromium.org>

Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
the vdso.

Testing passes on UML.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Tested-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/Kconfig        | 1 +
 arch/x86/um/vdso/vma.c | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 18051b1cfce0..eb2d439a5334 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -10,6 +10,7 @@ config UML
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select HAVE_ARCH_AUDITSYSCALL
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index f238f7b33cdd..fdfba858ffc9 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long vm_flags;
 	static struct vm_special_mapping vdso_mapping = {
 		.name = "[vdso]",
 	};
@@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return -EINTR;
 
 	vdso_mapping.pages = vdsop;
+	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
+	vm_flags |= VM_SEALED_SYSMAP;
 	vma = _install_special_mapping(mm, um_vdso_addr, PAGE_SIZE,
-		VM_READ|VM_EXEC|
-		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+		vm_flags,
 		&vdso_mapping);
 
 	mmap_write_unlock(mm);
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (4 preceding siblings ...)
  2025-02-24 22:52 ` [PATCH v7 5/7] mseal, system mappings: enable uml architecture jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-25  6:24   ` Lorenzo Stoakes
  2025-02-26 16:26   ` Oleg Nesterov
  2025-02-24 22:52 ` [PATCH v7 7/7] mseal, system mappings: update mseal.rst jeffxu
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Provide support to mseal the uprobe mapping.

Unlike other system mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime. It could be sealed from creation.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 kernel/events/uprobes.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2ca797cbe465..8dcdfa0d306b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1662,6 +1662,7 @@ static const struct vm_special_mapping xol_mapping = {
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
 	struct vm_area_struct *vma;
+	unsigned long vm_flags;
 	int ret;
 
 	if (mmap_write_lock_killable(mm))
@@ -1682,8 +1683,10 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 		}
 	}
 
+	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
+	vm_flags |= VM_SEALED_SYSMAP;
 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
-				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+				vm_flags,
 				&xol_mapping);
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH v7 7/7] mseal, system mappings: update mseal.rst
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (5 preceding siblings ...)
  2025-02-24 22:52 ` [PATCH v7 6/7] mseal, system mappings: uprobe mapping jeffxu
@ 2025-02-24 22:52 ` jeffxu
  2025-02-24 23:03 ` [PATCH v7 0/7] mseal system mappings Pedro Falcato
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
  To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
  Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu,
	Kees Cook

From: Jeff Xu <jeffxu@chromium.org>

Update memory sealing documentation to include details about system
mappings.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <kees@kernel.org>
---
 Documentation/userspace-api/mseal.rst | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
index 41102f74c5e2..10147281bf2d 100644
--- a/Documentation/userspace-api/mseal.rst
+++ b/Documentation/userspace-api/mseal.rst
@@ -130,6 +130,13 @@ Use cases
 
 - Chrome browser: protect some security sensitive data structures.
 
+- System mappings:
+  If supported by an architecture (via CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS),
+  the CONFIG_MSEAL_SYSTEM_MAPPINGS seals system mappings, e.g. vdso, vvar,
+  uprobes, sigpage, vectors, etc. CHECKPOINT_RESTORE, UML, gVisor, rr are
+  known to relocate or unmap system mapping, therefore this config can't be
+  enabled universally.
+
 When not to use mseal
 =====================
 Applications can apply sealing to any virtual memory region from userspace,
-- 
2.48.1.658.g4767266eb4-goog



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (6 preceding siblings ...)
  2025-02-24 22:52 ` [PATCH v7 7/7] mseal, system mappings: update mseal.rst jeffxu
@ 2025-02-24 23:03 ` Pedro Falcato
  2025-02-24 23:07   ` Jeff Xu
  2025-02-25 10:32 ` Lorenzo Stoakes
  2025-02-25 15:18 ` Lorenzo Stoakes
  9 siblings, 1 reply; 122+ messages in thread
From: Pedro Falcato @ 2025-02-24 23:03 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:52 PM <jeffxu@chromium.org> wrote:
>
> From: Jeff Xu <jeffxu@chromium.org>
>
> This is V7 version, addressing comments from V6, without code logic
> change.
>
> --------------------------------------------------
>
> History:
> V7:
>  - Remove cover letter from the first patch (Liam R. Howlett)
>  - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
>  - logging and fclose() in selftest (Liam R. Howlett)

Jeff, please don't send out new versions of the patchset that quickly.
We were having a discussion on v5, you sent v6 today (acceptable) and
now v7 (while changing barely anything of note). It's hard to track
things this way, and you're just flooding a bunch of mailboxes.

Thanks,
Pedro


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-24 23:03 ` [PATCH v7 0/7] mseal system mappings Pedro Falcato
@ 2025-02-24 23:07   ` Jeff Xu
  2025-02-25  6:09     ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-24 23:07 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 3:03 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Mon, Feb 24, 2025 at 10:52 PM <jeffxu@chromium.org> wrote:
> >
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > This is V7 version, addressing comments from V6, without code logic
> > change.
> >
> > --------------------------------------------------
> >
> > History:
> > V7:
> >  - Remove cover letter from the first patch (Liam R. Howlett)
> >  - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
> >  - logging and fclose() in selftest (Liam R. Howlett)
>
> Jeff, please don't send out new versions of the patchset that quickly.
> We were having a discussion on v5, you sent v6 today (acceptable) and
> now v7 (while changing barely anything of note). It's hard to track
> things this way, and you're just flooding a bunch of mailboxes.
>
Ah, I apologize. Sure.

-Jeff


> Thanks,
> Pedro


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-24 22:52 ` [PATCH v7 3/7] mseal, system mappings: enable x86-64 jeffxu
@ 2025-02-25  1:03   ` Kees Cook
  2025-02-26  0:21     ` Jeff Xu
  2025-02-25  8:08   ` Thomas Weißschuh
  1 sibling, 1 reply; 122+ messages in thread
From: Kees Cook @ 2025-02-25  1:03 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, jannh, torvalds, vbabka, lorenzo.stoakes, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> covering the vdso, vvar, vvar_vclock.
> 
> Production release testing passes on Android and Chrome OS.
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  arch/x86/Kconfig          |  1 +
>  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 87198d957e2f..8fa17032ca46 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
>  	depends on 64BIT
>  	# Options that are inherently 64-bit kernel only:
>  	select ARCH_HAS_GIGANTIC_PAGE
> +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>  	select ARCH_SUPPORTS_PER_VMA_LOCK
>  	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 39e6efc1a9ca..1b1c009f20a8 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
>  	unsigned long text_start;
> +	unsigned long vm_flags;
>  	int ret = 0;
>  
>  	if (mmap_write_lock_killable(mm))
> @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  	/*
>  	 * MAYWRITE to allow gdb to COW and set breakpoints
>  	 */
> +	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm,
>  				       text_start,
>  				       image->size,
> -				       VM_READ|VM_EXEC|
> -				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +				       vm_flags,
>  				       &vdso_mapping);

I think these (in all patches) were supposed to be reworked without the
"vm_flags" variable addition?

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
@ 2025-02-25  6:05   ` Lorenzo Stoakes
  2025-02-26  1:33     ` Jeff Xu
  2025-02-25 15:22   ` Liam R. Howlett
  1 sibling, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:05 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:52:40PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Provide infrastructure to mseal system mappings. Establish
> two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP
> macro for future patches.
>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  include/linux/mm.h | 10 ++++++++++
>  init/Kconfig       | 18 ++++++++++++++++++
>  security/Kconfig   | 18 ++++++++++++++++++
>  3 files changed, 46 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b1068ddcbb7..8b800941678d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4155,4 +4155,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
>  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
>  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>
> +
> +/*
> + * mseal of userspace process's system mappings.
> + */
> +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> +#define VM_SEALED_SYSMAP	VM_SEALED
> +#else
> +#define VM_SEALED_SYSMAP	VM_NONE
> +#endif

This is much better than the original thanks.

> +
>  #endif /* _LINUX_MM_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b..07435e33f965 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
>  config ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	bool
>
> +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> +	bool
> +	help
> +	  Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> +
> +	  A 64-bit kernel is required for the memory sealing feature.
> +	  No specific hardware features from the CPU are needed.
> +
> +	  To enable this feature, the architecture needs to update their
> +	  special mappings calls to include the sealing flag and confirm
> +	  that it doesn't unmap/remap system mappings during the life
> +	  time of the process. After the architecture enables this, a
> +	  distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> +	  to the feature.

Architectures also need to be confirmed not to require any form of VDSO
relocation, which as discussed in previous series some arches appear to
need to do. I'd mention that here.

> +
> +	  For complete descriptions of memory sealing, please see
> +	  Documentation/userspace-api/mseal.rst
> +
>  config HAVE_PERF_EVENTS
>  	bool
>  	help
> diff --git a/security/Kconfig b/security/Kconfig
> index f10dbf15c294..15a86a952910 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
>
>  endchoice
>
> +config MSEAL_SYSTEM_MAPPINGS
> +	bool "mseal system mappings"
> +	depends on 64BIT
> +	depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> +	depends on !CHECKPOINT_RESTORE
> +	help
> +	  Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.

Let's be specific here, 'etc.' could mean _anything_. Also you aren't
sealing most of this, let's just list what you are _actually_ sealing
here. Which is AFAIK VDSO only?

You can update this later as time goes on if/when you expand this.

> +
> +	  A 64-bit kernel is required for the memory sealing feature.
> +	  No specific hardware features from the CPU are needed.
> +
> +	  Note: CHECKPOINT_RESTORE, UML, gVisor, rr are known to relocate or
> +	  unmap system mapping, therefore this config can't be enabled
> +	  universally.

Thanks for putting this here, appreciate it!

Could we tweak this though? I'd like to make it crystal clear, so I don't
think 'note' sufficies and this sounds a little too vague.

I think 'warning' is more appropriate here since you're breaking things for
people who might be unaware. And we need to say this -breaks- programs:

WARNING: This feature breaks programs which rely on relocating or
         unmapping system mappings.

         Known broken software at the time of writing includes
         CHECKPOINT_RESTORE, UML, gVisor and rr.

I think this is critical.

> +
> +	  For complete descriptions of memory sealing, please see
> +	  Documentation/userspace-api/mseal.rst
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> --
> 2.48.1.658.g4767266eb4-goog
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-24 23:07   ` Jeff Xu
@ 2025-02-25  6:09     ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:09 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Pedro Falcato, akpm, keescook, jannh, torvalds, vbabka,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 03:07:03PM -0800, Jeff Xu wrote:
> On Mon, Feb 24, 2025 at 3:03 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 10:52 PM <jeffxu@chromium.org> wrote:
> > >
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > This is V7 version, addressing comments from V6, without code logic
> > > change.
> > >
> > > --------------------------------------------------
> > >
> > > History:
> > > V7:
> > >  - Remove cover letter from the first patch (Liam R. Howlett)
> > >  - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
> > >  - logging and fclose() in selftest (Liam R. Howlett)
> >
> > Jeff, please don't send out new versions of the patchset that quickly.
> > We were having a discussion on v5, you sent v6 today (acceptable) and
> > now v7 (while changing barely anything of note). It's hard to track
> > things this way, and you're just flooding a bunch of mailboxes.
> >
> Ah, I apologize. Sure.
>
> -Jeff

Thanks, I am behind the eight ball on this in a post-viral fatigue haze, so I'd
especially appreciate relaxing a bit on series pace here haha.

I mean being reasonable I don't want you to feel you're needing to be told 'ok
send vX' now, but I'd simply suggest - wait until things 'settle down' a bit on
comments + everything's addressed and the 'usual suspects' have commented, then
this is a good time to send next version.

I realise I'm maybe not well placed to say this as I've previouisly been famous
for resending WAY too quick haha, but it's something I've worked on myself so I
guess, I relate...

>
>
> > Thanks,
> > Pedro

Thanks!


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed
  2025-02-24 22:52 ` [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed jeffxu
@ 2025-02-25  6:15   ` Lorenzo Stoakes
  2025-02-25 22:37     ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:15 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport, Kees Cook

On Mon, Feb 24, 2025 at 10:52:41PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Add code to detect if the vdso is memory sealed, skip the test
> if it is.

I feel this is a little succinct of a commit message, but I guess it gets to the
heart of what you're doing here.

Fundamentally I mean it makes sense, but I'm concerned that x86 has a test
-expliictly checking- whether mremap() of VDSO is possible - are there cases
where x86 might want to do this internal to the kernel?

I guess not since this is essentially a userland self test and probably
asserting you can do this in the way rr, etc. do.

>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> Reviewed-by: Kees Cook <kees@kernel.org>

Anyway, this aside, this looks fine, aside from nit below, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
>  1 file changed, 43 insertions(+)
>
> diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c b/tools/testing/selftests/x86/test_mremap_vdso.c
> index d53959e03593..94bee6e0c813 100644
> --- a/tools/testing/selftests/x86/test_mremap_vdso.c
> +++ b/tools/testing/selftests/x86/test_mremap_vdso.c
> @@ -14,6 +14,7 @@
>  #include <errno.h>
>  #include <unistd.h>
>  #include <string.h>
> +#include <stdbool.h>
>
>  #include <sys/mman.h>
>  #include <sys/auxv.h>
> @@ -55,13 +56,55 @@ static int try_to_remap(void *vdso_addr, unsigned long size)
>
>  }
>
> +#define VDSO_NAME "[vdso]"
> +#define VMFLAGS "VmFlags:"
> +#define MSEAL_FLAGS "sl"
> +#define MAX_LINE_LEN 512
> +
> +bool vdso_sealed(FILE *maps)

Should be static?

> +{
> +	char line[MAX_LINE_LEN];
> +	bool has_vdso = false;
> +
> +	while (fgets(line, sizeof(line), maps)) {
> +		if (strstr(line, VDSO_NAME))
> +			has_vdso = true;
> +
> +		if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) {
> +			if (strstr(line, MSEAL_FLAGS))
> +				return true;
> +
> +			return false;
> +		}
> +	}
> +
> +	return false;
> +}
> +
>  int main(int argc, char **argv, char **envp)
>  {
>  	pid_t child;
> +	FILE *maps;
>
>  	ksft_print_header();
>  	ksft_set_plan(1);
>
> +	maps = fopen("/proc/self/smaps", "r");
> +	if (!maps) {
> +		ksft_test_result_skip(
> +			"Could not open /proc/self/smaps, errno=%d\n",
> +			 errno);
> +
> +		return 0;
> +	}
> +
> +	if (vdso_sealed(maps)) {
> +		ksft_test_result_skip("vdso is sealed\n");
> +		return 0;
> +	}
> +
> +	fclose(maps);
> +
>  	child = fork();
>  	if (child == -1)
>  		ksft_exit_fail_msg("failed to fork (%d): %m\n", errno);
> --
> 2.48.1.658.g4767266eb4-goog
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-24 22:52 ` [PATCH v7 4/7] mseal, system mappings: enable arm64 jeffxu
@ 2025-02-25  6:20   ` Lorenzo Stoakes
  2025-02-25 22:26     ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:20 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> the vdso, vvar, and compat-mode vectors and sigpage mappings.
>
> Production release testing passes on Android and Chrome OS.

This is pretty limited (yes yes I know android is massive etc. but we must
account for all the weird and wonderful arm64 devices out there in context of
upstream :)

Have you looking through all arm64-code relating to vdso, vvar, compat-mode
vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
nervous of that.

At any rate some comment about having checked/confirmed this would be good, arm
concerns me a lot more than x86 on this front.

Thanks however for doing extensive testing android/chrome side! This is of
course, very important for sheer volume (and probably worldwide % of deployed
arm64 devices...)

Just need to dot our i's and cross our t's...

>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  arch/arm64/Kconfig       |  1 +
>  arch/arm64/kernel/vdso.c | 22 +++++++++++++++-------
>  2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fcdd0ed3eca8..39202aa9a5af 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -38,6 +38,7 @@ config ARM64
>  	select ARCH_HAS_KEEPINITRD
>  	select ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	select ARCH_HAS_MEM_ENCRYPT
> +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>  	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index e8ed8e5b713b..12e6ab396018 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -183,6 +183,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
>  {
>  	unsigned long vdso_base, vdso_text_len, vdso_mapping_len;
>  	unsigned long gp_flags = 0;
> +	unsigned long vm_flags;
>  	void *ret;
>
>  	BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
> @@ -197,8 +198,10 @@ static int __setup_additional_pages(enum vdso_abi abi,
>  		goto up_fail;
>  	}
>
> +	vm_flags = VM_READ|VM_MAYREAD|VM_PFNMAP;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
> -				       VM_READ|VM_MAYREAD|VM_PFNMAP,
> +				       vm_flags,
>  				       &vvar_map);
>  	if (IS_ERR(ret))
>  		goto up_fail;
> @@ -208,9 +211,10 @@ static int __setup_additional_pages(enum vdso_abi abi,
>
>  	vdso_base += VVAR_NR_PAGES * PAGE_SIZE;
>  	mm->context.vdso = (void *)vdso_base;
> +	vm_flags = VM_READ|VM_EXEC|gp_flags|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	ret = _install_special_mapping(mm, vdso_base, vdso_text_len,
> -				       VM_READ|VM_EXEC|gp_flags|
> -				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +				       vm_flags,
>  				       vdso_info[abi].cm);
>  	if (IS_ERR(ret))
>  		goto up_fail;
> @@ -326,6 +330,7 @@ arch_initcall(aarch32_alloc_vdso_pages);
>  static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
>  {
>  	void *ret;
> +	unsigned long vm_flags;
>
>  	if (!IS_ENABLED(CONFIG_KUSER_HELPERS))
>  		return 0;
> @@ -334,9 +339,10 @@ static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
>  	 * Avoid VM_MAYWRITE for compatibility with arch/arm/, where it's
>  	 * not safe to CoW the page containing the CPU exception vectors.
>  	 */
> +	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	ret = _install_special_mapping(mm, AARCH32_VECTORS_BASE, PAGE_SIZE,
> -				       VM_READ | VM_EXEC |
> -				       VM_MAYREAD | VM_MAYEXEC,
> +				       vm_flags,
>  				       &aarch32_vdso_maps[AA32_MAP_VECTORS]);
>
>  	return PTR_ERR_OR_ZERO(ret);
> @@ -345,6 +351,7 @@ static int aarch32_kuser_helpers_setup(struct mm_struct *mm)
>  static int aarch32_sigreturn_setup(struct mm_struct *mm)
>  {
>  	unsigned long addr;
> +	unsigned long vm_flags;
>  	void *ret;
>
>  	addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
> @@ -357,9 +364,10 @@ static int aarch32_sigreturn_setup(struct mm_struct *mm)
>  	 * VM_MAYWRITE is required to allow gdb to Copy-on-Write and
>  	 * set breakpoints.
>  	 */
> +	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	ret = _install_special_mapping(mm, addr, PAGE_SIZE,
> -				       VM_READ | VM_EXEC | VM_MAYREAD |
> -				       VM_MAYWRITE | VM_MAYEXEC,
> +				       vm_flags,
>  				       &aarch32_vdso_maps[AA32_MAP_SIGPAGE]);
>  	if (IS_ERR(ret))
>  		goto out;
> --
> 2.48.1.658.g4767266eb4-goog
>

Patch looks fine for purposes of what you're trying to achieve though, just
need to have some calming reassurances about arch :)

Thanks!


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-24 22:52 ` [PATCH v7 5/7] mseal, system mappings: enable uml architecture jeffxu
@ 2025-02-25  6:22   ` Lorenzo Stoakes
  2025-02-25  8:45     ` Berg, Benjamin
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:22 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport, Benjamin Berg

On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> the vdso.
>
> Testing passes on UML.

Maybe expand on this by stating that it has been confirmed by Benjamin (I
_believe_) that UML has no need for problematic relocation so this is known to
be good.

>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> Tested-by: Benjamin Berg <benjamin.berg@intel.com>

Anyway I know UML has at any rate been confirmed to be good to go + patch looks
fine, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/um/Kconfig        | 1 +
>  arch/x86/um/vdso/vma.c | 6 ++++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> index 18051b1cfce0..eb2d439a5334 100644
> --- a/arch/um/Kconfig
> +++ b/arch/um/Kconfig
> @@ -10,6 +10,7 @@ config UML
>  	select ARCH_HAS_FORTIFY_SOURCE
>  	select ARCH_HAS_GCOV_PROFILE_ALL
>  	select ARCH_HAS_KCOV
> +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>  	select ARCH_HAS_STRNCPY_FROM_USER
>  	select ARCH_HAS_STRNLEN_USER
>  	select HAVE_ARCH_AUDITSYSCALL
> diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> index f238f7b33cdd..fdfba858ffc9 100644
> --- a/arch/x86/um/vdso/vma.c
> +++ b/arch/x86/um/vdso/vma.c
> @@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>  {
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm = current->mm;
> +	unsigned long vm_flags;
>  	static struct vm_special_mapping vdso_mapping = {
>  		.name = "[vdso]",
>  	};
> @@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>  		return -EINTR;
>
>  	vdso_mapping.pages = vdsop;
> +	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm, um_vdso_addr, PAGE_SIZE,
> -		VM_READ|VM_EXEC|
> -		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +		vm_flags,
>  		&vdso_mapping);
>
>  	mmap_write_unlock(mm);
> --
> 2.48.1.658.g4767266eb4-goog
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-24 22:52 ` [PATCH v7 6/7] mseal, system mappings: uprobe mapping jeffxu
@ 2025-02-25  6:24   ` Lorenzo Stoakes
  2025-02-26  0:06     ` Jeff Xu
  2025-02-26 16:26   ` Oleg Nesterov
  1 sibling, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25  6:24 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:52:45PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Provide support to mseal the uprobe mapping.
>
> Unlike other system mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime. It could be sealed from creation.
>

I thought we agreed not to enable this for uprobes for now? What testing
have you done to ensure this is functional?

I mean is this literally _all_ uprobe mappings now being sealed?

I'd really like some more assurances on this one. And what are you
mitigating by sealing these? I get VDSO (kinda) but uprobes?

You really need to provide more justification here.

> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  kernel/events/uprobes.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2ca797cbe465..8dcdfa0d306b 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1662,6 +1662,7 @@ static const struct vm_special_mapping xol_mapping = {
>  static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
>  {
>  	struct vm_area_struct *vma;
> +	unsigned long vm_flags;
>  	int ret;
>
>  	if (mmap_write_lock_killable(mm))
> @@ -1682,8 +1683,10 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
>  		}
>  	}
>
> +	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> +				vm_flags,
>  				&xol_mapping);
>  	if (IS_ERR(vma)) {
>  		ret = PTR_ERR(vma);
> --
> 2.48.1.658.g4767266eb4-goog
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-24 22:52 ` [PATCH v7 3/7] mseal, system mappings: enable x86-64 jeffxu
  2025-02-25  1:03   ` Kees Cook
@ 2025-02-25  8:08   ` Thomas Weißschuh
  2025-02-26  0:48     ` Jeff Xu
  1 sibling, 1 reply; 122+ messages in thread
From: Thomas Weißschuh @ 2025-02-25  8:08 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, adobriyan, johannes, pedro.falcato, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

Hi Jeff,

On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> covering the vdso, vvar, vvar_vclock.
> 
> Production release testing passes on Android and Chrome OS.
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  arch/x86/Kconfig          |  1 +
>  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 87198d957e2f..8fa17032ca46 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
>  	depends on 64BIT
>  	# Options that are inherently 64-bit kernel only:
>  	select ARCH_HAS_GIGANTIC_PAGE
> +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>  	select ARCH_SUPPORTS_PER_VMA_LOCK
>  	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 39e6efc1a9ca..1b1c009f20a8 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
>  	unsigned long text_start;
> +	unsigned long vm_flags;
>  	int ret = 0;
>  
>  	if (mmap_write_lock_killable(mm))
> @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  	/*
>  	 * MAYWRITE to allow gdb to COW and set breakpoints
>  	 */
> +	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm,
>  				       text_start,
>  				       image->size,
> -				       VM_READ|VM_EXEC|
> -				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +				       vm_flags,
>  				       &vdso_mapping);
>  
>  	if (IS_ERR(vma)) {
> @@ -276,11 +278,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  		goto up_fail;
>  	}
>  
> +	vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm,
>  				       addr,
>  				       (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
> -				       VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
> -				       VM_PFNMAP,
> +				       vm_flags,
>  				       &vvar_mapping);

This hunk (and the vvar mapping in the arm64 patch) will conflict with my
"Generic vDSO datapage" series.
That series is already part of the tip tree (branch timers/vdso) and scheduled
for the next merge window.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=timers/vdso

The conflict resolution is fairly easy:
Move the new flag logic into lib/vdso/datastore.c

Depending on the expected mainline timing for this series either mention this
somewhere for the conflict resolution by Linus or rebase your series on top of
tip/timers/vdso.

>  	if (IS_ERR(vma)) {
> @@ -289,11 +292,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>  		goto up_fail;
>  	}
>  
> +	vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
> +	vm_flags |= VM_SEALED_SYSMAP;
>  	vma = _install_special_mapping(mm,
>  				       addr + (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
>  				       VDSO_NR_VCLOCK_PAGES * PAGE_SIZE,
> -				       VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
> -				       VM_PFNMAP,
> +				       vm_flags,
>  				       &vvar_vclock_mapping);
>  
>  	if (IS_ERR(vma)) {
> -- 
> 2.48.1.658.g4767266eb4-goog
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25  6:22   ` Lorenzo Stoakes
@ 2025-02-25  8:45     ` Berg, Benjamin
  2025-02-25 10:37       ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Berg, Benjamin @ 2025-02-25  8:45 UTC (permalink / raw)
  To: jeffxu, lorenzo.stoakes
  Cc: Jason, adobriyan, deller, gerg, anna-maria, davem, avagin,
	mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, johannes, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

Hi,

On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> > 
> > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > the vdso.
> > 
> > Testing passes on UML.
> 
> Maybe expand on this by stating that it has been confirmed by Benjamin (I
> _believe_) that UML has no need for problematic relocation so this is known to
> be good.

I may well be misreading this message, but this sounds to me that this
is a misinterpretation. So, just to clarify in case that is needed.

CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
However, the UML kernel is a normal userspace application itself and
for this application to run, the host kernel must have the feature
disabled.

So, UML supports the feature. But it still *cannot* run on a host
machine that has the feature enabled.

Benjamin

> 
> > 
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > Tested-by: Benjamin Berg <benjamin.berg@intel.com>
> 
> Anyway I know UML has at any rate been confirmed to be good to go +
> patch looks
> fine, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> > ---
> >  arch/um/Kconfig        | 1 +
> >  arch/x86/um/vdso/vma.c | 6 ++++--
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > index 18051b1cfce0..eb2d439a5334 100644
> > --- a/arch/um/Kconfig
> > +++ b/arch/um/Kconfig
> > @@ -10,6 +10,7 @@ config UML
> >  	select ARCH_HAS_FORTIFY_SOURCE
> >  	select ARCH_HAS_GCOV_PROFILE_ALL
> >  	select ARCH_HAS_KCOV
> > +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> >  	select ARCH_HAS_STRNCPY_FROM_USER
> >  	select ARCH_HAS_STRNLEN_USER
> >  	select HAVE_ARCH_AUDITSYSCALL
> > diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> > index f238f7b33cdd..fdfba858ffc9 100644
> > --- a/arch/x86/um/vdso/vma.c
> > +++ b/arch/x86/um/vdso/vma.c
> > @@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct
> > linux_binprm *bprm, int uses_interp)
> >  {
> >  	struct vm_area_struct *vma;
> >  	struct mm_struct *mm = current->mm;
> > +	unsigned long vm_flags;
> >  	static struct vm_special_mapping vdso_mapping = {
> >  		.name = "[vdso]",
> >  	};
> > @@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct
> > linux_binprm *bprm, int uses_interp)
> >  		return -EINTR;
> > 
> >  	vdso_mapping.pages = vdsop;
> > +	vm_flags =
> > VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > +	vm_flags |= VM_SEALED_SYSMAP;
> >  	vma = _install_special_mapping(mm, um_vdso_addr,
> > PAGE_SIZE,
> > -		VM_READ|VM_EXEC|
> > -		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > +		vm_flags,
> >  		&vdso_mapping);
> > 
> >  	mmap_write_unlock(mm);
> > --
> > 2.48.1.658.g4767266eb4-goog
> > 

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Sean Fennelly, Jeffrey Schneiderman, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7]  mseal system mappings
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (7 preceding siblings ...)
  2025-02-24 23:03 ` [PATCH v7 0/7] mseal system mappings Pedro Falcato
@ 2025-02-25 10:32 ` Lorenzo Stoakes
  2025-02-26  0:17   ` Jeff Xu
  2025-02-25 15:18 ` Lorenzo Stoakes
  9 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 10:32 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

BTW can we please drop the 'mseal, system mappings' prefixes on this
series, it's really weird and makes it really hard for me to actually read
the individual summary lines for each commit. 'mseal:' will do.

I mean really you could argue it's 'mm: mseal: ...' but I'm not quite
_that_ pedantic :)


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25  8:45     ` Berg, Benjamin
@ 2025-02-25 10:37       ` Lorenzo Stoakes
  2025-02-25 12:24         ` Benjamin Berg
  2025-02-25 15:06         ` Kees Cook
  0 siblings, 2 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 10:37 UTC (permalink / raw)
  To: Berg, Benjamin
  Cc: jeffxu, Jason, adobriyan, deller, gerg, anna-maria, davem,
	avagin, mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, johannes, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> Hi,
>
> On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > > the vdso.
> > >
> > > Testing passes on UML.
> >
> > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> > _believe_) that UML has no need for problematic relocation so this is known to
> > be good.
>
> I may well be misreading this message, but this sounds to me that this
> is a misinterpretation. So, just to clarify in case that is needed.
>
> CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> However, the UML kernel is a normal userspace application itself and
> for this application to run, the host kernel must have the feature
> disabled.
>
> So, UML supports the feature. But it still *cannot* run on a host
> machine that has the feature enabled.

Sigh ok. Apologies if I misunderstood.

Is there any point having this for the 'guest' system? I mean security wise are
we concerned about sealing of system mappings?

I feel like having this here might just add confusion and churn if it's not
useful.

If this is useless for UML guest, let's just drop this patch.

>
> Benjamin
>
> >
> > >
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > Tested-by: Benjamin Berg <benjamin.berg@intel.com>
> >
> > Anyway I know UML has at any rate been confirmed to be good to go +
> > patch looks
> > fine, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

OK guess drop this tag please until we can figure this out, sorry Jeff.

> >
> > > ---
> > >  arch/um/Kconfig        | 1 +
> > >  arch/x86/um/vdso/vma.c | 6 ++++--
> > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > > index 18051b1cfce0..eb2d439a5334 100644
> > > --- a/arch/um/Kconfig
> > > +++ b/arch/um/Kconfig
> > > @@ -10,6 +10,7 @@ config UML
> > >  	select ARCH_HAS_FORTIFY_SOURCE
> > >  	select ARCH_HAS_GCOV_PROFILE_ALL
> > >  	select ARCH_HAS_KCOV
> > > +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > >  	select ARCH_HAS_STRNCPY_FROM_USER
> > >  	select ARCH_HAS_STRNLEN_USER
> > >  	select HAVE_ARCH_AUDITSYSCALL
> > > diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> > > index f238f7b33cdd..fdfba858ffc9 100644
> > > --- a/arch/x86/um/vdso/vma.c
> > > +++ b/arch/x86/um/vdso/vma.c
> > > @@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct
> > > linux_binprm *bprm, int uses_interp)
> > >  {
> > >  	struct vm_area_struct *vma;
> > >  	struct mm_struct *mm = current->mm;
> > > +	unsigned long vm_flags;
> > >  	static struct vm_special_mapping vdso_mapping = {
> > >  		.name = "[vdso]",
> > >  	};
> > > @@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct
> > > linux_binprm *bprm, int uses_interp)
> > >  		return -EINTR;
> > >
> > >  	vdso_mapping.pages = vdsop;
> > > +	vm_flags =
> > > VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > > +	vm_flags |= VM_SEALED_SYSMAP;
> > >  	vma = _install_special_mapping(mm, um_vdso_addr,
> > > PAGE_SIZE,
> > > -		VM_READ|VM_EXEC|
> > > -		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > > +		vm_flags,
> > >  		&vdso_mapping);
> > >
> > >  	mmap_write_unlock(mm);
> > > --
> > > 2.48.1.658.g4767266eb4-goog
> > >
>
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Sean Fennelly, Jeffrey Schneiderman, Tiffany Doon Silva
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 10:37       ` Lorenzo Stoakes
@ 2025-02-25 12:24         ` Benjamin Berg
  2025-02-25 13:41           ` Lorenzo Stoakes
  2025-02-25 15:06         ` Kees Cook
  1 sibling, 1 reply; 122+ messages in thread
From: Benjamin Berg @ 2025-02-25 12:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: jeffxu, Jason, adobriyan, deller, gerg, anna-maria, davem,
	avagin, mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, johannes, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

Hi,

On Tue, 2025-02-25 at 10:37 +0000, Lorenzo Stoakes wrote:
> On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> > Hi,
> > 
> > On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> > > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > 
> > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > > > the vdso.
> > > > 
> > > > Testing passes on UML.
> > > 
> > > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> > > _believe_) that UML has no need for problematic relocation so this is known to
> > > be good.
> > 
> > I may well be misreading this message, but this sounds to me that this
> > is a misinterpretation. So, just to clarify in case that is needed.
> > 
> > CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> > However, the UML kernel is a normal userspace application itself and
> > for this application to run, the host kernel must have the feature
> > disabled.
> > 
> > So, UML supports the feature. But it still *cannot* run on a host
> > machine that has the feature enabled.
> 
> Sigh ok. Apologies if I misunderstood.
> 
> Is there any point having this for the 'guest' system? I mean security wise are
> we concerned about sealing of system mappings?
> 
> I feel like having this here might just add confusion and churn if it's not
> useful.
> 
> If this is useless for UML guest, let's just drop this patch.

I figured it is not a lot of churn and there isn't really any cost to
enabling the feature.

That said, the only possible real-life use case I can see is doing MM
subsystem testing using UML. We certainly do not need the feature to
run our UML based wireless stack and driver tests.

Benjamin

> 
> > 
> > Benjamin
> > 
> > > 
> > > > 
> > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > > Tested-by: Benjamin Berg <benjamin.berg@intel.com>
> > > 
> > > Anyway I know UML has at any rate been confirmed to be good to go +
> > > patch looks
> > > fine, so:
> > > 
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> OK guess drop this tag please until we can figure this out, sorry Jeff.
> 
> > > 
> > > > ---
> > > >  arch/um/Kconfig        | 1 +
> > > >  arch/x86/um/vdso/vma.c | 6 ++++--
> > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > > > index 18051b1cfce0..eb2d439a5334 100644
> > > > --- a/arch/um/Kconfig
> > > > +++ b/arch/um/Kconfig
> > > > @@ -10,6 +10,7 @@ config UML
> > > >  	select ARCH_HAS_FORTIFY_SOURCE
> > > >  	select ARCH_HAS_GCOV_PROFILE_ALL
> > > >  	select ARCH_HAS_KCOV
> > > > +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > >  	select ARCH_HAS_STRNCPY_FROM_USER
> > > >  	select ARCH_HAS_STRNLEN_USER
> > > >  	select HAVE_ARCH_AUDITSYSCALL
> > > > diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> > > > index f238f7b33cdd..fdfba858ffc9 100644
> > > > --- a/arch/x86/um/vdso/vma.c
> > > > +++ b/arch/x86/um/vdso/vma.c
> > > > @@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct
> > > > linux_binprm *bprm, int uses_interp)
> > > >  {
> > > >  	struct vm_area_struct *vma;
> > > >  	struct mm_struct *mm = current->mm;
> > > > +	unsigned long vm_flags;
> > > >  	static struct vm_special_mapping vdso_mapping = {
> > > >  		.name = "[vdso]",
> > > >  	};
> > > > @@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct
> > > > linux_binprm *bprm, int uses_interp)
> > > >  		return -EINTR;
> > > > 
> > > >  	vdso_mapping.pages = vdsop;
> > > > +	vm_flags =
> > > > VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > > > +	vm_flags |= VM_SEALED_SYSMAP;
> > > >  	vma = _install_special_mapping(mm, um_vdso_addr,
> > > > PAGE_SIZE,
> > > > -		VM_READ|VM_EXEC|
> > > > -		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > > > +		vm_flags,
> > > >  		&vdso_mapping);
> > > > 
> > > >  	mmap_write_unlock(mm);
> > > > --
> > > > 2.48.1.658.g4767266eb4-goog
> > > > 
> > 
> > Intel Deutschland GmbH
> > Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
> > Tel: +49 89 99 8853-0, www.intel.de
> > Managing Directors: Sean Fennelly, Jeffrey Schneiderman, Tiffany Doon Silva
> > Chairperson of the Supervisory Board: Nicole Lau
> > Registered Office: Munich
> > Commercial Register: Amtsgericht Muenchen HRB 186928



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 12:24         ` Benjamin Berg
@ 2025-02-25 13:41           ` Lorenzo Stoakes
  2025-02-25 13:59             ` Johannes Berg
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 13:41 UTC (permalink / raw)
  To: Benjamin Berg
  Cc: jeffxu, Jason, adobriyan, deller, gerg, anna-maria, davem,
	avagin, mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, johannes, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

On Tue, Feb 25, 2025 at 01:24:49PM +0100, Benjamin Berg wrote:
> Hi,
>
> On Tue, 2025-02-25 at 10:37 +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> > > Hi,
> > >
> > > On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> > > > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > >
> > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > > > > the vdso.
> > > > >
> > > > > Testing passes on UML.
> > > >
> > > > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> > > > _believe_) that UML has no need for problematic relocation so this is known to
> > > > be good.
> > >
> > > I may well be misreading this message, but this sounds to me that this
> > > is a misinterpretation. So, just to clarify in case that is needed.
> > >
> > > CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> > > However, the UML kernel is a normal userspace application itself and
> > > for this application to run, the host kernel must have the feature
> > > disabled.
> > >
> > > So, UML supports the feature. But it still *cannot* run on a host
> > > machine that has the feature enabled.
> >
> > Sigh ok. Apologies if I misunderstood.
> >
> > Is there any point having this for the 'guest' system? I mean security wise are
> > we concerned about sealing of system mappings?
> >
> > I feel like having this here might just add confusion and churn if it's not
> > useful.
> >
> > If this is useless for UML guest, let's just drop this patch.
>
> I figured it is not a lot of churn and there isn't really any cost to
> enabling the feature.
>
> That said, the only possible real-life use case I can see is doing MM
> subsystem testing using UML. We certainly do not need the feature to
> run our UML based wireless stack and driver tests.

OK ack - my concern is users getting confused about this ironic host
vs. client thing, must disable the security feature in the _actual kernel_
to enable it in the client.

I'm not sure this is really worth it?

I mean I agree this isn't a _huge_ amount added here and I don't want to be
difficult - Jeff, Kees are you really keen on having this? Do you have
specific use cases in mind or was this just a 'because we can':>)

I guess if intent is to slowly add architectures, it's not totally insane
since we kinda know this one is ok so if that's what it is, probably won't
oppose it _too_ badly.

>
> Benjamin
>
> >
> > >
> > > Benjamin
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > > > Tested-by: Benjamin Berg <benjamin.berg@intel.com>
> > > >
> > > > Anyway I know UML has at any rate been confirmed to be good to go +
> > > > patch looks
> > > > fine, so:
> > > >
> > > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > OK guess drop this tag please until we can figure this out, sorry Jeff.

(to be clear this is just temporary while we establish what's up with this
situation! :>)

> >
> > > >
> > > > > ---
> > > > >  arch/um/Kconfig        | 1 +
> > > > >  arch/x86/um/vdso/vma.c | 6 ++++--
> > > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> > > > > index 18051b1cfce0..eb2d439a5334 100644
> > > > > --- a/arch/um/Kconfig
> > > > > +++ b/arch/um/Kconfig
> > > > > @@ -10,6 +10,7 @@ config UML
> > > > >  	select ARCH_HAS_FORTIFY_SOURCE
> > > > >  	select ARCH_HAS_GCOV_PROFILE_ALL
> > > > >  	select ARCH_HAS_KCOV
> > > > > +	select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > > >  	select ARCH_HAS_STRNCPY_FROM_USER
> > > > >  	select ARCH_HAS_STRNLEN_USER
> > > > >  	select HAVE_ARCH_AUDITSYSCALL
> > > > > diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> > > > > index f238f7b33cdd..fdfba858ffc9 100644
> > > > > --- a/arch/x86/um/vdso/vma.c
> > > > > +++ b/arch/x86/um/vdso/vma.c
> > > > > @@ -54,6 +54,7 @@ int arch_setup_additional_pages(struct
> > > > > linux_binprm *bprm, int uses_interp)
> > > > >  {
> > > > >  	struct vm_area_struct *vma;
> > > > >  	struct mm_struct *mm = current->mm;
> > > > > +	unsigned long vm_flags;
> > > > >  	static struct vm_special_mapping vdso_mapping = {
> > > > >  		.name = "[vdso]",
> > > > >  	};
> > > > > @@ -65,9 +66,10 @@ int arch_setup_additional_pages(struct
> > > > > linux_binprm *bprm, int uses_interp)
> > > > >  		return -EINTR;
> > > > >
> > > > >  	vdso_mapping.pages = vdsop;
> > > > > +	vm_flags =
> > > > > VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > > > > +	vm_flags |= VM_SEALED_SYSMAP;
> > > > >  	vma = _install_special_mapping(mm, um_vdso_addr,
> > > > > PAGE_SIZE,
> > > > > -		VM_READ|VM_EXEC|
> > > > > -		VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > > > > +		vm_flags,
> > > > >  		&vdso_mapping);
> > > > >
> > > > >  	mmap_write_unlock(mm);
> > > > > --
> > > > > 2.48.1.658.g4767266eb4-goog
> > > > >
> > >
> > > Intel Deutschland GmbH
> > > Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
> > > Tel: +49 89 99 8853-0, www.intel.de
> > > Managing Directors: Sean Fennelly, Jeffrey Schneiderman, Tiffany Doon Silva
> > > Chairperson of the Supervisory Board: Nicole Lau
> > > Registered Office: Munich
> > > Commercial Register: Amtsgericht Muenchen HRB 186928
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 13:41           ` Lorenzo Stoakes
@ 2025-02-25 13:59             ` Johannes Berg
  0 siblings, 0 replies; 122+ messages in thread
From: Johannes Berg @ 2025-02-25 13:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Benjamin Berg
  Cc: jeffxu, Jason, adobriyan, deller, gerg, anna-maria, davem,
	avagin, mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

On Tue, 2025-02-25 at 13:41 +0000, Lorenzo Stoakes wrote:
> > I figured it is not a lot of churn and there isn't really any cost to
> > enabling the feature.
> > 
> > That said, the only possible real-life use case I can see is doing MM
> > subsystem testing using UML. We certainly do not need the feature to
> > run our UML based wireless stack and driver tests.
> 
> OK ack - my concern is users getting confused about this ironic host
> vs. client thing, must disable the security feature in the _actual kernel_
> to enable it in the client.

Well, s/to enable it in the client/to run the client/, I guess.

I'm still a bit disappointed in the whole thing anyway - if this does
get enabled in e.g. ChromeOS (as it looks like), then it'll mean that
gvisor/rr/UML will never run on chromebooks, which ... I mean yeah who's
going to do that, so it's more of a purist disappointment I guess. Can't
run kunit on a chromebook then, for example. This looks much different
for more general purpose distros too.

I also don't really want to reopen a discussion that was probably had
before, but I did wonder now what the security downsides of having an
opt-out, e.g. a new ELF property, for skipping the sealings would be.
Perhaps, depending on the impact, even making that mean "no system
mappings at all", at least for UML I believe they're not needed in the
first place.

> I'm not sure this is really worth it?
> 
> I mean I agree this isn't a _huge_ amount added here and I don't want to be
> difficult - Jeff, Kees are you really keen on having this? Do you have
> specific use cases in mind or was this just a 'because we can':>)

There's always kunit that can run with UML, but I don't see tests being
added for this feature, in fact the only thing here is _disabling_ a
test. Maybe it should come with tests and then it'd be more interesting
;-)

The commit says "Testing passes on UML" but I'm not sure I see what
testing that might have been, per the cover letter Benjamin did that?

> I guess if intent is to slowly add architectures, it's not totally insane
> since we kinda know this one is ok so if that's what it is, probably won't
> oppose it _too_ badly.

I think it still makes _some_ sense to have it for the testing aspect,
but perhaps might then make sense to split it out of the series to avoid
all the confusion and submit it to UML separately later? Or just leave
it since you can always test with qemu.

johannes

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 10:37       ` Lorenzo Stoakes
  2025-02-25 12:24         ` Benjamin Berg
@ 2025-02-25 15:06         ` Kees Cook
  2025-02-25 15:31           ` Lorenzo Stoakes
  1 sibling, 1 reply; 122+ messages in thread
From: Kees Cook @ 2025-02-25 15:06 UTC (permalink / raw)
  To: Lorenzo Stoakes, Berg, Benjamin
  Cc: jeffxu, Jason, adobriyan, deller, gerg, anna-maria, davem,
	avagin, mhocko, enh, thomas.weissschuh, hch, hca, peterz,
	adhemerval.zanella, linux-kernel, ojeda, jannh, f.fainelli,
	sroettger, ardb, jorgelo, rdunlap, mark.rutland, Liam.Howlett,
	vbabka, mpe, oleg, willy, keescook, peterx, mike.rapoport, mingo,
	rientjes, groeck, linus.walleij, pedro.falcato, ardb, 42.hyeyoo,
	linux-mm, johannes, linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn



On February 25, 2025 2:37:11 AM PST, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
>> Hi,
>>
>> On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
>> > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
>> > > From: Jeff Xu <jeffxu@chromium.org>
>> > >
>> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
>> > > the vdso.
>> > >
>> > > Testing passes on UML.
>> >
>> > Maybe expand on this by stating that it has been confirmed by Benjamin (I
>> > _believe_) that UML has no need for problematic relocation so this is known to
>> > be good.
>>
>> I may well be misreading this message, but this sounds to me that this
>> is a misinterpretation. So, just to clarify in case that is needed.
>>
>> CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
>> However, the UML kernel is a normal userspace application itself and
>> for this application to run, the host kernel must have the feature
>> disabled.
>>
>> So, UML supports the feature. But it still *cannot* run on a host
>> machine that has the feature enabled.
>
>Sigh ok. Apologies if I misunderstood.
>
>Is there any point having this for the 'guest' system? I mean security wise are
>we concerned about sealing of system mappings?

UML guests are used for testing. For example, it's the default target for KUnit's scripts. Having sealing working in the guest seems generally useful to me.

>
>I feel like having this here might just add confusion and churn if it's not
>useful.
>
>If this is useless for UML guest, let's just drop this patch.

But on the flip side, it's certainly not critical to have UML supported. I guess I just don't see a down side to keeping the patch.

-Kees


-- 
Kees Cook


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7]  mseal system mappings
  2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
                   ` (8 preceding siblings ...)
  2025-02-25 10:32 ` Lorenzo Stoakes
@ 2025-02-25 15:18 ` Lorenzo Stoakes
  2025-02-26  0:12   ` Jeff Xu
  9 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 15:18 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

Jeff - looking further in this series, I asked for a couple things for this
series which you've not provided:

1. Some assurance based on code that the kernel-side code doesn't rely on
   VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
   myself, it looks fine for arm64, x86-64. But it might have been nice had
   you done it :) Apologies if you had and I just missed it.

2. Tests - could you please add some tests to assert that mremap() fails
   for VDSO for instance? You've edited an existing test for VDSO in x86 to
   skip the test should this be enabled, but this is not the same as a self
   test. And obviously doesn't cover arm64.

   This should be relatively strightforward right? You already have code
   for finding out whether VDSO is msealed, so just use that to see if you
   skip, then attempt mremap(), mmap() over etc. + assert it fails.

   Ideally these tests would cover all the cases you've changed.

Please do try to ensure you address requests from maintainers to save on
iterations, while I get the desire to shoot out new versions (I've been
guilty of this in the past), it makes life so much easier and will reduce
version count if you try to get everything done in a one go.

Having said the above, we're really not far off this being a viable
series. You just need to address comments here (+ in v6...) + provide some
tests.

Thanks!

On Mon, Feb 24, 2025 at 10:52:39PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> This is V7 version, addressing comments from V6, without code logic
> change.
>
> --------------------------------------------------
>
> History:
> V7:
>  - Remove cover letter from the first patch (Liam R. Howlett)
>  - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
>  - logging and fclose() in selftest (Liam R. Howlett)
>
> V6:
>   https://lore.kernel.org/all/20250224174513.3600914-1-jeffxu@google.com/
>

Nitty, but it's really useful to actually include the history for all of
these.

> V5
>   https://lore.kernel.org/all/20250212032155.1276806-1-jeffxu@google.com/
>
> V4:
>   https://lore.kernel.org/all/20241125202021.3684919-1-jeffxu@google.com/
>
> V3:
>   https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
>
> V2:
>   https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
>
> V1:
>   https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
>
> --------------------------------------------------
> As discussed during mseal() upstream process [1], mseal() protects
> the VMAs of a given virtual memory range against modifications, such
> as the read/write (RW) and no-execute (NX) bits. For complete
> descriptions of memory sealing, please see mseal.rst [2].
>
> The mseal() is useful to mitigate memory corruption issues where a
> corrupted pointer is passed to a memory management system. For
> example, such an attacker primitive can break control-flow integrity
> guarantees since read-only memory that is supposed to be trusted can
> become writable or .text pages can get remapped.
>
> The system mappings are readonly only, memory sealing can protect
> them from ever changing to writable or unmmap/remapped as different
> attributes.
>
> System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
> are created by the kernel during program initialization, and could
> be sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [3]. It could be sealed from creation.
>
> The vsyscall on x86-64 uses a special address (0xffffffffff600000),
> which is outside the mm managed range. This means mprotect, munmap, and
> mremap won't work on the vsyscall. Since sealing doesn't enhance
> the vsyscall's security, it is skipped in this patch. If we ever seal
> the vsyscall, it is probably only for decorative purpose, i.e. showing
> the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the system mappings during restore operations. UML(User Mode Linux)
> and gVisor, rr are also known to change the vdso/vvar mappings.
> Consequently, this feature cannot be universally enabled across all
> systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
>
> To support mseal of system mappings, architectures must define
> CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
> calls to pass mseal flag. Additionally, architectures must confirm they
> do not unmap/remap system mappings during the process lifetime.
>
> In this version, we've improved the handling of system mapping sealing from
> previous versions, instead of modifying the _install_special_mapping
> function itself, which would affect all architectures, we now call
> _install_special_mapping with a sealing flag only within the specific
> architecture that requires it. This targeted approach offers two key
> advantages: 1) It limits the code change's impact to the necessary
> architectures, and 2) It aligns with the software architecture by keeping
> the core memory management within the mm layer, while delegating the
> decision of sealing system mappings to the individual architecture, which
> is particularly relevant since 32-bit architectures never require sealing.
>
> Prior to this patch series, we explored sealing special mappings from
> userspace using glibc's dynamic linker. This approach revealed several
> issues:
> - The PT_LOAD header may report an incorrect length for vdso, (smaller
>   than its actual size). The dynamic linker, which relies on PT_LOAD
>   information to determine mapping size, would then split and partially
>   seal the vdso mapping. Since each architecture has its own vdso/vvar
>   code, fixing this in the kernel would require going through each
>   archiecture. Our initial goal was to enable sealing readonly mappings,
>   e.g. .text, across all architectures, sealing vdso from kernel since
>   creation appears to be simpler than sealing vdso at glibc.
> - The [vvar] mapping header only contains address information, not length
>   information. Similar issues might exist for other special mappings.
> - Mappings like uprobe are not covered by the dynamic linker,
>   and there is no effective solution for them.
>
> This feature's security enhancements will benefit ChromeOS, Android,
> and other high security systems.
>
> Testing:
> This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
> - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
>   i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
> - Passing various automation tests (e.g. pre-checkin) on ChromeOS and
>   Android to ensure the sealing doesn't affect the functionality of
>   Chromebook and Android phone.
>
> I also tested the feature on Ubuntu on x86-64:
> - With config disabled, vdso/vvar is not sealed,
> - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
>   normal operations such as browsing the web, open/edit doc are OK.
>
> In addition, Benjamin Berg tested this on UML.
>
> Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
> Link: Documentation/userspace-api/mseal.rst [2]
> Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]
>
>
>
>
> Jeff Xu (7):
>   mseal, system mappings: kernel config and header change
>   selftests: x86: test_mremap_vdso: skip if vdso is msealed
>   mseal, system mappings: enable x86-64
>   mseal, system mappings: enable arm64
>   mseal, system mappings: enable uml architecture
>   mseal, system mappings: uprobe mapping
>   mseal, system mappings: update mseal.rst
>
>  Documentation/userspace-api/mseal.rst         |  7 +++
>  arch/arm64/Kconfig                            |  1 +
>  arch/arm64/kernel/vdso.c                      | 22 +++++++---
>  arch/um/Kconfig                               |  1 +
>  arch/x86/Kconfig                              |  1 +
>  arch/x86/entry/vdso/vma.c                     | 16 ++++---
>  arch/x86/um/vdso/vma.c                        |  6 ++-
>  include/linux/mm.h                            | 10 +++++
>  init/Kconfig                                  | 18 ++++++++
>  kernel/events/uprobes.c                       |  5 ++-
>  security/Kconfig                              | 18 ++++++++
>  .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
>  12 files changed, 132 insertions(+), 16 deletions(-)
>
> --
> 2.48.1.658.g4767266eb4-goog
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
  2025-02-25  6:05   ` Lorenzo Stoakes
@ 2025-02-25 15:22   ` Liam R. Howlett
  2025-02-25 15:37     ` Lorenzo Stoakes
  2025-02-26  0:04     ` Jeff Xu
  1 sibling, 2 replies; 122+ messages in thread
From: Liam R. Howlett @ 2025-02-25 15:22 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

* jeffxu@chromium.org <jeffxu@chromium.org> [250224 17:52]:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Provide infrastructure to mseal system mappings. Establish
> two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP
> macro for future patches.
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  include/linux/mm.h | 10 ++++++++++
>  init/Kconfig       | 18 ++++++++++++++++++
>  security/Kconfig   | 18 ++++++++++++++++++
>  3 files changed, 46 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b1068ddcbb7..8b800941678d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4155,4 +4155,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
>  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
>  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>  
> +
> +/*
> + * mseal of userspace process's system mappings.
> + */
> +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> +#define VM_SEALED_SYSMAP	VM_SEALED
> +#else
> +#define VM_SEALED_SYSMAP	VM_NONE
> +#endif
> +
>  #endif /* _LINUX_MM_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b..07435e33f965 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
>  config ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	bool
>  
> +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS

I think we said ARCH_SUPPORTS_ for software features in v5 of the series
[1].  Can we also make this change please?

...

Thanks,
Liam

[1]. https://lore.kernel.org/all/202502131142.F5EE115C3A@keescook/


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 15:06         ` Kees Cook
@ 2025-02-25 15:31           ` Lorenzo Stoakes
  2025-02-25 18:38             ` Kees Cook
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 15:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Berg, Benjamin, jeffxu, Jason, adobriyan, deller, gerg,
	anna-maria, davem, avagin, mhocko, enh, thomas.weissschuh, hch,
	hca, peterz, adhemerval.zanella, linux-kernel, ojeda, jannh,
	f.fainelli, sroettger, ardb, jorgelo, rdunlap, mark.rutland,
	Liam.Howlett, vbabka, mpe, oleg, willy, keescook, peterx,
	mike.rapoport, mingo, rientjes, groeck, linus.walleij,
	pedro.falcato, ardb, 42.hyeyoo, linux-mm, johannes,
	linux-hardening, torvalds, akpm, dave.hansen,
	aleksandr.mikhalitsyn

On Tue, Feb 25, 2025 at 07:06:13AM -0800, Kees Cook wrote:
>
>
> On February 25, 2025 2:37:11 AM PST, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> >> Hi,
> >>
> >> On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> >> > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> >> > > From: Jeff Xu <jeffxu@chromium.org>
> >> > >
> >> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> >> > > the vdso.
> >> > >
> >> > > Testing passes on UML.
> >> >
> >> > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> >> > _believe_) that UML has no need for problematic relocation so this is known to
> >> > be good.
> >>
> >> I may well be misreading this message, but this sounds to me that this
> >> is a misinterpretation. So, just to clarify in case that is needed.
> >>
> >> CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> >> However, the UML kernel is a normal userspace application itself and
> >> for this application to run, the host kernel must have the feature
> >> disabled.
> >>
> >> So, UML supports the feature. But it still *cannot* run on a host
> >> machine that has the feature enabled.
> >
> >Sigh ok. Apologies if I misunderstood.
> >
> >Is there any point having this for the 'guest' system? I mean security wise are
> >we concerned about sealing of system mappings?
>
> UML guests are used for testing. For example, it's the default target for KUnit's scripts. Having sealing working in the guest seems generally useful to me.
>

'Having sealing working' you mean system sealing? Because mseal works fine
(presumably in UML, not tried myself!)

System msealing lacks any test in this series (I did ask for them...), certainly
no kunit tests, so this seems a bit theoretical? Unless you're talking about the
theoretical interaction of kunit tests and VDSO sealing?

I mean can't we just introduce this at the time if we believe this'd be useful?

Adding stuff for entirely theoretical reasons is generally frowned upon and
isn't that the (at least one) definition of churn?

Can you give a really specific reason why this needs to be added?

For x86-64, arm64 the reason is obvious (albeit one can argue how much security
benefit msealing the VDSO truly gives).

For UML it just isn't, at all.

> >
> >I feel like having this here might just add confusion and churn if it's not
> >useful.
> >
> >If this is useless for UML guest, let's just drop this patch.
>
> But on the flip side, it's certainly not critical to have UML supported. I guess I just don't see a down side to keeping the patch.

Right, presumably you disagree that it might be confusing that to test mseal
_system_ sealing (not mseal itself) by enabling mseal system sealing in UML but
having to disable it in the host?

Because that seems really confusing to me (and why I explicitly cited it as a
reason I'm confused here, even the exchange above is confused).

If there benefit is basically theoretical/why not/nil, and the downside is
confusion + churn (albeit small churn), why not just wait until somebody writes
kunit tests?

If I saw a list of supported arches including UML, but also saw a note about it
not working in UML I'd definitely be confused.

Generally I'm not a fan of adding features mid-way through a series, the
revisions are meant to be refinements of the original, not an evolving thing.

So in general I'd prefer this to be added if + when we need it for something.

>
> -Kees
>
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-25 15:22   ` Liam R. Howlett
@ 2025-02-25 15:37     ` Lorenzo Stoakes
  2025-02-26  0:04     ` Jeff Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 15:37 UTC (permalink / raw)
  To: Liam R. Howlett, jeffxu, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 10:22:56AM -0500, Liam R. Howlett wrote:
> * jeffxu@chromium.org <jeffxu@chromium.org> [250224 17:52]:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide infrastructure to mseal system mappings. Establish
> > two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> > ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP
> > macro for future patches.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  include/linux/mm.h | 10 ++++++++++
> >  init/Kconfig       | 18 ++++++++++++++++++
> >  security/Kconfig   | 18 ++++++++++++++++++
> >  3 files changed, 46 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7b1068ddcbb7..8b800941678d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4155,4 +4155,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> >  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> >  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> > +
> > +/*
> > + * mseal of userspace process's system mappings.
> > + */
> > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > +#define VM_SEALED_SYSMAP	VM_SEALED
> > +#else
> > +#define VM_SEALED_SYSMAP	VM_NONE
> > +#endif
> > +
> >  #endif /* _LINUX_MM_H */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b..07435e33f965 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> >  	bool
> >
> > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>
> I think we said ARCH_SUPPORTS_ for software features in v5 of the series
> [1].  Can we also make this change please?

Jeff - I'm going to risk sounding like a typical, grumpy, kernel maintainer
here (sorry! :P) but please, please do ensure that you address all prior
feedback.

It may be worth even, when things settle down here, putting together a list
of things you intend to address in the next round and double-check you've
covered everything off.

Because obviously when this happens once it's probably an oversight, but it
seems you've failed to address at least 4 separate things requested of you
here. It's obivously s going to make this a faaar more drawn out process
for al linvolved.

Again, to be clear, as to allay LWN articles about 'Stoakes urging caution'
or what not and related drama - my intent in saying this is purely that _I
want this series to land_. So I say it merely to help expedite this! :)

Cheers, Lorenzo

>
> ...
>
> Thanks,
> Liam
>
> [1]. https://lore.kernel.org/all/202502131142.F5EE115C3A@keescook/


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 15:31           ` Lorenzo Stoakes
@ 2025-02-25 18:38             ` Kees Cook
  2025-02-26  0:00               ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Kees Cook @ 2025-02-25 18:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Berg, Benjamin, jeffxu, Jason, adobriyan, deller, gerg,
	anna-maria, davem, avagin, mhocko, enh, thomas.weissschuh, hch,
	hca, peterz, adhemerval.zanella, linux-kernel, ojeda, jannh,
	f.fainelli, sroettger, ardb, jorgelo, rdunlap, mark.rutland,
	Liam.Howlett, vbabka, mpe, oleg, willy, peterx, mike.rapoport,
	mingo, rientjes, groeck, linus.walleij, pedro.falcato, ardb,
	42.hyeyoo, linux-mm, johannes, linux-hardening, torvalds, akpm,
	dave.hansen, aleksandr.mikhalitsyn

On Tue, Feb 25, 2025 at 03:31:06PM +0000, Lorenzo Stoakes wrote:
> On Tue, Feb 25, 2025 at 07:06:13AM -0800, Kees Cook wrote:
> >
> >
> > On February 25, 2025 2:37:11 AM PST, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> > >> Hi,
> > >>
> > >> On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> > >> > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > >> > > From: Jeff Xu <jeffxu@chromium.org>
> > >> > >
> > >> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > >> > > the vdso.
> > >> > >
> > >> > > Testing passes on UML.
> > >> >
> > >> > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> > >> > _believe_) that UML has no need for problematic relocation so this is known to
> > >> > be good.
> > >>
> > >> I may well be misreading this message, but this sounds to me that this
> > >> is a misinterpretation. So, just to clarify in case that is needed.
> > >>
> > >> CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> > >> However, the UML kernel is a normal userspace application itself and
> > >> for this application to run, the host kernel must have the feature
> > >> disabled.
> > >>
> > >> So, UML supports the feature. But it still *cannot* run on a host
> > >> machine that has the feature enabled.
> > >
> > >Sigh ok. Apologies if I misunderstood.
> > >
> > >Is there any point having this for the 'guest' system? I mean security wise are
> > >we concerned about sealing of system mappings?
> >
> > UML guests are used for testing. For example, it's the default target for KUnit's scripts. Having sealing working in the guest seems generally useful to me.
> >
> 
> 'Having sealing working' you mean system sealing? Because mseal works fine
> (presumably in UML, not tried myself!)

Sorry, yes, I mean "system mapping msealing".

> 
> System msealing lacks any test in this series (I did ask for them...), certainly
> no kunit tests, so this seems a bit theoretical? Unless you're talking about the
> theoretical interaction of kunit tests and VDSO sealing?

Right, I meant theoretical interaction, but it would be useful for
future KUnit tests of system mapping msealing too.

> I mean can't we just introduce this at the time if we believe this'd be useful?

Perhaps adding it as part of adding some KUnit tests that exercise the
system mapping msealing would be the most sensible.

> Generally I'm not a fan of adding features mid-way through a series, the
> revisions are meant to be refinements of the original, not an evolving thing.
> 
> So in general I'd prefer this to be added if + when we need it for something.

Yup, makes sense. And it may be that KUnit tests need to exercise more
than what UML can support, so even the KUnit idea may be invalid.

Jeff, let's leave off UML for this initial "minimum viable feature"
series, unless there is a strong reason to keep it.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-25  6:20   ` Lorenzo Stoakes
@ 2025-02-25 22:26     ` Jeff Xu
  2025-02-26  5:25       ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-25 22:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> >
> > Production release testing passes on Android and Chrome OS.
>
> This is pretty limited (yes yes I know android is massive etc. but we must
> account for all the weird and wonderful arm64 devices out there in context of
> upstream :)
>
> Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> nervous of that.
>
Can you please point out where PPC munmap/mremap the vdso ?

Previously, when you mentioned that, I thought you meant user space in
PPC, I didn't realize that you meant that kernel code in PPC.  I
tried, but didn't find anything, hence asking.

Thanks.
-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed
  2025-02-25  6:15   ` Lorenzo Stoakes
@ 2025-02-25 22:37     ` Jeff Xu
  2025-02-26  5:58       ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-25 22:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport, Kees Cook

On Mon, Feb 24, 2025 at 10:15 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 10:52:41PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Add code to detect if the vdso is memory sealed, skip the test
> > if it is.
>
> I feel this is a little succinct of a commit message, but I guess it gets to the
> heart of what you're doing here.
>
> Fundamentally I mean it makes sense, but I'm concerned that x86 has a test
> -expliictly checking- whether mremap() of VDSO is possible - are there cases
> where x86 might want to do this internal to the kernel?
>
> I guess not since this is essentially a userland self test and probably
> asserting you can do this in the way rr, etc. do.
>
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > Reviewed-by: Kees Cook <kees@kernel.org>
>
> Anyway, this aside, this looks fine, aside from nit below, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >  .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
> >  1 file changed, 43 insertions(+)
> >
> > diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c b/tools/testing/selftests/x86/test_mremap_vdso.c
> > index d53959e03593..94bee6e0c813 100644
> > --- a/tools/testing/selftests/x86/test_mremap_vdso.c
> > +++ b/tools/testing/selftests/x86/test_mremap_vdso.c
> > @@ -14,6 +14,7 @@
> >  #include <errno.h>
> >  #include <unistd.h>
> >  #include <string.h>
> > +#include <stdbool.h>
> >
> >  #include <sys/mman.h>
> >  #include <sys/auxv.h>
> > @@ -55,13 +56,55 @@ static int try_to_remap(void *vdso_addr, unsigned long size)
> >
> >  }
> >
> > +#define VDSO_NAME "[vdso]"
> > +#define VMFLAGS "VmFlags:"
> > +#define MSEAL_FLAGS "sl"
> > +#define MAX_LINE_LEN 512
> > +
> > +bool vdso_sealed(FILE *maps)
>
> Should be static?
>
sure.

> > +{
> > +     char line[MAX_LINE_LEN];
> > +     bool has_vdso = false;
> > +
> > +     while (fgets(line, sizeof(line), maps)) {
> > +             if (strstr(line, VDSO_NAME))
> > +                     has_vdso = true;
> > +
> > +             if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) {
> > +                     if (strstr(line, MSEAL_FLAGS))
> > +                             return true;
> > +
> > +                     return false;
> > +             }
> > +     }
> > +
> > +     return false;
> > +}
> > +
> >  int main(int argc, char **argv, char **envp)
> >  {
> >       pid_t child;
> > +     FILE *maps;
> >
> >       ksft_print_header();
> >       ksft_set_plan(1);
> >
> > +     maps = fopen("/proc/self/smaps", "r");
> > +     if (!maps) {
> > +             ksft_test_result_skip(
> > +                     "Could not open /proc/self/smaps, errno=%d\n",
> > +                      errno);
> > +
> > +             return 0;
> > +     }
> > +
> > +     if (vdso_sealed(maps)) {
> > +             ksft_test_result_skip("vdso is sealed\n");
> > +             return 0;
> > +     }
> > +
> > +     fclose(maps);
> > +
> >       child = fork();
> >       if (child == -1)
> >               ksft_exit_fail_msg("failed to fork (%d): %m\n", errno);
> > --
> > 2.48.1.658.g4767266eb4-goog
> >


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 5/7] mseal, system mappings: enable uml architecture
  2025-02-25 18:38             ` Kees Cook
@ 2025-02-26  0:00               ` Jeff Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Lorenzo Stoakes, Berg, Benjamin, Jason, adobriyan, deller, gerg,
	anna-maria, davem, avagin, mhocko, enh, thomas.weissschuh, hch,
	hca, peterz, adhemerval.zanella, linux-kernel, ojeda, jannh,
	f.fainelli, sroettger, ardb, jorgelo, rdunlap, mark.rutland,
	Liam.Howlett, vbabka, mpe, oleg, willy, peterx, mike.rapoport,
	mingo, rientjes, groeck, linus.walleij, pedro.falcato, ardb,
	42.hyeyoo, linux-mm, johannes, linux-hardening, torvalds, akpm,
	dave.hansen, aleksandr.mikhalitsyn

On Tue, Feb 25, 2025 at 10:38 AM Kees Cook <kees@kernel.org> wrote:
>
> On Tue, Feb 25, 2025 at 03:31:06PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 25, 2025 at 07:06:13AM -0800, Kees Cook wrote:
> > >
> > >
> > > On February 25, 2025 2:37:11 AM PST, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > >On Tue, Feb 25, 2025 at 08:45:21AM +0000, Berg, Benjamin wrote:
> > > >> Hi,
> > > >>
> > > >> On Tue, 2025-02-25 at 06:22 +0000, Lorenzo Stoakes wrote:
> > > >> > On Mon, Feb 24, 2025 at 10:52:44PM +0000, jeffxu@chromium.org wrote:
> > > >> > > From: Jeff Xu <jeffxu@chromium.org>
> > > >> > >
> > > >> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on UML, covering
> > > >> > > the vdso.
> > > >> > >
> > > >> > > Testing passes on UML.
> > > >> >
> > > >> > Maybe expand on this by stating that it has been confirmed by Benjamin (I
> > > >> > _believe_) that UML has no need for problematic relocation so this is known to
> > > >> > be good.
> > > >>
> > > >> I may well be misreading this message, but this sounds to me that this
> > > >> is a misinterpretation. So, just to clarify in case that is needed.
> > > >>
> > > >> CONFIG_MSEAL_SYSTEM_MAPPINGS does work fine for the UML kernel.
> > > >> However, the UML kernel is a normal userspace application itself and
> > > >> for this application to run, the host kernel must have the feature
> > > >> disabled.
> > > >>
> > > >> So, UML supports the feature. But it still *cannot* run on a host
> > > >> machine that has the feature enabled.
> > > >
> > > >Sigh ok. Apologies if I misunderstood.
> > > >
> > > >Is there any point having this for the 'guest' system? I mean security wise are
> > > >we concerned about sealing of system mappings?
> > >
> > > UML guests are used for testing. For example, it's the default target for KUnit's scripts. Having sealing working in the guest seems generally useful to me.
> > >
> >
> > 'Having sealing working' you mean system sealing? Because mseal works fine
> > (presumably in UML, not tried myself!)
>
> Sorry, yes, I mean "system mapping msealing".
>
> >
> > System msealing lacks any test in this series (I did ask for them...), certainly
> > no kunit tests, so this seems a bit theoretical? Unless you're talking about the
> > theoretical interaction of kunit tests and VDSO sealing?
>
> Right, I meant theoretical interaction, but it would be useful for
> future KUnit tests of system mapping msealing too.
>
> > I mean can't we just introduce this at the time if we believe this'd be useful?
>
> Perhaps adding it as part of adding some KUnit tests that exercise the
> system mapping msealing would be the most sensible.
>
> > Generally I'm not a fan of adding features mid-way through a series, the
> > revisions are meant to be refinements of the original, not an evolving thing.
> >
> > So in general I'd prefer this to be added if + when we need it for something.
>
> Yup, makes sense. And it may be that KUnit tests need to exercise more
> than what UML can support, so even the KUnit idea may be invalid.
>
> Jeff, let's leave off UML for this initial "minimum viable feature"
> series, unless there is a strong reason to keep it.
>
Sure.
It will be removed unless someone raises a strong reason to keep it.
UML can be added when future KUnit tests need it.

Thanks
-Jeff


> --
> Kees Cook


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-25 15:22   ` Liam R. Howlett
  2025-02-25 15:37     ` Lorenzo Stoakes
@ 2025-02-26  0:04     ` Jeff Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:04 UTC (permalink / raw)
  To: Liam R. Howlett, jeffxu, akpm, keescook, jannh, torvalds, vbabka,
	lorenzo.stoakes, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 7:23 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * jeffxu@chromium.org <jeffxu@chromium.org> [250224 17:52]:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide infrastructure to mseal system mappings. Establish
> > two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> > ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP
> > macro for future patches.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  include/linux/mm.h | 10 ++++++++++
> >  init/Kconfig       | 18 ++++++++++++++++++
> >  security/Kconfig   | 18 ++++++++++++++++++
> >  3 files changed, 46 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7b1068ddcbb7..8b800941678d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4155,4 +4155,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> >  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> >  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> > +
> > +/*
> > + * mseal of userspace process's system mappings.
> > + */
> > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > +#define VM_SEALED_SYSMAP     VM_SEALED
> > +#else
> > +#define VM_SEALED_SYSMAP     VM_NONE
> > +#endif
> > +
> >  #endif /* _LINUX_MM_H */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b..07435e33f965 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> >       bool
> >
> > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>
> I think we said ARCH_SUPPORTS_ for software features in v5 of the series
> [1].  Can we also make this change please?
>
Sure.



> ...
>
> Thanks,
> Liam
>
> [1]. https://lore.kernel.org/all/202502131142.F5EE115C3A@keescook/


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-25  6:24   ` Lorenzo Stoakes
@ 2025-02-26  0:06     ` Jeff Xu
  2025-02-26  5:57       ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:06 UTC (permalink / raw)
  To: Lorenzo Stoakes, Oleg Nesterov
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:24 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Feb 24, 2025 at 10:52:45PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide support to mseal the uprobe mapping.
> >
> > Unlike other system mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime. It could be sealed from creation.
> >
>
> I thought we agreed not to enable this for now? What testing
> have you done to ensure this is functional?
>
I honestly don't know much about uprobe. I don't recall that I agree
to ignore that though.

As indicated in the cover letter, it is my understanding that uprobe's
mapping's lifetime are the same as process's lifetime, thus sealable.
[1]
Oleg Nesterov, also cc, seems OK with mseal it in the early version of
this patch [2]

Are there any potential downsides of doing this? If yes, I can remove it.

I'm also looking at Oleg to give more guidance on this :-), or if
there are some functional tests that I need to do for uprobe.


[1] https://lore.kernel.org/all/20241005200741.GA24353@redhat.com/
[2] https://lore.kernel.org/all/20241005200741.GA24353@redhat.com/

> I mean is this literally _all_ uprobe mappings now being sealed?
>
> I'd really like some more assurances on this one. And what are you
> mitigating by sealing these? I get VDSO (kinda) but uprobes?
>
> You really need to provide more justification here.

Sure. In our threat model, we need to seal all r-x, r--, and --x
mappings to prevent them from becoming writable. This applies to all
mappings, regardless of whether they're created by the kernel or
dynamic linker.


> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  kernel/events/uprobes.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 2ca797cbe465..8dcdfa0d306b 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1662,6 +1662,7 @@ static const struct vm_special_mapping xol_mapping = {
> >  static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> >  {
> >       struct vm_area_struct *vma;
> > +     unsigned long vm_flags;
> >       int ret;
> >
> >       if (mmap_write_lock_killable(mm))
> > @@ -1682,8 +1683,10 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> >               }
> >       }
> >
> > +     vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> > +     vm_flags |= VM_SEALED_SYSMAP;
> >       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > -                             VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> > +                             vm_flags,
> >                               &xol_mapping);
> >       if (IS_ERR(vma)) {
> >               ret = PTR_ERR(vma);
> > --
> > 2.48.1.658.g4767266eb4-goog
> >


^ permalink raw reply	[flat|nested] 122+ messages in thread

* (no subject)
  2025-02-25 15:18 ` Lorenzo Stoakes
@ 2025-02-26  0:12   ` Jeff Xu
  2025-02-26  5:42     ` your mail Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 7:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Jeff - looking further in this series, I asked for a couple things for this
> series which you've not provided:
>
> 1. Some assurance based on code that the kernel-side code doesn't rely on
>    VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
>    myself, it looks fine for arm64, x86-64. But it might have been nice had
>    you done it :) Apologies if you had and I just missed it.
>
Thanks for checking this.
Do ppc in kernel code unmap/remap  vdso ?

I am aware that userspace can remap/unmap special mappings, but I
don't know if the kernel will remap/unmap a special mapping. Could you
please point out the code ?


> 2. Tests - could you please add some tests to assert that mremap() fails
>    for VDSO for instance? You've edited an existing test for VDSO in x86 to
>    skip the test should this be enabled, but this is not the same as a self
>    test. And obviously doesn't cover arm64.
>
>    This should be relatively strightforward right? You already have code
>    for finding out whether VDSO is msealed, so just use that to see if you
>    skip, then attempt mremap(), mmap() over etc. + assert it fails.
>
>    Ideally these tests would cover all the cases you've changed.
>
It is not as easy.

The config is disabled by default. And I don't know a way to detect
KCONFIG  from selftest itself. Without this, I can't reasonably
determine the test result.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-25 10:32 ` Lorenzo Stoakes
@ 2025-02-26  0:17   ` Jeff Xu
  2025-02-26  6:00     ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 2:32 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> BTW can we please drop the 'mseal, system mappings' prefixes on this
> series, it's really weird and makes it really hard for me to actually read
> the individual summary lines for each commit. 'mseal:' will do.
>

I am not sure.
I had comments about adding mseal, system mappings, as prefixes, and I
think it is reasonable.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-25  1:03   ` Kees Cook
@ 2025-02-26  0:21     ` Jeff Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:21 UTC (permalink / raw)
  To: Kees Cook
  Cc: akpm, jannh, torvalds, vbabka, lorenzo.stoakes, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 5:03 PM Kees Cook <kees@kernel.org> wrote:
>
> On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> > covering the vdso, vvar, vvar_vclock.
> >
> > Production release testing passes on Android and Chrome OS.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  arch/x86/Kconfig          |  1 +
> >  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 87198d957e2f..8fa17032ca46 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -26,6 +26,7 @@ config X86_64
> >       depends on 64BIT
> >       # Options that are inherently 64-bit kernel only:
> >       select ARCH_HAS_GIGANTIC_PAGE
> > +     select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> >       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> >       select ARCH_SUPPORTS_PER_VMA_LOCK
> >       select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> > index 39e6efc1a9ca..1b1c009f20a8 100644
> > --- a/arch/x86/entry/vdso/vma.c
> > +++ b/arch/x86/entry/vdso/vma.c
> > @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma;
> >       unsigned long text_start;
> > +     unsigned long vm_flags;
> >       int ret = 0;
> >
> >       if (mmap_write_lock_killable(mm))
> > @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> >       /*
> >        * MAYWRITE to allow gdb to COW and set breakpoints
> >        */
> > +     vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > +     vm_flags |= VM_SEALED_SYSMAP;
> >       vma = _install_special_mapping(mm,
> >                                      text_start,
> >                                      image->size,
> > -                                    VM_READ|VM_EXEC|
> > -                                    VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > +                                    vm_flags,
> >                                      &vdso_mapping);
>
> I think these (in all patches) were supposed to be reworked without the
> "vm_flags" variable addition?
>
OK.

> --
> Kees Cook


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-25  8:08   ` Thomas Weißschuh
@ 2025-02-26  0:48     ` Jeff Xu
  2025-02-26  7:35       ` Thomas Weißschuh
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  0:48 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, adobriyan, johannes, pedro.falcato, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 12:08 AM Thomas Weißschuh
<thomas.weissschuh@linutronix.de> wrote:
>
> Hi Jeff,
>
> On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> > covering the vdso, vvar, vvar_vclock.
> >
> > Production release testing passes on Android and Chrome OS.
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  arch/x86/Kconfig          |  1 +
> >  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 87198d957e2f..8fa17032ca46 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -26,6 +26,7 @@ config X86_64
> >       depends on 64BIT
> >       # Options that are inherently 64-bit kernel only:
> >       select ARCH_HAS_GIGANTIC_PAGE
> > +     select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> >       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> >       select ARCH_SUPPORTS_PER_VMA_LOCK
> >       select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> > index 39e6efc1a9ca..1b1c009f20a8 100644
> > --- a/arch/x86/entry/vdso/vma.c
> > +++ b/arch/x86/entry/vdso/vma.c
> > @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma;
> >       unsigned long text_start;
> > +     unsigned long vm_flags;
> >       int ret = 0;
> >
> >       if (mmap_write_lock_killable(mm))
> > @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> >       /*
> >        * MAYWRITE to allow gdb to COW and set breakpoints
> >        */
> > +     vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > +     vm_flags |= VM_SEALED_SYSMAP;
> >       vma = _install_special_mapping(mm,
> >                                      text_start,
> >                                      image->size,
> > -                                    VM_READ|VM_EXEC|
> > -                                    VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > +                                    vm_flags,
> >                                      &vdso_mapping);
> >
> >       if (IS_ERR(vma)) {
> > @@ -276,11 +278,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> >               goto up_fail;
> >       }
> >
> > +     vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
> > +     vm_flags |= VM_SEALED_SYSMAP;
> >       vma = _install_special_mapping(mm,
> >                                      addr,
> >                                      (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
> > -                                    VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
> > -                                    VM_PFNMAP,
> > +                                    vm_flags,
> >                                      &vvar_mapping);
>
> This hunk (and the vvar mapping in the arm64 patch) will conflict with my
> "Generic vDSO datapage" series.
> That series is already part of the tip tree (branch timers/vdso) and scheduled
> for the next merge window.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=timers/vdso
>
> The conflict resolution is fairly easy:
> Move the new flag logic into lib/vdso/datastore.c
>
Thank you for bringing this to my attention.

In your change,  it seems lib/vdso/datastore.c implements a
vdso_install_vvar_mapping(), then all the architectures call this
function.

So merging conflict won't be as straightforward.  Maybe a better
approach is that I continue resolving all the comments, based on the
latest main. Then wait for your change to be merged and submit another
version.

Thanks
-Jeff

-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-25  6:05   ` Lorenzo Stoakes
@ 2025-02-26  1:33     ` Jeff Xu
  2025-02-26  6:04       ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26  1:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Mon, Feb 24, 2025 at 10:05 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > +     bool
> > +     help
> > +       Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> > +
> > +       A 64-bit kernel is required for the memory sealing feature.
> > +       No specific hardware features from the CPU are needed.
> > +
> > +       To enable this feature, the architecture needs to update their
> > +       special mappings calls to include the sealing flag and confirm
> > +       that it doesn't unmap/remap system mappings during the life
> > +       time of the process. After the architecture enables this, a
> > +       distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> > +       to the feature.
>
> Architectures also need to be confirmed not to require any form of VDSO
> relocation, which as discussed in previous series some arches appear to
> need to do. I'd mention that here.
>
This might need clarification, the system mapping includes vdso, right
? Why the focus on vdso ?

The sentence  "... it doesn't unmap/remap system mappings during the
lifetime of the process."  already cover what you want here, I think.


> > +
> > +       For complete descriptions of memory sealing, please see
> > +       Documentation/userspace-api/mseal.rst
> > +
> >  config HAVE_PERF_EVENTS
> >       bool
> >       help
> > diff --git a/security/Kconfig b/security/Kconfig
> > index f10dbf15c294..15a86a952910 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
> >
> >  endchoice
> >
> > +config MSEAL_SYSTEM_MAPPINGS
> > +     bool "mseal system mappings"
> > +     depends on 64BIT
> > +     depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > +     depends on !CHECKPOINT_RESTORE
> > +     help
> > +       Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
>
> Let's be specific here, 'etc.' could mean _anything_. Also you aren't
> sealing most of this, let's just list what you are _actually_ sealing
> here. Which is AFAIK VDSO only?
>
I will remove "etc" and list all the mappings.

Those mappings are:
vdso, vvar, vvar_vclock, vectors (arm compact-mode) and sigpage (arm
compact-mode), uprobe.

We seal all system mappings that x86-64 and arm64 have.

> You can update this later as time goes on if/when you expand this.
>
> > +
> > +       A 64-bit kernel is required for the memory sealing feature.
> > +       No specific hardware features from the CPU are needed.
> > +
> > +       Note: CHECKPOINT_RESTORE, UML, gVisor, rr are known to relocate or
> > +       unmap system mapping, therefore this config can't be enabled
> > +       universally.
>
> Thanks for putting this here, appreciate it!
>
> Could we tweak this though? I'd like to make it crystal clear, so I don't
> think 'note' sufficies and this sounds a little too vague.
>
> I think 'warning' is more appropriate here since you're breaking things for
> people who might be unaware. And we need to say this -breaks- programs:
>
> WARNING: This feature breaks programs which rely on relocating or
>          unmapping system mappings.
>
>          Known broken software at the time of writing includes
>          CHECKPOINT_RESTORE, UML, gVisor and rr.
>
> I think this is critical.
>
Sure.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-25 22:26     ` Jeff Xu
@ 2025-02-26  5:25       ` Lorenzo Stoakes
  2025-02-26 17:11         ` Liam R. Howlett
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  5:25 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > >
> > > Production release testing passes on Android and Chrome OS.
> >
> > This is pretty limited (yes yes I know android is massive etc. but we must
> > account for all the weird and wonderful arm64 devices out there in context of
> > upstream :)
> >
> > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > nervous of that.
> >
> Can you please point out where PPC munmap/mremap the vdso ?
>
> Previously, when you mentioned that, I thought you meant user space in
> PPC, I didn't realize that you meant that kernel code in PPC.  I
> tried, but didn't find anything, hence asking.

Jeff, please stick to replying to review. 'Have you looking through all
arm64-code'.

I ended up doing this myself yesterday and found no issues, as with x86-64.

I said I'm _pretty sure_ PPC does this. Liam mentioned something about
it. We can discuss it, and I can find specifics if + when you try to add
this to PPC.

Please try to respect my time...

>
> Thanks.
> -Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2025-02-26  0:12   ` Jeff Xu
@ 2025-02-26  5:42     ` Lorenzo Stoakes
  2025-02-28  0:55       ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  5:42 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 04:12:40PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 7:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Jeff - looking further in this series, I asked for a couple things for this
> > series which you've not provided:
> >
> > 1. Some assurance based on code that the kernel-side code doesn't rely on
> >    VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
> >    myself, it looks fine for arm64, x86-64. But it might have been nice had
> >    you done it :) Apologies if you had and I just missed it.
> >
> Thanks for checking this.
> Do ppc in kernel code unmap/remap  vdso ?
>
> I am aware that userspace can remap/unmap special mappings, but I
> don't know if the kernel will remap/unmap a special mapping. Could you
> please point out the code ?

Again as discussed in other thread, let's leave this until as/when you try
to attack PPC. I am not 100% this is the case, I may be mistaken sure, but
gathered it _might_ be.

>
>
> > 2. Tests - could you please add some tests to assert that mremap() fails
> >    for VDSO for instance? You've edited an existing test for VDSO in x86 to
> >    skip the test should this be enabled, but this is not the same as a self
> >    test. And obviously doesn't cover arm64.
> >
> >    This should be relatively strightforward right? You already have code
> >    for finding out whether VDSO is msealed, so just use that to see if you
> >    skip, then attempt mremap(), mmap() over etc. + assert it fails.
> >
> >    Ideally these tests would cover all the cases you've changed.
> >
> It is not as easy.
>
> The config is disabled by default. And I don't know a way to detect
> KCONFIG  from selftest itself. Without this, I can't reasonably
> determine the test result.

Please in future let's try to get this kind of response at the point of the
request being made :) makes life much easier.

So I think it is easy actually. As I say here (perhaps you missed it?) you
literally already have code you added to the x86 test to prevent test
failure.

So you essentially look for [vdso] or whatever, then you look up to see if
it is sealed, ensure an mremap() fails.

Of course this doesn't check to see if it _should_ be enabled or not. I'm
being nice by not _insisting_ we export a way for you to determine whether
this config option is enabled or not for the sake of a test (since I don't
want to hold up this series). But that'd be nice! Maybe in a
/sys/kernel/security endpoint...

...but I think we can potentially add this later on so we don't hold things
up here/add yet more to the series. I suspect you and Kees alike would
prioritise getting _this series_ in at this point :)

You could, if you wanted to, check to see if /proc/config.gz exists and
zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
that, but you know kinda hacky.

But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
fails here for system mappings for the sake of it. I guess this is, in
effect, only checking mseal-ing works right? But it at least asserts the
existence of the thing, and that it behaves.

Later on, when you add some way of observing that it's enabled or not, you
can extend the test to check this.

Thanks!

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26  0:06     ` Jeff Xu
@ 2025-02-26  5:57       ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  5:57 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Oleg Nesterov, akpm, keescook, jannh, torvalds, vbabka,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 04:06:37PM -0800, Jeff Xu wrote:
> On Mon, Feb 24, 2025 at 10:24 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 10:52:45PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Provide support to mseal the uprobe mapping.
> > >
> > > Unlike other system mappings, the uprobe mapping is not
> > > established during program startup. However, its lifetime is the same
> > > as the process's lifetime. It could be sealed from creation.
> > >
> >
> > I thought we agreed not to enable this for now? What testing
> > have you done to ensure this is functional?
> >
> I honestly don't know much about uprobe. I don't recall that I agree
> to ignore that though.

OK sorry I realise you have done this from an early version of the series,
my mistake! Apologies.

I'm concerned you don't feel you know much about uprobe, but I guess you
defer to Oleg's views here?

If he's confirmed this is ok, then fine.

>
> As indicated in the cover letter, it is my understanding that uprobe's
> mapping's lifetime are the same as process's lifetime, thus sealable.

> [1]
> Oleg Nesterov, also cc, seems OK with mseal it in the early version of
> this patch [2]
>
> Are there any potential downsides of doing this? If yes, I can remove it.
>
> I'm also looking at Oleg to give more guidance on this :-), or if
> there are some functional tests that I need to do for uprobe.

Yeah, apologies, my mistake I forgot that this was from early, I thought it
was scope creep... but I double-checked and yeah, no haha.

>
>
> [1] https://lore.kernel.org/all/20241005200741.GA24353@redhat.com/
> [2] https://lore.kernel.org/all/20241005200741.GA24353@redhat.com/
>
> > I mean is this literally _all_ uprobe mappings now being sealed?
> >
> > I'd really like some more assurances on this one. And what are you
> > mitigating by sealing these? I get VDSO (kinda) but uprobes?
> >
> > You really need to provide more justification here.
>
> Sure. In our threat model, we need to seal all r-x, r--, and --x
> mappings to prevent them from becoming writable. This applies to all
> mappings, regardless of whether they're created by the kernel or
> dynamic linker.

All mappings? :P I mean I guess you mean somehow, all 'system' mappings
right?

I guess you mean that somehow some malicious user could manipulate these
mappings from a sandbox or such using a series of exploits that are maybe
more achievable that arbitrary code execution (rop with syscalls or sth? I
am not a security person - obviously! :)

And then un-sandboxed code could innocently touch and bang.

I mean that to me makes sense and cool, we're good. Something like this in
the doc, just a brief sentence like this for idiots (or perhaps you might
say, idiots when it comes to security :) like me would be great, thanks!

>
>
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > ---
> > >  kernel/events/uprobes.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 2ca797cbe465..8dcdfa0d306b 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -1662,6 +1662,7 @@ static const struct vm_special_mapping xol_mapping = {
> > >  static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> > >  {
> > >       struct vm_area_struct *vma;
> > > +     unsigned long vm_flags;
> > >       int ret;
> > >
> > >       if (mmap_write_lock_killable(mm))
> > > @@ -1682,8 +1683,10 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> > >               }
> > >       }
> > >
> > > +     vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> > > +     vm_flags |= VM_SEALED_SYSMAP;
> > >       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > -                             VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> > > +                             vm_flags,
> > >                               &xol_mapping);
> > >       if (IS_ERR(vma)) {
> > >               ret = PTR_ERR(vma);
> > > --
> > > 2.48.1.658.g4767266eb4-goog
> > >


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed
  2025-02-25 22:37     ` Jeff Xu
@ 2025-02-26  5:58       ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  5:58 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport, Kees Cook

On Tue, Feb 25, 2025 at 02:37:46PM -0800, Jeff Xu wrote:
> On Mon, Feb 24, 2025 at 10:15 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 10:52:41PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Add code to detect if the vdso is memory sealed, skip the test
> > > if it is.
> >
> > I feel this is a little succinct of a commit message, but I guess it gets to the
> > heart of what you're doing here.
> >
> > Fundamentally I mean it makes sense, but I'm concerned that x86 has a test
> > -expliictly checking- whether mremap() of VDSO is possible - are there cases
> > where x86 might want to do this internal to the kernel?
> >
> > I guess not since this is essentially a userland self test and probably
> > asserting you can do this in the way rr, etc. do.
> >
> > >
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > Reviewed-by: Kees Cook <kees@kernel.org>
> >
> > Anyway, this aside, this looks fine, aside from nit below, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >  .../testing/selftests/x86/test_mremap_vdso.c  | 43 +++++++++++++++++++
> > >  1 file changed, 43 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c b/tools/testing/selftests/x86/test_mremap_vdso.c
> > > index d53959e03593..94bee6e0c813 100644
> > > --- a/tools/testing/selftests/x86/test_mremap_vdso.c
> > > +++ b/tools/testing/selftests/x86/test_mremap_vdso.c
> > > @@ -14,6 +14,7 @@
> > >  #include <errno.h>
> > >  #include <unistd.h>
> > >  #include <string.h>
> > > +#include <stdbool.h>
> > >
> > >  #include <sys/mman.h>
> > >  #include <sys/auxv.h>
> > > @@ -55,13 +56,55 @@ static int try_to_remap(void *vdso_addr, unsigned long size)
> > >
> > >  }
> > >
> > > +#define VDSO_NAME "[vdso]"
> > > +#define VMFLAGS "VmFlags:"
> > > +#define MSEAL_FLAGS "sl"
> > > +#define MAX_LINE_LEN 512
> > > +
> > > +bool vdso_sealed(FILE *maps)
> >
> > Should be static?
> >
> sure.

Thanks! :)

>
> > > +{
> > > +     char line[MAX_LINE_LEN];
> > > +     bool has_vdso = false;
> > > +
> > > +     while (fgets(line, sizeof(line), maps)) {
> > > +             if (strstr(line, VDSO_NAME))
> > > +                     has_vdso = true;
> > > +
> > > +             if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) {
> > > +                     if (strstr(line, MSEAL_FLAGS))
> > > +                             return true;
> > > +
> > > +                     return false;
> > > +             }
> > > +     }
> > > +
> > > +     return false;
> > > +}
> > > +
> > >  int main(int argc, char **argv, char **envp)
> > >  {
> > >       pid_t child;
> > > +     FILE *maps;
> > >
> > >       ksft_print_header();
> > >       ksft_set_plan(1);
> > >
> > > +     maps = fopen("/proc/self/smaps", "r");
> > > +     if (!maps) {
> > > +             ksft_test_result_skip(
> > > +                     "Could not open /proc/self/smaps, errno=%d\n",
> > > +                      errno);
> > > +
> > > +             return 0;
> > > +     }
> > > +
> > > +     if (vdso_sealed(maps)) {
> > > +             ksft_test_result_skip("vdso is sealed\n");
> > > +             return 0;
> > > +     }
> > > +
> > > +     fclose(maps);
> > > +
> > >       child = fork();
> > >       if (child == -1)
> > >               ksft_exit_fail_msg("failed to fork (%d): %m\n", errno);
> > > --
> > > 2.48.1.658.g4767266eb4-goog
> > >


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-26  0:17   ` Jeff Xu
@ 2025-02-26  6:00     ` Lorenzo Stoakes
  2025-02-27 23:43       ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  6:00 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 04:17:01PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 2:32 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > BTW can we please drop the 'mseal, system mappings' prefixes on this
> > series, it's really weird and makes it really hard for me to actually read
> > the individual summary lines for each commit. 'mseal:' will do.
> >
>
> I am not sure.
> I had comments about adding mseal, system mappings, as prefixes, and I
> think it is reasonable.

No it's really horrible (sorry I know it's hyperbolic but it really really looks
horrid to me) , I've never seen prefixes like that before in mm in my life and I
don't think it adds anything.

I also find it MIGHTY confusing.

Please remove this :) you can git log the relevant files to see the conventions
people use. Typically xxx: has something really short and sweet for the 'xxx'.

I realise this is subjective, but it's a small thing and would be helpful for
parsing your series at a glance.

Thanks!


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-26  1:33     ` Jeff Xu
@ 2025-02-26  6:04       ` Lorenzo Stoakes
  2025-02-28  0:04         ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26  6:04 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 05:33:24PM -0800, Jeff Xu wrote:
> On Mon, Feb 24, 2025 at 10:05 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > +     bool
> > > +     help
> > > +       Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> > > +
> > > +       A 64-bit kernel is required for the memory sealing feature.
> > > +       No specific hardware features from the CPU are needed.
> > > +
> > > +       To enable this feature, the architecture needs to update their
> > > +       special mappings calls to include the sealing flag and confirm
> > > +       that it doesn't unmap/remap system mappings during the life
> > > +       time of the process. After the architecture enables this, a
> > > +       distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> > > +       to the feature.
> >
> > Architectures also need to be confirmed not to require any form of VDSO
> > relocation, which as discussed in previous series some arches appear to
> > need to do. I'd mention that here.
> >
> This might need clarification, the system mapping includes vdso, right
> ? Why the focus on vdso ?

My mistake, I thought scope was more limited than this when I first
looked. Please disregard the focus on VDSO here... :)

>
> The sentence  "... it doesn't unmap/remap system mappings during the
> lifetime of the process."  already cover what you want here, I think.
>

Right, I guess it just doesn't quite _emphasise_ it enough for me. Something
like the below would really help bring that out:

	The existing of this flag for an architecture implies that it does not
	require the remapping of these system mappings during process lifetime,
	so sealing these mappings is safe from a kernel perspective.

>
> > > +
> > > +       For complete descriptions of memory sealing, please see
> > > +       Documentation/userspace-api/mseal.rst
> > > +
> > >  config HAVE_PERF_EVENTS
> > >       bool
> > >       help
> > > diff --git a/security/Kconfig b/security/Kconfig
> > > index f10dbf15c294..15a86a952910 100644
> > > --- a/security/Kconfig
> > > +++ b/security/Kconfig
> > > @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
> > >
> > >  endchoice
> > >
> > > +config MSEAL_SYSTEM_MAPPINGS
> > > +     bool "mseal system mappings"
> > > +     depends on 64BIT
> > > +     depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > +     depends on !CHECKPOINT_RESTORE
> > > +     help
> > > +       Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
> >
> > Let's be specific here, 'etc.' could mean _anything_. Also you aren't
> > sealing most of this, let's just list what you are _actually_ sealing
> > here. Which is AFAIK VDSO only?
> >
> I will remove "etc" and list all the mappings.
>
> Those mappings are:
> vdso, vvar, vvar_vclock, vectors (arm compact-mode) and sigpage (arm
> compact-mode), uprobe.
>
> We seal all system mappings that x86-64 and arm64 have.

Perfect!

>
> > You can update this later as time goes on if/when you expand this.
> >
> > > +
> > > +       A 64-bit kernel is required for the memory sealing feature.
> > > +       No specific hardware features from the CPU are needed.
> > > +
> > > +       Note: CHECKPOINT_RESTORE, UML, gVisor, rr are known to relocate or
> > > +       unmap system mapping, therefore this config can't be enabled
> > > +       universally.
> >
> > Thanks for putting this here, appreciate it!
> >
> > Could we tweak this though? I'd like to make it crystal clear, so I don't
> > think 'note' sufficies and this sounds a little too vague.
> >
> > I think 'warning' is more appropriate here since you're breaking things for
> > people who might be unaware. And we need to say this -breaks- programs:
> >
> > WARNING: This feature breaks programs which rely on relocating or
> >          unmapping system mappings.
> >
> >          Known broken software at the time of writing includes
> >          CHECKPOINT_RESTORE, UML, gVisor and rr.
> >
> > I think this is critical.
> >
> Sure.

Perfect, thank you! Much appreciated.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-26  0:48     ` Jeff Xu
@ 2025-02-26  7:35       ` Thomas Weißschuh
  2025-02-27 21:44         ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Thomas Weißschuh @ 2025-02-26  7:35 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, adobriyan, johannes, pedro.falcato, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 04:48:47PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 12:08 AM Thomas Weißschuh
> <thomas.weissschuh@linutronix.de> wrote:
> > On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> > > covering the vdso, vvar, vvar_vclock.
> > >
> > > Production release testing passes on Android and Chrome OS.
> > >
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > ---
> > >  arch/x86/Kconfig          |  1 +
> > >  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 87198d957e2f..8fa17032ca46 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -26,6 +26,7 @@ config X86_64
> > >       depends on 64BIT
> > >       # Options that are inherently 64-bit kernel only:
> > >       select ARCH_HAS_GIGANTIC_PAGE
> > > +     select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > >       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > >       select ARCH_SUPPORTS_PER_VMA_LOCK
> > >       select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> > > index 39e6efc1a9ca..1b1c009f20a8 100644
> > > --- a/arch/x86/entry/vdso/vma.c
> > > +++ b/arch/x86/entry/vdso/vma.c
> > > @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > >       struct mm_struct *mm = current->mm;
> > >       struct vm_area_struct *vma;
> > >       unsigned long text_start;
> > > +     unsigned long vm_flags;
> > >       int ret = 0;
> > >
> > >       if (mmap_write_lock_killable(mm))
> > > @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > >       /*
> > >        * MAYWRITE to allow gdb to COW and set breakpoints
> > >        */
> > > +     vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > > +     vm_flags |= VM_SEALED_SYSMAP;
> > >       vma = _install_special_mapping(mm,
> > >                                      text_start,
> > >                                      image->size,
> > > -                                    VM_READ|VM_EXEC|
> > > -                                    VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > > +                                    vm_flags,
> > >                                      &vdso_mapping);
> > >
> > >       if (IS_ERR(vma)) {
> > > @@ -276,11 +278,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > >               goto up_fail;
> > >       }
> > >
> > > +     vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
> > > +     vm_flags |= VM_SEALED_SYSMAP;
> > >       vma = _install_special_mapping(mm,
> > >                                      addr,
> > >                                      (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
> > > -                                    VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
> > > -                                    VM_PFNMAP,
> > > +                                    vm_flags,
> > >                                      &vvar_mapping);
> >
> > This hunk (and the vvar mapping in the arm64 patch) will conflict with my
> > "Generic vDSO datapage" series.
> > That series is already part of the tip tree (branch timers/vdso) and scheduled
> > for the next merge window.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=timers/vdso
> >
> > The conflict resolution is fairly easy:
> > Move the new flag logic into lib/vdso/datastore.c
> >
> Thank you for bringing this to my attention.
> 
> In your change,  it seems lib/vdso/datastore.c implements a
> vdso_install_vvar_mapping(), then all the architectures call this
> function.

Correct.

At least all the architectures using the generic vDSO infrastructure,
which are the ones you care about.
Sparc for example has its own implementation.

> So merging conflict won't be as straightforward.

Wouldn't it be enough to unconditionally use VM_SEALED_SYSMAP in
vdso_install_vvar_mapping()?
The symbol is a noop on architectures or configurations where the new
functionality is not available or enabled.

> Maybe a better
> approach is that I continue resolving all the comments, based on the
> latest main. Then wait for your change to be merged and submit another
> version.

That would work, too. As you prefer.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-24 22:52 ` [PATCH v7 6/7] mseal, system mappings: uprobe mapping jeffxu
  2025-02-25  6:24   ` Lorenzo Stoakes
@ 2025-02-26 16:26   ` Oleg Nesterov
  2025-02-26 16:33     ` Oleg Nesterov
  2025-02-26 16:45     ` Lorenzo Stoakes
  1 sibling, 2 replies; 122+ messages in thread
From: Oleg Nesterov @ 2025-02-26 16:26 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On 02/24, jeffxu@chromium.org wrote:
>
> Unlike other system mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime. It could be sealed from creation.

Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
of config options.

ACK,

but can't we do

	#ifdef CONFIG_64BIT
	/* VM is sealed, in vm_flags */
	#define VM_SEALED	_BITUL(63)
+	#else
+	#define VM_SEALED	0
	#endif

and then simply

	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
-				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,

?

But I am fine either way, feel free to ignore.

Oleg.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 16:26   ` Oleg Nesterov
@ 2025-02-26 16:33     ` Oleg Nesterov
  2025-02-26 16:45     ` Lorenzo Stoakes
  1 sibling, 0 replies; 122+ messages in thread
From: Oleg Nesterov @ 2025-02-26 16:33 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On 02/26, Oleg Nesterov wrote:
>
> On 02/24, jeffxu@chromium.org wrote:
> >
> > Unlike other system mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime. It could be sealed from creation.
>
> Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
> of config options.
>
> ACK,
>
> but can't we do
>
> 	#ifdef CONFIG_64BIT
> 	/* VM is sealed, in vm_flags */
> 	#define VM_SEALED	_BITUL(63)
> +	#else
> +	#define VM_SEALED	0
> 	#endif
>
> and then simply
>
> 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> +				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,
>
> ?
>
> But I am fine either way, feel free to ignore.

Yes, but either way, why your patch adds "unsigned long vm_flags" ?
OK, perhaps it makes sense for readability, but

	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
	vm_flags |= VM_SEALED_SYSMAP;

looks a bit strange, why not

	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;

?

Oleg.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 16:26   ` Oleg Nesterov
  2025-02-26 16:33     ` Oleg Nesterov
@ 2025-02-26 16:45     ` Lorenzo Stoakes
  2025-02-26 18:01       ` Oleg Nesterov
  1 sibling, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 16:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: jeffxu, akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 05:26:04PM +0100, Oleg Nesterov wrote:
> On 02/24, jeffxu@chromium.org wrote:
> >
> > Unlike other system mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime. It could be sealed from creation.
>
> Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
> of config options.

If you think this ought to be the case generally, then perhaps we should
drop this patch from the commit and just do this separately as a
permanent-on thing, if you are sure this is fine + want it?

An aside - we _definitely_ cannot allow this -system mapping stuff- to be
enabled without a config option, this is emphatic, for the reason that it
breaks userspace and is only known-good on some arches.

A config flag that checks arch gives a big warning saying 'hey this breaks
userspace' means users use it knowing this to be the case.
>
>
> ACK,
>
> but can't we do
>
> 	#ifdef CONFIG_64BIT
> 	/* VM is sealed, in vm_flags */
> 	#define VM_SEALED	_BITUL(63)
> +	#else
> +	#define VM_SEALED	0
> 	#endif

This has been raised a few times. Jeff objects to this (for reasons I don't
agree with, honestly) but it's been implemented in this way from the start
(in order to catch the case of 32-bit systems trying to use mseal but it
not happening).

Anyway I agree, but I'm not going to push this at least in this series.

>
> and then simply
>
> 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> +				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,
>
> ?

Nah you'd have to do:

> 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO
#ifdef CONFIG_64BIT
				VM_SEALED
#endif
				,

Or something similar :)

Beautiful no?

>
> But I am fine either way, feel free to ignore.
>
> Oleg.
>

Ideally We should hear what the security use case is here. I mean I'm less
bothered with it behind a flag, but it seems odd that an attacker would
break out of a sandbox into a process currently being debugged... seems
unlikely.

I'm not sure under what other circumstances this would be a problem. Jeff,
Kees?

Anyway as I said before, I don't overly object to this as-is, as long as
you are ok with it Oleg and can absolutely confirm this will never break
anything, you don't need to remap, unmap (until process teardown), adjust
the VMA in any way etc.?

A quick glance at uprobe code suggests it's fine.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-26  5:25       ` Lorenzo Stoakes
@ 2025-02-26 17:11         ` Liam R. Howlett
  2025-02-26 17:17           ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Liam R. Howlett @ 2025-02-26 17:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jeff Xu, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > From: Jeff Xu <jeffxu@chromium.org>
> > > >
> > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > >
> > > > Production release testing passes on Android and Chrome OS.
> > >
> > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > account for all the weird and wonderful arm64 devices out there in context of
> > > upstream :)
> > >
> > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > nervous of that.
> > >
> > Can you please point out where PPC munmap/mremap the vdso ?
> >
> > Previously, when you mentioned that, I thought you meant user space in
> > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > tried, but didn't find anything, hence asking.
> 
> Jeff, please stick to replying to review. 'Have you looking through all
> arm64-code'.
> 
> I ended up doing this myself yesterday and found no issues, as with x86-64.
> 
> I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> it. We can discuss it, and I can find specifics if + when you try to add
> this to PPC.
> 

PPC allows the vma to be munmapped then detects and falls back to the
slower method, iirc.

They were against the removal of the fallback; other archs also have
this infrastructure.  Really, if we fixed the fallback to work for
all platforms then it would probably also remove the possibility of a
remap over the VDSO being a problem (if it is today, which still isn't
clear?).

Thanks,
Liam


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-26 17:11         ` Liam R. Howlett
@ 2025-02-26 17:17           ` Jeff Xu
  2025-02-26 17:43             ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-26 17:17 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Jeff Xu, akpm, keescook, jannh,
	torvalds, vbabka, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato,
	hca, willy, anna-maria, mark.rutland, linus.walleij, Jason,
	deller, rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen,
	mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
	groeck, mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 9:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> > On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > >
> > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > > >
> > > > > Production release testing passes on Android and Chrome OS.
> > > >
> > > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > > account for all the weird and wonderful arm64 devices out there in context of
> > > > upstream :)
> > > >
> > > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > > nervous of that.
> > > >
> > > Can you please point out where PPC munmap/mremap the vdso ?
> > >
> > > Previously, when you mentioned that, I thought you meant user space in
> > > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > > tried, but didn't find anything, hence asking.
> >
> > Jeff, please stick to replying to review. 'Have you looking through all
> > arm64-code'.
> >
> > I ended up doing this myself yesterday and found no issues, as with x86-64.
> >
> > I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> > it. We can discuss it, and I can find specifics if + when you try to add
> > this to PPC.
> >
>
> PPC allows the vma to be munmapped then detects and falls back to the
> slower method, iirc.
>
Is this code in the kernel or userspace?

If PPC doesn't want to create vdso for all its userspace apps, we
could instead "don't create" vdso during the execve call.


> They were against the removal of the fallback; other archs also have
> this infrastructure.  Really, if we fixed the fallback to work for
> all platforms then it would probably also remove the possibility of a
> remap over the VDSO being a problem (if it is today, which still isn't
> clear?).
>
Any past thread/communication about this that I can read ?

Thanks
-Jeff


> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-26 17:17           ` Jeff Xu
@ 2025-02-26 17:43             ` Lorenzo Stoakes
  2025-02-26 18:14               ` Lorenzo Stoakes
  2025-02-28  0:48               ` Jeff Xu
  0 siblings, 2 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 17:43 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 09:17:10AM -0800, Jeff Xu wrote:
> On Wed, Feb 26, 2025 at 9:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> > > On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > > > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > >
> > > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > > > >
> > > > > > Production release testing passes on Android and Chrome OS.
> > > > >
> > > > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > > > account for all the weird and wonderful arm64 devices out there in context of
> > > > > upstream :)
> > > > >
> > > > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > > > nervous of that.
> > > > >
> > > > Can you please point out where PPC munmap/mremap the vdso ?
> > > >
> > > > Previously, when you mentioned that, I thought you meant user space in
> > > > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > > > tried, but didn't find anything, hence asking.
> > >
> > > Jeff, please stick to replying to review. 'Have you looking through all
> > > arm64-code'.
> > >
> > > I ended up doing this myself yesterday and found no issues, as with x86-64.
> > >
> > > I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> > > it. We can discuss it, and I can find specifics if + when you try to add
> > > this to PPC.
> > >
> >
> > PPC allows the vma to be munmapped then detects and falls back to the
> > slower method, iirc.
> >
> Is this code in the kernel or userspace?
>
> If PPC doesn't want to create vdso for all its userspace apps, we
> could instead "don't create" vdso during the execve call.
>
>
> > They were against the removal of the fallback; other archs also have
> > this infrastructure.  Really, if we fixed the fallback to work for
> > all platforms then it would probably also remove the possibility of a
> > remap over the VDSO being a problem (if it is today, which still isn't
> > clear?).
> >
> Any past thread/communication about this that I can read ?

Jeff, I'm sure you don't intend to, but I find it quite disrespectful that you
ignored my feedback here (and elsewhere, regarding you ignoring 4 sets of
feedback).

This?

https://elixir.bootlin.com/linux/v6.13.4/source/arch/powerpc/kernel/vdso.c#L236

Was [0] a relevant discussion?

[0]: https://lore.kernel.org/all/lhe2mky6ahlk2jzvvfjyongqiseelyx2uy7sbyuso6jcy3b2dq@7ju6cea62jgk/

>
> Thanks
> -Jeff
>
>
> > Thanks,
> > Liam


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 16:45     ` Lorenzo Stoakes
@ 2025-02-26 18:01       ` Oleg Nesterov
  2025-02-26 18:06         ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Oleg Nesterov @ 2025-02-26 18:01 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: jeffxu, akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On 02/26, Lorenzo Stoakes wrote:
>
> On Wed, Feb 26, 2025 at 05:26:04PM +0100, Oleg Nesterov wrote:
> > On 02/24, jeffxu@chromium.org wrote:
> > >
> > > Unlike other system mappings, the uprobe mapping is not
> > > established during program startup. However, its lifetime is the same
> > > as the process's lifetime. It could be sealed from creation.
> >
> > Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
> > of config options.
>
> If you think this ought to be the case generally, then perhaps we should
> drop this patch from the commit and just do this separately as a
> permanent-on thing, if you are sure this is fine + want it?

See below...

> An aside - we _definitely_ cannot allow this -system mapping stuff- to be
> enabled without a config option,

This is clear.

But as for uprobes in particular I do think that VM_SEALED is always fine.

Do we really want it? I dunno. If a task unmaps its "[uprobes]" vma it
will crash when it hits the uprobes bp next time. Unless the probed insn
can be emulated and it is not ret-probe. Do we really care? Again, I don't
know.

Should this change come as a separate patch? I don't understand why it should
but I am fine either way.

In short. please do what you think is right,  VM_SEALED can't hurt uprobes ;)

> > 	#ifdef CONFIG_64BIT
> > 	/* VM is sealed, in vm_flags */
> > 	#define VM_SEALED	_BITUL(63)
> > +	#else
> > +	#define VM_SEALED	0
> > 	#endif
>
> This has been raised a few times. Jeff objects to this

OK,

> > and then simply
> >
> > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> > +				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,
> >
> > ?
>
> Nah you'd have to do:
>
>
> > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> 				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO
> #ifdef CONFIG_64BIT
> 				VM_SEALED
> #endif
> 				,

Why??? With the proposed change above VM_SEALED == 0 if !CONFIG_64BIT.

Oleg.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:01       ` Oleg Nesterov
@ 2025-02-26 18:06         ` Lorenzo Stoakes
  2025-02-26 18:19           ` Liam R. Howlett
  2025-02-26 18:20           ` Oleg Nesterov
  0 siblings, 2 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 18:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: jeffxu, akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 07:01:36PM +0100, Oleg Nesterov wrote:
> On 02/26, Lorenzo Stoakes wrote:
> >
> > On Wed, Feb 26, 2025 at 05:26:04PM +0100, Oleg Nesterov wrote:
> > > On 02/24, jeffxu@chromium.org wrote:
> > > >
> > > > Unlike other system mappings, the uprobe mapping is not
> > > > established during program startup. However, its lifetime is the same
> > > > as the process's lifetime. It could be sealed from creation.
> > >
> > > Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
> > > of config options.
> >
> > If you think this ought to be the case generally, then perhaps we should
> > drop this patch from the commit and just do this separately as a
> > permanent-on thing, if you are sure this is fine + want it?
>
> See below...
>
> > An aside - we _definitely_ cannot allow this -system mapping stuff- to be
> > enabled without a config option,
>
> This is clear.
>
> But as for uprobes in particular I do think that VM_SEALED is always fine.
>
> Do we really want it? I dunno. If a task unmaps its "[uprobes]" vma it
> will crash when it hits the uprobes bp next time. Unless the probed insn
> can be emulated and it is not ret-probe. Do we really care? Again, I don't
> know.
>
> Should this change come as a separate patch? I don't understand why it should
> but I am fine either way.
>
> In short. please do what you think is right,  VM_SEALED can't hurt uprobes ;)
>
> > > 	#ifdef CONFIG_64BIT
> > > 	/* VM is sealed, in vm_flags */
> > > 	#define VM_SEALED	_BITUL(63)
> > > +	#else
> > > +	#define VM_SEALED	0
> > > 	#endif
> >
> > This has been raised a few times. Jeff objects to this
>
> OK,
>
> > > and then simply
> > >
> > > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> > > +				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,
> > >
> > > ?
> >
> > Nah you'd have to do:
> >
> >
> > > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > 				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO
> > #ifdef CONFIG_64BIT
> > 				VM_SEALED
> > #endif
> > 				,
>
> Why??? With the proposed change above VM_SEALED == 0 if !CONFIG_64BIT.
>

Like I said, Jeff opposes the change. I disagree with him, and agree with you,
because this is very silly.

But I don't want to hold up this series with that discussion (this is for his
sake...)

> Oleg.
>

Jeff - perhaps drop this and let's return to it in a follow up so this series
isn't held up?

Thanks.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-26 17:43             ` Lorenzo Stoakes
@ 2025-02-26 18:14               ` Lorenzo Stoakes
  2025-02-28  0:48               ` Jeff Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 18:14 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 05:43:22PM +0000, Lorenzo Stoakes wrote:
> On Wed, Feb 26, 2025 at 09:17:10AM -0800, Jeff Xu wrote:
> > On Wed, Feb 26, 2025 at 9:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> > > > On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > > > > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > >
> > > > > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > > >
> > > > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > > > > >
> > > > > > > Production release testing passes on Android and Chrome OS.
> > > > > >
> > > > > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > > > > account for all the weird and wonderful arm64 devices out there in context of
> > > > > > upstream :)
> > > > > >
> > > > > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > > > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > > > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > > > > nervous of that.
> > > > > >
> > > > > Can you please point out where PPC munmap/mremap the vdso ?
> > > > >
> > > > > Previously, when you mentioned that, I thought you meant user space in
> > > > > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > > > > tried, but didn't find anything, hence asking.
> > > >
> > > > Jeff, please stick to replying to review. 'Have you looking through all
> > > > arm64-code'.
> > > >
> > > > I ended up doing this myself yesterday and found no issues, as with x86-64.
> > > >
> > > > I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> > > > it. We can discuss it, and I can find specifics if + when you try to add
> > > > this to PPC.
> > > >
> > >
> > > PPC allows the vma to be munmapped then detects and falls back to the
> > > slower method, iirc.
> > >
> > Is this code in the kernel or userspace?
> >
> > If PPC doesn't want to create vdso for all its userspace apps, we
> > could instead "don't create" vdso during the execve call.
> >
> >
> > > They were against the removal of the fallback; other archs also have
> > > this infrastructure.  Really, if we fixed the fallback to work for
> > > all platforms then it would probably also remove the possibility of a
> > > remap over the VDSO being a problem (if it is today, which still isn't
> > > clear?).
> > >
> > Any past thread/communication about this that I can read ?
>
> Jeff, I'm sure you don't intend to, but I find it quite disrespectful that you
> ignored my feedback here (and elsewhere, regarding you ignoring 4 sets of
> feedback).

Apologies, I meant to reword this to sound less harsh but somebody phoned
me and I hit send...

What I mean to say is I think you _do_ ack what I've said here, but you
think it's not useful to reply because there's not really a converastion to
be had.

Whereas I'm saying it'd be useful to ack :)

Sorry I did not mean for this to sound quite so 'full on'.

>
> This?
>
> https://elixir.bootlin.com/linux/v6.13.4/source/arch/powerpc/kernel/vdso.c#L236
>
> Was [0] a relevant discussion?
>
> [0]: https://lore.kernel.org/all/lhe2mky6ahlk2jzvvfjyongqiseelyx2uy7sbyuso6jcy3b2dq@7ju6cea62jgk/

I did in the end go and check this, so hopefully this is useful at least.

But again, I really think we should hold off on PPC stuff until we come to
it and focus on getting this series into mergeable state.

Am doing my best to try to get you there ASAP as there's been a lot of
delays here.

>
> >
> > Thanks
> > -Jeff
> >
> >
> > > Thanks,
> > > Liam

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:06         ` Lorenzo Stoakes
@ 2025-02-26 18:19           ` Liam R. Howlett
  2025-02-26 18:20           ` Oleg Nesterov
  1 sibling, 0 replies; 122+ messages in thread
From: Liam R. Howlett @ 2025-02-26 18:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Oleg Nesterov, jeffxu, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 13:06]:
> On Wed, Feb 26, 2025 at 07:01:36PM +0100, Oleg Nesterov wrote:
> > On 02/26, Lorenzo Stoakes wrote:
> > >
> > > On Wed, Feb 26, 2025 at 05:26:04PM +0100, Oleg Nesterov wrote:
> > > > On 02/24, jeffxu@chromium.org wrote:
> > > > >
> > > > > Unlike other system mappings, the uprobe mapping is not
> > > > > established during program startup. However, its lifetime is the same
> > > > > as the process's lifetime. It could be sealed from creation.
> > > >
> > > > Agreed, VM_SEALED should be always for the "[uprobes]" vma, regardless
> > > > of config options.
> > >
> > > If you think this ought to be the case generally, then perhaps we should
> > > drop this patch from the commit and just do this separately as a
> > > permanent-on thing, if you are sure this is fine + want it?
> >
> > See below...
> >
> > > An aside - we _definitely_ cannot allow this -system mapping stuff- to be
> > > enabled without a config option,
> >
> > This is clear.
> >
> > But as for uprobes in particular I do think that VM_SEALED is always fine.
> >
> > Do we really want it? I dunno. If a task unmaps its "[uprobes]" vma it
> > will crash when it hits the uprobes bp next time. Unless the probed insn
> > can be emulated and it is not ret-probe. Do we really care? Again, I don't
> > know.
> >
> > Should this change come as a separate patch? I don't understand why it should
> > but I am fine either way.
> >
> > In short. please do what you think is right,  VM_SEALED can't hurt uprobes ;)
> >
> > > > 	#ifdef CONFIG_64BIT
> > > > 	/* VM is sealed, in vm_flags */
> > > > 	#define VM_SEALED	_BITUL(63)
> > > > +	#else
> > > > +	#define VM_SEALED	0

nit, we have VM_NONE for this (it's also 0, so no real difference)

> > > > 	#endif
> > >
> > > This has been raised a few times. Jeff objects to this
> >
> > OK,
> >
> > > > and then simply
> > > >
> > > > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > > -				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> > > > +				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED,
> > > >
> > > > ?
> > >
> > > Nah you'd have to do:
> > >
> > >
> > > > 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > 				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO
> > > #ifdef CONFIG_64BIT
> > > 				VM_SEALED
> > > #endif
> > > 				,
> >
> > Why??? With the proposed change above VM_SEALED == 0 if !CONFIG_64BIT.
> >
> 
> Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> because this is very silly.

Discussion here [1].

> 
> But I don't want to hold up this series with that discussion (this is for his
> sake...)
> 
> > Oleg.
> >
> 
> Jeff - perhaps drop this and let's return to it in a follow up so this series
> isn't held up?
> 

...

Thanks,
Liam

[1]. https://lore.kernel.org/all/CABi2SkVKhjShryG0K-NSjjBvGs0UOVsY-6MQVOuQCkfuph5K8Q@mail.gmail.com/


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:06         ` Lorenzo Stoakes
  2025-02-26 18:19           ` Liam R. Howlett
@ 2025-02-26 18:20           ` Oleg Nesterov
  2025-02-26 18:25             ` Lorenzo Stoakes
  2025-02-27 21:48             ` Jeff Xu
  1 sibling, 2 replies; 122+ messages in thread
From: Oleg Nesterov @ 2025-02-26 18:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: jeffxu, akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On 02/26, Lorenzo Stoakes wrote:
>
> Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> because this is very silly.
>
> But I don't want to hold up this series with that discussion (this is for his
> sake...)

Neither me, so lets go with VM_SEALED_SYSMAP.

My only objection is that

	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
	vm_flags |= VM_SEALED_SYSMAP;

looks unnecessarily confusing to me,

	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;

or just

	vma = _install_special_mapping(...,
				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP,
				...

looks more readable. But this is cosmetic/subjective, so I won't argue/insist.

> Jeff - perhaps drop this and let's return to it in a follow up so this series
> isn't held up?

Up to you and Jeff.

But this patch looks "natural" to me in this series.

Oleg.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:20           ` Oleg Nesterov
@ 2025-02-26 18:25             ` Lorenzo Stoakes
  2025-02-27 23:38               ` Jeff Xu
  2025-02-27 21:48             ` Jeff Xu
  1 sibling, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 18:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: jeffxu, akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 07:20:50PM +0100, Oleg Nesterov wrote:
> On 02/26, Lorenzo Stoakes wrote:
> >
> > Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> > because this is very silly.
> >
> > But I don't want to hold up this series with that discussion (this is for his
> > sake...)
>
> Neither me, so lets go with VM_SEALED_SYSMAP.
>
> My only objection is that
>
> 	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> 	vm_flags |= VM_SEALED_SYSMAP;
>
> looks unnecessarily confusing to me,
>
> 	vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;
>
> or just
>
> 	vma = _install_special_mapping(...,
> 				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP,
> 				...
>
> looks more readable. But this is cosmetic/subjective, so I won't argue/insist.

Agreed. This would be good.

>
> > Jeff - perhaps drop this and let's return to it in a follow up so this series
> > isn't held up?
>
> Up to you and Jeff.
>
> But this patch looks "natural" to me in this series.

OK, I mean in that case I'm ok with it as-is, as you confirms there's no
issue, I've looked at the code and there's no issue.

It was only if we wanted to try the VM_SEALED thing, i.e. _always_ seal
then it'd do better outside of the series as there'd be a discussion about
maybe changing this CONFIG_64BIT thing yada yada.

>
> Oleg.
>

Jeff - in that case, do NOT drop this one :P but do please look at the
above style nit.

Let's keep things moving... :)


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 3/7] mseal, system mappings: enable x86-64
  2025-02-26  7:35       ` Thomas Weißschuh
@ 2025-02-27 21:44         ` Jeff Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Jeff Xu @ 2025-02-27 21:44 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
	Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin,
	linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
	ojeda, adobriyan, johannes, pedro.falcato, hca, willy,
	anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
	davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
	mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
	aleksandr.mikhalitsyn, mike.rapoport

Hi Thomas

On Tue, Feb 25, 2025 at 11:35 PM Thomas Weißschuh
<thomas.weissschuh@linutronix.de> wrote:
>
> On Tue, Feb 25, 2025 at 04:48:47PM -0800, Jeff Xu wrote:
> > On Tue, Feb 25, 2025 at 12:08 AM Thomas Weißschuh
> > <thomas.weissschuh@linutronix.de> wrote:
> > > On Mon, Feb 24, 2025 at 10:52:42PM +0000, jeffxu@chromium.org wrote:
> > > > From: Jeff Xu <jeffxu@chromium.org>
> > > >
> > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64,
> > > > covering the vdso, vvar, vvar_vclock.
> > > >
> > > > Production release testing passes on Android and Chrome OS.
> > > >
> > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > > ---
> > > >  arch/x86/Kconfig          |  1 +
> > > >  arch/x86/entry/vdso/vma.c | 16 ++++++++++------
> > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index 87198d957e2f..8fa17032ca46 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -26,6 +26,7 @@ config X86_64
> > > >       depends on 64BIT
> > > >       # Options that are inherently 64-bit kernel only:
> > > >       select ARCH_HAS_GIGANTIC_PAGE
> > > > +     select ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > >       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > > >       select ARCH_SUPPORTS_PER_VMA_LOCK
> > > >       select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > > diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> > > > index 39e6efc1a9ca..1b1c009f20a8 100644
> > > > --- a/arch/x86/entry/vdso/vma.c
> > > > +++ b/arch/x86/entry/vdso/vma.c
> > > > @@ -247,6 +247,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > > >       struct mm_struct *mm = current->mm;
> > > >       struct vm_area_struct *vma;
> > > >       unsigned long text_start;
> > > > +     unsigned long vm_flags;
> > > >       int ret = 0;
> > > >
> > > >       if (mmap_write_lock_killable(mm))
> > > > @@ -264,11 +265,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > > >       /*
> > > >        * MAYWRITE to allow gdb to COW and set breakpoints
> > > >        */
> > > > +     vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> > > > +     vm_flags |= VM_SEALED_SYSMAP;
> > > >       vma = _install_special_mapping(mm,
> > > >                                      text_start,
> > > >                                      image->size,
> > > > -                                    VM_READ|VM_EXEC|
> > > > -                                    VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > > > +                                    vm_flags,
> > > >                                      &vdso_mapping);
> > > >
> > > >       if (IS_ERR(vma)) {
> > > > @@ -276,11 +278,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
> > > >               goto up_fail;
> > > >       }
> > > >
> > > > +     vm_flags = VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|VM_PFNMAP;
> > > > +     vm_flags |= VM_SEALED_SYSMAP;
> > > >       vma = _install_special_mapping(mm,
> > > >                                      addr,
> > > >                                      (__VVAR_PAGES - VDSO_NR_VCLOCK_PAGES) * PAGE_SIZE,
> > > > -                                    VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
> > > > -                                    VM_PFNMAP,
> > > > +                                    vm_flags,
> > > >                                      &vvar_mapping);
> > >
> > > This hunk (and the vvar mapping in the arm64 patch) will conflict with my
> > > "Generic vDSO datapage" series.
> > > That series is already part of the tip tree (branch timers/vdso) and scheduled
> > > for the next merge window.
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=timers/vdso
> > >
> > > The conflict resolution is fairly easy:
> > > Move the new flag logic into lib/vdso/datastore.c
> > >
> > Thank you for bringing this to my attention.
> >
> > In your change,  it seems lib/vdso/datastore.c implements a
> > vdso_install_vvar_mapping(), then all the architectures call this
> > function.
>
> Correct.
>
> At least all the architectures using the generic vDSO infrastructure,
> which are the ones you care about.
> Sparc for example has its own implementation.
>
> > So merging conflict won't be as straightforward.
>
> Wouldn't it be enough to unconditionally use VM_SEALED_SYSMAP in
> vdso_install_vvar_mapping()?
> The symbol is a noop on architectures or configurations where the new
> functionality is not available or enabled.
>
Yes. That will work.

> > Maybe a better
> > approach is that I continue resolving all the comments, based on the
> > latest main. Then wait for your change to be merged and submit another
> > version.
>
> That would work, too. As you prefer.
Great !

Thanks
-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:20           ` Oleg Nesterov
  2025-02-26 18:25             ` Lorenzo Stoakes
@ 2025-02-27 21:48             ` Jeff Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Jeff Xu @ 2025-02-27 21:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lorenzo Stoakes, akpm, keescook, jannh, torvalds, vbabka,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

Hi Oleg

On Wed, Feb 26, 2025 at 10:21 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 02/26, Lorenzo Stoakes wrote:
> >
> > Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> > because this is very silly.
> >
> > But I don't want to hold up this series with that discussion (this is for his
> > sake...)
>
> Neither me, so lets go with VM_SEALED_SYSMAP.
>
> My only objection is that
>
>         vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
>         vm_flags |= VM_SEALED_SYSMAP;
>
> looks unnecessarily confusing to me,
>
>         vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;
>
> or just
>
>         vma = _install_special_mapping(...,
>                                 VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP,
>
Sure
                                 ...
>
> looks more readable. But this is cosmetic/subjective, so I won't argue/insist.
>
> > Jeff - perhaps drop this and let's return to it in a follow up so this series
> > isn't held up?
>
> Up to you and Jeff.
>
> But this patch looks "natural" to me in this series.
>
Thanks for sharing your insights on uprobe! Your expertise is invaluable.

-Jeff

> Oleg.
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-26 18:25             ` Lorenzo Stoakes
@ 2025-02-27 23:38               ` Jeff Xu
  2025-02-28 10:39                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-27 23:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Oleg Nesterov, akpm, keescook, jannh, torvalds, vbabka,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 10:25 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 26, 2025 at 07:20:50PM +0100, Oleg Nesterov wrote:
> > On 02/26, Lorenzo Stoakes wrote:
> > >
> > > Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> > > because this is very silly.
> > >
> > > But I don't want to hold up this series with that discussion (this is for his
> > > sake...)
> >
> > Neither me, so lets go with VM_SEALED_SYSMAP.
> >
> > My only objection is that
> >
> >       vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> >       vm_flags |= VM_SEALED_SYSMAP;
> >
> > looks unnecessarily confusing to me,
> >
> >       vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;
> >
> > or just
> >
> >       vma = _install_special_mapping(...,
> >                               VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP,
> >                               ...
> >
> > looks more readable. But this is cosmetic/subjective, so I won't argue/insist.
>
> Agreed. This would be good.
>
> >
> > > Jeff - perhaps drop this and let's return to it in a follow up so this series
> > > isn't held up?
> >
> > Up to you and Jeff.
> >
> > But this patch looks "natural" to me in this series.
>
> OK, I mean in that case I'm ok with it as-is, as you confirms there's no
> issue, I've looked at the code and there's no issue.
>
> It was only if we wanted to try the VM_SEALED thing, i.e. _always_ seal
> then it'd do better outside of the series as there'd be a discussion about
> maybe changing this CONFIG_64BIT thing yada yada.
>
> >
> > Oleg.
> >
>
> Jeff - in that case, do NOT drop this one :P but do please look at the
> above style nit.
>
Ok.


> Let's keep things moving... :)


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-26  6:00     ` Lorenzo Stoakes
@ 2025-02-27 23:43       ` Jeff Xu
  2025-02-28 10:32         ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-27 23:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 10:01 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Feb 25, 2025 at 04:17:01PM -0800, Jeff Xu wrote:
> > On Tue, Feb 25, 2025 at 2:32 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > BTW can we please drop the 'mseal, system mappings' prefixes on this
> > > series, it's really weird and makes it really hard for me to actually read
> > > the individual summary lines for each commit. 'mseal:' will do.
> > >
> >
> > I am not sure.
> > I had comments about adding mseal, system mappings, as prefixes, and I
> > think it is reasonable.
>
> No it's really horrible (sorry I know it's hyperbolic but it really really looks
> horrid to me) , I've never seen prefixes like that before in mm in my life and I
> don't think it adds anything.
>
> I also find it MIGHTY confusing.
>
> Please remove this :) you can git log the relevant files to see the conventions
> people use. Typically xxx: has something really short and sweet for the 'xxx'.
>
> I realise this is subjective, but it's a small thing and would be helpful for
> parsing your series at a glance.
>
I need a prefix to group the patches, especially when the first patch
is just a config change. "mseal" won't work because this patch isn't
about core business logic of mseal.

How about "mseal sysmap:"?

-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-26  6:04       ` Lorenzo Stoakes
@ 2025-02-28  0:04         ` Jeff Xu
  2025-02-28 10:32           ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-28  0:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Tue, Feb 25, 2025 at 10:04 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Feb 25, 2025 at 05:33:24PM -0800, Jeff Xu wrote:
> > On Mon, Feb 24, 2025 at 10:05 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > > +     bool
> > > > +     help
> > > > +       Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> > > > +
> > > > +       A 64-bit kernel is required for the memory sealing feature.
> > > > +       No specific hardware features from the CPU are needed.
> > > > +
> > > > +       To enable this feature, the architecture needs to update their
> > > > +       special mappings calls to include the sealing flag and confirm
> > > > +       that it doesn't unmap/remap system mappings during the life
> > > > +       time of the process. After the architecture enables this, a
> > > > +       distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> > > > +       to the feature.
> > >
> > > Architectures also need to be confirmed not to require any form of VDSO
> > > relocation, which as discussed in previous series some arches appear to
> > > need to do. I'd mention that here.
> > >
> > This might need clarification, the system mapping includes vdso, right
> > ? Why the focus on vdso ?
>
> My mistake, I thought scope was more limited than this when I first
> looked. Please disregard the focus on VDSO here... :)
>
> >
> > The sentence  "... it doesn't unmap/remap system mappings during the
> > lifetime of the process."  already cover what you want here, I think.
> >
>
> Right, I guess it just doesn't quite _emphasise_ it enough for me. Something
> like the below would really help bring that out:
>
>         The existing of this flag for an architecture implies that it does not
>         require the remapping of these system mappings during process lifetime,
>         so sealing these mappings is safe from a kernel perspective.
>
I'm not sure I get the difference, but I can add it,  is below OK ?

To enable this feature, the architecture needs to update their
special mappings calls to include the sealing flag and confirm
that it doesn't unmap/remap system mappings during the life
time of the process. The existence of this flag for an architecture
implies that it does not require the remapping of these system
mappings during process lifetime, so sealing these mappings is
safe from a kernel perspective. After the architecture enables this,
a distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to
manage access to the feature.

Thanks
-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-26 17:43             ` Lorenzo Stoakes
  2025-02-26 18:14               ` Lorenzo Stoakes
@ 2025-02-28  0:48               ` Jeff Xu
  2025-02-28 10:31                 ` Lorenzo Stoakes
  1 sibling, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-28  0:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Wed, Feb 26, 2025 at 9:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 26, 2025 at 09:17:10AM -0800, Jeff Xu wrote:
> > On Wed, Feb 26, 2025 at 9:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> > > > On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > > > > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > >
> > > > > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > > >
> > > > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > > > > >
> > > > > > > Production release testing passes on Android and Chrome OS.
> > > > > >
> > > > > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > > > > account for all the weird and wonderful arm64 devices out there in context of
> > > > > > upstream :)
> > > > > >
> > > > > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > > > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > > > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > > > > nervous of that.
> > > > > >
> > > > > Can you please point out where PPC munmap/mremap the vdso ?
> > > > >
> > > > > Previously, when you mentioned that, I thought you meant user space in
> > > > > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > > > > tried, but didn't find anything, hence asking.
> > > >
> > > > Jeff, please stick to replying to review. 'Have you looking through all
> > > > arm64-code'.
> > > >
I checked the kernel code and couldn't find any instances of kernel
unmap/remap system mapping in any architecture. But I could be wrong,
so I've also included developers from different architectures since
V1, and hoping to get some insight.

> > > > I ended up doing this myself yesterday and found no issues, as with x86-64.
> > > >
Thank you for double checking.

> > > > I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> > > > it. We can discuss it, and I can find specifics if + when you try to add
> > > > this to PPC.
> > > >
> > >
> > > PPC allows the vma to be munmapped then detects and falls back to the
> > > slower method, iirc.
> > >
> > Is this code in the kernel or userspace?
> >
> > If PPC doesn't want to create vdso for all its userspace apps, we
> > could instead "don't create" vdso during the execve call.
> >
> >
> > > They were against the removal of the fallback; other archs also have
> > > this infrastructure.  Really, if we fixed the fallback to work for
> > > all platforms then it would probably also remove the possibility of a
> > > remap over the VDSO being a problem (if it is today, which still isn't
> > > clear?).
> > >
> > Any past thread/communication about this that I can read ?
>
> Jeff, I'm sure you don't intend to, but I find it quite disrespectful that you
> ignored my feedback here (and elsewhere, regarding you ignoring 4 sets of
> feedback).
>
I'm just interested in the details :),  If we know why PPC needs to
unmap/remap vdso, then there are additional data points to consider,
when we develop pre-process level control for this feature. But sure,
we can postpone this.

> This?
>
> https://elixir.bootlin.com/linux/v6.13.4/source/arch/powerpc/kernel/vdso.c#L236
>
OK, you meant the failed case ? i.e. when install_special_mappings
failed ? That is a case that I haven't considered. It looks like error
handling, and I was expecting the install_special_mappings to never
fail, maybe I'm wrong here for PPC.

> Was [0] a relevant discussion?
>
Sorry, I'm kind of lost.  This link doesn't give a reason why PPC
needs to be unmap. If it is due to CRIU or  other user space apps,
that is not an architecture dependency, maybe a distribution
dependency.

Anyway, we can postpone this discussion for PPC, I don't mean to make
you spend more time responding to me. Please feel free to ignore this
one.

Thanks.
-Jeff


> [0]: https://lore.kernel.org/all/lhe2mky6ahlk2jzvvfjyongqiseelyx2uy7sbyuso6jcy3b2dq@7ju6cea62jgk/
>
> >
> > Thanks
> > -Jeff
> >
> >
> > > Thanks,
> > > Liam


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2025-02-26  5:42     ` your mail Lorenzo Stoakes
@ 2025-02-28  0:55       ` Jeff Xu
  2025-02-28  9:35         ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-28  0:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
	Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
	Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
	linux-hardening, linux-mm, Jorge Lucangeli Obes,
	Stephen Röttger, hch, ojeda, Thomas Weißschuh,
	adobriyan, johannes, Pedro Falcato, hca, Matthew Wilcox,
	anna-maria, mark.rutland, linus.walleij, Jason, Helge Deller,
	Randy Dunlap, davem, Peter Xu, f.fainelli, gerg, dave.hansen,
	Ingo Molnar, ardb, mhocko, 42.hyeyoo, peterz, ardb,
	Elliott Hughes, rientjes, Guenter Roeck, Michael Ellerman,
	Alexander Mikhalitsyn, Mike Rapoport

Hi Lorenzo,

On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
>
> > > 2. Tests - could you please add some tests to assert that mremap() fails
> > >    for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > >    skip the test should this be enabled, but this is not the same as a self
> > >    test. And obviously doesn't cover arm64.
> > >
> > >    This should be relatively strightforward right? You already have code
> > >    for finding out whether VDSO is msealed, so just use that to see if you
> > >    skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > >
> > >    Ideally these tests would cover all the cases you've changed.
> > >
> > It is not as easy.
> >
> > The config is disabled by default. And I don't know a way to detect
> > KCONFIG  from selftest itself. Without this, I can't reasonably
> > determine the test result.
>
> Please in future let's try to get this kind of response at the point of the
> request being made :) makes life much easier.
>
There might be miscommunication ?
This version is the first time you ask about adding a self test.

IIRC, you had comments about providing the details of what tests I did, in V4.
As a follow-up, I added a test-info section in the cover letter of V5

Though I have thought about adding a selftest since the beginning,
there are two problems:
1> This config is by default disabled,
2> No good pattern to check KCONFIG from selftest.

>
> So I think it is easy actually. As I say here (perhaps you missed it?) you
> literally already have code you added to the x86 test to prevent test
> failure.
>
> So you essentially look for [vdso] or whatever, then you look up to see if
> it is sealed, ensure an mremap() fails.
>
This suggestion doesn't test the core change of this series, which is
to enable mseal for vdso.

When the vdso is marked with "sl", mremap() will fail, that's part of
the mseal() business logic and belongs in mseal_test. The mseal_test
already covers the mseal, and this series doesn't change mseal.

As for the "sl", as I responded in the "refactor mseal_test" [1] , it
will be part of the verifying step:

[1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/

> Of course this doesn't check to see if it _should_ be enabled or not. I'm
> being nice by not _insisting_ we export a way for you to determine whether
> this config option is enabled or not for the sake of a test (since I don't
> want to hold up this series).
>
> But that'd be nice! Maybe in a
> /sys/kernel/security endpoint...
>
>
> ...but I think we can potentially add this later on so we don't hold things
> up here/add yet more to the series. I suspect you and Kees alike would
> prioritise getting _this series_ in at this point :)
>
> You could, if you wanted to, check to see if /proc/config.gz exists and
> zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> that, but you know kinda hacky.

Ya, it is hacky. None of the existing selftest uses this pattern, and
I'm not sure /proc/config.gz is enabled in the default kernel config.

One option is to have ChromeOS or Android to maintain an out-of-tree
test, since the configuration will be enabled there.

Option two is to create a new path:
tools/testing/selftests/sealsysmap. Then, add
CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
this path. This seems to be the preferred way by selftest, but we need
a new dir for that.

Option three is to add a self-test when we have a process-level opt-in
solution. This would allow the test to deterministically know whether
the vdso should be sealed or not.

>
> But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> fails here for system mappings for the sake of it. I guess this is, in
> effect, only checking mseal-ing works right? But it at least asserts the
> existence of the thing, and that it behaves.
>
> Later on, when you add some way of observing that it's enabled or not, you
> can extend the test to check this.

I think it is best that we don't add a test that doesn't actually
check the code change. Do you think one of the above three options
works ? maybe the second option (with a new path) ?

In any case, I think the risk is low, and the code changes are quite
simple, and fully tested.

Thanks.
-Jeff.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2025-02-28  0:55       ` Jeff Xu
@ 2025-02-28  9:35         ` Lorenzo Stoakes
  2025-02-28 17:24           ` Jeff Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28  9:35 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
	Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
	Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
	linux-hardening, linux-mm, Jorge Lucangeli Obes,
	Stephen Röttger, hch, ojeda, Thomas Weißschuh,
	adobriyan, johannes, Pedro Falcato, hca, Matthew Wilcox,
	anna-maria, mark.rutland, linus.walleij, Jason, Helge Deller,
	Randy Dunlap, davem, Peter Xu, f.fainelli, gerg, dave.hansen,
	Ingo Molnar, ardb, mhocko, 42.hyeyoo, peterz, ardb,
	Elliott Hughes, rientjes, Guenter Roeck, Michael Ellerman,
	Alexander Mikhalitsyn, Mike Rapoport

On Thu, Feb 27, 2025 at 04:55:06PM -0800, Jeff Xu wrote:
> Hi Lorenzo,
>
> On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
> >
> > > > 2. Tests - could you please add some tests to assert that mremap() fails
> > > >    for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > > >    skip the test should this be enabled, but this is not the same as a self
> > > >    test. And obviously doesn't cover arm64.
> > > >
> > > >    This should be relatively strightforward right? You already have code
> > > >    for finding out whether VDSO is msealed, so just use that to see if you
> > > >    skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > > >
> > > >    Ideally these tests would cover all the cases you've changed.
> > > >
> > > It is not as easy.
> > >
> > > The config is disabled by default. And I don't know a way to detect
> > > KCONFIG  from selftest itself. Without this, I can't reasonably
> > > determine the test result.
> >
> > Please in future let's try to get this kind of response at the point of the
> > request being made :) makes life much easier.
> >
> There might be miscommunication ?
> This version is the first time you ask about adding a self test.

Yeah I thought I'd been clear but this might _very well_ have been me not
having been, so apologies if so.

I mean 'make sure it's tested' is an overloaded term right? So fact you've
tested on android, chromeos, etc. implies 'tested', but also self
tests/kunit/whatever.

>
> IIRC, you had comments about providing the details of what tests I did, in V4.
> As a follow-up, I added a test-info section in the cover letter of V5

Thanks. Appreciate that! And this really does point towards a
miscommunication on my part, will try to be super explicit in future.

>
> Though I have thought about adding a selftest since the beginning,
> there are two problems:
> 1> This config is by default disabled,
> 2> No good pattern to check KCONFIG from selftest.

Yeah agreed, it's a TOTAL pain.

I wish we had a better way of doing this. Maybe a self-volunteering
exercise (*goes to write on writeboard :P*)

>
> >
> > So I think it is easy actually. As I say here (perhaps you missed it?) you
> > literally already have code you added to the x86 test to prevent test
> > failure.
> >
> > So you essentially look for [vdso] or whatever, then you look up to see if
> > it is sealed, ensure an mremap() fails.
> >
> This suggestion doesn't test the core change of this series, which is
> to enable mseal for vdso.

Right, and thinking about it, what does this test? Just that mseal works
right?

It's sort of implicit that, if a VMA is sealed, the seal should work (or
rather, tested in mseal tests themselves, rather than mseal system
settings).

>
> When the vdso is marked with "sl", mremap() will fail, that's part of
> the mseal() business logic and belongs in mseal_test. The mseal_test
> already covers the mseal, and this series doesn't change mseal.
>
> As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> will be part of the verifying step:
>
> [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/

OK cool thanks. I will look at this when I can, I'm just snowed under
pre-LSF and have been sick so backlog, backlog as discussed off-list. But
it's not been forgotten (whiteboard etc. etc.)

>
> > Of course this doesn't check to see if it _should_ be enabled or not. I'm
> > being nice by not _insisting_ we export a way for you to determine whether
> > this config option is enabled or not for the sake of a test (since I don't
> > want to hold up this series).
> >
> > But that'd be nice! Maybe in a
> > /sys/kernel/security endpoint...
> >
> >
> > ...but I think we can potentially add this later on so we don't hold things
> > up here/add yet more to the series. I suspect you and Kees alike would
> > prioritise getting _this series_ in at this point :)
> >
> > You could, if you wanted to, check to see if /proc/config.gz exists and
> > zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> > that, but you know kinda hacky.
>
> Ya, it is hacky. None of the existing selftest uses this pattern, and
> I'm not sure /proc/config.gz is enabled in the default kernel config.

Yeah and I'm not sure I even like my hacky suggestion here, I mean let's be
honest, it sucks.

>
> One option is to have ChromeOS or Android to maintain an out-of-tree
> test, since the configuration will be enabled there.

Nah haha, though of course you can do what you want. Really want something
upstream.

>
> Option two is to create a new path:
> tools/testing/selftests/sealsysmap. Then, add
> CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> this path. This seems to be the preferred way by selftest, but we need
> a new dir for that.

OK I like option 2 let's do this.

But let's call it mseal_system_mappings (yes I"m being nitty again :) I
really want to try to _explicitly_ say 'mseal' because we have other forms
of sealing.

Not your fault, but we overload terms like _crazy_ in mm and need to be
cautious as not to confuse vs. e.g. memfd seals.


>
> Option three is to add a self-test when we have a process-level opt-in
> solution. This would allow the test to deterministically know whether
> the vdso should be sealed or not.

Yeah one for future.

>
> >
> > But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> > fails here for system mappings for the sake of it. I guess this is, in
> > effect, only checking mseal-ing works right? But it at least asserts the
> > existence of the thing, and that it behaves.
> >
> > Later on, when you add some way of observing that it's enabled or not, you
> > can extend the test to check this.
>
> I think it is best that we don't add a test that doesn't actually
> check the code change. Do you think one of the above three options
> works ? maybe the second option (with a new path) ?

Yeah I actually agree on reflection. And yes agreed option 2 is great,
thanks!

>
> In any case, I think the risk is low, and the code changes are quite
> simple, and fully tested.

Yeah indeed, but I'd really like _something_ if possible. If option 2 is
relatively quick let's get that sorted out!

>
> Thanks.
> -Jeff.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 4/7] mseal, system mappings: enable arm64
  2025-02-28  0:48               ` Jeff Xu
@ 2025-02-28 10:31                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 10:31 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, akpm, keescook, jannh, torvalds, vbabka,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Thu, Feb 27, 2025 at 04:48:23PM -0800, Jeff Xu wrote:
> On Wed, Feb 26, 2025 at 9:43 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Feb 26, 2025 at 09:17:10AM -0800, Jeff Xu wrote:
> > > On Wed, Feb 26, 2025 at 9:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250226 00:26]:
> > > > > On Tue, Feb 25, 2025 at 02:26:50PM -0800, Jeff Xu wrote:
> > > > > > On Mon, Feb 24, 2025 at 10:20 PM Lorenzo Stoakes
> > > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > >
> > > > > > > On Mon, Feb 24, 2025 at 10:52:43PM +0000, jeffxu@chromium.org wrote:
> > > > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > > > >
> > > > > > > > Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering
> > > > > > > > the vdso, vvar, and compat-mode vectors and sigpage mappings.
> > > > > > > >
> > > > > > > > Production release testing passes on Android and Chrome OS.
> > > > > > >
> > > > > > > This is pretty limited (yes yes I know android is massive etc. but we must
> > > > > > > account for all the weird and wonderful arm64 devices out there in context of
> > > > > > > upstream :)
> > > > > > >
> > > > > > > Have you looking through all arm64-code relating to vdso, vvar, compat-mode
> > > > > > > vectors, sigpage mapping and ensured nothing kernel-side relies upon relocation?
> > > > > > > Some arches actually seem to want to do this. Pretty sure PPC does... so a bit
> > > > > > > nervous of that.
> > > > > > >
> > > > > > Can you please point out where PPC munmap/mremap the vdso ?
> > > > > >
> > > > > > Previously, when you mentioned that, I thought you meant user space in
> > > > > > PPC, I didn't realize that you meant that kernel code in PPC.  I
> > > > > > tried, but didn't find anything, hence asking.
> > > > >
> > > > > Jeff, please stick to replying to review. 'Have you looking through all
> > > > > arm64-code'.
> > > > >
> I checked the kernel code and couldn't find any instances of kernel
> unmap/remap system mapping in any architecture. But I could be wrong,
> so I've also included developers from different architectures since
> V1, and hoping to get some insight.

Thanks, yeah me also. Perhaps you said somewhere but I missed, apologies if
so. Might be worth adding explicitly to commit messages, though I think you
allude to it actually.

>
> > > > > I ended up doing this myself yesterday and found no issues, as with x86-64.
> > > > >
> Thank you for double checking.

No problem, in the went was fairly quick.

>
> > > > > I said I'm _pretty sure_ PPC does this. Liam mentioned something about
> > > > > it. We can discuss it, and I can find specifics if + when you try to add
> > > > > this to PPC.
> > > > >
> > > >
> > > > PPC allows the vma to be munmapped then detects and falls back to the
> > > > slower method, iirc.
> > > >
> > > Is this code in the kernel or userspace?
> > >
> > > If PPC doesn't want to create vdso for all its userspace apps, we
> > > could instead "don't create" vdso during the execve call.
> > >
> > >
> > > > They were against the removal of the fallback; other archs also have
> > > > this infrastructure.  Really, if we fixed the fallback to work for
> > > > all platforms then it would probably also remove the possibility of a
> > > > remap over the VDSO being a problem (if it is today, which still isn't
> > > > clear?).
> > > >
> > > Any past thread/communication about this that I can read ?
> >
> > Jeff, I'm sure you don't intend to, but I find it quite disrespectful that you
> > ignored my feedback here (and elsewhere, regarding you ignoring 4 sets of
> > feedback).
> >
> I'm just interested in the details :),  If we know why PPC needs to
> unmap/remap vdso, then there are additional data points to consider,
> when we develop pre-process level control for this feature. But sure,
> we can postpone this.
>
> > This?
> >
> > https://elixir.bootlin.com/linux/v6.13.4/source/arch/powerpc/kernel/vdso.c#L236
> >
> OK, you meant the failed case ? i.e. when install_special_mappings
> failed ? That is a case that I haven't considered. It looks like error
> handling, and I was expecting the install_special_mappings to never
> fail, maybe I'm wrong here for PPC.
>
> > Was [0] a relevant discussion?
> >
> Sorry, I'm kind of lost.  This link doesn't give a reason why PPC
> needs to be unmap. If it is due to CRIU or  other user space apps,
> that is not an architecture dependency, maybe a distribution
> dependency.

Yeah I actually think the web might be somewhat tangled here, I don't know
but...

>
> Anyway, we can postpone this discussion for PPC, I don't mean to make
> you spend more time responding to me. Please feel free to ignore this
> one.

...agreed, let's postpone this to there, at least hopefully these links provide
some basis for further research! :)

>
> Thanks.
> -Jeff
>
>
> > [0]: https://lore.kernel.org/all/lhe2mky6ahlk2jzvvfjyongqiseelyx2uy7sbyuso6jcy3b2dq@7ju6cea62jgk/
> >
> > >
> > > Thanks
> > > -Jeff
> > >
> > >
> > > > Thanks,
> > > > Liam

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 1/7] mseal, system mappings: kernel config and header change
  2025-02-28  0:04         ` Jeff Xu
@ 2025-02-28 10:32           ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 10:32 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Thu, Feb 27, 2025 at 04:04:03PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 10:04 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 05:33:24PM -0800, Jeff Xu wrote:
> > > On Mon, Feb 24, 2025 at 10:05 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > > > +     bool
> > > > > +     help
> > > > > +       Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> > > > > +
> > > > > +       A 64-bit kernel is required for the memory sealing feature.
> > > > > +       No specific hardware features from the CPU are needed.
> > > > > +
> > > > > +       To enable this feature, the architecture needs to update their
> > > > > +       special mappings calls to include the sealing flag and confirm
> > > > > +       that it doesn't unmap/remap system mappings during the life
> > > > > +       time of the process. After the architecture enables this, a
> > > > > +       distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> > > > > +       to the feature.
> > > >
> > > > Architectures also need to be confirmed not to require any form of VDSO
> > > > relocation, which as discussed in previous series some arches appear to
> > > > need to do. I'd mention that here.
> > > >
> > > This might need clarification, the system mapping includes vdso, right
> > > ? Why the focus on vdso ?
> >
> > My mistake, I thought scope was more limited than this when I first
> > looked. Please disregard the focus on VDSO here... :)
> >
> > >
> > > The sentence  "... it doesn't unmap/remap system mappings during the
> > > lifetime of the process."  already cover what you want here, I think.
> > >
> >
> > Right, I guess it just doesn't quite _emphasise_ it enough for me. Something
> > like the below would really help bring that out:
> >
> >         The existing of this flag for an architecture implies that it does not
> >         require the remapping of these system mappings during process lifetime,
> >         so sealing these mappings is safe from a kernel perspective.
> >
> I'm not sure I get the difference, but I can add it,  is below OK ?
>
> To enable this feature, the architecture needs to update their
> special mappings calls to include the sealing flag and confirm
> that it doesn't unmap/remap system mappings during the life
> time of the process. The existence of this flag for an architecture
> implies that it does not require the remapping of these system
> mappings during process lifetime, so sealing these mappings is
> safe from a kernel perspective. After the architecture enables this,
> a distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to
> manage access to the feature.

Sounds great, cheers!

>
> Thanks
> -Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 0/7] mseal system mappings
  2025-02-27 23:43       ` Jeff Xu
@ 2025-02-28 10:32         ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 10:32 UTC (permalink / raw)
  To: Jeff Xu
  Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
	adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Thu, Feb 27, 2025 at 03:43:07PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 10:01 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 04:17:01PM -0800, Jeff Xu wrote:
> > > On Tue, Feb 25, 2025 at 2:32 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > BTW can we please drop the 'mseal, system mappings' prefixes on this
> > > > series, it's really weird and makes it really hard for me to actually read
> > > > the individual summary lines for each commit. 'mseal:' will do.
> > > >
> > >
> > > I am not sure.
> > > I had comments about adding mseal, system mappings, as prefixes, and I
> > > think it is reasonable.
> >
> > No it's really horrible (sorry I know it's hyperbolic but it really really looks
> > horrid to me) , I've never seen prefixes like that before in mm in my life and I
> > don't think it adds anything.
> >
> > I also find it MIGHTY confusing.
> >
> > Please remove this :) you can git log the relevant files to see the conventions
> > people use. Typically xxx: has something really short and sweet for the 'xxx'.
> >
> > I realise this is subjective, but it's a small thing and would be helpful for
> > parsing your series at a glance.
> >
> I need a prefix to group the patches, especially when the first patch
> is just a config change. "mseal" won't work because this patch isn't
> about core business logic of mseal.
>
> How about "mseal sysmap:"?

OK let's go with that as a compromise. This isn't a huge big deal.

>
> -Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH v7 6/7] mseal, system mappings: uprobe mapping
  2025-02-27 23:38               ` Jeff Xu
@ 2025-02-28 10:39                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 10:39 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Oleg Nesterov, akpm, keescook, jannh, torvalds, vbabka,
	Liam.Howlett, adhemerval.zanella, avagin, benjamin, linux-kernel,
	linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
	thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
	willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
	rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
	ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
	mpe, aleksandr.mikhalitsyn, mike.rapoport

On Thu, Feb 27, 2025 at 03:38:47PM -0800, Jeff Xu wrote:
> On Wed, Feb 26, 2025 at 10:25 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Feb 26, 2025 at 07:20:50PM +0100, Oleg Nesterov wrote:
> > > On 02/26, Lorenzo Stoakes wrote:
> > > >
> > > > Like I said, Jeff opposes the change. I disagree with him, and agree with you,
> > > > because this is very silly.
> > > >
> > > > But I don't want to hold up this series with that discussion (this is for his
> > > > sake...)
> > >
> > > Neither me, so lets go with VM_SEALED_SYSMAP.
> > >
> > > My only objection is that
> > >
> > >       vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO;
> > >       vm_flags |= VM_SEALED_SYSMAP;
> > >
> > > looks unnecessarily confusing to me,
> > >
> > >       vm_flags = VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP;
> > >
> > > or just
> > >
> > >       vma = _install_special_mapping(...,
> > >                               VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED_SYSMAP,
> > >                               ...
> > >
> > > looks more readable. But this is cosmetic/subjective, so I won't argue/insist.
> >
> > Agreed. This would be good.
> >
> > >
> > > > Jeff - perhaps drop this and let's return to it in a follow up so this series
> > > > isn't held up?
> > >
> > > Up to you and Jeff.
> > >
> > > But this patch looks "natural" to me in this series.
> >
> > OK, I mean in that case I'm ok with it as-is, as you confirms there's no
> > issue, I've looked at the code and there's no issue.
> >
> > It was only if we wanted to try the VM_SEALED thing, i.e. _always_ seal
> > then it'd do better outside of the series as there'd be a discussion about
> > maybe changing this CONFIG_64BIT thing yada yada.
> >
> > >
> > > Oleg.
> > >
> >
> > Jeff - in that case, do NOT drop this one :P but do please look at the
> > above style nit.
> >
> Ok.

Thanks :)

>
>
> > Let's keep things moving... :)


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2025-02-28  9:35         ` Lorenzo Stoakes
@ 2025-02-28 17:24           ` Jeff Xu
  2025-02-28 17:30             ` Lorenzo Stoakes
  0 siblings, 1 reply; 122+ messages in thread
From: Jeff Xu @ 2025-02-28 17:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
	Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
	Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
	linux-hardening, linux-mm, Jorge Lucangeli Obes,
	Stephen Röttger, hch, ojeda, Thomas Weißschuh,
	adobriyan, johannes, Pedro Falcato, hca, Matthew Wilcox,
	anna-maria, mark.rutland, linus.walleij, Jason, Helge Deller,
	Randy Dunlap, davem, Peter Xu, f.fainelli, gerg, dave.hansen,
	Ingo Molnar, ardb, mhocko, 42.hyeyoo, peterz, ardb,
	Elliott Hughes, rientjes, Guenter Roeck, Michael Ellerman,
	Alexander Mikhalitsyn, Mike Rapoport

On Fri, Feb 28, 2025 at 1:35 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Feb 27, 2025 at 04:55:06PM -0800, Jeff Xu wrote:
> > Hi Lorenzo,
> >
> > On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
> > >
> > > > > 2. Tests - could you please add some tests to assert that mremap() fails
> > > > >    for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > > > >    skip the test should this be enabled, but this is not the same as a self
> > > > >    test. And obviously doesn't cover arm64.
> > > > >
> > > > >    This should be relatively strightforward right? You already have code
> > > > >    for finding out whether VDSO is msealed, so just use that to see if you
> > > > >    skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > > > >
> > > > >    Ideally these tests would cover all the cases you've changed.
> > > > >
> > > > It is not as easy.
> > > >
> > > > The config is disabled by default. And I don't know a way to detect
> > > > KCONFIG  from selftest itself. Without this, I can't reasonably
> > > > determine the test result.
> > >
> > > Please in future let's try to get this kind of response at the point of the
> > > request being made :) makes life much easier.
> > >
> > There might be miscommunication ?
> > This version is the first time you ask about adding a self test.
>
> Yeah I thought I'd been clear but this might _very well_ have been me not
> having been, so apologies if so.
>
> I mean 'make sure it's tested' is an overloaded term right? So fact you've
> tested on android, chromeos, etc. implies 'tested', but also self
> tests/kunit/whatever.
>
> >
> > IIRC, you had comments about providing the details of what tests I did, in V4.
> > As a follow-up, I added a test-info section in the cover letter of V5
>
> Thanks. Appreciate that! And this really does point towards a
> miscommunication on my part, will try to be super explicit in future.
>
> >
> > Though I have thought about adding a selftest since the beginning,
> > there are two problems:
> > 1> This config is by default disabled,
> > 2> No good pattern to check KCONFIG from selftest.
>
> Yeah agreed, it's a TOTAL pain.
>
> I wish we had a better way of doing this. Maybe a self-volunteering
> exercise (*goes to write on writeboard :P*)
>
> >
> > >
> > > So I think it is easy actually. As I say here (perhaps you missed it?) you
> > > literally already have code you added to the x86 test to prevent test
> > > failure.
> > >
> > > So you essentially look for [vdso] or whatever, then you look up to see if
> > > it is sealed, ensure an mremap() fails.
> > >
> > This suggestion doesn't test the core change of this series, which is
> > to enable mseal for vdso.
>
> Right, and thinking about it, what does this test? Just that mseal works
> right?
>
> It's sort of implicit that, if a VMA is sealed, the seal should work (or
> rather, tested in mseal tests themselves, rather than mseal system
> settings).
>
> >
> > When the vdso is marked with "sl", mremap() will fail, that's part of
> > the mseal() business logic and belongs in mseal_test. The mseal_test
> > already covers the mseal, and this series doesn't change mseal.
> >
> > As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> > will be part of the verifying step:
> >
> > [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
>
> OK cool thanks. I will look at this when I can, I'm just snowed under
> pre-LSF and have been sick so backlog, backlog as discussed off-list. But
> it's not been forgotten (whiteboard etc. etc.)
>
Ya, no worry about that review, please take time to recover, I will wait.
And appreciate your time and take priority for reviewing this series.

> >
> > > Of course this doesn't check to see if it _should_ be enabled or not. I'm
> > > being nice by not _insisting_ we export a way for you to determine whether
> > > this config option is enabled or not for the sake of a test (since I don't
> > > want to hold up this series).
> > >
> > > But that'd be nice! Maybe in a
> > > /sys/kernel/security endpoint...
> > >
> > >
> > > ...but I think we can potentially add this later on so we don't hold things
> > > up here/add yet more to the series. I suspect you and Kees alike would
> > > prioritise getting _this series_ in at this point :)
> > >
> > > You could, if you wanted to, check to see if /proc/config.gz exists and
> > > zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> > > that, but you know kinda hacky.
> >
> > Ya, it is hacky. None of the existing selftest uses this pattern, and
> > I'm not sure /proc/config.gz is enabled in the default kernel config.
>
> Yeah and I'm not sure I even like my hacky suggestion here, I mean let's be
> honest, it sucks.
>
> >
> > One option is to have ChromeOS or Android to maintain an out-of-tree
> > test, since the configuration will be enabled there.
>
> Nah haha, though of course you can do what you want. Really want something
> upstream.
>
> >
> > Option two is to create a new path:
> > tools/testing/selftests/sealsysmap. Then, add
> > CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> > this path. This seems to be the preferred way by selftest, but we need
> > a new dir for that.
>
> OK I like option 2 let's do this.
>
> But let's call it mseal_system_mappings (yes I"m being nitty again :) I
> really want to try to _explicitly_ say 'mseal' because we have other forms
> of sealing.
>
Sure.

If long path names aren't a problem, I will use mseal_system_mappings,
otherwise mseal_sysmap.

> Not your fault, but we overload terms like _crazy_ in mm and need to be
> cautious as not to confuse vs. e.g. memfd seals.
>
>
> >
> > Option three is to add a self-test when we have a process-level opt-in
> > solution. This would allow the test to deterministically know whether
> > the vdso should be sealed or not.
>
> Yeah one for future.
>
> >
> > >
> > > But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> > > fails here for system mappings for the sake of it. I guess this is, in
> > > effect, only checking mseal-ing works right? But it at least asserts the
> > > existence of the thing, and that it behaves.
> > >
> > > Later on, when you add some way of observing that it's enabled or not, you
> > > can extend the test to check this.
> >
> > I think it is best that we don't add a test that doesn't actually
> > check the code change. Do you think one of the above three options
> > works ? maybe the second option (with a new path) ?
>
> Yeah I actually agree on reflection. And yes agreed option 2 is great,
> thanks!
>
> >
> > In any case, I think the risk is low, and the code changes are quite
> > simple, and fully tested.
>
> Yeah indeed, but I'd really like _something_ if possible. If option 2 is
> relatively quick let's get that sorted out!
>
Great ! I will work on option 2.

Thanks
-Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2025-02-28 17:24           ` Jeff Xu
@ 2025-02-28 17:30             ` Lorenzo Stoakes
  0 siblings, 0 replies; 122+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 17:30 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
	Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
	Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
	linux-hardening, linux-mm, Jorge Lucangeli Obes,
	Stephen Röttger, hch, ojeda, Thomas Weißschuh,
	adobriyan, johannes, Pedro Falcato, hca, Matthew Wilcox,
	anna-maria, mark.rutland, linus.walleij, Jason, Helge Deller,
	Randy Dunlap, davem, Peter Xu, f.fainelli, gerg, dave.hansen,
	Ingo Molnar, ardb, mhocko, 42.hyeyoo, peterz, ardb,
	Elliott Hughes, rientjes, Guenter Roeck, Michael Ellerman,
	Alexander Mikhalitsyn, Mike Rapoport

On Fri, Feb 28, 2025 at 09:24:11AM -0800, Jeff Xu wrote:

[snip]

> > >
> > > When the vdso is marked with "sl", mremap() will fail, that's part of
> > > the mseal() business logic and belongs in mseal_test. The mseal_test
> > > already covers the mseal, and this series doesn't change mseal.
> > >
> > > As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> > > will be part of the verifying step:
> > >
> > > [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
> >
> > OK cool thanks. I will look at this when I can, I'm just snowed under
> > pre-LSF and have been sick so backlog, backlog as discussed off-list. But
> > it's not been forgotten (whiteboard etc. etc.)
> >
> Ya, no worry about that review, please take time to recover, I will wait.
> And appreciate your time and take priority for reviewing this series.

Thanks :)

[snip]

> > >
> > > Option two is to create a new path:
> > > tools/testing/selftests/sealsysmap. Then, add
> > > CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> > > this path. This seems to be the preferred way by selftest, but we need
> > > a new dir for that.
> >
> > OK I like option 2 let's do this.
> >
> > But let's call it mseal_system_mappings (yes I"m being nitty again :) I
> > really want to try to _explicitly_ say 'mseal' because we have other forms
> > of sealing.
> >
> Sure.

Thanks!

>
> If long path names aren't a problem, I will use mseal_system_mappings,
> otherwise mseal_sysmap.

I think long form is fine.

[snip]

> > >
> > > In any case, I think the risk is low, and the code changes are quite
> > > simple, and fully tested.
> >
> > Yeah indeed, but I'd really like _something_ if possible. If option 2 is
> > relatively quick let's get that sorted out!
> >
> Great ! I will work on option 2.

Perfect cheers!

>
> Thanks
> -Jeff


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2023-05-16 22:47   ` Thomas Gleixner
@ 2023-05-23 13:46     ` Liam R. Howlett
  0 siblings, 0 replies; 122+ messages in thread
From: Liam R. Howlett @ 2023-05-23 13:46 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni

* Thomas Gleixner <tglx@linutronix.de> [230516 18:48]:
> On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> > * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
> >> The documentation of mt_next() claims that it starts the search at the
> >> provided index. That's incorrect as it starts the search after the provided
> >> index.
> >> 
> >> The documentation of mt_find() is slightly confusing. "Handles locking" is
> >> not really helpful as it does not explain how the "locking" works.
> >
> > More locking notes can be found in Documentation/core-api/maple_tree.rst
> > which lists mt_find() under the "Takes RCU read lock" list.  I'm okay
> > with duplicating the comment of taking the RCU read lock in here.
> 
> Without a reference to the actual locking documentation such comments
> are not super helpful.

Noted.  A reference to the larger document should probably be added.
Thanks.

> 
> >> Fix similar issues for mt_find_after() and mt_prev().
> >> 
> >> Remove the completely confusing and pointless "Note: Will not return the
> >> zero entry." comment from mt_for_each() and document @__index correctly.
> >
> > The zero entry concept is an advanced API concept which allows you to
> > store something that cannot be seen by the mt_* family of users, so it
> > will not be returned and, instead, it will return NULL.  Think of it as
> > a reservation for an entry that isn't fully initialized.  Perhaps it
> > should read "Will not return the XA_ZERO_ENTRY" ?
> >>
> >> - *
> >> - * Note: Will not return the zero entry.
> >
> > This function "will not return the zero entry", meaning it will return
> > NULL if xa_is_zero(entry).
> 
> If I understand correctly, this translates to:
> 
>   This iterator skips entries, which have been reserved for future use
>   but have not yet been fully initialized.
> 
> Right?

Well, that's one use of using the XA_ZERO_ENTRY, but it's really up to
the user to decide why they are adding something that returns NULL in a
specific range for the not-advanced API.  It might be worth adding the
XA_ZERO_ENTRY in here, since that's the only special value right now?

> 
> >> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
> >>   * mt_find() - Search from the start up until an entry is found.
> >>   * @mt: The maple tree
> >>   * @index: Pointer which contains the start location of the search
> >> - * @max: The maximum value to check
> >> + * @max: The maximum value of the search range
> >> + *
> >> + * Takes RCU read lock internally to protect the search, which does not
> >> + * protect the returned pointer after dropping RCU read lock.
> >>   *
> >> - * Handles locking.  @index will be incremented to one beyond the range.
> >> + * In case that an entry is found @index contains the index of the found
> >> + * entry plus one, so it can be used as iterator index to find the next
> >> + * entry.
> >
> > What about:
> > "In case that an entry is found @index contains the last index of the
> > found entry plus one"
> 
> Still confusing to the casual reader like me :)
> 
>     "In case that an entry is found @index is updated to point to the next
>      possible entry independent whether the found entry is occupying a
>      single index or a range if indices."
> 
> Hmm?

That makes sense to me.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2023-05-15 19:27 ` your mail Liam R. Howlett
  2023-05-15 21:16   ` Thomas Gleixner
@ 2023-05-16 22:47   ` Thomas Gleixner
  2023-05-23 13:46     ` Liam R. Howlett
  1 sibling, 1 reply; 122+ messages in thread
From: Thomas Gleixner @ 2023-05-16 22:47 UTC (permalink / raw)
  To: Liam R. Howlett; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni

On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
>> The documentation of mt_next() claims that it starts the search at the
>> provided index. That's incorrect as it starts the search after the provided
>> index.
>> 
>> The documentation of mt_find() is slightly confusing. "Handles locking" is
>> not really helpful as it does not explain how the "locking" works.
>
> More locking notes can be found in Documentation/core-api/maple_tree.rst
> which lists mt_find() under the "Takes RCU read lock" list.  I'm okay
> with duplicating the comment of taking the RCU read lock in here.

Without a reference to the actual locking documentation such comments
are not super helpful.

>> Fix similar issues for mt_find_after() and mt_prev().
>> 
>> Remove the completely confusing and pointless "Note: Will not return the
>> zero entry." comment from mt_for_each() and document @__index correctly.
>
> The zero entry concept is an advanced API concept which allows you to
> store something that cannot be seen by the mt_* family of users, so it
> will not be returned and, instead, it will return NULL.  Think of it as
> a reservation for an entry that isn't fully initialized.  Perhaps it
> should read "Will not return the XA_ZERO_ENTRY" ?
>>
>> - *
>> - * Note: Will not return the zero entry.
>
> This function "will not return the zero entry", meaning it will return
> NULL if xa_is_zero(entry).

If I understand correctly, this translates to:

  This iterator skips entries, which have been reserved for future use
  but have not yet been fully initialized.

Right?

>> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
>>   * mt_find() - Search from the start up until an entry is found.
>>   * @mt: The maple tree
>>   * @index: Pointer which contains the start location of the search
>> - * @max: The maximum value to check
>> + * @max: The maximum value of the search range
>> + *
>> + * Takes RCU read lock internally to protect the search, which does not
>> + * protect the returned pointer after dropping RCU read lock.
>>   *
>> - * Handles locking.  @index will be incremented to one beyond the range.
>> + * In case that an entry is found @index contains the index of the found
>> + * entry plus one, so it can be used as iterator index to find the next
>> + * entry.
>
> What about:
> "In case that an entry is found @index contains the last index of the
> found entry plus one"

Still confusing to the casual reader like me :)

    "In case that an entry is found @index is updated to point to the next
     possible entry independent whether the found entry is occupying a
     single index or a range if indices."

Hmm?

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2023-05-15 19:27 ` your mail Liam R. Howlett
@ 2023-05-15 21:16   ` Thomas Gleixner
  2023-05-16 22:47   ` Thomas Gleixner
  1 sibling, 0 replies; 122+ messages in thread
From: Thomas Gleixner @ 2023-05-15 21:16 UTC (permalink / raw)
  To: Liam R. Howlett; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni

Liam!

On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
>>Also the
>> documentation of index talks about a range, while in reality the index
>> is updated on a succesful search to the index of the found entry plus one.
>
> This is a range based tree, so the index is incremented beyond the last
> entry which would return the entry.  That is, if you search for 5 and
> there is an entry at 4-100, the index would be 101 after the search -
> or, one beyond the range.  If you have single entries at a specific
> index, then index would be equal to last and it would be one beyond the
> index you found - but only because index == last in this case.

Thanks for the explanation

>> 
>> Fix similar issues for mt_find_after() and mt_prev().
>> 
>> Remove the completely confusing and pointless "Note: Will not return the
>> zero entry." comment from mt_for_each() and document @__index correctly.
>
> The zero entry concept is an advanced API concept which allows you to
> store something that cannot be seen by the mt_* family of users, so it
> will not be returned and, instead, it will return NULL.  Think of it as
> a reservation for an entry that isn't fully initialized.  Perhaps it
> should read "Will not return the XA_ZERO_ENTRY" ?

That makes actually sense.

>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -659,10 +659,8 @@ void *mt_next(struct maple_tree *mt, uns
>>   * mt_for_each - Iterate over each entry starting at index until max.
>>   * @__tree: The Maple Tree
>>   * @__entry: The current entry
>> - * @__index: The index to update to track the location in the tree
>> + * @__index: The index to start the search from. Subsequently used as iterator.
>>   * @__max: The maximum limit for @index
>> - *
>> - * Note: Will not return the zero entry.
>
> This function "will not return the zero entry", meaning it will return
> NULL if xa_is_zero(entry).

Ack.

>> + * Takes RCU read lock internally to protect the search, which does not
>> + * protect the returned pointer after dropping RCU read lock.
>>   *
>> - * Handles locking.  @index will be incremented to one beyond the range.
>> + * In case that an entry is found @index contains the index of the found
>> + * entry plus one, so it can be used as iterator index to find the next
>> + * entry.
>
> What about:
> "In case that an entry is found @index contains the last index of the
> found entry plus one"

Something like that, yes.

Let me try again.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2023-05-10 19:01 [PATCH] maple_tree: Fix a few documentation issues, Thomas Gleixner
@ 2023-05-15 19:27 ` Liam R. Howlett
  2023-05-15 21:16   ` Thomas Gleixner
  2023-05-16 22:47   ` Thomas Gleixner
  0 siblings, 2 replies; 122+ messages in thread
From: Liam R. Howlett @ 2023-05-15 19:27 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni

* Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
> The documentation of mt_next() claims that it starts the search at the
> provided index. That's incorrect as it starts the search after the provided
> index.
> 
> The documentation of mt_find() is slightly confusing. "Handles locking" is
> not really helpful as it does not explain how the "locking" works.

More locking notes can be found in Documentation/core-api/maple_tree.rst
which lists mt_find() under the "Takes RCU read lock" list.  I'm okay
with duplicating the comment of taking the RCU read lock in here.

>Also the
> documentation of index talks about a range, while in reality the index
> is updated on a succesful search to the index of the found entry plus one.

This is a range based tree, so the index is incremented beyond the last
entry which would return the entry.  That is, if you search for 5 and
there is an entry at 4-100, the index would be 101 after the search -
or, one beyond the range.  If you have single entries at a specific
index, then index would be equal to last and it would be one beyond the
index you found - but only because index == last in this case.

> 
> Fix similar issues for mt_find_after() and mt_prev().
> 
> Remove the completely confusing and pointless "Note: Will not return the
> zero entry." comment from mt_for_each() and document @__index correctly.

The zero entry concept is an advanced API concept which allows you to
store something that cannot be seen by the mt_* family of users, so it
will not be returned and, instead, it will return NULL.  Think of it as
a reservation for an entry that isn't fully initialized.  Perhaps it
should read "Will not return the XA_ZERO_ENTRY" ?

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/maple_tree.h |    4 +---
>  lib/maple_tree.c           |   23 ++++++++++++++++++-----
>  2 files changed, 19 insertions(+), 8 deletions(-)
> 
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -659,10 +659,8 @@ void *mt_next(struct maple_tree *mt, uns
>   * mt_for_each - Iterate over each entry starting at index until max.
>   * @__tree: The Maple Tree
>   * @__entry: The current entry
> - * @__index: The index to update to track the location in the tree
> + * @__index: The index to start the search from. Subsequently used as iterator.
>   * @__max: The maximum limit for @index
> - *
> - * Note: Will not return the zero entry.

This function "will not return the zero entry", meaning it will return
NULL if xa_is_zero(entry).

>   */
>  #define mt_for_each(__tree, __entry, __index, __max) \
>  	for (__entry = mt_find(__tree, &(__index), __max); \
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5947,7 +5947,10 @@ EXPORT_SYMBOL_GPL(mas_next);
>   * @index: The start index
>   * @max: The maximum index to check
>   *
> - * Return: The entry at @index or higher, or %NULL if nothing is found.
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
> + *
> + * Return: The entry higher than @index or %NULL if nothing is found.
>   */
>  void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max)
>  {
> @@ -6012,7 +6015,10 @@ EXPORT_SYMBOL_GPL(mas_prev);
>   * @index: The start index
>   * @min: The minimum index to check
>   *
> - * Return: The entry at @index or lower, or %NULL if nothing is found.
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
> + *
> + * Return: The entry before @index or %NULL if nothing is found.
>   */
>  void *mt_prev(struct maple_tree *mt, unsigned long index, unsigned long min)
>  {
> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
>   * mt_find() - Search from the start up until an entry is found.
>   * @mt: The maple tree
>   * @index: Pointer which contains the start location of the search
> - * @max: The maximum value to check
> + * @max: The maximum value of the search range
> + *
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
>   *
> - * Handles locking.  @index will be incremented to one beyond the range.
> + * In case that an entry is found @index contains the index of the found
> + * entry plus one, so it can be used as iterator index to find the next
> + * entry.

What about:
"In case that an entry is found @index contains the last index of the
found entry plus one"

>   *
>   * Return: The entry at or after the @index or %NULL
>   */
> @@ -6548,7 +6559,9 @@ EXPORT_SYMBOL(mt_find);
>   * @index: Pointer which contains the start location of the search
>   * @max: The maximum value to check
>   *
> - * Handles locking, detects wrapping on index == 0
> + * Same as mt_find() except that it checks @index for 0 before
> + * searching. If @index == 0, the search is aborted. This covers a wrap
> + * around of @index to 0 in an iterator loop.
>   *
>   * Return: The entry at or after the @index or %NULL
>   */


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
       [not found] <20190225201635.4648-1-hannes@cmpxchg.org>
@ 2019-02-26 23:49 ` Roman Gushchin
  0 siblings, 0 replies; 122+ messages in thread
From: Roman Gushchin @ 2019-02-26 23:49 UTC (permalink / raw)
  To: up, the, LRU, counts, tracking
  Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel, Kernel Team

On Mon, Feb 25, 2019 at 03:16:29PM -0500, Johannes Weiner wrote:
> [resending, rebased on top of latest mmots]
> 
> The memcg LRU stats usage is currently a bit messy. Memcg has private
> per-zone counters because reclaim needs zone granularity sometimes,
> but we also have plenty of users that need to awkwardly sum them up to
> node or memcg granularity. Meanwhile the canonical per-memcg vmstats
> do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
> 
> This series enables LRU count tracking in the per-memcg vmstats array
> such that lruvec_page_state() and memcg_page_state() work on the enum
> node_stat_item items for the LRU counters. Then it converts all the
> callers that don't specifically need per-zone numbers over to that.

The updated version looks very good* to me!
Please, feel free to use:
Reviewed-by: Roman Gushchin <guro@fb.com>

Looking through the patchset, I have a feeling that we're sometimes
gathering too much data. Perhaps we don't need the whole set
of counters to be per-cpu on both memcg- and memcg-per-node levels.
Merging them can save quite a lot of space. Anyway, it's a separate
topic.

* except "to" and "subject" of the cover letter

Thanks!


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-27  2:08                       ` Joonsoo Kim
@ 2017-04-27 15:10                         ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2017-04-27 15:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Thu 27-04-17 11:08:38, Joonsoo Kim wrote:
> On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > 
> > > > > > You are trying to change a semantic of something that has a well defined
> > > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > > > you are proposing is conflating two different things together.
> > > > > 
> > > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > > It would be original semantic of pfn_valid().
> > > > > 
> > > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > > zone information,"
> > > > 
> > > > I do not see any guarantee about the zone information anywhere. In fact
> > > > this is not true with the original implementation as I've tried to
> > > > explain already. We do have new pages associated with a zone but that
> > > > association might change during the online phase. So you cannot really
> > > > rely on that information until the page is online. There is no real
> > > > change in that regards after my rework.
> > > 
> > > I know that what you did doesn't change thing much. What I try to say
> > > is that previous implementation related to pfn_valid() in hotplug is
> > > wrong. Please do not assume that hotplug implementation is correct and
> > > other pfn_valid() users are incorrect. There is no design document so
> > > I'm not sure which one is correct but assumption that pfn_valid() user
> > > can access whole the struct page information makes much sense to me.
> > 
> > Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> > still need pfn_valid to work for those pfns. Really, pfn_valid has a
> 
> It's really contrary example to your insist. They requires not only
> struct page but also other information, especially, the zone index.
> They checks zone idx to know whether this page is for ZONE_DEVICE or not.

Yes and they guarantee this association is true. Without memory onlining
though. This memory is never online for anybody who is asking.

[...]

> I think that I did my best to explain my reasoning. It seems that we
> cannot agree with each other so it's better for some others to express
> their opinion to this problem. I will stop this discussion from now
> on.

I _do_ appreciate your feedback and if the general consensus is to
modify pfn_valid I can go that direction but my gut feeling tells me
that conflating "existing struct page" test and "fully online and
initialized" one is a wrong thing to do.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-26  9:19                     ` Michal Hocko
@ 2017-04-27  2:08                       ` Joonsoo Kim
  2017-04-27 15:10                         ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-27  2:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > [...]
> > > 
> > > > > You are trying to change a semantic of something that has a well defined
> > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > > you are proposing is conflating two different things together.
> > > > 
> > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > It would be original semantic of pfn_valid().
> > > > 
> > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > zone information,"
> > > 
> > > I do not see any guarantee about the zone information anywhere. In fact
> > > this is not true with the original implementation as I've tried to
> > > explain already. We do have new pages associated with a zone but that
> > > association might change during the online phase. So you cannot really
> > > rely on that information until the page is online. There is no real
> > > change in that regards after my rework.
> > 
> > I know that what you did doesn't change thing much. What I try to say
> > is that previous implementation related to pfn_valid() in hotplug is
> > wrong. Please do not assume that hotplug implementation is correct and
> > other pfn_valid() users are incorrect. There is no design document so
> > I'm not sure which one is correct but assumption that pfn_valid() user
> > can access whole the struct page information makes much sense to me.
> 
> Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> still need pfn_valid to work for those pfns. Really, pfn_valid has a

It's really contrary example to your insist. They requires not only
struct page but also other information, especially, the zone index.
They checks zone idx to know whether this page is for ZONE_DEVICE or not.

So, pfn_valid() for ZONE_DEVICE pages assume that struct page has all
the valid information. It's perfectly matched with my suggestion.
Online isn't important issue here. What the important point is the condition
that pfn_valid() return true. pfn_valid() for ZONE_DEVICE returns true after
arch_add_memory() since all the struct page information is fixed there.

If zone of hotplugged memory cannot be fixed at this moment, you can
defef it until all the information is fixed (onlining). That
seems to be better semantic of pfn_valid() to me.

> different meaning than you would like it to have. Who knows how many
> others like that are lurking there. I feel much more comfortable to go
> and hunt already broken code and fix it rathert than break something
> unexpectedly.

I think that I did my best to explain my reasoning. It seems that we
cannot agree with each other so it's better for some others to express
their opinion to this problem. I will stop this discussion from now
on.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-25  2:50                   ` Joonsoo Kim
@ 2017-04-26  9:19                     ` Michal Hocko
  2017-04-27  2:08                       ` Joonsoo Kim
  0 siblings, 1 reply; 122+ messages in thread
From: Michal Hocko @ 2017-04-26  9:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Tue 25-04-17 11:50:45, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> > On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > Which pfn walkers you have in mind?
> > > > > > > 
> > > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > > using pfn_valid().
> > > > > > 
> > > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > > case where you do not really care about holes. It just checks the page
> > > > > > count which is a valid information under any circumstances.
> > > > > 
> > > > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > > > even if PageReserved() is set?
> > > > 
> > > > I do not know about any user which would manipulate page map count for
> > > > referenced pages. The core MM code doesn't.
> > > 
> > > That's weird that we can get *map* count without PageReserved() check,
> > > but we cannot get zone information.
> > > Zone information is more static information than map count.
> > 
> > As I've already pointed out the rework of the hotplug code is mainly
> > about postponing the zone initialization from the physical hot add to
> > the logical onlining. The zone is really not clear until that moment.
> >  
> > > It should be defined/documented in this time that what information in
> > > the struct page is valid even if PageReserved() is set. And then, we
> > > need to fix all the things based on this design decision.
> > 
> > Where would you suggest documenting this? We do have
> > Documentation/memory-hotplug.txt but it is not really specific about
> > struct page.
> 
> pfn_valid() in include/linux/mmzone.h looks proper place.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c412e6a3a1e9..443258fcac93 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,10 +1288,14 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
 /*
  * pfn_valid() is meant to be able to tell if a given PFN has valid memmap
- * associated with it or not. In FLATMEM, it is expected that holes always
- * have valid memmap as long as there is valid PFNs either side of the hole.
- * In SPARSEMEM, it is assumed that a valid section has a memmap for the
- * entire section.
+ * associated with it or not. This means that a struct page exists for this
+ * pfn. The caller cannot assume the page is fully initialized though.
+ * pfn_to_online_page() should be used to make sure the struct page is fully
+ * initialized.
+ *
+ * In FLATMEM, it is expected that holes always have valid memmap as long as
+ * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed
+ * that a valid section has a memmap for the entire section.
  *
  * However, an ARM, and maybe other embedded architectures in the future
  * free memmap backing holes to save memory on the assumption the memmap is

> > [...]
> > 
> > > > You are trying to change a semantic of something that has a well defined
> > > > meaning. I disagree that we should change it. It might sound like a
> > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > you are proposing is conflating two different things together.
> > > 
> > > I don't think that *I* try to change the semantic of pfn_valid().
> > > It would be original semantic of pfn_valid().
> > > 
> > > "If pfn_valid() returns true, we can get proper struct page and the
> > > zone information,"
> > 
> > I do not see any guarantee about the zone information anywhere. In fact
> > this is not true with the original implementation as I've tried to
> > explain already. We do have new pages associated with a zone but that
> > association might change during the online phase. So you cannot really
> > rely on that information until the page is online. There is no real
> > change in that regards after my rework.
> 
> I know that what you did doesn't change thing much. What I try to say
> is that previous implementation related to pfn_valid() in hotplug is
> wrong. Please do not assume that hotplug implementation is correct and
> other pfn_valid() users are incorrect. There is no design document so
> I'm not sure which one is correct but assumption that pfn_valid() user
> can access whole the struct page information makes much sense to me.

Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
still need pfn_valid to work for those pfns. Really, pfn_valid has a
different meaning than you would like it to have. Who knows how many
others like that are lurking there. I feel much more comfortable to go
and hunt already broken code and fix it rathert than break something
unexpectedly.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-24  7:53                 ` Michal Hocko
@ 2017-04-25  2:50                   ` Joonsoo Kim
  2017-04-26  9:19                     ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-25  2:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Which pfn walkers you have in mind?
> > > > > > 
> > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > using pfn_valid().
> > > > > 
> > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > case where you do not really care about holes. It just checks the page
> > > > > count which is a valid information under any circumstances.
> > > > 
> > > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > > even if PageReserved() is set?
> > > 
> > > I do not know about any user which would manipulate page map count for
> > > referenced pages. The core MM code doesn't.
> > 
> > That's weird that we can get *map* count without PageReserved() check,
> > but we cannot get zone information.
> > Zone information is more static information than map count.
> 
> As I've already pointed out the rework of the hotplug code is mainly
> about postponing the zone initialization from the physical hot add to
> the logical onlining. The zone is really not clear until that moment.
>  
> > It should be defined/documented in this time that what information in
> > the struct page is valid even if PageReserved() is set. And then, we
> > need to fix all the things based on this design decision.
> 
> Where would you suggest documenting this? We do have
> Documentation/memory-hotplug.txt but it is not really specific about
> struct page.

pfn_valid() in include/linux/mmzone.h looks proper place.

> 
> [...]
> 
> > > You are trying to change a semantic of something that has a well defined
> > > meaning. I disagree that we should change it. It might sound like a
> > > simpler thing to do because pfn walkers will have to be checked but what
> > > you are proposing is conflating two different things together.
> > 
> > I don't think that *I* try to change the semantic of pfn_valid().
> > It would be original semantic of pfn_valid().
> > 
> > "If pfn_valid() returns true, we can get proper struct page and the
> > zone information,"
> 
> I do not see any guarantee about the zone information anywhere. In fact
> this is not true with the original implementation as I've tried to
> explain already. We do have new pages associated with a zone but that
> association might change during the online phase. So you cannot really
> rely on that information until the page is online. There is no real
> change in that regards after my rework.

I know that what you did doesn't change thing much. What I try to say
is that previous implementation related to pfn_valid() in hotplug is
wrong. Please do not assume that hotplug implementation is correct and
other pfn_valid() users are incorrect. There is no design document so
I'm not sure which one is correct but assumption that pfn_valid() user
can access whole the struct page information makes much sense to me.
So, I hope that please fix hotplug implementation rather than
modifying each pfn_valid() users.

> 
> [...]
> > > So please do not conflate those two different concepts together. I
> > > believe that the most prominent pfn walkers should be covered now and
> > > others can be evaluated later.
> > 
> > Even if original pfn_valid()'s semantic is not the one that I mentioned,
> > I think that suggested semantic from me is better.
> > Only hotplug code need to be changed and others doesn't need to be changed.
> > There is no overhead for others. What's the problem about this approach?
> 
> That this would require to check _every_ single pfn_valid user in the
> kernel. That is beyond my time capacity and not really necessary because
> the current code already suffers from the same/similar class of
> problems.

I think that all the pfn_valid() user doesn't consider hole case.
Unlike your expectation, if your way is taken, it requires to check
_every_ pfn_valid() users.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-24  1:44               ` Joonsoo Kim
@ 2017-04-24  7:53                 ` Michal Hocko
  2017-04-25  2:50                   ` Joonsoo Kim
  0 siblings, 1 reply; 122+ messages in thread
From: Michal Hocko @ 2017-04-24  7:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Which pfn walkers you have in mind?
> > > > > 
> > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > using pfn_valid().
> > > > 
> > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > case where you do not really care about holes. It just checks the page
> > > > count which is a valid information under any circumstances.
> > > 
> > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > even if PageReserved() is set?
> > 
> > I do not know about any user which would manipulate page map count for
> > referenced pages. The core MM code doesn't.
> 
> That's weird that we can get *map* count without PageReserved() check,
> but we cannot get zone information.
> Zone information is more static information than map count.

As I've already pointed out the rework of the hotplug code is mainly
about postponing the zone initialization from the physical hot add to
the logical onlining. The zone is really not clear until that moment.
 
> It should be defined/documented in this time that what information in
> the struct page is valid even if PageReserved() is set. And then, we
> need to fix all the things based on this design decision.

Where would you suggest documenting this? We do have
Documentation/memory-hotplug.txt but it is not really specific about
struct page.

[...]

> > You are trying to change a semantic of something that has a well defined
> > meaning. I disagree that we should change it. It might sound like a
> > simpler thing to do because pfn walkers will have to be checked but what
> > you are proposing is conflating two different things together.
> 
> I don't think that *I* try to change the semantic of pfn_valid().
> It would be original semantic of pfn_valid().
> 
> "If pfn_valid() returns true, we can get proper struct page and the
> zone information,"

I do not see any guarantee about the zone information anywhere. In fact
this is not true with the original implementation as I've tried to
explain already. We do have new pages associated with a zone but that
association might change during the online phase. So you cannot really
rely on that information until the page is online. There is no real
change in that regards after my rework.

[...]
> > So please do not conflate those two different concepts together. I
> > believe that the most prominent pfn walkers should be covered now and
> > others can be evaluated later.
> 
> Even if original pfn_valid()'s semantic is not the one that I mentioned,
> I think that suggested semantic from me is better.
> Only hotplug code need to be changed and others doesn't need to be changed.
> There is no overhead for others. What's the problem about this approach?

That this would require to check _every_ single pfn_valid user in the
kernel. That is beyond my time capacity and not really necessary because
the current code already suffers from the same/similar class of
problems.
 
> And, I'm not sure that you covered the most prominent pfn walkers.
> Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.

I probably haven't (and will send a patch to fix this one - thanks for
pointing to it) but the point is they those are broken already and they
can be fixed in follow up patches. If you change pfn_valid you might
break an existing code in an unexpected ways.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-21  7:16             ` Michal Hocko
@ 2017-04-24  1:44               ` Joonsoo Kim
  2017-04-24  7:53                 ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-24  1:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Which pfn walkers you have in mind?
> > > > 
> > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > using pfn_valid().
> > > 
> > > Yeah, I've checked that one and in fact this is a good example of the
> > > case where you do not really care about holes. It just checks the page
> > > count which is a valid information under any circumstances.
> > 
> > I don't think so. First, it checks the page *map* count. Is it still valid
> > even if PageReserved() is set?
> 
> I do not know about any user which would manipulate page map count for
> referenced pages. The core MM code doesn't.

That's weird that we can get *map* count without PageReserved() check,
but we cannot get zone information.
Zone information is more static information than map count.

It should be defined/documented in this time that what information in
the struct page is valid even if PageReserved() is set. And then, we
need to fix all the things based on this design decision.

> 
> > What I'd like to ask in this example is
> > that what information is valid if PageReserved() is set. Is there any
> > design document on this? I think that we need to define/document it first.
> 
> NO, it is not AFAIK.
> 
> [...]
> > > OK, fair enough. I did't consider memblock allocations. I will rethink
> > > this patch but there are essentially 3 options
> > > 	- use a different criterion for the offline holes dection. I
> > > 	  have just realized we might do it by storing the online
> > > 	  information into the mem sections
> > > 	- drop this patch
> > > 	- move the PageReferenced check down the chain into
> > > 	  isolate_freepages_block resp. isolate_migratepages_block
> > > 
> > > I would prefer 3 over 2 over 1. I definitely want to make this more
> > > robust so 1 is preferable long term but I do not want this to be a
> > > roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > I like #1 among of above options and I already see your patch for #1.
> > It's much better than your first attempt but I'm still not happy due
> > to the semantic of pfn_valid().
> 
> You are trying to change a semantic of something that has a well defined
> meaning. I disagree that we should change it. It might sound like a
> simpler thing to do because pfn walkers will have to be checked but what
> you are proposing is conflating two different things together.

I don't think that *I* try to change the semantic of pfn_valid().
It would be original semantic of pfn_valid().

"If pfn_valid() returns true, we can get proper struct page and the
zone information,"

That situation is now being changed by your patch *hotplug rework*.

"Even if pfn_valid() returns true, we cannot get the zone information
without PageReserved() check, since *zone is determined during
onlining* and pfn_valid() return true after adding the memory."

> 
> > > [..]
> > > > Let me clarify my desire(?) for this issue.
> > > > 
> > > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > > without checking PageResereved().
> > > 
> > > This is no longer true after my rework. Pages are associated with the
> > > zone during _onlining_ rather than when they are physically hotpluged.
> > 
> > If your rework make information valid during _onlining_, my
> > suggestion is making pfn_valid() return false until onlining.
> > 
> > Caller of pfn_valid() expects that they can get valid information from
> > the struct page. There is no reason to access the struct page if they
> > can't get valid information from it. So, passing pfn_valid() should
> > guarantee that, at least, some kind of information is valid.
> > 
> > If pfn_valid() doesn't guarantee it, most of the pfn walker should
> > check PageResereved() to make sure that validity of information from
> > the struct page.
> 
> This is true only for those walkers which really depend on the full
> initialization. This is not the case for all of them. I do not see any
> reason to introduce another _pfn_valid to just check whether there is a
> struct page...

It's really confusing concept that only some information is valid for
*not* fully initialized struct page. Even, there is no document that
what information is valid for this half-initialized struct page.

Better design would be that we regard that every information is
invalid for half-initialized struct page. In this case, it's natural
to make pfn_valid() returns false for this half-initialized struct page.

>  
> So please do not conflate those two different concepts together. I
> believe that the most prominent pfn walkers should be covered now and
> others can be evaluated later.

Even if original pfn_valid()'s semantic is not the one that I mentioned,
I think that suggested semantic from me is better.
Only hotplug code need to be changed and others doesn't need to be changed.
There is no overhead for others. What's the problem about this approach?

And, I'm not sure that you covered the most prominent pfn walkers.
Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.
As you admitted, additional check approach is really error-prone and
this example shows that.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-21  4:38           ` Joonsoo Kim
@ 2017-04-21  7:16             ` Michal Hocko
  2017-04-24  1:44               ` Joonsoo Kim
  0 siblings, 1 reply; 122+ messages in thread
From: Michal Hocko @ 2017-04-21  7:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > [...]
> > > > Which pfn walkers you have in mind?
> > > 
> > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > using pfn_valid().
> > 
> > Yeah, I've checked that one and in fact this is a good example of the
> > case where you do not really care about holes. It just checks the page
> > count which is a valid information under any circumstances.
> 
> I don't think so. First, it checks the page *map* count. Is it still valid
> even if PageReserved() is set?

I do not know about any user which would manipulate page map count for
referenced pages. The core MM code doesn't.

> What I'd like to ask in this example is
> that what information is valid if PageReserved() is set. Is there any
> design document on this? I think that we need to define/document it first.

NO, it is not AFAIK.

[...]
> > OK, fair enough. I did't consider memblock allocations. I will rethink
> > this patch but there are essentially 3 options
> > 	- use a different criterion for the offline holes dection. I
> > 	  have just realized we might do it by storing the online
> > 	  information into the mem sections
> > 	- drop this patch
> > 	- move the PageReferenced check down the chain into
> > 	  isolate_freepages_block resp. isolate_migratepages_block
> > 
> > I would prefer 3 over 2 over 1. I definitely want to make this more
> > robust so 1 is preferable long term but I do not want this to be a
> > roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> I like #1 among of above options and I already see your patch for #1.
> It's much better than your first attempt but I'm still not happy due
> to the semantic of pfn_valid().

You are trying to change a semantic of something that has a well defined
meaning. I disagree that we should change it. It might sound like a
simpler thing to do because pfn walkers will have to be checked but what
you are proposing is conflating two different things together.

> > [..]
> > > Let me clarify my desire(?) for this issue.
> > > 
> > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > without checking PageResereved().
> > 
> > This is no longer true after my rework. Pages are associated with the
> > zone during _onlining_ rather than when they are physically hotpluged.
> 
> If your rework make information valid during _onlining_, my
> suggestion is making pfn_valid() return false until onlining.
> 
> Caller of pfn_valid() expects that they can get valid information from
> the struct page. There is no reason to access the struct page if they
> can't get valid information from it. So, passing pfn_valid() should
> guarantee that, at least, some kind of information is valid.
> 
> If pfn_valid() doesn't guarantee it, most of the pfn walker should
> check PageResereved() to make sure that validity of information from
> the struct page.

This is true only for those walkers which really depend on the full
initialization. This is not the case for all of them. I do not see any
reason to introduce another _pfn_valid to just check whether there is a
struct page...
 
So please do not conflate those two different concepts together. I
believe that the most prominent pfn walkers should be covered now and
others can be evaluated later.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-20  7:28         ` Michal Hocko
  2017-04-20  8:49           ` Michal Hocko
@ 2017-04-21  4:38           ` Joonsoo Kim
  2017-04-21  7:16             ` Michal Hocko
  1 sibling, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-21  4:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> [...]
> > > Which pfn walkers you have in mind?
> > 
> > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > using pfn_valid().
> 
> Yeah, I've checked that one and in fact this is a good example of the
> case where you do not really care about holes. It just checks the page
> count which is a valid information under any circumstances.

I don't think so. First, it checks the page *map* count. Is it still valid
even if PageReserved() is set? What I'd like to ask in this example is
that what information is valid if PageReserved() is set. Is there any
design document on this? I think that we need to define/document it first.

And, I hope that all the information in flags field is valid in all
cases if pfn_valid() return true. By the design.

This makes all the exsiting pfn walkers happy since we don't need an
additional check for PageReserved().

> 
> > > > The other problem I found is that your change will makes some
> > > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > > by memblock API is also marked as PageResereved. If we consider this as
> > > > a hole, we will set such a zone as non-contiguous.
> > > 
> > > Why would that be a problem? We shouldn't touch those pages anyway?
> > 
> > Skipping those pages in compaction are valid so no problem in this
> > case.
> > 
> > The problem I mentioned above is that adding PageReserved() check in
> > __pageblock_pfn_to_page() invalidates optimization by
> > set_zone_contiguous(). In compaction, we need to get a valid struct
> > page and it requires a lot of work. There is performance problem
> > report due to this so set_zone_contiguous() optimization is added. It
> > checks if the zone is contiguous or not in boot time. If zone is
> > determined as contiguous, we can easily get a valid struct page in
> > runtime without expensive checks.
> 
> OK, I see. I've had some vague understading and the clarification helps.
> 
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
> 	- use a different criterion for the offline holes dection. I
> 	  have just realized we might do it by storing the online
> 	  information into the mem sections
> 	- drop this patch
> 	- move the PageReferenced check down the chain into
> 	  isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

I like #1 among of above options and I already see your patch for #1.
It's much better than your first attempt but I'm still not happy due
to the semantic of pfn_valid().

> [..]
> > Let me clarify my desire(?) for this issue.
> > 
> > 1. If pfn_valid() returns true, struct page has valid information, at
> > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > without checking PageResereved().
> 
> This is no longer true after my rework. Pages are associated with the
> zone during _onlining_ rather than when they are physically hotpluged.

If your rework make information valid during _onlining_, my
suggestion is making pfn_valid() return false until onlining.

Caller of pfn_valid() expects that they can get valid information from
the struct page. There is no reason to access the struct page if they
can't get valid information from it. So, passing pfn_valid() should
guarantee that, at least, some kind of information is valid.

If pfn_valid() doesn't guarantee it, most of the pfn walker should
check PageResereved() to make sure that validity of information from
the struct page.

> Basically only the nid is set properly. Strictly speaking this is the
> case also without my rework because the zone might change during online
> phase so you cannot assume it is correct even now. It just happens that
> it more or less works just fine.
>
> > 2. pfn_valid() for offlined holes returns false. This can be easily
> > (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> > guess that there is no reason that pfn_valid() returns true for
> > offlined holes. If there is, please let me know.
> 
> There is some code which really expects that pfn_valid returns true iff
> there is a struct page and it doesn't care about the online status.
> E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
> do though is to add pfn_to_online_page which would do the proper check.
> I have already sent [1]. As noted above we can (ab)use the remaining bit
> in SECTION_MAP_MASK to detect offline pages more robustly.

Some pfn_valid() caller in hotplug code look wrong. They want to check
section's validity rather than pfn's validity. Others want to access
the struct page so they fit for my assumption (?) for pfn_valid().
Therefore, we can change that pfn_valid() return false until online.

> > 3. We don't need to check PageReserved() in most of pfn walkers in
> > order to check offline holes.
> 
> We still have to distinguish those who care about offline pages from
> those who do not care about it.

Hotplug code can distinguish those by another way by using new section
mask as you did in a new patch. If someone excluding hotplug code do
care about offline pages, it would be just for optimization rather
than correteness. I think that it's okay.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-20 11:56             ` Vlastimil Babka
@ 2017-04-20 12:13               ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2017-04-20 12:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, linux-mm, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Thu 20-04-17 13:56:34, Vlastimil Babka wrote:
> On 04/20/2017 10:49 AM, Michal Hocko wrote:
> > On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> >> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > [...]
> >>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> >>> woule make that zone->contiguous usually returns false since memory
> >>> used by memblock API is marked as PageReserved() and your patch regard
> >>> it as a hole. It invalidates set_zone_contiguous() optimization and I
> >>> worry about it.
> >>
> >> OK, fair enough. I did't consider memblock allocations. I will rethink
> >> this patch but there are essentially 3 options
> >> 	- use a different criterion for the offline holes dection. I
> >> 	  have just realized we might do it by storing the online
> >> 	  information into the mem sections
> >> 	- drop this patch
> >> 	- move the PageReferenced check down the chain into
> >> 	  isolate_freepages_block resp. isolate_migratepages_block
> >>
> >> I would prefer 3 over 2 over 1. I definitely want to make this more
> >> robust so 1 is preferable long term but I do not want this to be a
> >> roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > So I've played with all three options just to see how the outcome would
> > look like and it turned out that going with 1 will be easiest in the
> > end. What do you think about the following? It should be free of any 
> > false positives. I have only compile tested it yet.
> 
> That looks fine, can't say immediately if fully correct. I think you'll
> need to bump SECTION_NID_SHIFT as well and make sure things still fit?
> Otherwise looks like nobody needed a new section bit since 2005, so we
> should be fine.

You are absolutely right. Thanks for spotting this! I have folded this
in

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 611ff869fa4d..c412e6a3a1e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,7 +1166,7 @@ extern unsigned long usemap_size(void);
 #define SECTION_IS_ONLINE	(1UL<<2)
 #define SECTION_MAP_LAST_BIT	(1UL<<3)
 #define SECTION_MAP_MASK	(~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT	2
+#define SECTION_NID_SHIFT	3
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-20  8:49           ` Michal Hocko
@ 2017-04-20 11:56             ` Vlastimil Babka
  2017-04-20 12:13               ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Vlastimil Babka @ 2017-04-20 11:56 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On 04/20/2017 10:49 AM, Michal Hocko wrote:
> On Thu 20-04-17 09:28:20, Michal Hocko wrote:
>> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> [...]
>>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
>>> woule make that zone->contiguous usually returns false since memory
>>> used by memblock API is marked as PageReserved() and your patch regard
>>> it as a hole. It invalidates set_zone_contiguous() optimization and I
>>> worry about it.
>>
>> OK, fair enough. I did't consider memblock allocations. I will rethink
>> this patch but there are essentially 3 options
>> 	- use a different criterion for the offline holes dection. I
>> 	  have just realized we might do it by storing the online
>> 	  information into the mem sections
>> 	- drop this patch
>> 	- move the PageReferenced check down the chain into
>> 	  isolate_freepages_block resp. isolate_migratepages_block
>>
>> I would prefer 3 over 2 over 1. I definitely want to make this more
>> robust so 1 is preferable long term but I do not want this to be a
>> roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> So I've played with all three options just to see how the outcome would
> look like and it turned out that going with 1 will be easiest in the
> end. What do you think about the following? It should be free of any 
> false positives. I have only compile tested it yet.

That looks fine, can't say immediately if fully correct. I think you'll
need to bump SECTION_NID_SHIFT as well and make sure things still fit?
Otherwise looks like nobody needed a new section bit since 2005, so we
should be fine.

> ---
> From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 13 Apr 2017 10:28:45 +0200
> Subject: [PATCH] mm: consider zone which is not fully populated to have holes
> 
> __pageblock_pfn_to_page has two users currently, set_zone_contiguous
> which checks whether the given zone contains holes and
> pageblock_pfn_to_page which then carefully returns a first valid
> page from the given pfn range for the given zone. This doesn't handle
> zones which are not fully populated though. Memory pageblocks can be
> offlined or might not have been onlined yet. In such a case the zone
> should be considered to have holes otherwise pfn walkers can touch
> and play with offline pages.
> 
> Current callers of pageblock_pfn_to_page in compaction seem to work
> properly right now because they only isolate PageBuddy
> (isolate_freepages_block) or PageLRU resp. __PageMovable
> (isolate_migratepages_block) which will be always false for these pages.
> It would be safer to skip these pages altogether, though.
> 
> In order to do this patch adds a new memory section state
> (SECTION_IS_ONLINE) which is set in memory_present (during boot
> time) or in online_pages_range during the memory hotplug. Similarly
> offline_mem_sections clears the bit and it is called when the memory
> range is offlined.
> 
> pfn_to_online_page helper is then added which check the mem section and
> only returns a page if it is onlined already.
> 
> Use the new helper in __pageblock_pfn_to_page and skip the whole page
> block in such a case.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/memory_hotplug.h | 21 ++++++++++++++++++++
>  include/linux/mmzone.h         | 20 ++++++++++++++++++-
>  mm/memory_hotplug.c            |  3 +++
>  mm/page_alloc.c                |  5 ++++-
>  mm/sparse.c                    | 45 +++++++++++++++++++++++++++++++++++++++++-
>  5 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 3c8cf86201c3..fc1c873504eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -14,6 +14,19 @@ struct memory_block;
>  struct resource;
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +#define pfn_to_online_page(pfn)				\
> +({							\
> +	struct page *___page = NULL;			\
> +							\
> +	if (online_section_nr(pfn_to_section_nr(pfn)))	\
> +		___page = pfn_to_page(pfn);		\
> +	___page;					\
> +})
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>  
>  #else /* ! CONFIG_MEMORY_HOTPLUG */
> +#define pfn_to_online_page(pfn)			\
> +({						\
> +	struct page *___page = NULL;		\
> +	if (pfn_valid(pfn))			\
> +		___page = pfn_to_page(pfn);	\
> +	___page;				\
> + })
> +
>  /*
>   * Stub functions for when hotplug is off
>   */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0fc121bbf4ff..cad16ac080f5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1143,7 +1143,8 @@ extern unsigned long usemap_size(void);
>   */
>  #define	SECTION_MARKED_PRESENT	(1UL<<0)
>  #define SECTION_HAS_MEM_MAP	(1UL<<1)
> -#define SECTION_MAP_LAST_BIT	(1UL<<2)
> +#define SECTION_IS_ONLINE	(1UL<<2)
> +#define SECTION_MAP_LAST_BIT	(1UL<<3)
>  #define SECTION_MAP_MASK	(~(SECTION_MAP_LAST_BIT-1))
>  #define SECTION_NID_SHIFT	2
>  
> @@ -1174,6 +1175,23 @@ static inline int valid_section_nr(unsigned long nr)
>  	return valid_section(__nr_to_section(nr));
>  }
>  
> +static inline int online_section(struct mem_section *section)
> +{
> +	return (section && (section->section_mem_map & SECTION_IS_ONLINE));
> +}
> +
> +static inline int online_section_nr(unsigned long nr)
> +{
> +	return online_section(__nr_to_section(nr));
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
> +#endif
> +#endif
> +
>  static inline struct mem_section *__pfn_to_section(unsigned long pfn)
>  {
>  	return __nr_to_section(pfn_to_section_nr(pfn));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index caa58338d121..98f565c279bf 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -929,6 +929,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
>  	unsigned long i;
>  	unsigned long onlined_pages = *(unsigned long *)arg;
>  	struct page *page;
> +
> +	online_mem_sections(start_pfn, start_pfn + nr_pages);
> +
>  	if (PageReserved(pfn_to_page(start_pfn)))
>  		for (i = 0; i < nr_pages; i++) {
>  			page = pfn_to_page(start_pfn + i);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d72d29a6ece..fa752de84eef 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1353,7 +1353,9 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>  	if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
>  		return NULL;
>  
> -	start_page = pfn_to_page(start_pfn);
> +	start_page = pfn_to_online_page(start_pfn);
> +	if (!start_page)
> +		return NULL;
>  
>  	if (page_zone(start_page) != zone)
>  		return NULL;
> @@ -7686,6 +7688,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>  			break;
>  	if (pfn == end_pfn)
>  		return;
> +	offline_mem_sections(pfn, end_pfn);
>  	zone = page_zone(pfn_to_page(pfn));
>  	spin_lock_irqsave(&zone->lock, flags);
>  	pfn = start_pfn;
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 6903c8fc3085..79017f90d8fc 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -185,7 +185,8 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
>  		ms = __nr_to_section(section);
>  		if (!ms->section_mem_map)
>  			ms->section_mem_map = sparse_encode_early_nid(nid) |
> -							SECTION_MARKED_PRESENT;
> +							SECTION_MARKED_PRESENT |
> +							SECTION_IS_ONLINE;
>  	}
>  }
>  
> @@ -590,6 +591,48 @@ void __init sparse_init(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/* Mark all memory sections within the pfn range as online */
> +void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned long pfn;
> +
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> +		unsigned long section_nr = pfn_to_section_nr(start_pfn);
> +		struct mem_section *ms;
> +
> +		/* onlining code should never touch invalid ranges */
> +		if (WARN_ON(!valid_section_nr(section_nr)))
> +			continue;
> +
> +		ms = __nr_to_section(section_nr);
> +		ms->section_mem_map |= SECTION_IS_ONLINE;
> +	}
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/* Mark all memory sections within the pfn range as online */
> +void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned long pfn;
> +
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> +		unsigned long section_nr = pfn_to_section_nr(start_pfn);
> +		struct mem_section *ms;
> +
> +		/*
> +		 * TODO this needs some double checking. Offlining code makes
> +		 * sure to check pfn_valid but those checks might be just bogus
> +		 */
> +		if (WARN_ON(!valid_section_nr(section_nr)))
> +			continue;
> +
> +		ms = __nr_to_section(section_nr);
> +		ms->section_mem_map &= ~SECTION_IS_ONLINE;
> +	}
> +}
> +#endif
> +
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
>  {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-20  7:28         ` Michal Hocko
@ 2017-04-20  8:49           ` Michal Hocko
  2017-04-20 11:56             ` Vlastimil Babka
  2017-04-21  4:38           ` Joonsoo Kim
  1 sibling, 1 reply; 122+ messages in thread
From: Michal Hocko @ 2017-04-20  8:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
[...]
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
> 	- use a different criterion for the offline holes dection. I
> 	  have just realized we might do it by storing the online
> 	  information into the mem sections
> 	- drop this patch
> 	- move the PageReferenced check down the chain into
> 	  isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

So I've played with all three options just to see how the outcome would
look like and it turned out that going with 1 will be easiest in the
end. What do you think about the following? It should be free of any 
false positives. I have only compile tested it yet.
---

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-20  1:27       ` Joonsoo Kim
@ 2017-04-20  7:28         ` Michal Hocko
  2017-04-20  8:49           ` Michal Hocko
  2017-04-21  4:38           ` Joonsoo Kim
  0 siblings, 2 replies; 122+ messages in thread
From: Michal Hocko @ 2017-04-20  7:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
[...]
> > Which pfn walkers you have in mind?
> 
> For example, kpagecount_read() in fs/proc/page.c. I searched it by
> using pfn_valid().

Yeah, I've checked that one and in fact this is a good example of the
case where you do not really care about holes. It just checks the page
count which is a valid information under any circumstances.

> > > The other problem I found is that your change will makes some
> > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > by memblock API is also marked as PageResereved. If we consider this as
> > > a hole, we will set such a zone as non-contiguous.
> > 
> > Why would that be a problem? We shouldn't touch those pages anyway?
> 
> Skipping those pages in compaction are valid so no problem in this
> case.
> 
> The problem I mentioned above is that adding PageReserved() check in
> __pageblock_pfn_to_page() invalidates optimization by
> set_zone_contiguous(). In compaction, we need to get a valid struct
> page and it requires a lot of work. There is performance problem
> report due to this so set_zone_contiguous() optimization is added. It
> checks if the zone is contiguous or not in boot time. If zone is
> determined as contiguous, we can easily get a valid struct page in
> runtime without expensive checks.

OK, I see. I've had some vague understading and the clarification helps.

> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> woule make that zone->contiguous usually returns false since memory
> used by memblock API is marked as PageReserved() and your patch regard
> it as a hole. It invalidates set_zone_contiguous() optimization and I
> worry about it.

OK, fair enough. I did't consider memblock allocations. I will rethink
this patch but there are essentially 3 options
	- use a different criterion for the offline holes dection. I
	  have just realized we might do it by storing the online
	  information into the mem sections
	- drop this patch
	- move the PageReferenced check down the chain into
	  isolate_freepages_block resp. isolate_migratepages_block

I would prefer 3 over 2 over 1. I definitely want to make this more
robust so 1 is preferable long term but I do not want this to be a
roadblock to the rest of the rework. Does that sound acceptable to you?
 
[..]
> Let me clarify my desire(?) for this issue.
> 
> 1. If pfn_valid() returns true, struct page has valid information, at
> least, in flags (zone id, node id, flags, etc...). So, we can use them
> without checking PageResereved().

This is no longer true after my rework. Pages are associated with the
zone during _onlining_ rather than when they are physically hotpluged.
Basically only the nid is set properly. Strictly speaking this is the
case also without my rework because the zone might change during online
phase so you cannot assume it is correct even now. It just happens that
it more or less works just fine.

> 2. pfn_valid() for offlined holes returns false. This can be easily
> (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> guess that there is no reason that pfn_valid() returns true for
> offlined holes. If there is, please let me know.

There is some code which really expects that pfn_valid returns true iff
there is a struct page and it doesn't care about the online status.
E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
do though is to add pfn_to_online_page which would do the proper check.
I have already sent [1]. As noted above we can (ab)use the remaining bit
in SECTION_MAP_MASK to detect offline pages more robustly.

> 3. We don't need to check PageReserved() in most of pfn walkers in
> order to check offline holes.

We still have to distinguish those who care about offline pages from
those who do not care about it.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-17  8:15     ` Michal Hocko
@ 2017-04-20  1:27       ` Joonsoo Kim
  2017-04-20  7:28         ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-20  1:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> > On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > > Hi,
> > > here I 3 more preparatory patches which I meant to send on Thursday but
> > > forgot... After more thinking about pfn walkers I have realized that
> > > the current code doesn't check offline holes in zones. From a quick
> > > review that doesn't seem to be a problem currently. Pfn walkers can race
> > > with memory offlining and with the original hotplug impementation those
> > > offline pages can change the zone but I wasn't able to find any serious
> > > problem other than small confusion. The new hotplug code, will not have
> > > any valid zone, though so those code paths should check PageReserved
> > > to rule offline holes. I hope I have addressed all of them in these 3
> > > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > > me.
> > 
> > Hello, Michal.
> > 
> > s/Jonsoo/Joonsoo. :)
> 
> ups, sorry about that.
> 
> > I'm not sure that it's a good idea to add PageResereved() check in pfn
> > walkers. First, this makes struct page validity check as two steps,
> > pfn_valid() and then PageResereved().
> 
> Yes, those are two separate checkes because semantically they are
> different. Not all pfn walkers do care about the online status.

If offlined page has no valid information, reading information
about offlined pages are just wrong. So, all pfn walkers that reads
information about the page should do care about it.

I guess that many callers for pfn_valid() is in this category.

> 
> > If we should not use struct page
> > in this case, it's better to pfn_valid() returns false rather than
> > adding a separate check. Anyway, we need to fix more places (all pfn
> > walker?) if we want to check validity by two steps.
> 
> Which pfn walkers you have in mind?

For example, kpagecount_read() in fs/proc/page.c. I searched it by
using pfn_valid().

> > The other problem I found is that your change will makes some
> > contiguous zones to be considered as non-contiguous. Memory allocated
> > by memblock API is also marked as PageResereved. If we consider this as
> > a hole, we will set such a zone as non-contiguous.
> 
> Why would that be a problem? We shouldn't touch those pages anyway?

Skipping those pages in compaction are valid so no problem in this
case.

The problem I mentioned above is that adding PageReserved() check in
__pageblock_pfn_to_page() invalidates optimization by
set_zone_contiguous(). In compaction, we need to get a valid struct
page and it requires a lot of work. There is performance problem
report due to this so set_zone_contiguous() optimization is added. It
checks if the zone is contiguous or not in boot time. If zone is
determined as contiguous, we can easily get a valid struct page in
runtime without expensive checks.

Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
woule make that zone->contiguous usually returns false since memory
used by memblock API is marked as PageReserved() and your patch regard
it as a hole. It invalidates set_zone_contiguous() optimization and I
worry about it.

>  
> > And, I guess that it's not enough to check PageResereved() in
> > pageblock_pfn_to_page() in order to skip these pages in compaction. If
> > holes are in the middle of the pageblock, pageblock_pfn_to_page()
> > cannot catch it and compaction will use struct page for this hole.
> 
> Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
> current implementation anyway. So the implementation won't be any worse
> than with the current code. On the other hand offline holes will always
> fill the whole pageblock (assuming those are not spanning multiple
> memblocks).
>  
> > Therefore, I think that making pfn_valid() return false for not
> > onlined memory is a better solution for this problem. I don't know the
> > implementation detail for hotplug and I don't see your recent change
> > but we may defer memmap initialization until the zone is determined.
> > It will make pfn_valid() return false for un-initialized range.
> 
> I am not really sure. pfn_valid is used in many context and its only
> purpose is to tell whether pfn_to_page will return a valid struct page
> AFAIU.
> 
> I agree that having more checks is more error prone and we can add a
> helper pfn_to_valid_page or something similar but I believe we can do
> that on top of the current hotplug rework. This would require a non
> trivial amount of changes and I believe that a lacking check for the
> offline holes is not critical - we would (ab)use the lowest zone which
> is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

I'm not objecting your hotplug rework. In fact, I don't know the
relationship between this work and hotplug rework. I'm agreeing
with checking offline holes but I don't like the design and
implementation about it.

Let me clarify my desire(?) for this issue.

1. If pfn_valid() returns true, struct page has valid information, at
least, in flags (zone id, node id, flags, etc...). So, we can use them
without checking PageResereved().

2. pfn_valid() for offlined holes returns false. This can be easily
(?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
guess that there is no reason that pfn_valid() returns true for
offlined holes. If there is, please let me know.

3. We don't need to check PageReserved() in most of pfn walkers in
order to check offline holes.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-17  5:47   ` your mail Joonsoo Kim
@ 2017-04-17  8:15     ` Michal Hocko
  2017-04-20  1:27       ` Joonsoo Kim
  0 siblings, 1 reply; 122+ messages in thread
From: Michal Hocko @ 2017-04-17  8:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > Hi,
> > here I 3 more preparatory patches which I meant to send on Thursday but
> > forgot... After more thinking about pfn walkers I have realized that
> > the current code doesn't check offline holes in zones. From a quick
> > review that doesn't seem to be a problem currently. Pfn walkers can race
> > with memory offlining and with the original hotplug impementation those
> > offline pages can change the zone but I wasn't able to find any serious
> > problem other than small confusion. The new hotplug code, will not have
> > any valid zone, though so those code paths should check PageReserved
> > to rule offline holes. I hope I have addressed all of them in these 3
> > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > me.
> 
> Hello, Michal.
> 
> s/Jonsoo/Joonsoo. :)

ups, sorry about that.

> I'm not sure that it's a good idea to add PageResereved() check in pfn
> walkers. First, this makes struct page validity check as two steps,
> pfn_valid() and then PageResereved().

Yes, those are two separate checkes because semantically they are
different. Not all pfn walkers do care about the online status.

> If we should not use struct page
> in this case, it's better to pfn_valid() returns false rather than
> adding a separate check. Anyway, we need to fix more places (all pfn
> walker?) if we want to check validity by two steps.

Which pfn walkers you have in mind?

> The other problem I found is that your change will makes some
> contiguous zones to be considered as non-contiguous. Memory allocated
> by memblock API is also marked as PageResereved. If we consider this as
> a hole, we will set such a zone as non-contiguous.

Why would that be a problem? We shouldn't touch those pages anyway?
 
> And, I guess that it's not enough to check PageResereved() in
> pageblock_pfn_to_page() in order to skip these pages in compaction. If
> holes are in the middle of the pageblock, pageblock_pfn_to_page()
> cannot catch it and compaction will use struct page for this hole.

Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
current implementation anyway. So the implementation won't be any worse
than with the current code. On the other hand offline holes will always
fill the whole pageblock (assuming those are not spanning multiple
memblocks).
 
> Therefore, I think that making pfn_valid() return false for not
> onlined memory is a better solution for this problem. I don't know the
> implementation detail for hotplug and I don't see your recent change
> but we may defer memmap initialization until the zone is determined.
> It will make pfn_valid() return false for un-initialized range.

I am not really sure. pfn_valid is used in many context and its only
purpose is to tell whether pfn_to_page will return a valid struct page
AFAIU.

I agree that having more checks is more error prone and we can add a
helper pfn_to_valid_page or something similar but I believe we can do
that on top of the current hotplug rework. This would require a non
trivial amount of changes and I believe that a lacking check for the
offline holes is not critical - we would (ab)use the lowest zone which
is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2017-04-15 12:17 ` Michal Hocko
@ 2017-04-17  5:47   ` Joonsoo Kim
  2017-04-17  8:15     ` Michal Hocko
  0 siblings, 1 reply; 122+ messages in thread
From: Joonsoo Kim @ 2017-04-17  5:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML

On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> Hi,
> here I 3 more preparatory patches which I meant to send on Thursday but
> forgot... After more thinking about pfn walkers I have realized that
> the current code doesn't check offline holes in zones. From a quick
> review that doesn't seem to be a problem currently. Pfn walkers can race
> with memory offlining and with the original hotplug impementation those
> offline pages can change the zone but I wasn't able to find any serious
> problem other than small confusion. The new hotplug code, will not have
> any valid zone, though so those code paths should check PageReserved
> to rule offline holes. I hope I have addressed all of them in these 3
> patches. I would appreciate if Vlastimil and Jonsoo double check after
> me.

Hello, Michal.

s/Jonsoo/Joonsoo. :)

I'm not sure that it's a good idea to add PageResereved() check in pfn
walkers. First, this makes struct page validity check as two steps,
pfn_valid() and then PageResereved(). If we should not use struct page
in this case, it's better to pfn_valid() returns false rather than
adding a separate check. Anyway, we need to fix more places (all pfn
walker?) if we want to check validity by two steps.

The other problem I found is that your change will makes some
contiguous zones to be considered as non-contiguous. Memory allocated
by memblock API is also marked as PageResereved. If we consider this as
a hole, we will set such a zone as non-contiguous.

And, I guess that it's not enough to check PageResereved() in
pageblock_pfn_to_page() in order to skip these pages in compaction. If
holes are in the middle of the pageblock, pageblock_pfn_to_page()
cannot catch it and compaction will use struct page for this hole.

Therefore, I think that making pfn_valid() return false for not
onlined memory is a better solution for this problem. I don't know the
implementation detail for hotplug and I don't see your recent change
but we may defer memmap initialization until the zone is determined.
It will make pfn_valid() return false for un-initialized range.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2012-10-04 16:50 Andrea Arcangeli
@ 2012-10-04 18:17 ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2012-10-04 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
	Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney

On Thu, 4 Oct 2012, Andrea Arcangeli wrote:

> So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency
> (AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it
> also wouldn't be available on 32bit archs but the latter isn't a
> problem).

Nope it should depend on page struct alignment. Other kernel subsystems
may be depeding on page struct alignment in the future (and some other
arches may already have that requirement)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2010-06-22 13:54   ` Jan Kara
@ 2010-06-22 14:12     ` Wu Fengguang
  0 siblings, 0 replies; 122+ messages in thread
From: Wu Fengguang @ 2010-06-22 14:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Tue, Jun 22, 2010 at 09:54:58PM +0800, Jan Kara wrote:
> On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > >   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > >   spend too much time tagging when we write only a few pages in the end
> > >   because of nr_to_write. If we tag only say nr_to_write pages, we may
> > >   not have enough pages tagged because some pages are written out by
> > >   someone else and so we would have to restart and tagging would become
> > 
> > This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> > writeback triggered by sync(). write_cache_pages() already ignored
> > nr_to_write for WB_SYNC_ALL.
>   We could do that but frankly, I'm not very fond of adding more special
> cases to writeback code than strictly necessary...

So do me. However for this case we only need to broaden the special case test:

                        if (nr_to_write > 0) {
                                nr_to_write--;
                                if (nr_to_write == 0 &&
-                                   wbc->sync_mode == WB_SYNC_NONE) {
+                                   !wbc->for_sync) {

> > >   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > >   writeback if we can get rid of nr_to_write. But that's a story for
> > >   a different patch set.
> > 
> > Besides introducing overheads, it will be a policy change in which the
> > system loses control to somehow "throttle" writeback of huge files.
>   Yes, but if we guarantee we cannot livelock on a single file, do we care?
> Memory management does not care because it's getting rid of dirty pages
> which is what it wants. User might care but actually writing out files in
> the order they were dirtied (i.e., the order user written them) is quite
> natural so it's not a "surprising" behavior. And I don't think we can
> assume that data in those small files are more valuable than data in the
> large file and thus should be written earlier...

It could be a surprising behavior when, there is a 4GB dirty file and
100 small dirty files. The user may expect the 100 small dirty files
be synced to disk after 30s. However without nr_to_write, they could
be delayed by the 4GB file for 40 more seconds. Now if something goes
wrong in between and the small dirty files happen to include some
.c/.tex/.txt/... files. 

Small files are more likely your precious documents that are typed in
word-by-word and perhaps the most important data you want to protect.
Naturally you'll want them enjoy more priority than large files.

>   With the overhead you are right that tagging is more expensive than
> checking nr_to_write limit...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2010-06-22  2:59 ` Wu Fengguang
@ 2010-06-22 13:54   ` Jan Kara
  2010-06-22 14:12     ` Wu Fengguang
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kara @ 2010-06-22 13:54 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> >   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> >   spend too much time tagging when we write only a few pages in the end
> >   because of nr_to_write. If we tag only say nr_to_write pages, we may
> >   not have enough pages tagged because some pages are written out by
> >   someone else and so we would have to restart and tagging would become
> 
> This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> writeback triggered by sync(). write_cache_pages() already ignored
> nr_to_write for WB_SYNC_ALL.
  We could do that but frankly, I'm not very fond of adding more special
cases to writeback code than strictly necessary...

> >   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> >   writeback if we can get rid of nr_to_write. But that's a story for
> >   a different patch set.
> 
> Besides introducing overheads, it will be a policy change in which the
> system loses control to somehow "throttle" writeback of huge files.
  Yes, but if we guarantee we cannot livelock on a single file, do we care?
Memory management does not care because it's getting rid of dirty pages
which is what it wants. User might care but actually writing out files in
the order they were dirtied (i.e., the order user written them) is quite
natural so it's not a "surprising" behavior. And I don't think we can
assume that data in those small files are more valuable than data in the
large file and thus should be written earlier...
  With the overhead you are right that tagging is more expensive than
checking nr_to_write limit...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2010-06-16 16:33 Jan Kara
  2010-06-16 22:15 ` your mail Dave Chinner
@ 2010-06-22  2:59 ` Wu Fengguang
  2010-06-22 13:54   ` Jan Kara
  1 sibling, 1 reply; 122+ messages in thread
From: Wu Fengguang @ 2010-06-22  2:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
>   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
>   spend too much time tagging when we write only a few pages in the end
>   because of nr_to_write. If we tag only say nr_to_write pages, we may
>   not have enough pages tagged because some pages are written out by
>   someone else and so we would have to restart and tagging would become

This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
writeback triggered by sync(). write_cache_pages() already ignored
nr_to_write for WB_SYNC_ALL.

>   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
>   writeback if we can get rid of nr_to_write. But that's a story for
>   a different patch set.

Besides introducing overheads, it will be a policy change in which the
system loses control to somehow "throttle" writeback of huge files.

So it may be safer to enlarge nr_to_write instead of canceling it totally.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2010-06-16 16:33 Jan Kara
@ 2010-06-16 22:15 ` Dave Chinner
  2010-06-22  2:59 ` Wu Fengguang
  1 sibling, 0 replies; 122+ messages in thread
From: Dave Chinner @ 2010-06-16 22:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
>   Hello,
> 
>   here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
>   I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
>   If there are no more concerns, can these patches get merged?

Has it been run through xfstests? I'd suggest doing that at least
with XFS as there are several significant sync sanity tests for XFS
in the suite...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
       [not found] <1131.86.55.168.2.1170690089.squirrel@mail.thinknet.ro>
@ 2007-02-05 12:36 ` Joerg Roedel
  0 siblings, 0 replies; 122+ messages in thread
From: Joerg Roedel @ 2007-02-05 12:36 UTC (permalink / raw)
  To: logic; +Cc: linux-kernel, linux-mm

On Mon, Feb 05, 2007 at 05:41:29PM +0200, logic@thinknet.ro wrote:
> Good morning,
> 
> I am experiencing a bug i think. I am running a 2.6.19.2 kernel on a 3Ghz
> Intel with HT activated, 1 gb ram, and noname motherboard. Here is the
> output of the hang:

Hmm, this seems to be the same issue as in [1] and [2]. A page that is
assumed to belong to the slab but is not longer marked as a slab page.
Could this be a bug in the memory management?

Joerg

[1] http://lkml.org/lkml/2007/2/4/77
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=406477

-- 
Joerg Roedel
Operating System Research Center
AMD Saxony LLC & Co. KG

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-25 23:10           ` Larry McVoy
@ 2003-01-26  8:12             ` David S. Miller
  0 siblings, 0 replies; 122+ messages in thread
From: David S. Miller @ 2003-01-26  8:12 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Eric W. Biederman, Jason Papadopoulos, linux-kernel, linux-mm

On Sat, 2003-01-25 at 15:10, Larry McVoy wrote:
> All good page coloring implementation do exactly that.  The starting
> index into the page buckets is based on process id.

I think everyone interested in learning more about this
topic should go read the following papers, they were very
helpful when I was fiddling around in this area.

These papers, in turn, reference several others which are
good reads as well.

1) W. L. Lynch, B. K. Bray, and M. J. Flynn. "The effect of page
   allocation on caches". In Micro-25 Conference Proceedings, pages
   222-225, December 1992. 

2) W. Lynch and M. Flynn. "Cache improvements through colored page
   allocation". ACM Transactions on Computer Systems, 1993. Submitted
   for review, 1992. 

3) William L. Lynch. "The Interaction of Virtual Memory and Cache
   Memory". PhD thesis, Stanford University, October
   1993. CSL-TR-93-587.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-25 17:47         ` Eric W. Biederman
@ 2003-01-25 23:10           ` Larry McVoy
  2003-01-26  8:12             ` David S. Miller
  0 siblings, 1 reply; 122+ messages in thread
From: Larry McVoy @ 2003-01-25 23:10 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Larry McVoy, Jason Papadopoulos, linux-kernel, linux-mm

> I am wondering if there is any point in biasing page addresses in between
> processes so that processes are less likely to have a cache conflict.
> i.e.  process 1 address 0 %16K == 0, process 2 address 0 %16K == 4K 

All good page coloring implementation do exactly that.  The starting
index into the page buckets is based on process id.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-25  2:26       ` Larry McVoy
@ 2003-01-25 17:47         ` Eric W. Biederman
  2003-01-25 23:10           ` Larry McVoy
  0 siblings, 1 reply; 122+ messages in thread
From: Eric W. Biederman @ 2003-01-25 17:47 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Jason Papadopoulos, linux-kernel, linux-mm

Larry McVoy <lm@bitmover.com> writes:

> > For the record, I finally got to try my own page coloring patch on a 1GHz
> > Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
> > own number crunching benchmarks and a kernel compile don't show any benefit 
> > at all, and lmbench is completely unchanged except for the mmap latency, 
> > which is slightly worse. Hardly a compelling case for PCs!
> 
> If it works correctly then the variability in lat_ctx should go away.
> Try this
> 
> 	for p in 2 4 8 12 16 24 32 64
> 	do	for size in 0 2 4 8 16
> 		do	for i in 1 2 3 4 5 6 7 8 9 0
> 			do	lat_ctx -s$size $p
> 			done
> 		done
> 	done
> 
> on both the with and without kernel.  The page coloring should make the 
> numbers rock steady, without it, they will bounce a lot.

On the same kind of vein I have seen some tremendous variability in the
stream benchmark.  Under linux I have gotten it to very as much
as a 100MB/sec by running updatedb, between runs.  In one case
it ran faster with updatedb running in the background.

But at the same time streams tends to be very steady if you have a quiet
machine and run it several times in a row repeatedly because it gets
allocated essentially the same memory every run.

So I do no the variables of cache contention do have effect on some
real programs.  I have not yet tracked it down to see if cache coloring
could be a benefit.  I suspect the buddy allocator actually comes
quite close most of the time, and tricks like allocating multiple pages
at once could improve that even more with very little effort, while reducing
page fault miss times.

I am wondering if there is any point in biasing page addresses in between
processes so that processes are less likely to have a cache conflict.
i.e.  process 1 address 0 %16K == 0, process 2 address 0 %16K == 4K 

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  6:06   ` John Alvord
@ 2003-01-25  2:29     ` Jason Papadopoulos
  2003-01-25  2:26       ` Larry McVoy
  0 siblings, 1 reply; 122+ messages in thread
From: Jason Papadopoulos @ 2003-01-25  2:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm

At 10:06 PM 1/23/03 -0800, John Alvord wrote:

>The big challenge in Linux is that several serious attempts to add
>page coloring have foundered on the shoals of "no benefit found". It
>may be that the typical hardware Linux runs on just doesn't experience
>the problem very much.

Another strike against page coloring is that it gives tremendous benefits
when caches are large and not very associative, but if both of these are
not present the benefits are much smaller. In the case of latter-day PCs,
neither of these is the case: the caches are very small and at least 8-way
set associative.

For the record, I finally got to try my own page coloring patch on a 1GHz
Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
own number crunching benchmarks and a kernel compile don't show any benefit 
at all, and lmbench is completely unchanged except for the mmap latency, 
which is slightly worse. Hardly a compelling case for PCs!

Oh well. At least now I'll be able to port to 2.5 :)

jasonp
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-25  2:29     ` Jason Papadopoulos
@ 2003-01-25  2:26       ` Larry McVoy
  2003-01-25 17:47         ` Eric W. Biederman
  0 siblings, 1 reply; 122+ messages in thread
From: Larry McVoy @ 2003-01-25  2:26 UTC (permalink / raw)
  To: Jason Papadopoulos; +Cc: linux-kernel, linux-mm

> For the record, I finally got to try my own page coloring patch on a 1GHz
> Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
> own number crunching benchmarks and a kernel compile don't show any benefit 
> at all, and lmbench is completely unchanged except for the mmap latency, 
> which is slightly worse. Hardly a compelling case for PCs!

If it works correctly then the variability in lat_ctx should go away.
Try this

	for p in 2 4 8 12 16 24 32 64
	do	for size in 0 2 4 8 16
		do	for i in 1 2 3 4 5 6 7 8 9 0
			do	lat_ctx -s$size $p
			done
		done
	done

on both the with and without kernel.  The page coloring should make the 
numbers rock steady, without it, they will bounce a lot.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24 19:14         ` David Lang
@ 2003-01-24 19:40           ` Maciej W. Rozycki
  0 siblings, 0 replies; 122+ messages in thread
From: Maciej W. Rozycki @ 2003-01-24 19:40 UTC (permalink / raw)
  To: David Lang; +Cc: Anoop J., linux-mm, linux-kernel

On Fri, 24 Jan 2003, David Lang wrote:

> the cache never sees the virtual addresses, it operated excclusivly on the
> physical addresses so the problem of aliasing never comes up.

 It depends on the implementation.

> virtual to physical addres mapping is all resolved before anything hits
> the cache.

 It depends on the processor.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  9:49       ` Anoop J.
@ 2003-01-24 19:14         ` David Lang
  2003-01-24 19:40           ` Maciej W. Rozycki
  0 siblings, 1 reply; 122+ messages in thread
From: David Lang @ 2003-01-24 19:14 UTC (permalink / raw)
  To: Anoop J.; +Cc: linux-mm, linux-kernel

the cache never sees the virtual addresses, it operated excclusivly on the
physical addresses so the problem of aliasing never comes up.

virtual to physical addres mapping is all resolved before anything hits
the cache.

David Lang

On Fri, 24 Jan 2003, Anoop J. wrote:

> Date: Fri, 24 Jan 2003 15:19:16 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: david.lang@digitalinsight.com
> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
> Subject: Re: your mail
>
> ok i shall put it in another way
> since virtual indexing is a representation of the virtual memory,
> it is possible for more multiple virtual addresses to represent the same
> physical address.So the problem of aliasing occurs in the cache.Does page
> coloring guarantee a unique mapping of physical address.If so how is the
> maping from virtual to physical address
>
>
>
> Thanks
>
>
>
> > I think this is a case of the same tuerm being used for two different
> > purposes. I don't know the use you are refering to.
> >
> > David Lang
> >
> >
> >
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  8:48     ` David Lang
@ 2003-01-24  9:49       ` Anoop J.
  2003-01-24 19:14         ` David Lang
  0 siblings, 1 reply; 122+ messages in thread
From: Anoop J. @ 2003-01-24  9:49 UTC (permalink / raw)
  To: david.lang; +Cc: linux-mm, linux-kernel

ok i shall put it in another way
since virtual indexing is a representation of the virtual memory,
it is possible for more multiple virtual addresses to represent the same
physical address.So the problem of aliasing occurs in the cache.Does page
coloring guarantee a unique mapping of physical address.If so how is the
maping from virtual to physical address



Thanks



> I think this is a case of the same tuerm being used for two different
> purposes. I don't know the use you are refering to.
>
> David Lang
>
>
>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  6:28 ` David Lang
@ 2003-01-24  8:51   ` Anoop J.
  2003-01-24  8:48     ` David Lang
  0 siblings, 1 reply; 122+ messages in thread
From: Anoop J. @ 2003-01-24  8:51 UTC (permalink / raw)
  To: david.lang; +Cc: linux-mm, linux-kernel

I read that the data coherency problem due to virtual indexing is avoided
through page coloring and it has also got the speed of physical indexing
can u just elaborate on how this is possible?


Thanks




> implementing a fully associative cache eliminates the need for page
> coloring, but it has to be implemented in hardware. if you don't have
> fully associative caches in your hardware page coloring helps avoid the
> worst case memory allocations.
>
> from what I have seen on the attempts to implement it the problem is
> that the calculations needed to do page colored allocations end up
> costing enough that they end up with a net loss compared to the old
> method.
>
> David Lang
>
>
>  On Fri, 24 Jan 2003, Anoop J.
> wrote:
>
>> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
>> From: Anoop J. <cs99001@nitc.ac.in>
>> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
>>
>>
>> How is this different from a fully associative cache .Would be better
>> if u could deal it based on the address bits used
>>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  8:51   ` Anoop J.
@ 2003-01-24  8:48     ` David Lang
  2003-01-24  9:49       ` Anoop J.
  0 siblings, 1 reply; 122+ messages in thread
From: David Lang @ 2003-01-24  8:48 UTC (permalink / raw)
  To: Anoop J.; +Cc: linux-mm, linux-kernel

I think this is a case of the same tuerm being used for two different
purposes. I don't know the use you are refering to.

David Lang


On Fri, 24 Jan 2003, Anoop J. wrote:

> I read that the data coherency problem due to virtual indexing is avoided
> through page coloring and it has also got the speed of physical indexing
> can u just elaborate on how this is possible?
>
>
> Thanks
>
>
>
>
> > implementing a fully associative cache eliminates the need for page
> > coloring, but it has to be implemented in hardware. if you don't have
> > fully associative caches in your hardware page coloring helps avoid the
> > worst case memory allocations.
> >
> > from what I have seen on the attempts to implement it the problem is
> > that the calculations needed to do page colored allocations end up
> > costing enough that they end up with a net loss compared to the old
> > method.
> >
> > David Lang
> >
> >
> >  On Fri, 24 Jan 2003, Anoop J.
> > wrote:
> >
> >> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
> >> From: Anoop J. <cs99001@nitc.ac.in>
> >> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
> >>
> >>
> >> How is this different from a fully associative cache .Would be better
> >> if u could deal it based on the address bits used
> >>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  5:54 Anoop J.
@ 2003-01-24  6:28 ` David Lang
  2003-01-24  8:51   ` Anoop J.
  0 siblings, 1 reply; 122+ messages in thread
From: David Lang @ 2003-01-24  6:28 UTC (permalink / raw)
  To: Anoop J.; +Cc: linux-mm, linux-kernel

implementing a fully associative cache eliminates the need for page
coloring, but it has to be implemented in hardware. if you don't have
fully associative caches in your hardware page coloring helps avoid the
worst case memory allocations.

from what I have seen on the attempts to implement it the problem is that
the calculations needed to do page colored allocations end up costing
enough that they end up with a net loss compared to the old method.

David Lang


 On Fri, 24 Jan 2003, Anoop J.
wrote:

> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
>
>
> How is this different from a fully associative cache .Would be better if u
> could deal it based on the address bits used
>
> Thanks
>
> David Lang wrote:
>
> >The idea of page coloring is based on the fact that common implementations
> >of caching can't put any page in memory in any line in the cache (such an
> >implementation is possible, but is more expensive to do so is not commonly
> >done)
> >
> >With this implementation it means that if your program happens to use
> >memory that cannot be mapped to half of the cache lines then effectivly
> >the CPU cache is half it's rated size for your program. the next time your
> >program runs it may get a more favorable memory allocation and be able to
> >use all of the cache and therefor run faster.
> >
> >Page coloring is an attampt to take this into account when allocating
> >memory to programs so that every program gets to use all of the cache.
> >
> >David Lang
> >
> >
> > On Fri, 24 Jan 2003, Anoop J. wrote:
> >
> >>Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
> >>From: Anoop J. <cs99001@nitc.ac.in>
> >>To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
> >>
> >>
> >>How does page coloring work. Iwant its mechanism not the implementation.
> >>I went through some pages of W.L.Lynch's paper on cache and VM. Still not
> >>able to grasp it .
> >>
> >>
> >>Thanks in advance
> >>
> >>
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> >>
> >
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  5:11 ` your mail David Lang
@ 2003-01-24  6:06   ` John Alvord
  2003-01-25  2:29     ` Jason Papadopoulos
  0 siblings, 1 reply; 122+ messages in thread
From: John Alvord @ 2003-01-24  6:06 UTC (permalink / raw)
  To: David Lang; +Cc: Anoop J., linux-kernel, linux-mm

The big challenge in Linux is that several serious attempts to add
page coloring have foundered on the shoals of "no benefit found". It
may be that the typical hardware Linux runs on just doesn't experience
the problem very much.

john


On Thu, 23 Jan 2003 21:11:10 -0800 (PST), David Lang
<david.lang@digitalinsight.com> wrote:

>The idea of page coloring is based on the fact that common implementations
>of caching can't put any page in memory in any line in the cache (such an
>implementation is possible, but is more expensive to do so is not commonly
>done)
>
>With this implementation it means that if your program happens to use
>memory that cannot be mapped to half of the cache lines then effectivly
>the CPU cache is half it's rated size for your program. the next time your
>program runs it may get a more favorable memory allocation and be able to
>use all of the cache and therefor run faster.
>
>Page coloring is an attampt to take this into account when allocating
>memory to programs so that every program gets to use all of the cache.
>
>David Lang
>
>
> On Fri, 24 Jan 2003, Anoop J. wrote:
>
>> Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
>> From: Anoop J. <cs99001@nitc.ac.in>
>> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>>
>>
>> How does page coloring work. Iwant its mechanism not the implementation.
>> I went through some pages of W.L.Lynch's paper on cache and VM. Still not
>> able to grasp it .
>>
>>
>> Thanks in advance
>>
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
@ 2003-01-24  5:54 Anoop J.
  2003-01-24  6:28 ` David Lang
  0 siblings, 1 reply; 122+ messages in thread
From: Anoop J. @ 2003-01-24  5:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel

How is this different from a fully associative cache .Would be better if u
could deal it based on the address bits used

Thanks

David Lang wrote:

>The idea of page coloring is based on the fact that common implementations
>of caching can't put any page in memory in any line in the cache (such an
>implementation is possible, but is more expensive to do so is not commonly
>done)
>
>With this implementation it means that if your program happens to use
>memory that cannot be mapped to half of the cache lines then effectivly
>the CPU cache is half it's rated size for your program. the next time your
>program runs it may get a more favorable memory allocation and be able to
>use all of the cache and therefor run faster.
>
>Page coloring is an attampt to take this into account when allocating
>memory to programs so that every program gets to use all of the cache.
>
>David Lang
>
>
> On Fri, 24 Jan 2003, Anoop J. wrote:
>
>>Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
>>From: Anoop J. <cs99001@nitc.ac.in>
>>To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>>
>>
>>How does page coloring work. Iwant its mechanism not the implementation.
>>I went through some pages of W.L.Lynch's paper on cache and VM. Still not
>>able to grasp it .
>>
>>
>>Thanks in advance
>>
>>
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/
>>
>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2003-01-24  5:08 (unknown), Anoop J.
@ 2003-01-24  5:11 ` David Lang
  2003-01-24  6:06   ` John Alvord
  0 siblings, 1 reply; 122+ messages in thread
From: David Lang @ 2003-01-24  5:11 UTC (permalink / raw)
  To: Anoop J.; +Cc: linux-kernel, linux-mm

The idea of page coloring is based on the fact that common implementations
of caching can't put any page in memory in any line in the cache (such an
implementation is possible, but is more expensive to do so is not commonly
done)

With this implementation it means that if your program happens to use
memory that cannot be mapped to half of the cache lines then effectivly
the CPU cache is half it's rated size for your program. the next time your
program runs it may get a more favorable memory allocation and be able to
use all of the cache and therefor run faster.

Page coloring is an attampt to take this into account when allocating
memory to programs so that every program gets to use all of the cache.

David Lang


 On Fri, 24 Jan 2003, Anoop J. wrote:

> Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>
>
> How does page coloring work. Iwant its mechanism not the implementation.
> I went through some pages of W.L.Lynch's paper on cache and VM. Still not
> able to grasp it .
>
>
> Thanks in advance
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2002-04-21 14:54 raciel
@ 2002-04-21 19:12 ` William Lee Irwin III
  0 siblings, 0 replies; 122+ messages in thread
From: William Lee Irwin III @ 2002-04-21 19:12 UTC (permalink / raw)
  To: raciel; +Cc: linux-mm

On Sun, Apr 21, 2002 at 02:54:00PM +0000, raciel wrote:
> Hello all :)
> 	I have been trying to understand the rmap patch from Rik Van Riel, but
> i dont undertand what the rmap patch do. Somebody can explain me or know
> a good site where i can get documentation?
> Regards Raciel 

The largest piece of functionality is additional accounting that enables
the kernel to find all users of a given page. This is known as physical
scanning, and other OS's call similar functionality "pmap", "ptov" (for
physical to virtual), and "HAT" (Hardware Address Translation), though
the interfaces are just as different as the names.

There are two very important translation mechanisms:
	(1) physical page to page table entry
	(2) 3rd-level pagetable to address space (mm_struct)

(1) is accomplished by using a singly linked list of page table entry
	addresses attached to a per-physical-page structure (struct page).

(2) is accomplished by reusing one of the fields of struct page for the
	3rd-level pagetable to hold the pointer to the mm_struct.

To clarify how a physical page is handled, the page replacement
algorithm might decide that a given physical page is targeted for
eviction. It then calls try_to_unmap(), which traverses the list of
3rd-level pagetable entries, and then rounds their addresses to
obtain the physical page occupied by the 3rd-level pagetable, then
it finds the struct page for that physical page, and then it tries
to obtain locks on the address space's pagetable lock and when it
does, it just removes the thing from the pagetable, otherwise it
returns an error code saying "try again later" (SWAP_AGAIN).

All this requires is

(1) putting an entry onto the per-page list when entering a page
	into pagetables

(2) pulling things off the per-page list when removing a page
	 from pagetables

(3) marking a 3rd-level pagetable's struct page with the mm_struct

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2002-01-02 14:20 mehul radheshyam choube
@ 2002-01-03 16:40 ` Rik van Riel
  0 siblings, 0 replies; 122+ messages in thread
From: Rik van Riel @ 2002-01-03 16:40 UTC (permalink / raw)
  To: mehul radheshyam choube; +Cc: linux-mm

On 2 Jan 2002, mehul radheshyam choube wrote:

>    i would like to do some developement work for linux-mm
>    please guide me .

This all depends on what kind of project you would be
interested in. ;)

If you want to play with new VM features, you could
start by testing the reverse mapping patch and finding
bugs, maybe looking at how things work or even at
implementing some items from the TODO list ;)

http://surriel.com/patches/
http://linuxvm.bkbits.net/

kind regards,

Rik
-- 
Shortwave goes a long way:  irc.starchat.net  #swl

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2001-08-04 11:10 Mahmoud Taghizadeh
@ 2001-08-04 13:18 ` Francois Romieu
  0 siblings, 0 replies; 122+ messages in thread
From: Francois Romieu @ 2001-08-04 13:18 UTC (permalink / raw)
  To: m_taghi; +Cc: linux-mm

The Sat, Aug 04, 2001 at 03:40:43PM +0430, Mahmoud Taghizadeh wrote :
[...]
> is there any linux distribution without memory managment unit?
> I mean, the paging is disabled and memory managemnt is done only
> by segmentation.

ELKS ?

-- 
Ueimor
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2001-06-08  1:36 jnn
@ 2001-06-08 13:16 ` Ralf Baechle
  0 siblings, 0 replies; 122+ messages in thread
From: Ralf Baechle @ 2001-06-08 13:16 UTC (permalink / raw)
  To: jnn; +Cc: linux-mm

On Fri, Jun 08, 2001 at 09:36:48AM +0800, jnn wrote:

>      somebody PLS tell me where can I find some documention about the mechanism of the cach flushing.

Right there in your kernel source, in Documentation/cachetlb.txt.  It
probably still leaves alot of questions open.

  Ralf
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2000-09-04 12:01 Sahil
@ 2000-09-04 15:35 ` Rik van Riel
  0 siblings, 0 replies; 122+ messages in thread
From: Rik van Riel @ 2000-09-04 15:35 UTC (permalink / raw)
  To: Sahil; +Cc: linux-mm

On Mon, 4 Sep 2000, Sahil wrote:

> I am a newbie to kernel programming.
> can anyone suggest some good readings for the same?

http://kernelnewbies.org/
http://www.linux.eu.org/Linux-MM/

cheers,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  2000-03-28  8:19 pnilesh
@ 2000-03-28 13:26 ` Stephen C. Tweedie
  0 siblings, 0 replies; 122+ messages in thread
From: Stephen C. Tweedie @ 2000-03-28 13:26 UTC (permalink / raw)
  To: pnilesh; +Cc: Kanoj Sarcar, linux-mm, Stephen Tweedie

Hi,

On Tue, Mar 28, 2000 at 01:49:04PM +0530, pnilesh@in.ibm.com wrote:
> 
> No, if both processes have faulted in the page into their ptes, it will
> be 2.

3.  The page cache counts as a reference.

> One more thing if the process ocurrs a page fault on text page it calls
> file_no_page()
> From what you said in this case it should increment the page count but in
> this function no where I could see the page count getting incremented.

It is done implicitly when filemap_nopage() looks for the page in the
page cache: __find_page() increments the reference count of any page
it finds before returning.

> The David Rusling book says when reducing page cache and buffer cache the
> page table entries are not modified and the pages can be dropped directly.

Yes, but it checks the page reference count to make sure it is legal to
do so first.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
@ 2000-03-28  8:19 pnilesh
  2000-03-28 13:26 ` Stephen C. Tweedie
  0 siblings, 1 reply; 122+ messages in thread
From: pnilesh @ 2000-03-28  8:19 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm

No, if both processes have faulted in the page into their ptes, it will
be 2. The page count is normally the number of references from user
ptes, plus any long/short term holds kernel code establishes on the
page.

I was confused as Maurice Bach increases region reference count when any
region say text is shared among more than one processes, and not the page
reference count.

One more thing if the process ocurrs a page fault on text page it calls
file_no_page()
>From what you said in this case it should increment the page count but in
this function no where I could see the page count getting incremented.

>
> Q    When a page of a file is in page hash queue, does this page have
page
> table entry in any process ?

Possibly, if the file is mmaped into some other process.

> Q     Can this be discarded right away , if the need arises?
>
At the minimum, you need to write modified contents back to disk, if
the file page has not already been discarded.

The David Rusling book says when reducing page cache and buffer cache the
page table entries are not modified and the pages can be dropped directly.

Kanoj

> Nilesh Patel
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  1998-02-25 22:48 ` your mail Linus Torvalds
@ 1998-02-25 23:26   ` Rik van Riel
  0 siblings, 0 replies; 122+ messages in thread
From: Rik van Riel @ 1998-02-25 23:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, linux-mm

On Wed, 25 Feb 1998, Linus Torvalds wrote:

> > We could simply prohibit the VM subsystem from swapping
> > out pages which have been allocated less than one second
> > ago, this way the movement of pages becomes 'slower', and
> > thrashing might get somewhat less.
> 
> The reason I dislike rate-based things is that it is _so_ hard to give any
> kind of guarantees at all on the sanity of the thing. You can tweak the
> rates, but they don't have any logic to them, and most importantly they
> are very hard to make self-tweaking. 

Yup, I don't like 'em either, but I proposed it anyways
in case anyone of you might like it :-)
Personally, I like balancing code too, if possible with
some balancing on the balancing code as well...

OK, we go with balancing code.

Rik.
+-----------------------------+------------------------------+
| For Linux mm-patches, go to | "I'm busy managing memory.." |
| my homepage (via LinuxHQ).  | H.H.vanRiel@fys.ruu.nl       |
| ...submissions welcome...   | http://www.fys.ruu.nl/~riel/ |
+-----------------------------+------------------------------+

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: your mail
  1998-02-25 22:15 Rik van Riel
@ 1998-02-25 22:48 ` Linus Torvalds
  1998-02-25 23:26   ` Rik van Riel
  0 siblings, 1 reply; 122+ messages in thread
From: Linus Torvalds @ 1998-02-25 22:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, linux-mm

On Wed, 25 Feb 1998, Rik van Riel wrote:
> 
> I've just come up with a very simple idea to limit
> thrashing, and I'm asking you if you want it implemented
> (there's some cost involved :-( ).
> 
> We could simply prohibit the VM subsystem from swapping
> out pages which have been allocated less than one second
> ago, this way the movement of pages becomes 'slower', and
> thrashing might get somewhat less.

I'm _really_ nervous about "rate-based" limiting. Personally I think that
it only makes sense for things like networking, and even there it had
better be done by hardware. 

The reason I dislike rate-based things is that it is _so_ hard to give any
kind of guarantees at all on the sanity of the thing. You can tweak the
rates, but they don't have any logic to them, and most importantly they
are very hard to make self-tweaking. 

I tend to prefer a "balancing" approach to these problems: the important
part about balancing is that while it may not have some specific
well-defined behaviour that you can point your finger to ("will not page
out the same page that it paged in within 5 seconds"), the basic approach
is to write something that doesn't have any hard rules but that TENDS
towards some certain goal. 

That way you get algorithms that you can be fairly confident work well in
the normal cases (because you test those normal cases), and because there
are no hard rules you also don't get strange "edges" when something
unexpected happens: performance may well degrade badly, but it degrades
_softly_ rather than in quantisized jumps.

			Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2025-02-28 17:30 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
2025-02-24 22:52 ` [PATCH v7 1/7] mseal, system mappings: kernel config and header change jeffxu
2025-02-25  6:05   ` Lorenzo Stoakes
2025-02-26  1:33     ` Jeff Xu
2025-02-26  6:04       ` Lorenzo Stoakes
2025-02-28  0:04         ` Jeff Xu
2025-02-28 10:32           ` Lorenzo Stoakes
2025-02-25 15:22   ` Liam R. Howlett
2025-02-25 15:37     ` Lorenzo Stoakes
2025-02-26  0:04     ` Jeff Xu
2025-02-24 22:52 ` [PATCH v7 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed jeffxu
2025-02-25  6:15   ` Lorenzo Stoakes
2025-02-25 22:37     ` Jeff Xu
2025-02-26  5:58       ` Lorenzo Stoakes
2025-02-24 22:52 ` [PATCH v7 3/7] mseal, system mappings: enable x86-64 jeffxu
2025-02-25  1:03   ` Kees Cook
2025-02-26  0:21     ` Jeff Xu
2025-02-25  8:08   ` Thomas Weißschuh
2025-02-26  0:48     ` Jeff Xu
2025-02-26  7:35       ` Thomas Weißschuh
2025-02-27 21:44         ` Jeff Xu
2025-02-24 22:52 ` [PATCH v7 4/7] mseal, system mappings: enable arm64 jeffxu
2025-02-25  6:20   ` Lorenzo Stoakes
2025-02-25 22:26     ` Jeff Xu
2025-02-26  5:25       ` Lorenzo Stoakes
2025-02-26 17:11         ` Liam R. Howlett
2025-02-26 17:17           ` Jeff Xu
2025-02-26 17:43             ` Lorenzo Stoakes
2025-02-26 18:14               ` Lorenzo Stoakes
2025-02-28  0:48               ` Jeff Xu
2025-02-28 10:31                 ` Lorenzo Stoakes
2025-02-24 22:52 ` [PATCH v7 5/7] mseal, system mappings: enable uml architecture jeffxu
2025-02-25  6:22   ` Lorenzo Stoakes
2025-02-25  8:45     ` Berg, Benjamin
2025-02-25 10:37       ` Lorenzo Stoakes
2025-02-25 12:24         ` Benjamin Berg
2025-02-25 13:41           ` Lorenzo Stoakes
2025-02-25 13:59             ` Johannes Berg
2025-02-25 15:06         ` Kees Cook
2025-02-25 15:31           ` Lorenzo Stoakes
2025-02-25 18:38             ` Kees Cook
2025-02-26  0:00               ` Jeff Xu
2025-02-24 22:52 ` [PATCH v7 6/7] mseal, system mappings: uprobe mapping jeffxu
2025-02-25  6:24   ` Lorenzo Stoakes
2025-02-26  0:06     ` Jeff Xu
2025-02-26  5:57       ` Lorenzo Stoakes
2025-02-26 16:26   ` Oleg Nesterov
2025-02-26 16:33     ` Oleg Nesterov
2025-02-26 16:45     ` Lorenzo Stoakes
2025-02-26 18:01       ` Oleg Nesterov
2025-02-26 18:06         ` Lorenzo Stoakes
2025-02-26 18:19           ` Liam R. Howlett
2025-02-26 18:20           ` Oleg Nesterov
2025-02-26 18:25             ` Lorenzo Stoakes
2025-02-27 23:38               ` Jeff Xu
2025-02-28 10:39                 ` Lorenzo Stoakes
2025-02-27 21:48             ` Jeff Xu
2025-02-24 22:52 ` [PATCH v7 7/7] mseal, system mappings: update mseal.rst jeffxu
2025-02-24 23:03 ` [PATCH v7 0/7] mseal system mappings Pedro Falcato
2025-02-24 23:07   ` Jeff Xu
2025-02-25  6:09     ` Lorenzo Stoakes
2025-02-25 10:32 ` Lorenzo Stoakes
2025-02-26  0:17   ` Jeff Xu
2025-02-26  6:00     ` Lorenzo Stoakes
2025-02-27 23:43       ` Jeff Xu
2025-02-28 10:32         ` Lorenzo Stoakes
2025-02-25 15:18 ` Lorenzo Stoakes
2025-02-26  0:12   ` Jeff Xu
2025-02-26  5:42     ` your mail Lorenzo Stoakes
2025-02-28  0:55       ` Jeff Xu
2025-02-28  9:35         ` Lorenzo Stoakes
2025-02-28 17:24           ` Jeff Xu
2025-02-28 17:30             ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2023-05-10 19:01 [PATCH] maple_tree: Fix a few documentation issues, Thomas Gleixner
2023-05-15 19:27 ` your mail Liam R. Howlett
2023-05-15 21:16   ` Thomas Gleixner
2023-05-16 22:47   ` Thomas Gleixner
2023-05-23 13:46     ` Liam R. Howlett
     [not found] <20190225201635.4648-1-hannes@cmpxchg.org>
2019-02-26 23:49 ` Roman Gushchin
2017-04-10 11:03 [PATCH -v2 0/9] mm: make movable onlining suck less Michal Hocko
2017-04-15 12:17 ` Michal Hocko
2017-04-17  5:47   ` your mail Joonsoo Kim
2017-04-17  8:15     ` Michal Hocko
2017-04-20  1:27       ` Joonsoo Kim
2017-04-20  7:28         ` Michal Hocko
2017-04-20  8:49           ` Michal Hocko
2017-04-20 11:56             ` Vlastimil Babka
2017-04-20 12:13               ` Michal Hocko
2017-04-21  4:38           ` Joonsoo Kim
2017-04-21  7:16             ` Michal Hocko
2017-04-24  1:44               ` Joonsoo Kim
2017-04-24  7:53                 ` Michal Hocko
2017-04-25  2:50                   ` Joonsoo Kim
2017-04-26  9:19                     ` Michal Hocko
2017-04-27  2:08                       ` Joonsoo Kim
2017-04-27 15:10                         ` Michal Hocko
2012-10-04 16:50 Andrea Arcangeli
2012-10-04 18:17 ` your mail Christoph Lameter
2010-06-16 16:33 Jan Kara
2010-06-16 22:15 ` your mail Dave Chinner
2010-06-22  2:59 ` Wu Fengguang
2010-06-22 13:54   ` Jan Kara
2010-06-22 14:12     ` Wu Fengguang
     [not found] <1131.86.55.168.2.1170690089.squirrel@mail.thinknet.ro>
2007-02-05 12:36 ` Joerg Roedel
2003-01-24  5:54 Anoop J.
2003-01-24  6:28 ` David Lang
2003-01-24  8:51   ` Anoop J.
2003-01-24  8:48     ` David Lang
2003-01-24  9:49       ` Anoop J.
2003-01-24 19:14         ` David Lang
2003-01-24 19:40           ` Maciej W. Rozycki
2003-01-24  5:08 (unknown), Anoop J.
2003-01-24  5:11 ` your mail David Lang
2003-01-24  6:06   ` John Alvord
2003-01-25  2:29     ` Jason Papadopoulos
2003-01-25  2:26       ` Larry McVoy
2003-01-25 17:47         ` Eric W. Biederman
2003-01-25 23:10           ` Larry McVoy
2003-01-26  8:12             ` David S. Miller
2002-04-21 14:54 raciel
2002-04-21 19:12 ` your mail William Lee Irwin III
2002-01-02 14:20 mehul radheshyam choube
2002-01-03 16:40 ` your mail Rik van Riel
2001-08-04 11:10 Mahmoud Taghizadeh
2001-08-04 13:18 ` your mail Francois Romieu
2001-06-08  1:36 jnn
2001-06-08 13:16 ` your mail Ralf Baechle
2000-09-04 12:01 Sahil
2000-09-04 15:35 ` your mail Rik van Riel
2000-03-28  8:19 pnilesh
2000-03-28 13:26 ` Stephen C. Tweedie
1998-02-25 22:15 Rik van Riel
1998-02-25 22:48 ` your mail Linus Torvalds
1998-02-25 23:26   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox