* [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
@ 2023-12-01 9:46 Yafang Shao
2023-12-01 9:46 ` [PATCH v3 1/7] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
` (6 more replies)
0 siblings, 7 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
Background
==========
In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks.
The Challenge
=============
In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.
Currently, users possess the ability to autonomously bind their memory to
specific nodes without explicit agreement or authorization from our end.
It's imperative that we establish a method to prevent this behavior.
Proposed Solution
=================
- Capability
Currently, any task can perform MPOL_BIND without specific capabilities.
Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
may have unintended consequences. Capabilities, being broad, might grant
unnecessary privileges. We should explore alternatives to prevent
unexpected side effects.
- LSM
Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
to disable MPOL_BIND. This approach is more flexibility and allows for
fine-grained control without unintended consequences. A sample LSM BPF
program is included, demonstrating practical implementation in a
production environment.
- seccomp
seccomp is relatively heavyweight, making it less suitable for
enabling in our production environment:
- Both kubelet and containers need adaptation to support it.
- Dynamically altering security policies for individual containers
without interrupting their operations isn't straightforward.
Future Considerations
=====================
In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.
[0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
Changes:
- RCC v2 -> v3:
- Add MPOL_F_NUMA_BALANCING man-page (Ying)
- Fix bpf selftests error reported by bot+bpf-ci
- RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
- Refine the commit log to avoid misleading
- Use one common lsm hook instead and add comment for it
- Add selinux implementation
- Other improments in mempolicy
- RFC v1: https://lwn.net/Articles/951188/
Yafang Shao (6):
mm, doc: Add doc for MPOL_F_NUMA_BALANCING
mm: mempolicy: Revise comment regarding mempolicy mode flags
mm, security: Fix missed security_task_movememory()
mm, security: Add lsm hook for memory policy adjustment
security: selinux: Implement set_mempolicy hook
selftests/bpf: Add selftests for set_mempolicy with a lsm prog
.../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
include/linux/lsm_hook_defs.h | 3 +
include/linux/security.h | 9 +++
include/uapi/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 22 ++++-
security/security.c | 13 +++
security/selinux/hooks.c | 8 ++
security/selinux/include/classmap.h | 2 +-
.../selftests/bpf/prog_tests/set_mempolicy.c | 81 +++++++++++++++++++
.../selftests/bpf/progs/test_set_mempolicy.c | 28 +++++++
10 files changed, 192 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
NOT kernel:
Yafang Shao (1):
NOT kernel/man2/mbind.2: Add mode flag MPOL_F_NUMA_BALANCING
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 1/7] mm, doc: Add doc for MPOL_F_NUMA_BALANCING
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 2/7] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
` (5 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
The document on MPOL_F_NUMA_BALANCING was missed in the initial commit
The MPOL_F_NUMA_BALANCING document was inadvertently omitted from the
initial commit bda420b98505 ("numa balancing: migrate on fault among
multiple bound nodes")
Let's ensure its inclusion.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
.../admin-guide/mm/numa_memory_policy.rst | 27 +++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..19071b71979c 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -332,6 +332,33 @@ MPOL_F_RELATIVE_NODES
MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation).
+MPOL_F_NUMA_BALANCING (since Linux 5.12)
+ When operating in MPOL_BIND mode, enables NUMA balancing for tasks,
+ contingent upon kernel support. This feature optimizes page
+ placement within the confines of the specified memory binding
+ policy. The addition of the MPOL_F_NUMA_BALANCING flag augments the
+ control mechanism for NUMA balancing:
+
+ - The sysctl knob numa_balancing governs global activation or
+ deactivation of NUMA balancing.
+
+ - Even if sysctl numa_balancing is enabled, NUMA balancing remains
+ disabled by default for memory areas or applications utilizing
+ explicit memory policies.
+
+ - The MPOL_F_NUMA_BALANCING flag facilitates NUMA balancing
+ activation for applications employing explicit memory policies
+ (MPOL_BIND).
+
+ This flags enables various optimizations for page placement through
+ NUMA balancing. For instance, when an application's memory is bound
+ to multiple nodes (MPOL_BIND), the hint page fault handler attempts
+ to migrate accessed pages to reduce cross-node access if the
+ accessing node aligns with the policy nodemask.
+
+ If the flag isn't supported by the kernel, or is used with mode
+ other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
+
Memory Policy Reference Counting
================================
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 2/7] mm: mempolicy: Revise comment regarding mempolicy mode flags
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-01 9:46 ` [PATCH v3 1/7] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 3/7] mm, security: Fix missed security_task_movememory() Yafang Shao
` (4 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao,
Eric Dumazet
MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, and MPOL_F_NUMA_BALANCING are
mode flags applicable to both set_mempolicy(2) and mbind(2) system calls.
It's worth noting that MPOL_F_NUMA_BALANCING was initially introduced in
commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
nodes") exclusively for set_mempolicy(2). However, it was later made a
shared flag for both set_mempolicy(2) and mbind(2) following
commit 6d2aec9e123b ("mm/mempolicy: do not allow illegal
MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()").
This revised version aims to clarify the details regarding the mode flags.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
---
include/uapi/linux/mempolicy.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..afed4a45f5b9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -26,7 +26,7 @@ enum {
MPOL_MAX, /* always last member of enum */
};
-/* Flags for set_mempolicy */
+/* Flags for set_mempolicy() or mbind() */
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 3/7] mm, security: Fix missed security_task_movememory()
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-01 9:46 ` [PATCH v3 1/7] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
2023-12-01 9:46 ` [PATCH v3 2/7] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 20:50 ` Serge E. Hallyn
2023-12-01 9:46 ` [PATCH v3 4/7] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
` (3 subsequent siblings)
6 siblings, 1 reply; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
Considering that MPOL_F_NUMA_BALANCING or mbind(2) using either
MPOL_MF_MOVE or MPOL_MF_MOVE_ALL are capable of memory movement, it's
essential to include security_task_movememory() to cover this
functionality as well. It was identified during a code review.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
mm/mempolicy.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..1eafe81d782e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1259,8 +1259,15 @@ static long do_mbind(unsigned long start, unsigned long len,
if (!new)
flags |= MPOL_MF_DISCONTIG_OK;
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+ err = security_task_movememory(current);
+ if (err) {
+ mpol_put(new);
+ return err;
+ }
lru_cache_disable();
+ }
+
{
NODEMASK_SCRATCH(scratch);
if (scratch) {
@@ -1450,6 +1457,8 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
/* Basic parameter sanity check used by both mbind() and set_mempolicy() */
static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
{
+ int err;
+
*flags = *mode & MPOL_MODE_FLAGS;
*mode &= ~MPOL_MODE_FLAGS;
@@ -1460,6 +1469,9 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
if (*flags & MPOL_F_NUMA_BALANCING) {
if (*mode != MPOL_BIND)
return -EINVAL;
+ err = security_task_movememory(current);
+ if (err)
+ return err;
*flags |= (MPOL_F_MOF | MPOL_F_MORON);
}
return 0;
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 4/7] mm, security: Add lsm hook for memory policy adjustment
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (2 preceding siblings ...)
2023-12-01 9:46 ` [PATCH v3 3/7] mm, security: Fix missed security_task_movememory() Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 5/7] security: selinux: Implement set_mempolicy hook Yafang Shao
` (2 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.
At present, users have the capability to bind their memory to a specific
node without explicit agreement or authorization from us. Consequently, a
new LSM hook is introduced to mitigate this. This implementation allows us
to exercise fine-grained control over memory policy adjustments within our
container environment
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/lsm_hook_defs.h | 3 +++
include/linux/security.h | 9 +++++++++
mm/mempolicy.c | 8 ++++++++
security/security.c | 13 +++++++++++++
4 files changed, 33 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index ff217a5ce552..558012719f98 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -419,3 +419,6 @@ LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
LSM_HOOK(int, 0, uring_sqpoll, void)
LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
#endif /* CONFIG_IO_URING */
+
+LSM_HOOK(int, 0, set_mempolicy, unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
diff --git a/include/linux/security.h b/include/linux/security.h
index 1d1df326c881..cc4a19a0888c 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -484,6 +484,8 @@ int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
int security_locked_down(enum lockdown_reason what);
+int security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags);
#else /* CONFIG_SECURITY */
static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1395,6 +1397,13 @@ static inline int security_locked_down(enum lockdown_reason what)
{
return 0;
}
+
+static inline int
+security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return 0;
+}
#endif /* CONFIG_SECURITY */
#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1eafe81d782e..9a260dd24a4b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1495,6 +1495,10 @@ static long kernel_mbind(unsigned long start, unsigned long len,
if (err)
return err;
+ err = security_set_mempolicy(lmode, mode_flags, &nodes, flags);
+ if (err)
+ return err;
+
return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
}
@@ -1589,6 +1593,10 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
if (err)
return err;
+ err = security_set_mempolicy(lmode, mode_flags, &nodes, 0);
+ if (err)
+ return err;
+
return do_set_mempolicy(lmode, mode_flags, &nodes);
}
diff --git a/security/security.c b/security/security.c
index dcb3e7014f9b..685ad7993753 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5337,3 +5337,16 @@ int security_uring_cmd(struct io_uring_cmd *ioucmd)
return call_int_hook(uring_cmd, 0, ioucmd);
}
#endif /* CONFIG_IO_URING */
+
+/**
+ * security_set_mempolicy() - Check if memory policy can be adjusted
+ * @mode: The memory policy mode to be set
+ * @mode_flags: optional mode flags
+ * @nmask: modemask to which the mode applies
+ * @flags: mode flags for mbind(2) only
+ */
+int security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return call_int_hook(set_mempolicy, 0, mode, mode_flags, nmask, flags);
+}
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 5/7] security: selinux: Implement set_mempolicy hook
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (3 preceding siblings ...)
2023-12-01 9:46 ` [PATCH v3 4/7] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 6/7] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
2023-12-01 9:46 ` [PATCH v3 7/7] NOT kernel/man2/mbind.2: Add mode flag MPOL_F_NUMA_BALANCING Yafang Shao
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
Add a SELinux access control for the newly introduced set_mempolicy lsm
hook. A new permission "setmempolicy" is defined under the "process" class
for it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
security/selinux/hooks.c | 8 ++++++++
security/selinux/include/classmap.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index feda711c6b7b..1528d4dcfa03 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4238,6 +4238,13 @@ static int selinux_userns_create(const struct cred *cred)
USER_NAMESPACE__CREATE, NULL);
}
+static int selinux_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return avc_has_perm(current_sid(), task_sid_obj(current), SECCLASS_PROCESS,
+ PROCESS__SETMEMPOLICY, NULL);
+}
+
/* Returns error only if unable to parse addresses */
static int selinux_parse_skb_ipv4(struct sk_buff *skb,
struct common_audit_data *ad, u8 *proto)
@@ -7072,6 +7079,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
LSM_HOOK_INIT(task_kill, selinux_task_kill),
LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
LSM_HOOK_INIT(userns_create, selinux_userns_create),
+ LSM_HOOK_INIT(set_mempolicy, selinux_set_mempolicy),
LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
LSM_HOOK_INIT(ipc_getsecid, selinux_ipc_getsecid),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index a3c380775d41..c280d92a409f 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -51,7 +51,7 @@ const struct security_class_mapping secclass_map[] = {
"getattr", "setexec", "setfscreate", "noatsecure", "siginh",
"setrlimit", "rlimitinh", "dyntransition", "setcurrent",
"execmem", "execstack", "execheap", "setkeycreate",
- "setsockcreate", "getrlimit", NULL } },
+ "setsockcreate", "getrlimit", "setmempolicy", NULL } },
{ "process2",
{ "nnp_transition", "nosuid_transition", NULL } },
{ "system",
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 6/7] selftests/bpf: Add selftests for set_mempolicy with a lsm prog
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (4 preceding siblings ...)
2023-12-01 9:46 ` [PATCH v3 5/7] security: selinux: Implement set_mempolicy hook Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 7/7] NOT kernel/man2/mbind.2: Add mode flag MPOL_F_NUMA_BALANCING Yafang Shao
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
The result as follows,
#261/1 set_mempolicy/MPOL_BIND_with_lsm:OK
#261/2 set_mempolicy/MPOL_DEFAULT_with_lsm:OK
#261/3 set_mempolicy/MPOL_BIND_without_lsm:OK
#261/4 set_mempolicy/MPOL_DEFAULT_without_lsm:OK
#261 set_mempolicy:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
.../selftests/bpf/prog_tests/set_mempolicy.c | 81 +++++++++++++++++++
.../selftests/bpf/progs/test_set_mempolicy.c | 28 +++++++
2 files changed, 109 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
diff --git a/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c b/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
new file mode 100644
index 000000000000..6d115ecedb10
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <linux/mempolicy.h>
+#include <test_progs.h>
+#include "test_set_mempolicy.skel.h"
+
+#define SIZE 4096
+
+static void mempolicy_bind(bool success)
+{
+ unsigned long mask = 1;
+ char *addr;
+ int err;
+
+ addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+ if (!ASSERT_OK_PTR(addr, "mmap"))
+ return;
+
+ /* -lnuma is required by mbind(2), so use __NR_mbind to avoid the dependency. */
+ err = syscall(__NR_mbind, addr, SIZE, MPOL_BIND, &mask, sizeof(mask), 0);
+ if (success)
+ ASSERT_OK(err, "mbind_success");
+ else
+ ASSERT_ERR(err, "mbind_fail");
+
+ munmap(addr, SIZE);
+}
+
+static void mempolicy_default(void)
+{
+ char *addr;
+ int err;
+
+ addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+ if (!ASSERT_OK_PTR(addr, "mmap"))
+ return;
+
+ err = syscall(__NR_mbind, addr, SIZE, MPOL_DEFAULT, NULL, 0, 0);
+ ASSERT_OK(err, "mbind_success");
+
+ munmap(addr, SIZE);
+}
+
+void test_set_mempolicy(void)
+{
+ struct test_set_mempolicy *skel;
+ int err;
+
+ skel = test_set_mempolicy__open();
+ if (!ASSERT_OK_PTR(skel, "open"))
+ return;
+
+ skel->bss->target_pid = getpid();
+
+ err = test_set_mempolicy__load(skel);
+ if (!ASSERT_OK(err, "load"))
+ goto destroy;
+
+ /* Attach LSM prog first */
+ err = test_set_mempolicy__attach(skel);
+ if (!ASSERT_OK(err, "attach"))
+ goto destroy;
+
+ /* syscall to adjust memory policy */
+ if (test__start_subtest("MPOL_BIND_with_lsm"))
+ mempolicy_bind(false);
+ if (test__start_subtest("MPOL_DEFAULT_with_lsm"))
+ mempolicy_default();
+
+destroy:
+ test_set_mempolicy__destroy(skel);
+
+ if (test__start_subtest("MPOL_BIND_without_lsm"))
+ mempolicy_bind(true);
+ if (test__start_subtest("MPOL_DEFAULT_without_lsm"))
+ mempolicy_default();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_set_mempolicy.c b/tools/testing/selftests/bpf/progs/test_set_mempolicy.c
new file mode 100644
index 000000000000..b5356d5fcb8b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_set_mempolicy.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+int target_pid;
+
+static int mem_policy_adjustment(u64 mode)
+{
+ struct task_struct *task = bpf_get_current_task_btf();
+
+ if (task->pid != target_pid)
+ return 0;
+
+ if (mode != MPOL_BIND)
+ return 0;
+ return -1;
+}
+
+SEC("lsm/set_mempolicy")
+int BPF_PROG(setmempolicy, u64 mode, u16 mode_flags, nodemask_t *nmask, u32 flags)
+{
+ return mem_policy_adjustment(mode);
+}
+
+char _license[] SEC("license") = "GPL";
--
2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 7/7] NOT kernel/man2/mbind.2: Add mode flag MPOL_F_NUMA_BALANCING
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (5 preceding siblings ...)
2023-12-01 9:46 ` [PATCH v3 6/7] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
@ 2023-12-01 9:46 ` Yafang Shao
6 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-01 9:46 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao,
Alejandro Colomar, Michael Kerrisk
In Linux Kernel 5.12, a new mode flag, MPOL_F_NUMA_BALANCING, was
added to set_mempolicy() to optimize the page placement among the
NUMA nodes with the NUMA balancing mechanism even if the memory of
the applications is bound with MPOL_BIND.
In Linux Kernel 5.15, this mode flag was extended to mbind(2). Let's
also add man-page for mbind(2). It is copied from set_mempoicy(2)
man-page with subtle modifications.
Related kernel commits:
bda420b985054a3badafef23807c4b4fa38a3dff
6d2aec9e123bb9c49cb5c7fc654f25f81e688e8c
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
man2/mbind.2 | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/man2/mbind.2 b/man2/mbind.2
index ba1b81ae9..dac784389 100644
--- a/man2/mbind.2
+++ b/man2/mbind.2
@@ -142,6 +142,23 @@ The supported
.I "mode flags"
are:
.TP
+.BR MPOL_F_NUMA_BALANCING " (since Linux 5.15)"
+.\" commit bda420b985054a3badafef23807c4b4fa38a3dff
+.\" commit 6d2aec9e123bb9c49cb5c7fc654f25f81e688e8c
+When
+.I mode
+is
+.BR MPOL_BIND ,
+enable the kernel NUMA balancing for the task if it is supported by the kernel.
+If the flag isn't supported by the kernel, or is used with
+.I mode
+other than
+.BR MPOL_BIND ,
+\-1 is returned and
+.I errno
+is set to
+.BR EINVAL .
+.TP
.BR MPOL_F_STATIC_NODES " (since Linux-2.6.26)"
A nonempty
.I nodemask
--
2.39.3
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/7] mm, security: Fix missed security_task_movememory()
2023-12-01 9:46 ` [PATCH v3 3/7] mm, security: Fix missed security_task_movememory() Yafang Shao
@ 2023-12-01 20:50 ` Serge E. Hallyn
2023-12-03 2:57 ` Yafang Shao
0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2023-12-01 20:50 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, paul, jmorris, serge, omosnace, mhocko, ying.huang,
linux-mm, linux-security-module, bpf, ligang.bdlg
On Fri, Dec 01, 2023 at 09:46:32AM +0000, Yafang Shao wrote:
> Considering that MPOL_F_NUMA_BALANCING or mbind(2) using either
> MPOL_MF_MOVE or MPOL_MF_MOVE_ALL are capable of memory movement, it's
> essential to include security_task_movememory() to cover this
> functionality as well. It was identified during a code review.
Hm - this doesn't have any bad side effects for you when using selinux?
The selinux_task_movememory() hook checks for PROCESS__SETSCHED privs.
The two existing security_task_movememory() calls are in cases where we
expect the caller to be affecting another task identified by pid, so
that makes sense. Is an MPOL_MV_MOVE to move your own pages actually
analogous to that?
Much like the concern you mentioned in your intro about requiring
CAP_SYS_NICE and thereby expanding its use, it seems that here you
will be regressing some mbind users unless the granting of PROCESS__SETSCHED
is widened.
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> mm/mempolicy.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..1eafe81d782e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1259,8 +1259,15 @@ static long do_mbind(unsigned long start, unsigned long len,
> if (!new)
> flags |= MPOL_MF_DISCONTIG_OK;
>
> - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
MPOL_MF_MOVE_ALL already has a CAP_SYS_NICE check. Does that
suffice for that one?
> + err = security_task_movememory(current);
> + if (err) {
> + mpol_put(new);
> + return err;
> + }
> lru_cache_disable();
> + }
> +
> {
> NODEMASK_SCRATCH(scratch);
> if (scratch) {
> @@ -1450,6 +1457,8 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
> /* Basic parameter sanity check used by both mbind() and set_mempolicy() */
> static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> {
> + int err;
> +
> *flags = *mode & MPOL_MODE_FLAGS;
> *mode &= ~MPOL_MODE_FLAGS;
>
> @@ -1460,6 +1469,9 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> if (*flags & MPOL_F_NUMA_BALANCING) {
> if (*mode != MPOL_BIND)
> return -EINVAL;
> + err = security_task_movememory(current);
> + if (err)
> + return err;
> *flags |= (MPOL_F_MOF | MPOL_F_MORON);
> }
> return 0;
> --
> 2.30.1 (Apple Git-130)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/7] mm, security: Fix missed security_task_movememory()
2023-12-01 20:50 ` Serge E. Hallyn
@ 2023-12-03 2:57 ` Yafang Shao
0 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2023-12-03 2:57 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: akpm, paul, jmorris, omosnace, mhocko, ying.huang, linux-mm,
linux-security-module, bpf, ligang.bdlg
On Sat, Dec 2, 2023 at 4:50 AM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Fri, Dec 01, 2023 at 09:46:32AM +0000, Yafang Shao wrote:
> > Considering that MPOL_F_NUMA_BALANCING or mbind(2) using either
> > MPOL_MF_MOVE or MPOL_MF_MOVE_ALL are capable of memory movement, it's
> > essential to include security_task_movememory() to cover this
> > functionality as well. It was identified during a code review.
>
> Hm - this doesn't have any bad side effects for you when using selinux?
> The selinux_task_movememory() hook checks for PROCESS__SETSCHED privs.
> The two existing security_task_movememory() calls are in cases where we
> expect the caller to be affecting another task identified by pid, so
> that makes sense. Is an MPOL_MV_MOVE to move your own pages actually
> analogous to that?
>
> Much like the concern you mentioned in your intro about requiring
> CAP_SYS_NICE and thereby expanding its use, it seems that here you
> will be regressing some mbind users unless the granting of PROCESS__SETSCHED
> is widened.
Ah, it appears that this change might lead to regression. I overlooked
its association with the PROCESS__SETSCHED privilege. I'll exclude
this patch from the upcoming version.
Thanks for your review.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-12-03 2:58 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-01 9:46 [PATCH v3 0/7] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-01 9:46 ` [PATCH v3 1/7] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
2023-12-01 9:46 ` [PATCH v3 2/7] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
2023-12-01 9:46 ` [PATCH v3 3/7] mm, security: Fix missed security_task_movememory() Yafang Shao
2023-12-01 20:50 ` Serge E. Hallyn
2023-12-03 2:57 ` Yafang Shao
2023-12-01 9:46 ` [PATCH v3 4/7] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
2023-12-01 9:46 ` [PATCH v3 5/7] security: selinux: Implement set_mempolicy hook Yafang Shao
2023-12-01 9:46 ` [PATCH v3 6/7] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
2023-12-01 9:46 ` [PATCH v3 7/7] NOT kernel/man2/mbind.2: Add mode flag MPOL_F_NUMA_BALANCING Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox