* [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
@ 2023-12-14 12:50 Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 1/5] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
Background
==========
In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks.
The Challenge
=============
In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.
Currently, users possess the ability to autonomously bind their memory to
specific nodes without explicit agreement or authorization from our end.
It's imperative that we establish a method to prevent this behavior.
Proposed Solution
=================
- Capability
Currently, any task can perform MPOL_BIND without specific capabilities.
Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
may have unintended consequences. Capabilities, being broad, might grant
unnecessary privileges. We should explore alternatives to prevent
unexpected side effects.
- LSM
Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
to disable MPOL_BIND. This approach is more flexibility and allows for
fine-grained control without unintended consequences. A sample LSM BPF
program is included, demonstrating practical implementation in a
production environment.
- seccomp
seccomp is relatively heavyweight, making it less suitable for
enabling in our production environment:
- Both kubelet and containers need adaptation to support it.
- Dynamically altering security policies for individual containers
without interrupting their operations isn't straightforward.
Future Considerations
=====================
In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.
[0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
Changes:
- v4 -> v5:
- Revise the commit log in patch #5. (KP)
- v3 -> v4: https://lwn.net/Articles/954126/
- Drop the changes around security_task_movememory (Serge)
- RCC v2 -> v3: https://lwn.net/Articles/953526/
- Add MPOL_F_NUMA_BALANCING man-page (Ying)
- Fix bpf selftests error reported by bot+bpf-ci
- RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
- Refine the commit log to avoid misleading
- Use one common lsm hook instead and add comment for it
- Add selinux implementation
- Other improments in mempolicy
- RFC v1: https://lwn.net/Articles/951188/
Yafang Shao (5):
mm, doc: Add doc for MPOL_F_NUMA_BALANCING
mm: mempolicy: Revise comment regarding mempolicy mode flags
mm, security: Add lsm hook for memory policy adjustment
security: selinux: Implement set_mempolicy hook
selftests/bpf: Add selftests for set_mempolicy with a lsm prog
.../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
include/linux/lsm_hook_defs.h | 3 +
include/linux/security.h | 9 +++
include/uapi/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 8 +++
security/security.c | 13 ++++
security/selinux/hooks.c | 8 +++
security/selinux/include/classmap.h | 2 +-
.../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
.../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
10 files changed, 182 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 bpf-next 1/5] mm, doc: Add doc for MPOL_F_NUMA_BALANCING
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
@ 2023-12-14 12:50 ` Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 2/5] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
` (4 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
The document on MPOL_F_NUMA_BALANCING was missed in the initial commit
The MPOL_F_NUMA_BALANCING document was inadvertently omitted from the
initial commit bda420b98505 ("numa balancing: migrate on fault among
multiple bound nodes")
Let's ensure its inclusion.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
.../admin-guide/mm/numa_memory_policy.rst | 27 ++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa..19071b71 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -332,6 +332,33 @@ MPOL_F_RELATIVE_NODES
MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation).
+MPOL_F_NUMA_BALANCING (since Linux 5.12)
+ When operating in MPOL_BIND mode, enables NUMA balancing for tasks,
+ contingent upon kernel support. This feature optimizes page
+ placement within the confines of the specified memory binding
+ policy. The addition of the MPOL_F_NUMA_BALANCING flag augments the
+ control mechanism for NUMA balancing:
+
+ - The sysctl knob numa_balancing governs global activation or
+ deactivation of NUMA balancing.
+
+ - Even if sysctl numa_balancing is enabled, NUMA balancing remains
+ disabled by default for memory areas or applications utilizing
+ explicit memory policies.
+
+ - The MPOL_F_NUMA_BALANCING flag facilitates NUMA balancing
+ activation for applications employing explicit memory policies
+ (MPOL_BIND).
+
+ This flags enables various optimizations for page placement through
+ NUMA balancing. For instance, when an application's memory is bound
+ to multiple nodes (MPOL_BIND), the hint page fault handler attempts
+ to migrate accessed pages to reduce cross-node access if the
+ accessing node aligns with the policy nodemask.
+
+ If the flag isn't supported by the kernel, or is used with mode
+ other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
+
Memory Policy Reference Counting
================================
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 bpf-next 2/5] mm: mempolicy: Revise comment regarding mempolicy mode flags
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 1/5] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
@ 2023-12-14 12:50 ` Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 3/5] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao,
Eric Dumazet
MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, and MPOL_F_NUMA_BALANCING are
mode flags applicable to both set_mempolicy(2) and mbind(2) system calls.
It's worth noting that MPOL_F_NUMA_BALANCING was initially introduced in
commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
nodes") exclusively for set_mempolicy(2). However, it was later made a
shared flag for both set_mempolicy(2) and mbind(2) following
commit 6d2aec9e123b ("mm/mempolicy: do not allow illegal
MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()").
This revised version aims to clarify the details regarding the mode flags.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
---
include/uapi/linux/mempolicy.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7..afed4a4 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -26,7 +26,7 @@ enum {
MPOL_MAX, /* always last member of enum */
};
-/* Flags for set_mempolicy */
+/* Flags for set_mempolicy() or mbind() */
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 bpf-next 3/5] mm, security: Add lsm hook for memory policy adjustment
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 1/5] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 2/5] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
@ 2023-12-14 12:50 ` Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 4/5] security: selinux: Implement set_mempolicy hook Yafang Shao
` (2 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.
At present, users have the capability to bind their memory to a specific
node without explicit agreement or authorization from us. Consequently, a
new LSM hook is introduced to mitigate this. This implementation allows us
to exercise fine-grained control over memory policy adjustments within our
container environment
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/lsm_hook_defs.h | 3 +++
include/linux/security.h | 9 +++++++++
mm/mempolicy.c | 8 ++++++++
security/security.c | 13 +++++++++++++
4 files changed, 33 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index ff217a5..5580127 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -419,3 +419,6 @@
LSM_HOOK(int, 0, uring_sqpoll, void)
LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
#endif /* CONFIG_IO_URING */
+
+LSM_HOOK(int, 0, set_mempolicy, unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
diff --git a/include/linux/security.h b/include/linux/security.h
index 1d1df326..cc4a19a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -484,6 +484,8 @@ int security_setprocattr(const char *lsm, const char *name, void *value,
int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
int security_locked_down(enum lockdown_reason what);
+int security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags);
#else /* CONFIG_SECURITY */
static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1395,6 +1397,13 @@ static inline int security_locked_down(enum lockdown_reason what)
{
return 0;
}
+
+static inline int
+security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return 0;
+}
#endif /* CONFIG_SECURITY */
#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590e..9535d9e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1483,6 +1483,10 @@ static long kernel_mbind(unsigned long start, unsigned long len,
if (err)
return err;
+ err = security_set_mempolicy(lmode, mode_flags, &nodes, flags);
+ if (err)
+ return err;
+
return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
}
@@ -1577,6 +1581,10 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
if (err)
return err;
+ err = security_set_mempolicy(lmode, mode_flags, &nodes, 0);
+ if (err)
+ return err;
+
return do_set_mempolicy(lmode, mode_flags, &nodes);
}
diff --git a/security/security.c b/security/security.c
index dcb3e70..685ad79 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5337,3 +5337,16 @@ int security_uring_cmd(struct io_uring_cmd *ioucmd)
return call_int_hook(uring_cmd, 0, ioucmd);
}
#endif /* CONFIG_IO_URING */
+
+/**
+ * security_set_mempolicy() - Check if memory policy can be adjusted
+ * @mode: The memory policy mode to be set
+ * @mode_flags: optional mode flags
+ * @nmask: modemask to which the mode applies
+ * @flags: mode flags for mbind(2) only
+ */
+int security_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return call_int_hook(set_mempolicy, 0, mode, mode_flags, nmask, flags);
+}
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 bpf-next 4/5] security: selinux: Implement set_mempolicy hook
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (2 preceding siblings ...)
2023-12-14 12:50 ` [PATCH v5 bpf-next 3/5] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
@ 2023-12-14 12:50 ` Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
2023-12-23 0:16 ` [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Paul Moore
5 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
Add a SELinux access control for the newly introduced set_mempolicy lsm
hook. A new permission "setmempolicy" is defined under the "process" class
for it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
security/selinux/hooks.c | 8 ++++++++
security/selinux/include/classmap.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index feda711..1528d4d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4238,6 +4238,13 @@ static int selinux_userns_create(const struct cred *cred)
USER_NAMESPACE__CREATE, NULL);
}
+static int selinux_set_mempolicy(unsigned long mode, unsigned short mode_flags,
+ nodemask_t *nmask, unsigned int flags)
+{
+ return avc_has_perm(current_sid(), task_sid_obj(current), SECCLASS_PROCESS,
+ PROCESS__SETMEMPOLICY, NULL);
+}
+
/* Returns error only if unable to parse addresses */
static int selinux_parse_skb_ipv4(struct sk_buff *skb,
struct common_audit_data *ad, u8 *proto)
@@ -7072,6 +7079,7 @@ static int selinux_uring_cmd(struct io_uring_cmd *ioucmd)
LSM_HOOK_INIT(task_kill, selinux_task_kill),
LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
LSM_HOOK_INIT(userns_create, selinux_userns_create),
+ LSM_HOOK_INIT(set_mempolicy, selinux_set_mempolicy),
LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
LSM_HOOK_INIT(ipc_getsecid, selinux_ipc_getsecid),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index a3c3807..c280d92 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -51,7 +51,7 @@
"getattr", "setexec", "setfscreate", "noatsecure", "siginh",
"setrlimit", "rlimitinh", "dyntransition", "setcurrent",
"execmem", "execstack", "execheap", "setkeycreate",
- "setsockcreate", "getrlimit", NULL } },
+ "setsockcreate", "getrlimit", "setmempolicy", NULL } },
{ "process2",
{ "nnp_transition", "nosuid_transition", NULL } },
{ "system",
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 bpf-next 5/5] selftests/bpf: Add selftests for set_mempolicy with a lsm prog
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (3 preceding siblings ...)
2023-12-14 12:50 ` [PATCH v5 bpf-next 4/5] security: selinux: Implement set_mempolicy hook Yafang Shao
@ 2023-12-14 12:50 ` Yafang Shao
2023-12-23 0:16 ` [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Paul Moore
5 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2023-12-14 12:50 UTC (permalink / raw)
To: akpm, paul, jmorris, serge, omosnace, casey, kpsingh, mhocko, ying.huang
Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao
In the straightforward LSM prog, it denies the use of mbind(2) with the
mode MPOL_BIND and permits other modes.
Consequently:
- Absent the LSM prog, mbind(2) should invariably succeed regardless of
the mode
#263/1 set_mempolicy/MPOL_BIND_without_lsm:OK
#263/2 set_mempolicy/MPOL_DEFAULT_without_lsm:OK
- With the LSM prog
- mbind(2) with the mode MPOL_BIND should result in failure
#263/3 set_mempolicy/MPOL_BIND_with_lsm:OK
- mbind(2) with the mode MPOL_DEFAULT should succeed
#263/4 set_mempolicy/MPOL_DEFAULT_with_lsm:OK
- Summary
#263 set_mempolicy:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
.../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
.../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
2 files changed, 112 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
diff --git a/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c b/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
new file mode 100644
index 0000000..4d3fe1d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <linux/mempolicy.h>
+#include <test_progs.h>
+#include "test_set_mempolicy.skel.h"
+
+#define SIZE 4096
+
+static void mempolicy_bind(bool success)
+{
+ unsigned long mask = 1;
+ char *addr;
+ int err;
+
+ addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+ if (!ASSERT_OK_PTR(addr, "mmap"))
+ return;
+
+ /* -lnuma is required by mbind(2), so use __NR_mbind to avoid the dependency. */
+ err = syscall(__NR_mbind, addr, SIZE, MPOL_BIND, &mask, sizeof(mask), 0);
+ if (success)
+ ASSERT_OK(err, "mbind_success");
+ else
+ ASSERT_ERR(err, "mbind_fail");
+
+ munmap(addr, SIZE);
+}
+
+static void mempolicy_default(void)
+{
+ char *addr;
+ int err;
+
+ addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+ if (!ASSERT_OK_PTR(addr, "mmap"))
+ return;
+
+ err = syscall(__NR_mbind, addr, SIZE, MPOL_DEFAULT, NULL, 0, 0);
+ ASSERT_OK(err, "mbind_success");
+
+ munmap(addr, SIZE);
+}
+
+void test_set_mempolicy(void)
+{
+ struct test_set_mempolicy *skel;
+ int err;
+
+ skel = test_set_mempolicy__open();
+ if (!ASSERT_OK_PTR(skel, "open"))
+ return;
+
+ skel->bss->target_pid = getpid();
+
+ err = test_set_mempolicy__load(skel);
+ if (!ASSERT_OK(err, "load"))
+ goto destroy;
+
+ /* Without LSM, mbind(2) should succeed regardless of the mode. */
+ if (test__start_subtest("MPOL_BIND_without_lsm"))
+ mempolicy_bind(true);
+ if (test__start_subtest("MPOL_DEFAULT_without_lsm"))
+ mempolicy_default();
+
+ /* Attach LSM prog, in which it will deny MPOL_BIND */
+ err = test_set_mempolicy__attach(skel);
+ if (!ASSERT_OK(err, "attach"))
+ goto destroy;
+
+ /* MPOL_BIND should fail. */
+ if (test__start_subtest("MPOL_BIND_with_lsm"))
+ mempolicy_bind(false);
+
+ /* MPOL_DEFAULT should succeed. */
+ if (test__start_subtest("MPOL_DEFAULT_with_lsm"))
+ mempolicy_default();
+
+destroy:
+ test_set_mempolicy__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_set_mempolicy.c b/tools/testing/selftests/bpf/progs/test_set_mempolicy.c
new file mode 100644
index 0000000..b5356d5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_set_mempolicy.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+int target_pid;
+
+static int mem_policy_adjustment(u64 mode)
+{
+ struct task_struct *task = bpf_get_current_task_btf();
+
+ if (task->pid != target_pid)
+ return 0;
+
+ if (mode != MPOL_BIND)
+ return 0;
+ return -1;
+}
+
+SEC("lsm/set_mempolicy")
+int BPF_PROG(setmempolicy, u64 mode, u16 mode_flags, nodemask_t *nmask, u32 flags)
+{
+ return mem_policy_adjustment(mode);
+}
+
+char _license[] SEC("license") = "GPL";
--
1.8.3.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
` (4 preceding siblings ...)
2023-12-14 12:50 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
@ 2023-12-23 0:16 ` Paul Moore
2023-12-24 3:35 ` Yafang Shao
5 siblings, 1 reply; 14+ messages in thread
From: Paul Moore @ 2023-12-23 0:16 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, jmorris, serge, omosnace, casey, kpsingh, mhocko,
ying.huang, linux-mm, linux-security-module, bpf, ligang.bdlg
On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ==========
>
> In our containerized environment, we've identified unexpected OOM events
> where the OOM-killer terminates tasks despite having ample free memory.
> This anomaly is traced back to tasks within a container using mbind(2) to
> bind memory to a specific NUMA node. When the allocated memory on this node
> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> indiscriminately kills tasks.
>
> The Challenge
> =============
>
> In a containerized environment, independent memory binding by a user can
> lead to unexpected system issues or disrupt tasks being run by other users
> on the same server. If a user genuinely requires memory binding, we will
> allocate dedicated servers to them by leveraging kubelet deployment.
>
> Currently, users possess the ability to autonomously bind their memory to
> specific nodes without explicit agreement or authorization from our end.
> It's imperative that we establish a method to prevent this behavior.
>
> Proposed Solution
> =================
>
> - Capability
> Currently, any task can perform MPOL_BIND without specific capabilities.
> Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> may have unintended consequences. Capabilities, being broad, might grant
> unnecessary privileges. We should explore alternatives to prevent
> unexpected side effects.
>
> - LSM
> Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> to disable MPOL_BIND. This approach is more flexibility and allows for
> fine-grained control without unintended consequences. A sample LSM BPF
> program is included, demonstrating practical implementation in a
> production environment.
>
> - seccomp
> seccomp is relatively heavyweight, making it less suitable for
> enabling in our production environment:
> - Both kubelet and containers need adaptation to support it.
> - Dynamically altering security policies for individual containers
> without interrupting their operations isn't straightforward.
>
> Future Considerations
> =====================
>
> In addition, there's room for enhancement in the OOM-killer for cases
> involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> prioritize selecting a victim that has allocated memory on the same NUMA
> node. My exploration on the lore led me to a proposal[0] related to this
> matter, although consensus seems elusive at this point. Nevertheless,
> delving into this specific topic is beyond the scope of the current
> patchset.
>
> [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
>
> Changes:
> - v4 -> v5:
> - Revise the commit log in patch #5. (KP)
> - v3 -> v4: https://lwn.net/Articles/954126/
> - Drop the changes around security_task_movememory (Serge)
> - RCC v2 -> v3: https://lwn.net/Articles/953526/
> - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> - Fix bpf selftests error reported by bot+bpf-ci
> - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> - Refine the commit log to avoid misleading
> - Use one common lsm hook instead and add comment for it
> - Add selinux implementation
> - Other improments in mempolicy
> - RFC v1: https://lwn.net/Articles/951188/
>
> Yafang Shao (5):
> mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> mm: mempolicy: Revise comment regarding mempolicy mode flags
> mm, security: Add lsm hook for memory policy adjustment
> security: selinux: Implement set_mempolicy hook
> selftests/bpf: Add selftests for set_mempolicy with a lsm prog
>
> .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
> include/linux/lsm_hook_defs.h | 3 +
> include/linux/security.h | 9 +++
> include/uapi/linux/mempolicy.h | 2 +-
> mm/mempolicy.c | 8 +++
> security/security.c | 13 ++++
> security/selinux/hooks.c | 8 +++
> security/selinux/include/classmap.h | 2 +-
> .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
> .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
> 10 files changed, 182 insertions(+), 2 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
In your original patchset there was a lot of good discussion about
ways to solve, or mitigate, this problem using existing mechanisms;
while you disputed many (all?) of those suggestions, I felt that they
still had merit over your objections. I also don't believe the
SELinux implementation of the set_mempolicy hook fits with the
existing SELinux philosophy of access control via type enforcement;
outside of some checks on executable memory and low memory ranges,
SELinux doesn't currently enforce policy on memory ranges like this,
SELinux focuses more on tasks being able to access data/resources on
the system.
My current opinion is that you should pursue some of the mitigations
that have already been mentioned, including seccomp and/or a better
NUMA workload configuration. I would also encourage you to pursue the
OOM improvement you briefly described. All of those seem like better
options than this new LSM/SELinux hook.
--
paul-moore.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2023-12-23 0:16 ` [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Paul Moore
@ 2023-12-24 3:35 ` Yafang Shao
2023-12-24 19:44 ` Paul Moore
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2023-12-24 3:35 UTC (permalink / raw)
To: Paul Moore, Kees Cook, luto, wad
Cc: akpm, jmorris, serge, omosnace, casey, kpsingh, mhocko,
ying.huang, linux-mm, linux-security-module, bpf, ligang.bdlg
On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Background
> > ==========
> >
> > In our containerized environment, we've identified unexpected OOM events
> > where the OOM-killer terminates tasks despite having ample free memory.
> > This anomaly is traced back to tasks within a container using mbind(2) to
> > bind memory to a specific NUMA node. When the allocated memory on this node
> > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > indiscriminately kills tasks.
> >
> > The Challenge
> > =============
> >
> > In a containerized environment, independent memory binding by a user can
> > lead to unexpected system issues or disrupt tasks being run by other users
> > on the same server. If a user genuinely requires memory binding, we will
> > allocate dedicated servers to them by leveraging kubelet deployment.
> >
> > Currently, users possess the ability to autonomously bind their memory to
> > specific nodes without explicit agreement or authorization from our end.
> > It's imperative that we establish a method to prevent this behavior.
> >
> > Proposed Solution
> > =================
> >
> > - Capability
> > Currently, any task can perform MPOL_BIND without specific capabilities.
> > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > may have unintended consequences. Capabilities, being broad, might grant
> > unnecessary privileges. We should explore alternatives to prevent
> > unexpected side effects.
> >
> > - LSM
> > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > to disable MPOL_BIND. This approach is more flexibility and allows for
> > fine-grained control without unintended consequences. A sample LSM BPF
> > program is included, demonstrating practical implementation in a
> > production environment.
> >
> > - seccomp
> > seccomp is relatively heavyweight, making it less suitable for
> > enabling in our production environment:
> > - Both kubelet and containers need adaptation to support it.
> > - Dynamically altering security policies for individual containers
> > without interrupting their operations isn't straightforward.
> >
> > Future Considerations
> > =====================
> >
> > In addition, there's room for enhancement in the OOM-killer for cases
> > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > prioritize selecting a victim that has allocated memory on the same NUMA
> > node. My exploration on the lore led me to a proposal[0] related to this
> > matter, although consensus seems elusive at this point. Nevertheless,
> > delving into this specific topic is beyond the scope of the current
> > patchset.
> >
> > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> >
> > Changes:
> > - v4 -> v5:
> > - Revise the commit log in patch #5. (KP)
> > - v3 -> v4: https://lwn.net/Articles/954126/
> > - Drop the changes around security_task_movememory (Serge)
> > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > - Fix bpf selftests error reported by bot+bpf-ci
> > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > - Refine the commit log to avoid misleading
> > - Use one common lsm hook instead and add comment for it
> > - Add selinux implementation
> > - Other improments in mempolicy
> > - RFC v1: https://lwn.net/Articles/951188/
> >
> > Yafang Shao (5):
> > mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > mm: mempolicy: Revise comment regarding mempolicy mode flags
> > mm, security: Add lsm hook for memory policy adjustment
> > security: selinux: Implement set_mempolicy hook
> > selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> >
> > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
> > include/linux/lsm_hook_defs.h | 3 +
> > include/linux/security.h | 9 +++
> > include/uapi/linux/mempolicy.h | 2 +-
> > mm/mempolicy.c | 8 +++
> > security/security.c | 13 ++++
> > security/selinux/hooks.c | 8 +++
> > security/selinux/include/classmap.h | 2 +-
> > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
> > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
> > 10 files changed, 182 insertions(+), 2 deletions(-)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
>
> In your original patchset there was a lot of good discussion about
> ways to solve, or mitigate, this problem using existing mechanisms;
> while you disputed many (all?) of those suggestions, I felt that they
> still had merit over your objections.
JFYI. The initial patchset presents three suggestions:
- Disabling CONFIG_NUMA, proposed by Michal:
By default, tasks on a server allocate memory from their local
memory node initially. Disabling CONFIG_NUMA could potentially lead to
a performance hit.
- Adjusting NUMA workload configuration, also from Michal:
This adjustment has been successfully implemented on some dedicated
clusters, as mentioned in the commit log. However, applying this
change universally across a large fleet of servers might result in
significant wastage of physical memory.
- Implementing seccomp, suggested by Ondrej and Casey:
As indicated in the commit log, altering the security policy
dynamically without interrupting a running container isn't
straightforward. Implementing seccomp requires the introduction of an
eBPF-based seccomp, which constitutes a substantial change.
[ The seccomp maintainer has been added to this mail thread for
further discussion. ]
> I also don't believe the
> SELinux implementation of the set_mempolicy hook fits with the
> existing SELinux philosophy of access control via type enforcement;
> outside of some checks on executable memory and low memory ranges,
> SELinux doesn't currently enforce policy on memory ranges like this,
> SELinux focuses more on tasks being able to access data/resources on
> the system.
>
> My current opinion is that you should pursue some of the mitigations
> that have already been mentioned, including seccomp and/or a better
> NUMA workload configuration. I would also encourage you to pursue the
> OOM improvement you briefly described. All of those seem like better
> options than this new LSM/SELinux hook.
Using the OOM solution should not be our primary approach. Whenever
possible, we should prioritize alternative solutions to prevent
encountering the OOM situation.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2023-12-24 3:35 ` Yafang Shao
@ 2023-12-24 19:44 ` Paul Moore
2023-12-25 3:12 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Paul Moore @ 2023-12-24 19:44 UTC (permalink / raw)
To: Yafang Shao
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Background
> > > ==========
> > >
> > > In our containerized environment, we've identified unexpected OOM events
> > > where the OOM-killer terminates tasks despite having ample free memory.
> > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > indiscriminately kills tasks.
> > >
> > > The Challenge
> > > =============
> > >
> > > In a containerized environment, independent memory binding by a user can
> > > lead to unexpected system issues or disrupt tasks being run by other users
> > > on the same server. If a user genuinely requires memory binding, we will
> > > allocate dedicated servers to them by leveraging kubelet deployment.
> > >
> > > Currently, users possess the ability to autonomously bind their memory to
> > > specific nodes without explicit agreement or authorization from our end.
> > > It's imperative that we establish a method to prevent this behavior.
> > >
> > > Proposed Solution
> > > =================
> > >
> > > - Capability
> > > Currently, any task can perform MPOL_BIND without specific capabilities.
> > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > > may have unintended consequences. Capabilities, being broad, might grant
> > > unnecessary privileges. We should explore alternatives to prevent
> > > unexpected side effects.
> > >
> > > - LSM
> > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > > to disable MPOL_BIND. This approach is more flexibility and allows for
> > > fine-grained control without unintended consequences. A sample LSM BPF
> > > program is included, demonstrating practical implementation in a
> > > production environment.
> > >
> > > - seccomp
> > > seccomp is relatively heavyweight, making it less suitable for
> > > enabling in our production environment:
> > > - Both kubelet and containers need adaptation to support it.
> > > - Dynamically altering security policies for individual containers
> > > without interrupting their operations isn't straightforward.
> > >
> > > Future Considerations
> > > =====================
> > >
> > > In addition, there's room for enhancement in the OOM-killer for cases
> > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > node. My exploration on the lore led me to a proposal[0] related to this
> > > matter, although consensus seems elusive at this point. Nevertheless,
> > > delving into this specific topic is beyond the scope of the current
> > > patchset.
> > >
> > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > >
> > > Changes:
> > > - v4 -> v5:
> > > - Revise the commit log in patch #5. (KP)
> > > - v3 -> v4: https://lwn.net/Articles/954126/
> > > - Drop the changes around security_task_movememory (Serge)
> > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > > - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > > - Fix bpf selftests error reported by bot+bpf-ci
> > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > > - Refine the commit log to avoid misleading
> > > - Use one common lsm hook instead and add comment for it
> > > - Add selinux implementation
> > > - Other improments in mempolicy
> > > - RFC v1: https://lwn.net/Articles/951188/
> > >
> > > Yafang Shao (5):
> > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > > mm: mempolicy: Revise comment regarding mempolicy mode flags
> > > mm, security: Add lsm hook for memory policy adjustment
> > > security: selinux: Implement set_mempolicy hook
> > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > >
> > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
> > > include/linux/lsm_hook_defs.h | 3 +
> > > include/linux/security.h | 9 +++
> > > include/uapi/linux/mempolicy.h | 2 +-
> > > mm/mempolicy.c | 8 +++
> > > security/security.c | 13 ++++
> > > security/selinux/hooks.c | 8 +++
> > > security/selinux/include/classmap.h | 2 +-
> > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
> > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
> > > 10 files changed, 182 insertions(+), 2 deletions(-)
> > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> >
> > In your original patchset there was a lot of good discussion about
> > ways to solve, or mitigate, this problem using existing mechanisms;
> > while you disputed many (all?) of those suggestions, I felt that they
> > still had merit over your objections.
>
> JFYI. The initial patchset presents three suggestions:
> - Disabling CONFIG_NUMA, proposed by Michal:
> By default, tasks on a server allocate memory from their local
> memory node initially. Disabling CONFIG_NUMA could potentially lead to
> a performance hit.
>
> - Adjusting NUMA workload configuration, also from Michal:
> This adjustment has been successfully implemented on some dedicated
> clusters, as mentioned in the commit log. However, applying this
> change universally across a large fleet of servers might result in
> significant wastage of physical memory.
>
> - Implementing seccomp, suggested by Ondrej and Casey:
> As indicated in the commit log, altering the security policy
> dynamically without interrupting a running container isn't
> straightforward. Implementing seccomp requires the introduction of an
> eBPF-based seccomp, which constitutes a substantial change.
> [ The seccomp maintainer has been added to this mail thread for
> further discussion. ]
The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
number of sandboxing tools designed to make this easier to use,
including systemd, and if you need to augment your existing
application there are libraries available to make this easier.
> > I also don't believe the
> > SELinux implementation of the set_mempolicy hook fits with the
> > existing SELinux philosophy of access control via type enforcement;
> > outside of some checks on executable memory and low memory ranges,
> > SELinux doesn't currently enforce policy on memory ranges like this,
> > SELinux focuses more on tasks being able to access data/resources on
> > the system.
> >
> > My current opinion is that you should pursue some of the mitigations
> > that have already been mentioned, including seccomp and/or a better
> > NUMA workload configuration. I would also encourage you to pursue the
> > OOM improvement you briefly described. All of those seem like better
> > options than this new LSM/SELinux hook.
>
> Using the OOM solution should not be our primary approach. Whenever
> possible, we should prioritize alternative solutions to prevent
> encountering the OOM situation.
It's a good thing that there exist other options.
--
paul-moore.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2023-12-24 19:44 ` Paul Moore
@ 2023-12-25 3:12 ` Yafang Shao
2024-01-10 6:06 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2023-12-25 3:12 UTC (permalink / raw)
To: Paul Moore
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Mon, Dec 25, 2023 at 3:44 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks.
> > > >
> > > > The Challenge
> > > > =============
> > > >
> > > > In a containerized environment, independent memory binding by a user can
> > > > lead to unexpected system issues or disrupt tasks being run by other users
> > > > on the same server. If a user genuinely requires memory binding, we will
> > > > allocate dedicated servers to them by leveraging kubelet deployment.
> > > >
> > > > Currently, users possess the ability to autonomously bind their memory to
> > > > specific nodes without explicit agreement or authorization from our end.
> > > > It's imperative that we establish a method to prevent this behavior.
> > > >
> > > > Proposed Solution
> > > > =================
> > > >
> > > > - Capability
> > > > Currently, any task can perform MPOL_BIND without specific capabilities.
> > > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > > > may have unintended consequences. Capabilities, being broad, might grant
> > > > unnecessary privileges. We should explore alternatives to prevent
> > > > unexpected side effects.
> > > >
> > > > - LSM
> > > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > > > to disable MPOL_BIND. This approach is more flexibility and allows for
> > > > fine-grained control without unintended consequences. A sample LSM BPF
> > > > program is included, demonstrating practical implementation in a
> > > > production environment.
> > > >
> > > > - seccomp
> > > > seccomp is relatively heavyweight, making it less suitable for
> > > > enabling in our production environment:
> > > > - Both kubelet and containers need adaptation to support it.
> > > > - Dynamically altering security policies for individual containers
> > > > without interrupting their operations isn't straightforward.
> > > >
> > > > Future Considerations
> > > > =====================
> > > >
> > > > In addition, there's room for enhancement in the OOM-killer for cases
> > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > > node. My exploration on the lore led me to a proposal[0] related to this
> > > > matter, although consensus seems elusive at this point. Nevertheless,
> > > > delving into this specific topic is beyond the scope of the current
> > > > patchset.
> > > >
> > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > > >
> > > > Changes:
> > > > - v4 -> v5:
> > > > - Revise the commit log in patch #5. (KP)
> > > > - v3 -> v4: https://lwn.net/Articles/954126/
> > > > - Drop the changes around security_task_movememory (Serge)
> > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > > > - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > > > - Fix bpf selftests error reported by bot+bpf-ci
> > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > > > - Refine the commit log to avoid misleading
> > > > - Use one common lsm hook instead and add comment for it
> > > > - Add selinux implementation
> > > > - Other improments in mempolicy
> > > > - RFC v1: https://lwn.net/Articles/951188/
> > > >
> > > > Yafang Shao (5):
> > > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > > > mm: mempolicy: Revise comment regarding mempolicy mode flags
> > > > mm, security: Add lsm hook for memory policy adjustment
> > > > security: selinux: Implement set_mempolicy hook
> > > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > > >
> > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
> > > > include/linux/lsm_hook_defs.h | 3 +
> > > > include/linux/security.h | 9 +++
> > > > include/uapi/linux/mempolicy.h | 2 +-
> > > > mm/mempolicy.c | 8 +++
> > > > security/security.c | 13 ++++
> > > > security/selinux/hooks.c | 8 +++
> > > > security/selinux/include/classmap.h | 2 +-
> > > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
> > > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
> > > > 10 files changed, 182 insertions(+), 2 deletions(-)
> > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> > >
> > > In your original patchset there was a lot of good discussion about
> > > ways to solve, or mitigate, this problem using existing mechanisms;
> > > while you disputed many (all?) of those suggestions, I felt that they
> > > still had merit over your objections.
> >
> > JFYI. The initial patchset presents three suggestions:
> > - Disabling CONFIG_NUMA, proposed by Michal:
> > By default, tasks on a server allocate memory from their local
> > memory node initially. Disabling CONFIG_NUMA could potentially lead to
> > a performance hit.
> >
> > - Adjusting NUMA workload configuration, also from Michal:
> > This adjustment has been successfully implemented on some dedicated
> > clusters, as mentioned in the commit log. However, applying this
> > change universally across a large fleet of servers might result in
> > significant wastage of physical memory.
> >
> > - Implementing seccomp, suggested by Ondrej and Casey:
> > As indicated in the commit log, altering the security policy
> > dynamically without interrupting a running container isn't
> > straightforward. Implementing seccomp requires the introduction of an
> > eBPF-based seccomp, which constitutes a substantial change.
> > [ The seccomp maintainer has been added to this mail thread for
> > further discussion. ]
>
> The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
> number of sandboxing tools designed to make this easier to use,
> including systemd, and if you need to augment your existing
> application there are libraries available to make this easier.
Let's delve into how cBPF-based seccomp operates with runc [0] - our
application:
1. Create a seccomp filter in /path/to/seccomp/profile.json.
2. Initiate a container with this filter rule using
docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
However, modifying or removing the seccomp filter mandates stopping
the running container and repeating the aforementioned steps. This
interruption isn't desirable for us.
The inability to dynamically alter the seccomp filter with cBPF arises
from the kernel lacking a method to unload the seccomp once attached
to a task. In other words, cBPF-based seccomp cannot dynamically
attach and detach from tasks. Please correct me if my understanding is
incorrect.
[0]. https://docs.docker.com/engine/security/seccomp/
>
> > > I also don't believe the
> > > SELinux implementation of the set_mempolicy hook fits with the
> > > existing SELinux philosophy of access control via type enforcement;
> > > outside of some checks on executable memory and low memory ranges,
> > > SELinux doesn't currently enforce policy on memory ranges like this,
> > > SELinux focuses more on tasks being able to access data/resources on
> > > the system.
> > >
> > > My current opinion is that you should pursue some of the mitigations
> > > that have already been mentioned, including seccomp and/or a better
> > > NUMA workload configuration. I would also encourage you to pursue the
> > > OOM improvement you briefly described. All of those seem like better
> > > options than this new LSM/SELinux hook.
> >
> > Using the OOM solution should not be our primary approach. Whenever
> > possible, we should prioritize alternative solutions to prevent
> > encountering the OOM situation.
>
> It's a good thing that there exist other options.
Absolutely, let's explore alternative options beforehand.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2023-12-25 3:12 ` Yafang Shao
@ 2024-01-10 6:06 ` Yafang Shao
2024-01-10 14:28 ` Paul Moore
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-01-10 6:06 UTC (permalink / raw)
To: Paul Moore
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Mon, Dec 25, 2023 at 11:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Dec 25, 2023 at 3:44 AM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Background
> > > > > ==========
> > > > >
> > > > > In our containerized environment, we've identified unexpected OOM events
> > > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > > indiscriminately kills tasks.
> > > > >
> > > > > The Challenge
> > > > > =============
> > > > >
> > > > > In a containerized environment, independent memory binding by a user can
> > > > > lead to unexpected system issues or disrupt tasks being run by other users
> > > > > on the same server. If a user genuinely requires memory binding, we will
> > > > > allocate dedicated servers to them by leveraging kubelet deployment.
> > > > >
> > > > > Currently, users possess the ability to autonomously bind their memory to
> > > > > specific nodes without explicit agreement or authorization from our end.
> > > > > It's imperative that we establish a method to prevent this behavior.
> > > > >
> > > > > Proposed Solution
> > > > > =================
> > > > >
> > > > > - Capability
> > > > > Currently, any task can perform MPOL_BIND without specific capabilities.
> > > > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> > > > > may have unintended consequences. Capabilities, being broad, might grant
> > > > > unnecessary privileges. We should explore alternatives to prevent
> > > > > unexpected side effects.
> > > > >
> > > > > - LSM
> > > > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2)
> > > > > to disable MPOL_BIND. This approach is more flexibility and allows for
> > > > > fine-grained control without unintended consequences. A sample LSM BPF
> > > > > program is included, demonstrating practical implementation in a
> > > > > production environment.
> > > > >
> > > > > - seccomp
> > > > > seccomp is relatively heavyweight, making it less suitable for
> > > > > enabling in our production environment:
> > > > > - Both kubelet and containers need adaptation to support it.
> > > > > - Dynamically altering security policies for individual containers
> > > > > without interrupting their operations isn't straightforward.
> > > > >
> > > > > Future Considerations
> > > > > =====================
> > > > >
> > > > > In addition, there's room for enhancement in the OOM-killer for cases
> > > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > > > prioritize selecting a victim that has allocated memory on the same NUMA
> > > > > node. My exploration on the lore led me to a proposal[0] related to this
> > > > > matter, although consensus seems elusive at this point. Nevertheless,
> > > > > delving into this specific topic is beyond the scope of the current
> > > > > patchset.
> > > > >
> > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
> > > > >
> > > > > Changes:
> > > > > - v4 -> v5:
> > > > > - Revise the commit log in patch #5. (KP)
> > > > > - v3 -> v4: https://lwn.net/Articles/954126/
> > > > > - Drop the changes around security_task_movememory (Serge)
> > > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > > > > - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > > > > - Fix bpf selftests error reported by bot+bpf-ci
> > > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > > > > - Refine the commit log to avoid misleading
> > > > > - Use one common lsm hook instead and add comment for it
> > > > > - Add selinux implementation
> > > > > - Other improments in mempolicy
> > > > > - RFC v1: https://lwn.net/Articles/951188/
> > > > >
> > > > > Yafang Shao (5):
> > > > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > > > > mm: mempolicy: Revise comment regarding mempolicy mode flags
> > > > > mm, security: Add lsm hook for memory policy adjustment
> > > > > security: selinux: Implement set_mempolicy hook
> > > > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > > > >
> > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++
> > > > > include/linux/lsm_hook_defs.h | 3 +
> > > > > include/linux/security.h | 9 +++
> > > > > include/uapi/linux/mempolicy.h | 2 +-
> > > > > mm/mempolicy.c | 8 +++
> > > > > security/security.c | 13 ++++
> > > > > security/selinux/hooks.c | 8 +++
> > > > > security/selinux/include/classmap.h | 2 +-
> > > > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++
> > > > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++
> > > > > 10 files changed, 182 insertions(+), 2 deletions(-)
> > > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
> > > > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c
> > > >
> > > > In your original patchset there was a lot of good discussion about
> > > > ways to solve, or mitigate, this problem using existing mechanisms;
> > > > while you disputed many (all?) of those suggestions, I felt that they
> > > > still had merit over your objections.
> > >
> > > JFYI. The initial patchset presents three suggestions:
> > > - Disabling CONFIG_NUMA, proposed by Michal:
> > > By default, tasks on a server allocate memory from their local
> > > memory node initially. Disabling CONFIG_NUMA could potentially lead to
> > > a performance hit.
> > >
> > > - Adjusting NUMA workload configuration, also from Michal:
> > > This adjustment has been successfully implemented on some dedicated
> > > clusters, as mentioned in the commit log. However, applying this
> > > change universally across a large fleet of servers might result in
> > > significant wastage of physical memory.
> > >
> > > - Implementing seccomp, suggested by Ondrej and Casey:
> > > As indicated in the commit log, altering the security policy
> > > dynamically without interrupting a running container isn't
> > > straightforward. Implementing seccomp requires the introduction of an
> > > eBPF-based seccomp, which constitutes a substantial change.
> > > [ The seccomp maintainer has been added to this mail thread for
> > > further discussion. ]
> >
> > The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
> > number of sandboxing tools designed to make this easier to use,
> > including systemd, and if you need to augment your existing
> > application there are libraries available to make this easier.
>
> Let's delve into how cBPF-based seccomp operates with runc [0] - our
> application:
>
> 1. Create a seccomp filter in /path/to/seccomp/profile.json.
> 2. Initiate a container with this filter rule using
> docker run --rm \
> -it \
> --security-opt seccomp=/path/to/seccomp/profile.json \
> hello-world
>
> However, modifying or removing the seccomp filter mandates stopping
> the running container and repeating the aforementioned steps. This
> interruption isn't desirable for us.
>
> The inability to dynamically alter the seccomp filter with cBPF arises
> from the kernel lacking a method to unload the seccomp once attached
> to a task. In other words, cBPF-based seccomp cannot dynamically
> attach and detach from tasks. Please correct me if my understanding is
> incorrect.
>
> [0]. https://docs.docker.com/engine/security/seccomp/
Paul,
Do you have any additional comments or further suggestions?
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2024-01-10 6:06 ` Yafang Shao
@ 2024-01-10 14:28 ` Paul Moore
2024-01-10 15:56 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Paul Moore @ 2024-01-10 14:28 UTC (permalink / raw)
To: Yafang Shao
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> Paul,
>
> Do you have any additional comments or further suggestions?
No, I'm still comfortable with my original comments and stand by them.
--
paul-moore.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2024-01-10 14:28 ` Paul Moore
@ 2024-01-10 15:56 ` Yafang Shao
2024-01-10 16:14 ` Paul Moore
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-01-10 15:56 UTC (permalink / raw)
To: Paul Moore
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Wed, Jan 10, 2024 at 10:28 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > Paul,
> >
> > Do you have any additional comments or further suggestions?
>
> No, I'm still comfortable with my original comments and stand by them.
I understand your perspective, but it seems I have to propose an
eBPF-based seccomp in the next step.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
2024-01-10 15:56 ` Yafang Shao
@ 2024-01-10 16:14 ` Paul Moore
0 siblings, 0 replies; 14+ messages in thread
From: Paul Moore @ 2024-01-10 16:14 UTC (permalink / raw)
To: Yafang Shao
Cc: Kees Cook, luto, wad, akpm, jmorris, serge, omosnace, casey,
kpsingh, mhocko, ying.huang, linux-mm, linux-security-module,
bpf, ligang.bdlg
On Wed, Jan 10, 2024 at 10:56 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> On Wed, Jan 10, 2024 at 10:28 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Wed, Jan 10, 2024 at 1:07 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > Paul,
> > >
> > > Do you have any additional comments or further suggestions?
> >
> > No, I'm still comfortable with my original comments and stand by them.
>
> I understand your perspective, but it seems I have to propose an
> eBPF-based seccomp in the next step.
You likely already know this, but just in case, eBPF-based seccomp has
been proposed many times in the past and has been rejected. I don't
want to dissuade you from doing so again, but I suspect that this use
case will not be compelling enough to be successful.
--
paul-moore.com
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-01-10 16:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-14 12:50 [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 1/5] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 2/5] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 3/5] mm, security: Add lsm hook for memory policy adjustment Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 4/5] security: selinux: Implement set_mempolicy hook Yafang Shao
2023-12-14 12:50 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add selftests for set_mempolicy with a lsm prog Yafang Shao
2023-12-23 0:16 ` [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Paul Moore
2023-12-24 3:35 ` Yafang Shao
2023-12-24 19:44 ` Paul Moore
2023-12-25 3:12 ` Yafang Shao
2024-01-10 6:06 ` Yafang Shao
2024-01-10 14:28 ` Paul Moore
2024-01-10 15:56 ` Yafang Shao
2024-01-10 16:14 ` Paul Moore
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox