linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
@ 2023-11-22 14:15 Yafang Shao
  2023-11-22 14:15 ` [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yafang Shao @ 2023-11-22 14:15 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge, omosnace, mhocko
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao

Background
==========

In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks. 

The Challenge 
============
In a containerized environment, independent memory binding by a user can
lead to unexpected system issues or disrupt tasks being run by other users
on the same server. If a user genuinely requires memory binding, we will
allocate dedicated servers to them by leveraging kubelet deployment.

Currently, users possess the ability to autonomously bind their memory to
specific nodes without explicit agreement or authorization from our end.
It's imperative that we establish a method to prevent this behavior.

Proposed Solutions
=================

- Introduce Capability to Disable MPOL_BIND
  Currently, any task can perform MPOL_BIND without specific capabilities.
  Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
  may have unintended consequences. Capabilities, being broad, might grant
  unnecessary privileges. We should explore alternatives to prevent
  unexpected side effects.

- Use LSM BPF to Disable MPOL_BIND
  Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
  set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
  flexibility and allows for fine-grained control without unintended
  consequences. A sample LSM BPF program is included, demonstrating
  practical implementation in a production environment.

- seccomp
  seccomp is relatively heavyweight, making it less suitable for
  enabling in our production environment:
  - Both kubelet and containers need adaptation to support it.
  - Dynamically altering security policies for individual containers
    without interrupting their operations isn't straightforward.

Future Considerations
=====================

In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.

[0] https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/


Changes:
- RFC v1 -> RFC v2:
  - Refine the commit log to avoid misleading
  - Use one common lsm hook instead and add comment for it
  - Add selinux implementation
  - Other improments in mempolicy
- RFC v1: https://lwn.net/Articles/951188/

Yafang Shao (6):
  mm, doc: Add doc for MPOL_F_NUMA_BALANCING
  mm: mempolicy: Revise comment regarding mempolicy mode flags
  mm, security: Fix missed security_task_movememory() in mbind(2)
  mm, security: Add lsm hook for memory policy adjustment
  security: selinux: Implement set_mempolicy hook
  selftests/bpf: Add selftests for set_mempolicy with a lsm prog

 .../admin-guide/mm/numa_memory_policy.rst     | 27 +++++++
 include/linux/lsm_hook_defs.h                 |  3 +
 include/linux/security.h                      |  9 +++
 include/uapi/linux/mempolicy.h                |  2 +-
 mm/mempolicy.c                                | 17 +++-
 security/security.c                           | 13 +++
 security/selinux/hooks.c                      |  8 ++
 security/selinux/include/classmap.h           |  2 +-
 tools/testing/selftests/bpf/Makefile          |  2 +-
 .../selftests/bpf/prog_tests/set_mempolicy.c  | 79 +++++++++++++++++++
 .../selftests/bpf/progs/test_set_mempolicy.c  | 29 +++++++
 11 files changed, 187 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c

-- 
2.30.1 (Apple Git-130)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING
  2023-11-22 14:15 [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
@ 2023-11-22 14:15 ` Yafang Shao
  2023-11-23  6:37   ` Huang, Ying
  2023-11-22 14:15 ` [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
  2023-11-22 14:15 ` [RFC PATCH v2 3/6] mm, security: Fix missed security_task_movememory() in mbind(2) Yafang Shao
  2 siblings, 1 reply; 7+ messages in thread
From: Yafang Shao @ 2023-11-22 14:15 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge, omosnace, mhocko
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao,
	Huang, Ying

The document on MPOL_F_NUMA_BALANCING was missed in the initial commit
The MPOL_F_NUMA_BALANCING document was inadvertently omitted from the
initial commit bda420b98505 ("numa balancing: migrate on fault among
multiple bound nodes")

Let's ensure its inclusion.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 27 +++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..19071b71979c 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -332,6 +332,33 @@ MPOL_F_RELATIVE_NODES
 	MPOL_PREFERRED policies that were created with an empty nodemask
 	(local allocation).
 
+MPOL_F_NUMA_BALANCING (since Linux 5.12)
+        When operating in MPOL_BIND mode, enables NUMA balancing for tasks,
+        contingent upon kernel support. This feature optimizes page
+        placement within the confines of the specified memory binding
+        policy. The addition of the MPOL_F_NUMA_BALANCING flag augments the
+        control mechanism for NUMA balancing:
+
+        - The sysctl knob numa_balancing governs global activation or
+          deactivation of NUMA balancing.
+
+        - Even if sysctl numa_balancing is enabled, NUMA balancing remains
+          disabled by default for memory areas or applications utilizing
+          explicit memory policies.
+
+        - The MPOL_F_NUMA_BALANCING flag facilitates NUMA balancing
+          activation for applications employing explicit memory policies
+          (MPOL_BIND).
+
+        This flags enables various optimizations for page placement through
+        NUMA balancing. For instance, when an application's memory is bound
+        to multiple nodes (MPOL_BIND), the hint page fault handler attempts
+        to migrate accessed pages to reduce cross-node access if the
+        accessing node aligns with the policy nodemask.
+
+        If the flag isn't supported by the kernel, or is used with mode
+        other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
+
 Memory Policy Reference Counting
 ================================
 
-- 
2.30.1 (Apple Git-130)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags
  2023-11-22 14:15 [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
  2023-11-22 14:15 ` [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
@ 2023-11-22 14:15 ` Yafang Shao
  2023-11-23  6:30   ` Huang, Ying
  2023-11-22 14:15 ` [RFC PATCH v2 3/6] mm, security: Fix missed security_task_movememory() in mbind(2) Yafang Shao
  2 siblings, 1 reply; 7+ messages in thread
From: Yafang Shao @ 2023-11-22 14:15 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge, omosnace, mhocko
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao,
	Eric Dumazet, Huang, Ying

MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, and MPOL_F_NUMA_BALANCING are
mode flags applicable to both set_mempolicy(2) and mbind(2) system calls.
It's worth noting that MPOL_F_NUMA_BALANCING was initially introduced in
commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
nodes") exclusively for set_mempolicy(2). However, it was later made a
shared flag for both set_mempolicy(2) and mbind(2) following
commit 6d2aec9e123b ("mm/mempolicy: do not allow illegal
MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()").

This revised version aims to clarify the details regarding the mode flags.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
---
 include/uapi/linux/mempolicy.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..afed4a45f5b9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -26,7 +26,7 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
-/* Flags for set_mempolicy */
+/* Flags for set_mempolicy() or mbind() */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
 #define MPOL_F_NUMA_BALANCING	(1 << 13) /* Optimize with NUMA balancing if possible */
-- 
2.30.1 (Apple Git-130)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 3/6] mm, security: Fix missed security_task_movememory() in mbind(2)
  2023-11-22 14:15 [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
  2023-11-22 14:15 ` [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
  2023-11-22 14:15 ` [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
@ 2023-11-22 14:15 ` Yafang Shao
  2 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2023-11-22 14:15 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge, omosnace, mhocko
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, Yafang Shao

Considering that mbind(2) using either MPOL_MF_MOVE or MPOL_MF_MOVE_ALL is
capable of memory movement, it's essential to include
security_task_movememory() to cover this functionality as well. It was
identified during a code review.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/mempolicy.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..ded2e0e62e24 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1259,8 +1259,15 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (!new)
 		flags |= MPOL_MF_DISCONTIG_OK;
 
-	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+		err = security_task_movememory(current);
+		if (err) {
+			mpol_put(new);
+			return err;
+		}
 		lru_cache_disable();
+	}
+
 	{
 		NODEMASK_SCRATCH(scratch);
 		if (scratch) {
-- 
2.30.1 (Apple Git-130)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags
  2023-11-22 14:15 ` [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
@ 2023-11-23  6:30   ` Huang, Ying
  2023-11-23 12:21     ` Yafang Shao
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2023-11-23  6:30 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, paul, jmorris, serge, omosnace, mhocko, linux-mm,
	linux-security-module, bpf, ligang.bdlg, Eric Dumazet

Yafang Shao <laoar.shao@gmail.com> writes:

> MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, and MPOL_F_NUMA_BALANCING are
> mode flags applicable to both set_mempolicy(2) and mbind(2) system calls.
> It's worth noting that MPOL_F_NUMA_BALANCING was initially introduced in
> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
> nodes") exclusively for set_mempolicy(2). However, it was later made a
> shared flag for both set_mempolicy(2) and mbind(2) following
> commit 6d2aec9e123b ("mm/mempolicy: do not allow illegal
> MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()").
>
> This revised version aims to clarify the details regarding the mode flags.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>

Thanks for fixing this.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

And, please revise the manpage for mbind() too.  As we have done for
set_mempolicy(),

https://lore.kernel.org/all/20210120061235.148637-3-ying.huang@intel.com/

--
Best Regards,
Huang, Ying

> ---
>  include/uapi/linux/mempolicy.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..afed4a45f5b9 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -26,7 +26,7 @@ enum {
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> -/* Flags for set_mempolicy */
> +/* Flags for set_mempolicy() or mbind() */
>  #define MPOL_F_STATIC_NODES	(1 << 15)
>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
>  #define MPOL_F_NUMA_BALANCING	(1 << 13) /* Optimize with NUMA balancing if possible */


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING
  2023-11-22 14:15 ` [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
@ 2023-11-23  6:37   ` Huang, Ying
  0 siblings, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2023-11-23  6:37 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, paul, jmorris, serge, omosnace, mhocko, linux-mm,
	linux-security-module, bpf, ligang.bdlg

Yafang Shao <laoar.shao@gmail.com> writes:

> The document on MPOL_F_NUMA_BALANCING was missed in the initial commit
> The MPOL_F_NUMA_BALANCING document was inadvertently omitted from the
> initial commit bda420b98505 ("numa balancing: migrate on fault among
> multiple bound nodes")
>
> Let's ensure its inclusion.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  .../admin-guide/mm/numa_memory_policy.rst     | 27 +++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..19071b71979c 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -332,6 +332,33 @@ MPOL_F_RELATIVE_NODES
>  	MPOL_PREFERRED policies that were created with an empty nodemask
>  	(local allocation).
>  
> +MPOL_F_NUMA_BALANCING (since Linux 5.12)
> +        When operating in MPOL_BIND mode, enables NUMA balancing for tasks,
> +        contingent upon kernel support. This feature optimizes page
> +        placement within the confines of the specified memory binding
> +        policy. The addition of the MPOL_F_NUMA_BALANCING flag augments the
> +        control mechanism for NUMA balancing:
> +
> +        - The sysctl knob numa_balancing governs global activation or
> +          deactivation of NUMA balancing.
> +
> +        - Even if sysctl numa_balancing is enabled, NUMA balancing remains
> +          disabled by default for memory areas or applications utilizing
> +          explicit memory policies.
> +
> +        - The MPOL_F_NUMA_BALANCING flag facilitates NUMA balancing
> +          activation for applications employing explicit memory policies
> +          (MPOL_BIND).
> +
> +        This flags enables various optimizations for page placement through
> +        NUMA balancing. For instance, when an application's memory is bound
> +        to multiple nodes (MPOL_BIND), the hint page fault handler attempts
> +        to migrate accessed pages to reduce cross-node access if the
> +        accessing node aligns with the policy nodemask.
> +
> +        If the flag isn't supported by the kernel, or is used with mode
> +        other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
> +
>  Memory Policy Reference Counting
>  ================================

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags
  2023-11-23  6:30   ` Huang, Ying
@ 2023-11-23 12:21     ` Yafang Shao
  0 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2023-11-23 12:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, paul, jmorris, serge, omosnace, mhocko, linux-mm,
	linux-security-module, bpf, ligang.bdlg, Eric Dumazet

On Thu, Nov 23, 2023 at 2:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, and MPOL_F_NUMA_BALANCING are
> > mode flags applicable to both set_mempolicy(2) and mbind(2) system calls.
> > It's worth noting that MPOL_F_NUMA_BALANCING was initially introduced in
> > commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
> > nodes") exclusively for set_mempolicy(2). However, it was later made a
> > shared flag for both set_mempolicy(2) and mbind(2) following
> > commit 6d2aec9e123b ("mm/mempolicy: do not allow illegal
> > MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()").
> >
> > This revised version aims to clarify the details regarding the mode flags.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: "Huang, Ying" <ying.huang@intel.com>
>
> Thanks for fixing this.
>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>
> And, please revise the manpage for mbind() too.  As we have done for
> set_mempolicy(),
>
> https://lore.kernel.org/all/20210120061235.148637-3-ying.huang@intel.com/

Thanks for your review. will do it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-23 12:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-22 14:15 [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-11-22 14:15 ` [RFC PATCH v2 1/6] mm, doc: Add doc for MPOL_F_NUMA_BALANCING Yafang Shao
2023-11-23  6:37   ` Huang, Ying
2023-11-22 14:15 ` [RFC PATCH v2 2/6] mm: mempolicy: Revise comment regarding mempolicy mode flags Yafang Shao
2023-11-23  6:30   ` Huang, Ying
2023-11-23 12:21     ` Yafang Shao
2023-11-22 14:15 ` [RFC PATCH v2 3/6] mm, security: Fix missed security_task_movememory() in mbind(2) Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox