From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C81E9C3DA6E
	for <linux-mm@archiver.kernel.org>; Sun, 24 Dec 2023 03:36:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 28AFA6B0072; Sat, 23 Dec 2023 22:36:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 238E96B0074; Sat, 23 Dec 2023 22:36:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 101356B0075; Sat, 23 Dec 2023 22:36:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id F22C96B0074
	for <linux-mm@kvack.org>; Sat, 23 Dec 2023 22:36:00 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id BC490A1B1F
	for <linux-mm@kvack.org>; Sun, 24 Dec 2023 03:36:00 +0000 (UTC)
X-FDA: 81600298080.13.CF86A1F
Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50])
	by imf29.hostedemail.com (Postfix) with ESMTP id 11955120002
	for <linux-mm@kvack.org>; Sun, 24 Dec 2023 03:35:58 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jZHma3ro;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1703388959;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=;
	b=5IKs92duMu6QjrqpGKrkniixdAvX9oZCbeQeL3CGEHk/jM23O6hnUInXVUJ8S+oynV8rgM
	Q3u98ghbZGiHROprco6k8QCD6pRM+gRmnCSINBwi2MO0j9WcQ5Kz+83tM3ON12iZlREXxY
	X2JYbfsM9I3UkQ2W8U9ekIupvbpJeZ4=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jZHma3ro;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703388959; a=rsa-sha256;
	cv=none;
	b=MiR3gV9b4Jd3baNG7jlM/yHekUnZRwfr+jYLLluZuNwEVZbd3GoILmrkgb+IL9pAeZBL9h
	uYkTJWC8qh8ZiQ4lAaSK6DRUQtNiqUsofxuTG0xzDC1658nXL+rjrAXB8DSyV4V+QEvqBG
	GLCNhFB5ahbhbYre9Hsq1xm7tyKGUUA=
Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-67f95d69115so17123696d6.1
        for <linux-mm@kvack.org>; Sat, 23 Dec 2023 19:35:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1703388958; x=1703993758; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=;
        b=jZHma3ro3e0daIG18gUpSUKRZ9qTS8rAmhJVCmrlGVQkWVB4/lvyH2XVCQW4Sh/0l2
         oWTB+LnkokSS9VIeWJQ9qCBrvKn4heaZSvU0/dm3u30rX2ZGjIt9IGd3QVIUNwp+7Y7q
         JsTvIqsqDWrsP589ykZNokihigcjxrrkFeSNRK+5bSqAotzPnt0rYF9ha5JJEiaVSSWu
         ZJOeksIt1T6NNnPNecW1b60rg64W5nR5UzHKqzkRCz/kPksLmIRLtSxpmM46YFZlYdCx
         37+Y2rf0dSlnkYTutxZIKV/TWPvo5nMMc0ohlUQRiFHd+DUQe6QqiOdsR7Ob/3y2b0ru
         K2+w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703388958; x=1703993758;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=;
        b=xQsoA3gbtAuK2b6STKzAQw1SG68xZSGFxNAtlpHXlcKKyBaLIoLtWiHB3ky0hhNBad
         tkeFNsuF6+u9Y4DQX94/RfYdSSs+ewt7E73wNJpMdpRkaz+IEnlLesnhd+pWGmHi8sLo
         FUEBfzAiqUcnoeQmlSv4ID2Cs+4z/670KWLfDR+hNcpBDcCVTt8Bzb14k0FhKWCTkrDw
         faXxT3Bc8Oycka1sj2N19VQZbHQxxco+pOmWXlbj9xCi4k7TI3dk6raTHiF5AncWL3UA
         lr30J/hFwgqddbYGkWVgzw88dzVUht9vtph4CDt/GnwVipwEWRnwU3BYT0P8Jzza4MZi
         Am+A==
X-Gm-Message-State: AOJu0YyejprqJmqRYAN8pO5DDm04R7LXURhUgLle7vqFpVke3be0k2k8
	GZiW28dXziTskqo3N9HxFEpNCiJvTiVGkHrELfU=
X-Google-Smtp-Source: AGHT+IFZJd1vRU2AgHQ9AyCbdlMfveFkrY+CjEU2/eHQHycCO+OrEnUMLwumOveI81fMFduNFPiYWNaOGga9ayqBS+E=
X-Received: by 2002:a05:6214:d8a:b0:67f:27b6:9bae with SMTP id
 e10-20020a0562140d8a00b0067f27b69baemr5362555qve.85.1703388958113; Sat, 23
 Dec 2023 19:35:58 -0800 (PST)
MIME-Version: 1.0
References: <20231214125033.4158-1-laoar.shao@gmail.com> <CAHC9VhTs_5-SFq2M+w4SE7gMd3cHXP2P3y71O4H_q7XGUtvVUg@mail.gmail.com>
In-Reply-To: <CAHC9VhTs_5-SFq2M+w4SE7gMd3cHXP2P3y71O4H_q7XGUtvVUg@mail.gmail.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Sun, 24 Dec 2023 11:35:21 +0800
Message-ID: <CALOAHbDEoZ_gPNg-ABE0-Qc0uPqwHJBLRpqSjFd7fH6r+oH23A@mail.gmail.com>
Subject: Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control
 over memory policy adjustments with lsm bpf
To: Paul Moore <paul@paul-moore.com>, Kees Cook <keescook@chromium.org>, 
	"luto@amacapital.net" <luto@amacapital.net>, wad@chromium.org
Cc: akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, 
	omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, 
	mhocko@suse.com, ying.huang@intel.com, linux-mm@kvack.org, 
	linux-security-module@vger.kernel.org, bpf@vger.kernel.org, 
	ligang.bdlg@bytedance.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 11955120002
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: e8qtuaaqawhfk7ezqi3mpqdr3pzwucgt
X-HE-Tag: 1703388958-390927
X-HE-Meta: U2FsdGVkX1+DhIDvO+YOTX7zIyxMBmJUe3Js/RbQsWqj7zZyC6mSwEDYo1xNB/aizmH6fQSN1Q7qyLmjDSKFuxdDdp4QG9PdLOIfHOFx5aO6AhgxWgk/4WLXy+YzU7Mv2UJXo4TAUqg3AhrfegCs/SyJw9TLZv3jPQDaccYuUu6V4wkFv66x7YJdwfOEHL+0CEkDfUXlr0hXHuQuk3rU1yM7LItUxOWaVK7bg76nnDk9gTcHIA/jpu0w7ci7hB3wa0sXhIt7Sg6oQ0cBSw6396JiRhtJ1vFi06mYgZp5x6yFBYvQuRmQlXQXKh8mutnN2Eu1cRmYtPWD6mR/5tEJ74KxQAKya8ELee+QYrPBaeG5PWzFZRGGpecENlSttgnOHAMklKD3NFNfoNm2VFvbPRlVqeKJAhCn1G1bSwm8XB1RGvawiqbd3823bjNJsQ/CgZO22z8Yp36wPRLPwWD08Qzaz01ELCZHtrvJN4ZPLb8YtQsKpDzWPGASylV98haLHpoYawQxPUkLrazIcUQPEZmMzjJm8/BhG6imLt1az/51C1rfTnf2IURsP1rJeHsPOTctYr61FQr0f8mxwBIgFH156xxelKuxz3xqZsxOYMynHrWr3xxKBqlwevJLAUp0jOrjethxoPAcgomG0Kc4EIhPMySfo34BgoG0gzhjICoPV2hPtNVZWtk69zMYbrNI8NQFZiH2f1gmUEZpXUw0F0VRoGWZaBFPScit+5wCCqhFC0Yclv5+7cAQjYsUAhXCK2bRzrneZuD+MdBor0eHZ1TTBlqcWmAbOlgWBklkK9pd4hbZzptqRoEDQpOVldarBQgLlnnlKuDhtYL7aPENANxioUmyN2xj81rC7eUc6/nv58ZfEm5bcjatEfeNUT0Q/r7IThK6CZwrddP0CmSo7UWDQdgO3rkpvWjJUpXdRnFCnYl08pMVec1Q3pTEtbo8nI1iJpUnkVlboQ7Wuo6
 QWsXvNxj
 alfspR2F8t7ydO62xQKOtXfjD0TO8LC9Jd2Eky+rHZ2yzoUpvVLOyMUgxdsBDjeVTHB0TQvO7amb5Ht+EbBOxBx5OXwxgRxrb0ORrC/oNNlg8RJQxDZUv9pYybb0ol6MuJLqTcYzbLO1S8mDNUrW0Jrp+0LUl05ncCgnUmLElFH31qqNIoMhx0qGWZFeRMULs33uVTEa0/I6hLd/+qZoZ6PGTQn7TeHp2e73FuZ+/brD5z8xKglG5Or/72X6RYOtWrHVcXpLHo70514CfmGLWcCEMMPp5DzF5yrsjGw8rIlYYAlcLiGgM4gzXAsYxb0BQDZTzpGcqsj0IJ+J+iu24W+LrIzAZmy6HtHaba1+Gto68y8Prq1igElGyRaMv3AGe2GEfxgD8CEWTJxrwSqGSQbcgd/opsZ/BzGBVMXHLNNduqRazEy7+E1YZeFA95K2eLCGEsrZ+ZUjUl9vUy4bLRvNobkBbEo7LSBdyIU8ykRLJuJGI4ZAnuHfc0kRXjECvpJSz51R/6HDIsN2xEg3H2duuJ+FJNkQP9LymJEcdCuRIKCvr/ClnrFW220+Ego6Ph3p6zps6Lucd0ZQq9tasLt/N5uROUxf+QrjKAwrn4mvdCPJqJI3qVEyttou+b8gNkd5NeI+aH454I04=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Dec 23, 2023 at 8:16=E2=80=AFAM Paul Moore <paul@paul-moore.com> wr=
ote:
>
> On Thu, Dec 14, 2023 at 7:51=E2=80=AFAM Yafang Shao <laoar.shao@gmail.com=
> wrote:
> >
> > Background
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > In our containerized environment, we've identified unexpected OOM event=
s
> > where the OOM-killer terminates tasks despite having ample free memory.
> > This anomaly is traced back to tasks within a container using mbind(2) =
to
> > bind memory to a specific NUMA node. When the allocated memory on this =
node
> > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > indiscriminately kills tasks.
> >
> > The Challenge
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > In a containerized environment, independent memory binding by a user ca=
n
> > lead to unexpected system issues or disrupt tasks being run by other us=
ers
> > on the same server. If a user genuinely requires memory binding, we wil=
l
> > allocate dedicated servers to them by leveraging kubelet deployment.
> >
> > Currently, users possess the ability to autonomously bind their memory =
to
> > specific nodes without explicit agreement or authorization from our end=
.
> > It's imperative that we establish a method to prevent this behavior.
> >
> > Proposed Solution
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > - Capability
> >   Currently, any task can perform MPOL_BIND without specific capabiliti=
es.
> >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but th=
is
> >   may have unintended consequences. Capabilities, being broad, might gr=
ant
> >   unnecessary privileges. We should explore alternatives to prevent
> >   unexpected side effects.
> >
> > - LSM
> >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2=
)
> >   to disable MPOL_BIND. This approach is more flexibility and allows fo=
r
> >   fine-grained control without unintended consequences. A sample LSM BP=
F
> >   program is included, demonstrating practical implementation in a
> >   production environment.
> >
> > - seccomp
> >   seccomp is relatively heavyweight, making it less suitable for
> >   enabling in our production environment:
> >   - Both kubelet and containers need adaptation to support it.
> >   - Dynamically altering security policies for individual containers
> >     without interrupting their operations isn't straightforward.
> >
> > Future Considerations
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > In addition, there's room for enhancement in the OOM-killer for cases
> > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > prioritize selecting a victim that has allocated memory on the same NUM=
A
> > node. My exploration on the lore led me to a proposal[0] related to thi=
s
> > matter, although consensus seems elusive at this point. Nevertheless,
> > delving into this specific topic is beyond the scope of the current
> > patchset.
> >
> > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@by=
tedance.com/
> >
> > Changes:
> > - v4 -> v5:
> >   - Revise the commit log in patch #5. (KP)
> > - v3 -> v4: https://lwn.net/Articles/954126/
> >   - Drop the changes around security_task_movememory (Serge)
> > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> >   - Fix bpf selftests error reported by bot+bpf-ci
> > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> >   - Refine the commit log to avoid misleading
> >   - Use one common lsm hook instead and add comment for it
> >   - Add selinux implementation
> >   - Other improments in mempolicy
> > - RFC v1: https://lwn.net/Articles/951188/
> >
> > Yafang Shao (5):
> >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> >   mm, security: Add lsm hook for memory policy adjustment
> >   security: selinux: Implement set_mempolicy hook
> >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> >
> >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> >  include/linux/lsm_hook_defs.h                      |  3 +
> >  include/linux/security.h                           |  9 +++
> >  include/uapi/linux/mempolicy.h                     |  2 +-
> >  mm/mempolicy.c                                     |  8 +++
> >  security/security.c                                | 13 ++++
> >  security/selinux/hooks.c                           |  8 +++
> >  security/selinux/include/classmap.h                |  2 +-
> >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++++=
++++++++
> >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> >  10 files changed, 182 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolic=
y.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolic=
y.c
>
> In your original patchset there was a lot of good discussion about
> ways to solve, or mitigate, this problem using existing mechanisms;
> while you disputed many (all?) of those suggestions, I felt that they
> still had merit over your objections.

JFYI. The initial patchset presents three suggestions:
- Disabling CONFIG_NUMA, proposed by Michal:
  By default, tasks on a server allocate memory from their local
memory node initially. Disabling CONFIG_NUMA could potentially lead to
a performance hit.

- Adjusting NUMA workload configuration, also from Michal:
  This adjustment has been successfully implemented on some dedicated
clusters, as mentioned in the commit log. However, applying this
change universally across a large fleet of servers might result in
significant wastage of physical memory.

- Implementing seccomp, suggested by Ondrej and Casey:
  As indicated in the commit log, altering the security policy
dynamically without interrupting a running container isn't
straightforward. Implementing seccomp requires the introduction of an
eBPF-based seccomp, which constitutes a substantial change.
  [ The seccomp maintainer has been added to this mail thread for
further discussion. ]


> I also don't believe the
> SELinux implementation of the set_mempolicy hook fits with the
> existing SELinux philosophy of access control via type enforcement;
> outside of some checks on executable memory and low memory ranges,
> SELinux doesn't currently enforce policy on memory ranges like this,
> SELinux focuses more on tasks being able to access data/resources on
> the system.
>
> My current opinion is that you should pursue some of the mitigations
> that have already been mentioned, including seccomp and/or a better
> NUMA workload configuration.  I would also encourage you to pursue the
> OOM improvement you briefly described.  All of those seem like better
> options than this new LSM/SELinux hook.

Using the OOM solution should not be our primary approach. Whenever
possible, we should prioritize alternative solutions to prevent
encountering the OOM situation.

--=20
Regards
Yafang