From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 67129C46CD2
	for <linux-mm@archiver.kernel.org>; Sun, 24 Dec 2023 19:45:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 759FF6B0075; Sun, 24 Dec 2023 14:45:00 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 70A416B0078; Sun, 24 Dec 2023 14:45:00 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5F9136B007B; Sun, 24 Dec 2023 14:45:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 4D4906B0075
	for <linux-mm@kvack.org>; Sun, 24 Dec 2023 14:45:00 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 1A381C0318
	for <linux-mm@kvack.org>; Sun, 24 Dec 2023 19:45:00 +0000 (UTC)
X-FDA: 81602739960.09.1A7BB79
Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170])
	by imf19.hostedemail.com (Postfix) with ESMTP id 26BAB1A000A
	for <linux-mm@kvack.org>; Sun, 24 Dec 2023 19:44:58 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=paul-moore.com header.s=google header.b=HiIn5bcM;
	spf=pass (imf19.hostedemail.com: domain of paul@paul-moore.com designates 209.85.167.170 as permitted sender) smtp.mailfrom=paul@paul-moore.com;
	dmarc=pass (policy=none) header.from=paul-moore.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1703447098;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=;
	b=peQCeFHkgbnlthXqUZBVoIkbHicRnVt1Rm51CR4562wi9GmMwe0GFOc9PU5enLzKnj1JFi
	p8hSwT89pGIHrPqoE1JhQ+6W1oE6S/XNEV0y31AfnvqTAptDwnjir32/0akbnOUJw4llX2
	VWnOryUqNcKsgpEuqFxw55vr5cDQV0s=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703447098; a=rsa-sha256;
	cv=none;
	b=MGdJ6rURNAo6/ZwWN1m6acumb8wvaeZFAgNV8b5HTm1C0uaF45ki3tzGjKcPZdVkNyGpuD
	z4sEacP15dGa6vjGbocISB525aUo+lp1wLGpDHmA8kfQzyYS2pb3Kf/Qy02MUuoSYyuFDW
	AKtDCTD7CEtWapYsHTh3HBwaqUvYu/w=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=paul-moore.com header.s=google header.b=HiIn5bcM;
	spf=pass (imf19.hostedemail.com: domain of paul@paul-moore.com designates 209.85.167.170 as permitted sender) smtp.mailfrom=paul@paul-moore.com;
	dmarc=pass (policy=none) header.from=paul-moore.com
Received: by mail-oi1-f170.google.com with SMTP id 5614622812f47-3bb85a202c2so2151371b6e.2
        for <linux-mm@kvack.org>; Sun, 24 Dec 2023 11:44:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=paul-moore.com; s=google; t=1703447097; x=1704051897; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=;
        b=HiIn5bcM5IY4piKZ0guWJq7Q6He1CoyTz+JWt3ZUmdUwJjgemKPM0PzR3+PWumna2L
         0Pq9xVBropMdpn70Sk86j1WNg0nMOMt4kQOjju2lNVwdQ1e4T7d3NxT8UtM3EKP16R7H
         kN5gfCUwKp2G5/SC6TGUb54iblO8p3dIT/2uy/41CGgHWDswl/WZf6DDfQv4Ztsxvfzn
         vntJh+QnlsY04FyZOs9WL0EbviRRclP+ZGCap96DpLL5xP44kzEWvttToOpz2GH3Pcil
         IIKUBrtgMFeJ+sH1czsxawnB5pUM9pjRjv3u/91+h7onEXdCegj+GZ8whcR3syrwfTug
         B2/A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703447097; x=1704051897;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=;
        b=PQ0ttWsUCbimubqK6ehXaO87Jfjq1D+QvieV92AmxJ5Zwmpn4WP1EnzA3YztNmsvuo
         Fz9f50ASjInnItNTKZVr5iwOsXzzcE7PGeusYVhHtpt8gd9TgxQWEIM6zvJ9KH09zBNY
         eZfTCT0GvUsbqq5p19WMvCer0P65E+uhk6Xon8ITA7Wjx5C7FsFSssXgfWwOYKB2FxMt
         DJU3BWGM/KV6AXO28I8n51IpCetA7eow6EujY2KrOopkq29iAVHmhEeY1wVuog4NjVVZ
         RP6icu7pGn3UHAUJvwoEIrmzhatADRbQr3sZUv8fRGeqOCWWr+ESskB/g+/7C5PvlsCD
         cZYQ==
X-Gm-Message-State: AOJu0Yy15otPHkySEdxmEkdQ5qQP4l476/CzCzu6BIVdCi512PbjZHEs
	HpKgyGcpxyhL0R7bRLjBiRH1kyvOhFHFjSyvS2OsojkM4vII
X-Google-Smtp-Source: AGHT+IGETYfxawX0HFXMVp+6t2ErU493rO1FtJFm80zYN2C9Rdc/T7NbMEoVOIgu/80q/4IgixKMSLJqZ1pExaurGLI=
X-Received: by 2002:a05:6808:13d0:b0:3b8:b063:9b66 with SMTP id
 d16-20020a05680813d000b003b8b0639b66mr6047506oiw.88.1703447097154; Sun, 24
 Dec 2023 11:44:57 -0800 (PST)
MIME-Version: 1.0
References: <20231214125033.4158-1-laoar.shao@gmail.com> <CAHC9VhTs_5-SFq2M+w4SE7gMd3cHXP2P3y71O4H_q7XGUtvVUg@mail.gmail.com>
 <CALOAHbDEoZ_gPNg-ABE0-Qc0uPqwHJBLRpqSjFd7fH6r+oH23A@mail.gmail.com>
In-Reply-To: <CALOAHbDEoZ_gPNg-ABE0-Qc0uPqwHJBLRpqSjFd7fH6r+oH23A@mail.gmail.com>
From: Paul Moore <paul@paul-moore.com>
Date: Sun, 24 Dec 2023 14:44:46 -0500
Message-ID: <CAHC9VhQkRPMO2Xpg0gYdpOPZTDrp1xKwU=idt9EQJg7Zi7XjqQ@mail.gmail.com>
Subject: Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control
 over memory policy adjustments with lsm bpf
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Kees Cook <keescook@chromium.org>, "luto@amacapital.net" <luto@amacapital.net>, wad@chromium.org, 
	akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, 
	omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, 
	mhocko@suse.com, ying.huang@intel.com, linux-mm@kvack.org, 
	linux-security-module@vger.kernel.org, bpf@vger.kernel.org, 
	ligang.bdlg@bytedance.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 26BAB1A000A
X-Rspam-User: 
X-Stat-Signature: 1p1yx7yutd5k4ea1cio3mmi4qp9nj81y
X-Rspamd-Server: rspam03
X-HE-Tag: 1703447097-48422
X-HE-Meta: U2FsdGVkX1/ntUAMSxGfeVcXrhSuJBSHEjX6ERXLvndY4AwOIwgYvZIYD6O/Y+CLb+qhSpScg3vagZGWiY4ZSnH+Vkxw2hp3dvRuCvlgKrbw3y+UOmpSCenKKDqYKF5JMvHPi5fm7Ww49SIlxfsLc//dPMFpjuaAAqsVCPBGvEon1Hjv7TQrgvu2kzO7r1QxPXTk9/3QZ4wpnEN9SHhsAMKlGkuN4NkpRcqTsvQgxCLqY0oplo8rV+iwNmMCmLfaUgSMZKjOwT/qK53LmbYE7tiuFDbCxmLWd3afZeKna7aMtVCY8QuNtligg5RauHM6s8ydfqpS/0Li5F+VIb/ucYCTdHtw3oplholuczPug5NcbmbPnPGwRjju4VFcWfp4LSj3VeYWlrTxwfaORGBEdHdkE4XzTwcjOc1tRpaymIngSkctH5BoQHtN0YkYFuQ+TeubqX9OfkDNfCXtBCvT0MUcJJhIuurxZFzkVQUxQK88YADxKEex7nvU6gYR76iUrvXj5TxXE1mRCyTsl1v/40zD5753xakCBT5d4P6rdj3QZv4zX8xNuRx3QNUqt6TMXsHBUwvMWey0+FQLoBj5cDrf5wKZ+4krLvT9BHNm3oy6z2CPQYEbGvi5z/vldg3R8ciKhs5J1nQ/fyXyQtZ4tFmAfE8WL0a0UjjWmfGBkMU8QO5BoF4uC5IIlfs4mQbceBHFgyA0aUSVKPKZCtZzAYqoy9GYs9UKol170fkdF3gpYvu16WhA7cwJBvxIzThzqEeQow+sEs6BU5OB/NpqzfyYpQgFwh+BbOgnUW9AtN+XYqfTImmMjWcLrTheW3maMi9/lzKmLq41DpCS+VB2EmEFIHr4UqumEB8Vce28XkUCOF3EoALE+shDMuKjFBF4WJIPX3mnGN3OlHAom4M5wnaNILcISD79nWJ7X3clZEKnV092sQxvcHU6mICYkNSqpil6+LasEY+jHC7A0W6
 YHmOzIWM
 FvCyfjwBu+XsuPoAKmm0zkeG1uKSKnXm5XTkWa+0EEwcwY7NXEfHky9ksWEBL3cFPNcQ2gmSurgghCxvCfxePcJq75M6ufFZRMXQVd1WfQgQH2PKZwvwHqbfZJD+pKkfOXZlyi/HZReGq6wF2fVfuthtB+LDSBGYCfDJRmeUXACg6OoML4SPDDzr6Y7wL+/rXixcaRhnuZh8orBWpSp/eHKvXH6G+ysPuOYNt3+TQbv9IvoFjAFzlNROapmIBHzMk6SmVPzCkrSQRH0+6oS92innzX4vIbxOgGbmMm41Bmt6y2Bh5MQ9UcoSWK5F7FzZqAOtp1DHFnwlRMO1Rzjz2FqTXtnaEuMmkaBcxmkYnpO1nnllN5NOrP+UX4yYSsyAhJR8giadg2MGNmpV5zax19aMLD1+jslDtwkSBr2Aj34zaFWGT5jAmAzbR4nwxTKk4ACoQsAnhAFm0Ml9VhudeZUAjTjjS2Ml3Juu0ZAazHsco5d23EYSUxSiQjIC0UeD9kcAbxCz4JwUEzk40X/PMsbFUxgxzyDSeTXBddRbAmgPG3CtlLf6i43wIRqSgumlsFcH+h7tXZwHqOnkKQLxgQXBg2Ixq7C2nRXL5ErFPUDnwbLc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Dec 23, 2023 at 10:35=E2=80=AFPM Yafang Shao <laoar.shao@gmail.com>=
 wrote:
> On Sat, Dec 23, 2023 at 8:16=E2=80=AFAM Paul Moore <paul@paul-moore.com> =
wrote:
> > On Thu, Dec 14, 2023 at 7:51=E2=80=AFAM Yafang Shao <laoar.shao@gmail.c=
om> wrote:
> > >
> > > Background
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >
> > > In our containerized environment, we've identified unexpected OOM eve=
nts
> > > where the OOM-killer terminates tasks despite having ample free memor=
y.
> > > This anomaly is traced back to tasks within a container using mbind(2=
) to
> > > bind memory to a specific NUMA node. When the allocated memory on thi=
s node
> > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > indiscriminately kills tasks.
> > >
> > > The Challenge
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >
> > > In a containerized environment, independent memory binding by a user =
can
> > > lead to unexpected system issues or disrupt tasks being run by other =
users
> > > on the same server. If a user genuinely requires memory binding, we w=
ill
> > > allocate dedicated servers to them by leveraging kubelet deployment.
> > >
> > > Currently, users possess the ability to autonomously bind their memor=
y to
> > > specific nodes without explicit agreement or authorization from our e=
nd.
> > > It's imperative that we establish a method to prevent this behavior.
> > >
> > > Proposed Solution
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >
> > > - Capability
> > >   Currently, any task can perform MPOL_BIND without specific capabili=
ties.
> > >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but =
this
> > >   may have unintended consequences. Capabilities, being broad, might =
grant
> > >   unnecessary privileges. We should explore alternatives to prevent
> > >   unexpected side effects.
> > >
> > > - LSM
> > >   Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy=
(2)
> > >   to disable MPOL_BIND. This approach is more flexibility and allows =
for
> > >   fine-grained control without unintended consequences. A sample LSM =
BPF
> > >   program is included, demonstrating practical implementation in a
> > >   production environment.
> > >
> > > - seccomp
> > >   seccomp is relatively heavyweight, making it less suitable for
> > >   enabling in our production environment:
> > >   - Both kubelet and containers need adaptation to support it.
> > >   - Dynamically altering security policies for individual containers
> > >     without interrupting their operations isn't straightforward.
> > >
> > > Future Considerations
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >
> > > In addition, there's room for enhancement in the OOM-killer for cases
> > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> > > prioritize selecting a victim that has allocated memory on the same N=
UMA
> > > node. My exploration on the lore led me to a proposal[0] related to t=
his
> > > matter, although consensus seems elusive at this point. Nevertheless,
> > > delving into this specific topic is beyond the scope of the current
> > > patchset.
> > >
> > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@=
bytedance.com/
> > >
> > > Changes:
> > > - v4 -> v5:
> > >   - Revise the commit log in patch #5. (KP)
> > > - v3 -> v4: https://lwn.net/Articles/954126/
> > >   - Drop the changes around security_task_movememory (Serge)
> > > - RCC v2 -> v3: https://lwn.net/Articles/953526/
> > >   - Add MPOL_F_NUMA_BALANCING man-page (Ying)
> > >   - Fix bpf selftests error reported by bot+bpf-ci
> > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/
> > >   - Refine the commit log to avoid misleading
> > >   - Use one common lsm hook instead and add comment for it
> > >   - Add selinux implementation
> > >   - Other improments in mempolicy
> > > - RFC v1: https://lwn.net/Articles/951188/
> > >
> > > Yafang Shao (5):
> > >   mm, doc: Add doc for MPOL_F_NUMA_BALANCING
> > >   mm: mempolicy: Revise comment regarding mempolicy mode flags
> > >   mm, security: Add lsm hook for memory policy adjustment
> > >   security: selinux: Implement set_mempolicy hook
> > >   selftests/bpf: Add selftests for set_mempolicy with a lsm prog
> > >
> > >  .../admin-guide/mm/numa_memory_policy.rst          | 27 +++++++
> > >  include/linux/lsm_hook_defs.h                      |  3 +
> > >  include/linux/security.h                           |  9 +++
> > >  include/uapi/linux/mempolicy.h                     |  2 +-
> > >  mm/mempolicy.c                                     |  8 +++
> > >  security/security.c                                | 13 ++++
> > >  security/selinux/hooks.c                           |  8 +++
> > >  security/selinux/include/classmap.h                |  2 +-
> > >  .../selftests/bpf/prog_tests/set_mempolicy.c       | 84 ++++++++++++=
++++++++++
> > >  .../selftests/bpf/progs/test_set_mempolicy.c       | 28 ++++++++
> > >  10 files changed, 182 insertions(+), 2 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempol=
icy.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempol=
icy.c
> >
> > In your original patchset there was a lot of good discussion about
> > ways to solve, or mitigate, this problem using existing mechanisms;
> > while you disputed many (all?) of those suggestions, I felt that they
> > still had merit over your objections.
>
> JFYI. The initial patchset presents three suggestions:
> - Disabling CONFIG_NUMA, proposed by Michal:
>   By default, tasks on a server allocate memory from their local
> memory node initially. Disabling CONFIG_NUMA could potentially lead to
> a performance hit.
>
> - Adjusting NUMA workload configuration, also from Michal:
>   This adjustment has been successfully implemented on some dedicated
> clusters, as mentioned in the commit log. However, applying this
> change universally across a large fleet of servers might result in
> significant wastage of physical memory.
>
> - Implementing seccomp, suggested by Ondrej and Casey:
>   As indicated in the commit log, altering the security policy
> dynamically without interrupting a running container isn't
> straightforward. Implementing seccomp requires the introduction of an
> eBPF-based seccomp, which constitutes a substantial change.
>   [ The seccomp maintainer has been added to this mail thread for
> further discussion. ]

The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a
number of sandboxing tools designed to make this easier to use,
including systemd, and if you need to augment your existing
application there are libraries available to make this easier.

> > I also don't believe the
> > SELinux implementation of the set_mempolicy hook fits with the
> > existing SELinux philosophy of access control via type enforcement;
> > outside of some checks on executable memory and low memory ranges,
> > SELinux doesn't currently enforce policy on memory ranges like this,
> > SELinux focuses more on tasks being able to access data/resources on
> > the system.
> >
> > My current opinion is that you should pursue some of the mitigations
> > that have already been mentioned, including seccomp and/or a better
> > NUMA workload configuration.  I would also encourage you to pursue the
> > OOM improvement you briefly described.  All of those seem like better
> > options than this new LSM/SELinux hook.
>
> Using the OOM solution should not be our primary approach. Whenever
> possible, we should prioritize alternative solutions to prevent
> encountering the OOM situation.

It's a good thing that there exist other options.

--=20
paul-moore.com