From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67129C46CD2 for ; Sun, 24 Dec 2023 19:45:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 759FF6B0075; Sun, 24 Dec 2023 14:45:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 70A416B0078; Sun, 24 Dec 2023 14:45:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F9136B007B; Sun, 24 Dec 2023 14:45:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4D4906B0075 for ; Sun, 24 Dec 2023 14:45:00 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1A381C0318 for ; Sun, 24 Dec 2023 19:45:00 +0000 (UTC) X-FDA: 81602739960.09.1A7BB79 Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170]) by imf19.hostedemail.com (Postfix) with ESMTP id 26BAB1A000A for ; Sun, 24 Dec 2023 19:44:58 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=HiIn5bcM; spf=pass (imf19.hostedemail.com: domain of paul@paul-moore.com designates 209.85.167.170 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703447098; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=; b=peQCeFHkgbnlthXqUZBVoIkbHicRnVt1Rm51CR4562wi9GmMwe0GFOc9PU5enLzKnj1JFi p8hSwT89pGIHrPqoE1JhQ+6W1oE6S/XNEV0y31AfnvqTAptDwnjir32/0akbnOUJw4llX2 VWnOryUqNcKsgpEuqFxw55vr5cDQV0s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703447098; a=rsa-sha256; cv=none; b=MGdJ6rURNAo6/ZwWN1m6acumb8wvaeZFAgNV8b5HTm1C0uaF45ki3tzGjKcPZdVkNyGpuD z4sEacP15dGa6vjGbocISB525aUo+lp1wLGpDHmA8kfQzyYS2pb3Kf/Qy02MUuoSYyuFDW AKtDCTD7CEtWapYsHTh3HBwaqUvYu/w= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=HiIn5bcM; spf=pass (imf19.hostedemail.com: domain of paul@paul-moore.com designates 209.85.167.170 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com Received: by mail-oi1-f170.google.com with SMTP id 5614622812f47-3bb85a202c2so2151371b6e.2 for ; Sun, 24 Dec 2023 11:44:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paul-moore.com; s=google; t=1703447097; x=1704051897; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=; b=HiIn5bcM5IY4piKZ0guWJq7Q6He1CoyTz+JWt3ZUmdUwJjgemKPM0PzR3+PWumna2L 0Pq9xVBropMdpn70Sk86j1WNg0nMOMt4kQOjju2lNVwdQ1e4T7d3NxT8UtM3EKP16R7H kN5gfCUwKp2G5/SC6TGUb54iblO8p3dIT/2uy/41CGgHWDswl/WZf6DDfQv4Ztsxvfzn vntJh+QnlsY04FyZOs9WL0EbviRRclP+ZGCap96DpLL5xP44kzEWvttToOpz2GH3Pcil IIKUBrtgMFeJ+sH1czsxawnB5pUM9pjRjv3u/91+h7onEXdCegj+GZ8whcR3syrwfTug B2/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703447097; x=1704051897; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=boZ0uwY/zF6oiDhDnHlBkkvbQGja5+tZWoWqWwoQ9bk=; b=PQ0ttWsUCbimubqK6ehXaO87Jfjq1D+QvieV92AmxJ5Zwmpn4WP1EnzA3YztNmsvuo Fz9f50ASjInnItNTKZVr5iwOsXzzcE7PGeusYVhHtpt8gd9TgxQWEIM6zvJ9KH09zBNY eZfTCT0GvUsbqq5p19WMvCer0P65E+uhk6Xon8ITA7Wjx5C7FsFSssXgfWwOYKB2FxMt DJU3BWGM/KV6AXO28I8n51IpCetA7eow6EujY2KrOopkq29iAVHmhEeY1wVuog4NjVVZ RP6icu7pGn3UHAUJvwoEIrmzhatADRbQr3sZUv8fRGeqOCWWr+ESskB/g+/7C5PvlsCD cZYQ== X-Gm-Message-State: AOJu0Yy15otPHkySEdxmEkdQ5qQP4l476/CzCzu6BIVdCi512PbjZHEs HpKgyGcpxyhL0R7bRLjBiRH1kyvOhFHFjSyvS2OsojkM4vII X-Google-Smtp-Source: AGHT+IGETYfxawX0HFXMVp+6t2ErU493rO1FtJFm80zYN2C9Rdc/T7NbMEoVOIgu/80q/4IgixKMSLJqZ1pExaurGLI= X-Received: by 2002:a05:6808:13d0:b0:3b8:b063:9b66 with SMTP id d16-20020a05680813d000b003b8b0639b66mr6047506oiw.88.1703447097154; Sun, 24 Dec 2023 11:44:57 -0800 (PST) MIME-Version: 1.0 References: <20231214125033.4158-1-laoar.shao@gmail.com> In-Reply-To: From: Paul Moore Date: Sun, 24 Dec 2023 14:44:46 -0500 Message-ID: Subject: Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Yafang Shao Cc: Kees Cook , "luto@amacapital.net" , wad@chromium.org, akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, mhocko@suse.com, ying.huang@intel.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 26BAB1A000A X-Rspam-User: X-Stat-Signature: 1p1yx7yutd5k4ea1cio3mmi4qp9nj81y X-Rspamd-Server: rspam03 X-HE-Tag: 1703447097-48422 X-HE-Meta: U2FsdGVkX1/ntUAMSxGfeVcXrhSuJBSHEjX6ERXLvndY4AwOIwgYvZIYD6O/Y+CLb+qhSpScg3vagZGWiY4ZSnH+Vkxw2hp3dvRuCvlgKrbw3y+UOmpSCenKKDqYKF5JMvHPi5fm7Ww49SIlxfsLc//dPMFpjuaAAqsVCPBGvEon1Hjv7TQrgvu2kzO7r1QxPXTk9/3QZ4wpnEN9SHhsAMKlGkuN4NkpRcqTsvQgxCLqY0oplo8rV+iwNmMCmLfaUgSMZKjOwT/qK53LmbYE7tiuFDbCxmLWd3afZeKna7aMtVCY8QuNtligg5RauHM6s8ydfqpS/0Li5F+VIb/ucYCTdHtw3oplholuczPug5NcbmbPnPGwRjju4VFcWfp4LSj3VeYWlrTxwfaORGBEdHdkE4XzTwcjOc1tRpaymIngSkctH5BoQHtN0YkYFuQ+TeubqX9OfkDNfCXtBCvT0MUcJJhIuurxZFzkVQUxQK88YADxKEex7nvU6gYR76iUrvXj5TxXE1mRCyTsl1v/40zD5753xakCBT5d4P6rdj3QZv4zX8xNuRx3QNUqt6TMXsHBUwvMWey0+FQLoBj5cDrf5wKZ+4krLvT9BHNm3oy6z2CPQYEbGvi5z/vldg3R8ciKhs5J1nQ/fyXyQtZ4tFmAfE8WL0a0UjjWmfGBkMU8QO5BoF4uC5IIlfs4mQbceBHFgyA0aUSVKPKZCtZzAYqoy9GYs9UKol170fkdF3gpYvu16WhA7cwJBvxIzThzqEeQow+sEs6BU5OB/NpqzfyYpQgFwh+BbOgnUW9AtN+XYqfTImmMjWcLrTheW3maMi9/lzKmLq41DpCS+VB2EmEFIHr4UqumEB8Vce28XkUCOF3EoALE+shDMuKjFBF4WJIPX3mnGN3OlHAom4M5wnaNILcISD79nWJ7X3clZEKnV092sQxvcHU6mICYkNSqpil6+LasEY+jHC7A0W6 YHmOzIWM FvCyfjwBu+XsuPoAKmm0zkeG1uKSKnXm5XTkWa+0EEwcwY7NXEfHky9ksWEBL3cFPNcQ2gmSurgghCxvCfxePcJq75M6ufFZRMXQVd1WfQgQH2PKZwvwHqbfZJD+pKkfOXZlyi/HZReGq6wF2fVfuthtB+LDSBGYCfDJRmeUXACg6OoML4SPDDzr6Y7wL+/rXixcaRhnuZh8orBWpSp/eHKvXH6G+ysPuOYNt3+TQbv9IvoFjAFzlNROapmIBHzMk6SmVPzCkrSQRH0+6oS92innzX4vIbxOgGbmMm41Bmt6y2Bh5MQ9UcoSWK5F7FzZqAOtp1DHFnwlRMO1Rzjz2FqTXtnaEuMmkaBcxmkYnpO1nnllN5NOrP+UX4yYSsyAhJR8giadg2MGNmpV5zax19aMLD1+jslDtwkSBr2Aj34zaFWGT5jAmAzbR4nwxTKk4ACoQsAnhAFm0Ml9VhudeZUAjTjjS2Ml3Juu0ZAazHsco5d23EYSUxSiQjIC0UeD9kcAbxCz4JwUEzk40X/PMsbFUxgxzyDSeTXBddRbAmgPG3CtlLf6i43wIRqSgumlsFcH+h7tXZwHqOnkKQLxgQXBg2Ixq7C2nRXL5ErFPUDnwbLc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Dec 23, 2023 at 10:35=E2=80=AFPM Yafang Shao = wrote: > On Sat, Dec 23, 2023 at 8:16=E2=80=AFAM Paul Moore = wrote: > > On Thu, Dec 14, 2023 at 7:51=E2=80=AFAM Yafang Shao wrote: > > > > > > Background > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > In our containerized environment, we've identified unexpected OOM eve= nts > > > where the OOM-killer terminates tasks despite having ample free memor= y. > > > This anomaly is traced back to tasks within a container using mbind(2= ) to > > > bind memory to a specific NUMA node. When the allocated memory on thi= s node > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > indiscriminately kills tasks. > > > > > > The Challenge > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > In a containerized environment, independent memory binding by a user = can > > > lead to unexpected system issues or disrupt tasks being run by other = users > > > on the same server. If a user genuinely requires memory binding, we w= ill > > > allocate dedicated servers to them by leveraging kubelet deployment. > > > > > > Currently, users possess the ability to autonomously bind their memor= y to > > > specific nodes without explicit agreement or authorization from our e= nd. > > > It's imperative that we establish a method to prevent this behavior. > > > > > > Proposed Solution > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > - Capability > > > Currently, any task can perform MPOL_BIND without specific capabili= ties. > > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but = this > > > may have unintended consequences. Capabilities, being broad, might = grant > > > unnecessary privileges. We should explore alternatives to prevent > > > unexpected side effects. > > > > > > - LSM > > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy= (2) > > > to disable MPOL_BIND. This approach is more flexibility and allows = for > > > fine-grained control without unintended consequences. A sample LSM = BPF > > > program is included, demonstrating practical implementation in a > > > production environment. > > > > > > - seccomp > > > seccomp is relatively heavyweight, making it less suitable for > > > enabling in our production environment: > > > - Both kubelet and containers need adaptation to support it. > > > - Dynamically altering security policies for individual containers > > > without interrupting their operations isn't straightforward. > > > > > > Future Considerations > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > In addition, there's room for enhancement in the OOM-killer for cases > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > > > prioritize selecting a victim that has allocated memory on the same N= UMA > > > node. My exploration on the lore led me to a proposal[0] related to t= his > > > matter, although consensus seems elusive at this point. Nevertheless, > > > delving into this specific topic is beyond the scope of the current > > > patchset. > > > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@= bytedance.com/ > > > > > > Changes: > > > - v4 -> v5: > > > - Revise the commit log in patch #5. (KP) > > > - v3 -> v4: https://lwn.net/Articles/954126/ > > > - Drop the changes around security_task_movememory (Serge) > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > > > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > > > - Fix bpf selftests error reported by bot+bpf-ci > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > > > - Refine the commit log to avoid misleading > > > - Use one common lsm hook instead and add comment for it > > > - Add selinux implementation > > > - Other improments in mempolicy > > > - RFC v1: https://lwn.net/Articles/951188/ > > > > > > Yafang Shao (5): > > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > > > mm: mempolicy: Revise comment regarding mempolicy mode flags > > > mm, security: Add lsm hook for memory policy adjustment > > > security: selinux: Implement set_mempolicy hook > > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > > > include/linux/lsm_hook_defs.h | 3 + > > > include/linux/security.h | 9 +++ > > > include/uapi/linux/mempolicy.h | 2 +- > > > mm/mempolicy.c | 8 +++ > > > security/security.c | 13 ++++ > > > security/selinux/hooks.c | 8 +++ > > > security/selinux/include/classmap.h | 2 +- > > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++= ++++++++++ > > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > > > 10 files changed, 182 insertions(+), 2 deletions(-) > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempol= icy.c > > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempol= icy.c > > > > In your original patchset there was a lot of good discussion about > > ways to solve, or mitigate, this problem using existing mechanisms; > > while you disputed many (all?) of those suggestions, I felt that they > > still had merit over your objections. > > JFYI. The initial patchset presents three suggestions: > - Disabling CONFIG_NUMA, proposed by Michal: > By default, tasks on a server allocate memory from their local > memory node initially. Disabling CONFIG_NUMA could potentially lead to > a performance hit. > > - Adjusting NUMA workload configuration, also from Michal: > This adjustment has been successfully implemented on some dedicated > clusters, as mentioned in the commit log. However, applying this > change universally across a large fleet of servers might result in > significant wastage of physical memory. > > - Implementing seccomp, suggested by Ondrej and Casey: > As indicated in the commit log, altering the security policy > dynamically without interrupting a running container isn't > straightforward. Implementing seccomp requires the introduction of an > eBPF-based seccomp, which constitutes a substantial change. > [ The seccomp maintainer has been added to this mail thread for > further discussion. ] The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a number of sandboxing tools designed to make this easier to use, including systemd, and if you need to augment your existing application there are libraries available to make this easier. > > I also don't believe the > > SELinux implementation of the set_mempolicy hook fits with the > > existing SELinux philosophy of access control via type enforcement; > > outside of some checks on executable memory and low memory ranges, > > SELinux doesn't currently enforce policy on memory ranges like this, > > SELinux focuses more on tasks being able to access data/resources on > > the system. > > > > My current opinion is that you should pursue some of the mitigations > > that have already been mentioned, including seccomp and/or a better > > NUMA workload configuration. I would also encourage you to pursue the > > OOM improvement you briefly described. All of those seem like better > > options than this new LSM/SELinux hook. > > Using the OOM solution should not be our primary approach. Whenever > possible, we should prioritize alternative solutions to prevent > encountering the OOM situation. It's a good thing that there exist other options. --=20 paul-moore.com