From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C81E9C3DA6E for ; Sun, 24 Dec 2023 03:36:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 28AFA6B0072; Sat, 23 Dec 2023 22:36:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 238E96B0074; Sat, 23 Dec 2023 22:36:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 101356B0075; Sat, 23 Dec 2023 22:36:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id F22C96B0074 for ; Sat, 23 Dec 2023 22:36:00 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BC490A1B1F for ; Sun, 24 Dec 2023 03:36:00 +0000 (UTC) X-FDA: 81600298080.13.CF86A1F Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50]) by imf29.hostedemail.com (Postfix) with ESMTP id 11955120002 for ; Sun, 24 Dec 2023 03:35:58 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jZHma3ro; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703388959; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=; b=5IKs92duMu6QjrqpGKrkniixdAvX9oZCbeQeL3CGEHk/jM23O6hnUInXVUJ8S+oynV8rgM Q3u98ghbZGiHROprco6k8QCD6pRM+gRmnCSINBwi2MO0j9WcQ5Kz+83tM3ON12iZlREXxY X2JYbfsM9I3UkQ2W8U9ekIupvbpJeZ4= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jZHma3ro; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703388959; a=rsa-sha256; cv=none; b=MiR3gV9b4Jd3baNG7jlM/yHekUnZRwfr+jYLLluZuNwEVZbd3GoILmrkgb+IL9pAeZBL9h uYkTJWC8qh8ZiQ4lAaSK6DRUQtNiqUsofxuTG0xzDC1658nXL+rjrAXB8DSyV4V+QEvqBG GLCNhFB5ahbhbYre9Hsq1xm7tyKGUUA= Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-67f95d69115so17123696d6.1 for ; Sat, 23 Dec 2023 19:35:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703388958; x=1703993758; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=; b=jZHma3ro3e0daIG18gUpSUKRZ9qTS8rAmhJVCmrlGVQkWVB4/lvyH2XVCQW4Sh/0l2 oWTB+LnkokSS9VIeWJQ9qCBrvKn4heaZSvU0/dm3u30rX2ZGjIt9IGd3QVIUNwp+7Y7q JsTvIqsqDWrsP589ykZNokihigcjxrrkFeSNRK+5bSqAotzPnt0rYF9ha5JJEiaVSSWu ZJOeksIt1T6NNnPNecW1b60rg64W5nR5UzHKqzkRCz/kPksLmIRLtSxpmM46YFZlYdCx 37+Y2rf0dSlnkYTutxZIKV/TWPvo5nMMc0ohlUQRiFHd+DUQe6QqiOdsR7Ob/3y2b0ru K2+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703388958; x=1703993758; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=weKsVsKNrCvV2kyNt0ry5Nk/tWdJ9JAICoIq7OSzhcA=; b=xQsoA3gbtAuK2b6STKzAQw1SG68xZSGFxNAtlpHXlcKKyBaLIoLtWiHB3ky0hhNBad tkeFNsuF6+u9Y4DQX94/RfYdSSs+ewt7E73wNJpMdpRkaz+IEnlLesnhd+pWGmHi8sLo FUEBfzAiqUcnoeQmlSv4ID2Cs+4z/670KWLfDR+hNcpBDcCVTt8Bzb14k0FhKWCTkrDw faXxT3Bc8Oycka1sj2N19VQZbHQxxco+pOmWXlbj9xCi4k7TI3dk6raTHiF5AncWL3UA lr30J/hFwgqddbYGkWVgzw88dzVUht9vtph4CDt/GnwVipwEWRnwU3BYT0P8Jzza4MZi Am+A== X-Gm-Message-State: AOJu0YyejprqJmqRYAN8pO5DDm04R7LXURhUgLle7vqFpVke3be0k2k8 GZiW28dXziTskqo3N9HxFEpNCiJvTiVGkHrELfU= X-Google-Smtp-Source: AGHT+IFZJd1vRU2AgHQ9AyCbdlMfveFkrY+CjEU2/eHQHycCO+OrEnUMLwumOveI81fMFduNFPiYWNaOGga9ayqBS+E= X-Received: by 2002:a05:6214:d8a:b0:67f:27b6:9bae with SMTP id e10-20020a0562140d8a00b0067f27b69baemr5362555qve.85.1703388958113; Sat, 23 Dec 2023 19:35:58 -0800 (PST) MIME-Version: 1.0 References: <20231214125033.4158-1-laoar.shao@gmail.com> In-Reply-To: From: Yafang Shao Date: Sun, 24 Dec 2023 11:35:21 +0800 Message-ID: Subject: Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Paul Moore , Kees Cook , "luto@amacapital.net" , wad@chromium.org Cc: akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, mhocko@suse.com, ying.huang@intel.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 11955120002 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: e8qtuaaqawhfk7ezqi3mpqdr3pzwucgt X-HE-Tag: 1703388958-390927 X-HE-Meta: U2FsdGVkX1+DhIDvO+YOTX7zIyxMBmJUe3Js/RbQsWqj7zZyC6mSwEDYo1xNB/aizmH6fQSN1Q7qyLmjDSKFuxdDdp4QG9PdLOIfHOFx5aO6AhgxWgk/4WLXy+YzU7Mv2UJXo4TAUqg3AhrfegCs/SyJw9TLZv3jPQDaccYuUu6V4wkFv66x7YJdwfOEHL+0CEkDfUXlr0hXHuQuk3rU1yM7LItUxOWaVK7bg76nnDk9gTcHIA/jpu0w7ci7hB3wa0sXhIt7Sg6oQ0cBSw6396JiRhtJ1vFi06mYgZp5x6yFBYvQuRmQlXQXKh8mutnN2Eu1cRmYtPWD6mR/5tEJ74KxQAKya8ELee+QYrPBaeG5PWzFZRGGpecENlSttgnOHAMklKD3NFNfoNm2VFvbPRlVqeKJAhCn1G1bSwm8XB1RGvawiqbd3823bjNJsQ/CgZO22z8Yp36wPRLPwWD08Qzaz01ELCZHtrvJN4ZPLb8YtQsKpDzWPGASylV98haLHpoYawQxPUkLrazIcUQPEZmMzjJm8/BhG6imLt1az/51C1rfTnf2IURsP1rJeHsPOTctYr61FQr0f8mxwBIgFH156xxelKuxz3xqZsxOYMynHrWr3xxKBqlwevJLAUp0jOrjethxoPAcgomG0Kc4EIhPMySfo34BgoG0gzhjICoPV2hPtNVZWtk69zMYbrNI8NQFZiH2f1gmUEZpXUw0F0VRoGWZaBFPScit+5wCCqhFC0Yclv5+7cAQjYsUAhXCK2bRzrneZuD+MdBor0eHZ1TTBlqcWmAbOlgWBklkK9pd4hbZzptqRoEDQpOVldarBQgLlnnlKuDhtYL7aPENANxioUmyN2xj81rC7eUc6/nv58ZfEm5bcjatEfeNUT0Q/r7IThK6CZwrddP0CmSo7UWDQdgO3rkpvWjJUpXdRnFCnYl08pMVec1Q3pTEtbo8nI1iJpUnkVlboQ7Wuo6 QWsXvNxj alfspR2F8t7ydO62xQKOtXfjD0TO8LC9Jd2Eky+rHZ2yzoUpvVLOyMUgxdsBDjeVTHB0TQvO7amb5Ht+EbBOxBx5OXwxgRxrb0ORrC/oNNlg8RJQxDZUv9pYybb0ol6MuJLqTcYzbLO1S8mDNUrW0Jrp+0LUl05ncCgnUmLElFH31qqNIoMhx0qGWZFeRMULs33uVTEa0/I6hLd/+qZoZ6PGTQn7TeHp2e73FuZ+/brD5z8xKglG5Or/72X6RYOtWrHVcXpLHo70514CfmGLWcCEMMPp5DzF5yrsjGw8rIlYYAlcLiGgM4gzXAsYxb0BQDZTzpGcqsj0IJ+J+iu24W+LrIzAZmy6HtHaba1+Gto68y8Prq1igElGyRaMv3AGe2GEfxgD8CEWTJxrwSqGSQbcgd/opsZ/BzGBVMXHLNNduqRazEy7+E1YZeFA95K2eLCGEsrZ+ZUjUl9vUy4bLRvNobkBbEo7LSBdyIU8ykRLJuJGI4ZAnuHfc0kRXjECvpJSz51R/6HDIsN2xEg3H2duuJ+FJNkQP9LymJEcdCuRIKCvr/ClnrFW220+Ego6Ph3p6zps6Lucd0ZQq9tasLt/N5uROUxf+QrjKAwrn4mvdCPJqJI3qVEyttou+b8gNkd5NeI+aH454I04= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Dec 23, 2023 at 8:16=E2=80=AFAM Paul Moore wr= ote: > > On Thu, Dec 14, 2023 at 7:51=E2=80=AFAM Yafang Shao wrote: > > > > Background > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In our containerized environment, we've identified unexpected OOM event= s > > where the OOM-killer terminates tasks despite having ample free memory. > > This anomaly is traced back to tasks within a container using mbind(2) = to > > bind memory to a specific NUMA node. When the allocated memory on this = node > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > indiscriminately kills tasks. > > > > The Challenge > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In a containerized environment, independent memory binding by a user ca= n > > lead to unexpected system issues or disrupt tasks being run by other us= ers > > on the same server. If a user genuinely requires memory binding, we wil= l > > allocate dedicated servers to them by leveraging kubelet deployment. > > > > Currently, users possess the ability to autonomously bind their memory = to > > specific nodes without explicit agreement or authorization from our end= . > > It's imperative that we establish a method to prevent this behavior. > > > > Proposed Solution > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > - Capability > > Currently, any task can perform MPOL_BIND without specific capabiliti= es. > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but th= is > > may have unintended consequences. Capabilities, being broad, might gr= ant > > unnecessary privileges. We should explore alternatives to prevent > > unexpected side effects. > > > > - LSM > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2= ) > > to disable MPOL_BIND. This approach is more flexibility and allows fo= r > > fine-grained control without unintended consequences. A sample LSM BP= F > > program is included, demonstrating practical implementation in a > > production environment. > > > > - seccomp > > seccomp is relatively heavyweight, making it less suitable for > > enabling in our production environment: > > - Both kubelet and containers need adaptation to support it. > > - Dynamically altering security policies for individual containers > > without interrupting their operations isn't straightforward. > > > > Future Considerations > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In addition, there's room for enhancement in the OOM-killer for cases > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > > prioritize selecting a victim that has allocated memory on the same NUM= A > > node. My exploration on the lore led me to a proposal[0] related to thi= s > > matter, although consensus seems elusive at this point. Nevertheless, > > delving into this specific topic is beyond the scope of the current > > patchset. > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@by= tedance.com/ > > > > Changes: > > - v4 -> v5: > > - Revise the commit log in patch #5. (KP) > > - v3 -> v4: https://lwn.net/Articles/954126/ > > - Drop the changes around security_task_movememory (Serge) > > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > > - Fix bpf selftests error reported by bot+bpf-ci > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > > - Refine the commit log to avoid misleading > > - Use one common lsm hook instead and add comment for it > > - Add selinux implementation > > - Other improments in mempolicy > > - RFC v1: https://lwn.net/Articles/951188/ > > > > Yafang Shao (5): > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > > mm: mempolicy: Revise comment regarding mempolicy mode flags > > mm, security: Add lsm hook for memory policy adjustment > > security: selinux: Implement set_mempolicy hook > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > > include/linux/lsm_hook_defs.h | 3 + > > include/linux/security.h | 9 +++ > > include/uapi/linux/mempolicy.h | 2 +- > > mm/mempolicy.c | 8 +++ > > security/security.c | 13 ++++ > > security/selinux/hooks.c | 8 +++ > > security/selinux/include/classmap.h | 2 +- > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++= ++++++++ > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > > 10 files changed, 182 insertions(+), 2 deletions(-) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolic= y.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolic= y.c > > In your original patchset there was a lot of good discussion about > ways to solve, or mitigate, this problem using existing mechanisms; > while you disputed many (all?) of those suggestions, I felt that they > still had merit over your objections. JFYI. The initial patchset presents three suggestions: - Disabling CONFIG_NUMA, proposed by Michal: By default, tasks on a server allocate memory from their local memory node initially. Disabling CONFIG_NUMA could potentially lead to a performance hit. - Adjusting NUMA workload configuration, also from Michal: This adjustment has been successfully implemented on some dedicated clusters, as mentioned in the commit log. However, applying this change universally across a large fleet of servers might result in significant wastage of physical memory. - Implementing seccomp, suggested by Ondrej and Casey: As indicated in the commit log, altering the security policy dynamically without interrupting a running container isn't straightforward. Implementing seccomp requires the introduction of an eBPF-based seccomp, which constitutes a substantial change. [ The seccomp maintainer has been added to this mail thread for further discussion. ] > I also don't believe the > SELinux implementation of the set_mempolicy hook fits with the > existing SELinux philosophy of access control via type enforcement; > outside of some checks on executable memory and low memory ranges, > SELinux doesn't currently enforce policy on memory ranges like this, > SELinux focuses more on tasks being able to access data/resources on > the system. > > My current opinion is that you should pursue some of the mitigations > that have already been mentioned, including seccomp and/or a better > NUMA workload configuration. I would also encourage you to pursue the > OOM improvement you briefly described. All of those seem like better > options than this new LSM/SELinux hook. Using the OOM solution should not be our primary approach. Whenever possible, we should prioritize alternative solutions to prevent encountering the OOM situation. --=20 Regards Yafang