From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C3ECC46CD2 for ; Sat, 23 Dec 2023 00:16:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6460B6B0085; Fri, 22 Dec 2023 19:16:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F5FF6B0088; Fri, 22 Dec 2023 19:16:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E4E46B008A; Fri, 22 Dec 2023 19:16:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3DA2D6B0085 for ; Fri, 22 Dec 2023 19:16:56 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 081AE140629 for ; Sat, 23 Dec 2023 00:16:56 +0000 (UTC) X-FDA: 81596167632.20.43F7ADA Received: from mail-yb1-f174.google.com (mail-yb1-f174.google.com [209.85.219.174]) by imf02.hostedemail.com (Postfix) with ESMTP id 383E080014 for ; Sat, 23 Dec 2023 00:16:54 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=KuM7dvDx; spf=pass (imf02.hostedemail.com: domain of paul@paul-moore.com designates 209.85.219.174 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703290614; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/besZVLQzMROX7KooUJ8sVZ5ZrqGy2MV58aazQPqc2k=; b=lKx43lkIWOHGWFLOhCWO2YFqRckUVPwdFQaIHFaIsmNVEJDCbAONRx0exriuMuWHM6ieKh dQNvEK4k9/9wdzzXQcekRrGXqRumrouLPuDupNBsPUeEY4zHgOLnSMEuA4nGvwHkDiogRP CJo3Vuga0/VWzR/4uByGOeqPZ55bhw8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703290614; a=rsa-sha256; cv=none; b=4fSzfGCOkqGutjFZHT9NPVkriveoRZ02RRlRDfPD23+ErSiecXpCfh2vESEL6UZo032lwf G2f9vp54BGohx/Kk2ht8SKUWpZZITTnjFMfZN7eyHgNLF8vpd/trV9oSn/e1Y5fIVO4Rkm +eUPYLnQ63M8Ow3MTeivb0zezLYLENk= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=KuM7dvDx; spf=pass (imf02.hostedemail.com: domain of paul@paul-moore.com designates 209.85.219.174 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com Received: by mail-yb1-f174.google.com with SMTP id 3f1490d57ef6-dbd715ed145so2142577276.1 for ; Fri, 22 Dec 2023 16:16:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paul-moore.com; s=google; t=1703290613; x=1703895413; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/besZVLQzMROX7KooUJ8sVZ5ZrqGy2MV58aazQPqc2k=; b=KuM7dvDxWFETOtxOPuPKnF2yds+jSpG/P15M7v3lTnzw3sq33Z7hde30sjUB/9d7Q/ ORsSa3LZq5g37y7NCc+Dhd+JG6qK5hrqBcwez7oUGKnvc5Srbxdf3aXii03fAJyZmA4F tcn1VTY60U/jhmNQ5aQ9OPvJkHia+iDghyk9lp98oTRvrz5hMG4PtuKmavC+zCDtgc8R tZgXZQkEmE1f5MpAMSMOnKlz+IfVXgRdDEmOLBE8dxLTsDR4MceGgsQ0slwC8x+8tKKO R/pRCmfvYoXW/41N2HimhvOyq5afSrFVb51Sr2g0LgRQm4prPHtJi7EOcvPfVOSlN+ql 2alA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703290613; x=1703895413; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/besZVLQzMROX7KooUJ8sVZ5ZrqGy2MV58aazQPqc2k=; b=bQDePOFddKc+8ZfQAilBJl6Y9CsuVnodXWoUwJo3IMM/T/gmbL4XIwAk4uQibP3tIa vNMHDkWZ3kOsknBRjuSXE7s603usp3n3AhTi26b2c/39unUBZKfRimEeMxWreKrXYoiT Pw2y9grX/mZ/kw7zlOv6lnQUgMsCvwVMLjc2gcW5WjWy17mhPwIZiDr94u91X0DdyOww bn9LLF4CdhQOeLiYD/TNHCP6hzuAWSGvTawjp1mrijfXElp5f/tuCz8nAg9yFRWUNGbn yaXIdavCBLDeR6egmK1Nhp6mC1yTvnGJWcAFhyj8peBfQsGbS6kzjAUV2JXTfQhMjD95 6KPQ== X-Gm-Message-State: AOJu0YwdPmTlQ2VxVIRoqBJ/I0laIztuj+2IL3YXKtTnSLE/egzMnyv2 3ccbmK1qsw3pzgg72/vKhKooVm7z/Y3d4tdLxP/sozTK9y/B X-Google-Smtp-Source: AGHT+IFtFWTTBJwAWvg/R0L/LFCTPZMk9JdCLjGWegtqmDLOod4jBJr7AZY8F7veqYwFKec4YDsKigBo7Ux3x/p10uA= X-Received: by 2002:a5b:b12:0:b0:db7:dacf:620e with SMTP id z18-20020a5b0b12000000b00db7dacf620emr1679377ybp.96.1703290613247; Fri, 22 Dec 2023 16:16:53 -0800 (PST) MIME-Version: 1.0 References: <20231214125033.4158-1-laoar.shao@gmail.com> In-Reply-To: <20231214125033.4158-1-laoar.shao@gmail.com> From: Paul Moore Date: Fri, 22 Dec 2023 19:16:42 -0500 Message-ID: Subject: Re: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Yafang Shao Cc: akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, mhocko@suse.com, ying.huang@intel.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 383E080014 X-Rspam-User: X-Stat-Signature: u5b1qgyicme5p6sks5ssb7mgbaz6xxkr X-Rspamd-Server: rspam03 X-HE-Tag: 1703290614-306884 X-HE-Meta: U2FsdGVkX199rVWp83gcCk3XwdgxRcaD4UXAMoM6cfILW71QhrNI3A91VThtISD45CsRZz3BGeBJ14y0jQrw5L5vT0mpXVhBTyIhU5mL5FrXHEgdWRVIsU1DuxnH2vhLIlIvQJTUTyT+snmprxI96R9XkDLWsnZ+0yobjtIfrzvIDBlwHnBkU5qtxvtV0Wc+nyL1WcEJUtUY3Ql7EIgMrspJsuRmkKn2z1c4Bw6hKuj62KqZtAHV5TNpkR3+DL0hJycAV9ZBoLglG7jTRIcc3T4DxJ+wxasSPA/aF6ZhymaE/GEMydPo8i231paMBszaPdbPUra5kff/fHcZ7tfmnW1uuRB7zR24prJTefo3tzmiLFgXpjx+Q0TW18ZJ2W5q5bdegCWzHW7MfUl9Nvaw5/vkTT9hTUsppG1pkuGGKC+r8k4qAUi5JVS47GoPEMfNrU9gvMul1j9NdNNz6P9NiFXQx0H0tf0XqpoFlhBDJphqSWlqCjSZG+KWbKRZljz8EpUttMsgwAQsxuQuCDvTjIDu48LLJD2q49tdQi0SCuNbUV1QJtEDEW67pBccJ+qOTjqlD6TEGBKiv+vp135OiGfqA/aqeWyBtsyiO1sx2jTTs8FoDZqVbRlWFwCfNCzE0xH+mmuIkdxTCQ9UqsClCltwEChRC0YSy48OuLMVIlXzhFUEIwCv1uYFQPq6+BzW0rDur1/64u1GeQNsYO3F7x3dZh+9DPnk6b7CV6RFqgFfQs9iRP80Tq+YruThPo71NYpvh4NwshdgqH0+wxweYYB6eyTzUt3Gxre+WaKAaitjrvCaSXsHcJ3iVWgk54Gz2Ipas0lYmA6EYq+wEZUO76OIr07pL4Q/Hvk1SHTbp2z6Qq72T6vCPNaYBz+kxcQazo1uN4OvLNgBVyluJmgF5A5pPoxoNWS3xRV6+I+B3t5jWpKnkgxxvAQBW1y+iacRvxH1T9QGSsmR97pj/pH qjWU2wfF ZfRIUISdAHwQZKIQG9RON2ZowrDbyw1YmH/2ljBxUUNcXSuOjs+/xMNrzFxfYLuVrAfJ49jbg21fQAoA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Dec 14, 2023 at 7:51=E2=80=AFAM Yafang Shao = wrote: > > Background > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In our containerized environment, we've identified unexpected OOM events > where the OOM-killer terminates tasks despite having ample free memory. > This anomaly is traced back to tasks within a container using mbind(2) to > bind memory to a specific NUMA node. When the allocated memory on this no= de > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > indiscriminately kills tasks. > > The Challenge > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In a containerized environment, independent memory binding by a user can > lead to unexpected system issues or disrupt tasks being run by other user= s > on the same server. If a user genuinely requires memory binding, we will > allocate dedicated servers to them by leveraging kubelet deployment. > > Currently, users possess the ability to autonomously bind their memory to > specific nodes without explicit agreement or authorization from our end. > It's imperative that we establish a method to prevent this behavior. > > Proposed Solution > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Capability > Currently, any task can perform MPOL_BIND without specific capabilities= . > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > may have unintended consequences. Capabilities, being broad, might gran= t > unnecessary privileges. We should explore alternatives to prevent > unexpected side effects. > > - LSM > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2) > to disable MPOL_BIND. This approach is more flexibility and allows for > fine-grained control without unintended consequences. A sample LSM BPF > program is included, demonstrating practical implementation in a > production environment. > > - seccomp > seccomp is relatively heavyweight, making it less suitable for > enabling in our production environment: > - Both kubelet and containers need adaptation to support it. > - Dynamically altering security policies for individual containers > without interrupting their operations isn't straightforward. > > Future Considerations > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In addition, there's room for enhancement in the OOM-killer for cases > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > prioritize selecting a victim that has allocated memory on the same NUMA > node. My exploration on the lore led me to a proposal[0] related to this > matter, although consensus seems elusive at this point. Nevertheless, > delving into this specific topic is beyond the scope of the current > patchset. > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@byte= dance.com/ > > Changes: > - v4 -> v5: > - Revise the commit log in patch #5. (KP) > - v3 -> v4: https://lwn.net/Articles/954126/ > - Drop the changes around security_task_movememory (Serge) > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > - Fix bpf selftests error reported by bot+bpf-ci > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > - Refine the commit log to avoid misleading > - Use one common lsm hook instead and add comment for it > - Add selinux implementation > - Other improments in mempolicy > - RFC v1: https://lwn.net/Articles/951188/ > > Yafang Shao (5): > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > mm: mempolicy: Revise comment regarding mempolicy mode flags > mm, security: Add lsm hook for memory policy adjustment > security: selinux: Implement set_mempolicy hook > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > include/linux/lsm_hook_defs.h | 3 + > include/linux/security.h | 9 +++ > include/uapi/linux/mempolicy.h | 2 +- > mm/mempolicy.c | 8 +++ > security/security.c | 13 ++++ > security/selinux/hooks.c | 8 +++ > security/selinux/include/classmap.h | 2 +- > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++= ++++++ > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > 10 files changed, 182 insertions(+), 2 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.= c > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.= c In your original patchset there was a lot of good discussion about ways to solve, or mitigate, this problem using existing mechanisms; while you disputed many (all?) of those suggestions, I felt that they still had merit over your objections. I also don't believe the SELinux implementation of the set_mempolicy hook fits with the existing SELinux philosophy of access control via type enforcement; outside of some checks on executable memory and low memory ranges, SELinux doesn't currently enforce policy on memory ranges like this, SELinux focuses more on tasks being able to access data/resources on the system. My current opinion is that you should pursue some of the mitigations that have already been mentioned, including seccomp and/or a better NUMA workload configuration. I would also encourage you to pursue the OOM improvement you briefly described. All of those seem like better options than this new LSM/SELinux hook. --=20 paul-moore.com