From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07411C4167B for ; Mon, 13 Nov 2023 08:50:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90BB56B017E; Mon, 13 Nov 2023 03:50:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BC4F6B01A2; Mon, 13 Nov 2023 03:50:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 783646B01F4; Mon, 13 Nov 2023 03:50:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 69D7E6B017E for ; Mon, 13 Nov 2023 03:50:20 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3FA24805F1 for ; Mon, 13 Nov 2023 08:50:20 +0000 (UTC) X-FDA: 81452309400.29.3FF7A73 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 379761C0016 for ; Mon, 13 Nov 2023 08:50:17 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=jJADADId; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf21.hostedemail.com: domain of omosnace@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=omosnace@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699865417; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=abJq3b4SxArGIzdH6oN3k/sACWtPDA4b9f3Ex7vpiIc=; b=jLoCjxYTaHCITqQMxPPOC1gLgWvzxvLNvrAXg7x2UTZW84+lrMzyi3c8OtZkKNnKNceOTI EsfPkM7gYg5yvmILmS3n+vzwBGkqT7MtN9RQF/jKIZbsU908zp//I+ToPgDOOWt6mRW4y8 Noe9uKnKWNyTlpQ/DxjSp/TZihLpN/c= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=jJADADId; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf21.hostedemail.com: domain of omosnace@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=omosnace@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699865417; a=rsa-sha256; cv=none; b=8JGWypDSVxe0I++o7PsU7JuLlTtdUiIcmc4YkCHLuLZdSA/uaD0eEv2iwk9CYN17MYFtQZ TyiI9IE0DbHpdSeCC4zEHl3WxHLaqrNiI7XTxzLW3MZ03zqLuoUrjebm9gJ+1a7MwjSF4k mFF4UK/2HYBQ+TIsmVB9LTiSpwz6lD4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699865416; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=abJq3b4SxArGIzdH6oN3k/sACWtPDA4b9f3Ex7vpiIc=; b=jJADADId2DuCzjk+Gml9FFqGqLEfyNxDmi1onKtKB8JUtWWobKAZ7KfUMoVZiYCR35ONcY jSHUw98ayqpKheMLA2Xsmm1fPVWU/kTv9ThIfX1FbFVMFjSNmVPD8PcYgRClXnISZEYzRx tAjOgAr+U2Lp85s+spLRBwqZP88YFsc= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-536-LdVb1zLlNSeBfTiNzeClzw-1; Mon, 13 Nov 2023 03:50:15 -0500 X-MC-Unique: LdVb1zLlNSeBfTiNzeClzw-1 Received: by mail-pj1-f72.google.com with SMTP id 98e67ed59e1d1-2804e851d5cso4833816a91.0 for ; Mon, 13 Nov 2023 00:50:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699865414; x=1700470214; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=abJq3b4SxArGIzdH6oN3k/sACWtPDA4b9f3Ex7vpiIc=; b=d+3pwpiJulrHMkrqyxr3y2HBOb4If3rdqCCNTTixRsOCQHw+FunZDnw6IWaDzVvs3T +T0AGnldlXJq06EdtEHj8Ne0CbbQeZ2c8X8g38jKfEP+7Rda0LzZFBu728mFymqKXawy +dCtmXmgPqwj6IEontrXJdrJ9nKGopt0Ct84kAMWcZMO0XJ0+aoJs/wnbUeIU8z3u/sB 9ps0Entap5EAV9w2GUjyFuYF89WS1QclgPdgAhYiRKhg5W3pTQfXRaKSKD1l/dmWNMNb /cTI4jFQspKM6SjhKESSV59xaVDsAbRNmYZZwdTnMwOEQPCWGmIuzzRF8PE26dXml337 hJyw== X-Gm-Message-State: AOJu0YyMY4pnt0OqEgmYDjpXEL1tbo55/NXsSFvISkeybfqdNh2ze35o k2AjnqgG5VpCr6BWqny33n9+ztAONdoum/cyi2pxtTSkZOX0qANw7RGcSWJvHoFbcoE7VuUUh00 2Sl9J0UUaoLm8FiGievMxk5Vi1MI= X-Received: by 2002:a17:90a:f00e:b0:280:cd15:9684 with SMTP id bt14-20020a17090af00e00b00280cd159684mr6410467pjb.37.1699865414098; Mon, 13 Nov 2023 00:50:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IGLms+NRQJbWTq++B6OIejcvpHYuyFNppMncnz0QhKUjYCH64CKOAj1f2csMJ2ny7fAoEfld8lcYGI2echZ33w= X-Received: by 2002:a17:90a:f00e:b0:280:cd15:9684 with SMTP id bt14-20020a17090af00e00b00280cd159684mr6410454pjb.37.1699865413819; Mon, 13 Nov 2023 00:50:13 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: From: Ondrej Mosnacek Date: Mon, 13 Nov 2023 09:50:02 +0100 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Yafang Shao Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 379761C0016 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: sma3zehiiqqr8b8p8oy3w7e97pigu6ad X-HE-Tag: 1699865417-903410 X-HE-Meta: U2FsdGVkX196N/dbE9DXX+mL+mnxeTYojR1+m3VHhq0coN0pVyOSazUcHSE8kTc0/erpxP04Cuo5m+zyMT4Fz37vhkagfhvTJDWBScjQgSqo5TxctxyjPL2MQT2XvtXKI8E9BnHDdxNLyS6hstUOW0fdAkdfHFj49Zny2X1z2eYOMdE+rgp2aeDnRirw/BQisECrb3Hx5urnDlzlsk9aXbK9TtYJyyeTZs0TT8/5wHzUPaUjg6Qx/UIW68/QrVt59msdtV27T8p4sq9RhsNDEOJ41GTbJbQQd/6b9xoBPbCKuh0rr66VYQvonqHqR2dhUuaT9eGdD0xfVQP3IK9Of74kjGZf/7dDRi+LeKGkT9bN45UuWcHTSqmihFblyMPjNAaKOghmSZmSyzK6oouRtRI/xynSIYcv2qSFMJ0oW/RDdoUKVb5gzNKmWZj+PkRDSsTOuCN73NJyNzuV6iaawHMIjSZptUTItz7x9zF4sWKH1CuBxkrJFMrdgdmjiReNHQaV6MdOKTj2jxedNiAzjnrRzvu1gdBx+P18naha/oKkBCcIsNEgY5aAVpZXnuIIsiVXeBzkvXNkRTADI7c1kK0k8gkBl3mTeQ5I56GkAA8mPO6Vy7VScw1ZVlbdn/8wlwXMK8KIMFbPsO8nzCLt6CVRccbQyCIcZLm9CR64VJuNAWaLVIJJTGLRbxW0tuh66MVsB11m1kvuNi70VsO6mbkdKVIQKo6iHKLaCvB1mcWjSij1s6YeHTucafcxG10/t7E5A46cig5siIIgvGeBylvVvQzfIz9/TcCuDiVvRDqo6u1FF1rRmLXGJtPM7XtoVw+/fUOWFDMsgkHBOMjcqhDxpvExXSIbfm9ERHaMZUZ9iqkxNPjnQV7kiyRcFkl16hmTxIE3k2to+dHYI8PKoj1o3Im4fWLSXAx9AYgAe7oxGJ5gjXskjrsajuMz3VYlwV9JmnFuYDUTqLSj1MN x2MlYUe7 L/rn7rHWdHAHUYAVT+qNI4hrlne6czeXMMijfrkAn5jnt2bUOTbcT6LwdcD5TQinGmKrb07TlRdY+GR4HuR0o1HRiUozxb12KFmfe80LT6vPzXjiEniI1qM90DEmTCCXq1xp7hziZfvWnPLm+49LEvH2sBImZKKybJf5ob66bC39ZCJT7Lox4ukhFv3djV+Kc6pI1vTXjzbDWRuml4kDdb5c39LKHX3vfha1MEmIXft2v8rj8X+uV4ZscTyoufV8MHTFqpZ6B73K15mY94N4qkzvwsVDss2KL5vGfFGGCHDHUmO7KQLmD8QSiORYK9l/XL2Deu43dQVA8DN8Zr48x3+IesGmnYUuvyk8+aqLKeq3jniOO8phFEJYDe5Hqc7JX09xSmjripz4sMC8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 13, 2023 at 4:17=E2=80=AFAM Yafang Shao = wrote: > > On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > > > On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > Background > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > In our containerized environment, we've identified unexpected OOM eve= nts > > > where the OOM-killer terminates tasks despite having ample free memor= y. > > > This anomaly is traced back to tasks within a container using mbind(2= ) to > > > bind memory to a specific NUMA node. When the allocated memory on thi= s node > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > indiscriminately kills tasks. This becomes more critical with guarant= eed > > > tasks (oom_score_adj: -998) aggravating the issue. > > > > Is there some reason why you can't fix the callers of mbind(2)? > > This looks like an user space configuration error rather than a > > system security issue. > > It appears my initial description may have caused confusion. In this > scenario, the caller is an unprivileged user lacking any capabilities. > While a privileged user, such as root, experiencing this issue might > indicate a user space configuration error, the concerning aspect is > the potential for an unprivileged user to disrupt the system easily. > If this is perceived as a misconfiguration, the question arises: What > is the correct configuration to prevent an unprivileged user from > utilizing mbind(2)?" > > > > > > > > > The selected victim might not have allocated memory on the same NUMA = node, > > > rendering the killing ineffective. This patch aims to address this by > > > disabling MPOL_BIND in container environments. > > > > > > In the container environment, our aim is to consolidate memory resour= ce > > > control under the management of kubelet. If users express a preferenc= e for > > > binding their memory to a specific NUMA node, we encourage the adopti= on of > > > a standardized approach. Specifically, we recommend configuring this = memory > > > policy through kubelet using cpuset.mems in the cpuset controller, ra= ther > > > than individual users setting it autonomously. This centralized appro= ach > > > ensures that NUMA nodes are globally managed through kubelet, promoti= ng > > > consistency and facilitating streamlined administration of memory res= ources > > > across the entire containerized environment. > > > > Changing system behavior for a single use case doesn't seem prudent. > > You're introducing a bunch of kernel code to avoid fixing a broken > > user space configuration. > > Currently, there is no mechanism in place to proactively prevent an > unprivileged user from utilizing mbind(2). The approach adopted is to > monitor mbind(2) through a BPF program and trigger an alert if its > usage is detected. However, beyond this monitoring, the only recourse > is to verbally communicate with the user, advising against the use of > mbind(2). As a result, users will question why mbind(2) isn't outright > prohibited in the first place. Is there a reason why you can't use syscall filtering via seccomp(2)? AFAIK, all the mainstream container tooling already has support for specifying seccomp filters for containers. --=20 Ondrej Mosnacek Senior Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.