From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B39BC4167D for ; Mon, 13 Nov 2023 03:15:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 423CF6B018C; Sun, 12 Nov 2023 22:15:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D3B36B018E; Sun, 12 Nov 2023 22:15:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29C5F6B018F; Sun, 12 Nov 2023 22:15:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1234F6B018C for ; Sun, 12 Nov 2023 22:15:51 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8DA5D1CB1A7 for ; Mon, 13 Nov 2023 03:15:50 +0000 (UTC) X-FDA: 81451466460.29.2CE8876 Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf13.hostedemail.com (Postfix) with ESMTP id BD5662001C for ; Mon, 13 Nov 2023 03:15:47 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FjEJztVU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699845347; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T8B96Ni+FGUAk/kFVWQhHFafqodPoSpSscy2GJncHQw=; b=2FD/28ZtL+I/ofIJLCBXdgDkRxWZuIEKIABBjL5fI3/8QKLxpSX0q3csXOf12Vkjo/bbok 8N1u2A5IaSyzEopgvMiMt5tMEgt51UpZibpuA5yvTNEIKS6WbOnj1+nGQM4hVVuTW1gWAb ga0GGIljU4//8guzDulWIylVBjfdR54= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FjEJztVU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699845347; a=rsa-sha256; cv=none; b=elRJ4u69emVKDorAprXfEsDT5JGur2TkonH1jhQG+FE1H0BkeUBNRydYVhB7gxki16m3rQ GgqSBlrzkCYZI9EmDZaSbsvgKuTea55r1pPBM+MXJ10LZ2vEvMA9ut8TPeujbrrzwuRSSY PlS0uPRbpPJbLUH7PZYQsoz5tQGdggk= Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-d9b9adaf291so3709698276.1 for ; Sun, 12 Nov 2023 19:15:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699845347; x=1700450147; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=T8B96Ni+FGUAk/kFVWQhHFafqodPoSpSscy2GJncHQw=; b=FjEJztVUmuitzyvXfr+GLLIFrEh9E32cqMFgRftlM/YgZ2QGFCPpoZtbsoShuEz4/+ rRe4otzo7dm8P7S7jbOAa3B4twI7CcAwNF18zkCmGOKlJBgbBFJt6NJXWVcxd1O90JFr t0ZarH+V/NKiA0YIXvppkmWRZIkrLZq75jPX/dAXXYOushpCyKELJrUan0eZS58xQxNd ndB1zzrrWfjtMxd01lEWjJDQ9sgtNrHSalHBKrW7A0/yGWy9Nb9EBifHy63VJSd+9lqZ Oj6b21FUubCIe+C6tAvSFpg26nMEzgsKIAAfKn1ntGZ/jWBljBIcNaDGmCXIy4lUal2H 5Bhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699845347; x=1700450147; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=T8B96Ni+FGUAk/kFVWQhHFafqodPoSpSscy2GJncHQw=; b=GOFZGq+8pBhYzbe50rSdCtx5c0zRSnSuN0pI1ESLtcrV6+cBQV4FrNSYrI0zTefNMJ rV+FhmJ6SDfBtrnv6TmeSoD6hoOZydFAMGVvRfx+aatxPYg4g3O4FdY73mMJfypOF3px Hy3pVpNdQVf0w5Iwno6U4kS+vGf3I0zmXA3MFCiEfde5hgKOmvUFoY0EewUqVvib7wU2 KynkcXmzR04mzkSgqud578cCDtnGXHcW0NRvbgr9oy7TOB+NWsg6yhPtyzJYPYeD24Up sCDwBIA1p+MfQAw6p5JO7eLyjjlbkJlM7vRKbTAkd6souo1+EqxpA2SUkbJ+XNqIGrZ4 V9xg== X-Gm-Message-State: AOJu0YzQGu20/0KdlxXzQjmb3FxZKaJdTqwZl62q/lNVBzgSJkibH4IO qjaCBTgiZMcO5uN4V3GXSRxPgTujRMedFrIZ/co= X-Google-Smtp-Source: AGHT+IGgoOn1GD9rsX8lEdOR18h9YH3PIsSRn09mwvNzpVQHVvocjvVSovUR6KBjYyKPtj7t4Wtss/58PVhBz+Dl7N4= X-Received: by 2002:a25:5ca:0:b0:da3:9a65:84b1 with SMTP id 193-20020a2505ca000000b00da39a6584b1mr2840571ybf.12.1699845346849; Sun, 12 Nov 2023 19:15:46 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> From: Yafang Shao Date: Mon, 13 Nov 2023 11:15:06 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Casey Schaufler Cc: akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: BD5662001C X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 8skjdy5ddo87xeqdb9aedp7y744ab6gj X-HE-Tag: 1699845347-242051 X-HE-Meta: U2FsdGVkX1/ut4abbMgUiHbs3nKap/tjE4iPpU10ufptmUJHIQqX66j+YyoY709OKHtBo2ysHkkpA1NPU3bE3c0N1H5MZyFX6u8DaAMHL5NXoLFUUm72fIw9QhaYhs1XZwAWUy9T9+Xx6tiCvbXlpUr6xDleNapfhv3EZ3XhLKiBb/MptONQfOThjmKxDAFHFy5mPDHAMnKHx4W25mwm2+jE2xi2RJKGpRPE/0bUM5I5P8vlTda3uKYYLEH8TUlIJBlBNjPKvx6NXjoX97j+9YdtX+8A2BbVQeBsQmEGnUbx63rM3msU52nxddecQwfmfUF9L18sFAnK0N6CSLfkZDdaKuceh+ws6/xbo+25m5dYD4Ib2aamDmqGGDGE0umpKeF08HC8iitPXUCVY0UuaCjIgbolzfh15dQaLZE9gb9/EcCf18yI4zlB3lo+5/Cl+RIMkUuPWZyWFX1i+VVLncWSWLsJPwhz9sueUUdIKYV+FH8St/zOdZoOgdraW8ssqKshvKVkbTQddo0sD9nPK2z65dly8pPz1N4Heef2l53XjaqXBGIQoJ7jxmknv7m9kG+ITErO3xrHTvI6oapKXQU9Sq1av8cWJn2d93kDNlvCtFvk+TpcGBSC+mFTu6Z0iei+IgNj9FGC4UV+oxUVyI2wTer8tqpHZhe7E1UwrE4eDridtslnHX8mt4OtBSTLCNfB3tHwTLS1RE4v+d9iNFD3iwflM1T9q105kJ+OHQgOJqdMa2W6pJpi3tCvld3T6pW+bfRiuH/kI5ZpO0Oq2VOcg3J9MYnKS1U9Ce7qUZxcn+SxUsnLahYpeNZsq3SuAbpzPN32V1EQQT+rInan0WXjVtKG67JAMM3PBfHngNE54vzNFy5+XtHwbLMdo7slpBRUOQtWjSDh2CRGwNZNYgNR1yk8oTDrc7RmQoOHsx75HMVYsMer8zAZRpHI6LxhBqVpheIEKVCUH5hrsEI 4LhVHwDn qUa2QV5I+oQSD62Xfcv0jyZhKO7WDJgcmFKNNCPXKHgbeUXR6iGKHeUM0kYOW0w1qkwUs7DBHNX3U30KWoz58K2CQzE1TQiHxCQxLYMIv15vbxU8c9bz8f1Rw7MSpuDw+EcUX1nzaoDvUJ592uFWOObz9UUpQGsv/Noxm/NdxmguN+Sprm3+2EaZ0YheUhHjOz6pDxwKtnlPv5MRi67GMNjywQQA144WKzI+D6Kv5E0CBS8l8DZTToCKJT6JZJ0CsJJDfL7jlo58qKZZdIDuHRwgN5CN9T7VvUHGF9Bs5eXufN+srtk+4/SdOu4QSK2uqsaP0987U/V8aAL7ypWedNwZwlAgEvRTAH4nyM1f9k2D1uxoVqq7NipnBoen3A/nyf6EiJc/xaQ/B1AnrRVqf61M/QtLeHB4uPkVz4d0SAxKGs4yZjfvdNxnlwprOQ2cuSfYn X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > On 11/11/2023 11:34 PM, Yafang Shao wrote: > > Background > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In our containerized environment, we've identified unexpected OOM event= s > > where the OOM-killer terminates tasks despite having ample free memory. > > This anomaly is traced back to tasks within a container using mbind(2) = to > > bind memory to a specific NUMA node. When the allocated memory on this = node > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > indiscriminately kills tasks. This becomes more critical with guarantee= d > > tasks (oom_score_adj: -998) aggravating the issue. > > Is there some reason why you can't fix the callers of mbind(2)? > This looks like an user space configuration error rather than a > system security issue. It appears my initial description may have caused confusion. In this scenario, the caller is an unprivileged user lacking any capabilities. While a privileged user, such as root, experiencing this issue might indicate a user space configuration error, the concerning aspect is the potential for an unprivileged user to disrupt the system easily. If this is perceived as a misconfiguration, the question arises: What is the correct configuration to prevent an unprivileged user from utilizing mbind(2)?" > > > > > The selected victim might not have allocated memory on the same NUMA no= de, > > rendering the killing ineffective. This patch aims to address this by > > disabling MPOL_BIND in container environments. > > > > In the container environment, our aim is to consolidate memory resource > > control under the management of kubelet. If users express a preference = for > > binding their memory to a specific NUMA node, we encourage the adoption= of > > a standardized approach. Specifically, we recommend configuring this me= mory > > policy through kubelet using cpuset.mems in the cpuset controller, rath= er > > than individual users setting it autonomously. This centralized approac= h > > ensures that NUMA nodes are globally managed through kubelet, promoting > > consistency and facilitating streamlined administration of memory resou= rces > > across the entire containerized environment. > > Changing system behavior for a single use case doesn't seem prudent. > You're introducing a bunch of kernel code to avoid fixing a broken > user space configuration. Currently, there is no mechanism in place to proactively prevent an unprivileged user from utilizing mbind(2). The approach adopted is to monitor mbind(2) through a BPF program and trigger an alert if its usage is detected. However, beyond this monitoring, the only recourse is to verbally communicate with the user, advising against the use of mbind(2). As a result, users will question why mbind(2) isn't outright prohibited in the first place. --=20 Regards Yafang