From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 508F4C07548 for ; Wed, 15 Nov 2023 09:34:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC4418D004E; Wed, 15 Nov 2023 04:34:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D75028D001A; Wed, 15 Nov 2023 04:34:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C15868D004E; Wed, 15 Nov 2023 04:34:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B25CC8D001A for ; Wed, 15 Nov 2023 04:34:31 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 94994160317 for ; Wed, 15 Nov 2023 09:34:31 +0000 (UTC) X-FDA: 81459678342.12.0DF84F2 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) by imf09.hostedemail.com (Postfix) with ESMTP id E1957140011 for ; Wed, 15 Nov 2023 09:34:29 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Bxft9Mrf; spf=pass (imf09.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700040869; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uz/93EuDc77oJSQteK39JD/4miHZ2Blpb5SChByfej0=; b=xIiyrHpfu34aRr42rS/0JrLa1JP324ffdt5WUJNJ1yNxPi3uPxWmCNz/nM8mZGi5o316wt 1T8Sd0WvmMZyImIzwT6Q4MvLkVIs5d0tLnyXdS/DCGyohrhdKLB5kYyL3Bk61EMbNZdw6O V4kFN7Ntf+f0XEMmn0fMIBM6t+Z1aLw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700040869; a=rsa-sha256; cv=none; b=CLnwIz4bbm5/78wZA5V50aRWVc8DYbDJ41km8MYvNz6gdEuDscj3GklDjNoDZ167V6+FJy n6j+Hi8idyPBRkS2dViJU5Y6rohRmpIgMg5ov2AAlWl9r0ap92p3sc/9fKLzSLIM0ExbTu FTsNYEJM/JGfWlEqpiXMLPDCFKaNSIs= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Bxft9Mrf; spf=pass (imf09.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-77897c4ac1fso419913685a.3 for ; Wed, 15 Nov 2023 01:34:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700040869; x=1700645669; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=uz/93EuDc77oJSQteK39JD/4miHZ2Blpb5SChByfej0=; b=Bxft9MrfOJfNlsHKvKpGfvqVrw0H+XchVyB6WfU6G8lq/mZF6nEDIzM3wcrm+OrrJ4 aSocO4MskkfBITYHkIL4vBJDtxRRy2qk/vhZ0O8i0vDvr+KE7GrKF1808k9e0MPHAffI mYwRMrETl258OFAylzTaW9siU39KKxaI4uhOKdw6veHOp5saOowHpRBdHn/KM/J30O/v 7sUQlIophQR69eNeaa3JJjCiUuAtjcoM0eOJZVbwEm/xhm9+g7t27rCSHtbGjPp1k6dI dFuL3a4qax7xtAF/9ihb0/f29vSCRw4V4nqzpsBO+Tu002SsvO+crWoEBahr2nxeclc4 kINg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700040869; x=1700645669; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uz/93EuDc77oJSQteK39JD/4miHZ2Blpb5SChByfej0=; b=NDgA1DRcu2Kq+fiZ3+6iJB4EZmI0lSAK3hHZAyolkwIgFsWPUhch+mSmr+1zX8CsC1 +yQZ3L762L6CMsP4GKTMNfFq4fA5f4Gkxwna/D3VuZ1vo1DA3NaLah5yhjN46o500p2k Yz30YChCGu+i67mIzPTEU9hzskffY5z6qEnH2oZ52q/6QLjX9dzk3U5yiB0/uSnC0cls dIfHYHB/RmBZTE7ZnNV5uOfLvW2jMh3m4LP95bUEx6eBm0rCo9JMPOBOQWYjUcGS309j Ja0fremqQqR/G4ZNNN/7IHRWiZBRuze80Uy7pVeymYn+hYLjhO5ofVLoZntIlH1SYxEm REog== X-Gm-Message-State: AOJu0YxvIxzEEl6+vfW0lcx2HNW1LkNx1THumMd2tGduwgoTT8GelOvi pTzNqpVF133F9KrMQWFF2FsqtDb2+N7MieScNHI= X-Google-Smtp-Source: AGHT+IEgoiLeRDvfy56C9e6tZWich5bZ11SOLZwSnE5d3+UctNQiQxr9whStX7oOQApMOKxzFg3WnzBP+Uudcf6xPgs= X-Received: by 2002:a0c:e706:0:b0:66d:299d:e4c with SMTP id d6-20020a0ce706000000b0066d299d0e4cmr4639612qvn.20.1700040868969; Wed, 15 Nov 2023 01:34:28 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: From: Yafang Shao Date: Wed, 15 Nov 2023 17:33:51 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Michal Hocko Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E1957140011 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: amugddj8fdozkh3qxqkrzw8z1pghsn9w X-HE-Tag: 1700040869-257148 X-HE-Meta: U2FsdGVkX1+LDpE948PTbZ5T2F2ljrUF9W9A7gIBp7VneUle6efk1+X0ru/ughGF2DvMUM3ELJ/I423HO/HwA33yKtt5Xj9/oOOm1ouxYzTGIyoShEFPv3xqfNdbMcc14Eq0m8cG1HZcTlWXdNJn93ijsdhBvzWunXRV8CQgV9TU+6lGdlZjyhNyReRwHpEI0YFR/r/8PMc7QNV5BADVo+IYiLK0hz9eLCqY4xx0URAjfwXlVzvrXHrs++A7p9+Dg5ebtCNAzPWOlYC8cxenSOMMaMvLR0JhqdtFtQ2jZdwWvG7MGu1IvNfQEdGc8VZUnJ3cOzbDgW8vNHTG2YiPM0+ENEKX0tjIMoCxmmvHcillIY9U8H1zXjCIwRLEGLX1t5P4sSrD/FWYKbxL0gFBoo4SZB6dZtYRlsIw6cS6LrxqEiLGM2rSVf3jDu4I1an8Le8T07yvO6621VDMq4+x9HGUYPExyPMw1fJqNoi/cK2uG40lYg/8hPeBSiUT/SOWSXjGx1NGx1W0aJR5lZLiH5EPaILGw+fSMY8AeVmbaDktTa1hUPY9kGJTBzL89QPcFF73C1ty1bdPwwM+JrcjXJ2sBYFiHcZwwi9IYOian1Hoc4Sl0ynqDL5PXUb3NXG3vd2lzJCNCEg6bEi8vFnI0ctoAylSZdypl8nM/4ENuygBw3HHDhCGGBjb++G0r4dUEPoRa+b2fAXWCdKVhMGSqJyTttR3sfnMHkCG16U8UwiELXlqM4yqjp1VLPwwdABhFlnEj1MwyydpQzav5nzXGKbEwbBoXJo9pKMatrS6U/ofkjUSPCPznfIPt/r1l2Nup2C6P2UF/mdCs53F0BXCYEFy+4qMTsFiIbLfrZK73+aQ5MPsfGfycyVb0rFys/NZAlxkZ39znRDyTE3Qo8qoQnpVGdQl+9F6DDm/JIqkavGomPuwL6Qh8hdMTee9dqlelLSts9YeycJNTQxQ7OH hUDFHo06 L/50z/KqTei4A6H7bDyFNmcMT9U3ovFEZ3SOcBKnkhAfjCJH8ZROL2oylKycXF83TJx6bP8iqIgx7jBavjS9mxwLajZlOGrkyhvNn7CgwQiAAvOYhHZkbGyVlPbaQUQ4Yj5x+DFqC3m9/TIKZdz3tyLNR8AYiis4Z8N4bsy0mSrF7+burKsO/ya95CcAFHaJC6F5y9+f0fwXM5+FdY4H+JPRw28iTw0O3NlTw5qJ5Ejnft5B1gMEb8ak88Gm067K369FcK+nqptldK7lx23PyoffcXisqIWQfGtDUU5Gn4JzBXTqt3bUdVEZwSEyJW56+SPpvu+NBI0ey4ssFIf18TAEZg9vKVIUTIH6yxnQARKKKErzAN3PovYzusR+IqugVkD0Fx6n4ydgBCOmvII5R9TAhhDLBI0eqq6NirDOjRH+/HIw8AHm1Ovl0UPMj9lY2oeETt17GA9bhSd7Mdpk1TF94aDG/AgjciDxCd+z+fgT3BKdxS5eU49zJ0g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000932, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 15, 2023 at 4:45=E2=80=AFPM Michal Hocko wrot= e: > > On Wed 15-11-23 09:52:38, Yafang Shao wrote: > > On Wed, Nov 15, 2023 at 12:58=E2=80=AFAM Casey Schaufler wrote: > > > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote: > > > > On Tue, Nov 14, 2023 at 6:15=E2=80=AFPM Michal Hocko wrote: > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > > >>> On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > >>>>> Background > > > >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > >>>>> > > > >>>>> In our containerized environment, we've identified unexpected O= OM events > > > >>>>> where the OOM-killer terminates tasks despite having ample free= memory. > > > >>>>> This anomaly is traced back to tasks within a container using m= bind(2) to > > > >>>>> bind memory to a specific NUMA node. When the allocated memory = on this node > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_s= core, > > > >>>>> indiscriminately kills tasks. This becomes more critical with g= uaranteed > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue. > > > >>>> Is there some reason why you can't fix the callers of mbind(2)? > > > >>>> This looks like an user space configuration error rather than a > > > >>>> system security issue. > > > >>> It appears my initial description may have caused confusion. In t= his > > > >>> scenario, the caller is an unprivileged user lacking any capabili= ties. > > > >>> While a privileged user, such as root, experiencing this issue mi= ght > > > >>> indicate a user space configuration error, the concerning aspect = is > > > >>> the potential for an unprivileged user to disrupt the system easi= ly. > > > >>> If this is perceived as a misconfiguration, the question arises: = What > > > >>> is the correct configuration to prevent an unprivileged user from > > > >>> utilizing mbind(2)?" > > > >> How is this any different than a non NUMA (mbind) situation? > > > > In a UMA system, each gigabyte of memory carries the same cost. > > > > Conversely, in a NUMA architecture, opting to confine processes wit= hin > > > > a specific NUMA node incurs additional costs. In the worst-case > > > > scenario, if all containers opt to bind their memory exclusively to > > > > specific nodes, it will result in significant memory wastage. > > > > > > That still sounds like you've misconfigured your containers such > > > that they expect to get more memory than is available, and that > > > they have more control over it than they really do. > > > > And again: What configuration method is suitable to limit user control > > over memory policy adjustments, besides the heavyweight seccomp > > approach? > > This really depends on the workloads. What is the reason mbind is used > in the first place? It can improve their performance. > Is it acceptable to partition the system so that > there is a numa node reserved for NUMA aware workloads? As highlighted in the commit log, our preference is to configure this memory policy through kubelet using cpuset.mems in the cpuset controller, rather than allowing individual users to set it independently. > If not, have you > considered (already proposed numa=3Doff)? The challenge at hand isn't solely about whether users should bind to a memory node or the deployment of workloads. What we're genuinely dealing with is the fact that users can bind to a specific node without our explicit agreement or authorization. --=20 Regards Yafang