From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DFBFC2BB3F for ; Thu, 16 Nov 2023 02:22:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E3276B03D7; Wed, 15 Nov 2023 21:22:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 992D86B03D8; Wed, 15 Nov 2023 21:22:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85A5C6B03D9; Wed, 15 Nov 2023 21:22:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 768566B03D7 for ; Wed, 15 Nov 2023 21:22:51 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 50827805CF for ; Thu, 16 Nov 2023 02:22:51 +0000 (UTC) X-FDA: 81462219342.13.E0AF11C Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com [209.85.219.177]) by imf06.hostedemail.com (Postfix) with ESMTP id 911D2180011 for ; Thu, 16 Nov 2023 02:22:49 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jzxuzfaq; spf=pass (imf06.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700101369; a=rsa-sha256; cv=none; b=eP92BbDKe+9VNPz2kUdAxTQ6eSIymnEAjnEvWvXw2qyZQQx3OAKo+iL6UpNpacwtI+cJmp sVb491hs3aVCfxEG4IbWQKeObBK4RMomgTSS1NNCkw9gqgZrba8ZNctH3dmYs+o/k+H0VV /m9RqGXP10R99zrlt9GnF4As3M8AnR0= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jzxuzfaq; spf=pass (imf06.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700101369; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MGQ9YKX/G2W66tZSKmmzF1Dhr3FgNuFObNkiMHjizoE=; b=xf9hO3Dx3s6WyUz36GoPSlePcIGUliL0P/jrhZluZR8aJxH0mIgKo73vxy2VYk27ZsDcoy EDWjHFIPAnUXDHdspqniBc9H7MiLMBGnovzHMFEwwzBge6Jo8mptx9TrYYOeqR4+mTDp2c iT2f/k3h/k+ulxc/y9suxLlkqsk18xE= Received: by mail-yb1-f177.google.com with SMTP id 3f1490d57ef6-daf26d84100so279073276.3 for ; Wed, 15 Nov 2023 18:22:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700101368; x=1700706168; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MGQ9YKX/G2W66tZSKmmzF1Dhr3FgNuFObNkiMHjizoE=; b=jzxuzfaqmsFmqryznfWsU30rYK7LH+DhZ7z6CxZTZHDP+wxzjt3UMzH+K7rgIUZtQH FFfcevzfTFmq5ET8FN2p3K0t+lrp0aac7pg1+YfYmV5xgLEoKJmqx8IEmzYHBdl7ueL6 h8o7NMlpLYmQb095hKlBjhSBazhe5iH895mpe0cACJueLXTLp4NPL2IyKPkVMkP7poNs qUFKD7QFZrQS8gCJMqpDkLt1zbCz94IaqmeGkiOc+OTxVUw05JJ+6BjyiIxezmrCh52a U8xtQ5iaSVDnuDYrfXl6JuoUpKXaP1SHRN6iuei5oHn6sAeeXpmnr4a0q94dQnMcFle5 JBTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700101368; x=1700706168; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MGQ9YKX/G2W66tZSKmmzF1Dhr3FgNuFObNkiMHjizoE=; b=sA4Y6OXdjJ2ForWtBscegZDFCSVUPaocui2dtl8DdboqGvCgNtH/ZBCy+dRWKxvaZ5 HUvjT84s0F14z4l4E+Fx0fDZe62bunNBxMa16rlFJrslnRQr8IMF4oBtS3RKwNxY7hln 2j+eV/1bpoc6D7l1ooHqYchxARoN2ga2akauY6Wy//S8zGFbNK9ZFlExqFPySc8TacZq x+JmZpC6RUhA1SkPzf9K4L8/dBneYCMEn+IMnw3AveXGWRGJ347xwBz8g/V5fy0cXunO YbaAX00sTDC94ue720RdeH0N+NmFR0nV19Sw/G06ejrssAbTbbWGhQFjZeD1zdCDITW9 OcSw== X-Gm-Message-State: AOJu0YyhVLfghDDT/oNbl0fD+TilYehEALhh4uGu9tzeTM93QkxXJX3y nyZT2Rpgw7s/yKvgDdPxb/Jmjfzj5YRRSHCRqOo= X-Google-Smtp-Source: AGHT+IEgGri604YUX0vfvdcXXXL+exWNnifxZT2c9KihxjTFs8lzjW8zgVjGk9cWEoukuySq6viBH50vd8iYtM7GLAk= X-Received: by 2002:a25:e0d5:0:b0:d9a:e6d6:42f8 with SMTP id x204-20020a25e0d5000000b00d9ae6d642f8mr15463351ybg.29.1700101368588; Wed, 15 Nov 2023 18:22:48 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: From: Yafang Shao Date: Thu, 16 Nov 2023 10:22:12 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Michal Hocko Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 911D2180011 X-Stat-Signature: 5abxtnyucynw34bpcm1et4rzy58jzgc8 X-Rspam-User: X-HE-Tag: 1700101369-831007 X-HE-Meta: U2FsdGVkX1+zAb/qRvjQlSBJrr0z4YUY8ljRYMORNroMu0XO+dP/QNPVTD9TC9N/wdYj/JgoL8z7HaGEJTnC779ZdEz9s+ubWBFO4efB94aYa3POYlF+gIpAlWQ6K68M/Y96Rvj8T09y/FwH4Xos1cKNh6KdXL9B7a9PSk3upkg2A5+3Vd+/YFTe/9IXZm0yy9B20efbeROH19ryah9aH+Hf9l5D8uXhDL5gC3LR8Amnj4XWWuZqVxS4yrXX1Dhc1WSfk23eyqOKTL+V3lHlf5gN4utH/LA8yzFUvQbACkunMYEWxyj/4zute8jge4ADSbq4iC6Ni54/kHRkYCf5dgGr808syPiHv78UGNit1+VJYFvpI8R5W1Bj+NioTeyhdlasak9xkWxA0IO+EhNdIhrknXuTI7+JZ2DBqr7/o6OA6yTo/M/sFL0xn76MmKy/+LFEQyo++jCbJggbth272d0h8MNRC/Zw0U/yQ/EaUSFNbe2PTRcCobEHitKeNRMFOttf9wKWpL7svl0Y6KiKHZ4hmt6+x13K+7bYJpGaJCXvaJfEMjhHKoQ40xCOrZaO0d0Isbqi9aOGHPZsHnQX972S5zQ6AUe7BLg0mMCe1TbcclSYco5bG31GCDlLr/YRI3yJ2qUmd+pxT+h0MYN8TDgFSaKPHjFqL7yCpues7j1Cs+wuobMhW9Kj6NyIInQz12Q91JQx3Vve4zilOn+d04NqdvpsxxLDN0XmgUsrII9NCPjXl+r6KPPP2nAZJY/0tkHRiNm6FvhKcRLxM7TiNnhNgXVDeTGFeHnPAbWzlaF/Yf1ZOgXzXp2eiQDOh7Syq5Iasa2uRwXslfd5k0S92A47r84wpctUUxQEXjL9Ls2bGY1cQUukIiBkCeyx/rwjIRitXlOsMzS1RboRn8Dkbm4uHNhepoGLtuQ/YEc3G9tbBTyKtAkM2ghTOVjoMRtBC7PCUCQa55ipdioW4rG XyDpyC3Q AMDrpFvOKBDtLGAiJDYdAfdnea0+fde++vJf99/Z+U5/v8MT4Mdliik7DjgAx8wXpEB+zj/EQU0OgTeUKWQZgE1/vamM0BtNh+sjNX0d+RTlVphyl6fD8h/tx378QLpVmqURdH89KH7ZT2LxIDSIt/zN91lRri6Kk98UCKAGLSalvbxw7+Pi0tGB6hkjTKWGxc7gP8duFwfiZxE1+SzeTHIgyvbCuKx1BxhgnS6nJeHtC4mDR4LYWA+rr/mG0kVU7NGGGJ62pgDz7XLQG0XfQpjHRngwpbxnlJzjIJsdyqxuJXniS6tZpbRJhM+Jv7l8VRoLCj1aNV8f5+fdmzFA3Cql2BHx4RvKXNchvPPziM6kW/WSRqkrkYGeNeuOlckLRJ03V7eKwNRthwxk+12TW/o6LadwwMlBr0Eoco+xm2pGQGGLhtKiLOrytf7+T/7yvQI1/CHFvVMbnV9eLvUc+U5CDJphnQ82l88HD4jVmRwnjfu+1lhqqu0csnw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000017, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 16, 2023 at 1:00=E2=80=AFAM Michal Hocko wrot= e: > > On Wed 15-11-23 17:33:51, Yafang Shao wrote: > > On Wed, Nov 15, 2023 at 4:45=E2=80=AFPM Michal Hocko = wrote: > > > > > > On Wed 15-11-23 09:52:38, Yafang Shao wrote: > > > > On Wed, Nov 15, 2023 at 12:58=E2=80=AFAM Casey Schaufler wrote: > > > > > > > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote: > > > > > > On Tue, Nov 14, 2023 at 6:15=E2=80=AFPM Michal Hocko wrote: > > > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > > > > >>> On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > > >>>>> Background > > > > > >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > >>>>> > > > > > >>>>> In our containerized environment, we've identified unexpect= ed OOM events > > > > > >>>>> where the OOM-killer terminates tasks despite having ample = free memory. > > > > > >>>>> This anomaly is traced back to tasks within a container usi= ng mbind(2) to > > > > > >>>>> bind memory to a specific NUMA node. When the allocated mem= ory on this node > > > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on o= om_score, > > > > > >>>>> indiscriminately kills tasks. This becomes more critical wi= th guaranteed > > > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue. > > > > > >>>> Is there some reason why you can't fix the callers of mbind(= 2)? > > > > > >>>> This looks like an user space configuration error rather tha= n a > > > > > >>>> system security issue. > > > > > >>> It appears my initial description may have caused confusion. = In this > > > > > >>> scenario, the caller is an unprivileged user lacking any capa= bilities. > > > > > >>> While a privileged user, such as root, experiencing this issu= e might > > > > > >>> indicate a user space configuration error, the concerning asp= ect is > > > > > >>> the potential for an unprivileged user to disrupt the system = easily. > > > > > >>> If this is perceived as a misconfiguration, the question aris= es: What > > > > > >>> is the correct configuration to prevent an unprivileged user = from > > > > > >>> utilizing mbind(2)?" > > > > > >> How is this any different than a non NUMA (mbind) situation? > > > > > > In a UMA system, each gigabyte of memory carries the same cost. > > > > > > Conversely, in a NUMA architecture, opting to confine processes= within > > > > > > a specific NUMA node incurs additional costs. In the worst-case > > > > > > scenario, if all containers opt to bind their memory exclusivel= y to > > > > > > specific nodes, it will result in significant memory wastage. > > > > > > > > > > That still sounds like you've misconfigured your containers such > > > > > that they expect to get more memory than is available, and that > > > > > they have more control over it than they really do. > > > > > > > > And again: What configuration method is suitable to limit user cont= rol > > > > over memory policy adjustments, besides the heavyweight seccomp > > > > approach? > > > > > > This really depends on the workloads. What is the reason mbind is use= d > > > in the first place? > > > > It can improve their performance. > > > > > Is it acceptable to partition the system so that > > > there is a numa node reserved for NUMA aware workloads? > > > > As highlighted in the commit log, our preference is to configure this > > memory policy through kubelet using cpuset.mems in the cpuset > > controller, rather than allowing individual users to set it > > independently. > > OK, I have missed that part. > > > > If not, have you > > > considered (already proposed numa=3Doff)? > > > > The challenge at hand isn't solely about whether users should bind to > > a memory node or the deployment of workloads. What we're genuinely > > dealing with is the fact that users can bind to a specific node > > without our explicit agreement or authorization. > > mbind outside of the cpuset shouldn't be possible (policy_nodemask). So > if you are configuring cpusets already then mbind should add much to a > problem. I can see how you can have problems when you do not have any > NUMA partitioning in place because mixing NUMA aware and unaware > workloads doesn't really work out well when the memory is short on > supply. Right, we're trying to move NUMA aware workloads to dedicated servers. --=20 Regards Yafang