From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA99CC2BB3F for ; Wed, 15 Nov 2023 17:00:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 413ED8D001A; Wed, 15 Nov 2023 12:00:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C3C28D0007; Wed, 15 Nov 2023 12:00:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 28C5E8D001A; Wed, 15 Nov 2023 12:00:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1689E8D0007 for ; Wed, 15 Nov 2023 12:00:30 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B6E68140281 for ; Wed, 15 Nov 2023 17:00:29 +0000 (UTC) X-FDA: 81460802178.29.AE53814 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf17.hostedemail.com (Postfix) with ESMTP id 11D6640054 for ; Wed, 15 Nov 2023 17:00:23 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=I6Vk5XWU; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf17.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700067624; a=rsa-sha256; cv=none; b=mEd1qhdOpRr5lkPe+qgofMgHt4e5Iy+umslZl95e0i7bObNuAVg0dHZqJk3Zq7jTUuXnRn oOthTkhEofGjgGLt5wpT3SlRJAQCmoUTlUsN73NGsmzLLU+mihffKFpznV6t1EM/t0lvPL 5otM53YNZLvOQCTF0Jl+ziRFBF0dpOE= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=I6Vk5XWU; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf17.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700067624; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bzi5q2POrzZglBN2JQhbBgQiExoELJgTIynNQ9Q/jgs=; b=1+YlEp0e+fThU5gn9rB30m3v7LhOxPXvv7NaHe67lrkqAIobJYBDw3Irmo4SsjgJWlVRiY oNoe4wohm78bOTVLtf3qXlADzJa9AifQA48AiaEskuB6mIWPWT6nwxALcjeAp114b2+kSj 4+5FHGuSrnyC8baEO/5Mi40gffAcVE4= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 74C5B22914; Wed, 15 Nov 2023 17:00:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1700067621; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bzi5q2POrzZglBN2JQhbBgQiExoELJgTIynNQ9Q/jgs=; b=I6Vk5XWUzDNZAfxjslKVM/VVi2h5E73QmYLsb+37VSUQD/IHEbP1zxg2lUQBQgSTIB0DYm ep3CtN9tPb4HyyudoXMqXZ8WCEylJiET54RHr3SekXE+oNsGgY1cvEjYw9URhoDjc9N+++ QEYH0+aAzWJ0B0TD3yP4NBDDGZ9ERGc= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 055B613592; Wed, 15 Nov 2023 17:00:20 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uARxOiT5VGXBOwAAMHmgww (envelope-from ); Wed, 15 Nov 2023 17:00:20 +0000 Date: Wed, 15 Nov 2023 18:00:20 +0100 From: Michal Hocko To: Yafang Shao Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Message-ID: References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 11D6640054 X-Stat-Signature: tbboa8jxcc35iedds1mxpcdwgfh4g8qe X-HE-Tag: 1700067623-124015 X-HE-Meta: U2FsdGVkX18r/j2K0gx9JNIGuFUhz/6K1j7bIy4+dN8zUiUy2MezuBEUPnJfKg3vTHVKMzykd9H4dqJrACXbkuoq3lNTpYpsw/ff69a1tdHm0b9kZh4EAoTr4mFWYE/EsfnWfvwf8hJsYR9YsQ+nITw/nhSAPW5eyUy8PIaOGs08sA+6lSCt2FBY01utQm6w+DqbYweMfUGoUiN2P0bftkPXg+mAMmC/ZVGp9y+us7KeNNS6USFBiuWlgqhDUh58nz+sF9mIiXU9vbwP9v6JbeFIJPTD7G1yCZo/cN5kNToYbpFeAyj+eFv62HtZPaEv+/UIWA4ig1IdCQNvJYgQP42W/Xz1hivnmEtydgw4YJkff63k83xyBA8P1gAvmzgVo0/rSkUaDbWivSPyJFWSOzGuOjgBSHOmHP2+qjEpc62vk+SBSyK0VTOy5DSFfKEfvVHfBtbFD+0svlupNFw+R3N3kecZpoTFFfQPBwMg3ZyzkRT9IonYzRWZM2euEyW2S6ps8axnsYtKTXri6Ur8mNxYi7P/aLjv61a4F4WgYarRAFagyK5u2LD+WRe/m0KEE4Aoze0T6T8FWZ4CN4qVSeKmn3X/Jlr8lZmkQRwEn0FIRp8M8RVJWPgnsB0FCIPp7qZYjYNFP6fBuGjK9xLqHiWmXpUofZEb/k1ECzOU6G832pDFDyG873ArsFFes5xjzazDhfVLCeBSc7eS3eflmb9TgGM8fs3S+aaYcbgPI1DVORHv8LGVfs/3zL4rCkaY0N/tJELWppopK5fGvAFhA9DLg0QujfrI39uNcB5Vre1E0OqKJEx8B21k0z8UyJh+oAKQawOK4Pvoox8Z0zAuBkko9MkEHWLXx59RcvrKtUoHMoN3PJyvDb1zVCiGIJeimgJqJyfypBPyXL7x4H1+O/yd/X35G5V37//szuJdGzOThvlPTZbxAIiD3h9kwDsGbFYg8t9X9KT62nxdeqx mjwoRYf7 hnqvxab56RI3+ijHvaNczPB6KyyrdBrAROPHFmQuKr/qSj68C1W+4l3eRHSWSkxMHw/ctHFGh8NknBKN/nzlvNU006/+irYump4MwA5FkBbHL32wNUbWJ7ckJSnQdgSrA9Jk227/g9OuC2W2HfITOaXfehgL+HKqK0OWM7dILLgW0OT3Mzd7+ISQP6F8jLN3fHDLdElfPg1p/Um9vsG0IwFmcDF4BRebxuOSA4VSSXmGgaoSjGaw/STsVPz9Xgb4Xdg3K59UBTLSNQXHxQnfjI8KwGdV89W/oDCGJGIgW8/9rrkqx5LAgm8bmWKKGnTCSB6O2yh6WhsSGhCwH5fnP86/uGdgcJAd9Thmc946h0jfxFVnPPqW11pmwl8zVJQDD721W X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 15-11-23 17:33:51, Yafang Shao wrote: > On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko wrote: > > > > On Wed 15-11-23 09:52:38, Yafang Shao wrote: > > > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler wrote: > > > > > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote: > > > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko wrote: > > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler wrote: > > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > >>>>> Background > > > > >>>>> ========== > > > > >>>>> > > > > >>>>> In our containerized environment, we've identified unexpected OOM events > > > > >>>>> where the OOM-killer terminates tasks despite having ample free memory. > > > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to > > > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node > > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed > > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue. > > > > >>>> Is there some reason why you can't fix the callers of mbind(2)? > > > > >>>> This looks like an user space configuration error rather than a > > > > >>>> system security issue. > > > > >>> It appears my initial description may have caused confusion. In this > > > > >>> scenario, the caller is an unprivileged user lacking any capabilities. > > > > >>> While a privileged user, such as root, experiencing this issue might > > > > >>> indicate a user space configuration error, the concerning aspect is > > > > >>> the potential for an unprivileged user to disrupt the system easily. > > > > >>> If this is perceived as a misconfiguration, the question arises: What > > > > >>> is the correct configuration to prevent an unprivileged user from > > > > >>> utilizing mbind(2)?" > > > > >> How is this any different than a non NUMA (mbind) situation? > > > > > In a UMA system, each gigabyte of memory carries the same cost. > > > > > Conversely, in a NUMA architecture, opting to confine processes within > > > > > a specific NUMA node incurs additional costs. In the worst-case > > > > > scenario, if all containers opt to bind their memory exclusively to > > > > > specific nodes, it will result in significant memory wastage. > > > > > > > > That still sounds like you've misconfigured your containers such > > > > that they expect to get more memory than is available, and that > > > > they have more control over it than they really do. > > > > > > And again: What configuration method is suitable to limit user control > > > over memory policy adjustments, besides the heavyweight seccomp > > > approach? > > > > This really depends on the workloads. What is the reason mbind is used > > in the first place? > > It can improve their performance. > > > Is it acceptable to partition the system so that > > there is a numa node reserved for NUMA aware workloads? > > As highlighted in the commit log, our preference is to configure this > memory policy through kubelet using cpuset.mems in the cpuset > controller, rather than allowing individual users to set it > independently. OK, I have missed that part. > > If not, have you > > considered (already proposed numa=off)? > > The challenge at hand isn't solely about whether users should bind to > a memory node or the deployment of workloads. What we're genuinely > dealing with is the fact that users can bind to a specific node > without our explicit agreement or authorization. mbind outside of the cpuset shouldn't be possible (policy_nodemask). So if you are configuring cpusets already then mbind should add much to a problem. I can see how you can have problems when you do not have any NUMA partitioning in place because mixing NUMA aware and unaware workloads doesn't really work out well when the memory is short on supply. -- Michal Hocko SUSE Labs