From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFFB2C47072 for ; Wed, 15 Nov 2023 08:45:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B11D6B031E; Wed, 15 Nov 2023 03:45:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 539E46B0328; Wed, 15 Nov 2023 03:45:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B3BA6B032A; Wed, 15 Nov 2023 03:45:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 25FF86B031E for ; Wed, 15 Nov 2023 03:45:49 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E91C312023E for ; Wed, 15 Nov 2023 08:45:48 +0000 (UTC) X-FDA: 81459555576.19.F58D56C Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf20.hostedemail.com (Postfix) with ESMTP id B2E5E1C001A for ; Wed, 15 Nov 2023 08:45:46 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b="WL5Q1yW/"; spf=pass (imf20.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700037947; a=rsa-sha256; cv=none; b=X/ZCAg5CGrdepP1v/DUWwLdGeMBpCwQlsPlCtW2M5Usft4HaiQzXuAi9RYWp1NGOlE7+T+ pWPstQQxHYnNxZCAVora78aDDIlsC8cRYV69IF4ACe/tCgU1KhEpOhqvVEr1bxD+L6S0/s WTmYN+7VoXil4HDI1W5cDXbQB2oy3eM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b="WL5Q1yW/"; spf=pass (imf20.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700037947; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2qCPQijTShxgcg0gL+zgK5uAUfEKbh1txCQLcbJEtBw=; b=JyzyJnSsNq8MVreTA3AkhoAaaH9FTZLa9zHDhujJh/4322m8CyXYQhSPpblp2rf4BdHLcK vCQ8dSLYtIjj14bzcj2qBNe21e1T17cV2D6TasA9AIA5cpVa07HLKwhBs6w+b0rCRhW0fp xF03xjo7tLGZ9oXAuQTZIY4rkA9fn3A= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 4584B1F8B6; Wed, 15 Nov 2023 08:45:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1700037944; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2qCPQijTShxgcg0gL+zgK5uAUfEKbh1txCQLcbJEtBw=; b=WL5Q1yW/cBrq7z+9poSYWwAr+D/PwF1yVmcbuSLnr5HoMzIXYgLojKtrZf0Jgou1Vhz89q veqH1++uExZYsr253bu8SjfsO1JO2DcoVcrcE6EXBM3lZg1KB4WXmW63louBZCx2oD1G6r 41SjeVe/MKaYK/zLEvr8qtHMyYSTGJk= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2E3A413587; Wed, 15 Nov 2023 08:45:44 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id OBLLBziFVGUDGQAAMHmgww (envelope-from ); Wed, 15 Nov 2023 08:45:44 +0000 Date: Wed, 15 Nov 2023 09:45:43 +0100 From: Michal Hocko To: Yafang Shao Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Message-ID: References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B2E5E1C001A X-Stat-Signature: xdohr7e95tyagbrc4b7uqiiu34c4zk1a X-Rspam-User: X-HE-Tag: 1700037946-991705 X-HE-Meta: U2FsdGVkX19Aj9wwrLXd3sIjzpuAzx0nxYiza0r8kqRr5f4yc2WGW9VYqYkhUJS1kEF4xHE1kr9PAoQ9QRwE+hxNlppaW5PrShhYe9db6ldnv/sXpWWpxaZCjSXGbB6gAsFhyuTtIWHa2DGhrUpFJ/FcQ7V+QTUHfyUd+rPCCzesEPT/a8pzUAUgUyV3FlFzy2DQR0pu3Z4iN9hIIK8o+n74jekbj0x2/tYunnuIQPnEQkIhbP04fHN3rt4mkPanB5iubq5GaScXPSpO9LXU0dDHPDRRLZLAly0snrAdNtvnMZ+KYpPczqYkvvVJ5iM+sN0bcK/cMkiLGzxS5ZeWNw5yuFvw0b4SqweIeXAZfFkVJfanDY+8GtT46mEuE6GXalgBxheUOSqjParDVGLBcZ5o9z/sUZMr71wVQVFiOXVBVVuOSMklkpa0+rRwBHfT4oQL0fJ+3/M/pjwofWCZgbKMu4pPvM+jUnzuXNSLYn7QTW6RNAdiGH3ebU6z+sn1M1atrBfN7CbEOR7+OLCSV94TjYDTAjNq/r+0TqwS/nMh/IVivA8wIO7sNxDLcLymnReJyaP/sl7vuh0vhYtxX6ISwcelz3JTkXZS7Ui+urUe/E0wVwDhJaYC8CWqzx0NRPFrow2WQ3T9hKwScx8JesYy7Zm+xvW5mQAGFFUxlrAwtt90qu2b/S8uH4T4CTHhydzZeedKAamoV8doSvjWSySNfUkDnFHKfvq6XzFbObv1VC1H0Y0EQb0eDb4r8UbbynSI7fKV4YP5ea8qZYRZbjWdY6byrBtOUw5/cHtWpHo1+N4lQ05FCb8jj9Oz+kWwsj3r90PGzBgyHgk2D8Ksva07JuTM3h3uvtN0XozCEogxPygu7FI0eYREl8VJilLEmzZJS89uO7mrMypNtauoz0q7VR5TNan+lqxUQR3JWl31oDb22iW4mt6VdWRv2yrXSl4kJ/LkZCBauucloCo +kAHW4ld a/2ZsxVjobWpiANX5qeeAH3sjP3tcOC2q524RqNT3SP1KBDn9KYb6qPoBebn+sEN+BvJkkycjurzmMHer74xJAU6Ux9IsNZolWnbWxbqNJon8rK2OUkqxO5drdzZfcTve4XfWoAsHwgHYULHCcXSeg8PaT6H3wW680r/iJpOE5V4zSNNkS2FPQ5PvCxVwxnf6i0WIraMmq9ADTdf2LftEr3Cl4LHCxpjo6pDvaG9SXLMaYH1UFgbXYmugMGH7TlyA1HhePLDCg99kBc5ZTbJpTiPW7nj4FclfP9KBi7hEHDJhtrD4LhFs/XwAxRN3JFdvN5xE66W0qLRABmk2dmuy2YFwM2r8e6rP3zV8quaIZwlYtyBdfYzTbpvWfuu4fizGQRKW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000470, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 15-11-23 09:52:38, Yafang Shao wrote: > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler wrote: > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote: > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko wrote: > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler wrote: > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: > > >>>>> Background > > >>>>> ========== > > >>>>> > > >>>>> In our containerized environment, we've identified unexpected OOM events > > >>>>> where the OOM-killer terminates tasks despite having ample free memory. > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed > > >>>>> tasks (oom_score_adj: -998) aggravating the issue. > > >>>> Is there some reason why you can't fix the callers of mbind(2)? > > >>>> This looks like an user space configuration error rather than a > > >>>> system security issue. > > >>> It appears my initial description may have caused confusion. In this > > >>> scenario, the caller is an unprivileged user lacking any capabilities. > > >>> While a privileged user, such as root, experiencing this issue might > > >>> indicate a user space configuration error, the concerning aspect is > > >>> the potential for an unprivileged user to disrupt the system easily. > > >>> If this is perceived as a misconfiguration, the question arises: What > > >>> is the correct configuration to prevent an unprivileged user from > > >>> utilizing mbind(2)?" > > >> How is this any different than a non NUMA (mbind) situation? > > > In a UMA system, each gigabyte of memory carries the same cost. > > > Conversely, in a NUMA architecture, opting to confine processes within > > > a specific NUMA node incurs additional costs. In the worst-case > > > scenario, if all containers opt to bind their memory exclusively to > > > specific nodes, it will result in significant memory wastage. > > > > That still sounds like you've misconfigured your containers such > > that they expect to get more memory than is available, and that > > they have more control over it than they really do. > > And again: What configuration method is suitable to limit user control > over memory policy adjustments, besides the heavyweight seccomp > approach? This really depends on the workloads. What is the reason mbind is used in the first place? Is it acceptable to partition the system so that there is a numa node reserved for NUMA aware workloads? If not, have you considered (already proposed numa=off)? -- Michal Hocko SUSE Labs