From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52517C4332F for ; Tue, 14 Nov 2023 02:31:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 78B326B0176; Mon, 13 Nov 2023 21:31:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73B4B6B0179; Mon, 13 Nov 2023 21:31:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 629BE6B0248; Mon, 13 Nov 2023 21:31:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5444C6B0176 for ; Mon, 13 Nov 2023 21:31:33 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 198CF1A0142 for ; Tue, 14 Nov 2023 02:31:33 +0000 (UTC) X-FDA: 81454983666.06.3661D26 Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182]) by imf29.hostedemail.com (Postfix) with ESMTP id 5DB2012000F for ; Tue, 14 Nov 2023 02:31:31 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XnTsl6iq; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699929091; a=rsa-sha256; cv=none; b=fASsdc/qWRFCv8h6CyJm8I3Zc0Q1t5wP5C0ZcQ+aaQIRJKK8Aa0+o3PKbhQLS9NHriQaM6 F56o+EqwULD4EH8ZYz/P5CpJA6V0WSdTxGLw5ArAREReYEBgMsj34GZ3lhWkaSHyV2pCse V+dWU2Ls8beKwGFeLAtVD8TpAJ3jWTE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XnTsl6iq; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699929091; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Lhq3BMji2JsqBVPjJVodYR9QqxfNwbcn8XENMmyckWg=; b=IIPAHyT27f7QCQPI2Vt7ri7KC4yr+Ew/sm+JbRnGOMe3EbdqYyc1k2vp2HdwYpij1yRgCG J9DieEsiPsBh6w6mM38cMGNsdcjB12gYinvf2nEFSGJ2KVZF2P8IG2ypgtW7SQKk+7/rkh NV9ciUl4UfYfHdOBXlz/YoP4TzpwSs0= Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-59b5484fbe6so59250517b3.1 for ; Mon, 13 Nov 2023 18:31:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699929090; x=1700533890; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Lhq3BMji2JsqBVPjJVodYR9QqxfNwbcn8XENMmyckWg=; b=XnTsl6iqqZQMCYqiCPXfwvMccaCCc20ykLrg7NNPk22yV0tlIKoD1gab3k2z5joHEj dvcgd0ATe/yuHnctQ0U+3JOerN2/CJBkMlKNsKsbzxAYen8uq6deG8b0IQoOCqUZHHHn AvHeuav2JnphRZFFy9f54Z5xcOx/5WJjuGOSfED+q0ZGrDxG2h8r6hFwicRkr26mDWdK 6l9nDXpSRjDscOSaTaOWJhAxMThMoar04joV1rxG9AE1XrBbwv2EEc83qwhsSVwTOulB ie8bycm40s1dhAX46U8DXdXCGf/UPRgc0AjOaNUVJYxGflTkhJ1XcUv8cpMK/AOjJ92R OA/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699929090; x=1700533890; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Lhq3BMji2JsqBVPjJVodYR9QqxfNwbcn8XENMmyckWg=; b=QQlLCIjXpG88vDFjfUq7YChLOkOyQ51mSuxbHXJlYR1vdNM+0fiTIRCGouV7+tR3WW 8owc18snCfwFVtaJxWTgiOv/JNCeUJk8OVlfJX681Nn2WfyYlQHRx6rryvVPwoGPp5A3 nr/QaGUMsVp3I/Akk54BlaL49ias0HKtVhXxyKKNkgb3LSbePGOvu1PZiGg4I2iUcjta sQoYYAJXaZxo8/ay5kLiOIAAxlj9FlzgbNG+EwsX7qba+o/UfHjpfdg95o5DaooizCkz FgD21i1APTk3i9hPWdI5rfzM225uzMpWUq3sW68urbBR9q769tVx6o51dv+uLaZQbDmt DyBg== X-Gm-Message-State: AOJu0YwSItDsd3AAYBGieB48lw9q9etfciDX+EiLR/9yvBZs0VMe4HZ4 QMnhr0zJnJaLU0Nv37fcc/GIFj/XkM1ydDTkG08= X-Google-Smtp-Source: AGHT+IG0Oepn9YXd0wpWq7Ya4tLQsUeBzaZNVuDytpS8Kamf68LKdJwLqFvSd8jhxMfzrfMTAsMoA6ezj34r9ScFqlc= X-Received: by 2002:a81:8306:0:b0:5b3:22f1:e42f with SMTP id t6-20020a818306000000b005b322f1e42fmr9085333ywf.26.1699929090431; Mon, 13 Nov 2023 18:31:30 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: From: Yafang Shao Date: Tue, 14 Nov 2023 10:30:53 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Ondrej Mosnacek Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 5DB2012000F X-Stat-Signature: ehesecp89grwskrcqhwa1u54xq59i49u X-Rspam-User: X-HE-Tag: 1699929091-606315 X-HE-Meta: U2FsdGVkX18sP54htTOO7M64zPczx9MzZqXVW0WGl1hAAKXXFHT7k1PpJV26enfsID7GmMyt1z/ujrDbrhIyT1DYc6wmwOvC7x5ljmSu1/s0LxKs+Epze2ACPVd5B15FamyYI2v40/dbovN2vbxvTvlg6a++bqNrgxYKDpL9VCAndyT9b6niMjYmPi4aHjwG01vgiLRhgYmZlt1e3tTwoDXsjqC6RgHyVIqDAZ7XCPzGqfYhy+Un0++W/xUrSCRWnqMhK9mdgqr31bRPse7OvhTNWEldVByGMbl2303jUA3YaE1WrjiyNSFhdNLqiN5UbdLOXSFoa5WUOAhgtDzpuItwanTIrKilSmKP9z3GuQ9T9w8qdGgOa50Z5DpkuylLMlky06HKmfy2cEYHVWHB+yfMfcpJJisJLryXQNHhodQbP5hoJCEzfZMDZu5IH4hjZD8EftGPivGnLnDkYhAvVtDwM0IfObjz10Jx+JKBB7ggWMJqtnL422fGm205nHTR4CWD+opvxg6r8Funya+9dOBEO2wechQjeHXhuFL++k5MGqV5J/9sC+ZGhCdH8OefSgX2tSQFIJ2FDBCCIaPUZVUC1++0HfaN2uvM0iHIKVXR+XdpzNRDdsSCxSEMttc40jdbjGSahNBhtX4oZTPIt9ioIHm/1Tvs5AH+GY86yAvGWf3/dFgQ1OsbiqHA/QlCb50V2E8WGim/2SOowUAEOeFbczJMO+HNURnVCYoeT5IFzZhUUydac2Nt5s28EWv+UjZdVMHyWdPee0glNZCZlKEGvoSWnyutn1/3XRUG9drKO/w70JP8MnlQe01cPahpEk7JcUctYenDcfAuEU1OUTotL8ds9SAtL3Go6awa79Bhtfm4pHKLAhl7JSrrTo0YQRIG1Z90xm1j84dXRdRt0jADojJhK6PTYzQGVRNHCcc387kXazsBKr8htyKMmB3zWrZ7aOMWSpnHGc+nhV/ ujOcTfrK je5mLv8/j8TiG8uZAdlm8Urwfp0FOtUggCMCuqoG3x/e48/leulObduJ/3KanxtdH0RjxLVZvMAGEcG0dvMCJjBkWJdtOKMSweCz8evo3wFqyQfH+VYvG0XSoiKKq/tIM119d5zx26RHyqcefq8X7SAdpCtpM6ZBq9N+19nUM4k/UqbZLBc0im0eJ8C4kfm/BB5KQLwui2igkYXHP6javvyttUwjEPTM9vslsqZF7TOyDVxhmGzSafK89f7l79w4MTpobEUWQzA37OV7SZej4MH0VhO/zEEnDzG2byvuBQJz5sDvamnCaUx1bvhsM9BpcUTG+aCv9vrKuYjJN9+9RZnQyXrNihQuZXm2w20HsJ5nbAUkvrvMsltPiH5od1aWj2nTIGtjx+fZL4yhx6g+ZsNrmgWGk9AWIP0x107j42a9ttJi0T24Q73twzXNNuyq+Bt81q3PqdLjAfRs3LIGUcUAGbhNmYY8hUPiUdJwOpIyNTrw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 13, 2023 at 4:50=E2=80=AFPM Ondrej Mosnacek wrote: > > On Mon, Nov 13, 2023 at 4:17=E2=80=AFAM Yafang Shao wrote: > > > > On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > > > > > On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > Background > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > In our containerized environment, we've identified unexpected OOM e= vents > > > > where the OOM-killer terminates tasks despite having ample free mem= ory. > > > > This anomaly is traced back to tasks within a container using mbind= (2) to > > > > bind memory to a specific NUMA node. When the allocated memory on t= his node > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score= , > > > > indiscriminately kills tasks. This becomes more critical with guara= nteed > > > > tasks (oom_score_adj: -998) aggravating the issue. > > > > > > Is there some reason why you can't fix the callers of mbind(2)? > > > This looks like an user space configuration error rather than a > > > system security issue. > > > > It appears my initial description may have caused confusion. In this > > scenario, the caller is an unprivileged user lacking any capabilities. > > While a privileged user, such as root, experiencing this issue might > > indicate a user space configuration error, the concerning aspect is > > the potential for an unprivileged user to disrupt the system easily. > > If this is perceived as a misconfiguration, the question arises: What > > is the correct configuration to prevent an unprivileged user from > > utilizing mbind(2)?" > > > > > > > > > > > > > The selected victim might not have allocated memory on the same NUM= A node, > > > > rendering the killing ineffective. This patch aims to address this = by > > > > disabling MPOL_BIND in container environments. > > > > > > > > In the container environment, our aim is to consolidate memory reso= urce > > > > control under the management of kubelet. If users express a prefere= nce for > > > > binding their memory to a specific NUMA node, we encourage the adop= tion of > > > > a standardized approach. Specifically, we recommend configuring thi= s memory > > > > policy through kubelet using cpuset.mems in the cpuset controller, = rather > > > > than individual users setting it autonomously. This centralized app= roach > > > > ensures that NUMA nodes are globally managed through kubelet, promo= ting > > > > consistency and facilitating streamlined administration of memory r= esources > > > > across the entire containerized environment. > > > > > > Changing system behavior for a single use case doesn't seem prudent. > > > You're introducing a bunch of kernel code to avoid fixing a broken > > > user space configuration. > > > > Currently, there is no mechanism in place to proactively prevent an > > unprivileged user from utilizing mbind(2). The approach adopted is to > > monitor mbind(2) through a BPF program and trigger an alert if its > > usage is detected. However, beyond this monitoring, the only recourse > > is to verbally communicate with the user, advising against the use of > > mbind(2). As a result, users will question why mbind(2) isn't outright > > prohibited in the first place. > > Is there a reason why you can't use syscall filtering via seccomp(2)? > AFAIK, all the mainstream container tooling already has support for > specifying seccomp filters for containers. seccomp is relatively heavyweight, making it less suitable for enabling in our production environment. In contrast, LSM offer a more lightweight and flexible alternative. Moreover, the act of binding to a specific NUMA node appears akin to a privileged operation, warranting the consideration of a dedicated LSM hook. --=20 Regards Yafang