From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FB58C48BD7 for ; Wed, 15 Nov 2023 14:27:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BBC9F6B036B; Wed, 15 Nov 2023 09:27:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B44996B036C; Wed, 15 Nov 2023 09:27:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BE5F6B036D; Wed, 15 Nov 2023 09:27:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 861176B036B for ; Wed, 15 Nov 2023 09:27:15 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 573FF1CB7D0 for ; Wed, 15 Nov 2023 14:27:15 +0000 (UTC) X-FDA: 81460416030.28.1AAA833 Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com [209.85.219.179]) by imf29.hostedemail.com (Postfix) with ESMTP id 895A712000F for ; Wed, 15 Nov 2023 14:27:13 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aBm1uZQr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700058433; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FpawJywioUM9HI5TCTr/j3hJWHRVbm+8zihbFeHeUpg=; b=x6Fc6bHBgPrzfZwwzItZ4pkTOp2BzWkymC8s9LvhebEZkbUYkMYdw3BepHd3DAQdF12B+Y mHI+iD7vszRUlbSdbxhDgW5x0EucaOq3EwCwnvWbtRmallKOOyrVGJidlqVapJjO6GuLvc F/q9bA8eVpBklwqfKtYsSDY1/3/ut+I= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aBm1uZQr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700058433; a=rsa-sha256; cv=none; b=l53f/ngRODvHd/xI4ltqsgOOVK6mza5haqlINtVZeum4xeExbT8FqeW97vj1BbTwZREm9e QdWSKPkWlv6kLhT5Dl5mZ/kBxtXdixXpMrgucH5fdIxOoqJH5+dMnUA2v6NFh/hMAxWFbr 3GR8Mv1d3ZvQlg6UPgBccWm4FM0Fq1c= Received: by mail-yb1-f179.google.com with SMTP id 3f1490d57ef6-da077db5145so6816658276.0 for ; Wed, 15 Nov 2023 06:27:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700058432; x=1700663232; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FpawJywioUM9HI5TCTr/j3hJWHRVbm+8zihbFeHeUpg=; b=aBm1uZQrn2ggSSWyFNNFOULMO8AIBNS+cbo2NSpYVmazuEz1MNlPxZgbX8SMbdevGG EBRhiykBsKtDm5YUf0OdqfBh6wdiq2N/OuAMkw/RcGsuVzlAwNqdDZbvsIAOqB9pDwBs 0Vkxzw7XplDOKB5zpBKbMF54u6vdXHcPEOZqHW9sl83QU9Dt1RvqsoFj6aUw6v/aObma XDRDKISSDZGBGy2sHcxLveoJ2xB5x65u5WO7CxspxZz4yFhXWjG1VoHraUaOjomtzQX1 Lp36t80SBPcwZMHg1q2j6t9tmV6PIAIwL2VO17sj01LSjCQc6rrOhFg6s9F/l6Y0FkWQ sVjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700058432; x=1700663232; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FpawJywioUM9HI5TCTr/j3hJWHRVbm+8zihbFeHeUpg=; b=t7Hzpf8kms7jXjm0ICA9eDW4EsNk7aBRLsO3efHBfHHtCeUicdu92nIggH8t/5CJq8 Cc139FV/+SeNx38dnH+0qb39tP+jprIMxwfNna32UNSZsOgDaUkZCmpSC0ddts2aWq0/ nqaRUkDTlnqtC6jmbNxdZcgbWiv+3rsp9h8Amp1Q8wRoe3nn7ol3+f1o+FAqBfp7cWo2 Mi5z5e0JRorTHS5JwxoFrd9H+enRrvGkPxn7CleFKYhePrbU3QWtR2sABYiWfbZP7Dsp ZiMarTuGbtnTK8G2cWTBYw9dBDm8ZNqVequuymtuDSJIoq/7W73rg4uukw29HdiD9roq AGKQ== X-Gm-Message-State: AOJu0Yw51xCHXnn5Q9fj0XnWGbg+4Jovfhrggj1OO0ZSIkDW1GHur2lP fqFAiPuyeADPKino8kZPXG9tkBTnC9EgKOs1EJE= X-Google-Smtp-Source: AGHT+IF/oZl/Viuy/5cek7/QuWtWGBxcNWt9HIJpVyZUKi90aRDc580fBwNEbhZb+GcGV8pnPxE+soPsnEd9fIswogA= X-Received: by 2002:a25:aea1:0:b0:daf:ed94:c038 with SMTP id b33-20020a25aea1000000b00dafed94c038mr6391085ybj.8.1700058432457; Wed, 15 Nov 2023 06:27:12 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> In-Reply-To: From: Yafang Shao Date: Wed, 15 Nov 2023 22:26:36 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Michal Hocko Cc: Casey Schaufler , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 895A712000F X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 81fuekiijbjr9q4d6nf8mhf1u54ubooy X-HE-Tag: 1700058433-737062 X-HE-Meta: U2FsdGVkX1+CIcOOsTaCRCR1MI1b/0iMkwG5KjiQ02JwyCZ19KMatFZTVqYGX+RE5ZkimDsKC9VsrnuaMXywi2jB5TCEBucfjGyD31vz1FW9y9inX906eKAh9HDyAauoOauxC/Wyv0R0AlCmXUoIBYtRsRLwFlK0yRSOTWFiduSTPVbPWIDzOP7USrTTQYEHTACyXzHxjCtctYvJNYfpWw7g//ZvIHSN26/3TGUbm8SY6Ip4klGtmybrBeiD/Pc0c12whaot5XC0Ivfx2GzsogRfmj+kHYGJny1ShSXACVl7ucpwarABzknSFWZzleNrAOOXa5l2sXII82kOfBbrVVDZb2MwKlPAkzsretV7E8R4EZFSnE3+4SNFZRseckVbeTiMS3tc8DerEmPdLX5nxF8Mz+5evZIzr4WNUmKf6pQ9DBt2SuF3T/w+m2PKdf/xGbpEn54Ckh2+JtXvpjqo4oqA8Zc8kdXPhK7Nf633KPTgEQD5+GszZ2cKIk/iQiql0D2wwAQ0C8gULVMwN6OceBFNW4C2D3oLkZsLoOMgh5BhggCoThPh1GLViGOhcJI84lk7kpfSWrGcWgaxm0qmdUrgHcfX652+udFhG2ep3UFiWhEgZlndZAtMse2QljBo/p5m1R6Nr19Wb/bTihJReLdhNu6GaNQPaN/z5rp69tuEX4PgYLDFvA6IZ8wT6VO4Wr+yVFUBFGjkW9VQjPpC+15d+dei0dSSJ9f+os7KYteMf89EG7WBuUCE7vm1EarFNbnhxIs+F82SxQh/ZPt0Rrrf9XUC+4nAxlpNgK6fkKta7MIOvAJu7bso+apR5A3IAHT0hMnmaDsycrngfIUPrLmnrEaJdnukFKbpEcemJYjvxbYPtL360JdHvd+EEkOUWDUipDsK2g+C1kNNK+tLKY4SsKHF5oswb5VXFEbHdmUCtEjjUm+qH8N4kqSk8VvAs3eAHO4S/vr8PEUvECh gGef0wGE h/lEFM6/qGw+VUEChnKEYIIO4EqZtpjAKttgPFiMMgO/jfTVYy8fAtKj9Ku4jEVFvUzC8lzGo+/x0osQ0yvLyA9IZamaehx+CRfSayMmbRni05OMhuyj/zVXZwXdtJ8dxF8pB/skGbtv/Rg6j09WTjXpAB4y5NXuKqi6M3gt2O5OMQa9n5SJs7Q4xJnWguxTNvlQ0kLsUskRAauParHamrc0MjxlBXNcnCwVJ6556w3iYb5QSTsuaEwzJaZKYhWglhZjOe8Pfse4Qd3DiKH3HESd5yDk3vvDTIUAADlsClj2i/JBqsNf+w4dnCQniJww0Hb8xpI8CEhA7oUZff9hXc3e0dsSRFXgafwQOMRy4/w4B6LCiis/oXPzrIkVQ5KjIQeW8nwXx6w5qfqVjCBJFoZqrhENu2/z2ruzWAizCCWEny06RIQ072IRUYCwyjg5P3pHrSw4tNSTyZgrBLPTyyAo5b1HjFqlFT3EaRvLYc/IYnyqZB7YIgKxz764sxrXbGFX6Etsxfp3qE9YBY4qqhyEpnw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000932, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 15, 2023 at 5:33=E2=80=AFPM Yafang Shao = wrote: > > On Wed, Nov 15, 2023 at 4:45=E2=80=AFPM Michal Hocko wr= ote: > > > > On Wed 15-11-23 09:52:38, Yafang Shao wrote: > > > On Wed, Nov 15, 2023 at 12:58=E2=80=AFAM Casey Schaufler wrote: > > > > > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote: > > > > > On Tue, Nov 14, 2023 at 6:15=E2=80=AFPM Michal Hocko wrote: > > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > > > >>> On Mon, Nov 13, 2023 at 12:45=E2=80=AFAM Casey Schaufler wrote: > > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > >>>>> Background > > > > >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > >>>>> > > > > >>>>> In our containerized environment, we've identified unexpected= OOM events > > > > >>>>> where the OOM-killer terminates tasks despite having ample fr= ee memory. > > > > >>>>> This anomaly is traced back to tasks within a container using= mbind(2) to > > > > >>>>> bind memory to a specific NUMA node. When the allocated memor= y on this node > > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom= _score, > > > > >>>>> indiscriminately kills tasks. This becomes more critical with= guaranteed > > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue. > > > > >>>> Is there some reason why you can't fix the callers of mbind(2)= ? > > > > >>>> This looks like an user space configuration error rather than = a > > > > >>>> system security issue. > > > > >>> It appears my initial description may have caused confusion. In= this > > > > >>> scenario, the caller is an unprivileged user lacking any capabi= lities. > > > > >>> While a privileged user, such as root, experiencing this issue = might > > > > >>> indicate a user space configuration error, the concerning aspec= t is > > > > >>> the potential for an unprivileged user to disrupt the system ea= sily. > > > > >>> If this is perceived as a misconfiguration, the question arises= : What > > > > >>> is the correct configuration to prevent an unprivileged user fr= om > > > > >>> utilizing mbind(2)?" > > > > >> How is this any different than a non NUMA (mbind) situation? > > > > > In a UMA system, each gigabyte of memory carries the same cost. > > > > > Conversely, in a NUMA architecture, opting to confine processes w= ithin > > > > > a specific NUMA node incurs additional costs. In the worst-case > > > > > scenario, if all containers opt to bind their memory exclusively = to > > > > > specific nodes, it will result in significant memory wastage. > > > > > > > > That still sounds like you've misconfigured your containers such > > > > that they expect to get more memory than is available, and that > > > > they have more control over it than they really do. > > > > > > And again: What configuration method is suitable to limit user contro= l > > > over memory policy adjustments, besides the heavyweight seccomp > > > approach? > > > > This really depends on the workloads. What is the reason mbind is used > > in the first place? > > It can improve their performance. > > > Is it acceptable to partition the system so that > > there is a numa node reserved for NUMA aware workloads? > > As highlighted in the commit log, our preference is to configure this > memory policy through kubelet using cpuset.mems in the cpuset > controller, rather than allowing individual users to set it > independently. > > > If not, have you > > considered (already proposed numa=3Doff)? > > The challenge at hand isn't solely about whether users should bind to > a memory node or the deployment of workloads. What we're genuinely > dealing with is the fact that users can bind to a specific node > without our explicit agreement or authorization. BYW, the same principle should also apply to sched_setaffinity(2). While there's already a security_task_setscheduler() in place, it's undeniable that we should also consider adding a security_set_mempolicy() for consistency. --=20 Regards Yafang