From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6046BC4167D for ; Sun, 12 Nov 2023 20:33:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C957A80035; Sun, 12 Nov 2023 15:32:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C45788E0001; Sun, 12 Nov 2023 15:32:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0D5D80035; Sun, 12 Nov 2023 15:32:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A27B78E0001 for ; Sun, 12 Nov 2023 15:32:59 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 67BD81CB0CA for ; Sun, 12 Nov 2023 20:32:59 +0000 (UTC) X-FDA: 81450451278.21.755003A Received: from mail-yb1-f174.google.com (mail-yb1-f174.google.com [209.85.219.174]) by imf10.hostedemail.com (Postfix) with ESMTP id 97310C001A for ; Sun, 12 Nov 2023 20:32:57 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=HfFTB3kI; spf=pass (imf10.hostedemail.com: domain of paul@paul-moore.com designates 209.85.219.174 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699821177; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kQZ6q7zkUzBb4xIAiq5ifefOrJoCgb7ZOALJX74IEOE=; b=Ex61i+X5ELmon2JD06F18gspwGvkGC4IhjMc4Z+9E2mx6hOlX6KohsTehIMAQDLyVIwCZs Qdyb4tDRvo2qFjay4W01wDh83upbzdTDFnD0RJ25KSRleOc8QscJ7L7FcNtXNDTh14tC5S mHeKBSM95srCXsp0ULJohGiolls7sQw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=paul-moore.com header.s=google header.b=HfFTB3kI; spf=pass (imf10.hostedemail.com: domain of paul@paul-moore.com designates 209.85.219.174 as permitted sender) smtp.mailfrom=paul@paul-moore.com; dmarc=pass (policy=none) header.from=paul-moore.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699821177; a=rsa-sha256; cv=none; b=Inf3NqRAFjQRi2ynn0UzmnfkC9aP6P8oQZw1iDDhBv6VA26Az+HaRSN7fTOLqRX5BaSm6J yH71FDKnOk9ftt693XgYoh1sdN6+z9FeMZtfHsD+ktw+HPR2x6hXyxS1udmo9Z97PlxrIv CMdd2HrRrsf60j/FMlEkx1rUMgOfSp4= Received: by mail-yb1-f174.google.com with SMTP id 3f1490d57ef6-d9a518d66a1so3819424276.0 for ; Sun, 12 Nov 2023 12:32:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paul-moore.com; s=google; t=1699821176; x=1700425976; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kQZ6q7zkUzBb4xIAiq5ifefOrJoCgb7ZOALJX74IEOE=; b=HfFTB3kIUIDVnxiAgNleRZHEI1SdbZkS3VoBLkuF1p3fxJvI9YQ8Jf2YXGW7gU3X+h 3JybQyTPcCSt/693EP3eLXx1TkSpEJyj/9pEzcOcAEorPu7IyMzx/fVDHEPCkfdfeSeo xLFt8lKWzD93CZwzLEodd29iWDdNI9k2cBr//nRc9JTGpxQ0QqgPgna7jdbUOK3Ndi+x pXRHNXuraQoEo2a9Qtnu/+20up8PG5XFrzTowg8/0HnhMWxUVO8fDN59L3QtSk+GNMEB e+WuCiS14gHS6hyrT5dhh9il3O3m07P2xp3TmKBdL6qKCdS3AAjk/CljjzpfqE2RDJ/G tAeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699821176; x=1700425976; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kQZ6q7zkUzBb4xIAiq5ifefOrJoCgb7ZOALJX74IEOE=; b=R/tqoLQVfsPU1EPSFoZfexNJROEZuHEfGac+l7hj5zi3PXapvs3bSrKj0h0Z8ZYeze cngjzIcfBcLYScIUSQ1F9I/ijIVKeR+WbFYVK8AvBuoXitAzcr+pHlkPT6gfUdEFWbkA cwjgFGgOrrFgsh7VnNx+C0XxTdaB5f4EDopoDttnERQmcUxZ9g6EzZpka8vQhJoztm8c kjWhpxFsRZ/e6yWJRQexomYzoUmvxC7D5Onl6XEUGz41IVLeljAdjdEtWQPziU4HRso5 2PeNsHu0qAS3ZyTGScF8kzVwozhS3wAbxG6A0c6KhOSkLFwceLQP8LG5Zf6c+4lj4Aiw Ml9g== X-Gm-Message-State: AOJu0YwvxBP8iSy+6cpHD2aDcUQJCaU9J09uIw/RU7V9M/MC5UmQcuoC AeHlWrukeMTWdo+xA1qxxXKaNYpuBCvbe2/Y0LxG X-Google-Smtp-Source: AGHT+IFxB5XY+P9gbJKytFyNfW83PPd7xVRHBIT5WpoJ8OFrzjK3QzTvR+pgKG+BfzpG0R0eyls66J/UTrbCR5YIPWA= X-Received: by 2002:a25:c008:0:b0:da0:400e:750c with SMTP id c8-20020a25c008000000b00da0400e750cmr3784114ybf.27.1699821176534; Sun, 12 Nov 2023 12:32:56 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> In-Reply-To: <20231112073424.4216-1-laoar.shao@gmail.com> From: Paul Moore Date: Sun, 12 Nov 2023 15:32:45 -0500 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Yafang Shao Cc: akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 97310C001A X-Rspam-User: X-Stat-Signature: sdroixthpbogogeyzscrznnueyy3br7b X-Rspamd-Server: rspam01 X-HE-Tag: 1699821177-271626 X-HE-Meta: U2FsdGVkX1/tpkcahSYBBSUyzMzy0QT2P5lUxL8fIQfMP3gvTDFplLJ8Ha4zT1Be9ik+H4WT8a3duJBAzhd8vDHmJCvJAQ/9NmjBhAXexCd7ervYN4/Zc0eV8cbdBQDnQ0qypvF0Ubla9MtLoK+7Q9v0j2TNQlu+Co6vspthm7C38t6WvPGIjVexO/4+cFLy7iNztfhe/V2yVV+P/u0cCkAGgQFEDz5Jvf3ZhOLNQqVn1i3e6h92C3zW0SLY8RXGf8j1SOQGXMGIsgHGMgUCQOWunm54AE7r+32maALeJaCIttdqGz68He1nulAkIWNa7yRtWYcsiR5Cm2DrffuKdkqJ0EHMHZtWRrfbWqtHbktLNjwz0qzz/1+tWrZ78JRHg6AZEbSV6TMOG6Y+NHqR7N8grwBKmUWaUzzBuy1cd+UGdb+PvxNFLd0SuUGvbbPUAbpt8R+XQCQnHMBd+QachkfSZjuYqqMXuQboh3t43mRgv88M4lIfZLJ84ilEGIqkuoNpkjFL5K9kBDb0RcvqEZSD9i9CRFUu33NI4EPuPHKvmxw7Zf08S7jAry63yFAsvMQ44g+UmBtgZrhHX0xZZ2reKXB5XNFOtiIh30tnvs9HKSvwTsYYfYCWjgTrgZj+wxEMwXK4CKBMqDhbuw2rtWb9e/+uBmNO6VJZtSGbv+fIayDSPgIdAwt7Kt79cuna2QlkjV/WCJNTr9p2gu0jLVez7tCuWFB5n979nJkmmsoiXOImu9DxXImdn3z/Oy5LxFgwR7LPp2JnwSklK0wBipBvF0qEEHyozDsvX7bE6S0Gssr/DiEsuASuF+PtKrS5IEm7h+g6W61sNb1janmzHvW0fMPtqkvGEFqDZUan5TOSsIcALSxluxj3s8QWHnoaNB9UrvsKDC+oEy0EtN1aPw2c+vN9pjjuo5iBOBiEbOlHghzCR1H+f8u3sSvmucoNtRNuMjn//GbgK8SegoP nTX+j9aL PfMB3+IzBgJ13Skmu5Wbs0dsUApw/HIQPrAJfXGKmqgOCaMprzY5xayBRHhWniTIrrt1SCL9PZCk34R72fh+YZYiDhVsjRcLSVPMWsH51GxvtaE0VwxNWqRVnYtHkx+3xGbxcIYLj43Z9SZRDzG2RKG+mKqvNxz4fKtBgkGCZHlYriDCxqA9qvTydeXC3YbBrGTcvfd5f0qMwZNuoRmq2jmorWYQeWH3XTBMdC83JeBSaFnWh41fQx0Ot8+xzZCzhJKcQkvF+Sq/VTse3RcV4iTFlWHa7zgQmuLo4/bZPjYiDIWw/99+J37h3Y9RvBbXl6nHe4ZLoHvmtvuT9JKo8K1pSG8cD2mf4r6eBt//MFuhP8Y+e1CWcuSxSdqQWxWpMSkpjc1kAQZinqK7zAPoFrFObgQIMHOrRpaLzpMQfNoSPtOKKejXtda5tVNyoJNP7OY2N2cc5h2LRhleLpW/nCBjhvJ/s14iTO0MZO8cFKw95urqwgfdP46LZhg7BQlZANcuNQj9Cja5UL5E08ROyogA7Aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000083, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Nov 12, 2023 at 2:35=E2=80=AFAM Yafang Shao = wrote: > > Background > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In our containerized environment, we've identified unexpected OOM events > where the OOM-killer terminates tasks despite having ample free memory. > This anomaly is traced back to tasks within a container using mbind(2) to > bind memory to a specific NUMA node. When the allocated memory on this no= de > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > indiscriminately kills tasks. This becomes more critical with guaranteed > tasks (oom_score_adj: -998) aggravating the issue. > > The selected victim might not have allocated memory on the same NUMA node= , > rendering the killing ineffective. This patch aims to address this by > disabling MPOL_BIND in container environments. > > In the container environment, our aim is to consolidate memory resource > control under the management of kubelet. If users express a preference fo= r > binding their memory to a specific NUMA node, we encourage the adoption o= f > a standardized approach. Specifically, we recommend configuring this memo= ry > policy through kubelet using cpuset.mems in the cpuset controller, rather > than individual users setting it autonomously. This centralized approach > ensures that NUMA nodes are globally managed through kubelet, promoting > consistency and facilitating streamlined administration of memory resourc= es > across the entire containerized environment. > > Proposed Solutions > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > - Introduce Capability to Disable MPOL_BIND > Currently, any task can perform MPOL_BIND without specific capabilities= . > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > may have unintended consequences. Capabilities, being broad, might gran= t > unnecessary privileges. We should explore alternatives to prevent > unexpected side effects. > > - Use LSM BPF to Disable MPOL_BIND > Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), an= d > set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more > flexibility and allows for fine-grained control without unintended > consequences. A sample LSM BPF program is included, demonstrating > practical implementation in a production environment. Without looking at the patchset in any detail yet, I wanted to point out that we do have some documented guidelines for adding new LSM hooks: https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-h= ook-guidelines I just learned that there are provisions for adding this to the MAINTAINERS file, I'll be doing that shortly. My apologies for not having it in there sooner. > Future Considerations > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > In addition, there's room for enhancement in the OOM-killer for cases > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > prioritize selecting a victim that has allocated memory on the same NUMA > node. My exploration on the lore led me to a proposal[0] related to this > matter, although consensus seems elusive at this point. Nevertheless, > delving into this specific topic is beyond the scope of the current > patchset. > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@byte= dance.com/ --=20 paul-moore.com