From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA01EC4332F for ; Mon, 13 Nov 2023 03:18:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 278366B018E; Sun, 12 Nov 2023 22:18:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 227DF6B0190; Sun, 12 Nov 2023 22:18:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F0976B0191; Sun, 12 Nov 2023 22:18:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 010F96B018E for ; Sun, 12 Nov 2023 22:18:34 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id CA7AC16031A for ; Mon, 13 Nov 2023 03:18:34 +0000 (UTC) X-FDA: 81451473348.16.4A672AA Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf01.hostedemail.com (Postfix) with ESMTP id 13EC240003 for ; Mon, 13 Nov 2023 03:18:32 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CnGn1HZd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699845513; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CkrhYJmroTlYiiJ3rV5QSD9C69+eZxDckLyTvIr5/NM=; b=T01JSgvxz6XF41psFQrGnF9Eyk0Zl4+b2beg1cKcstygMu4UNj+Jv/3o3fTs7ruYYJG7PI dpg3sWAfEXPvBhyTgGWUdFHHS+91nabfwhEhVWFFkK6CFnZ9YWJrHS03bk5VJpb+D8DOD6 1lh6iw2orPnkSylWh2BLkMsGxwsv1nI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CnGn1HZd; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699845513; a=rsa-sha256; cv=none; b=DX3VupvMOzkr4F6Eu/zVtnEW5J0soO9x+TtSpDN8l3sqi9qE7WN6+4bEVPXZ/1ihpogCjB b+ORa+34/BQmz8Zm+uJTV3kUbzZjhiswcBykPbeG90umf+fOEMvkMf9a7/DvSLpyRXM+hg mu/feS/LJE/Pqi+QYp/SZKL8WriRat8= Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-66d0169cf43so25286446d6.3 for ; Sun, 12 Nov 2023 19:18:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699845512; x=1700450312; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CkrhYJmroTlYiiJ3rV5QSD9C69+eZxDckLyTvIr5/NM=; b=CnGn1HZdAWDXDYBdPA2evlxSEp2Uh5QihONS1R1/16kma8p4kFjN32d++ZIy4ragbd 539bda957Kf/bswcIXi+rgiYK9U1I4ExFvg85PZusyiirIml27Znu2q1VEhu1280y+Er y/YE3WVxEOTfLnpdQJYzP+YNh/qA0KsFLKp2RPqwWtTe6s8w3kveal48veXK8L7yGpNw EJcWMDLXFICAGEGTjC4+KbmZP6wDxNag0xIF+UcVQYxEJPkMGaJsFPjkwSru71ZueJWo lxY9rwKeNp7gFxtBu0b+80QGeEC+YPE5gj2YSGjOqN/uU2dkkOLyt1jEiqIWG4dPsm1p QuHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699845512; x=1700450312; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CkrhYJmroTlYiiJ3rV5QSD9C69+eZxDckLyTvIr5/NM=; b=XFbcgdZWuKkmcnMbwlXQtgySQbngWAZpfpPZXczQxD5wAApdNSEH/CgLhuIu1RLQD5 63Q7LrAWfnMmlngslNzksj5WhhTgRnNUxK0oxAxPUv/3L+hWzO94eoAOlUpjCZz73puF qb2CUOMCW/zLNvHpeckh26VxzJ/Pn1IIDg5pE2TU106FYCmopt5pwlPnZXK7ZM3Zn+Ep L/8mfeTL6270Q60+hDVW8X4d4aRw+oK2RIT5v3SR4NukY08RVF+iR7X/Xm5AoOp1WuvP m0sffIWuWk1RA9IDLALiQGeiE1kaeUFH2UboQY15x2GEcR2NYagfaS/gnzI166JwKLRD n9lw== X-Gm-Message-State: AOJu0YxO4Pc/HRkiScYOCCy0e8S3ImRSY+I3ZUShiOnDLlwozMy4GY27 CyiRXSQh0g+AHh6awaNz4s/dhQjEMdWFhfGmNx8= X-Google-Smtp-Source: AGHT+IGgewBV0p901aWIwLc8NDDsI5JIf/ir2MxmFpNxLqWRWfFMliNwGoQSDRsyoIsjj6tIBqlDvc1N9Ky2ckvuHAk= X-Received: by 2002:a0c:c683:0:b0:66f:b00b:9d51 with SMTP id d3-20020a0cc683000000b0066fb00b9d51mr5256007qvj.9.1699845512052; Sun, 12 Nov 2023 19:18:32 -0800 (PST) MIME-Version: 1.0 References: <20231112073424.4216-1-laoar.shao@gmail.com> In-Reply-To: From: Yafang Shao Date: Mon, 13 Nov 2023 11:17:51 +0800 Message-ID: Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf To: Paul Moore Cc: akpm@linux-foundation.org, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 13EC240003 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: ompqrcca18uum6czn4obdnxrd689feeg X-HE-Tag: 1699845512-777832 X-HE-Meta: U2FsdGVkX19xoigSAn/hnVwjr/3Lsk0I2ruXBoNG7MAXD/a5s9Wmo1dgWc0PR8PFap5KvhQFKYZdSlvz6LVQWp6P90eR0WKHJP5+IaB9AU3HK96X3j94xSjdB2Sprfh8n+n/+BVW7jbPIo661om7dRI73Qllbu/Qn+6IrcGBmF02E2avKJr5yiHkjOLWjXR/L42kOUI4hopEf+Rn4DwGZNCMW7NdUy7csGnGkKY8r9zKhlAFj88e4RiWN3keQkEiBSaIGl+0aCBc9YnlXC3d13ATBI5wvFFtIBDMgn49pidHlaKHPiRQ8e1/vUfKQrCNlDd+YwJRkrBUK2f9jA84BWBqdavBU1zAJJ0pKLjvglMH+yZWuaTlptylJHzWkjPWGT13MTTxCeNbIQcMNRb83CspuvcoDt+OqocWiowET0R7FcYQiilaboKhmQ8m4rxQ5bCIoOk+IroTQlAVfW7BxaReh/x3ZD4G0LIi4xjTlEh2F1NNw/3j0iZ25fuADely97F0zgn6q+p55SLV0QTpT8scqSJzkhdSC9B7REuz7rFBVJRnKnG74vOkrveGxYmoduQbXATz9JuXHcdmRT9KuWmhr7YliUHJY0/KeyP6wCX15yMnOaWA7lxt01Iqy8D19eAOT0OgmayMW1OCdG9CoLc0b6lDPUL1Y/4KBsKuGTFuAcVmM0QujhEY69RkrqYbGkkDRyyzYlMbNlct3+F1W3oYBWDuNuh4yk4IxI0s+QU6xXJqvUJsbuMIdO+LZ8NyzmB+HSsg84A9NnAFjXBcuu78vXiXDD6uG3FUp46dBwvwubtu1ZMtRWKtXu4sc6QcZlOuojejKL0Ye+DB1teZilNdPxqFyuoklkZkrMgJJEJXWxReGca178A9UgBbUgkaRE8g7fILhioen2FvMYJGqICn9nFTIKfSz7Bu/oL6G5sYB82Ib+XTvbQJQWCVXzoj4p8CMAmx9E58AFfo02M i94Np7vE V9o5FFbqSQPuzb7xJ95t3YpH936n+M+zfh3GdGt39KU05Naf2kPBEdpSmVaoq0K/kx2AqRotc6fdFd70Jm0KtWJG3e/G1myd6jQDF2PAsH3GNtgPebv70TneGvkLgvfNuMWCEhNXYfPZupIBZ13rGmOW8h+EpH0ib5qrsYrdla/JOab87PKWSEycEozopjAAJGZxw3UVRIHphcygmBmGa1YYD3jNrem55wz203v9CfpHWSLw0dRcznPcMtgildbQYIsnowGHZR5PVZnV8EVjNtqtHRbXBpeonMBNYvmpqYop9PGvExlNk5HsHerysT74w8lFvw9iBlVHxlor5Fma9oRBtKZdptfxGRtLwrqXDQAjrKEGV4Tpjm7MNNuUUgC+b4Ra5+FCWDIjed5EDi9wIacvWr3Pvsl3cWQItLUibz+i+r07slWVHCoHX+/bPLajnriZAqDKF09y4/mQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 13, 2023 at 4:32=E2=80=AFAM Paul Moore wr= ote: > > On Sun, Nov 12, 2023 at 2:35=E2=80=AFAM Yafang Shao wrote: > > > > Background > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In our containerized environment, we've identified unexpected OOM event= s > > where the OOM-killer terminates tasks despite having ample free memory. > > This anomaly is traced back to tasks within a container using mbind(2) = to > > bind memory to a specific NUMA node. When the allocated memory on this = node > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > indiscriminately kills tasks. This becomes more critical with guarantee= d > > tasks (oom_score_adj: -998) aggravating the issue. > > > > The selected victim might not have allocated memory on the same NUMA no= de, > > rendering the killing ineffective. This patch aims to address this by > > disabling MPOL_BIND in container environments. > > > > In the container environment, our aim is to consolidate memory resource > > control under the management of kubelet. If users express a preference = for > > binding their memory to a specific NUMA node, we encourage the adoption= of > > a standardized approach. Specifically, we recommend configuring this me= mory > > policy through kubelet using cpuset.mems in the cpuset controller, rath= er > > than individual users setting it autonomously. This centralized approac= h > > ensures that NUMA nodes are globally managed through kubelet, promoting > > consistency and facilitating streamlined administration of memory resou= rces > > across the entire containerized environment. > > > > Proposed Solutions > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > - Introduce Capability to Disable MPOL_BIND > > Currently, any task can perform MPOL_BIND without specific capabiliti= es. > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but th= is > > may have unintended consequences. Capabilities, being broad, might gr= ant > > unnecessary privileges. We should explore alternatives to prevent > > unexpected side effects. > > > > - Use LSM BPF to Disable MPOL_BIND > > Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), = and > > set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is mor= e > > flexibility and allows for fine-grained control without unintended > > consequences. A sample LSM BPF program is included, demonstrating > > practical implementation in a production environment. > > Without looking at the patchset in any detail yet, I wanted to point > out that we do have some documented guidelines for adding new LSM > hooks: > > https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm= -hook-guidelines > > I just learned that there are provisions for adding this to the > MAINTAINERS file, I'll be doing that shortly. My apologies for not > having it in there sooner. Thanks for your information. I will learn it carefully. --=20 Regards Yafang