From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F1CEC4167B for ; Sun, 12 Nov 2023 16:45:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 56DBF8D0019; Sun, 12 Nov 2023 11:45:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 51E188D0002; Sun, 12 Nov 2023 11:45:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E5788D0019; Sun, 12 Nov 2023 11:45:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 291E28D0002 for ; Sun, 12 Nov 2023 11:45:44 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E89F914022D for ; Sun, 12 Nov 2023 16:45:43 +0000 (UTC) X-FDA: 81449878566.25.4778F2E Received: from sonic302-27.consmr.mail.ne1.yahoo.com (sonic302-27.consmr.mail.ne1.yahoo.com [66.163.186.153]) by imf06.hostedemail.com (Postfix) with ESMTP id D1C7218000E for ; Sun, 12 Nov 2023 16:45:41 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=yahoo.com header.s=s2048 header.b=dfnVPts4; dmarc=none; spf=none (imf06.hostedemail.com: domain of casey@schaufler-ca.com has no SPF policy when checking 66.163.186.153) smtp.mailfrom=casey@schaufler-ca.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699807542; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QkUjV5t8IJ0GXUPo0WqgUQD/kDQxsiveoip8QzmJqJ4=; b=ZdCUHOWooFqMQj3YdvADk1jQLq0M6yOwbolfYZo+qqXwuDJksc13pI1I2muYuqNy3wLgSV vfuIsx2z4YWw7pucI+5jRgiE0rVyoJNrHC2akwr1ijyhjgTaOh/SSAqS8jQYgkdAsPEo0K Jq/6m09eo5QRT6ef8ST61LfAgILfAW8= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=yahoo.com header.s=s2048 header.b=dfnVPts4; dmarc=none; spf=none (imf06.hostedemail.com: domain of casey@schaufler-ca.com has no SPF policy when checking 66.163.186.153) smtp.mailfrom=casey@schaufler-ca.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699807542; a=rsa-sha256; cv=none; b=wCSlTYQ0qEFNMva6jZBzIjH+65XDqZ+eGRAf8vXbJmXAtKV+aaef+FYE5AiG0dz3oJr4Xz odK1d5uGhcut2wAAtkgQgX6WLdKzDCND0yMpsUMZW4rREARRnzqmKr39yJzt6CnFyIVR3+ t8zrGN3N5jfVvKgibXnwsDxo3VgXIM0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1699807540; bh=QkUjV5t8IJ0GXUPo0WqgUQD/kDQxsiveoip8QzmJqJ4=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From:Subject:Reply-To; b=dfnVPts4tAkCBb/7VhIM0HBO88QtaFHHdFX7LUzZItAfct9HLvY4SH4Xji1GEBqcap0BuQpLaVXrBnI7V8YEa4I1KSJ4xEb54SBsig0jhRXmyqe7/S4qH5cm0aFDro6H18EkiaezpDbbHFxNi18rL17LnAu4ImGcQ9/qzwugghCni796hDWev577UvaKC3+5AigThAtok8+uuI2ImZbP0tB0bnZDGydDrDMeNkISY8BocFPKwr4At73esnsY8nuhcauQDPjV2hRYXmYmzhvYAVQ0MpvjLuPmwnuJGhC7tP08T2BCFz1EwXzLluVK72vjVGdOKjw/SxW2oQuwKyg8HQ== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1699807540; bh=1VpxShmE6qjr4iv0t6hoTD3r2fXLL7zBdXWyC6TONjh=; h=X-Sonic-MF:Date:Subject:To:From:From:Subject; b=Fjp1cswf+tnNTyd1jSwlOTJW2BWemhfRTBWJhc9CyUt/ryDyYrva36ectNPY5WYHb4xWgNkUUfVDDNpiYnS2dzldenMuuMHQ7qrU/KRFzofwDj/fHM/zFHYqOgj3kPOjDbNwmcNpljSzvhRAJarkycEWpm6gk1fNfoGl0nHoQ3c0kNoicWqh2s4bOyJytxLiiTybzLivlIKsEqiItuIiJpBUl8+YCZRSqSSIMWUyP5jYp2NBi0684Omo2H4kLgYArGtos5sU2DyVZdehzdNCPamsgvE1tD2Zo1KgpXaD4+Wwgo60uZYn+QHsisowGKs4c1rbNoNmrW7pdE9uriMKlw== X-YMail-OSG: GRlCGHQVM1kDISfaDg7rqBLZ28e3OEyiXnX.092G.bFlk1S3SXSlfjsU.40djcn XVRFOzUV64TK0_SbAO0mDyj4LWmvANmYG7kb541_qMCmSH4S6pwdsuZpPyKz5Rw.TQhQSPXgPPeq k3F5tEYHz5rZvsFRQ8KLBN5tPF1cW6xSCw3Il3AFZxY7qPL2OAVcx033TQ1zp15yAVKrD8u9pWB4 Nm3PR8FcqySe3YObuODT7KABN2DMfqIZBcGeZ9leeqnBOpU28U5XshkfBrsds4HFNZcyI5tGX5SO VveaelgNY5HYv9a22DSaTimXRWbYozTc1vBfvI9QfxDSpWcOV0FL2YRdOfyKU51FUF67pOp3sNCi W_VzQqYBgG4k1B5EaB1_FWVAG3ZAYx40oX1yI3N2RryX4RH1PXWR5Zs2rYHcPLO0t.e_2xjz0Lfg Dir9uSFdc5ro72OgtUj4tHTC_51G.99fN0DQVTJkWq8.nCD2IWcYJl8J6pBpLgf65KpejC7DkXOz qOcsq8PUsamikc6ZtI2P10RwkgofRDq0JMCC1wEb_PcM7TyvYYq6IOop2cWBgR8Ls2AkZ1MLEY.9 kFG.vVHKqnGKCU2.u1MRAiAaPwpkEwVwFI5lnBHWeDBaQusbRNVYdA9koQ8RmPr9S6FjTjs5egQt Pmh15nrPppojf5Ng5OKSsrkOVKC2moKkGpEAlbDS2fNVEbcUknySArbEJi4.bsLpeQ0Qr6Bsu.Uj CTq9abiW2fu7T3.iK0c7Vnhrc1yR8M1hDxr7AsO8QH8rqWe_t5j3xbH0ulIBOxETGuqyMSxURrbG uGQCwURKWeiYIRCbodaxPm5DnSNMdtT2KyoE0Y.Nwrvt2ZlC_m7EUgWuciD1wwR0tik9g64mnKkc 5kaCB_CEo1EgPjDoKm5uN21jN7Xj2XCfVoqgATcCh3Lz5lKMN7BnSV8daE4NYNDzCYWgxdXwKCvu 0uBVYyP2M.vs8OdAjmWo5sEG499fTFcY5XOtwc1mKrNlu.fIJDR_yhWo4NCK1LYfNYO9uEZkHQhQ sN4tGGbWkfnOGfYufeqheTx54kEnRSWZTT.QgfNZMxvNEg3YJb4yyMHaA5N0QJC6irBJ7LsoHXkK Yek4zxfZ.N5ffWnTz5aubnADNPL9aP.5Kg_e7_.imLieyL0CuB0Lv3xsoYmOMC_O6m2rg1K8Fa_x MTP1JBTCMNbmAy9MGUACB1E7_Wo76cbhTJYPMgdYra1sEEvPJqSp4uUcXepCGlaS.Zcf3T_d702Y 5ZUM7mkZHqAAtqsruiWJKQ0oaf_5AnjKzOt0ZItw9NDjrCE8P3LBIeAk_fpLxz_r9ehDXdtEJvy9 _0M70cmHD5LFUj0XQCNXNllZVoUkSTr2uU7ANjgujRwSSojw4YBWYJ.ZDm.AGEzbW0NkEZGlZVGc ptVvGQKGLv.TdvrFUWb61hqQQUu3MDmpiP4KZfF4_0EdpmffRHGEvebs2rIOJTYlM8u2nSDuGF2L UjyLaHmcX1FM8C7OeDhGbNJMW0fCzlFvqtfL0QucO6oBGjBxLQmBfDOirGIsoIpd1g56GQu3Ffu6 TExPK_oYCoCXr.uHZmzE6h.XNya4d6A62d1AOPU_jF3t3_IbAMhF5UqmPZDVRzUN7DO6YuTEXhCZ tz7tOCCfrCD02xB27tt4DcTVVFLr.5U_zEOt6bRWED3HevLGfhMKYrAAJZf31G3lfl61WugBzhRj 3In315niWtOvAqPQXxIv23JdyMqhh_xh89Ie5iTHmu9a_FGCxL1dCvfgSnYxnB.tjQGbvuObQkox evs4HQklb8cTU.aXM_Z4kFUjp.wnVnn2bOx4bSH95BFIbp3JFuJxYNFTO0AVhu0N25IB_myRKZKD CoM3JN7l3q8CmW7CyU87Qep8Ba6A3B4hu1YBMHLeXp8eSOWQY5qP5MfpEpxkkztT8M2RWWUWlFXq t5..IlfdVq.v_oKfDfoDErivdlDT734_DbjpkH5m5ThCkkIDxd5GEcZTSN.Xpt1n.wBS1TZmwqJv .EsPohx6KJ.jpzwT6nRl1eKKu1Fz4VEKGMWPeYabXdbDdaPg7JvJ6h_3XOgKLl5uAokS6I3p_W2S yp4l7VyzKs9.bW41N5OkDkYeJO39FtmYMKwrh__lZ21ex8HX_gCcz9nmhKzQGlKo2RUaNW9I0Kb9 1DEyYkSEz8hVxYizProGTHGtszBbiyXsK.uhW73P3y_wiVcaewLO73UClLo8iRySI70CQ_A0mB3m gMHulXuODRjMuVQw7Xmj5is19T0LUeH0nlar288RNCMXHALeH1li9s6hNMIjbOAi4XXallIbd7MC ZM62vRgwRPFIpRA-- X-Sonic-MF: X-Sonic-ID: cfa27603-0734-49f6-ba07-41171bf30657 Received: from sonic.gate.mail.ne1.yahoo.com by sonic302.consmr.mail.ne1.yahoo.com with HTTP; Sun, 12 Nov 2023 16:45:40 +0000 Received: by hermes--production-bf1-5b945b6d47-jx96d (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 328aad92118f82324d3a6d13ce8cfcd6; Sun, 12 Nov 2023 16:45:37 +0000 (UTC) Message-ID: <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> Date: Sun, 12 Nov 2023 08:45:34 -0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Content-Language: en-US To: Yafang Shao , akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com Cc: linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com, Casey Schaufler References: <20231112073424.4216-1-laoar.shao@gmail.com> From: Casey Schaufler In-Reply-To: <20231112073424.4216-1-laoar.shao@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailer: WebService/1.1.21896 mail.backend.jedi.jws.acl:role.jedi.acl.token.atz.jws.hermes.yahoo X-Rspamd-Queue-Id: D1C7218000E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: mnem9x7o57txus63q1a653mjuxb7k18m X-HE-Tag: 1699807541-325646 X-HE-Meta: U2FsdGVkX18xG62tT8VV+pkEQ4gbhLtor0UhPiHT298oeHfs7/mmS0WxQ+2uHyqCtstSf7r/v3HJ2X3Kx7CM+lBiFzMwrXL+S9y6L8EO8IDdYE4Jwxo3ghzcHETNqGMlhdx3YmdOvDec2LL9SkUuLYZzjxV58457iuiifj5hBGSQ73d6JtBtmyM0eetjqHcpx5NQLpOFlYw47778FzR1ApcSrDsFPK5GtueiOMRW3wGKnupHnyg3EjwU3yPl2XZJQB/1qPHhNB1/Mriy2GOcuL5HqDQrVSWC+TMk30L5rGfrB5zfskZMyK9yep6tnOubY0m2ulCtHvadwhT1PQBkNdqAnB+4LyHrQf0VkekmfEJ34mOGcvqUy/BydPxl6Gu6m4b3RjRhePFiR63SstqjpcZ9sCD1cRrgrDZPn1tJ1XITcsIPcsj1wkRjmggZTPhKApcTMcZeM/BGRSa7O06aTrBbzFvLFQ8QtPOpEAqUWVoQEIph64+Vu/3UgP+I5L2fWKKuIpOsrbYOJhU8Yz08r/hrGTYKtA0UtKlvTYNIraewk8sij0yUgZkpGgmFgWb8zsw5mrC0ihkBFe+5BxRlTlXC1X6f2U7/jTmVNy1xCaAIiwcrx3Mbd8py4ywNxZVXxSYP+5rWodIkU9qFsOmBw85sEinvEkBoN2/xgn6PFP3lNnyVngS4rlY8UsfVBG8MUE7neKU6BRL52gLrDm3SYhn82YSqGk9bGlCZ/Pv24azRCPzLMntNGLtKQiVXS9/dJzgdgU2QKYg9w0GG49oYlnE7sGkvIyiQbCW6BYylKp09IqbVVMMRiVJ1CEzMxlVDWloBEftE2B8LLsutyKPH6HjVZoBGYeRCzJHzIW6wBIaDPALisx8nxT/RDVHnDZfn5455au0k+fpz16Jk09S6Va+XRXrOTHlykFGACr6sMVSIEgnBI5pLng11xl0Ao9/wHbffniUodyYxfwVbrRM RjsoAfg1 tPVoOPZH0XL9Cr35En5ElcTT/ca6wVeJD1MSIyicX/zRxt94fNkC+RfrJpbHPhNmlKyE66QF4gBeRl2qoAsRPZwVKXMOzeCune2iTj7wGxW87Kwlrs8ZEt0WBhTObswlQC/xg+A2SEdj29p/DSUj/6FW+hOvfFGaeQFGYBOEnn3FUvCQAD40Du6PUoZym/h3wxJR2W66idvNqwMSmC03a8hmRpvzZmhBKVgbRXDa35vFJ4ryA/oxBEwIvmDfavY4bp0yK0GkwyvFd5sra1vhyAQ0H+IGVaaIipVbT3/Ie+8YQKd1Sfu7lYNdmOTwf/KAgS6WDi5qGJoQaZgr47F0XIatlBcNZcVxdHx/+sVV3XJ40ySBiVwXY2ftaT2xtHlfXhsn3ZcuGg5y1wrjJK65pPyG+dl5wmFHJ8JtXL4QH1ZUcyGM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/11/2023 11:34 PM, Yafang Shao wrote: > Background > ========== > > In our containerized environment, we've identified unexpected OOM events > where the OOM-killer terminates tasks despite having ample free memory. > This anomaly is traced back to tasks within a container using mbind(2) to > bind memory to a specific NUMA node. When the allocated memory on this node > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > indiscriminately kills tasks. This becomes more critical with guaranteed > tasks (oom_score_adj: -998) aggravating the issue. Is there some reason why you can't fix the callers of mbind(2)? This looks like an user space configuration error rather than a system security issue. > > The selected victim might not have allocated memory on the same NUMA node, > rendering the killing ineffective. This patch aims to address this by > disabling MPOL_BIND in container environments. > > In the container environment, our aim is to consolidate memory resource > control under the management of kubelet. If users express a preference for > binding their memory to a specific NUMA node, we encourage the adoption of > a standardized approach. Specifically, we recommend configuring this memory > policy through kubelet using cpuset.mems in the cpuset controller, rather > than individual users setting it autonomously. This centralized approach > ensures that NUMA nodes are globally managed through kubelet, promoting > consistency and facilitating streamlined administration of memory resources > across the entire containerized environment. Changing system behavior for a single use case doesn't seem prudent. You're introducing a bunch of kernel code to avoid fixing a broken user space configuration. > > Proposed Solutions > ================= > > - Introduce Capability to Disable MPOL_BIND > Currently, any task can perform MPOL_BIND without specific capabilities. > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > may have unintended consequences. Capabilities, being broad, might grant > unnecessary privileges. We should explore alternatives to prevent > unexpected side effects. > > - Use LSM BPF to Disable MPOL_BIND > Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and > set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more > flexibility and allows for fine-grained control without unintended > consequences. A sample LSM BPF program is included, demonstrating > practical implementation in a production environment. > > Future Considerations > ===================== > > In addition, there's room for enhancement in the OOM-killer for cases > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > prioritize selecting a victim that has allocated memory on the same NUMA > node. My exploration on the lore led me to a proposal[0] related to this > matter, although consensus seems elusive at this point. Nevertheless, > delving into this specific topic is beyond the scope of the current > patchset. > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/ > > Yafang Shao (4): > mm, security: Add lsm hook for mbind(2) > mm, security: Add lsm hook for set_mempolicy(2) > mm, security: Add lsm hook for set_mempolicy_home_node(2) > selftests/bpf: Add selftests for mbind(2) with lsm prog > > include/linux/lsm_hook_defs.h | 8 +++ > include/linux/security.h | 26 +++++++ > mm/mempolicy.c | 13 ++++ > security/security.c | 19 ++++++ > tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++ > tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++ > 6 files changed, 174 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c > create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c >