From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A823C072A2 for ; Wed, 15 Nov 2023 17:09:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB9AF6B03A6; Wed, 15 Nov 2023 12:09:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D68DC6B03A7; Wed, 15 Nov 2023 12:09:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C579C6B03A8; Wed, 15 Nov 2023 12:09:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B21D86B03A6 for ; Wed, 15 Nov 2023 12:09:13 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 781C4A033B for ; Wed, 15 Nov 2023 17:09:12 +0000 (UTC) X-FDA: 81460824144.30.0F0CA7E Received: from sonic302-28.consmr.mail.ne1.yahoo.com (sonic302-28.consmr.mail.ne1.yahoo.com [66.163.186.154]) by imf12.hostedemail.com (Postfix) with ESMTP id 7F8EA40013 for ; Wed, 15 Nov 2023 17:09:09 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=yahoo.com header.s=s2048 header.b=HsEs+XH1; dmarc=none; spf=none (imf12.hostedemail.com: domain of casey@schaufler-ca.com has no SPF policy when checking 66.163.186.154) smtp.mailfrom=casey@schaufler-ca.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700068149; a=rsa-sha256; cv=none; b=r/dOBBjPW40lY40boWesw6MitDdW2uDNItT8ET0OUo/T9BXl6cfldWMnRrnU9334IIujp8 gSYSdzmoCjswamUZxL8DzKtgjJKVzS6M4mjmAFrRehDHPy/WFyHypryatakAbeAX6ck56U 8rv9ZY4aGijBZ2m7qPDGdFneQGUpxsk= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=yahoo.com header.s=s2048 header.b=HsEs+XH1; dmarc=none; spf=none (imf12.hostedemail.com: domain of casey@schaufler-ca.com has no SPF policy when checking 66.163.186.154) smtp.mailfrom=casey@schaufler-ca.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700068149; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ecTXeIx/uMKgbGOKAy0MvHVqzTSGKM9Lk/9ujm0b0QM=; b=gTGYXcAmST3cOaKPp2K9EGykIRutjlUEYXuEoBg3MPlDci2poOXRYTwsWC2jxonFWfRFEr 2lc1F9li8f3B4ZxjCgfJb3gnH0cFWeatAYr2pXvVZjsg4gzGKa2lYWu10Lp9VReReHd7YA C0C2r/F/sxYKBZiNeHow5nWuteVIwwY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1700068148; bh=ecTXeIx/uMKgbGOKAy0MvHVqzTSGKM9Lk/9ujm0b0QM=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From:Subject:Reply-To; b=HsEs+XH1IROzd+0K+Z3Hregqg5U+MRbM8GvxZ2De3cmbRzMogJM4OR5IyxnsQeEeAvr551cfUOuGlBeZ3XFIzNEiSuwc6+qDjKgP0SzMPR47EpMAwUb57nN93kqHu63ogwB5I1VUKulxGW6bqsdl2+nEypCr8IWeiO9yM5Mimi1Y0rWu78llqoe8OEyq/nmF7nn2VcZe5eBqotzlIfDONfU6OgotkVPZ90flT+ODoVJmthNGiLJL3gf41fZb5uBgGNdRuTAwa/7DCjxuv2l7skJiRTL14EOt2QbGEDMZ3OdBDmmaCh6DqEFVubYF4hlDY5XQsanS/JQp5CAlngUsWg== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1700068148; bh=LFLw8pbFIm85Dqu/AR5jKKtHOktuTDMmPnsp+Igg8u+=; h=X-Sonic-MF:Date:Subject:To:From:From:Subject; b=R1hfnVwX82urobldJOqcge1VkXCM5PGONMUH/el/jxnyhC6UKOJwMkys7Dcu3I/sDZohD3cPUKT4RAQTeVMwAH3ma5S+VOmeP7J5fnYdbiTGhNlvOulchrQQv2yqmYdpeZqxs7JkDu+TeKyEmJE3fpsrT/4AcJ/3Gh7rDMRjlvH54vdqZPcjFNJc9mG7AwKZrEli+U1SCR1AEvXO04VylCjrNTd9dCfvB9CJCKYJi9GY8oasRxjzrlHOdQ5lzT69UGRVG4auDRQTN5hcbC2wtMw32RYl2KbB/NY0TaYvBaus91SAqcZMKWzXTeRI35TW+3mpUxHzWwwA+oLmQeX+Aw== X-YMail-OSG: kwrHWYEVM1nLxbQ8X8BkmBr_suF5zkyXjoCnyPbpwuCO6..saAOxydxDSw20v4L wgy14qaJ2n.7F0lmCyrqcTVLBo3XvKZuPdGlXH6HSeepF0xjPY_5EkrF6Q02.J0mrS_sTKICMc3l quzbuhtQxxDi4zsW5Iaqiwl4OLAqvKiAvOD8qSveiFs_kLcfkRLzFrUuSf6vnlyRTzudbixZtfgn GXkWcieI4B8aMg2rGHt.fVxPtOl79r9qwzDFESX.VqaqN_fjQoDSUWSiFsD7XWWtQm40m_tSa3AC VKd1JPxm0ai08nJw7tJtXnaNcVygHqGy2r5Dv5Gl7qAOwB0_4iEQPec0Lli9pq9hI1tjMSPbl3ts W2BE2pnzvtoe1Fq9_9YMWJH1x1Ws2yy4GdpOH6e.1_Xg9KW0HVuHRPJN9UGJCFa7k5HjT4arLZiD L3yHmrFDLCZGLKNiLu7xSBdNop1mFQfzTS4XhnGrUFBur8TeXknuGgsHQCJd6lfXAOuFR8Zl_ymy JxHwJU9DM0h.2xr2fVXU7mbJT5Os8I71Mf_zRNhGWPbfBGVa3ZmSf8sBUT3BiqQfgniRNcSiu9Qp gFIXzpftl2nexrCGp_3IbWRMT9dAxaA.vNg2ECYd6RedUFmO7QMc28jEXYCBmJ4WN49J8c9U7m.9 4CM8AojMnoY5ubhvNz_v7zRJROGVEvUThF4E3o0w5CQn2wT5ae.QzworO9_qJ2GmuOef7rVw0d5r ZqP3qhGULayksG0KRGuqng4iinZG1JRIb3sScmHfPHB10o._MYvQjGTWXYujUJTT0Xq.fWhhlo3H 1NnkyEev11W8oa5qX7ssjY0_XQu9K7edhI.81Bde5n..yuqFHGF9VqeuqyWaTII8dtnUA5xSz8tz fAKUnH_8jVajZarjjNm0q7E36CANO4pRmf8IQv_1q2bb.W6oOTsutqNitG8vvp9wIyv.071wwXUE HI.DejBVafHPmmoW6ODBHspwf9qsQodCofeEj4KTGe4sI5qt8QTNZGx5e_tnmBRM238iKZXUzo8z WwXW8IrnL3_oO0.lZRGOoCkwvyvNtys.1zKqxajzozqA19Lthezgn_T3WYpwABnWBMMu2vxhRRAZ Ir781ExWiOLyZNQV9RhSkQ.hx.NIRjMv9hREcP04mNQQCwNVHIIrGQT6tTsrIdDwjmzVvftw19qt .ATYjbwbr2JYDz0qe5I7T.4g4p5kjFNC7ZoEYB3pEgzAMHpFuGzNxhDOSCfuL1dLFIRX0fAhmHFc TRoDQTAnXhImsrRnPfn1PW2xD0QG0YVSexZu41xRgnNwZIgK5TxX1zvTJUpeXN9J6aNj83bdUNQ2 l_MwgS6JipZyIoPvL7EDYGnnQ73aIPHv0FpGddiVv2QAItN0oeLpMW6HciwK0QlHxyPSQr4gelbJ uBNSNkkU_M3kcldcQRbNZaT8yRC9hoZdOqFabqtvXS0GY0TP426GAGR90PR9m.XqNs29SD90egdn AksBiG2QKHal5OGue2IN45lOtCbTOeCi23VTIXW3751uohktOd_33vhb1ddZsrPg1H1ReUkPQ7mk DdD5H215rBDXPLBx9rT2ycBrTHOCgUcIWjSWbgGyTUiO98miStWatO2Vjc7RMHpYSNquDCnFJfL3 hFN4CFYM45Gb3ZlHlf.mzfsPgIvSMJuWA9DBAMBjVr_v6tvLPRvMKaYqo5S6jPplvC7ZAzsSnlRb oGavxgz36ybpR3l26sODvyxwZ0WaMiD2.NfkOkaQLdU47ay2lBmgD95vWwc6SN47PzipxiEA2v.u ssHEc928..ZPtNSsr06ScOvb.PuwtupxuJ4xW1vBU95wbBpjLdKxMXnQUj_Ljd9UuVkksynIf5ll P8_6RNZYdu.C1cBaSX5I36TWZ.CU0xuiUNKwCsaJJFZYfrRI0HAL_1EI1aA4uxRykW9YpyaTjp3v gouqQ5Z8jAJKbe0exPSO5MurF8Xwcqv3AX41ZlEVNSIvJ87S.FVXRvcBpLhP7u_u5ncR8TMXcTso xUbZt39z5rmvjxeRRT4qHNmPH7gLUWmgUWQ9B1XTd.GriMXy58h.rtqMYQz2pEhMV9_DBKSdoRhm Ve5Vb3qB3aaiSaNM6DNPm.jYLAhw4Irmm_eG_NBVrMMrd1LjegJ3hgx6mq6TjhYXSx5B1ao0D_UR vNKZX7TUEZlQVCr4dvxwK_bOnVVM.6mJel86.SDShzzQDOTC8SoskR5S3NuPGROhn9tsvsLidg2T pbSP9I52k8n4SB7TqqtM_tXQOOsulrkpdNymX.kK_gUgZYMKIW36oYOrZwRodaGccg442WhFWNKl _3xATQp6jjQ-- X-Sonic-MF: X-Sonic-ID: 41df1ed9-af91-480b-b4ad-fb50df66925a Received: from sonic.gate.mail.ne1.yahoo.com by sonic302.consmr.mail.ne1.yahoo.com with HTTP; Wed, 15 Nov 2023 17:09:08 +0000 Received: by hermes--production-ne1-56df75844-w2gpf (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 916ffbc0e12a03b751b0f160810b39af; Wed, 15 Nov 2023 17:09:07 +0000 (UTC) Message-ID: <22994ba0-18eb-4f9d-a399-abde52ffdc38@schaufler-ca.com> Date: Wed, 15 Nov 2023 09:09:06 -0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Content-Language: en-US To: Yafang Shao , Michal Hocko Cc: akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, Casey Schaufler References: <20231112073424.4216-1-laoar.shao@gmail.com> <188dc90e-864f-4681-88a5-87401c655878@schaufler-ca.com> From: Casey Schaufler In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Mailer: WebService/1.1.21896 mail.backend.jedi.jws.acl:role.jedi.acl.token.atz.jws.hermes.yahoo X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7F8EA40013 X-Stat-Signature: dzh4w5hfjz85xd1zf3zoxpu5ufbaxnb6 X-HE-Tag: 1700068149-886590 X-HE-Meta: U2FsdGVkX18gTGKtlWnSv4WazJOeUFFfp2SCS+EoqgCW3eVkp8Krb8GylHyHEGvpEorcUqWIKeQfqFp5VrdWeKXdlkR5XblIiUqyt/DRuh2O4cZvP+9qPK/7d3Te39NtQETKHYGWieed3M7M/T79JLbBDntycvCFJFKzyNjz7hx9gAAWZnbyJO32MLimxOWjz2CCwals4sLk4KjGJKX31VNx+2D9IQM6kE9oLIRTES+9G/a8OBUscG0HQNcBxC6WYO3lAkHvZDdyyvT4jsmChE1C7A58QR1UqcndF6vpEvXYzF3wBjSYAlu0fcQOS/EfjL8zUemYw+MdYJ+50eoDN1gdBtSQ0KFXVJ/c2RUvp1007Lqzu+Jo498ePLuNAwacuCrKDBxxnpM7VORd36XlfV8sEfnugnoN1CvqDueXMWsrAEPKCMrOgaS0QlFBacnH8CvdRggZ5/Fo8HVptgrUFR/SKlLXKNTZzJpwRucpl+BUOi6pDaYORlGS18vKSjwvv4Q4OEGlQBuYkyBUXZ1eyxzwtohbE4G2G3znU31VtUTad/dFhXY/Tm9P5FH3BgjF0T2b0/Unl91kSLlBO2VWANMrhdEoY5XHMT/QMxSoObXFZIjNoyuZeC/O8n5E4eJjwnvepAhEwRO6AIh4lKXSmcDDTaEceZfuf88JqDcO+yh1USv1F6fSE92N4vg+TWphh/JQLpLWaMXxqiyP0N8HgvvwFGHeCyRexNmqV2RvbWHCQ7Fw/AwN148QXcugi+ixAlTBrmjOtUaDW3JK5ZbIhTqssqxkUPhKKw4EEG1175UQwlQxTX1ug5QdK9VzaEeWjkbjx5tn9t8KQc9kmhy2thVkALapQB9eNH2n5iqJJngg3ihp6g1pNMKu/YdQXnppxQRhZiUfJsvol3ryRfupXe0B2V7h1eqlohaYTOHA0+eJLYBAnRzlrgOIFBWggI/ud3xyMtkFxzegZcCtKOQ Pa0uuzuK v7qfTo3AVnbrMlTt2sdnBxv03Wnjy9hExz5PORwZpTdnc4XTsEWVEjmQpKSrPC+ozIplDNdVMILDaeRXlIZ6onPLxkVT1B1vyh0xv8A092TcvXezXDhcx/QDdrgfWPRIc7a//Ad/aCspP0eDZ7Ad97gPevsRlG0KUJbN2iVBfts6q8Fun4fdWAqcWNYq1/b+SHuWM+KqmmzwasH+8FJ4eRFkbRGT4d2wVBUKd3xquQLdnK7YYEBiX7VciqFDnZ9arKKRXsqHsbGZwBBQN9UBF018airesz7YriCHNqFl+r8jQxlkiIGsHfy3y+ZoN+5qw56FIsH2nz3WQPmwGm1MMTuOzpRsUqP51oac7JKsLXSxg3HBblcDHRinhbYCuTvFSOilUFEaYsEnCMk1V0I19N6j2nPALgnJo6TtS0m2xOhRZNwBVZj9+OrjgRHaeZ+VHXuox X-Bogosity: Ham, tests=bogofilter, spamicity=0.033033, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/15/2023 6:26 AM, Yafang Shao wrote: > On Wed, Nov 15, 2023 at 5:33 PM Yafang Shao wrote: >> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko wrote: >>> On Wed 15-11-23 09:52:38, Yafang Shao wrote: >>>> On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler wrote: >>>>> On 11/14/2023 3:59 AM, Yafang Shao wrote: >>>>>> On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko wrote: >>>>>>> On Mon 13-11-23 11:15:06, Yafang Shao wrote: >>>>>>>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler wrote: >>>>>>>>> On 11/11/2023 11:34 PM, Yafang Shao wrote: >>>>>>>>>> Background >>>>>>>>>> ========== >>>>>>>>>> >>>>>>>>>> In our containerized environment, we've identified unexpected OOM events >>>>>>>>>> where the OOM-killer terminates tasks despite having ample free memory. >>>>>>>>>> This anomaly is traced back to tasks within a container using mbind(2) to >>>>>>>>>> bind memory to a specific NUMA node. When the allocated memory on this node >>>>>>>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score, >>>>>>>>>> indiscriminately kills tasks. This becomes more critical with guaranteed >>>>>>>>>> tasks (oom_score_adj: -998) aggravating the issue. >>>>>>>>> Is there some reason why you can't fix the callers of mbind(2)? >>>>>>>>> This looks like an user space configuration error rather than a >>>>>>>>> system security issue. >>>>>>>> It appears my initial description may have caused confusion. In this >>>>>>>> scenario, the caller is an unprivileged user lacking any capabilities. >>>>>>>> While a privileged user, such as root, experiencing this issue might >>>>>>>> indicate a user space configuration error, the concerning aspect is >>>>>>>> the potential for an unprivileged user to disrupt the system easily. >>>>>>>> If this is perceived as a misconfiguration, the question arises: What >>>>>>>> is the correct configuration to prevent an unprivileged user from >>>>>>>> utilizing mbind(2)?" >>>>>>> How is this any different than a non NUMA (mbind) situation? >>>>>> In a UMA system, each gigabyte of memory carries the same cost. >>>>>> Conversely, in a NUMA architecture, opting to confine processes within >>>>>> a specific NUMA node incurs additional costs. In the worst-case >>>>>> scenario, if all containers opt to bind their memory exclusively to >>>>>> specific nodes, it will result in significant memory wastage. >>>>> That still sounds like you've misconfigured your containers such >>>>> that they expect to get more memory than is available, and that >>>>> they have more control over it than they really do. >>>> And again: What configuration method is suitable to limit user control >>>> over memory policy adjustments, besides the heavyweight seccomp >>>> approach? What makes seccomp "heavyweight"? The overhead? The infrastructure required? >>> This really depends on the workloads. What is the reason mbind is used >>> in the first place? >> It can improve their performance. How much? You've already demonstrated that using mbind can degrade their performance. >> >>> Is it acceptable to partition the system so that >>> there is a numa node reserved for NUMA aware workloads? >> As highlighted in the commit log, our preference is to configure this >> memory policy through kubelet using cpuset.mems in the cpuset >> controller, rather than allowing individual users to set it >> independently. >> >>> If not, have you >>> considered (already proposed numa=off)? >> The challenge at hand isn't solely about whether users should bind to >> a memory node or the deployment of workloads. What we're genuinely >> dealing with is the fact that users can bind to a specific node >> without our explicit agreement or authorization. > BYW, the same principle should also apply to sched_setaffinity(2). > While there's already a security_task_setscheduler() in place, it's > undeniable that we should also consider adding a > security_set_mempolicy() for consistency. "A foolish consistency is the hobgoblin of little minds" - Ralph Waldo Emerson