From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D234C5320E for ; Sun, 18 Aug 2024 07:25:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF7726B044A; Sun, 18 Aug 2024 03:25:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B80816B044B; Sun, 18 Aug 2024 03:25:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D2F96B044C; Sun, 18 Aug 2024 03:25:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7CEB16B044A for ; Sun, 18 Aug 2024 03:25:55 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 02325160A9C for ; Sun, 18 Aug 2024 07:25:54 +0000 (UTC) X-FDA: 82464531870.16.F52BE36 Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41]) by imf24.hostedemail.com (Postfix) with ESMTP id 30E2418001A for ; Sun, 18 Aug 2024 07:25:53 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Hffsq6Ud; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723965852; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SXtP4/WEv92yt6xLcDqe/+8SUllrIBXiXRv3KjbK2ZE=; b=y9uTnTU2jFj3Z/l3tswD8SbpvpOYGNJuUoKaqldhaywrN1lb/BjlW/ajnkYOWOurNAxE8T o6+ou/IdXOtRYlEnYq6+Toj2VdrpAc5eZ1g9ydvELGPu1dBUy/HEIQP/1CD8CiN3gRUng7 Z61mdWELgOSuJ0pbUKDSmq4RncZTyeo= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Hffsq6Ud; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723965852; a=rsa-sha256; cv=none; b=wHXMMoeMKyQwr3nJwaxoJxtSdHrz3a5VYJALbJzE1Ga942/Utpe5bzHrHKXkwIMqofWpz4 a3LKf7OfXKEY7fVnjy3AOJ1vs6TxRxNu3lCn5DDAX3HnXfg9FKD2mhvh0tDqXWIp404w/K zjn7waEcjHBUwXCkTGI8wwV7I2yNMkw= Received: by mail-vs1-f41.google.com with SMTP id ada2fe7eead31-49299adf8adso1245139137.2 for ; Sun, 18 Aug 2024 00:25:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723965952; x=1724570752; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SXtP4/WEv92yt6xLcDqe/+8SUllrIBXiXRv3KjbK2ZE=; b=Hffsq6UdEpH4inb5Fzif82duwfeXH11NSGUwzyafnsVhTmPhqYojFMy0IvkiHI3CZo Z/SgW1LUp5mV2bIb3MISWhjiDa/7RVpkSMWwcgSIkD30+mFUrYjFcX5ttPHG5sSNjTVz Xtwjq3AQLWbijza+GqKM+jlj2Z3QXP9/i9/8BRxAzMQyIQ1GL6eB+kvoQ53z+r2N93ez KXuyKb8a9ZrbaY/pgZdsvBbLGB0jRzgOnor+2WBo+KmMNd/ShERG6nbDe67nFgo8Z+Io abxplKg8sExX09MB4z7uwt9PFOuckw0JBbcSVmxJZaYfp1369aVMqyQEcOziml3yILaS +f3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723965952; x=1724570752; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SXtP4/WEv92yt6xLcDqe/+8SUllrIBXiXRv3KjbK2ZE=; b=e9HeJEw5718MQ8f0SQIdvvDm6TwSgul9Qfna3QAz/fpdIoSiHUEDWwsC9sQudgidG0 iX4mxusgD+DjBYEXyjcuijR8zoWIUU2GGviP92YywbgP3rPfWR/ytFvm3kjohIQndE4B TI9k6iZc7C1zUQmxGi7kIprsi3DKdNcB2KAQEI5jiNU/pRYHBLeVqdLMW68Jf2vKb72Y LcLlE66q50GXfPA6qmmjQtfQPJz2z7jYoK0CazZ/n6IkYAGsUuk4qjtjATje9XQQcWXx XCrKPw5CvH2WrfDtRv0VkcSIsZamyBgjrpCUKRtJEWX14RgYUns3X1DEvLq6x9O3RPuC uOdw== X-Forwarded-Encrypted: i=1; AJvYcCWONcAk9PI+kZS/0sxfbaVQ4p7xdNTiZGTJDIxZ0/ymbaPt5juoVBlC5mOkhN1h33skDypU4fbIwjSYCgz0AU3XiGU= X-Gm-Message-State: AOJu0YyiLFNNAVkXXBFJNwljGPUmX1dfcZKBQKajL2ThDsdVUdnq5Fwf NNYwb/3DpRm+Ed6ahYnegnAhhfC2dWgG5bvO+z81S0WGDIXcf6MutQfYpOxTb/GFI1ln7+fWfrY zb2VKGJrQQH+/hYvI982S9RE5atOsy50M X-Google-Smtp-Source: AGHT+IEfLK7kifIa+xsC79U/+rehka1BNayB+GioKQk7e69mcx/NsQvfkvbjBPjAC0Su+YovhxbLKk7WOQvGnWTGw3g= X-Received: by 2002:a05:6102:32d1:b0:48f:e4b0:81b4 with SMTP id ada2fe7eead31-497798c3793mr9946848137.7.1723965952235; Sun, 18 Aug 2024 00:25:52 -0700 (PDT) MIME-Version: 1.0 References: <20240817062449.21164-1-21cnbao@gmail.com> <20240817062449.21164-5-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sun, 18 Aug 2024 19:25:40 +1200 Message-ID: Subject: Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL To: Yafang Shao Cc: akpm@linux-foundation.org, linux-mm@kvack.org, 42.hyeyoo@gmail.com, cl@linux.com, hailong.liu@oppo.com, hch@infradead.org, iamjoonsoo.kim@lge.com, mhocko@suse.com, penberg@kernel.org, rientjes@google.com, roman.gushchin@linux.dev, torvalds@linux-foundation.org, urezki@gmail.com, v-songbaohua@oppo.com, vbabka@suse.cz, virtualization@lists.linux.dev, Lorenzo Stoakes , Kees Cook , =?UTF-8?Q?Eugenio_P=C3=A9rez?= , Jason Wang , Maxime Coquelin , "Michael S. Tsirkin" , Xuan Zhuo Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 30E2418001A X-Stat-Signature: 8q9hmr14xwnfhfspaojnhdzecrjj1d7u X-Rspam-User: X-HE-Tag: 1723965953-276008 X-HE-Meta: U2FsdGVkX18PzSTlsh4PugThZ6HgGLUlrGfNmokv4WnDk86AuwcgEq4doFFENy4SeD4EJLBn3qDv+72dq/GIgSSoZ7sEyrdp9Qp9V7pWdggVprgTMrfIaAj1mVNRmwSn9SiPxsDUdl6tjeVOY8ZzenzEzy5HVNQ218yTjMiaxNHdsVPNxGiVfx6KURFcbYU1YJ27X3gl56LSlfDquu1MUvTY436RS1F2/uTcPKWZqOqPwAk7tJyLV/NoGcKURohasAWDVExg9G9WDE1WQZk9PBLGwUBCZBzhNtz2yMyE4FlB12M9fv/qT5OSYepy+aFUCnAVmTDN0/8nLfRNWMyzv7U8mn6CEvC6UwLE9HjF6lpsle53M/dwdgYGYC5YZvbf3NB13lxnx3Vf3tpROf9cGcfyuEKBmv3MU89YQ9IGl7yLbJ4P+DOOJ3X/CNEnIqJFpjjLu6NpzX9SZv7NexikWXGt7/RqG4gDawvbMkpE5YuOxRN61ovvrXmwzl60LNqn4nDDm0qiR8SmI9fuGlMd4gmbIjHt7tqxgT+Wo+Mbq/M9qpjCl1+fH7bjpTqW0D5A0hGvGw3BVTAGjPEiEWfaLc4a3PYChiLSorubuAqQH3M22e/BV6ujEJyb1q8Q3An5CY0+vrNPrNenZZ9eXwiUk03eQ2DB6agB2kpluOMo50SqxKq6oceDAhK52kkVkUvBdumGa6WX4I6hNH9yQixDFNY8tN4kRKbzt6/jibYft5V3q4oDODUCanbdv6Rmrv3CK/I1MvMAxRaDoYXrbAPpAmjEEwW94Im2DY2S7D+iyJYf51Kv8PSsfnc2PsK17hRmvA9Batbeho/w/8OXMdegnrgI8of9StAbf8x8mVPnButPoQTJ/trfWjsAoEhCfPDvwMwPp/CfdSDu+bezUNgz0pafV8vpovshjmMThZUQN/YwksN+EcdDWcsjzg4wQEEr3EAHEOzb8lWdShnexDz J354JwV6 SM3IBiRZ79aAC5wbWgPyuEV9+QcmAPzUjMvPm9iJYno8zcWqliyF38jCHzMKPOOIUndO8/iF09emHkFivxJflHQ/8MZdC31P8VrQDwM9MRiZ3yPTA6NLR56i3ERSXCW3DPmqUI0ynushUAZIK9fOdQwWI3GvthtpB1bPhdzbM6qZzN5BfJGA72dkFKldHofRVLYPvyQwMlidqgm+fkqkLwP5QKmK5FRuAE1N3OK/LN8Ki0WhayQU5ILFnO6BVjURbkkz0SVfFX9WigTgBsxjvgf1uZMV4k8T4NFQ8tQRQhS48TGDxv/FM1k/mjTKlimMHhAPammOmaUM5Z/3IHvI/S0TsdzBi+v4HAaWIOGilPHbLpqekwmbY8m/4ENsHFDFcA0wuN1U68PdrB8Kq3XM5jZZgAKKL/cmd/9iwPUf7snrbf3pSyobXT3a8jfPwSbpWYzon9iiYCXpGGs+kaBoa24xVqkHTp7VmPx+jxJdis4A9EfEZFl8TZfmdKKNENbpH3M8Sc6IcYU93Hs/LPUxizEeZad2V9lp2vL8wRaEy1SR3vm/YlrHDoBwxgrDWOd4xCWzR63z13oEHld3TUVyi2s6XlJG1gZRt+SMVLuWRTCW10pZ8+xfmj5RF6a4hXR6j8sIgwYThougbbkSzrV4PG39oEBqTojJZTGOvg8DuFQ9fRvJNePjev1PNqjqSfKwSjac7eI4oDSBsqMfyZLVLRcC5ehyPsxmJxfoAd4WG2Gkgii6QyKdcMXKGj1OWlDAHHWNCyKZ+ovN4L/7Mtarq5/JhTM4aQ0mTVTqHuanX613RPQ+hson/3GWWue9yxzeMTNdEPSjQyvtYEASBMhhP/VwIHB+rBYTJvSmr9IylqlvxfzAvKT2Kcx/E34xkY6QQ7grV5SBFWPqeluymcQKadAIwGdGzrkaVDIJAj54VxzgVzVqP6XmvCI600Yslm2f0WxDMbkX3s6pqUsatbDK850b7PMhA 9uz9EYHB 2g+d5dAt/KzkoHW4ny7kGg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Aug 18, 2024 at 7:07=E2=80=AFPM Yafang Shao = wrote: > > On Sun, Aug 18, 2024 at 2:45=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Sun, Aug 18, 2024 at 6:27=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sun, Aug 18, 2024 at 5:51=E2=80=AFPM Yafang Shao wrote: > > > > > > > > On Sun, Aug 18, 2024 at 11:48=E2=80=AFAM Barry Song <21cnbao@gmail.= com> wrote: > > > > > > > > > > On Sun, Aug 18, 2024 at 2:55=E2=80=AFPM Yafang Shao wrote: > > > > > > > > > > > > On Sat, Aug 17, 2024 at 2:25=E2=80=AFPM Barry Song <21cnbao@gma= il.com> wrote: > > > > > > > > > > > > > > From: Barry Song > > > > > > > > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they m= ight > > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. Th= is kind of > > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.= If we > > > > > > > attempt and still fail to allocate memory for these users, we= have two > > > > > > > choices: > > > > > > > > > > > > > > 1. We could busy-loop and hope that some other direct rec= lamation or > > > > > > > kswapd rescues the current process. However, this is unre= liable > > > > > > > and could ultimately lead to hard or soft lockups, > > > > > > > > > > > > That can occur even if we set both __GFP_NOFAIL and > > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is r= elated > > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flaw= ed > > > > > > design of __GFP_NOFAIL itself. > > > > > > > > > > the point of GFP_NOFAIL is that it won't fail and its user won't = check > > > > > the return value. without direct_reclamation, it is sometimes imp= ossible. > > > > > but with direct reclamation, users constantly wait and finally th= ey can > > > > > > > > So, what exactly is the difference between 'constantly waiting' and > > > > 'busy looping'? Could you please clarify? Also, why can't we > > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set? > > > > > > I list two options in changelog > > > 1: busy loop 2. bug_on. I am actually fine with either one. either on= e is > > > better than the existing code. but returning null in the current code > > > is definitely wrong. > > > > > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclama= tion > > > legal. so it is a bit suspicious going in the wrong direction. > > > > > > busy-loop is that you are not reclaiming memory you are not sleeping. > > > cpu is constantly working and busy, so it might result in a lockup, e= ither > > > soft lockup or hard lockup. > > Thanks for the clarification. > That can be avoided by a simple cond_resched() if the hard lockup or > softlockup is the main issue ;) Is your point to legitimize the combination of gfp_nofail and non-lockable? I don't think you can simply fix the lockup. A direct example is single-core, and preemptible kernel. you are calling __gfp_nofail without direct reclamation in a spinlock, cond_resched() will do nothing. it doesn't have to be a single core, for example on phones, cpu hotplug is used to save power. Sometimes, phones unplug lots of cpus to save power. and even we are multiple cores, we have cpuset cgroups things which can prevent async oom from running on other cores. > > > > > > > with direct_reclamation, wait is the case you can sleep. it is not ho= lding > > > cpu, not a busy loop. in rare case, users might end in endless wait, > > > but it matches the doc of __GFP_NOFAIL, never return till memory > > > is gotten (the current code is implemented in this way unless users > > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.) > > > > > > > and the essential difference between "w/ and w/o direct_reclaim": with > > direct reclaim, the user is actively reclaiming memory to rescue itself > > by all kinds of possible ways(compact, oom, reclamation), while without > > direct reclamation, it can do nothing and just loop (busy-loop). > > It can wake up kswapd, which can then reclaim memory. If kswapd can't > keep up, the system is likely under heavy memory pressure. In such a > case, it makes little difference whether __GFP_DIRECT_RECLAIM is set > or not. For reference, see the old issue: > https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com= /. > > I believe the core issue persists, and the design of __GFP_NOFAIL > exacerbates it. > > By the way, I believe we could trigger an asynchronous OOM kill in the > case without direct reclaim to avoid busy looping. This async oom is sometimes impossible with the same reason as the above example. > > > > > > note, long-term we won't expose __GFP_NOFAIL any more. we > > > will only expose GFP_NOFAIL which enforces Blockable. I am > > > quite busy on other issues, so this won't happen in a short time. > > > > > > > > > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it. > > > > > it is absolutely clearly documented. > > > > > > > > > > > > > > > > > > which might not > > > > > > > be well supported by some architectures. > > > > > > > > > > > > > > 2. We could use BUG_ON to trigger a reliable system crash= , avoiding > > > > > > > exposing NULL dereference. > > > > > > > > > > > > > > Neither option is ideal, but both are improvements over the e= xisting code. > > > > > > > This patch selects the second option because, with the introd= uction of > > > > > > > scoped API and GFP_NOFAIL=E2=80=94capable of enforcing direct= reclamation for > > > > > > > nofail users(which is in my plan), non-blockable nofail alloc= ations will > > > > > > > no longer be possible. > > > > > > > > > > > > > > Signed-off-by: Barry Song > > > > > > > Cc: Michal Hocko > > > > > > > Cc: Uladzislau Rezki (Sony) > > > > > > > Cc: Christoph Hellwig > > > > > > > Cc: Lorenzo Stoakes > > > > > > > Cc: Christoph Lameter > > > > > > > Cc: Pekka Enberg > > > > > > > Cc: David Rientjes > > > > > > > Cc: Joonsoo Kim > > > > > > > Cc: Vlastimil Babka > > > > > > > Cc: Roman Gushchin > > > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > > > > Cc: Linus Torvalds > > > > > > > Cc: Kees Cook > > > > > > > Cc: "Eugenio P=C3=A9rez" > > > > > > > Cc: Hailong.Liu > > > > > > > Cc: Jason Wang > > > > > > > Cc: Maxime Coquelin > > > > > > > Cc: "Michael S. Tsirkin" > > > > > > > Cc: Xuan Zhuo > > > > > > > --- > > > > > > > mm/page_alloc.c | 10 +++++----- > > > > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644 > > > > > > > --- a/mm/page_alloc.c > > > > > > > +++ b/mm/page_alloc.c > > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask= , unsigned int order, > > > > > > > */ > > > > > > > if (gfp_mask & __GFP_NOFAIL) { > > > > > > > /* > > > > > > > - * All existing users of the __GFP_NOFAIL are= blockable, so warn > > > > > > > - * of any new users that actually require GFP= _NOWAIT > > > > > > > + * All existing users of the __GFP_NOFAIL are= blockable > > > > > > > + * otherwise we introduce a busy loop with in= side the page > > > > > > > + * allocator from non-sleepable contexts > > > > > > > */ > > > > > > > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp= _mask)) > > > > > > > - goto fail; > > > > > > > + BUG_ON(!can_direct_reclaim); > > > > > > > > > > > > I'm not in favor of using BUG_ON() here, as many call sites alr= eady > > > > > > handle the return value of __GFP_NOFAIL. > > > > > > > > > > > > > > > > it is not correct to handle the return value of __GFP_NOFAIL. > > > > > if you check the ret, don't use __GFP_NOFAIL. > > > > > > > > If so, you have many code changes to make in the linux kernel ;) > > > > > > > > > > Please list those code using __GFP_NOFAIL and check the result > > > might fail, we should get them fixed. This is insane. NOFAIL means > > > no fail. > > You can find some instances with grep commands, but there's no > reliable way to capture them all with a single command. Here are a few > examples: > > // drivers/infiniband/hw/cxgb4/mem.c > skb =3D alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL); > if (!skb) > return -ENOMEM; > Thanks for pointing out this wrong code. This is absolutely wrong. with GFP_KERNEL | __GFP_NOFAIL, mm-core is doing endless retry till direct reclamation gives memory to itself. there is no possibility to return NULL. the only possibility is __GFP_NOFAIL | GFP_ATOMIC(NOWAIT), mm-core is returning NULL incorrectly. actually mm-core just ignores this __GFP_NOFAIL in this case. with GFP_FAIL enforcing direct_reclamation done, the wrong case __GFP_NOFAIL | GFP_ATOMIC(NOWAIT) will no longer exist. > // fs/xfs/libxfs/xfs_dir2.c > args =3D kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL); > if (!args) > return -ENOMEM; > > > -- > Regards > Yafang