From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56A42CA101E for ; Mon, 2 Sep 2024 03:03:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D5A2A8D006A; Sun, 1 Sep 2024 23:03:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0A3A8D0065; Sun, 1 Sep 2024 23:03:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD1A28D006A; Sun, 1 Sep 2024 23:03:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9F6FC8D0065 for ; Sun, 1 Sep 2024 23:03:28 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 08DD9121867 for ; Mon, 2 Sep 2024 03:03:28 +0000 (UTC) X-FDA: 82518302496.27.4233EB3 Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf05.hostedemail.com (Postfix) with ESMTP id 42A9D100009 for ; Mon, 2 Sep 2024 03:03:26 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V4pYft36; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725246130; a=rsa-sha256; cv=none; b=BWgjJ+iYjZmaoapLBTYecu1z/5khPDW7Y6qParXycFd00BzLmO8+Rxbb8lHd1rng1sTXN1 2Yto+npjrDvfIyaXa8EzHYeoBVtHz+K4rPdT5MrUWkUlSUMf80YKucpYkm+l0IIBeMLqhW LhHHksdZLpXvswyeZXsoAL3miL23WJc= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V4pYft36; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725246130; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LZxvKASLeQPhlUeFrFQCFSS8riyVWdaWYAqCbhxuJ7E=; b=K3FFoMHIMc5pn04PB0Tq4egVU1Xjk/K+I+R/qwcGUa45Bb1nDMeyoDF4Xr2b4/VRNUazrr BP3QTBDMyl9WU5QokWs3UK3YaeZ1fB9n9LQc5Jg7cvDBelcxeXQZZOJ1M+W3Pnwu/Yw7rY rCcmGe7SUOEUIDZID0wzQbgIID9MKD8= Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6c35426c34eso10640026d6.1 for ; Sun, 01 Sep 2024 20:03:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725246205; x=1725851005; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LZxvKASLeQPhlUeFrFQCFSS8riyVWdaWYAqCbhxuJ7E=; b=V4pYft3680XeagkdrbD8zAshZwqL2kOZAoWJf8QdnzEJxobTcm3zfilPb0QkQ5sYc0 pNQOtHYtsmn0qm025ezGZmGq2mhv0KOX3ZyeSnWBHC/G8OkKrdS0efPOKE8QfOAJgP3o HFNlpfa6au78Kt5FgFk2D19CnIPp5EW9DK84dJ+kTK+RM1gmFO53WQB7OZ+DgPHLDIbw sEpikz8FFiQcHEQK7cOzBg0rCJbUn2ou8reW47St06pfjyLqRrR3KWsO5lmrZQ9UFEHc 6oRiSn/bb35aFEHDV6djuKY+TnR2k/lucTCeOmh1cDMXfdHs7W8LHu9a8qaeMYZ+VOrb IQCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725246205; x=1725851005; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LZxvKASLeQPhlUeFrFQCFSS8riyVWdaWYAqCbhxuJ7E=; b=XyE3m4OiII3R2q8lQ5yiok++L3iAb2r2eJ+A54Xy0tYFiKqaLAuKitlqM+EIwLsheJ yeME23yQ637axpCxwH7lCxhasg0s2Sr41ndZ46+MvVl1m87M/AjvrFUVzhV/xaXr+SSM 59eYJk4sB898LhHAhGCjkxomPO4WcIg6jS+fJ1C1xhxH9bYKiLlqW/FTb1YWAz9o8TMx RbRVenfynD57WrOF+TvYJGFKJ2pAXbKmhhsA02yK5BuUKKpRlHL7Mm/BDExEW2Ct4dS/ NtULKqFqF8xcxSgjUkMnJ8nzehS7nfWPnvzDD66q9olps4mgEBHk1+P/+QZ1amkK3oeL Yyzg== X-Forwarded-Encrypted: i=1; AJvYcCXB2TXegKGXsHCL8xo7ZS7saWV87HYxEKviMEg47E+ieAXXoBo2MUJxHhcMp2QreBTyRQaskokFag==@kvack.org X-Gm-Message-State: AOJu0YygEp6YFyWTpoRc5MhCGwZFZzKmhaTvfnuE4tNW9C2dUkkyg3zp sjce7Z14QDV7EFeC355YsLrdy9A9QmSc2hWsobd6gGg2pM1BBIOPHO5hpZYyDo2dAm5gRYNtkNj csEZ90a+k9uvGv/OwRhIkMLuGAQA= X-Google-Smtp-Source: AGHT+IGQT0cG/HgGQSVHmz2EXEuhrSnKMDbqLSUVFa8dO6+VBSLmM6ZI7QgskC6mcN/cUUG56353wFfp/xkzaJSVjf4= X-Received: by 2002:a05:6214:468e:b0:6c3:5a86:6a2e with SMTP id 6a1803df08f44-6c35a867552mr66716106d6.50.1725246205287; Sun, 01 Sep 2024 20:03:25 -0700 (PDT) MIME-Version: 1.0 References: <20240828140638.3204253-1-kent.overstreet@linux.dev> In-Reply-To: From: Yafang Shao Date: Mon, 2 Sep 2024 11:02:50 +0800 Message-ID: Subject: Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc allocations To: Dave Chinner Cc: Kent Overstreet , Michal Hocko , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Chinner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 42A9D100009 X-Stat-Signature: ron6upx7bfrwgjnupof1gqrd7hbt4ed1 X-Rspam-User: X-HE-Tag: 1725246206-981670 X-HE-Meta: U2FsdGVkX1+hDWlIbLJsm8RDWpX2+kQIWOajoH9bp7jA4sOSzPnjqm7P+kxZAzilSSk2KMlKRa1BQB3Q9wx43Ump5p7e4uk/SdA5eGecfW4OIqjDbAS8pdICKr6iaE0TozFVkpN64RfFi40o6en22Sx/LRbKdVRxPnfKPD7HHNBsP3DCxJSWQeigbcSoCIsiHQY7tGByEi3FUgUAydgxgNri86Xnd63Cvf2mQPFXhgFkbQk9HG1CF4aFkmaoO1Y3RCWZ4BvqRFRHy8RSV6ZNtiTCeuF0qhT7ej/LWJmo43YcLIFW4DcVWPL59iDZyY0Ovof/wG76k/GNRLek23uJVmLLQU/kCpaOwMxOMS9BTWyPlkAYM1aNIEu4SMLRDAWn1v5s+fcOBXVWsWCoP0OsBuirmwJBPfI3RZJe5q6VP6lIU16zuW+dJX3fr4VCa98IvoNqPs/GJj6J9iOm8drs2pbY9l63iPdj1MuRdqlPwYuZpG0c6HrOwwOhf0l6PaoXo+toPc72x8eDtm1oTnl7H2kM+QSSEIqOsVtKnMFOJKNh7lk3rRh1ZromRiplMTkJnrR7D7p1hqj0omHnKG9njLvIJ5lXgvnX0Z1RDLgq613VgJFVtMx827tAlrLIZPmGoZopMVdfS5e2qVCl+HQ8GA3T+E2KW9Nl4EOOWl1o8xiquRTS/o4tQhcVazQTV8nHbAdNiylar8rQc53vwj4g3EaVNuScqBQiKd8FNw/xfcfPRQtXmRkyqzuCzHj2hknHSBu7RUhseoIr4FFRYYiT1C9gyhHRl3kFlmkjZeQYZGO+ADknpr9OQvyirmll/e9pf6AH6YE1XDQIIYwL1iG0c/TzX+ihqdKp7cv5VS7LbTO5W3D1Yj6JTQBmVcWmb3GI+7tE/Kc0uT3YVFAthfSV223o7h2vp5uoNGwUeVzb1wvim4e2odLh2eOBAblwyqNc6BOJL3HtrJqX61MwNb5 b1/+m3vh FcPV1yOMAAjSXklGlSJRM9Xau6u3jCOXMZLNv2h25INYXUcViI1N9PjXNr27TPhcxMN+xKD6jYobLDoealiJXlhzvxsWqmNam0vjnTuVPN3jKin6hw4T2XOkTrn4QcnGnajfM3cD6oAy95MqfRf0OVbyleDi9YIHsUCdjQ1ZAfoT0vBHMrPgfGn1dHJC9j7QnO34fWy1MDr68RvAFXf+tAAHt6Wbrg/ZHqATdjKChucUpFfAFLunATqE7kk3dbh64xwojGAFg0+vAO3aSvskIAFaxI8H5XfMpbkfoedvnXLzj9uI49emkt3vyWQlrnTVmyEyUMnzvrGlRr9VKRA5tMPMn8IlQE+I0e+0EpmsG1y9MtYrcK9bL1hpIeZvl4IqO0f9YT/TTxWzSOHec+G5i5qlj6/akdyg94HLaAb3tx7fAmLYhGgQ0SdQ3Q+NZ0zn5vPFSlxZJhT1r/W0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.001344, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Sep 1, 2024 at 11:35=E2=80=AFAM Dave Chinner = wrote: > > On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote: > > On Thu, Aug 29, 2024 at 10:29=E2=80=AFPM Dave Chinner wrote: > > > > > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > > > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > > > > according to call path and allocation size, you still need to be > > > > checking for failure - in the same way that you shouldn't be using > > > > BUG_ON() if you cannot prove that the condition won't occur in real= wold > > > > usage. > > > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > > now. This was the default Irix kernel allocator behaviour (it had a > > > forwards progress guarantee and would never fail allocation unless > > > told it could do so). We've been using the same "guaranteed not to > > > fail" semantics on Linux since the original port started 25 years > > > ago via open-coded loops. > > > > > > IOWs, __GFP_NOFAIL semantics have been production tested for a > > > couple of decades on Linux via XFS, and nobody here can argue that > > > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL > > > as it is used by XFS is reliable and lives up to the "will not fail" > > > guarantee that it is supposed to have. > > > > > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing > > > > > > do { > > > p =3D kmalloc(size); > > > while (!p); > > > > > > so that they blocked until memory allocation succeeded. The call > > > sites do not check for failure, because -failure never occurs-. > > > > > > The MM devs want to have visibility of these allocations - they may > > > not like them, but having __GFP_NOFAIL means it's trivial to audit > > > all the allocations that use these semantics. IOWs, __GFP_NOFAIL > > > was created with an explicit guarantee that it -will not fail- for > > > normal allocation contexts so it could replace all the open-coded > > > will-not-fail allocation loops.. > > > > > > Given this guarantee, we recently removed these historic allocation > > > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the > > > allocation call sites. There's nearly a hundred memory allocation > > > locations in XFS that are tagged with __GFP_NOFAIL. > > > > > > If we're now going to have the "will not fail" guarantee taken away > > > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can > > > it be used anywhere else that a "will not fail" guarantee it > > > required. > > > > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > > can fail due to external scoped memory allocation contexts. This > > > will force us to revert all __GFP_NOFAIL allocations back to > > > open-coded will-not-fail loops. > > > > > > This is not a step forwards for anyone. > > > > Hello Dave, > > > > I've noticed that XFS has increasingly replaced kmem_alloc() with > > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > > occurrences. In kmem_alloc(), there's an explicit > > memalloc_retry_wait() to throttle the allocator under heavy memory > > pressure, which aligns with your filesystem design. However, using > > __GFP_NOFAIL removes this throttling mechanism, potentially causing > > issues when the system is under heavy memory load. I'm concerned that > > this shift might not be a beneficial trend. > > AIUI, the memory allocation looping has back-offs already built in > to it when memory reserves are exhausted and/or reclaim is > congested. > > e.g: > > get_page_from_freelist() > (zone below watermark) > node_reclaim() > __node_reclaim() > shrink_node() > reclaim_throttle() It applies to all kinds of allocations. > > And the call to recalim_throttle() will do the equivalent of > memalloc_retry_wait() (a 2ms sleep). I'm wondering if we should take special action for __GFP_NOFAIL, as currently, it only results in an endless loop with no intervention. > > > We have been using XFS for our big data servers for years, and it has > > consistently performed well with older kernels like 4.19.y. However, > > after upgrading all our servers from 4.19.y to 6.1.y over the past two > > years, we have frequently encountered livelock issues caused by memory > > exhaustion. To mitigate this, we've had to limit the RSS of > > applications, which isn't an ideal solution and represents a worrying > > trend. > > If userspace uses all of memory all the time, then the best the > kernel can do is slowly limp along. Preventing userspace from > overcommitting memory to the point of OOM is the only way to avoid > these "userspace space wants more memory than the machine physically > has" sorts of issues. i.e. this is not a problem that the kernel > code can solve short of randomly killing userspace applications... We expect an OOM event, but it never occurs, which is a problem. -- Regards Yafang