From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2014CD6B6A5 for ; Wed, 30 Oct 2024 16:05:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 771338D0002; Wed, 30 Oct 2024 12:05:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D2618D0001; Wed, 30 Oct 2024 12:05:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54C868D0002; Wed, 30 Oct 2024 12:05:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 34D828D0001 for ; Wed, 30 Oct 2024 12:05:14 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A1B7E80415 for ; Wed, 30 Oct 2024 16:05:13 +0000 (UTC) X-FDA: 82730742192.12.C15D1AB Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf08.hostedemail.com (Postfix) with ESMTP id 7344C16002E for ; Wed, 30 Oct 2024 16:04:54 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=MGnyKN7o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730304135; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L6rBxDoMeGU1XAZ5CKnR5c6eqwtuSXjVYA5VGpPM5Q8=; b=dKJSr9R4oM+cH95LG/gPcN8U0SMz/Jj4iTH3M0UDEKn0Ng74FcnWuoYNa+3SBpYsX0ky+b uxTuk/kAqguMOJUN+DZFzjMjd1ErLYwqiMjH/0m5hp2NOkzOOV3jXSsOVqSgpHBL/oLP+D wmvrgyNh+s/vNRJEaHZKtj3iDoQQjKQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730304135; a=rsa-sha256; cv=none; b=zOIjhAqBaalsp/oFoOVFptI5vwkeGFxH9iVnRGEJMqjiJG76yDkbQIgCUuuz/Emed7w75a RwLV7RmeDWSV8ZVcNWlQrI2NHfsDEP5CBxtxW/TMZhKksJaNVXg0Y30uvUHT8vKzZupIBo i7HyBCIvSAMMl8mTtANK82ADHjfoI6E= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=MGnyKN7o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-46097806aaeso307121cf.2 for ; Wed, 30 Oct 2024 09:05:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730304311; x=1730909111; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=L6rBxDoMeGU1XAZ5CKnR5c6eqwtuSXjVYA5VGpPM5Q8=; b=MGnyKN7oogbyWQ5i5mBRTtbhV6pmq2t3N5Gc15MxI84b32QWhCMpvRL/WAADntFcuI JAICPrm+NYOxTmJrwcXSsI5RbDt48QnTlRpJc4DkAdX9i2HLmuRSLnA82akaMRFznBbg sRKmBrAnqsGprPCLgqHZMJ2rE6GQH6/K7evh8WnpqRqkaGI9sGnb0SwIdvbZC3FPD81R u6qwiEwwiq2xzXDX2gSPkIR2sm84AKIKdo1gp8kVPSQdA7gvAK8BnbfsQ6TZ0Dqn7nKr 9iUnZNpFdkwUsA29YVVLGiQ5uVmWihtYNcMz3mCuIvhZuWZP8eehBusoPK0x6XNmqmCL LV+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730304311; x=1730909111; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=L6rBxDoMeGU1XAZ5CKnR5c6eqwtuSXjVYA5VGpPM5Q8=; b=JY+zoOdz9mPPeLONXYqZP7zWUMbygfnfTIyLNpCLEuLkIDxgw291H8vsMc8whx704f yGJjicyPCZXntS0qA1EXdU2T1YN+EqoSklUWukKUeZ6hKA2t4dtdYwPOPSe/UlGJaOuO 3n83aE8kzCq1sm8uKau+qVo/z7+PgUgp/VWNk2OrF4JSkzKGUiZ3meIKyxLpQvGdoVuC 7b2d/o2OKMiOzMa9iMJ//r/vj/Ao1l/mTJfGwV/0hTbyVudu3vsKevcWAiWvlhaaFkKk 4xF15m0R0xRuuvTZuFwVEU2BQe5NKOsYVndx3Yyhqf9z88zwlIuTm3F6hd4HgVfiDwN0 TXNw== X-Forwarded-Encrypted: i=1; AJvYcCV64Ig+Bw4eBfuRCesvDkbBjh1GdoYISH0p37hHsCz8KwhlnJlaB9wYsqCU3QIPebNNjqwRApoJjA==@kvack.org X-Gm-Message-State: AOJu0YyaQ8F3su86wh1JtFWNYTEijOkPoNXpGbitmuNVgoKKf6TeQV6k AVaO7Tl/GXr53ol2Sb2LMfatngouQPLJ5kew3D0FcPcmtAieavuMqAADYmqQTRdj3zdQY/sslQM djgjTY/PKgsW/jtgEh8f6XFJ23oM= X-Google-Smtp-Source: AGHT+IF3eZqBbcty68Z/nY+r1rpVGK1MObGX4mzosigOkKr9VKbu8vlHA5A2Cfaxx0Mb5Vnl+aOEWabJY6mtG5IjAJg= X-Received: by 2002:a05:622a:354:b0:461:2616:84bf with SMTP id d75a77b69052e-4613c01552cmr215734991cf.23.1730304310471; Wed, 30 Oct 2024 09:05:10 -0700 (PDT) MIME-Version: 1.0 References: <20241014182228.1941246-1-joannelkoong@gmail.com> <3e4ff496-f2ed-42ef-9f1a-405f32aa1c8c@linux.alibaba.com> <0c3e6a4c-b04e-4af7-ae85-a69180d25744@fastmail.fm> In-Reply-To: <0c3e6a4c-b04e-4af7-ae85-a69180d25744@fastmail.fm> From: Joanne Koong Date: Wed, 30 Oct 2024 09:04:59 -0700 Message-ID: Subject: Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree To: Bernd Schubert Cc: Jingbo Xu , Miklos Szeredi , Shakeel Butt , linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7344C16002E X-Stat-Signature: fpg7n7wx7qmch9cju8a7xqf8a6ymwp4g X-Rspam-User: X-HE-Tag: 1730304294-932153 X-HE-Meta: U2FsdGVkX19hX+jhPcBF4pLAy4L+PHAkDHZU+9kcjo6m2uphfUrzINyAuBvC7mVjimFsjOuJZ/XO+PSK9DD9fStQYP2KS1gJ/L+IJGUO3YpksuE0rBEhSCnM/FiNIxT88ck+ntF2fmqyv0sDbr0PdIZFeCgWiLqLZQu7PVrkNbZvPZN69qYykAC9QhrQxd7cFX8VMzFNAea4pDbDKIxP8f++mYTK9kSY4Yn4FVz0lcoN5Y0xx8o50x8HRxOIuFWo7aHRvMPZS1cdhZDWHx0Nm24fJLDnUYYQlWTiizgSJgYPn9WAgFWYIWdJzr5i+04EKC7YsJGLFcqHQz+TJw/icYHg6niwbi2/4XQoViTkhX5z0X1qfhHCiIZucXNjkL21ngVAG/5+ckRvpifI4sPSQzP7VDWYSrQ/fUy/mdEjm/TptAH805gY4KBFLIO43wrqsC89KbjJzLkABds7+y/Sq8n+iepWlMPfCdw/I5BNdBbCz+mdAQx8gfNj0eDiRPPbraQgOH4Ul8f5fQ+W+YyZyftgmHOpC+RZDR+4QZhgwMacj9m1eaykhEgaKABsqS7f4PwI7uVV6/XuAHaj535uWkqXTmudIf7ecXIIVSBRdlR0kgzRo8o1zQd1cDKbrGwdGbb6r4lYE3YgPIj5+jdIJRUnQE/K8BxVZsG6TkRm3WG5citiZoeW0TPrZ4VoRZDp8A0upaGzIbBAav6aKZdFWxRdApXL/xeahQ80HeEQQWq9eEfPOSlOM97+ORQWJdLhWkh3+7Nncow0ZZ/Va3tLh+uS6OceQSKiM/zHbKci/uc+B0v0t/xBMgwOaKfQQ9Ek/Pagz6fqURP7r5Gw9oOE7oOvrtUOTFrYyxwv9Yf2/+p8lsELT3vlCLgZrtYBa0UTNTohQaEIaCTpr3vzSFo4/kvQpi1kLTQr6ys89JY3o/c6cAF4MgSKygtemcPplfd67nambdgRwbKWBTmfMPY 8vIELxAL arb4vx9aPhzjCqI4J2eVnwRVYZtpNlmDKHPV9TrgnpQaZIJoB7nRaCKiVDgtqxm3gB5GPl4GNagXKSHMbg6VRRMt9UDpYbCHRRvfoa2KoGsUiODE3XxuqM6fm422DKDnhS51CohEiBvwckCFBsiHe4UJyHZYETkRssSpUWBGVXO3VNhehC+ErUYHiJ0EhXLkBJwVDs9miOFBPYg2xpQzEZdi4F/FsvNqeSC19I5Gc4RFDPCacSvHnM5ohOsHTQikWzTQW5mec+IqgZGc1sVp5hkYjCVhGr92vNOCINug1ndY6gjNCf+JvzKXCUclWFyyDNHOaH55X1oTKEoCksW/DJcxNDQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 30, 2024 at 2:32=E2=80=AFAM Bernd Schubert wrote: > > On 10/28/24 22:58, Joanne Koong wrote: > > On Fri, Oct 25, 2024 at 3:40=E2=80=AFPM Joanne Koong wrote: > >> > >>> Same here, I need to look some more into the compaction / page > >>> migration paths. I'm planning to do this early next week and will > >>> report back with what I find. > >>> > >> > >> These are my notes so far: > >> > >> * We hit the folio_wait_writeback() path when callers call > >> migrate_pages() with mode MIGRATE_SYNC > >> ... -> migrate_pages() -> migrate_pages_sync() -> > >> migrate_pages_batch() -> migrate_folio_unmap() -> > >> folio_wait_writeback() > >> > >> * These are the places where we call migrate_pages(): > >> 1) demote_folio_list() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 2) __damon_pa_migrate_folio_list() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 3) migrate_misplaced_folio() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 4) do_move_pages_to_node() > >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but > >> this path is only invoked by the move_pages() syscall. It's fine to > >> wait on writeback for the move_pages() syscall since the user would > >> have to deliberately invoke this on the fuse server for this to apply > >> to the server's fuse folios > >> > >> 5) migrate_to_node() > >> Can ignore this for the same reason as in 4. This path is only invoked > >> by the migrate_pages() syscall. > >> > >> 6) do_mbind() > >> Can ignore this for the same reason as 4 and 5. This path is only > >> invoked by the mbind() syscall. > >> > >> 7) soft_offline_in_use_page() > >> Can skip soft offlining fuse folios (eg folios with the > >> AS_NO_WRITEBACK_WAIT mapping flag set). > >> The path for this is soft_offline_page() -> soft_offline_in_use_page() > >> -> migrate_pages(). soft_offline_page() only invokes this for in-use > >> pages in a well-defined state (see ret value of get_hwpoison_page()). > >> My understanding of soft offlining pages is that it's a mitigation > >> strategy for handling pages that are experiencing errors but are not > >> yet completely unusable, and its main purpose is to prevent future > >> issues. It seems fine to skip this for fuse folios. > >> > >> 8) do_migrate_range() > >> 9) compact_zone() > >> 10) migrate_longterm_unpinnable_folios() > >> 11) __alloc_contig_migrate_range() > >> > >> 8 to 11 needs more investigation / thinking about. I don't see a good > >> way around these tbh. I think we have to operate under the assumption > >> that the fuse server running is malicious or benevolently but > >> incorrectly written and could possibly never complete writeback. So we > >> definitely can't wait on these but it also doesn't seem like we can > >> skip waiting on these, especially for the case where the server uses > >> spliced pages, nor does it seem like we can just fail these with > >> -EBUSY or something. > > I see some code paths with -EAGAIN in migration. Could you explain why > we can't just fail migration for fuse write-back pages? > My understanding (and please correct me here Shakeel if I'm wrong) is that this could block system optimizations, especially since if an unprivileged malicious fuse server never replies to the writeback request, then this completely stalls progress. In the best case scenario, -EAGAIN could be used because the server might just be slow in serving the writeback, but I think we need to also account for servers that never complete the writeback. For __alloc_contig_migrate_range() for example, my understanding is that this is used to migrate pages so that there are more physically contiguous ranges of memory freed up. If fuse writeback blocks that, then that hurts system health overall. > >> > > > > I'm still not seeing a good way around this. > > > > What about this then? We add a new fuse sysctl called something like > > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > > admin sets this, then it opts into optimizing writeback to be as fast > > as possible (eg skipping the page copies) and if the server doesn't > > fulfill the writeback by the set timeout value, then the connection is > > aborted. > > > > Alternatively, we could also repurpose > > /proc/sys/fs/fuse/max_request_timeout from the request timeout > > patchset [1] but I like the additional flexibility and explicitness > > having the "writeback_optimization_timeout" sysctl gives. > > > > Any thoughts on this? > > > I'm a bit worried that we might lock up the system until time out is > reached - not ideal. Especially as timeouts are in minutes now. But > even a slightly stuttering video system not be great. I think we > should give users/admin the choice then, if they prefer slow page > copies or fast, but possibly shortly unresponsive system. > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout would be in seconds, where the sys admin would probably set something more reasonable like 5 seconds or so. If this syctl value is set, then servers who want writebacks to be fast can opt into it at mount time (and by doing so agree that they will service writeback requests by the timeout or their connection will be aborted). Thanks, Joanne > > Thank, > Bernd