From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7F9FD6B6B8 for ; Wed, 30 Oct 2024 17:36:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0DBFD6B008A; Wed, 30 Oct 2024 13:36:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 03E9D8D0001; Wed, 30 Oct 2024 13:36:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E47298D0003; Wed, 30 Oct 2024 13:36:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C5F8A8D0001 for ; Wed, 30 Oct 2024 13:36:01 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 44B3EA0B47 for ; Wed, 30 Oct 2024 17:36:01 +0000 (UTC) X-FDA: 82730970546.19.D3AC53A Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf05.hostedemail.com (Postfix) with ESMTP id 2512E10000D for ; Wed, 30 Oct 2024 17:35:09 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=H1OOJfOB; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730309582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9Wg5QAgsMhA2/BDgON/4C0ErrXfNx74mooVbV9XsK58=; b=f+TbbcpBL/x0LPQhdGM6s7PNJcow4yVDe7ry/H31/x42riL1ardR03Ha/2fby9rKfk9PhX R1IRXBmsR9bDeBGO9mVZsEt8aNJ+xuTsJiM9ijphyEZSrJ1UWiAwV0zylklibo2wI3Grv+ EDBFWRpdLQaPvQbqujVF/LvaJSAyauU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730309582; a=rsa-sha256; cv=none; b=nxr93WHLg0McDqWEFbcAAr9JN5NC+M6uhy5fBHvBOoWTIfEthSF/irYKPIGO6YNXcaZvXd lx9cCOM1ZnVmcExbNtMq8Dp8BrSxC2UYuHOP5a4ynH/5MHAi1UjNApqL0b1+LA4+hth1hW Dmc8CtzPO1lVunsmGD111dRyd7JxIsY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=H1OOJfOB; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-4611abb6bd5so979661cf.1 for ; Wed, 30 Oct 2024 10:35:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730309758; x=1730914558; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9Wg5QAgsMhA2/BDgON/4C0ErrXfNx74mooVbV9XsK58=; b=H1OOJfOBjEmiNX0MojuVEq9CTyPdZV/nJUSoq7Ndg+A0EDWJ5RFc72PrWIqw/LL2c1 EryZdMDauPnd61MTjOP+MzZchLAR9J5mlTRMeMwscv5LyetHWVxzUJyPiSvOmb+MZnGE QeJxpTDEA+ZFwXCqgK/IwSM3vFmPsR1i4XLmQpWFb1i1lOidnAtCVtiUThQH1Gh+gSJ4 Thf5/Y2DFhefPYrBUebrsoQ5DhrxAKixzuY6J7vxZCpjRorx9pTCdV+8XrZ1DS2H1ik1 voFdFVitfBbb52zmvK8mhwlpUEllvO62wqCLGM1r2wFPhutqQOZclNrRRNptqYN4XjIK I6sA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730309758; x=1730914558; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9Wg5QAgsMhA2/BDgON/4C0ErrXfNx74mooVbV9XsK58=; b=l6Q7/0XXfGeirmg420Yj+HyfIVCJ6rs76ODRl2vmPW3Erw66VSPnTIG5La3zNkoU+s zQLbRS0I/a06A52O8uh310py9d1Q2x4CX8W4XSA6ls8/n6xp8Tbtlim/U7S6zLCzeN3f Z8mLLGPl2H38bnvGR6fcWOPWoRSOea/Aw0B334GH1mkQVp5r46CD+iZ2Sm/FzwhNsDe4 Pi0jojLLWEyoTfh3mz3rCm/nYrrXtY/8ddvaya7l5ZZmtiEe9GlijLIDp1Mn4AFBa/XL 8jiG+esgfdzltUDCLdO1eUrQLFxxK7qNMQtT4nNQYk9UDlWlQB7i7DS2IFnD2fzhGca1 v+Hw== X-Forwarded-Encrypted: i=1; AJvYcCWieScIjbMDHyjQkWgUqJDHUaE4G+hygeOMfaG9g38hHeHekqpMJEYtU6D91uQ04ZOFk3a1MuKnTw==@kvack.org X-Gm-Message-State: AOJu0YwC0ZRNugTmRl8bVmmkVsbaQ5br6UZB5gvhKR7ZVMYRUOM1DAnd edlKC6IDtVmlFKEMWe7ry0KCJjmmVeWPJD3X9pQdzAD8g9Lb7+QOcdy8munWUV0Hod9gvFHKFnG GxvAjyZK2yNOWfO4h3w6y2sToeyw= X-Google-Smtp-Source: AGHT+IFQJGKL2jLqEKFDffFJ3XBCCpg4qoR6XYDdr83XvzpfpvEc/SGk0AUv85BAfPAFv+MqR2CeJf8xPISOyl/z0HE= X-Received: by 2002:a05:622a:1997:b0:458:4353:609f with SMTP id d75a77b69052e-4613c1bafebmr216005441cf.59.1730309758505; Wed, 30 Oct 2024 10:35:58 -0700 (PDT) MIME-Version: 1.0 References: <20241014182228.1941246-1-joannelkoong@gmail.com> <3e4ff496-f2ed-42ef-9f1a-405f32aa1c8c@linux.alibaba.com> <0c3e6a4c-b04e-4af7-ae85-a69180d25744@fastmail.fm> <023c4bab-0eb6-45c5-9a42-d8fda0abec02@fastmail.fm> In-Reply-To: From: Joanne Koong Date: Wed, 30 Oct 2024 10:35:47 -0700 Message-ID: Subject: Re: [PATCH v2 2/2] fuse: remove tmp folio for writebacks and internal rb tree To: Bernd Schubert Cc: Jingbo Xu , Miklos Szeredi , Shakeel Butt , linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2512E10000D X-Stat-Signature: 8wtrs3fjn3pa8y55j1bz9fuhm66pu4rx X-Rspam-User: X-HE-Tag: 1730309709-863793 X-HE-Meta: U2FsdGVkX18OQA9oU0wm6et5Vu00PG+8A7ye5HVGXyH0IX6fnzS16c/qkbHYo0lBT9tphJCn3ODB784p0uW/U5NyM0O8CHKkLykQ5wi60Fr2gCQzyPG7l8AbbdP23Ue7fl19lACVEobPBcQ4kgXrplnpeHFQSscjyxRiT7EURUnozS+koCpL8XFee4G7OXdks9rUjRx5wkOdbWW9W7pf7UpwVT6kaVcmRfL08x9fAW6H8ZlK+3gY+n3/nNsHDIULgB3wh8PJAvyWQ//6zcsemFacvSeHQnrPR7sgvutso77feQc6AmiycvuOdaYQ5+KkrU9l7G5BUkjGz0iH5yb6XMuNCswLsYISINT6tZZZkGSOAuE+VT+IhUhQIRyavo/PMGbgCSDWEK6k078Vg0CVZY1WjH3JqxNvjVzFnQks7PE57VM1PK/WsHoQOookqmCOpg0Tdrnx94StOagyr2SURHomsmBG9+Ze2ARIQI4L/4zaArrL2Q6PBB26CpWIzsypBPPzCZt2iCrYg/g7vI8WrfZo68Gi/cUa6xiXuRJZTj+wL3vMk10n60iF4+gOp8l3ebCaav/0qcJfNOvA9VjGIeg1RmnDb8BSC3nL5OxLc8Ri/60vDN5ngx7YxZK9cZpnzvho//F6jtT7CFeXtIlpoBBRDqzR6Oly6lpCKHedzF2XsqyoDuI/Do1qwlCdf+PaxHVpo8uc8VWIvwaBiR67YHLVkTIV9e8OpLGlrihkq9wphsZZ4zn+HLGR+4yig7FLfRt0q0unox5jE4Jy4HWA24XguSOuAqpCUbCOyJk/T1Y73kj60M9KaCHp3KOw5jszdS5lHbvTGmtoeaA1VHmWUuAKRN03a1pgeBYAYjgSK1FPnHX+N3BZyDaxaWk2xFDDHxl4iMkk7iJft9mqV+KJ/FeSTGQLgYu4NeSeAx0lHUHbg6GzDN/KBcNzGNEKK1S7ITh7E3MdzS4fdc78jY4 GqcQ6OCg bLpf4UutpC2xUwxKldVAFlhWQtFW1CU1IhwZLYv3mCCtuaGTA1RtwcqVPJ+NTvxf8dJeNMQETtIq2yTIgk43KP8Cy/Eh/AJbLeRaJXmGLvLDPqAqAtPuBGjjeEZeROx4ThtCvEc3SGosyLrQU3w2e9KPYo5/3t+sM/N/rh22drMtjLO1ej3ZM0IMaPS0cV+eXgDKGXVbowqk5Fz1FZwn6NuxFq80rPB63cdRtVk8zTmuNOrbxPzT7WiRK4V+L23YRXiMRhVYim2znahOFIRj6hHEIn8gQh6bM+2WKHklTAylRgDxkJbdT+U7ZASqlseyj9thl3QKJmIpUTy3lO2y96pHwZA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 30, 2024 at 10:27=E2=80=AFAM Bernd Schubert wrote: > > > > On 10/30/24 18:02, Joanne Koong wrote: > > On Wed, Oct 30, 2024 at 9:21=E2=80=AFAM Bernd Schubert > > wrote: > >> > >> On 10/30/24 17:04, Joanne Koong wrote: > >>> On Wed, Oct 30, 2024 at 2:32=E2=80=AFAM Bernd Schubert > >>> wrote: > >>>> > >>>> On 10/28/24 22:58, Joanne Koong wrote: > >>>>> On Fri, Oct 25, 2024 at 3:40=E2=80=AFPM Joanne Koong wrote: > >>>>>> > >>>>>>> Same here, I need to look some more into the compaction / page > >>>>>>> migration paths. I'm planning to do this early next week and will > >>>>>>> report back with what I find. > >>>>>>> > >>>>>> > >>>>>> These are my notes so far: > >>>>>> > >>>>>> * We hit the folio_wait_writeback() path when callers call > >>>>>> migrate_pages() with mode MIGRATE_SYNC > >>>>>> ... -> migrate_pages() -> migrate_pages_sync() -> > >>>>>> migrate_pages_batch() -> migrate_folio_unmap() -> > >>>>>> folio_wait_writeback() > >>>>>> > >>>>>> * These are the places where we call migrate_pages(): > >>>>>> 1) demote_folio_list() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 2) __damon_pa_migrate_folio_list() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 3) migrate_misplaced_folio() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 4) do_move_pages_to_node() > >>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode b= ut > >>>>>> this path is only invoked by the move_pages() syscall. It's fine t= o > >>>>>> wait on writeback for the move_pages() syscall since the user woul= d > >>>>>> have to deliberately invoke this on the fuse server for this to ap= ply > >>>>>> to the server's fuse folios > >>>>>> > >>>>>> 5) migrate_to_node() > >>>>>> Can ignore this for the same reason as in 4. This path is only inv= oked > >>>>>> by the migrate_pages() syscall. > >>>>>> > >>>>>> 6) do_mbind() > >>>>>> Can ignore this for the same reason as 4 and 5. This path is only > >>>>>> invoked by the mbind() syscall. > >>>>>> > >>>>>> 7) soft_offline_in_use_page() > >>>>>> Can skip soft offlining fuse folios (eg folios with the > >>>>>> AS_NO_WRITEBACK_WAIT mapping flag set). > >>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_pa= ge() > >>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-u= se > >>>>>> pages in a well-defined state (see ret value of get_hwpoison_page(= )). > >>>>>> My understanding of soft offlining pages is that it's a mitigation > >>>>>> strategy for handling pages that are experiencing errors but are n= ot > >>>>>> yet completely unusable, and its main purpose is to prevent future > >>>>>> issues. It seems fine to skip this for fuse folios. > >>>>>> > >>>>>> 8) do_migrate_range() > >>>>>> 9) compact_zone() > >>>>>> 10) migrate_longterm_unpinnable_folios() > >>>>>> 11) __alloc_contig_migrate_range() > >>>>>> > >>>>>> 8 to 11 needs more investigation / thinking about. I don't see a g= ood > >>>>>> way around these tbh. I think we have to operate under the assumpt= ion > >>>>>> that the fuse server running is malicious or benevolently but > >>>>>> incorrectly written and could possibly never complete writeback. S= o we > >>>>>> definitely can't wait on these but it also doesn't seem like we ca= n > >>>>>> skip waiting on these, especially for the case where the server us= es > >>>>>> spliced pages, nor does it seem like we can just fail these with > >>>>>> -EBUSY or something. > >>>> > >>>> I see some code paths with -EAGAIN in migration. Could you explain w= hy > >>>> we can't just fail migration for fuse write-back pages? > >>>> > >> > >> Hi Joanne, > >> > >> thanks a lot for your quick reply (especially as my reviews come in ve= ry > >> late). > >> > > > > Thanks for your comments/reviews, Bernd! I always appreciate them. > > > >>> > >>> My understanding (and please correct me here Shakeel if I'm wrong) is > >>> that this could block system optimizations, especially since if an > >>> unprivileged malicious fuse server never replies to the writeback > >>> request, then this completely stalls progress. In the best case > >>> scenario, -EAGAIN could be used because the server might just be slow > >>> in serving the writeback, but I think we need to also account for > >>> servers that never complete the writeback. For > >>> __alloc_contig_migrate_range() for example, my understanding is that > >>> this is used to migrate pages so that there are more physically > >>> contiguous ranges of memory freed up. If fuse writeback blocks that, > >>> then that hurts system health overall. > >> > >> Hmm, I wonder what is worse - tmp page copies or missing compaction. > >> Especially if we expect a low range of in-writeback pages/folios. > >> One could argue that an evil user might spawn many fuse server > >> processes to work around the default low fuse write-back limits, but > >> does that make any difference with tmp pages? And these cannot be > >> compacted either? > > > > My understanding (and Shakeel please jump in here if this isn't right) > > is that tmp pages can be migrated/compacted. I think it's only pages > > marked as under writeback that are considered to be non-movable. > > > >> > >> And with timeouts that would be so far totally uncritical, I > >> think. > >> > >> > >> You also mentioned > >> > >>> especially for the case where the server uses spliced pages > >> > >> could you provide more details for that? > >> > 7> > > For the page migration / compaction paths, I don't think we can do the > > workaround we could do for sync where we skip waiting on writeback for > > fuse folios and continue on with the operation, because the migration > > / compaction paths operate on the pages. For the splice case, we > > assign the page to the pipebuffer (fuse_ref_page()), so if the > > migration/compaction happens on the page before the server has read > > this page from the pipebuffer, it'll be incorrect data or maybe crash > > the kernel. > > > >> > >> > >>> > >>>>>> > >>>>> > >>>>> I'm still not seeing a good way around this. > >>>>> > >>>>> What about this then? We add a new fuse sysctl called something lik= e > >>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > >>>>> admin sets this, then it opts into optimizing writeback to be as fa= st > >>>>> as possible (eg skipping the page copies) and if the server doesn't > >>>>> fulfill the writeback by the set timeout value, then the connection= is > >>>>> aborted. > >>>>> > >>>>> Alternatively, we could also repurpose > >>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout > >>>>> patchset [1] but I like the additional flexibility and explicitness > >>>>> having the "writeback_optimization_timeout" sysctl gives. > >>>>> > >>>>> Any thoughts on this? > >>>> > >>>> > >>>> I'm a bit worried that we might lock up the system until time out is > >>>> reached - not ideal. Especially as timeouts are in minutes now. But > >>>> even a slightly stuttering video system not be great. I think we > >>>> should give users/admin the choice then, if they prefer slow page > >>>> copies or fast, but possibly shortly unresponsive system. > >>>> > >>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout > >>> would be in seconds, where the sys admin would probably set something > >>> more reasonable like 5 seconds or so. > >>> If this syctl value is set, then servers who want writebacks to be > >>> fast can opt into it at mount time (and by doing so agree that they > >>> will service writeback requests by the timeout or their connection > >>> will be aborted). > >> > >> > >> I think your current patch set has it in minutes? (Should be easy > >> enough to change that.) Though I'm more worried about the impact > >> of _frequent_ timeout scanning through the different fuse lists > >> on performance, than about missing compaction for folios that are > >> currently in write-back. > > Hmm, if tmp pages can be compacted, isn't that a problem for splice? > I.e. I don't understand what the difference between tmp page and > write-back page for migration. > That's a great question! I have no idea how compaction works for pages being used in splice. Shakeel, do you know the answer to this? Thanks, Joanne > > >> > > > > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout" > > would be a separate thing from the > > "/proc/sys/fs/fuse/max_request_timeout". The > > "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply > > for writeback requests. I was thinking implementation-wise, for > > writebacks we could just have a timer associated with each request > > (instead of having to grab locks with the fuse lists), since they > > won't be super common. > > Ah, thank you! I had missed that this is another variable. Issue > with too short timeouts would probably be network hick-up that > would immediately kill fuse-server. I.e. if it just the missing > page compaction/migration, maybe larger time outs would be > acceptable. > > > Thanks, > Bernd