From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16EF0C3601B for ; Thu, 3 Apr 2025 09:35:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C5CD280003; Thu, 3 Apr 2025 05:35:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 175A6280001; Thu, 3 Apr 2025 05:35:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 03C60280003; Thu, 3 Apr 2025 05:35:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D863C280001 for ; Thu, 3 Apr 2025 05:35:31 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id BD291141E31 for ; Thu, 3 Apr 2025 09:35:32 +0000 (UTC) X-FDA: 83292224904.30.4AB3D90 Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf24.hostedemail.com (Postfix) with ESMTP id 1C77C180004 for ; Thu, 3 Apr 2025 09:35:30 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=l6MQjqC9; spf=pass (imf24.hostedemail.com: domain of brauner@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743672931; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2JtUM0qyF7wE3k9wPHpMIy7oQ/pcqR51wk0r1m6bXsc=; b=7ZTXDMSNeQEtwYj4e7P4Ld/Gqj3vScE79z7+KeSev6ySxuh3JovBJviNXHOzN1SyozwnHp u9Zi1qYKfucgvdLjmkqLRC6q1bYqEPUoweaJcFr9kUeFFDNogeVcfq1ZUOW1TAhwgbwO1n clUInjoRJ/T8xFOrsFcDdRnpd1/TdbE= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=l6MQjqC9; spf=pass (imf24.hostedemail.com: domain of brauner@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743672931; a=rsa-sha256; cv=none; b=b/hw3lBSaRSOH8EEqbw9hjOp9EiAywFZCaAdjM383DbDnjCKBesxuOnfW5NVYQxQjvvQbq DEftTRHUHFu2Ws4Xze+FmL5vy1Qw0TcNuwLbG2V2O5XQazA3f6UPQ1J6NgB4+OYiSTzPvY lyQD+KE6N98Xasgf4kDM0Xo0//ooRRE= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 52F0268430; Thu, 3 Apr 2025 09:35:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 915E7C4CEE3; Thu, 3 Apr 2025 09:35:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1743672930; bh=YRlZ/NiPqcpQjk4lOYinYlI/Xb3uAZaKYuWB144x+lE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=l6MQjqC9OdDz+McSWUx8DvTcR+uW4VY3zyuNFut/w45QXKgRWf7OgKXsURfsUEtaq mp8iYogfRmneelF0cDPbW05E+BDPjpDwFZUXl1zxVhmyd0aSOakkwiVtDuAokS09CB 8OXEivZshjlD05rIFLZcWMXvhCfZaUQT62fRDqSxoIax3UEuvSlVOKEzUvaJJqr3cC ifB0V2ILU83mM+m5hsO3i31TnQpzgAH9J6DLfgBlw6QlrBF53+eK++z3PY4uZw113X oK6oyeR0B/jij6sm+Oo0OAdY2SPCTcUFR4K6b3gvtjM5jQawcTfw34IZRneg9y2+Vb RhBpbsxiyYBCQ== Date: Thu, 3 Apr 2025 11:35:23 +0200 From: Christian Brauner To: Bernd Schubert Cc: David Hildenbrand , Jingbo Xu , Joanne Koong , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org, kernel-team@meta.com, Matthew Wilcox , Zi Yan , Oscar Salvador , Michal Hocko , Keith Busch Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings Message-ID: <20250403-option-holztisch-de5d88079f59@brauner> References: <20241122232359.429647-1-joannelkoong@gmail.com> <20241122232359.429647-5-joannelkoong@gmail.com> <1036199a-3145-464b-8bbb-13726be86f46@linux.alibaba.com> <1577c4be-c6ee-4bc6-bb9c-d0a6d553b156@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 1C77C180004 X-Stat-Signature: spi5dashn6dj6nckrpz6djckktenpa1u X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1743672930-626446 X-HE-Meta: U2FsdGVkX1/txqePuOk+gdXxr9EdHCGv8gMAF6TUJaCdkIaZAYS8reu8PlTHYTxfEejVGRo7ptykC1+APoPg+TsGPY4WyStHsKC9QtfeFRTNPcvfirxcXjmKTJaEqAmfiefOlFjFJxenDZc9gk6M5KsQarzj0GdJPZxHlz8ta3K6OLKPmMJizp0kZodEwaTRefmE9N8w3hHpME/DNy4ZPTKVEDZsJo8DaNQrt4r4t23SzetBIMPh/xvxd/xsA08qKZ4Ce7OXqlHI1sJyZ0QSl2XLWKVziSV8EmP2FOyShTgnnAmSeaEmP+UhcMBcoxpAV7pnOcot7OHs2j00LVDrZMzRcm0rHrJKPQdx590HeojK2M4aIqYwD3LVWKuq6hZ095iS9Zo1CKgSGk5FfC7OoVLpYHMiMiZwi4yDtlUr2qbcC6W2w80kZGblQm8U9dp8D2q/004IICvgGOmQ6WD/TjTEoFS7T3g9PdXPXJY2ucvoHIq2ZZrJwRUsK4lbugcBzkD8QZAEpAl6dXPiQ8AzXNBsFuAiMazxJygiLN/gVd34t9yfaad3E3FWeqHmR74YLoT43RSItli9DwAcWbpi2g2aqItaUkNRuwaeGVuGfCWVQspM2C6lvc/1in5u7rpm/bXrtE8UjsWV9wXDZyU7DhyC2KWfHk12CMLI3SsZP8XW3SDrp5KRQBkdKPQu/kXOtcpSssO8p/rCUIggzAv6m6HrT08vWuJ98/Sjs2wu4rviEcY8vbRuzHob46t+3h2EbO4o+lvgdcE5IHIiNDfInuFtoBNf9/hmfsms7voNWTff+BYm9qNyf0hwm488YVaLGYul8LVUR1hzOR+t8m/TdnvnRsz8Qcs3byaye8IBR4iVuh5qnEGJakdTGOrIgQQ+3835RIVY/FXYsmaZB0xSrkt4DbqqlKlC2i5qK0x+Ql42uGTNGyw7FTC1ncOLG2/zIggHYk5fzKBxCUN2SaU JnZID1eE xEiraqGTx44NTyvtdID/tI2+tdTEcpbnCTeTkYlOtkkdTZgnVKHJB0CrH0jtgquhbVvXznm7n+rYC6bSyyGZGCBkHHMOFuggws7xCzYYMWYQGylKWMk0pi4qeiy7x9MgE+RMOP+MwbJ1VJtvkcfgpTNifHfIvdKrzdvj+t7fN2S/tvnjq/l552MA0qlabCcAwmg+P46MqcHBkdMLEYtdCn5QIQ/Jd3Hd3DfNHzLEVcGh48izmxEqIwXWIty3kOBf9vQ22KrFTV4SIbbj3LAlClJWC0WSSN9mHEhHY3n4GRdl2wAbpz3k1+kVPp4PyR6L3lbGK6f8GyPgo6xi/c/oWwv2xmLaDup22RfxnU52EJv8UPvSFLQVKYm41AOd7/tKi2OKhZ4LoxEQZVXOYKj/OAD75VMxNzR9G8yhRIRcIm7QbIDQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 03, 2025 at 11:25:17AM +0200, Bernd Schubert wrote: > > > On 4/3/25 11:18, David Hildenbrand wrote: > > On 03.04.25 05:31, Jingbo Xu wrote: > >> > >> > >> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand > >>> wrote: > >>>> > >>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the > >>>>> folio if > >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag > >>>>> set on its > >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the > >>>>> mapping, the > >>>>> writeback may take an indeterminate amount of time to complete, and > >>>>> waits may get stuck. > >>>>> > >>>>> Signed-off-by: Joanne Koong > >>>>> Reviewed-by: Shakeel Butt > >>>>> --- > >>>>>    mm/migrate.c | 5 ++++- > >>>>>    1 file changed, 4 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>>> index df91248755e4..fe73284e5246 100644 > >>>>> --- a/mm/migrate.c > >>>>> +++ b/mm/migrate.c > >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t > >>>>> get_new_folio, > >>>>>                 */ > >>>>>                switch (mode) { > >>>>>                case MIGRATE_SYNC: > >>>>> -                     break; > >>>>> +                     if (!src->mapping || > >>>>> +                         !mapping_writeback_indeterminate(src- > >>>>> >mapping)) > >>>>> +                             break; > >>>>> +                     fallthrough; > >>>>>                default: > >>>>>                        rc = -EBUSY; > >>>>>                        goto out; > >>>> > >>>> Ehm, doesn't this mean that any fuse user can essentially completely > >>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>> poisoning... ?! > >>>> > >>>> That sounds very bad. > >>> > >>> I took a closer look at the migration code and the FUSE code. In the > >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>> mode folio lock holds will block migration until that folio is > >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >>> > >>>          if (!folio_trylock(src)) { > >>>                  if (mode == MIGRATE_ASYNC) > >>>                          goto out; > >>> > >>>                  if (current->flags & PF_MEMALLOC) > >>>                          goto out; > >>> > >>>                  if (mode == MIGRATE_SYNC_LIGHT && ! > >>> folio_test_uptodate(src)) > >>>                          goto out; > >>> > >>>                  folio_lock(src); > >>>          } > >>> > > > > Right, I raised that also in my LSF/MM talk: waiting for readahead > > currently implies waiting for the folio lock (there is no separate > > readahead flag like there would be for writeback). > > > > The more I look into this and fuse, the more I realize that what fuse > > does is just completely broken right now. > > > >>> If this is all that is needed for a malicious FUSE server to block > >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>> mappings are skipped in migration. A malicious server has easier and > >>> more powerful ways of blocking migration in FUSE than trying to do it > >>> through writeback. For a malicious fuse server, we in fact wouldn't > >>> even get far enough to hit writeback - a write triggers > >>> aops->write_begin() and a malicious server would deliberately hang > >>> forever while the folio is locked in write_begin(). > >> > >> Indeed it seems possible.  A malicious FUSE server may already be > >> capable of blocking the synchronous migration in this way. > > > > Yes, I think the conclusion is that we should advise people from not > > using unprivileged FUSE if they care about any features that rely on > > page migration or page reclaim. > > > >> > >> > >>> > >>> I looked into whether we could eradicate all the places in FUSE where > >>> we may hold the folio lock for an indeterminate amount of time, > >>> because if that is possible, then we should not add this writeback way > >>> for a malicious fuse server to affect migration. But I don't think we > >>> can, for example taking one case, the folio lock needs to be held as > >>> we read in the folio from the server when servicing page faults, else > >>> the page cache would contain stale data if there was a concurrent > >>> write that happened just before, which would lead to data corruption > >>> in the filesystem. Imo, we need a more encompassing solution for all > >>> these cases if we're serious about preventing FUSE from blocking > >>> migration, which probably looks like a globally enforced default > >>> timeout of some sort or an mm solution for mitigating the blast radius > >>> of how much memory can be blocked from migration, but that is outside > >>> the scope of this patchset and is its own standalone topic. > > > > I'm still skeptical about timeouts: we can only get it wrong. > > > > I think a proper solution is making these pages movable, which does seem > > feasible if (a) splice is not involved and (b) we can find a way to not > > hold the folio lock forever e.g., in the readahead case. > > > > Maybe readahead would have to be handled more similar to writeback > > (e.g., having a separate flag, or using a combination of e.g., > > writeback+uptodate flag, not sure) > > > > In both cases (readahead+writeback), we'd want to call into the FS to > > migrate a folio that is under readahread/writeback. In case of fuse > > without splice, a migration might be doable, and as discussed, splice > > might just be avoided. > > My personal take is here that we should move away from splice. > Keith (or colleague) is working on ZC with io-uring anyway, so > maybe a good timing. We should just ensure that the new approach > doesn't have the same issue. splice is problematic in a lot of other ways too. It's easy to abuse it for weird userspace hangs since it clings onto the pipe_lock() and no one wants to do the invasive surgery to wean it off of that. So +1 on avoiding splice.