From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98F96C3600C for ; Thu, 3 Apr 2025 22:04:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15E9F280003; Thu, 3 Apr 2025 18:04:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 10E69280001; Thu, 3 Apr 2025 18:04:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1966280003; Thu, 3 Apr 2025 18:04:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D615D280001 for ; Thu, 3 Apr 2025 18:04:32 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 564D11412BF for ; Thu, 3 Apr 2025 22:04:34 +0000 (UTC) X-FDA: 83294112468.24.969209A Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf19.hostedemail.com (Postfix) with ESMTP id E1A711A0004 for ; Thu, 3 Apr 2025 22:04:30 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hShcJH7q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743717871; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/dm6XkGVy8YUMK9wlTKzjo7DbVXBEwYL67ZlTyjaN0U=; b=TXiQLPac0cw8K2E2EQiVhxZVH4jDRKFVzJWF9M0M6eZEaTs0PEVrSqZNXixyhK8oZNsyC6 8jnmoOC+d1RjjtDAh1RRY3i9V4SC552B58NdGpq2a15jNDVdcVTzWRDIYTJxOadR0xzPbs l2FlujtWfr0SemuMaYgXt8fVMN5wQ64= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hShcJH7q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743717871; a=rsa-sha256; cv=none; b=fhTarPPh+CZMuEuBobc/XzeUL1WI5QMceBq7cnF7G1UZWOS6erlK+pjd0cOCDUoj0RBxli VNE6PPXbGr808bNLvLqNKXymqHzUA+AeDJPXaHw8B/witbXAtbBhNscI1H65ShpIDiAeAb TVybIao6ZEqCFNNk0LEvqFnvH1BINxg= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-47698757053so16037041cf.0 for ; Thu, 03 Apr 2025 15:04:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743717870; x=1744322670; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/dm6XkGVy8YUMK9wlTKzjo7DbVXBEwYL67ZlTyjaN0U=; b=hShcJH7qPY2EJbO2iiFs6XyKStah3iPZkjM2tVMtqKQEp9ljDIaLQeQGT/mRyEq6Qk ByBQ8xZopODhhPcQ+1vy3Y0f0zkQ8Yh3UOu0tbBN0FTJoAmDm6O1BDHd4cnZV62cf3aC 3ff/MV3qRbN8XvOb6yvXze1wvUtVhoMFAlMuatTviehmKcqa63RtDdmKCXUFA0RRcDPE yGzRm5RBgkscytJbKOL8lnb4WVnBQqnPCvVRxx5Z5vE+4yOGuwKdKXE67ecOrTzRyPj8 O4uvQeK3t2seQrOn99V6eNpyPuIOB5xPrZ1LSNz7XSKkT6Befvfzd/VRJopLpC1Qq1Q/ xb/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743717870; x=1744322670; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/dm6XkGVy8YUMK9wlTKzjo7DbVXBEwYL67ZlTyjaN0U=; b=e+TocY3xiTpdANOm7Jw6gvklfuPVFXffHS1wf7VPCjVVN3D+AJ9kRkX/Y4nk9BWGQ1 g3tVBwJRP8OwRto+/i/6v3rbu//nYu2v3bWLRwZJeVZjV7WMVjSLLj4HE9/qSSXJAFqO Kk52uYmH93q/SJbp1YTlVL1IRIRGfQnPD6TCM41VJ/XRP2aiLDj4B2V6GWJl9ShJCmP6 Txleg18lr3+Z8eBrJz/WUU7hCcQ4LrUcA0RQG7wqVWYiWpQrCRxwuCjDKQrVEK7j4Qce ek4/dT9y1O+HHjCxmKjmStJ6Y4cjc4pvsVLJW8Nacle/P9hZnMxlUd3oESh0ZUg/bF8G To4w== X-Forwarded-Encrypted: i=1; AJvYcCVDuDkI5st7ztSEu1v4wERMdBdNRGPl2Jv47Cf3RPt9Jle1YOLmcPhW3oLu3vhcVyKS44crZCNe5Q==@kvack.org X-Gm-Message-State: AOJu0Yyjawa7Q8rizcANUW6q4DBXmcGfeZJHN+K1VO9fTBqylyz/+bWs SCRY9EObNIMcLa/PmtovGOQh/rWMKfYa5yKoPJVaWoP2Lw3XX26ImakjXM/JeLZLwA7sePpt8I5 vowLjTw9/JxpJPiGetFVP3MaG8ys= X-Gm-Gg: ASbGnctARswFQMJCpOL0d0M6fqQb0Q6XDtBrOCQX/pSEcTFxopwdIZBmUiPwS2jA0Vs PxvBRfWJBIyJyKXUOERge1/vPDxrezT8BtVLSNude6IaEskLHi4WlMMVvxvhDBblBEKCrLEka6o LWElmehV9sawK+9Pgt/BHeqT+0q3WpYg4xzMEqA6VoMTy9rqN6jH/f X-Google-Smtp-Source: AGHT+IGyfKAKCqQ+zjf0Kqp0F2yAQ+7HG81Q6goKCnG6wQLEtMtVSU9iztnx1YTrdf7+HN+sjSEKNXvdrbeg1QQd5SU= X-Received: by 2002:a05:622a:16:b0:477:e78:5a14 with SMTP id d75a77b69052e-47925930d19mr10327441cf.3.1743717869929; Thu, 03 Apr 2025 15:04:29 -0700 (PDT) MIME-Version: 1.0 References: <20241122232359.429647-1-joannelkoong@gmail.com> <20241122232359.429647-5-joannelkoong@gmail.com> <1036199a-3145-464b-8bbb-13726be86f46@linux.alibaba.com> <1577c4be-c6ee-4bc6-bb9c-d0a6d553b156@redhat.com> <075209ac-c659-485e-a220-83d4afed8a94@redhat.com> In-Reply-To: <075209ac-c659-485e-a220-83d4afed8a94@redhat.com> From: Joanne Koong Date: Thu, 3 Apr 2025 15:04:19 -0700 X-Gm-Features: AQ5f1Jq_CyzARy-pMtX6EIMqdzCB9JrpiQvlztngZ0rTMQICHyVz0WDMSZbnKGU Message-ID: Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings To: David Hildenbrand Cc: Jingbo Xu , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, josef@toxicpanda.com, bernd.schubert@fastmail.fm, linux-mm@kvack.org, kernel-team@meta.com, Matthew Wilcox , Zi Yan , Oscar Salvador , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: E1A711A0004 X-Stat-Signature: 6jjy6ib5we3e4o8osspnoakkxfugks97 X-Rspam-User: X-HE-Tag: 1743717870-966155 X-HE-Meta: U2FsdGVkX19UWRIPu0wzeN3kRZdqSthJoISek+qn+B/ywqcBzDl8SzaYHmYxl9+VszB4lpLx+oe5SbXjQtM2+NL+D5A0sC5VtTNYZT6bwvoHDPhaQj2ozy7KPhiC+ToaXkBg6FuDwaH1WHlsppT+bSNJYKdqkN+QHhtV3oxnbAPknnKt01BIy4DO53N40fhTbTuMuZ5cK0CLXuvscHBYGn/X0oqE5pievrbj5uPbbHiT6MTeGXGBWSwJwFfEMnFMq/Wkw+Y69OhohzlXKG9RhZQP0KXtdtj4peUHzV1kE3ci0lLc7qMhtC+o0DMWxOmWlXJX0efeicKD77jas2yznwJGTyF3cE5Q3tjSbXcmljTd7MquQXP3KS8gEY0f7OmHXGlwlYwNIDVn9FifY4GFEISe0bgnIStC78fA0vE96BxekW5WE+PVgJmu2x4HCwk8PXINWYppV/0yoQRsWJelPCSAaITLVhp4S7mD1Uome3iO1zpAmA1fnuw4vW9omRmOYON4TfiaP8fRQj4HnIRirCsthJvON6DVmzXk1F8YhsKc1FZmLKzNRbzXyL4E54XZKravxbLQom6APusY1FHGtyo1BIy/LG8uah+89DB3gw89oDStCkCQBieGxyPfUe7/7/Qi2HroZ9j91DDkdLhWyM+pyhD/VItqQa+iu5magHdQ5+zg8IaP3LSYoMuyO+KjuHVdSQF14zN3QU+TfWT2RcSAuqxqjPG9fpD6113rngp7tH80ufmexlFoeKw9P5Yt4KwDLjA64LmzFDJRG7qb5tIxA5B7Z/zdtv8uaN4sCkcj6veJMInjh6AFihJ8Sd/IPlmLK6P7AM/YTuKubju5lunBoDa7Mk9+mhvjfswtfcLHplj/C3nnddWjchWr5WbrdbX8iU6fnDhjCevMPZa5CIuqPUHYo/v40KTiYiqJffqD3/XcBfzZawK103pvp+ZTnECXGR0x4cGt1AsBZfK 2+j3AM7u iHXk0qqf4jZOUZVLdCtz1x+KigDOY2twJkpun7adX0cpKw1IuaD4Fya1TdGZdt+eErSBczF+m//ud2Sex5p+nhHUtIwUwyHzhUi/7kSdl9DcGyww5dJd0WkIovxckcyK4rSM12M+EYgjXptGcjcDcHcXluLC4lfQu/gdt/to+oCrP5tcJc22jEz53I12j1nKUklZnX90WKn90wAkeoUviSpsnGsjf3Lux9dv1iYF4KKzp4eMmkKtNnQWHCnpvGoCopOG1w+VnxvKbNA2KLuESACYuxVFI4ynfWAadMHs9Oq5alZfNzYZfgbYZnIqq8Bs+i5yBDbpOiwQ5ZPPIssr08p0VAB+9AYO/3mW060r/ZpLEqI2PHLNZd5S+f7b01h8uVq2Fa6hdLQDqfQKFFc3Ip/XdxXQj9bozeW4k X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 3, 2025 at 1:44=E2=80=AFPM David Hildenbrand = wrote: > > On 03.04.25 21:09, Joanne Koong wrote: > > On Thu, Apr 3, 2025 at 2:18=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 03.04.25 05:31, Jingbo Xu wrote: > >>> > >>> > >>> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>>> On Thu, Dec 19, 2024 at 5:05=E2=80=AFAM David Hildenbrand wrote: > >>>>> > >>>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the fol= io if > >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag = set on its > >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapp= ing, the > >>>>>> writeback may take an indeterminate amount of time to complete, an= d > >>>>>> waits may get stuck. > >>>>>> > >>>>>> Signed-off-by: Joanne Koong > >>>>>> Reviewed-by: Shakeel Butt > >>>>>> --- > >>>>>> mm/migrate.c | 5 ++++- > >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>>> > >>>>> Ehm, doesn't this mean that any fuse user can essentially completel= y > >>>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>>> poisoning... ?! > >>>>> > >>>>> That sounds very bad. > >>>> > >>>> I took a closer look at the migration code and the FUSE code. In the > >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>>> mode folio lock holds will block migration until that folio is > >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking a= t: > >>>> > >>>> if (!folio_trylock(src)) { > >>>> if (mode =3D=3D MIGRATE_ASYNC) > >>>> goto out; > >>>> > >>>> if (current->flags & PF_MEMALLOC) > >>>> goto out; > >>>> > >>>> if (mode =3D=3D MIGRATE_SYNC_LIGHT && !folio_test_= uptodate(src)) > >>>> goto out; > >>>> > >>>> folio_lock(src); > >>>> } > >>>> > >> > >> Right, I raised that also in my LSF/MM talk: waiting for readahead > >> currently implies waiting for the folio lock (there is no separate > >> readahead flag like there would be for writeback). > >> > >> The more I look into this and fuse, the more I realize that what fuse > >> does is just completely broken right now. > >> > >>>> If this is all that is needed for a malicious FUSE server to block > >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>>> mappings are skipped in migration. A malicious server has easier and > >>>> more powerful ways of blocking migration in FUSE than trying to do i= t > >>>> through writeback. For a malicious fuse server, we in fact wouldn't > >>>> even get far enough to hit writeback - a write triggers > >>>> aops->write_begin() and a malicious server would deliberately hang > >>>> forever while the folio is locked in write_begin(). > >>> > >>> Indeed it seems possible. A malicious FUSE server may already be > >>> capable of blocking the synchronous migration in this way. > >> > >> Yes, I think the conclusion is that we should advise people from not > >> using unprivileged FUSE if they care about any features that rely on > >> page migration or page reclaim. > >> > >>> > >>> > >>>> > >>>> I looked into whether we could eradicate all the places in FUSE wher= e > >>>> we may hold the folio lock for an indeterminate amount of time, > >>>> because if that is possible, then we should not add this writeback w= ay > >>>> for a malicious fuse server to affect migration. But I don't think w= e > >>>> can, for example taking one case, the folio lock needs to be held as > >>>> we read in the folio from the server when servicing page faults, els= e > >>>> the page cache would contain stale data if there was a concurrent > >>>> write that happened just before, which would lead to data corruption > >>>> in the filesystem. Imo, we need a more encompassing solution for all > >>>> these cases if we're serious about preventing FUSE from blocking > >>>> migration, which probably looks like a globally enforced default > >>>> timeout of some sort or an mm solution for mitigating the blast radi= us > >>>> of how much memory can be blocked from migration, but that is outsid= e > >>>> the scope of this patchset and is its own standalone topic. > >> > >> I'm still skeptical about timeouts: we can only get it wrong. > >> > >> I think a proper solution is making these pages movable, which does se= em > >> feasible if (a) splice is not involved and (b) we can find a way to no= t > >> hold the folio lock forever e.g., in the readahead case. > >> > >> Maybe readahead would have to be handled more similar to writeback > >> (e.g., having a separate flag, or using a combination of e.g., > >> writeback+uptodate flag, not sure) > >> > >> In both cases (readahead+writeback), we'd want to call into the FS to > >> migrate a folio that is under readahread/writeback. In case of fuse > >> without splice, a migration might be doable, and as discussed, splice > >> might just be avoided. > >> > >>>> > >>>> I don't see how this patch has any additional negative impact on > >>>> memory migration for the case of malicious servers that the server > >>>> can't already (and more easily) do. In fact, this patchset if anythi= ng > >>>> helps memory given that malicious servers now can't also trigger pag= e > >>>> allocations for temp pages that would never get freed. > >>>> > >>> > >>> If that's true, maybe we could drop this patch out of this patchset? = So > >>> that both before and after this patchset, synchronous migration could= be > >>> blocked by a malicious FUSE server, while the usability of continuous > >>> memory (CMA) won't be affected. > >> > >> I had exactly the same thought: if we can block forever on the folio > >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already al= l > >> completely broken. > > > > I will resubmit this patchset and drop this patch. > > > > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy > > cgroupv1 reclaim scenarios: > > a) sync: sync waits on writeback so if we don't skip waiting on > > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse > > servers could make syncs hang. (There's no actual effect on sync > > behavior though with temp pages because even without temp pages, we > > return even though the data hasn't actually been synced to disk by the > > server yet) > > Just curious: Are we sure there are no other cases where a malicious > userspace could make some other folio_lock() hang forever either way? > Unfortunately, there's an awful case where kswapd may get blocked waiting for the folio lock. We encountered this in prod last week from a well-intentioned but incorrectly written FUSE server that got stuck. The stack trace was: 366 kswapd0 D folio_wait_bit_common.llvm.15141953522965195141 truncate_inode_pages_range fuse_evict_inode evict _dentry_kill shrink_dentry_list prune_dcache_sb super_cache_scan do_shrink_slab shrink_slab kswapd kthread ret_from_fork ret_from_fork_asm which was narrowed down to the __filemap_get_folio(..., FGP_LOCK, ...) call in truncate_inode_pages_range(). I'm working on a fix for this for kswapd and planning to also do a broader audit for other places where we might get tripped up from fuse forever holding a folio lock. I'm going to look more into the long-term fuse fix too - the first step will be documenting all the places currently where a lock may be forever held. > IOW, just like for migration, isn't this just solving one part of the > whole problem we are facing? For sync, I didn't see any folio lock acquires anywhere but I just noticed that fuse's .sync_fs() implementation will block until a server replies, so yes a malicious server could still hold up sync regardless of temp pages or not. I'll drop the sync patch too in v7. Thanks, Joanne > > > > > b) cgroupv1 reclaim: a correctly written fuse server can fall into > > this deadlock in one very specific scenario (eg if it's using legacy > > cgroupv1 and reclaim encounters a folio that already has the reclaim > > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) > > set), where the deadlock is triggered by: > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in direct= reclaim > > Yes, that sounds reasonable. > > -- > Cheers, > > David / dhildenb >