From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89676D637AB for ; Wed, 13 Nov 2024 19:11:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB43C6B0089; Wed, 13 Nov 2024 14:11:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E646B6B008A; Wed, 13 Nov 2024 14:11:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2C916B008C; Wed, 13 Nov 2024 14:11:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B8E5F6B0089 for ; Wed, 13 Nov 2024 14:11:17 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3BDD61C4A7A for ; Wed, 13 Nov 2024 19:11:17 +0000 (UTC) X-FDA: 82782012768.01.138D363 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) by imf14.hostedemail.com (Postfix) with ESMTP id 60C8710000A for ; Wed, 13 Nov 2024 19:10:27 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Dsrl83mn; spf=pass (imf14.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731524987; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xSBrut3N0dpd6LfoNRacxBDtM2RMZABYpokLf+z8bTU=; b=LNiZwqABWCUya3R0f8Kb9VYU9HCQApAlL4/OJHnJiVwK0nWxqWGEJtqrP03nYSm2FhcPsj yTnFyfy1/0JJeQsQhTpjrCL5B7okK2UqfA9Nc9PQKuHUZiQWQ0J9+P6xBF8gLZqK/qXvBX G0ufdaWMX1mJxEEtBvx2tm9UigAUer0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731524987; a=rsa-sha256; cv=none; b=JA4OssBHBoTEs4AK5n8BveQa8gXPTbLGpy95Kl3nS4qd5Ngjw8rO2N+vSN8hr5kRcXnmeT hBa4BKJZC/dI9Ouwv8K9ybipYv1ENZDF4nDt2K6hMcHHWq9h9bvjIrZFuC6zMoIiWM9Vh0 hXpoDi56hbjB22BN1d7UMwNr9oXxjwU= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Dsrl83mn; spf=pass (imf14.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-46090640f0cso56840891cf.0 for ; Wed, 13 Nov 2024 11:11:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731525074; x=1732129874; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xSBrut3N0dpd6LfoNRacxBDtM2RMZABYpokLf+z8bTU=; b=Dsrl83mnjYurdV7pIgGIjEPV1nH+qmWxe0jTD6+btr8lEk1KLxXXSuNBKetyzovRHV KyuHotaDvSJgUhcK66u5ayp4LMdbUv6mviToYMf8mBTtbfeT6q4nV7KRGybYDvFPJBU0 H7Ev6mGoLtCPtKoiqzfd68Lg3793Gq9BWh+tGfNyc17Fi2NMnzOibSHcs+T9jpjb93jV UWipi1vcvdIhOLQdItVXIwJKC02iuMNXKLUuNmEF17u48Sm4fK/0Hks7eYBxnd0eNGUc OW1bmcA6t3J8ZHSDj/aViKlP+Q5mhnmQ3SXvfLd1dT1Scb1YendmwgB/RF+QVZwkKiZb xldw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731525074; x=1732129874; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xSBrut3N0dpd6LfoNRacxBDtM2RMZABYpokLf+z8bTU=; b=PZp7KZlkKfE2DpZQllYjgYqyT3YR1N+qV2m622AJKPJ5eNZSnBKnz5+2N+AAAfiMKK EnEH7B02C/qp8Zacoeo+nsliWnS3n90Ts4THu+DGxHx9i6QVKQTE+WobcCh9I6qKGx/B Y0k6rdKUiSR7qOc4VQqgKXcdX0066W/F74qHBYSpHKuKh4n4nVZoHpc62lHPXOb+UXUH 1nSf0TqfT8ZOKLRBgWh1ijgofFMOwoBlHAEkK166b3H8pzFZk4HDmhKKLfLfTYXi0U1t QyswkbANkYr4WqUpld37K8ZFqVBG1B5TPT94HPbfKnGnACFSR8Zny6T4YxtIjL0t7gpb 6cbw== X-Forwarded-Encrypted: i=1; AJvYcCXxICKGTML1FDkF1S1rsROLP1OgXEHgdC7+8kuw8QeeSK1+EmwTh33anEt7bfY28aBtkPw/SwOz6Q==@kvack.org X-Gm-Message-State: AOJu0YxV5a6Wp31DzF8K7lHW8pA7/hHd+dox0hRlC0g9e+6hisMc9djX 7gsVRXF2sFYMluLQ15/mbr5j4N/xxXCx3lDPJn+eEVR2ttezgtX1GoBM0ICzoKNeqizB/l1zdE/ kmQpoCJvMHSDivs/2Vx39TCAWFVw= X-Google-Smtp-Source: AGHT+IFPJJbBhxVGcgdqW6QLdVkip9zPAmkCQZfT1n2oiMB44YjAHmM9BN8yFhK0Lj+ep5pHWcYyslrgxfA5PY7+JrY= X-Received: by 2002:ac8:598d:0:b0:461:15a1:7889 with SMTP id d75a77b69052e-4630933b004mr332251821cf.16.1731525074427; Wed, 13 Nov 2024 11:11:14 -0800 (PST) MIME-Version: 1.0 References: <20241107235614.3637221-1-joannelkoong@gmail.com> <20241107235614.3637221-7-joannelkoong@gmail.com> <9c0dbdac-0aed-467c-86c7-5b9a9f96d89d@linux.alibaba.com> <0f585a7c-678b-492a-9492-358f21e57291@linux.alibaba.com> In-Reply-To: <0f585a7c-678b-492a-9492-358f21e57291@linux.alibaba.com> From: Joanne Koong Date: Wed, 13 Nov 2024 11:11:03 -0800 Message-ID: Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org, bernd.schubert@fastmail.fm, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: 8jkcy4q5qhnjpoxopws7gnzoffzjaujn X-Rspamd-Queue-Id: 60C8710000A X-Rspam-User: X-HE-Tag: 1731525027-363502 X-HE-Meta: U2FsdGVkX1+7ns+Zqo/hplG0kupOFwrzlVMGkWQUQeBBdZAzE+JvnCTJfKoC6Q37CNCTw1WRKW5cO8B192xyIq5E8h0VOWhR4LoQbuTYfBqX6jX53ojzXmdCqpoVQ7AFisjVQSME0YaUqsaeFSLJdQCTUl4rFqawaFnEN/Whon4d3/EvDhqcJGo7hWpLhylFP3CjWzBnRYANgHbubQ3WxtKZfEP+Ju1WYVZ0Ru6fnjyKR2rIkzNwCb2PJ2qnVSIPoRMVZhDyJMImasII3mpTuLmJUVAEkTTIpD9+KvN6Z8y0AZK3+/SvEhhG/TKWomT1/nWTz1J8g0n1z8FGzQXegi+uNe9PKq2Sw8Xr9O1infKhh3duoJXkkcIfjs0EicEmooTs5zISRxyCEMeXdQb0u/F3fIJRJYUbvrWg7MngbAFbgi14FbRHi+6mnwO/rpMsz5KIHDUrNDmu4Itm0+zsrVEXqG1eAnVS3HKwZNPD0489Gvu1UOuWtZwFdYSfj8yt8i0HwcLdGge+1uOxojBu5fZzK7Dfq09JIhtcft4pnzW8cwQEaAAUm4xSAAZsu9n93JIkdoiLwUjGZ6C+b4pC8KFzYLBCbOLc/Lq7OA9wLq2T1Sew+ngdqr4PR4PbiesurzA4f9Rq2Gxz/GdGyNjTIW0x6msomF/Lcsy10oPK8dH6/yz05F6Q7rYYctz+KOaOaKQZEIqJ530AFnjZX6+BB9wqK0FZnr2EDE8VSQaehmOpltkNPRO9JXg+RVJuKPbCRe2U/q98g9AKTwaX1H0vX19SJD5rGG7jZTgOFBFDWh6nkF+PEypY0ihYQqK2lv5njYdn5/0w7h9QbBONLmZT6Wql+gUcQx6Wb52tRyZzVK1s6iNIWOkRVRdM64kcMHuCc6djZSg8VCJPwBU6JCM8+5NFF9PqZuQLU4Z3eLUWbyvV5ZeW5H3kmDsFCVEi9Vi4hHrPtZiLlMvWryMCRmw Hji1yCyS 0KsA8gOVeKeUjH2d5GqXGM7/oNIp9A6YltcnHDG5z4o0sUKbC5cg0Xmd8GW93Vhr9yH+xHZUvYZ2xS+tOG84upTtECWkzwQYaE6CMcwb4r1XWq2eFLL+IJ+2xae5ZJp/hw27WNsiYdUXWt92/UBmEIlwu+jAz4wMnJn7ZzUduaRpP44DS0NtyZ5ZVTAj/LBlZMuuuZqWZHw7rSuHJryhn0dbrbCZjhq1+zvkzDcWjLQNK7IqqLvp7ZLitbPJQE4Nsdu3OudoNvqJ6Heh47Ma+45i6oISU9eb7YtqCjEeSytAq9wgNgWebeQvIzzkSFlg4SG5zcVxm9EzfgcTHxvHsL+0w8pcI7H8J8FyrImaNkKGfIa7Y9mbvffz0UQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 11, 2024 at 6:31=E2=80=AFPM Jingbo Xu wrote: > > > > On 11/12/24 5:30 AM, Joanne Koong wrote: > > On Mon, Nov 11, 2024 at 12:32=E2=80=AFAM Jingbo Xu wrote: > >> > >> Hi, Joanne and Miklos, > >> > >> On 11/8/24 7:56 AM, Joanne Koong wrote: > >>> Currently, we allocate and copy data to a temporary folio when > >>> handling writeback in order to mitigate the following deadlock scenar= io > >>> that may arise if reclaim waits on writeback to complete: > >>> * single-threaded FUSE server is in the middle of handling a request > >>> that needs a memory allocation > >>> * memory allocation triggers direct reclaim > >>> * direct reclaim waits on a folio under writeback > >>> * the FUSE server can't write back the folio since it's stuck in > >>> direct reclaim > >>> > >>> To work around this, we allocate a temporary folio and copy over the > >>> original folio to the temporary folio so that writeback can be > >>> immediately cleared on the original folio. This additionally requires= us > >>> to maintain an internal rb tree to keep track of writeback state on t= he > >>> temporary folios. > >>> > >>> A recent change prevents reclaim logic from waiting on writeback for > >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it. > >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which > >>> will prevent FUSE folios from running into the reclaim deadlock descr= ibed > >>> above) and removes the temporary folio + extra copying and the intern= al > >>> rb tree. > >>> > >>> fio benchmarks -- > >>> (using averages observed from 10 runs, throwing away outliers) > >>> > >>> Setup: > >>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads= =3D4 -o source=3D~/tmp_mount ~/fuse_mount > >>> > >>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,4k,1= M} --size=3D2G > >>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory=3D/r= oot/fuse_mount > >>> > >>> bs =3D 1k 4k 1M > >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>> % diff -3% 23% 45% > >>> > >>> Signed-off-by: Joanne Koong > >> > >> > >> IIUC this patch seems to break commit > >> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle sa= me > >> page rewrites"). > >> > > > > Interesting! My understanding was that we only needed that commit > > because we were clearing writeback on the original folio before > > writeback had actually finished. > > > > Now that folio writeback state is accounted for normally (eg through > > writeback being set/cleared on the original folio), does the > > folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate > > this? > > Yes, after inspecting the writeback logic more, it seems that the second > writeback won't be initiated if the first one has not completed yet, see > > ``` > a_ops->writepages > write_cache_pages > writeback_iter > writeback_get_folio > folio_prepare_writeback > if folio_test_writeback(folio): > folio_wait_writeback(folio) > ``` > > and thus it won't be an issue to remove the auxiliary list ;) > Awesome, thanks for double-checking! > > > >>> - /* > >>> - * Being under writeback is unlikely but possible. For example= direct > >>> - * read to an mmaped fuse file will set the page dirty twice; o= nce when > >>> - * the pages are faulted with get_user_pages(), and then after = the read > >>> - * completed. > >>> - */ > >> > >> In short, the target scenario is like: > >> > >> ``` > >> # open a fuse file and mmap > >> fd1 =3D open("fuse-file-path", ...) > >> uaddr =3D mmap(fd1, ...) > >> > >> # DIRECT read to the mmaped fuse file > >> fd2 =3D open("ext4-file-path", O_DIRECT, ...) > >> read(fd2, uaddr, ...) > >> # get_user_pages() of uaddr, and triggers faultin > >> # a_ops->dirty_folio() <--- mark PG_dirty > >> > >> # when DIRECT IO completed: > >> # a_ops->dirty_folio() <--- mark PG_dirty > > > > If you have the direct io function call stack at hand, could you point > > me to the function where the direct io completion marks this folio as > > dirty? > > > FYI The full call stack is like: > > ``` > # DIRECT read(2) to the mmaped fuse file > read(fd2, uaddr1, ...) > f_ops->read_iter() > (iomap-based ) iomap_dio_rw > # for READ && user_backed_iter(iter): > dio->flags |=3D IOMAP_DIO_DIRTY > iomap_dio_iter > iomap_dio_bio_iter > # add user or kernel pages to a bio > bio_iov_iter_get_pages > ... > pin_user_pages_fast(..., FOLL_WRITE, ...) > # find corresponding vma of dest buffer (fuse page cache) > # search page table (pet) to find corresponding page > # if not fault yet, trigger explicit faultin: > faultin_page(..., FOLL_WRITE, ...) > handle_mm_fault(..., FAULT_FLAG_WRITE) > handle_pte_fault > do_wp_page > (vma->vm_flags & VM_SHARED) wp_page_shared > ... > fault_dirty_shared_page > folio_mark_dirty > a_ops->dirty_folio(), i.e., > filemap_dirty_folio() > # set PG_dirty > folio_test_set_dirty(folio) > # set PAGECACHE_TAG_DIRTY > __folio_mark_dirty > > > # if dio->flags & IOMAP_DIO_DIRTY: > bio_set_pages_dirty > (for each dest page) folio_mark_dirty > a_ops->dirty_folio(), i.e., filemap_dirty_folio() > # set PG_dirty > folio_test_set_dirty(folio) > # set PAGECACHE_TAG_DIRTY > __folio_mark_dirty > ``` > Thanks for this info, Jingbo. > > > > >> ``` > >> > >> The auxiliary write request list was introduced to fix this. > >> > >> I'm not sure if there's an alternative other than the auxiliary list t= o > >> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so > >> that the same folio won't get dirtied when the writeback has not > >> completed yet? > >> > > > > I'm curious how other filesystems solve for this - this seems like a > > generic situation other filesystems would run into as well. > > > > As mentioned above, the writeback path will prevent the duplicate > writeback request on the same page when the first writeback IO has not > completed yet. > > Sorry for the noise... > > -- > Thanks, > Jingbo