From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A09ADC02198 for ; Mon, 10 Feb 2025 18:20:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32A626B0082; Mon, 10 Feb 2025 13:20:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2DA016B008A; Mon, 10 Feb 2025 13:20:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 17AAA6B008C; Mon, 10 Feb 2025 13:20:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EDD356B008A for ; Mon, 10 Feb 2025 13:20:26 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 767C116036F for ; Mon, 10 Feb 2025 18:20:26 +0000 (UTC) X-FDA: 83104850052.21.802817A Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf17.hostedemail.com (Postfix) with ESMTP id 09A334000F for ; Mon, 10 Feb 2025 18:20:23 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=scylladb.com header.s=google header.b=hnmV6YNV; spf=pass (imf17.hostedemail.com: domain of raphaelsc@scylladb.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=raphaelsc@scylladb.com; dmarc=pass (policy=reject) header.from=scylladb.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739211624; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nOakPTjds2Vny74GfnKgUzmZ5741L2AfZYgW15BH/Rg=; b=p6AjVS1yMdBuHQWYjstVYFGjREJP0O9JASTABqzAjgBoeB9Ik7kH0zNfuodP2vX/uuO2Gb RYp1rLxmAq85Z6g5NHNN5GtSA1IGZtqRqqYtnk6esfPy28Au8aRFW8C+O4smnptGO7sLLP c+/lXFSwgLKwzIJ+YpD6gIkPiFwj1Z0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=scylladb.com header.s=google header.b=hnmV6YNV; spf=pass (imf17.hostedemail.com: domain of raphaelsc@scylladb.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=raphaelsc@scylladb.com; dmarc=pass (policy=reject) header.from=scylladb.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739211624; a=rsa-sha256; cv=none; b=f+l4nN/XLxmYZkDOL1sF3KaFE/thNfWMoT9k0nrcNX8gf7CJx0JewrXFcQ/RiwOHU0DpA9 h5T3xm+LLQRTDgdEdnbhlDG65+lwFDS5IJ/LcZ15kv8q/z5FC7vJturgZwQbwMRfnd+CIf ap/kZbgAXsaQSJG2fioGxpaRCHut4Zo= Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-2f9bac7699aso6882820a91.1 for ; Mon, 10 Feb 2025 10:20:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb.com; s=google; t=1739211623; x=1739816423; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nOakPTjds2Vny74GfnKgUzmZ5741L2AfZYgW15BH/Rg=; b=hnmV6YNVE1dssdJtOdlfc7hgc+IBZSFKEnVA5kGjJ11+R8lIX2tkN6rD4/uFTj7Uqn xAdxb/rBJH7p9XsfK46sEy9gilelzKRBv6CG+eMVaxM7nXV+dnS/95SvcT8WJP2QqPys al0r0IMhS9zU7DvssfK23fkF12YC/CWTHntaO1aAAzqbKbPH9VWK8D1LPePRvlwrY5Bn fBT6iQp630c6WLsZECNhmIgxofRVt9/qCMaFoIAhwoOSXFT5NhnvrvrUe47y84BHWJIr 3PgxnGfUKq9AxgG3TkqrCpFV+RwGnknINEKQup94g7vm3oyyHQ3bro8sXTkb+yAYEja6 3aXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739211623; x=1739816423; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nOakPTjds2Vny74GfnKgUzmZ5741L2AfZYgW15BH/Rg=; b=kxKf/mgDePkAz4lWSGNXxLfVux5vn/nYM8Cfi/fAaHxJ8btSWnc+gn8s3mlXmuH7MQ GOk18P76bF1ilA9CVzvghLQ33BZBs9oObtfvMyLSldARMvNLdXgNQE0Wv2p8pIhIPLlt QaTuUYRujHed1ZvY2r9O6szfZw7uOR//n8al+TK1IkUTWct70Ez13DBmPWpwc5z6mi7o 73W0YOd+jGvDLPczp/iQD5h5DzDXn7zlzfbwUlY5Ux0sc5spwN/sLHAf09jF+vapEFJh XIdiAtIeOzggn3wJE6XHqLv/DHAeOg7jsIzi7mIHg6vur0yzuhFRBldWurm/613/6/vt uydw== X-Forwarded-Encrypted: i=1; AJvYcCU/dqKkn1gswB6mrR83IpFCxiij2wkMLGCEAJ6GDABef+zc476RXIUgA5UoOZ6Xq+NfHKLmDRtkNg==@kvack.org X-Gm-Message-State: AOJu0YzUXNmXRjc14BEPGrDLk959Sh6TF58kUdQ259w2OaCM6vGBygr5 t+0giMKsMfRbUoOYNGxO4o5z4a2QZmwchP329BaSf2z/tOSsm8d4C7HplNRXxfaeeM79UZDP0Z5 RXLTgXa3lYj8PRV+D0NN8WyvH/mBuM+0aIcFjKN+qCKzIsPzpORMKouY3Nqw4dauK6FVpoOp+6u RmddKPnXmr2MFr3fzN2+tSfGkM0WSq/ZbOomwfYhtjqAmPCScjsQcQdsIbnq+sHg5DrWQJ4Sl0S lZH4Wpt9NJI0fgkLIyJdQZzHOtycPO1FD1uaZNAX6E8ZfGZxJrjkPrGgipljE6OsSienZtIzcT3 EaO5o9cqKnkxTFcWwsdOkffbxG+dyfje/QL/dP7zR/iUm3NqQBt0c+FN X-Gm-Gg: ASbGncuDND6pMPZ/ttvT4XW/MfPIwUbyPWG6A7FsQ5wovsYl4TGFxa2gU0kyJvBbCYX xer8aVauYi5mAFttv+6Omuh5JDCUSrwLWI/aqGHJitguzPEPPZZPo2v1c1onk/uuxDYEXdILQFS J3NF1GoA1yRpwoWQ== X-Google-Smtp-Source: AGHT+IES0N77qvzcJtQZFftGFxaIrmeG33luUazN6PR5H62VH0z3F7hFvA9Uxf2dGfsVCUNFZXSuOyKKaBrn/Ohuu9Y= X-Received: by 2002:a17:90b:254c:b0:2ee:bbd8:2b9d with SMTP id 98e67ed59e1d1-2fa9ee316admr610096a91.34.1739211622603; Mon, 10 Feb 2025 10:20:22 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: "Raphael S. Carvalho" Date: Mon, 10 Feb 2025 15:20:05 -0300 X-Gm-Features: AWEUYZlaNfgWAlrdAqnecIoKMZd2mPUiHOPzsiVqRukK2TNDyCFtDrcVxLyPl_4 Message-ID: Subject: Re: Possible regression with buffered writes + NOWAIT behavior, under memory pressure To: linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, Dave Chinner , hch@lst.de, Avi Kivity Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-CLOUD-SEC-AV-Sent: true X-CLOUD-SEC-AV-Info: scylladb,google_mail,monitor X-Gm-Spam: 0 X-Gm-Phishy: 0 X-CLOUD-SEC-AV-Sent: true X-CLOUD-SEC-AV-Info: scylla,google_mail,monitor X-Gm-Spam: 0 X-Gm-Phishy: 0 X-Rspam-User: X-Stat-Signature: 9ormr9nuh3bgk3q966et5p5f8t911e6f X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 09A334000F X-HE-Tag: 1739211623-696476 X-HE-Meta: U2FsdGVkX19CMcLYTZbbOPN+Axd88AvNKw0/2m+GmqdoqqDBFJmm5afX1zVqCHejmS49nyNxvfu9nMdjUUmVWvkIYgmCGj9IZWpXKpK2IW9Br8xmCL2yeKr37b3JYMzV4hi5ZfFv4+Ciz1QvtVOMBjB9bpUZ8ne1OKOhAYBc0cp5GGiR0LZXS5rHH8dHis4t23ehN7v1LJVUdiLDGquRiKYMUY/F6UPAYrmgJyEizbwZFuKW4PSkPDWUDJsBqOdqmo1v6M0NiJeDLRIlenhE6ydMCG0yV06fINDZnuUNr7GofGs/yw2xjVzf5uziOta3aVWI0qmbDbGgLB3FBIbg2O0fxgOCbou1ckxeZ/fEnO3EMN5min/EeR/MUXIck7vttIK7S+FOIQUHToYbfVloLbtTQO0JbtfRAImUvJv79UQRHF4JRYGYA+gux2BCSyg+6Dq6vup5hPD0B10GyXI5x3+7B7QOqS4OhrOjANfzJQ/FBvLy8F3rINv7Qo+i2X9uw32JbBTLnVktdN8yHoTIKJaM1MRwzY3D71sqi5VMOt1EgGtsUU/5XJ2YU+pFGMX4ZK4MwPVkSXFt2MQpRoG+Cysv+7mbyEeHnBw1Jswr/wpzzgLZ7bMGBYXB1NfiC3Hsf1LHt2rThLKm0IJ2r/WlvH0eEKWf0j+Dj72KOydCGbq5StC5HPPMosYICVk7D1DY0arJd0i5pB3k0yWT5A1OnWEWU9f105vWR3X2o88y7HpTWGFP+7HLWIxHO95b1bC4hypatpReSIifzIt2gB1zAyXbsPSocQyAPo/V8cWi5MWZRv/3kMLnbRNhn+pGhD1z85LCONEh0Nd0CALvo8d3mlyQbcqfRStGkTpRGQvFpzLaAcNKz4qop9RxCxOVvCpWHZVIu5rCePLEjSec9AxdaUinyX3/of0tvJDxkYiB3rEGl418/cd/jG8zGRJOrFPY7WXENf6TEubpn4BJXpi xBmskfd2 a41iZ01KEG2xgHkcH4ncAnkKpDiPC6tVtBjJpjE/yCLfa+GUgH1+bfHsQNLDD5gZTgmwxYUpck/ei86LWDHRJYmMFOVvvG46BDBur+lsvC8VRO6+iz+eyODPO4CdihAcriwW0KA/ewbXQl83DbeB77TClouAb/2xfrB0AuXaB44bwytA6I0eAmzSQEDJ0sBsHtRucL08u11ROFRfscFHUENYzt2nXcHLPOEPw9dY+4SssBwCvEEB6wAflnymJC5mzKw7A X-Bogosity: Ham, tests=bogofilter, spamicity=0.000085, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 10, 2025 at 3:12=E2=80=AFPM Raphael S. Carvalho wrote: > > While running scylladb test suite, which uses io_uring + buffered > writes + XFS, the system was spuriously returning ENOMEM, despite > there being plenty of available memory to be reclaimed from the page > cache. FWIW, I am running: 6.12.9-100.fc40.x86_64 > > Tracing showed io_uring_complete failing the request with ENOMEM: > # cat /sys/kernel/debug/tracing/trace | grep "result -12" -B 100 | > grep "0000000065b91cd1" > reactor-1-707139 [000] ..... 46737.358518: > io_uring_submit_req: ring 00000000e52339b8, req 0000000065b91cd1, > user_data 0x50f0001e4000, opcode WRITE, flags 0x200000, sq_thread 0 > reactor-1-707139 [000] ..... 46737.358526: io_uring_file_get: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > fd 45 > reactor-1-707139 [000] ...1. 46737.358560: io_uring_complete: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > result -12, cflags 0x0 extra1 0 extra2 0 > > That puzzled me. > > Using retsnoop, it pointed to iomap_get_folio: > > 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 > (reactor-1/combined_tests): > > entry_SYSCALL_64_after_hwframe+0x76 > do_syscall_64+0x82 > __do_sys_io_uring_enter+0x265 > io_submit_sqes+0x209 > io_issue_sqe+0x5b > io_write+0xdd > xfs_file_buffered_write+0x84 > iomap_file_buffered_write+0x1a6 > 32us [-ENOMEM] iomap_write_begin+0x408 > iter=3D&{.inode=3D0xffff8c67aa031138,.len=3D4096,.flags=3D33,.iomap=3D{.a= ddr=3D0xffffffffffffffff,.length=3D4096,.type=3D1,.flags=3D3,.bdev=3D0x=E2= =80=A6 > pos=3D0 len=3D4096 foliop=3D0xffffb32c296b7b80 > ! 4us [-ENOMEM] iomap_get_folio > iter=3D&{.inode=3D0xffff8c67aa031138,.len=3D4096,.flags=3D33,.iomap=3D{.a= ddr=3D0xffffffffffffffff,.length=3D4096,.type=3D1,.flags=3D3,.bdev=3D0x=E2= =80=A6 > pos=3D0 len=3D4096 > > Another trace shows iomap_file_buffered_write with ki_flags 2359304, > which translate into (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT) > And flags 33 in iomap_get_folio means IOMAP_NOWAIT, which makes sense > since XFS translates IOCB_NOWAIT into IOMAP_NOWAIT for performing the > buffered write through iomap subsystem: > > fs/iomap/buffered-io.c- if (iocb->ki_flags & IOCB_NOWAIT) > fs/iomap/buffered-io.c: iter.flags |=3D IOMAP_NOWAIT; > > > We know io_uring works by first attempting to write with IOCB_NOWAIT, > and if it fails with EAGAIN, it falls back to worker thread without > the NOWAIT semantics. > > iomap_get_folio(), once called with IOMAP_NOWAIT, will request the > allocation to follow GFP_NOWAIT behavior, so allocation can > potentially fail under pressure. > > Coming across 'iomap: Add async buffered write support', I see Darrick wr= ote: > > "FGP_NOWAIT can cause __filemap_get_folio to return a NULL folio, which > makes iomap_write_begin return -ENOMEM. If nothing has been written > yet, won't that cause the ENOMEM to escape to userspace? Why do we want > that instead of EAGAIN?" > > In the patch ''mm: return an ERR_PTR from __filemap_get_folio', I see > the following changes: > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -468,19 +468,12 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) > { > unsigned fgp =3D FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | = FGP_NOFS; > - struct folio *folio; > > if (iter->flags & IOMAP_NOWAIT) > fgp |=3D FGP_NOWAIT; > > - folio =3D __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE= _SHIFT, > + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SH= IFT, > fgp, mapping_gfp_mask(iter->inode->i_mapping)); > - if (folio) > - return folio; > - > - if (iter->flags & IOMAP_NOWAIT) > - return ERR_PTR(-EAGAIN); > - return ERR_PTR(-ENOMEM); > } > > This leads to me believe we have a regression in this area, after that > patch, since iomap_get_folio() is no longer returning EAGAIN with > IOMAP_NOWAIT, if __filemap_get_folio() failed to get a folio. Now it > returns ENOMEM unconditionally. > > Since we pushed the error picking decision to __filemap_get_folio, I > think it makes sense for us to patch it such that it returns EAGAIN if > allocation failed (under pressure) because IOMAP_NOWAIT was requested > by its caller and allocation is not allowed to block waiting for > reclaimer to do its thing. > > A possible way to fix it is this one-liner, but I am not well versed > in this area, so someone may end up suggesting a better fix: > diff --git a/mm/filemap.c b/mm/filemap.c > index 804d7365680c..9e698a619545 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct > address_space *mapping, pgoff_t index, > do { > gfp_t alloc_gfp =3D gfp; > > - err =3D -ENOMEM; > + err =3D (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAG= AIN; Sorry, I actually meant this: + err =3D (fgp_flags & FGP_NOWAIT) ? -EAGAIN : -ENOME= M;