From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95C7CD609A1 for ; Wed, 27 Nov 2024 06:20:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EE2CF6B0082; Wed, 27 Nov 2024 01:20:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E91B36B0089; Wed, 27 Nov 2024 01:20:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D599C6B008C; Wed, 27 Nov 2024 01:20:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B650B6B0082 for ; Wed, 27 Nov 2024 01:20:16 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 31EEF121339 for ; Wed, 27 Nov 2024 06:20:16 +0000 (UTC) X-FDA: 82830875148.02.F2DFD9A Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf23.hostedemail.com (Postfix) with ESMTP id B26D3140004 for ; Wed, 27 Nov 2024 06:20:10 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Li4Mrh8x; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732688410; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v/iivPp+l72JnY9vYj/hX54ZOaGshomdwdZlqq+eIEY=; b=jrUZ5OPPbWAJrocgUtYUWqtMg70zn8QUTUqedEuhr+ygHGaOJigwHShttHFqE/rNQGn2CC yt7Oq7vxLb5/2irHrAsuC4OhZ6aihK01qqre7JoUOEcvh7DpxUXJQqpBxDbrXKz5UstyB2 jQv99xv0J9l+0Qfdekj26qrKHeo1s5I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732688410; a=rsa-sha256; cv=none; b=ke65/pgnaekSMV4L+8py7NhMb2txxCYtjIFNDl0SzG47liGGIlvMHoC8h5CqK8Nz7mxpu1 34oj9ZBpnEw3Xn5EMGJlh0CecDoRcPTvQuviPUw67IbNN9OyEIPsJEHPlt5mjqo6e2ES3a orFu8D/17f8n6Gd+yDE3Lc8QYb6dJWQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Li4Mrh8x; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-aa560a65fd6so83172666b.0 for ; Tue, 26 Nov 2024 22:20:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732688413; x=1733293213; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=v/iivPp+l72JnY9vYj/hX54ZOaGshomdwdZlqq+eIEY=; b=Li4Mrh8x/8lgghrUelf7g+pJmEKKblxNK0w8yOuIl1n2PFFGKfGFbiP9+E/22gzM7h jXlgIH8AumzJQBkj0O0/EGgn8qi11f1GpY1jc4Kv7edLQ0c7gHHa1ss0xCxw9Bvz6mfD 6QiflAZmmqWiYZLesPKW1AL6NkpcGYc6wYNQYeSkK/oy/PekQ6JdD5oSzVymyABmd/xY upr+agUBk2oYOTseMYtqgZRcKl3ziRaZ2nWyDnpZCPWyKoESCXzAL+KB48gmgub9Qaqn LhqJ0cT8L5xazwLSDwWVU4BIwwEkyEhU9pIAa5LaM6SlGScA0iB2A2xvQ2FEH9bpvD61 ax0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732688413; x=1733293213; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=v/iivPp+l72JnY9vYj/hX54ZOaGshomdwdZlqq+eIEY=; b=gtPcKsUgqjyWwqc0LyS7gZNb98pKL91J3pdE79Mn7nUW3p83aflfYB9xdRoh7pxOEr vvYqutBNhN5ExY00kXSDVSl/UxDguk8LxFLXFicainun3eaZtTfn44zCuU+2wDteBZ83 C0lykOqffgFD5tmbf/2/iU6X/SA1PBae5PlUzUzQXRMt9Cd1tK+dNEC/LNklPIoPTlgg cPVWGNqGa4fBbZBdM66Gi118WQ3Qj57UpzFudkvclJh+xuwm8wYfjWeVYlH/xXXT1el3 TE8kKIer3vB6Cjtz6OCibDdQgJTfwkstjFx13XD24d9gNmTWnu4UGAUb2yvtcWJVX6IH vp4w== X-Forwarded-Encrypted: i=1; AJvYcCUjo9PzH6J2H85gy646rMN1POekaWmn+8Mtew1SAHNr2S9j3L84WTAaV1s3pPbss2bffNnqt1fh8A==@kvack.org X-Gm-Message-State: AOJu0YzRnmDhAZ3kgOf031IOR36v2XmIJNPCfX2ympfm9YGU4ilsFv7U PMUUl/kLzdtPFc7JTOjz+K4GPBJq3p8ywhR3SPSxy8ZsEtPzXfKvxPcImc/s0aPv1jSiKLDL3ND o25LJbwX6BYaHuRldEknpiwnqf/w= X-Gm-Gg: ASbGncuxR0kSJmL2rzY5e7+hqJwJuwq26++7aiHWxKG8dJu8Wg6HUa/HzCtUU/RF7Zu mQ/p0km5FJtMF9UNiE82ImVRCbK64H1c= X-Google-Smtp-Source: AGHT+IGEwnUc8vPNCtyAfttz2QoN0COH3xUg8RCh11QFT+6Mn2KXsvxhdQl4Cy9scBpq0NBxxqKI8acJCxBSWNKaXAg= X-Received: by 2002:a17:906:32d2:b0:aa5:ac9:ce5f with SMTP id a640c23a62f3a-aa57f47f138mr221183666b.0.1732688412477; Tue, 26 Nov 2024 22:20:12 -0800 (PST) MIME-Version: 1.0 References: <20241127054737.33351-1-bharata@amd.com> In-Reply-To: From: Mateusz Guzik Date: Wed, 27 Nov 2024 07:19:59 +0100 Message-ID: Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path To: Bharata B Rao Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, willy@infradead.org, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: B26D3140004 X-Stat-Signature: ykb17nrys7t6dy6kktrdkj556bsc69to X-Rspam-User: X-HE-Tag: 1732688410-799507 X-HE-Meta: U2FsdGVkX1/q4IcjrWzkJZedekPVrO/O9PcQc5VR6iorwdn3HoXEcS2X2mQ80b+2/LxCpiaK7QLTIXDo7W7Xt7+BB2lF09ppHtZhx64LPa6aQWTXYkoiBtsgf6VWRx027A3F4iJ1Naix8JuHuxYgL002rVq2e+2uoEznaTwxc+k/tVV9+NNwgMfaQn6RHOl3VkO+0uAFU5OOmYm5geLT4Bvwfm+gi3MoxDqao0lX932BwIdu5vgqOGbsuWw8W/I7d4hTckKY8ylVXJiHHiMfCiMc3uGnjZYNfhN1FwqHYOA9gP4iGE15spSFv5ygZZzi4kHR3+4JpBn+hlllenvEW81ZTZkWSyw/wOU6qiyChKfCK8+HgdA91tNQAG666slykDn9hw4hxzockyMmc/y0hwzmv6g430k+CwNusS+mK36aZsK+C3a385cXYx+MA7myPrIwTr+bavdIgdsAVy5A2FIY1kMljQIqHyPrquuj9LdrTf1TcUkwlAvzca6JYTo8RpnoQ2/76WcIy+TJr7cMaKocS8CYDzd/K3ATgnRXP/OSBh2HzcCI80Qn0NjvaqYnL1xdOVPFF+hE79jWEqGg99GguS6D9idmU5tvjCNzMNODuBtv+t9vy+GCYfK6xenLc2JZEeqhnUUXcm6kKTrebpGvjmUEOHy1c4LGj50A2VpcXCZpg7PPiVmvmqS7f/kNpbxIKH0kdzNiXkr+SA6viPbZyqa+/ipeAk7XE81zaBoss5iTLugpc5YwmJXPGCnVehj6u+ikg8wr6scbSverGklhPzXlP9/c1gP7diCConLRQSwjkDgqUpzTmcq6lw2CPOkbBXcBktHbnItArwI8Gat0O3yp144UdaMY2gSc0FCEgflgX7RNnuH0WtgICvKRQFDUtYHSBy+T42HJb+I321hag8qkQhuZoxEgGoJH8f98bmX4+93O8Nvz0iVAFIvrqVxidjm0r0TOXYHfiKI wrVt8uo3 istXiADHdsqbOGuQLZSsAnlzdV3VqTSRs8IpB+JX/TByFmjbO/PdfdTdRCfnQgqosltFcnMtoDEjcXQTCQ0HO8EqiVDRCkR+hBfRUYaEQ27OqX5FCZ1qCDw5j7OkYYiPF8BFIO2h9cumlT9oEzwrcoZBoWuW2wQqmCS7TWBrauskcmkN8noNgJ/eLnTRH+y3fl7FPgW+7tWGa4cYbN/cpcvbQgh2vGKcB9X4zkQGE5BqcdqAXBHRCflRgLCugXb1D6CccNfQWcjzCzLfT4wIQ35LmH50ohf6HBoTItb6EwI575fTsUG/cZvtlYENphjeT6nbEKY4YA5mvRf3l6MN0JBNMfyB8+N6/c0dDnpNvj6BgTsOcvCGbHwbVpnqhiIoEnuY6WmjmMp2vTTo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000041, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 27, 2024 at 7:13=E2=80=AFAM Mateusz Guzik w= rote: > > On Wed, Nov 27, 2024 at 6:48=E2=80=AFAM Bharata B Rao w= rote: > > > > Recently we discussed the scalability issues while running large > > instances of FIO with buffered IO option on NVME block devices here: > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@a= md.com/ > > > > One of the suggestions Chris Mason gave (during private discussions) wa= s > > to enable large folios in block buffered IO path as that could > > improve the scalability problems and improve the lock contention > > scenarios. > > > > I have no basis to comment on the idea. > > However, it is pretty apparent whatever the situation it is being > heavily disfigured by lock contention in blkdev_llseek: > > > perf-lock contention output > > --------------------------- > > The lock contention data doesn't look all that conclusive but for 30% r= wmixwrite > > mix it looks like this: > > > > perf-lock contention default > > contended total wait max wait avg wait type caller > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_= wake.isra.0+0x42 > > 0xffffffff903f60a3 native_queued_spin_lock_slo= wpath+0x1f3 > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > 0xffffffff8f39e88f up_write+0x4f > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > 0xffffffff8f703322 ksys_lseek+0x72 > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev= _llseek+0x31 > > 0xffffffff903f15bc rwsem_down_write_slowpath+0= x36c > > 0xffffffff903f18fb down_write+0x5b > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > 0xffffffff8f703322 ksys_lseek+0x72 > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > 0xffffffff903dce5e do_syscall_64+0x7e > > 0xffffffff9040012b entry_SYSCALL_64_after_hwfr= ame+0x76 > > Admittedly I'm not familiar with this code, but at a quick glance the > lock can be just straight up removed here? > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int w= hence) > 535 { > 536 =E2=94=82 struct inode *bd_inode =3D bdev_file_inode(file); > 537 =E2=94=82 loff_t retval; > 538 =E2=94=82 > 539 =E2=94=82 inode_lock(bd_inode); > 540 =E2=94=82 retval =3D fixed_size_llseek(file, offset, whence, > i_size_read(bd_inode)); > 541 =E2=94=82 inode_unlock(bd_inode); > 542 =E2=94=82 return retval; > 543 } > > At best it stabilizes the size for the duration of the call. Sounds > like it helps nothing since if the size can change, the file offset > will still be altered as if there was no locking? > > Suppose this cannot be avoided to grab the size for whatever reason. > > While the above fio invocation did not work for me, I ran some crapper > which I had in my shell history and according to strace: > [pid 271829] lseek(7, 0, SEEK_SET) =3D 0 > [pid 271829] lseek(7, 0, SEEK_SET) =3D 0 > [pid 271830] lseek(7, 0, SEEK_SET) =3D 0 > > ... the lseeks just rewind to the beginning, *definitely* not needing > to know the size. One would have to check but this is most likely the > case in your test as well. > > And for that there is 0 need to grab the size, and consequently the inode= lock. That is to say bare minimum this needs to be benchmarked before/after with the lock removed from the picture, like so: diff --git a/block/fops.c b/block/fops.c index 2d01c9007681..7f9e9e2f9081 100644 --- a/block/fops.c +++ b/block/fops.c @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops =3D= { static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) { struct inode *bd_inode =3D bdev_file_inode(file); - loff_t retval; - inode_lock(bd_inode); - retval =3D fixed_size_llseek(file, offset, whence, i_size_read(bd_i= node)); - inode_unlock(bd_inode); - return retval; + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode= )); } static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, To be aborted if it blows up (but I don't see why it would). --=20 Mateusz Guzik