From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50E1FD6ACC3 for ; Wed, 27 Nov 2024 12:02:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE2BE6B0085; Wed, 27 Nov 2024 07:02:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B935E6B0088; Wed, 27 Nov 2024 07:02:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5BB26B0089; Wed, 27 Nov 2024 07:02:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8B8746B0085 for ; Wed, 27 Nov 2024 07:02:44 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1F0E94066C for ; Wed, 27 Nov 2024 12:02:44 +0000 (UTC) X-FDA: 82831738164.14.81A8D3E Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf18.hostedemail.com (Postfix) with ESMTP id 0D2F41C0017 for ; Wed, 27 Nov 2024 12:02:38 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=BkWU+T40; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=PmsVLffv; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=BkWU+T40; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=PmsVLffv; spf=pass (imf18.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732708957; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YKCN/jjYz46d2mr96flpQkyS6oKGUlfViaa4MefoOdQ=; b=AGCMnulTRJAFeIXzXJbaWZJHSv9pD/9YQHZAr+klJXAPZ4Meq9B54FKNdKPfKUSPEeoMkN yGFoS3OwymbCE1B6v3oFoTYXolL4hQEb6SWjes+PqqmhzypYkNTGfuZOUebaDSOh1kMtqk 3Tnl7V0uRsbWTA5oWaeFst6t4yXIcAw= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=BkWU+T40; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=PmsVLffv; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=BkWU+T40; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=PmsVLffv; spf=pass (imf18.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732708957; a=rsa-sha256; cv=none; b=qxfGjQ9RKRAJt2GYZm1ojlaUR66RjsxvsJuG2dS3KqATPzoxEhLYpGhk/L6Lg5L3yOWm4G vHZCuvkEDlWHvX6DBMUINsuHhgjwKn4FYDe22X15ip4VLq2CqoocBrwSpOIIxaDQ7kGB6G dfV7tRm6Uai5T3v7MSqMj5jDIJ0raAg= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 7F1481F770; Wed, 27 Nov 2024 12:02:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1732708959; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YKCN/jjYz46d2mr96flpQkyS6oKGUlfViaa4MefoOdQ=; b=BkWU+T40OTcpiOFVAuI8IgBkC8saI0/2jmRdJ2R//RCJ1+hd8Wf+QkoltRT5t2fkHPn/kE CUUCoXnm8CvM0XgQjWxypt3xn7R6olWvjXYBTMZ1ryilDD+Qu1iPaAUIIwdZgtuVjTNjkA SowtW4vkeBxxfNCkOgkj8Za7o2maTds= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1732708959; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YKCN/jjYz46d2mr96flpQkyS6oKGUlfViaa4MefoOdQ=; b=PmsVLffvB/Rc/f8b2X9VpJrTjNhOWn5YpTQ8kOpYTJ8MT8ECEHCU+88zhzlqtFto0z4X7L 0AZRivafKLrpRvAA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1732708959; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YKCN/jjYz46d2mr96flpQkyS6oKGUlfViaa4MefoOdQ=; b=BkWU+T40OTcpiOFVAuI8IgBkC8saI0/2jmRdJ2R//RCJ1+hd8Wf+QkoltRT5t2fkHPn/kE CUUCoXnm8CvM0XgQjWxypt3xn7R6olWvjXYBTMZ1ryilDD+Qu1iPaAUIIwdZgtuVjTNjkA SowtW4vkeBxxfNCkOgkj8Za7o2maTds= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1732708959; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YKCN/jjYz46d2mr96flpQkyS6oKGUlfViaa4MefoOdQ=; b=PmsVLffvB/Rc/f8b2X9VpJrTjNhOWn5YpTQ8kOpYTJ8MT8ECEHCU+88zhzlqtFto0z4X7L 0AZRivafKLrpRvAA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 6E908139AA; Wed, 27 Nov 2024 12:02:39 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id gor2Gl8KR2dAagAAD6G6ig (envelope-from ); Wed, 27 Nov 2024 12:02:39 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 27D23A08D6; Wed, 27 Nov 2024 13:02:35 +0100 (CET) Date: Wed, 27 Nov 2024 13:02:35 +0100 From: Jan Kara To: Mateusz Guzik Cc: Bharata B Rao , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, willy@infradead.org, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path Message-ID: <20241127120235.ejpvpks3fosbzbkr@quack3> References: <20241127054737.33351-1-bharata@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Action: no action X-Rspamd-Queue-Id: 0D2F41C0017 X-Rspamd-Server: rspam12 X-Stat-Signature: gndwz4k9qyyq7f1fkwj9esacbmn5jcht X-Rspam-User: X-HE-Tag: 1732708958-345454 X-HE-Meta: U2FsdGVkX19tuu/hbeD/juyVqakjxO7QOyGXsGI+VdAxoiJNNtvWlh3yQpoLRjVsilPGgzPgZdf+RQzI/YLexv9WaGL6syoUHnJXHCJ8X5Zi4HbOj/T0l5ulVswDjMd7MeN8Usxu50v2ntJ1+zO3HDa8EeK1ZZhWPzMbnU8oyPwb3d+Sx3qfKu8SIchSp/KxgNP4GV2XCL5ypYu1PRxFJ9XmLUOmsxyAsFHKfGvErCs63eHkuZsPconYrvfPv2XOxHVTExNTAJgoE4Zx5NFCWzoihIdOA1Z8CnaDPm3YTX4+x5KEziE63/AYG9ljouXbQLw/eICr+Xe4zOomOXy72Z4/ssYMAeQJ3G+nf6D8Nx1KtwF3PK4+olEtNzJlfjb4SD6kVIIZlJiSKxt07tV+MBGfSYBMP3JssGKgd1iXh6KmRbn42aSbfvkRunf7+P7bHzlraPz3zvUsDqXrkKNpJw99WFXbI8KiwpKR/HNFzlySBJyRZL65vW+0WbsV6bfPKmCU9W5uHameyNOPQypRIKuEvPY+SIifq26HS5ZQeecUuu19wm64M1Gio3qnKZFKACmgIu0m2wzR/biYpdhAzrLrzHNz1EWcu46AElCrAtpXFSBP8OsRMZLL9GMbVgf7MHU4Ldyq5fOVdT+YIvB+NGFFkcBCMj6vhbGKVFSJjf1HowB1YV2EzC/t2PDVUYSuOgpiINiyOE35+Z0UDmmmop39AHjM7bqc2Nmqhoqm3I3O9KD8X+ztQ3IE5L0IzPwtKs8RR2nKSNNppaOI9U77n8PASlSwGc/ogJTmwPlPD5ElRVe2byBr4Pzyow5Vz1OBjMPeIfSLdJfNUTwoU9tQCSaUko8bPbf1uW7ILK8+PNm/Z8CogrHXyRTMuio0tewocWpJWmabYqipUyhEoAarsPe9yUz4cDuPK3jWA/aN7kQJvbCvIo2P9tBa45qN4YJId66AGyA8ZBzB/Ui4ojC XwCon+zc uEcxdmbXh0bBZK3pJBY4YuzOX5L11akrtfHRQvVw26E4FD9oKePZ5JHW91FcoV+1OAD7rKjM30dhMh9daen8tZutDuPHuOhsWLsOsimBWGH99YCmUn3fQd1NNZFFYQOKdJWXHt0aQbpAyr1WboXiGKUIkzrJHNZfhcBqgZZoRBVuPBRPovdenyQQDKxqsaJZu9GmxN6kIaebl9JKqGqEj+ZQI9w3qGwguPR1VX4b0s4mkBjzNIMS/XHFYJandZhBtdCzoavbagxz6nMDbTnm/QUGk2qbuoL6fUBho8PN1f6PQToK0upqjpfFkc3BCS38dYie79umzJp0Fd6lPQXmSwdokOiM0ryYMpfbdpTwH0LziqLslgO4QCzSVu0yQlXMAr7i/hxiAms0zJvzo4HJdkaDxaT9dSrQbxGZmyAMsVwGhy0pxiYmynjkn3w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik wrote: > > > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao wrote: > > > > > > Recently we discussed the scalability issues while running large > > > instances of FIO with buffered IO option on NVME block devices here: > > > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ > > > > > > One of the suggestions Chris Mason gave (during private discussions) was > > > to enable large folios in block buffered IO path as that could > > > improve the scalability problems and improve the lock contention > > > scenarios. > > > > > > > I have no basis to comment on the idea. > > > > However, it is pretty apparent whatever the situation it is being > > heavily disfigured by lock contention in blkdev_llseek: > > > > > perf-lock contention output > > > --------------------------- > > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > > > mix it looks like this: > > > > > > perf-lock contention default > > > contended total wait max wait avg wait type caller > > > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > > 0xffffffff8f39e88f up_write+0x4f > > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > > > 0xffffffff903f18fb down_write+0x5b > > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 0xffffffff903dce5e do_syscall_64+0x7e > > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > > > > Admittedly I'm not familiar with this code, but at a quick glance the > > lock can be just straight up removed here? > > > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > > 535 { > > 536 │ struct inode *bd_inode = bdev_file_inode(file); > > 537 │ loff_t retval; > > 538 │ > > 539 │ inode_lock(bd_inode); > > 540 │ retval = fixed_size_llseek(file, offset, whence, > > i_size_read(bd_inode)); > > 541 │ inode_unlock(bd_inode); > > 542 │ return retval; > > 543 } > > > > At best it stabilizes the size for the duration of the call. Sounds > > like it helps nothing since if the size can change, the file offset > > will still be altered as if there was no locking? > > > > Suppose this cannot be avoided to grab the size for whatever reason. > > > > While the above fio invocation did not work for me, I ran some crapper > > which I had in my shell history and according to strace: > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271830] lseek(7, 0, SEEK_SET) = 0 > > > > ... the lseeks just rewind to the beginning, *definitely* not needing > > to know the size. One would have to check but this is most likely the > > case in your test as well. > > > > And for that there is 0 need to grab the size, and consequently the inode lock. > > That is to say bare minimum this needs to be benchmarked before/after > with the lock removed from the picture, like so: Yeah, I've noticed this in the locking profiles as well and I agree bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) get away without using inode lock in their llseek handler... Honza > diff --git a/block/fops.c b/block/fops.c > index 2d01c9007681..7f9e9e2f9081 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { > static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > { > struct inode *bd_inode = bdev_file_inode(file); > - loff_t retval; > > - inode_lock(bd_inode); > - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > - inode_unlock(bd_inode); > - return retval; > + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > } > > static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > > To be aborted if it blows up (but I don't see why it would). > > -- > Mateusz Guzik -- Jan Kara SUSE Labs, CR