From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 726D8D5E39F for ; Thu, 28 Nov 2024 05:50:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF5096B0083; Thu, 28 Nov 2024 00:50:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA55C6B0085; Thu, 28 Nov 2024 00:50:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6CC16B0088; Thu, 28 Nov 2024 00:50:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8A4096B0083 for ; Thu, 28 Nov 2024 00:50:37 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 462E740B5D for ; Thu, 28 Nov 2024 05:50:37 +0000 (UTC) X-FDA: 82834428810.20.8169FB8 Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf13.hostedemail.com (Postfix) with ESMTP id 0D81120008 for ; Thu, 28 Nov 2024 05:50:25 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=euWA4AH6; spf=pass (imf13.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732773030; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JLjeQx75L3lYmciLRB+c43uodhAABK8j/Hm5E1oeO90=; b=Pf23/eOrSIioo9nHaVPN9PAjCHJGu2A3Zs4c5AkhOPPGUeShMz6OHGRZCs9OW2vOCn2zqW /KWBa/lo9poZfwLh39UmKiF0boT0IhzqoQL9j/EBoNM2WIl+wEozCUu+cPikg7sTwcM6SS hMn0Va7b0xApDSSD5eaJbOW+z4XJL/k= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=euWA4AH6; spf=pass (imf13.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732773030; a=rsa-sha256; cv=none; b=SFxCt2opMlfCJ5XAe5EHLHsHfmXqlBIMbL8ncM9ZWF7c5reMmfM+O/PaaotbaS9BtNrpPZ HhP3z21lRUyTl/I/EYEpuP9mK2QOFe21w5OotPZWqfrLc3HzAdWTPp3eDlVs+vuU1vDsEq nrTs4T7VmGTI8Nn/CaR1e3Ys2j3FoDc= Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-2ea4e9e6ef2so358210a91.1 for ; Wed, 27 Nov 2024 21:50:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732773034; x=1733377834; darn=kvack.org; h=content-transfer-encoding:mime-version:references:message-id:date :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=JLjeQx75L3lYmciLRB+c43uodhAABK8j/Hm5E1oeO90=; b=euWA4AH6NN1IIqmudfQO0aDoi+anUZVMeYjLf0Plw0oSIrkQ8tvXtglCcj3rZ1K7i0 oFsKEZSgAKMTIyo9PZccB6H6EpBtsjIrZqQl309lZYcZuqU82lITO0dm1luebBalhlCf 74uAqTmMhGy/5x7AA8bY20rfhYbiK00yMoJ+QOCZq38oYT5AqOblQ8qCbkoZwHT2wMxM CyMWl0lzk1lzDRd5vNoA7GQM2yzz51G7QQH965j/fyadhDocYAQBhEvygTyUYZTTxq76 32Awv38XL+NpZS8fQTPwPPBvEsgMML/ElmaCPsfBU1ePyAQ2XJzXUd8ZVfPf/zXhebfY pboQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732773034; x=1733377834; h=content-transfer-encoding:mime-version:references:message-id:date :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JLjeQx75L3lYmciLRB+c43uodhAABK8j/Hm5E1oeO90=; b=lbcVYPmJknFu5WnINlnXD25t7REPth8/w0xB+YRLjT3AqL+f4Kt3iJaxzPQr7eoB4p eFsBTIsza4HEEvhCEFsB7Tjuo1J1bMCiLmCa9WdZwGHOWzud7P+GZKRPfxeAJ7h5oGU0 /lK1sNAcF89PR+YN7O7u6B4hdsgA27DTdXh2xOE2WkyFhrQwXR07ebwWwyTLVI2hviq1 CJalL69o0fC5WAXGON9Q2bQZJSoJljuji25bowRmhch5nDQPgq9swP9wksSk8+L3V6XQ AXCugRTvPCvIAp/cls41hj0vxAIyyvxHgDWk+w3y7SIhduClUoKkhAGds1L41AcgmaKg rkCA== X-Forwarded-Encrypted: i=1; AJvYcCWeXmwdZVXSTv0x1UjAWRVtSSMH7cA9ZaS4k3c/hrZ+7vOi6DSNDzaR9S4PvHobWAUcGVN389hWAQ==@kvack.org X-Gm-Message-State: AOJu0Yxs83yG6q7RBNRAORFbm+M4O9NcYfTMoRMEZvwSHt1v7Z8whq0g vANJyL+Ey9whBvAKLxT+Y6HpkSQTBY2LmLyOt2kmVTZjcLoXtAta X-Gm-Gg: ASbGncvYWfJJDOG4GTUlsvOZT8YlBvb6/TnWP+ek18FNrLon/kkiJkf7OAzJCMHLZrS RoK9mkzhs9/9ZpqdFGd82HH1eEB5clKQnbNSoZ4pfmfPBvIu3G11+PsefOnsyagToSKcbIuu4cP CLlNMWRQ/Wsb5hz7ijFg6jEBHfyzepEqAfZBC11yDJvVSkSH+GLE+EfO7TmSXM3y8FERb4E6H8P ln0lg9K/XNwma23z+mAdekActatd+K1mH4E9tk5YjA= X-Google-Smtp-Source: AGHT+IGvcaRwwg7Z6lX7Dj/twWMoZA67eSUPjEvK6r1tl4fqPwjizLPoPiL2oxPD3RSEg5ww+ZcLUQ== X-Received: by 2002:a17:90b:4b82:b0:2ea:7329:46 with SMTP id 98e67ed59e1d1-2ee08e9f0c6mr8602080a91.5.1732773033915; Wed, 27 Nov 2024 21:50:33 -0800 (PST) Received: from dw-tp ([171.76.82.126]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2ee2b22dd01sm588999a91.27.2024.11.27.21.50.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Nov 2024 21:50:33 -0800 (PST) From: Ritesh Harjani (IBM) To: Jan Kara , Mateusz Guzik Cc: Bharata B Rao , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, willy@infradead.org, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path In-Reply-To: <20241127120235.ejpvpks3fosbzbkr@quack3> Date: Thu, 28 Nov 2024 11:10:35 +0530 Message-ID: <87plmf3oh8.fsf@gmail.com> References: <20241127054737.33351-1-bharata@amd.com> <20241127120235.ejpvpks3fosbzbkr@quack3> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 0D81120008 X-Rspamd-Server: rspam12 X-Stat-Signature: m898c8gdwhas7mj5iaihurrmter6ymmh X-Rspam-User: X-HE-Tag: 1732773025-706736 X-HE-Meta: U2FsdGVkX19p50CLxA8k/Yz8ZdpknTVnDEi95kjlaN+MuN33+l3kMdNYTn2xxeAAVp9xu6nZZm6Js84TE1R7vNxaXbE2+cXGcuMrBkgUTpyAlNL08/EnSOXZSn/Ex43YbLnKQH/pIm24KvS9JrgskpGyG+15HZiosL1GMMZ7gZuEvLLRW1BxB9ruwdFR3cZhE90iHBbJ6HGbO/CI2Ba5fLMCGfwxg1kpuh3NJ26UpQfJGNSJEajYTwOMtH1ygpvAvmtls2r1ocxcTXabfPEx/nW4UQQqpd2vG/yEXDPLxgDXL8N+xxStSCIvzasuHUFHVBYgqqng9TiUul4jbvzk04P2Y4YI1mmMbia47122v+M+xUG9dwGX4lx0FUMoSQ6wbrpY5dGLILbvSLMzC1RYD7wNe9EQ/yyfGRUrCeqvxojAL5TF4XeAjFrAF6BQjW9ligzb5XPXm3uCk+bm0N2xW8Esh4COxKIHf0CdTt6rp+azYrH4ZdkpfucEXamcX4RaA7XuC7waBdrMBMII+JsXQQp22DgfDcKru9KD6vhvosPZdatvYXAKsYU7COlOdSNseDK7urkD5NhQHe4LCjkQSEiSIwLzuRuxLsMjsGwKjiDUWx2IHCghEckFe16AGvWxvjf7qY3ulxaPiSJZagtFVx7sU+diI1mLtiTxOx92c3EXbwfyg7lljCmvru7LyvfzCuopNiiynMmvdvTPGmMrGOj8BQ5xMbiJOYW/eHyMeC/RtR9otIsicGPX11YMKZHeYEFSZ5LpRYs3lkZwrjWTZgMin3jTHU3M2/wD2orgt9MFe5mi1LznFcOI1hUJityE5pvg8bUVh5RT2OkIh9zwHSyh9Wz8Bb6IoTa5DCsTwBXqghI6wlaTFsKc69G+V1RORb8HQoBhLE0W29kLn5mj4JaGPXcSmI2/N8LtoY1TDv1sUUs90ugavI0EZhblJ+E2LoXrYAVTV2C2wgLTBQg er4bAUMU 4UxY3egxSxZ9dzj9tdLZCAnsNBVgEREVGa4//3rzgCOu2ehF8pGW3pn8D4Q3uKYvjBpyhcEhtHz8lgPcmcqLoj099nAY1NhnZj2DJ9IwcQSnWo2AFJclOr6Cl8bZB0SDiSCT/K8F+a9BKqAqc70t8mn5wOrlfbEXw5zXmSDDDEZQDD8h4edXpnMFAT6CTgARq8/t+JlG9C7NiW8VzMgtuemq28f2OZJ6/KUaZHK4YWL/oA/izmJHEmoclxG5LUWOygIpzvdlB1L2OMByHlRQFmWZe5QXRagGcgpWzu7Yo9u54m/T7RLwby8Fjvbk+XhNMiMGF3CCd+I8pYPYSJrfgCNWRfoIJhY6PR1oxGFfpq+mWsmWb3EHhusyYG7c2DLWKB4UjKfyNkuGZRtrpLp+chHKIN+rlW8jgpu8ksmmzB7LSzUq/HBE8V+OPUKaB0Omj/fvpBheq5J6f9dAIz5HmzdkaxdXawADwg6f5i63xC7hsxWg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Jan Kara writes: > On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: >> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik wrote: >> > >> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao wrote: >> > > >> > > Recently we discussed the scalability issues while running large >> > > instances of FIO with buffered IO option on NVME block devices here: >> > > >> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ >> > > >> > > One of the suggestions Chris Mason gave (during private discussions) was >> > > to enable large folios in block buffered IO path as that could >> > > improve the scalability problems and improve the lock contention >> > > scenarios. >> > > >> > >> > I have no basis to comment on the idea. >> > >> > However, it is pretty apparent whatever the situation it is being >> > heavily disfigured by lock contention in blkdev_llseek: >> > >> > > perf-lock contention output >> > > --------------------------- >> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite >> > > mix it looks like this: >> > > >> > > perf-lock contention default >> > > contended total wait max wait avg wait type caller >> > > >> > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 >> > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 >> > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c >> > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 >> > > 0xffffffff8f39e88f up_write+0x4f >> > > 0xffffffff8f9d598e blkdev_llseek+0x4e >> > > 0xffffffff8f703322 ksys_lseek+0x72 >> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 >> > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 >> > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c >> > > 0xffffffff903f18fb down_write+0x5b >> > > 0xffffffff8f9d5971 blkdev_llseek+0x31 >> > > 0xffffffff8f703322 ksys_lseek+0x72 >> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 >> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 >> > > 0xffffffff903dce5e do_syscall_64+0x7e >> > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 >> > >> > Admittedly I'm not familiar with this code, but at a quick glance the >> > lock can be just straight up removed here? >> > >> > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) >> > 535 { >> > 536 │ struct inode *bd_inode = bdev_file_inode(file); >> > 537 │ loff_t retval; >> > 538 │ >> > 539 │ inode_lock(bd_inode); >> > 540 │ retval = fixed_size_llseek(file, offset, whence, >> > i_size_read(bd_inode)); >> > 541 │ inode_unlock(bd_inode); >> > 542 │ return retval; >> > 543 } >> > >> > At best it stabilizes the size for the duration of the call. Sounds >> > like it helps nothing since if the size can change, the file offset >> > will still be altered as if there was no locking? >> > >> > Suppose this cannot be avoided to grab the size for whatever reason. >> > >> > While the above fio invocation did not work for me, I ran some crapper >> > which I had in my shell history and according to strace: >> > [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> > [pid 271829] lseek(7, 0, SEEK_SET) = 0 >> > [pid 271830] lseek(7, 0, SEEK_SET) = 0 >> > >> > ... the lseeks just rewind to the beginning, *definitely* not needing >> > to know the size. One would have to check but this is most likely the >> > case in your test as well. >> > >> > And for that there is 0 need to grab the size, and consequently the inode lock. >> >> That is to say bare minimum this needs to be benchmarked before/after >> with the lock removed from the picture, like so: > > Yeah, I've noticed this in the locking profiles as well and I agree > bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) > get away without using inode lock in their llseek handler... > Right, we don't need an inode_lock() for i_size_read(). i_size_write() still needs locking for serialization, mainly for 32bit SMP case, due to use of seqcounts. I guess it would be good to maybe add this in Documentation too rather than this info just hanging on top of i_size_write()? References =========== [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557 [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932 [3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/ -ritesh