From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08D7BD609A1 for ; Wed, 27 Nov 2024 06:14:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7D8616B0083; Wed, 27 Nov 2024 01:14:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 786816B0085; Wed, 27 Nov 2024 01:14:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 64F346B0088; Wed, 27 Nov 2024 01:14:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 431CF6B0083 for ; Wed, 27 Nov 2024 01:14:11 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D31EAA0F25 for ; Wed, 27 Nov 2024 06:14:10 +0000 (UTC) X-FDA: 82830859608.03.ACE8A28 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf11.hostedemail.com (Postfix) with ESMTP id 851A74000A for ; Wed, 27 Nov 2024 06:14:02 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mr6MlzqO; spf=pass (imf11.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732688045; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oDoZpetLkqMnLB7cRFHPmJPgDTTNJdqJP5Z9EqKdMUQ=; b=Pq2gtYZvVfnipn8rerFou6jAgbyPUQZOApiDV2UrgssidbMTalpF1x4R71jgMtYg9BDroE bcdkxv3aI4KSdnyrIIqqjOr9BGq5MRZeOQUMsAYOXa9Aun0gSJ1BwfbpOaUcdF6ykAxqHX gje7JrnnkRUxz9G0mYPWoHHbQWGlJss= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mr6MlzqO; spf=pass (imf11.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732688045; a=rsa-sha256; cv=none; b=RdOAg36TpjmzYrXjgj3Q0yOchTvmFdPF4bMFflwXkwnxWvcLadHYVA3mWkdZyijK8c+O87 9+X5qJik+ai7FtsIUstgSpdJzJgzFNlws9jHtb+kC5W2u5jKDWbVf/FouzpSiKwhUQy8Ue qqQ8vid8apDThBiAxUHgnzbczkaIYtw= Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-5cfe5da1251so7687781a12.1 for ; Tue, 26 Nov 2024 22:14:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732688047; x=1733292847; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=oDoZpetLkqMnLB7cRFHPmJPgDTTNJdqJP5Z9EqKdMUQ=; b=mr6MlzqO2c6tOeIrStFhl+VQaPfrHxASK19F12qFzuwEiLoTSijLj/iuQQ5bKF0OAo 0mpNs5PSUznLsCbJQ0OUW4WIbTkudRLqJfFCPZPb1LrVnbT3A0cqaZa62PCS9OU0b6LE 7UrPIxzTI85mvP5Pm4r+AyNEob5GiOLv7V4pU1NMaj0peFH0BNhKXqWaqMpELZAjzUCL 6Q3+1TfeJXcqtPxeaQQv0HFthZvHbMhTO4YuTLBAWgI67VbpZ4x4gQxv+SqOp6kAnZsP fwPdGXEbFNL/ptqlBa8HJg2VTFDfHFgLVBygfXcPMbEFQv4Jb3xGuispRpddF5ewlwWe 31CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732688047; x=1733292847; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oDoZpetLkqMnLB7cRFHPmJPgDTTNJdqJP5Z9EqKdMUQ=; b=EBrr0n8+EsAaYUDVewZs09/IWh4GZs7KDLgJzr72Ul8pYKIi3Amiz1ca4Gk3KLrivM XPLaCQbjU7iRg5WfKrViS59Jp1Q5fNbestdDco7gCx/UvVbWSP1MehDjvTE+TtYg5fms ZCH3jY6dL9i2vKB/ZPpBAdfqI8b0y3U7YCKLtjFgPrTr/RD2QVdf2Gi5MbGJKZGWihlE Xg/uIxNCtMcbL3U0zyqE+kdsra+cCLTaqnirvakqemz37/5reZjja+1pNqR9sbA8XsWU 3XJNTebgQNgxn0JrlRhsb1ZG1Z6w/y60fd/TT9d8P57bCHmH39u20D0EXuOunthtuCvn uWWQ== X-Forwarded-Encrypted: i=1; AJvYcCX3wE+IORyil840UxUABdZVT/Gc+FP1x17O/kKs6guLhUF+S2rF+0OfJF5Orc4tBENEY0551BBqdA==@kvack.org X-Gm-Message-State: AOJu0YywVFdPObEWipXXgFHHUCGX4EIGv6Sv9ELCCQjpjTTaHYFgB2cl AxDoVrYPM+aKkCbV7vlDccHZ/YH67VnNxwf7L5hJ9zJhbd6kmXuLCrtFHH72hvITd6DWFFC+ERW k/3OTI+YemuwuhoNI0cDwOClNax4= X-Gm-Gg: ASbGncsciRyxXVKj904u28vLAofvjcQour2FDkSFovWi/BJ95IxlbFR2NhlakeLIjiL 5sA8mZyY/T9hXwVh7FAqySDWBV/21X7w= X-Google-Smtp-Source: AGHT+IF3EiiHk+74MSMyYikO5IFU3xODqntnXb6lef7wQiVbmziPJGBKrD/7bI3gL4wu/NsH0boT3Moj1gayuKFeqTo= X-Received: by 2002:a17:906:c3a3:b0:a9a:3e33:8d9e with SMTP id a640c23a62f3a-aa580f5620bmr109674566b.28.1732688047110; Tue, 26 Nov 2024 22:14:07 -0800 (PST) MIME-Version: 1.0 References: <20241127054737.33351-1-bharata@amd.com> In-Reply-To: <20241127054737.33351-1-bharata@amd.com> From: Mateusz Guzik Date: Wed, 27 Nov 2024 07:13:54 +0100 Message-ID: Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path To: Bharata B Rao Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, nikunj@amd.com, willy@infradead.org, vbabka@suse.cz, david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com, axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, joshdon@google.com, clm@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 851A74000A X-Rspamd-Server: rspam12 X-Stat-Signature: 1d9eedozhj4cudu9r8dx394z6nt6yqd1 X-Rspam-User: X-HE-Tag: 1732688042-352264 X-HE-Meta: U2FsdGVkX1/GS1zIzow/hphRImX8pD83MJwBuB5IEe1pxpTEPNl3JRWRSNv/q2Jnq3UA1DX6QS9Tf9IRbKKMXW8YeANpbxd5fu6UI7dWojQGGeyvCDg3L1EVvDnNgB1hEg085eTgD4M/dJfey96noH2y4Xc46t6OmxIOMOEVi3HvR+z8QOgryAhxVrMPduXVZKbQBxdbsWD7iHXPfAbvCX0jlhNlBoRH7ZhngWFiUuvQ4N4APvJ6fuFL8rhvAYsm0mYCDsH0QyZlEBH82MS32+tf8mjKNFShyc+HNqamSbQbJbkShxzjaRoGTsAs6QSWBCWKq9tFQsMd+LUXFqpPyzGxyRrDmyeyKujKf2COsRjLWq0WXac697A3m7uBfO14uDzArUsx2cbuwCnAAl1wMAgzxu3cgslOxkAmfNqoYNNOrly6cHe0Hnh1ui+FujM/Ex+gkWh3ZqWWmIhXJiMFhu/kAeBUBZkUAtBQ89aCumW04UNeZ+1mY7WZ9InbT/gUzGz9wg2Ic9Ub+MKcpLHloyJQnQ7Ga+L66xM2c9D/9QUKlAQIjtXAQjLKuAPXOxVKWy3TWLcR7jBePg4eGzzIBHBwMZgzt/Q2XTa2nAj+5/Nh5Qi3mL27sY1GGcuEtacoJ6qVNCDdtUvVnJVC27rRYGouqPHflpRBL/TabFdHjj0kZWUw7oa4c+sh7JKPTG5XvEJVbOxsXol0hFlMmOZMRTTHWczXDNXjQDu8InDuHXdCtQpgch3snVzLqcKrMLOsHLhl7iKy9l/66RWVDu+sIloMBYe8eQP1cLqYf/vhsK6y5aQkJnxkW6OYdNCp+cpJ2KKclqLMDp9auLBi/lvlcSUZLQgxvsLtLhPS2DYgsqythewqQMp0MhlxFCWwgzFtsW+j6M23Qf8VPgi/LL4bDSd4Vc+EGM8BiJe2sYenJm+DGVM/H3MXUJwLZXif9wLHTS12apBOwkytALiNNkT E9mrTz1e khqpAlV98vT+NN3rgxvzH6BQFyJJHfNEuMOILFap59E/YIykdTrIZgabKlqrp7qYkcGTefW9Zu+npygXPUcphY5vvbf1HDkccZMhm2XApTUrZnF//+4IQvkKOboRa4KX/VFOxndIdhAshx7858Y+DrtqOLRTqVosdgzMg9NxLqEf5agO56da1dyjwvz2ptUV+qPuKteh4kJiHu1d8fgKwbgsa1EM+eheeW+YCFmDL5v6UOK9cJRhx8wEoruZUlgLJyDYHqBO/2Mzg15jP4PITsqpAR7f/xkIT/TCr60N+FegeSPjzNcdVAHsVffdojbo9miQ9h85iRGR3AZhNxKD1z9EZm+erFxuWNrMRjV53kqws7kNlKljTelaQIYppLoPEHC9vypqUpRDsa38= X-Bogosity: Ham, tests=bogofilter, spamicity=0.001170, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 27, 2024 at 6:48=E2=80=AFAM Bharata B Rao wro= te: > > Recently we discussed the scalability issues while running large > instances of FIO with buffered IO option on NVME block devices here: > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd= .com/ > > One of the suggestions Chris Mason gave (during private discussions) was > to enable large folios in block buffered IO path as that could > improve the scalability problems and improve the lock contention > scenarios. > I have no basis to comment on the idea. However, it is pretty apparent whatever the situation it is being heavily disfigured by lock contention in blkdev_llseek: > perf-lock contention output > --------------------------- > The lock contention data doesn't look all that conclusive but for 30% rwm= ixwrite > mix it looks like this: > > perf-lock contention default > contended total wait max wait avg wait type caller > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wa= ke.isra.0+0x42 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > 0xffffffff8f39e88f up_write+0x4f > 0xffffffff8f9d598e blkdev_llseek+0x4e > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_l= lseek+0x31 > 0xffffffff903f15bc rwsem_down_write_slowpath+0x3= 6c > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 0xffffffff903dce5e do_syscall_64+0x7e > 0xffffffff9040012b entry_SYSCALL_64_after_hwfram= e+0x76 Admittedly I'm not familiar with this code, but at a quick glance the lock can be just straight up removed here? 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whe= nce) 535 { 536 =E2=94=82 struct inode *bd_inode =3D bdev_file_inode(file); 537 =E2=94=82 loff_t retval; 538 =E2=94=82 539 =E2=94=82 inode_lock(bd_inode); 540 =E2=94=82 retval =3D fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); 541 =E2=94=82 inode_unlock(bd_inode); 542 =E2=94=82 return retval; 543 } At best it stabilizes the size for the duration of the call. Sounds like it helps nothing since if the size can change, the file offset will still be altered as if there was no locking? Suppose this cannot be avoided to grab the size for whatever reason. While the above fio invocation did not work for me, I ran some crapper which I had in my shell history and according to strace: [pid 271829] lseek(7, 0, SEEK_SET) =3D 0 [pid 271829] lseek(7, 0, SEEK_SET) =3D 0 [pid 271830] lseek(7, 0, SEEK_SET) =3D 0 ... the lseeks just rewind to the beginning, *definitely* not needing to know the size. One would have to check but this is most likely the case in your test as well. And for that there is 0 need to grab the size, and consequently the inode l= ock. > 134057198 14.27 h 35.93 ms 383.14 us spinlock clear_sh= adow_entries+0x57 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5e7967 clear_shadow_entries+0x57 > 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 33351524 1.76 h 35.86 ms 190.43 us spinlock __remove= _mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f9be6 remove_mapping+0x16 > 0xffffffff8f5e8f5b mapping_evict_folio+0x7b > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 9448820 14.96 m 1.54 ms 95.01 us spinlock folio_lr= uvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x6= 4 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0x= d1 > 0xffffffff8f5e7593 deactivate_file_folio+0x43 > 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 1488531 11.07 m 1.07 ms 446.39 us spinlock try_to_f= ree_buffers+0x56 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f768c76 try_to_free_buffers+0x56 > 0xffffffff8f5cf647 filemap_release_folio+0x87 > 0xffffffff8f5e8f4c mapping_evict_folio+0x6c > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 2556868 6.78 m 474.72 us 159.07 us spinlock blkdev_l= lseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f14c4 rwsem_down_write_slowpath+0x2= 74 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 2512627 3.75 m 450.96 us 89.55 us spinlock blkdev_l= lseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f12f0 rwsem_down_write_slowpath+0xa= 0 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 908184 1.52 m 439.58 us 100.58 us spinlock blkdev_l= lseek+0x31 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 > 0xffffffff903f1367 rwsem_down_write_slowpath+0x1= 17 > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 134 1.48 m 1.22 s 663.88 ms mutex bdev_rel= ease+0x69 > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17= e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f9d5249 bdev_release+0x69 > 0xffffffff8f9d5921 blkdev_release+0x11 > 0xffffffff8f7089f3 __fput+0xe3 > 0xffffffff8f708c9b __fput_sync+0x1b > 0xffffffff8f6fe8ed __x64_sys_close+0x3d > > > perf-lock contention patched > contended total wait max wait avg wait type caller > > 1153627 40.15 h 48.67 s 125.30 ms rwsem:W blkdev_l= lseek+0x31 > 0xffffffff903f15bc rwsem_down_write_slowpath+0x3= 6c > 0xffffffff903f18fb down_write+0x5b > 0xffffffff8f9d5971 blkdev_llseek+0x31 > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 0xffffffff903dce5e do_syscall_64+0x7e > 0xffffffff9040012b entry_SYSCALL_64_after_hwfram= e+0x76 > 276512439 39.19 h 46.90 ms 510.22 us spinlock clear_sh= adow_entries+0x57 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5e7967 clear_shadow_entries+0x57 > 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 763119320 26.37 h 887.44 us 124.38 us spinlock rwsem_wa= ke.isra.0+0x42 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > 0xffffffff8f39e88f up_write+0x4f > 0xffffffff8f9d598e blkdev_llseek+0x4e > 0xffffffff8f703322 ksys_lseek+0x72 > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > 33263910 2.87 h 29.43 ms 310.56 us spinlock __remove= _mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f9be6 remove_mapping+0x16 > 0xffffffff8f5e8f5b mapping_evict_folio+0x7b > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 58671816 2.50 h 519.68 us 153.45 us spinlock folio_lr= uvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x6= 4 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0x= d1 > 0xffffffff8f5e7593 deactivate_file_folio+0x43 > 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 284 22.33 m 5.35 s 4.72 s mutex bdev_rel= ease+0x69 > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17= e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f9d5249 bdev_release+0x69 > 0xffffffff8f9d5921 blkdev_release+0x11 > 0xffffffff8f7089f3 __fput+0xe3 > 0xffffffff8f708c9b __fput_sync+0x1b > 0xffffffff8f6fe8ed __x64_sys_close+0x3d > 2181469 21.38 m 1.15 ms 587.98 us spinlock try_to_f= ree_buffers+0x56 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f768c76 try_to_free_buffers+0x56 > 0xffffffff8f5cf647 filemap_release_folio+0x87 > 0xffffffff8f5e8f4c mapping_evict_folio+0x6c > 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 > 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 > 0xffffffff8f9d3872 invalidate_bdev+0x42 > 454398 4.22 m 37.54 ms 557.13 us spinlock __remove= _mapping+0x5d > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f5c7f _raw_spin_lock+0x3f > 0xffffffff8f5ec71d __remove_mapping+0x5d > 0xffffffff8f5f4f04 shrink_folio_list+0xbc4 > 0xffffffff8f5f5a6b evict_folios+0x34b > 0xffffffff8f5f772f try_to_shrink_lruvec+0x20f > 0xffffffff8f5f79ef shrink_one+0x10f > 0xffffffff8f5fb975 shrink_node+0xb45 > 773 3.53 m 2.60 s 273.76 ms mutex __lru_ad= d_drain_all+0x3a > 0xffffffff903ef1de __mutex_lock.constprop.0+0x17= e > 0xffffffff903ef863 __mutex_lock_slowpath+0x13 > 0xffffffff903ef8bb mutex_lock+0x3b > 0xffffffff8f5e3d7a __lru_add_drain_all+0x3a > 0xffffffff8f5e77a0 lru_add_drain_all+0x10 > 0xffffffff8f9d3861 invalidate_bdev+0x31 > 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae > 0xffffffff8f9faea1 blkdev_ioctl+0xc1 > 1997851 3.09 m 651.65 us 92.83 us spinlock folio_lr= uvec_lock_irqsave+0x64 > 0xffffffff903f60a3 native_queued_spin_lock_slowp= ath+0x1f3 > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x6= 4 > 0xffffffff8f5e587c folio_batch_move_lru+0x5c > 0xffffffff8f5e5a41 __folio_batch_add_and_move+0x= d1 > 0xffffffff8f5e5ae4 folio_add_lru+0x54 > 0xffffffff8f5d075d filemap_add_folio+0xcd > 0xffffffff8f5e30c0 page_cache_ra_order+0x220 > > Observations from perf-lock contention > -------------------------------------- > - Significant reduction of contention for inode_lock (inode->i_rwsem) > from blkdev_llseek() path. > - Significant increase in contention for inode->i_lock from invalidate > and remove_mapping paths. > - Significant increase in contention for lruvec spinlock from > deactive_file_folio path. > > Request comments on the above and I am specifically looking for inputs > on these: > > - Lock contention results and usefulness of large folios in bringing > down the contention in this specific case. > - If enabling large folios in block buffered IO path is a feasible > approach, inputs on doing this cleanly and correclty. > > Bharata B Rao (1): > block/ioctl: Add an ioctl to enable large folios for block buffered IO > path > > block/ioctl.c | 8 ++++++++ > include/uapi/linux/fs.h | 2 ++ > 2 files changed, 10 insertions(+) > > -- > 2.34.1 > --=20 Mateusz Guzik