[RFC PATCH 0/1] Large folios in block buffered IO path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] Large folios in block buffered IO path
@ 2024-11-27  5:47 Bharata B Rao
  2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
  2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
  0 siblings, 2 replies; 24+ messages in thread
From: Bharata B Rao @ 2024-11-27  5:47 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-fsdevel, linux-mm
  Cc: nikunj, willy, vbabka, david, akpm, yuzhao, mjguzik, axboe, viro,
	brauner, jack, joshdon, clm, Bharata B Rao

Recently we discussed the scalability issues while running large
instances of FIO with buffered IO option on NVME block devices here:

https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/

One of the suggestions Chris Mason gave (during private discussions) was
to enable large folios in block buffered IO path as that could
improve the scalability problems and improve the lock contention
scenarios.

This is an attempt to check the feasibility and potential benefit of the
same. To keep changes to minimum and also to non-disruptively test this
for the required block device only, I have added an ioctl to set large
folios support on block device mapping. I understand that this is not
the right way to do this but this is just an experiment to evaluate the
potential benefit.

Experimental setup
------------------
2 node EPYC server based Zen5 server with 512G memory in each node.

Disk layout for FIO:
nvme2n1     259:12   0   3.5T  0 disk 
├─nvme2n1p1 259:13   0 894.3G  0 part 
├─nvme2n1p2 259:14   0 894.3G  0 part 
├─nvme2n1p3 259:15   0 894.3G  0 part 
└─nvme2n1p4 259:16   0 894.1G  0 part 

Four parallel instances of FIO are run on the above 4 partitions with
the following options:

-filename=/dev/nvme2n1p[1,2,3,4] -direct=0 -thread -size=800G -rw=rw -rwmixwrite=[10,30,50] --norandommap --randrepeat=0 -ioengine=sync -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting

Results
-------
default: Unmodified kernel and FIO.
patched: Kernel with BLKSETLFOLIO ioctl(introduced in this patchset) and FIO
modified to issue that ioctl.
In the below table, r is READ bw and w is WRITE bw reported by FIO.

		default				patched
ro (w/o -rw=rw option)
Instance 1	r=12.3GiB/s			r=39.4GiB/s
Instance 2	r=12.2GiB/s			r=39.1GiB/s
Instance 3	r=16.3GiB/s			r=37.1GiB/s
Instance 4	r=14.9GiB/s			r=42.9GiB/s

rwmixwrite=10%
Instance 1	r=27.5GiB/s,w=3125MiB/s		r=75.9GiB/s,w=8636MiB/s
Instance 2	r=25.5GiB/s,w=2898MiB/s		r=87.6GiB/s,w=9967MiB/s
Instance 3	r=25.7GiB/s,w=2922MiB/s		r=78.3GiB/s,w=8904MiB/s
Instance 4	r=27.5GiB/s,w=3134MiB/s		r=73.5GiB/s,w=8365MiB/s

rwmixwrite=30%
Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=59.2GiB/s,w=25.4GiB/s
Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=57.6GiB/s,w=24.7GiB/s
Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=59.5GiB/s,w=25.5GiB/s
Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=63.3GiB/s,w=27.1GiB/s

rwmixwrite=50%
Instance 1	r=37.1GiB/s,w=37.1GiB/s		r=40.7GiB/s,w=40.7GiB/s
Instance 2	r=37.6GiB/s,w=37.6GiB/s		r=45.9GiB/s,w=45.9GiB/s
Instance 3	r=35.1GiB/s,w=35.1GiB/s		r=49.2GiB/s,w=49.2GiB/s
Instance 4	r=43.6GiB/s,w=43.6GiB/s		r=41.2GiB/s,w=41.2GiB/s

Summary of FIO throughput
-------------------------
- Significant increase(3x) in bandwidth for ro case.
- Significant increase(3x) in bandwidth for rw 10%.
- Good gains(~1.15 to 1.5x) for 30% and 50%.

perf-lock contention output
---------------------------
The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
mix it looks like this:

perf-lock contention default
 contended   total wait     max wait     avg wait         type   caller

1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
                        0xffffffff8f39e88f  up_write+0x4f
                        0xffffffff8f9d598e  blkdev_llseek+0x4e
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
   2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
                        0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
                        0xffffffff903dce5e  do_syscall_64+0x7e
                        0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
 134057198     14.27 h      35.93 ms    383.14 us     spinlock   clear_shadow_entries+0x57
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5e7967  clear_shadow_entries+0x57
                        0xffffffff8f5e90e3  mapping_try_invalidate+0x163
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
  33351524      1.76 h      35.86 ms    190.43 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f9be6  remove_mapping+0x16
                        0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
   9448820     14.96 m       1.54 ms     95.01 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e7593  deactivate_file_folio+0x43
                        0xffffffff8f5e90b7  mapping_try_invalidate+0x137
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
   1488531     11.07 m       1.07 ms    446.39 us     spinlock   try_to_free_buffers+0x56
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f768c76  try_to_free_buffers+0x56
                        0xffffffff8f5cf647  filemap_release_folio+0x87
                        0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
   2556868      6.78 m     474.72 us    159.07 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f14c4  rwsem_down_write_slowpath+0x274
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
   2512627      3.75 m     450.96 us     89.55 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f12f0  rwsem_down_write_slowpath+0xa0
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
    908184      1.52 m     439.58 us    100.58 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f1367  rwsem_down_write_slowpath+0x117
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
       134      1.48 m       1.22 s     663.88 ms        mutex   bdev_release+0x69
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f9d5249  bdev_release+0x69
                        0xffffffff8f9d5921  blkdev_release+0x11
                        0xffffffff8f7089f3  __fput+0xe3
                        0xffffffff8f708c9b  __fput_sync+0x1b
                        0xffffffff8f6fe8ed  __x64_sys_close+0x3d


perf-lock contention patched
 contended   total wait     max wait     avg wait         type   caller

   1153627     40.15 h      48.67 s     125.30 ms      rwsem:W   blkdev_llseek+0x31
                        0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
                        0xffffffff903dce5e  do_syscall_64+0x7e
                        0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
 276512439     39.19 h      46.90 ms    510.22 us     spinlock   clear_shadow_entries+0x57
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5e7967  clear_shadow_entries+0x57
                        0xffffffff8f5e90e3  mapping_try_invalidate+0x163
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
 763119320     26.37 h     887.44 us    124.38 us     spinlock   rwsem_wake.isra.0+0x42
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
                        0xffffffff8f39e88f  up_write+0x4f
                        0xffffffff8f9d598e  blkdev_llseek+0x4e
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
  33263910      2.87 h      29.43 ms    310.56 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f9be6  remove_mapping+0x16
                        0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
  58671816      2.50 h     519.68 us    153.45 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e7593  deactivate_file_folio+0x43
                        0xffffffff8f5e90b7  mapping_try_invalidate+0x137
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
       284     22.33 m       5.35 s       4.72 s         mutex   bdev_release+0x69
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f9d5249  bdev_release+0x69
                        0xffffffff8f9d5921  blkdev_release+0x11
                        0xffffffff8f7089f3  __fput+0xe3
                        0xffffffff8f708c9b  __fput_sync+0x1b
                        0xffffffff8f6fe8ed  __x64_sys_close+0x3d
   2181469     21.38 m       1.15 ms    587.98 us     spinlock   try_to_free_buffers+0x56
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f768c76  try_to_free_buffers+0x56
                        0xffffffff8f5cf647  filemap_release_folio+0x87
                        0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
    454398      4.22 m      37.54 ms    557.13 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f4f04  shrink_folio_list+0xbc4
                        0xffffffff8f5f5a6b  evict_folios+0x34b
                        0xffffffff8f5f772f  try_to_shrink_lruvec+0x20f
                        0xffffffff8f5f79ef  shrink_one+0x10f
                        0xffffffff8f5fb975  shrink_node+0xb45
       773      3.53 m       2.60 s     273.76 ms        mutex   __lru_add_drain_all+0x3a
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f5e3d7a  __lru_add_drain_all+0x3a
                        0xffffffff8f5e77a0  lru_add_drain_all+0x10
                        0xffffffff8f9d3861  invalidate_bdev+0x31
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
   1997851      3.09 m     651.65 us     92.83 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e5ae4  folio_add_lru+0x54
                        0xffffffff8f5d075d  filemap_add_folio+0xcd
                        0xffffffff8f5e30c0  page_cache_ra_order+0x220

Observations from perf-lock contention
--------------------------------------
- Significant reduction of contention for inode_lock (inode->i_rwsem)
  from blkdev_llseek() path.
- Significant increase in contention for inode->i_lock from invalidate
  and remove_mapping paths.
- Significant increase in contention for lruvec spinlock from
  deactive_file_folio path.

Request comments on the above and I am specifically looking for inputs
on these:

- Lock contention results and usefulness of large folios in bringing
  down the contention in this specific case.
- If enabling large folios in block buffered IO path is a feasible
  approach, inputs on doing this cleanly and correclty.

Bharata B Rao (1):
  block/ioctl: Add an ioctl to enable large folios for block buffered IO
    path

 block/ioctl.c           | 8 ++++++++
 include/uapi/linux/fs.h | 2 ++
 2 files changed, 10 insertions(+)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for block buffered IO path
  2024-11-27  5:47 [RFC PATCH 0/1] Large folios in block buffered IO path Bharata B Rao
@ 2024-11-27  5:47 ` Bharata B Rao
  2024-11-27  6:26   ` Christoph Hellwig
  2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
  1 sibling, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-11-27  5:47 UTC (permalink / raw)
  To: linux-block, linux-kernel, linux-fsdevel, linux-mm
  Cc: nikunj, willy, vbabka, david, akpm, yuzhao, mjguzik, axboe, viro,
	brauner, jack, joshdon, clm, Bharata B Rao

In order to experiment using large folios for block devices read/write
operations, expose an ioctl that userspace can selectively use on the
raw block devices.

For the write path, this forces iomap layer to provision large
folios (via iomap_file_buffered_write()).

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 block/ioctl.c           | 8 ++++++++
 include/uapi/linux/fs.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 6554b728bae6..6af26a08ef34 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -548,6 +548,12 @@ static int blkdev_bszset(struct file *file, blk_mode_t mode,
 	return ret;
 }
 
+static int blkdev_set_large_folio(struct block_device *bdev)
+{
+	mapping_set_large_folios(bdev->bd_mapping);
+	return 0;
+}
+
 /*
  * Common commands that are handled the same way on native and compat
  * user space. Note the separate arg/argp parameters that are needed
@@ -632,6 +638,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
 		return blkdev_pr_preempt(bdev, mode, argp, true);
 	case IOC_PR_CLEAR:
 		return blkdev_pr_clear(bdev, mode, argp);
+	case BLKSETLFOLIO:
+		return blkdev_set_large_folio(bdev);
 	default:
 		return -ENOIOCTLCMD;
 	}
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..5c8a326b68a1 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -203,6 +203,8 @@ struct fsxattr {
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
 #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKSETLFOLIO _IO(0x12, 129)
+
 /*
  * A jump here: 130-136 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.34.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27  5:47 [RFC PATCH 0/1] Large folios in block buffered IO path Bharata B Rao
  2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
@ 2024-11-27  6:13 ` Mateusz Guzik
  2024-11-27  6:19   ` Mateusz Guzik
  1 sibling, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-27  6:13 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Recently we discussed the scalability issues while running large
> instances of FIO with buffered IO option on NVME block devices here:
>
> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>
> One of the suggestions Chris Mason gave (during private discussions) was
> to enable large folios in block buffered IO path as that could
> improve the scalability problems and improve the lock contention
> scenarios.
>

I have no basis to comment on the idea.

However, it is pretty apparent whatever the situation it is being
heavily disfigured by lock contention in blkdev_llseek:

> perf-lock contention output
> ---------------------------
> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> mix it looks like this:
>
> perf-lock contention default
>  contended   total wait     max wait     avg wait         type   caller
>
> 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>                         0xffffffff8f39e88f  up_write+0x4f
>                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>                         0xffffffff903dce5e  do_syscall_64+0x7e
>                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76

Admittedly I'm not familiar with this code, but at a quick glance the
lock can be just straight up removed here?

  534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
  535 {
  536 │       struct inode *bd_inode = bdev_file_inode(file);
  537 │       loff_t retval;
  538 │
  539 │       inode_lock(bd_inode);
  540 │       retval = fixed_size_llseek(file, offset, whence,
i_size_read(bd_inode));
  541 │       inode_unlock(bd_inode);
  542 │       return retval;
  543 }

At best it stabilizes the size for the duration of the call. Sounds
like it helps nothing since if the size can change, the file offset
will still be altered as if there was no locking?

Suppose this cannot be avoided to grab the size for whatever reason.

While the above fio invocation did not work for me, I ran some crapper
which I had in my shell history and according to strace:
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271830] lseek(7, 0, SEEK_SET)      = 0

... the lseeks just rewind to the beginning, *definitely* not needing
to know the size. One would have to check but this is most likely the
case in your test as well.

And for that there is 0 need to grab the size, and consequently the inode lock.

>  134057198     14.27 h      35.93 ms    383.14 us     spinlock   clear_shadow_entries+0x57
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5e7967  clear_shadow_entries+0x57
>                         0xffffffff8f5e90e3  mapping_try_invalidate+0x163
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>   33351524      1.76 h      35.86 ms    190.43 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f9be6  remove_mapping+0x16
>                         0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>    9448820     14.96 m       1.54 ms     95.01 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e7593  deactivate_file_folio+0x43
>                         0xffffffff8f5e90b7  mapping_try_invalidate+0x137
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>    1488531     11.07 m       1.07 ms    446.39 us     spinlock   try_to_free_buffers+0x56
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f768c76  try_to_free_buffers+0x56
>                         0xffffffff8f5cf647  filemap_release_folio+0x87
>                         0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>    2556868      6.78 m     474.72 us    159.07 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f14c4  rwsem_down_write_slowpath+0x274
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>    2512627      3.75 m     450.96 us     89.55 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f12f0  rwsem_down_write_slowpath+0xa0
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>     908184      1.52 m     439.58 us    100.58 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f1367  rwsem_down_write_slowpath+0x117
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>        134      1.48 m       1.22 s     663.88 ms        mutex   bdev_release+0x69
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f9d5249  bdev_release+0x69
>                         0xffffffff8f9d5921  blkdev_release+0x11
>                         0xffffffff8f7089f3  __fput+0xe3
>                         0xffffffff8f708c9b  __fput_sync+0x1b
>                         0xffffffff8f6fe8ed  __x64_sys_close+0x3d
>
>
> perf-lock contention patched
>  contended   total wait     max wait     avg wait         type   caller
>
>    1153627     40.15 h      48.67 s     125.30 ms      rwsem:W   blkdev_llseek+0x31
>                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>                         0xffffffff903dce5e  do_syscall_64+0x7e
>                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>  276512439     39.19 h      46.90 ms    510.22 us     spinlock   clear_shadow_entries+0x57
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5e7967  clear_shadow_entries+0x57
>                         0xffffffff8f5e90e3  mapping_try_invalidate+0x163
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>  763119320     26.37 h     887.44 us    124.38 us     spinlock   rwsem_wake.isra.0+0x42
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>                         0xffffffff8f39e88f  up_write+0x4f
>                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>   33263910      2.87 h      29.43 ms    310.56 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f9be6  remove_mapping+0x16
>                         0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>   58671816      2.50 h     519.68 us    153.45 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e7593  deactivate_file_folio+0x43
>                         0xffffffff8f5e90b7  mapping_try_invalidate+0x137
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>        284     22.33 m       5.35 s       4.72 s         mutex   bdev_release+0x69
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f9d5249  bdev_release+0x69
>                         0xffffffff8f9d5921  blkdev_release+0x11
>                         0xffffffff8f7089f3  __fput+0xe3
>                         0xffffffff8f708c9b  __fput_sync+0x1b
>                         0xffffffff8f6fe8ed  __x64_sys_close+0x3d
>    2181469     21.38 m       1.15 ms    587.98 us     spinlock   try_to_free_buffers+0x56
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f768c76  try_to_free_buffers+0x56
>                         0xffffffff8f5cf647  filemap_release_folio+0x87
>                         0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>     454398      4.22 m      37.54 ms    557.13 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f4f04  shrink_folio_list+0xbc4
>                         0xffffffff8f5f5a6b  evict_folios+0x34b
>                         0xffffffff8f5f772f  try_to_shrink_lruvec+0x20f
>                         0xffffffff8f5f79ef  shrink_one+0x10f
>                         0xffffffff8f5fb975  shrink_node+0xb45
>        773      3.53 m       2.60 s     273.76 ms        mutex   __lru_add_drain_all+0x3a
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f5e3d7a  __lru_add_drain_all+0x3a
>                         0xffffffff8f5e77a0  lru_add_drain_all+0x10
>                         0xffffffff8f9d3861  invalidate_bdev+0x31
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>    1997851      3.09 m     651.65 us     92.83 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e5ae4  folio_add_lru+0x54
>                         0xffffffff8f5d075d  filemap_add_folio+0xcd
>                         0xffffffff8f5e30c0  page_cache_ra_order+0x220
>
> Observations from perf-lock contention
> --------------------------------------
> - Significant reduction of contention for inode_lock (inode->i_rwsem)
>   from blkdev_llseek() path.
> - Significant increase in contention for inode->i_lock from invalidate
>   and remove_mapping paths.
> - Significant increase in contention for lruvec spinlock from
>   deactive_file_folio path.
>
> Request comments on the above and I am specifically looking for inputs
> on these:
>
> - Lock contention results and usefulness of large folios in bringing
>   down the contention in this specific case.
> - If enabling large folios in block buffered IO path is a feasible
>   approach, inputs on doing this cleanly and correclty.
>
> Bharata B Rao (1):
>   block/ioctl: Add an ioctl to enable large folios for block buffered IO
>     path
>
>  block/ioctl.c           | 8 ++++++++
>  include/uapi/linux/fs.h | 2 ++
>  2 files changed, 10 insertions(+)
>
> --
> 2.34.1
>


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
@ 2024-11-27  6:19   ` Mateusz Guzik
  2024-11-27 12:02     ` Jan Kara
  2024-11-27 12:18     ` Bharata B Rao
  0 siblings, 2 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-27  6:19 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
> >
> > Recently we discussed the scalability issues while running large
> > instances of FIO with buffered IO option on NVME block devices here:
> >
> > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> >
> > One of the suggestions Chris Mason gave (during private discussions) was
> > to enable large folios in block buffered IO path as that could
> > improve the scalability problems and improve the lock contention
> > scenarios.
> >
>
> I have no basis to comment on the idea.
>
> However, it is pretty apparent whatever the situation it is being
> heavily disfigured by lock contention in blkdev_llseek:
>
> > perf-lock contention output
> > ---------------------------
> > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > mix it looks like this:
> >
> > perf-lock contention default
> >  contended   total wait     max wait     avg wait         type   caller
> >
> > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
> >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
> >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
> >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
> >                         0xffffffff8f39e88f  up_write+0x4f
> >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
> >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
> >                         0xffffffff903f18fb  down_write+0x5b
> >                         0xffffffff8f9d5971  blkdev_llseek+0x31
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >                         0xffffffff903dce5e  do_syscall_64+0x7e
> >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>
> Admittedly I'm not familiar with this code, but at a quick glance the
> lock can be just straight up removed here?
>
>   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>   535 {
>   536 │       struct inode *bd_inode = bdev_file_inode(file);
>   537 │       loff_t retval;
>   538 │
>   539 │       inode_lock(bd_inode);
>   540 │       retval = fixed_size_llseek(file, offset, whence,
> i_size_read(bd_inode));
>   541 │       inode_unlock(bd_inode);
>   542 │       return retval;
>   543 }
>
> At best it stabilizes the size for the duration of the call. Sounds
> like it helps nothing since if the size can change, the file offset
> will still be altered as if there was no locking?
>
> Suppose this cannot be avoided to grab the size for whatever reason.
>
> While the above fio invocation did not work for me, I ran some crapper
> which I had in my shell history and according to strace:
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>
> ... the lseeks just rewind to the beginning, *definitely* not needing
> to know the size. One would have to check but this is most likely the
> case in your test as well.
>
> And for that there is 0 need to grab the size, and consequently the inode lock.

That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:

diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
 {
        struct inode *bd_inode = bdev_file_inode(file);
-       loff_t retval;

-       inode_lock(bd_inode);
-       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
-       inode_unlock(bd_inode);
-       return retval;
+       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
 }

 static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

To be aborted if it blows up (but I don't see why it would).

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for block buffered IO path
  2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
@ 2024-11-27  6:26   ` Christoph Hellwig
  2024-11-27 10:37     ` Bharata B Rao
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-11-27  6:26 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, mjguzik, axboe, viro,
	brauner, jack, joshdon, clm

On Wed, Nov 27, 2024 at 11:17:37AM +0530, Bharata B Rao wrote:
> In order to experiment using large folios for block devices read/write
> operations, expose an ioctl that userspace can selectively use on the
> raw block devices.
> 
> For the write path, this forces iomap layer to provision large
> folios (via iomap_file_buffered_write()).

Well, unless CONFIG_BUFFER_HEAD is disabled, the block device uses
the buffer head based write path, which currently doesn't fully
support large folios (although there is series out to do so on
fsdevel right now), so I don't think this will fully work.

But the more important problem, and the reason why we don't use
the non-buffer_head path by default is that the block device mapping
is reused by a lot of file systems, which are not aware of large
folios, and will get utterly confused.  So if we want to do anything
smart on the block device mapping, we'll have to ensure we're back
to state compatible with these file systems before calling into
their mount code, and stick to the old code while file systems are
mounted.

Of course the real question is:  why do you care about buffered
I/O performance on the block device node?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for block buffered IO path
  2024-11-27  6:26   ` Christoph Hellwig
@ 2024-11-27 10:37     ` Bharata B Rao
  2024-11-28  5:43       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-11-27 10:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, mjguzik, axboe, viro,
	brauner, jack, joshdon, clm

On 27-Nov-24 11:56 AM, Christoph Hellwig wrote:
> On Wed, Nov 27, 2024 at 11:17:37AM +0530, Bharata B Rao wrote:
>> In order to experiment using large folios for block devices read/write
>> operations, expose an ioctl that userspace can selectively use on the
>> raw block devices.
>>
>> For the write path, this forces iomap layer to provision large
>> folios (via iomap_file_buffered_write()).
> 
> Well, unless CONFIG_BUFFER_HEAD is disabled, the block device uses
> the buffer head based write path, which currently doesn't fully
> support large folios (although there is series out to do so on
> fsdevel right now), so I don't think this will fully work.

I believe you are referring to the patchset that enables bs > ps for 
block devices - 
https://lore.kernel.org/linux-fsdevel/20241113094727.1497722-1-mcgrof@kernel.org/

With the above patchset, block device can use buffer head based write 
path without disabling CONFIG_BUFFER_HEAD and that is a pre-requisite 
for buffered IO path in the block layer (blkdev_buffered_write()) to 
correctly/fully use large folios. Did I get that right?

> 
> But the more important problem, and the reason why we don't use
> the non-buffer_head path by default is that the block device mapping
> is reused by a lot of file systems, which are not aware of large
> folios, and will get utterly confused.  So if we want to do anything
> smart on the block device mapping, we'll have to ensure we're back
> to state compatible with these file systems before calling into
> their mount code, and stick to the old code while file systems are
> mounted.

In fact I was trying to see if it is possible to advertise large folio 
support in bdev mapping only for those block devices which don't have FS 
mounted on them. But apparently it was not so straight forward and my 
initial attempt at this resulted in FS corruption. Hence I resorted to 
the current ioctl approach as a way to showcase the problem and the 
potential benefit.

> 
> Of course the real question is:  why do you care about buffered
> I/O performance on the block device node?
> 

Various combinations of FIO options 
(direct/buffered/blocksizes/readwrite ratios etc) was part of a customer 
test/regression suite and we found this particular case of FIO with 
buffered IO on NVME block devices to have a lot of scalability issues. 
Hence checking if there are ways to mitigate those.

Thanks for your reply.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27  6:19   ` Mateusz Guzik
@ 2024-11-27 12:02     ` Jan Kara
  2024-11-27 12:13       ` Christian Brauner
  2024-11-28  5:40       ` Ritesh Harjani
  2024-11-27 12:18     ` Bharata B Rao
  1 sibling, 2 replies; 24+ messages in thread
From: Jan Kara @ 2024-11-27 12:02 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Bharata B Rao, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, willy, vbabka, david, akpm, yuzhao, axboe,
	viro, brauner, jack, joshdon, clm

On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
> > >
> > > Recently we discussed the scalability issues while running large
> > > instances of FIO with buffered IO option on NVME block devices here:
> > >
> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> > >
> > > One of the suggestions Chris Mason gave (during private discussions) was
> > > to enable large folios in block buffered IO path as that could
> > > improve the scalability problems and improve the lock contention
> > > scenarios.
> > >
> >
> > I have no basis to comment on the idea.
> >
> > However, it is pretty apparent whatever the situation it is being
> > heavily disfigured by lock contention in blkdev_llseek:
> >
> > > perf-lock contention output
> > > ---------------------------
> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > > mix it looks like this:
> > >
> > > perf-lock contention default
> > >  contended   total wait     max wait     avg wait         type   caller
> > >
> > > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
> > >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
> > >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
> > >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
> > >                         0xffffffff8f39e88f  up_write+0x4f
> > >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
> > >                         0xffffffff8f703322  ksys_lseek+0x72
> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> > >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
> > >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
> > >                         0xffffffff903f18fb  down_write+0x5b
> > >                         0xffffffff8f9d5971  blkdev_llseek+0x31
> > >                         0xffffffff8f703322  ksys_lseek+0x72
> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> > >                         0xffffffff903dce5e  do_syscall_64+0x7e
> > >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
> >
> > Admittedly I'm not familiar with this code, but at a quick glance the
> > lock can be just straight up removed here?
> >
> >   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> >   535 {
> >   536 │       struct inode *bd_inode = bdev_file_inode(file);
> >   537 │       loff_t retval;
> >   538 │
> >   539 │       inode_lock(bd_inode);
> >   540 │       retval = fixed_size_llseek(file, offset, whence,
> > i_size_read(bd_inode));
> >   541 │       inode_unlock(bd_inode);
> >   542 │       return retval;
> >   543 }
> >
> > At best it stabilizes the size for the duration of the call. Sounds
> > like it helps nothing since if the size can change, the file offset
> > will still be altered as if there was no locking?
> >
> > Suppose this cannot be avoided to grab the size for whatever reason.
> >
> > While the above fio invocation did not work for me, I ran some crapper
> > which I had in my shell history and according to strace:
> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> > [pid 271830] lseek(7, 0, SEEK_SET)      = 0
> >
> > ... the lseeks just rewind to the beginning, *definitely* not needing
> > to know the size. One would have to check but this is most likely the
> > case in your test as well.
> >
> > And for that there is 0 need to grab the size, and consequently the inode lock.
> 
> That is to say bare minimum this needs to be benchmarked before/after
> with the lock removed from the picture, like so:

Yeah, I've noticed this in the locking profiles as well and I agree
bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
get away without using inode lock in their llseek handler...

								Honza

> diff --git a/block/fops.c b/block/fops.c
> index 2d01c9007681..7f9e9e2f9081 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>  static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>  {
>         struct inode *bd_inode = bdev_file_inode(file);
> -       loff_t retval;
> 
> -       inode_lock(bd_inode);
> -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> -       inode_unlock(bd_inode);
> -       return retval;
> +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>  }
> 
>  static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> 
> To be aborted if it blows up (but I don't see why it would).
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27 12:02     ` Jan Kara
@ 2024-11-27 12:13       ` Christian Brauner
  2024-11-28  5:40       ` Ritesh Harjani
  1 sibling, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-11-27 12:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mateusz Guzik, Bharata B Rao, linux-block, linux-kernel,
	linux-fsdevel, linux-mm, nikunj, willy, vbabka, david, akpm,
	yuzhao, axboe, viro, joshdon, clm

On Wed, Nov 27, 2024 at 01:02:35PM +0100, Jan Kara wrote:
> On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
> > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
> > > >
> > > > Recently we discussed the scalability issues while running large
> > > > instances of FIO with buffered IO option on NVME block devices here:
> > > >
> > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> > > >
> > > > One of the suggestions Chris Mason gave (during private discussions) was
> > > > to enable large folios in block buffered IO path as that could
> > > > improve the scalability problems and improve the lock contention
> > > > scenarios.
> > > >
> > >
> > > I have no basis to comment on the idea.
> > >
> > > However, it is pretty apparent whatever the situation it is being
> > > heavily disfigured by lock contention in blkdev_llseek:
> > >
> > > > perf-lock contention output
> > > > ---------------------------
> > > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > > > mix it looks like this:
> > > >
> > > > perf-lock contention default
> > > >  contended   total wait     max wait     avg wait         type   caller
> > > >
> > > > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
> > > >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
> > > >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
> > > >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
> > > >                         0xffffffff8f39e88f  up_write+0x4f
> > > >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
> > > >                         0xffffffff8f703322  ksys_lseek+0x72
> > > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> > > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> > > >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
> > > >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
> > > >                         0xffffffff903f18fb  down_write+0x5b
> > > >                         0xffffffff8f9d5971  blkdev_llseek+0x31
> > > >                         0xffffffff8f703322  ksys_lseek+0x72
> > > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> > > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> > > >                         0xffffffff903dce5e  do_syscall_64+0x7e
> > > >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
> > >
> > > Admittedly I'm not familiar with this code, but at a quick glance the
> > > lock can be just straight up removed here?
> > >
> > >   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> > >   535 {
> > >   536 │       struct inode *bd_inode = bdev_file_inode(file);
> > >   537 │       loff_t retval;
> > >   538 │
> > >   539 │       inode_lock(bd_inode);
> > >   540 │       retval = fixed_size_llseek(file, offset, whence,
> > > i_size_read(bd_inode));
> > >   541 │       inode_unlock(bd_inode);
> > >   542 │       return retval;
> > >   543 }
> > >
> > > At best it stabilizes the size for the duration of the call. Sounds
> > > like it helps nothing since if the size can change, the file offset
> > > will still be altered as if there was no locking?
> > >
> > > Suppose this cannot be avoided to grab the size for whatever reason.
> > >
> > > While the above fio invocation did not work for me, I ran some crapper
> > > which I had in my shell history and according to strace:
> > > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> > > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> > > [pid 271830] lseek(7, 0, SEEK_SET)      = 0
> > >
> > > ... the lseeks just rewind to the beginning, *definitely* not needing
> > > to know the size. One would have to check but this is most likely the
> > > case in your test as well.
> > >
> > > And for that there is 0 need to grab the size, and consequently the inode lock.
> > 
> > That is to say bare minimum this needs to be benchmarked before/after
> > with the lock removed from the picture, like so:
> 
> Yeah, I've noticed this in the locking profiles as well and I agree
> bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
> get away without using inode lock in their llseek handler...

nod. This should be removed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27  6:19   ` Mateusz Guzik
  2024-11-27 12:02     ` Jan Kara
@ 2024-11-27 12:18     ` Bharata B Rao
  2024-11-27 12:28       ` Mateusz Guzik
  1 sibling, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-11-27 12:18 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> Recently we discussed the scalability issues while running large
>>> instances of FIO with buffered IO option on NVME block devices here:
>>>
>>> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>>>
>>> One of the suggestions Chris Mason gave (during private discussions) was
>>> to enable large folios in block buffered IO path as that could
>>> improve the scalability problems and improve the lock contention
>>> scenarios.
>>>
>>
>> I have no basis to comment on the idea.
>>
>> However, it is pretty apparent whatever the situation it is being
>> heavily disfigured by lock contention in blkdev_llseek:
>>
>>> perf-lock contention output
>>> ---------------------------
>>> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>>> mix it looks like this:
>>>
>>> perf-lock contention default
>>>   contended   total wait     max wait     avg wait         type   caller
>>>
>>> 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>>>                          0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>>>                          0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>>>                          0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>>>                          0xffffffff8f39e88f  up_write+0x4f
>>>                          0xffffffff8f9d598e  blkdev_llseek+0x4e
>>>                          0xffffffff8f703322  ksys_lseek+0x72
>>>                          0xffffffff8f7033a8  __x64_sys_lseek+0x18
>>>                          0xffffffff8f20b983  x64_sys_call+0x1fb3
>>>     2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>>>                          0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>>>                          0xffffffff903f18fb  down_write+0x5b
>>>                          0xffffffff8f9d5971  blkdev_llseek+0x31
>>>                          0xffffffff8f703322  ksys_lseek+0x72
>>>                          0xffffffff8f7033a8  __x64_sys_lseek+0x18
>>>                          0xffffffff8f20b983  x64_sys_call+0x1fb3
>>>                          0xffffffff903dce5e  do_syscall_64+0x7e
>>>                          0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>>
>> Admittedly I'm not familiar with this code, but at a quick glance the
>> lock can be just straight up removed here?
>>
>>    534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>>    535 {
>>    536 │       struct inode *bd_inode = bdev_file_inode(file);
>>    537 │       loff_t retval;
>>    538 │
>>    539 │       inode_lock(bd_inode);
>>    540 │       retval = fixed_size_llseek(file, offset, whence,
>> i_size_read(bd_inode));
>>    541 │       inode_unlock(bd_inode);
>>    542 │       return retval;
>>    543 }
>>
>> At best it stabilizes the size for the duration of the call. Sounds
>> like it helps nothing since if the size can change, the file offset
>> will still be altered as if there was no locking?
>>
>> Suppose this cannot be avoided to grab the size for whatever reason.
>>
>> While the above fio invocation did not work for me, I ran some crapper
>> which I had in my shell history and according to strace:
>> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>>
>> ... the lseeks just rewind to the beginning, *definitely* not needing
>> to know the size. One would have to check but this is most likely the
>> case in your test as well.
>>
>> And for that there is 0 need to grab the size, and consequently the inode lock.

Here is the complete FIO cmdline I am using:

fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw 
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=64k 
-numjobs=1 -runtime=3600 --time_based -group_reporting -name=mytest

And that results in lseek patterns like these:

lseek(6, 0, SEEK_SET)             = 0
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 65536, SEEK_SET)         = 65536
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 393216, SEEK_SET)        = 393216
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 458752, SEEK_SET)        = 458752
lseek(6, 262144, SEEK_SET)        = 262144
lseek(6, 1114112, SEEK_SET)       = 1114112

The lseeks are interspersed with read and write calls.

> 
> That is to say bare minimum this needs to be benchmarked before/after
> with the lock removed from the picture, like so:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 2d01c9007681..7f9e9e2f9081 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>   static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>   {
>          struct inode *bd_inode = bdev_file_inode(file);
> -       loff_t retval;
> 
> -       inode_lock(bd_inode);
> -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> -       inode_unlock(bd_inode);
> -       return retval;
> +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>   }
> 
>   static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> 
> To be aborted if it blows up (but I don't see why it would).

Thanks for this fix, will try and get back with results.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27 12:18     ` Bharata B Rao
@ 2024-11-27 12:28       ` Mateusz Guzik
  2024-11-28  4:01         ` Bharata B Rao
  0 siblings, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-27 12:28 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
> > That is to say bare minimum this needs to be benchmarked before/after
> > with the lock removed from the picture, like so:
> >
> > diff --git a/block/fops.c b/block/fops.c
> > index 2d01c9007681..7f9e9e2f9081 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
> >   static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> >   {
> >          struct inode *bd_inode = bdev_file_inode(file);
> > -       loff_t retval;
> >
> > -       inode_lock(bd_inode);
> > -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> > -       inode_unlock(bd_inode);
> > -       return retval;
> > +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> >   }
> >
> >   static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> >
> > To be aborted if it blows up (but I don't see why it would).
>
> Thanks for this fix, will try and get back with results.
>

Please make sure to have results just with this change, no messing
with folio sizes so that I have something for the patch submission.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27 12:28       ` Mateusz Guzik
@ 2024-11-28  4:01         ` Bharata B Rao
  2024-11-28  4:22           ` Matthew Wilcox
  2024-11-28  4:22           ` Mateusz Guzik
  0 siblings, 2 replies; 24+ messages in thread
From: Bharata B Rao @ 2024-11-28  4:01 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On 27-Nov-24 5:58 PM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
>>> That is to say bare minimum this needs to be benchmarked before/after
>>> with the lock removed from the picture, like so:
>>>
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 2d01c9007681..7f9e9e2f9081 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>>>    static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>>>    {
>>>           struct inode *bd_inode = bdev_file_inode(file);
>>> -       loff_t retval;
>>>
>>> -       inode_lock(bd_inode);
>>> -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>> -       inode_unlock(bd_inode);
>>> -       return retval;
>>> +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>>>    }
>>>
>>>    static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
>>>
>>> To be aborted if it blows up (but I don't see why it would).
>>
>> Thanks for this fix, will try and get back with results.
>>
> 
> Please make sure to have results just with this change, no messing
> with folio sizes so that I have something for the patch submission.

The contention with inode_lock is gone after your above changes. The new 
top 10 contention data looks like this now:

  contended   total wait     max wait     avg wait         type   caller

2441494015    172.15 h       1.72 ms    253.83 us     spinlock 
folio_wait_bit_common+0xd5
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd1905  folio_wait_bit_common+0xd5
                         0xffffffffacdd2d0a  filemap_get_pages+0x68a
                         0xffffffffacdd2e73  filemap_read+0x103
                         0xffffffffad1d67ba  blkdev_read_iter+0x6a
                         0xffffffffacf06937  vfs_read+0x297
                         0xffffffffacf07653  ksys_read+0x73
   25269947      1.58 h       1.72 ms    225.44 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdd2ca7  filemap_get_pages+0x627
                         0xffffffffacdd2e73  filemap_read+0x103
                         0xffffffffad1d67ba  blkdev_read_iter+0x6a
                         0xffffffffacf06937  vfs_read+0x297
                         0xffffffffacf07653  ksys_read+0x73
   44757761      1.05 h       1.55 ms     84.41 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdcf7bc  folio_end_read+0x2c
                         0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                         0xffffffffad1d8abb  bio_endio+0x12b
                         0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
                         0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
   66419830     53.00 m     685.29 us     47.88 us     spinlock 
__filemap_add_folio+0x14c
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd037c  __filemap_add_folio+0x14c
                         0xffffffffacdd072c  filemap_add_folio+0x9c
                         0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                         0xffffffffacde31e0  page_cache_ra_order+0x340
                         0xffffffffacde33b8  page_cache_sync_ra+0x148
                         0xffffffffacdd27c5  filemap_get_pages+0x145
        532     45.51 m       5.49 s       5.13 s         mutex 
bdev_release+0x69
                         0xffffffffadbef1de  __mutex_lock.constprop.0+0x17e
                         0xffffffffadbef863  __mutex_lock_slowpath+0x13
                         0xffffffffadbef8bb  mutex_lock+0x3b
                         0xffffffffad1d5249  bdev_release+0x69
                         0xffffffffad1d5921  blkdev_release+0x11
                         0xffffffffacf089f3  __fput+0xe3
                         0xffffffffacf08c9b  __fput_sync+0x1b
                         0xffffffffacefe8ed  __x64_sys_close+0x3d
  264989621     45.26 m     486.17 us     10.25 us     spinlock 
raw_spin_rq_lock_nested+0x1d
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                         0xffffffffacb4f22d  raw_spin_rq_lock_nested+0x1d
                         0xffffffffacb5b601  _raw_spin_rq_lock_irqsave+0x21
                         0xffffffffacb6f81b  sched_balance_rq+0x60b
                         0xffffffffacb70784  sched_balance_newidle+0x1f4
                         0xffffffffacb70ae4  pick_next_task_fair+0x54
                         0xffffffffacb4e583  __pick_next_task+0x43
   30026842     26.05 m     652.85 us     52.05 us     spinlock 
__folio_mark_dirty+0x2b
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacde189b  __folio_mark_dirty+0x2b
                         0xffffffffacf67503  mark_buffer_dirty+0x53
                         0xffffffffacf676e2  __block_commit_write+0x82
                         0xffffffffacf685fc  block_write_end+0x3c
                         0xffffffffacfb577e  iomap_write_end+0x13e
                         0xffffffffacfb674c  iomap_file_buffered_write+0x29c
    5085820      7.07 m       1.24 ms     83.42 us     spinlock 
folio_wake_bit+0x62
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffffacdcf322  folio_wake_bit+0x62
                         0xffffffffacdcf7bc  folio_end_read+0x2c
                         0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                         0xffffffffad1d8abb  bio_endio+0x12b
                         0xffffffffad1ee8ca  blk_update_request+0x1aa
                         0xffffffffad1ef7a4  blk_mq_end_request+0x24
    8027370      1.90 m      76.21 us     14.23 us     spinlock 
tick_do_update_jiffies64+0x29
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                         0xffffffffacc21c69  tick_do_update_jiffies64+0x29
                         0xffffffffacc21e8c  tick_nohz_handler+0x13c
                         0xffffffffacc0a0cf  __hrtimer_run_queues+0x10f
                         0xffffffffacc0afe8  hrtimer_interrupt+0xf8
                         0xffffffffacaa8f36 
__sysvec_apic_timer_interrupt+0x56
                         0xffffffffadbe3953 
sysvec_apic_timer_interrupt+0x93
    4344269      1.08 m     600.67 us     14.90 us     spinlock 
__filemap_add_folio+0x14c
                         0xffffffffadbf60a3 
native_queued_spin_lock_slowpath+0x1f3
                         0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                         0xffffffffacdd037c  __filemap_add_folio+0x14c
                         0xffffffffacdd072c  filemap_add_folio+0x9c
                         0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                         0xffffffffacde31e0  page_cache_ra_order+0x340
                         0xffffffffacde3615  page_cache_async_ra+0x165
                         0xffffffffacdd2b9c  filemap_get_pages+0x51c

However a point of concern is that FIO bandwidth comes down drastically 
after the change.

		default				inode_lock-fix
rw=30%
Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=9616MiB/s,w=4121MiB/s
Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=8482MiB/s,w=3635MiB/s
Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=8609MiB/s,w=3690MiB/s
Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=8486MiB/s,w=3637MiB/s

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:01         ` Bharata B Rao
@ 2024-11-28  4:22           ` Matthew Wilcox
  2024-11-28  4:37             ` Bharata B Rao
  2024-11-28  4:22           ` Mateusz Guzik
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-11-28  4:22 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Mateusz Guzik, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote:
> However a point of concern is that FIO bandwidth comes down drastically
> after the change.
> 
> 		default				inode_lock-fix
> rw=30%
> Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=9616MiB/s,w=4121MiB/s
> Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=8482MiB/s,w=3635MiB/s
> Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=8609MiB/s,w=3690MiB/s
> Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=8486MiB/s,w=3637MiB/s

Something this dramatic usually only happens when you enable a debugging
option.  Can you recheck that you're running both A and B with the same
debugging options both compiled in, and enabled?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:01         ` Bharata B Rao
  2024-11-28  4:22           ` Matthew Wilcox
@ 2024-11-28  4:22           ` Mateusz Guzik
  2024-11-28  4:31             ` Mateusz Guzik
  2024-11-28  4:43             ` Matthew Wilcox
  1 sibling, 2 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-28  4:22 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@amd.com> wrote:
>
> The contention with inode_lock is gone after your above changes. The new
> top 10 contention data looks like this now:
>
>   contended   total wait     max wait     avg wait         type   caller
>
> 2441494015    172.15 h       1.72 ms    253.83 us     spinlock
> folio_wait_bit_common+0xd5
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
>                          0xffffffffacdd1905  folio_wait_bit_common+0xd5
>                          0xffffffffacdd2d0a  filemap_get_pages+0x68a
>                          0xffffffffacdd2e73  filemap_read+0x103
>                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
>                          0xffffffffacf06937  vfs_read+0x297
>                          0xffffffffacf07653  ksys_read+0x73
>    25269947      1.58 h       1.72 ms    225.44 us     spinlock
> folio_wake_bit+0x62
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
>                          0xffffffffacdcf322  folio_wake_bit+0x62
>                          0xffffffffacdd2ca7  filemap_get_pages+0x627
>                          0xffffffffacdd2e73  filemap_read+0x103
>                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
>                          0xffffffffacf06937  vfs_read+0x297
>                          0xffffffffacf07653  ksys_read+0x73
>    44757761      1.05 h       1.55 ms     84.41 us     spinlock
> folio_wake_bit+0x62
>                          0xffffffffadbf60a3
> native_queued_spin_lock_slowpath+0x1f3
>                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
>                          0xffffffffacdcf322  folio_wake_bit+0x62
>                          0xffffffffacdcf7bc  folio_end_read+0x2c
>                          0xffffffffacf6d4cf  mpage_read_end_io+0x6f
>                          0xffffffffad1d8abb  bio_endio+0x12b
>                          0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
>                          0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
[snip]
> However a point of concern is that FIO bandwidth comes down drastically
> after the change.
>

Nicely put :)

>                 default                         inode_lock-fix
> rw=30%
> Instance 1      r=55.7GiB/s,w=23.9GiB/s         r=9616MiB/s,w=4121MiB/s
> Instance 2      r=38.5GiB/s,w=16.5GiB/s         r=8482MiB/s,w=3635MiB/s
> Instance 3      r=37.5GiB/s,w=16.1GiB/s         r=8609MiB/s,w=3690MiB/s
> Instance 4      r=37.4GiB/s,w=16.0GiB/s         r=8486MiB/s,w=3637MiB/s
>

This means that the folio waiting stuff has poor scalability, but
without digging into it I have no idea what can be done. The easy way
out would be to speculatively spin before buggering off, but one would
have to check what happens in real workloads -- presumably the lock
owner can be off cpu for a long time (I presume there is no way to
store the owner).

The now-removed lock uses rwsems which behave better when contested
and was pulling contention away from folios, artificially *helping*
performance by having the folio bottleneck be exercised less.

The right thing to do in the long run is still to whack the llseek
lock acquire, but in the light of the above it can probably wait for
better times.
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:22           ` Mateusz Guzik
@ 2024-11-28  4:31             ` Mateusz Guzik
  2024-12-02  9:37               ` Bharata B Rao
  2024-11-28  4:43             ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-28  4:31 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@amd.com> wrote:
> >
> > The contention with inode_lock is gone after your above changes. The new
> > top 10 contention data looks like this now:
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >
> > 2441494015    172.15 h       1.72 ms    253.83 us     spinlock
> > folio_wait_bit_common+0xd5
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
> >                          0xffffffffacdd1905  folio_wait_bit_common+0xd5
> >                          0xffffffffacdd2d0a  filemap_get_pages+0x68a
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    25269947      1.58 h       1.72 ms    225.44 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdd2ca7  filemap_get_pages+0x627
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    44757761      1.05 h       1.55 ms     84.41 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdcf7bc  folio_end_read+0x2c
> >                          0xffffffffacf6d4cf  mpage_read_end_io+0x6f
> >                          0xffffffffad1d8abb  bio_endio+0x12b
> >                          0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
> >                          0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
> [snip]
> > However a point of concern is that FIO bandwidth comes down drastically
> > after the change.
> >
>
> Nicely put :)
>
> >                 default                         inode_lock-fix
> > rw=30%
> > Instance 1      r=55.7GiB/s,w=23.9GiB/s         r=9616MiB/s,w=4121MiB/s
> > Instance 2      r=38.5GiB/s,w=16.5GiB/s         r=8482MiB/s,w=3635MiB/s
> > Instance 3      r=37.5GiB/s,w=16.1GiB/s         r=8609MiB/s,w=3690MiB/s
> > Instance 4      r=37.4GiB/s,w=16.0GiB/s         r=8486MiB/s,w=3637MiB/s
> >
>
> This means that the folio waiting stuff has poor scalability, but
> without digging into it I have no idea what can be done. The easy way
> out would be to speculatively spin before buggering off, but one would
> have to check what happens in real workloads -- presumably the lock
> owner can be off cpu for a long time (I presume there is no way to
> store the owner).
>
> The now-removed lock uses rwsems which behave better when contested
> and was pulling contention away from folios, artificially *helping*
> performance by having the folio bottleneck be exercised less.
>
> The right thing to do in the long run is still to whack the llseek
> lock acquire, but in the light of the above it can probably wait for
> better times.

WIlly mentioned the folio wait queue hash table could be grown, you
can find it in mm/filemap.c:
  1062 #define PAGE_WAIT_TABLE_BITS 8
  1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
  1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
__cacheline_aligned;
  1065
  1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
  1067 {
  1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
  1069 }

Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out

On debian this ships with the bpfcc-tools package.


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:22           ` Matthew Wilcox
@ 2024-11-28  4:37             ` Bharata B Rao
  2024-11-28 11:23               ` Bharata B Rao
  0 siblings, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-11-28  4:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mateusz Guzik, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On 28-Nov-24 9:52 AM, Matthew Wilcox wrote:
> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote:
>> However a point of concern is that FIO bandwidth comes down drastically
>> after the change.
>>
>> 		default				inode_lock-fix
>> rw=30%
>> Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=9616MiB/s,w=4121MiB/s
>> Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=8482MiB/s,w=3635MiB/s
>> Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=8609MiB/s,w=3690MiB/s
>> Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=8486MiB/s,w=3637MiB/s
> 
> Something this dramatic usually only happens when you enable a debugging
> option.  Can you recheck that you're running both A and B with the same
> debugging options both compiled in, and enabled?

It is the same kernel tree with and w/o Mateusz's inode_lock changes to 
block/fops.c. I see the config remains same for both the builds.

Let me get a run for both base and patched case w/o running perf lock 
contention to check if that makes a difference.

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:22           ` Mateusz Guzik
  2024-11-28  4:31             ` Mateusz Guzik
@ 2024-11-28  4:43             ` Matthew Wilcox
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-11-28  4:43 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Bharata B Rao, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On Thu, Nov 28, 2024 at 05:22:41AM +0100, Mateusz Guzik wrote:
> This means that the folio waiting stuff has poor scalability, but
> without digging into it I have no idea what can be done. The easy way

Actually the easy way is to change:

#define PAGE_WAIT_TABLE_BITS 8

to a larger number.

> out would be to speculatively spin before buggering off, but one would
> have to check what happens in real workloads -- presumably the lock
> owner can be off cpu for a long time (I presume there is no way to
> store the owner).

So ...

 - There's no space in struct folio to put a rwsem.
 - But we want to be able to sleep waiting for a folio to (eg) do I/O.

This is the solution we have.  For the read case, there are three
important bits in folio->flags to pay attention to:

 - PG_locked.  This is held during the read.
 - PG_uptodate.  This is set if the read succeeded.
 - PG_waiters.  This is set if anyone is waiting for PG_locked [*]

The first thread comes along, allocates a folio, locks it, inserts
it into the mapping.
The second thread comes along, finds the folio, sees it's !uptodate,
sets the waiter bit, adds itself to the waitqueue.
The third thread, ditto.
The read completes.  In interrupt or maybe softirq context, the
BIO completion sets the uptodate bit, clears the locked bit and tests
the waiter bit.  Since the waiter bit is set, it walks the waitqueue
looking for waiters which match the locked bit and folio (see
folio_wake_bit()).

So there's not _much_ of a thundering herd problem here.  Most likely
the waitqueue is just too damn long with a lot of threads waiting for
I/O.

[*] oversimplification; don't worry about it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-27 12:02     ` Jan Kara
  2024-11-27 12:13       ` Christian Brauner
@ 2024-11-28  5:40       ` Ritesh Harjani
  1 sibling, 0 replies; 24+ messages in thread
From: Ritesh Harjani @ 2024-11-28  5:40 UTC (permalink / raw)
  To: Jan Kara, Mateusz Guzik
  Cc: Bharata B Rao, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, willy, vbabka, david, akpm, yuzhao, axboe,
	viro, brauner, jack, joshdon, clm

Jan Kara <jack@suse.cz> writes:

> On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
>> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >
>> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>> > >
>> > > Recently we discussed the scalability issues while running large
>> > > instances of FIO with buffered IO option on NVME block devices here:
>> > >
>> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>> > >
>> > > One of the suggestions Chris Mason gave (during private discussions) was
>> > > to enable large folios in block buffered IO path as that could
>> > > improve the scalability problems and improve the lock contention
>> > > scenarios.
>> > >
>> >
>> > I have no basis to comment on the idea.
>> >
>> > However, it is pretty apparent whatever the situation it is being
>> > heavily disfigured by lock contention in blkdev_llseek:
>> >
>> > > perf-lock contention output
>> > > ---------------------------
>> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>> > > mix it looks like this:
>> > >
>> > > perf-lock contention default
>> > >  contended   total wait     max wait     avg wait         type   caller
>> > >
>> > > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>> > >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>> > >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>> > >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>> > >                         0xffffffff8f39e88f  up_write+0x4f
>> > >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>> > >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>> > >                         0xffffffff903f18fb  down_write+0x5b
>> > >                         0xffffffff8f9d5971  blkdev_llseek+0x31
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >                         0xffffffff903dce5e  do_syscall_64+0x7e
>> > >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>> >
>> > Admittedly I'm not familiar with this code, but at a quick glance the
>> > lock can be just straight up removed here?
>> >
>> >   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>> >   535 {
>> >   536 │       struct inode *bd_inode = bdev_file_inode(file);
>> >   537 │       loff_t retval;
>> >   538 │
>> >   539 │       inode_lock(bd_inode);
>> >   540 │       retval = fixed_size_llseek(file, offset, whence,
>> > i_size_read(bd_inode));
>> >   541 │       inode_unlock(bd_inode);
>> >   542 │       return retval;
>> >   543 }
>> >
>> > At best it stabilizes the size for the duration of the call. Sounds
>> > like it helps nothing since if the size can change, the file offset
>> > will still be altered as if there was no locking?
>> >
>> > Suppose this cannot be avoided to grab the size for whatever reason.
>> >
>> > While the above fio invocation did not work for me, I ran some crapper
>> > which I had in my shell history and according to strace:
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>> >
>> > ... the lseeks just rewind to the beginning, *definitely* not needing
>> > to know the size. One would have to check but this is most likely the
>> > case in your test as well.
>> >
>> > And for that there is 0 need to grab the size, and consequently the inode lock.
>> 
>> That is to say bare minimum this needs to be benchmarked before/after
>> with the lock removed from the picture, like so:
>
> Yeah, I've noticed this in the locking profiles as well and I agree
> bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
> get away without using inode lock in their llseek handler...
>

Right, we don't need an inode_lock() for i_size_read(). i_size_write()
still needs locking for serialization, mainly for 32bit SMP case, due
to use of seqcounts.
I guess it would be good to maybe add this in Documentation too rather
than this info just hanging on top of i_size_write()?

References
===========
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932
[3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/

-ritesh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for block buffered IO path
  2024-11-27 10:37     ` Bharata B Rao
@ 2024-11-28  5:43       ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-11-28  5:43 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, willy, vbabka, david, akpm, yuzhao, mjguzik,
	axboe, viro, brauner, jack, joshdon, clm

On Wed, Nov 27, 2024 at 04:07:02PM +0530, Bharata B Rao wrote:
> I believe you are referring to the patchset that enables bs > ps for block
> devices - https://lore.kernel.org/linux-fsdevel/20241113094727.1497722-1-mcgrof@kernel.org/

I actually thought of:

https://www.spinics.net/lists/linux-ext4/msg98151.html

but yes, the one you pointed to is more relevant.

> In fact I was trying to see if it is possible to advertise large folio
> support in bdev mapping only for those block devices which don't have FS
> mounted on them. But apparently it was not so straight forward and my
> initial attempt at this resulted in FS corruption. Hence I resorted to the
> current ioctl approach as a way to showcase the problem and the potential
> benefit.

Well, if you use the ioctl and then later mount a file system, you'll
still see the same corruption.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:37             ` Bharata B Rao
@ 2024-11-28 11:23               ` Bharata B Rao
  2024-11-28 23:31                 ` Mateusz Guzik
  0 siblings, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-11-28 11:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mateusz Guzik, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On 28-Nov-24 10:07 AM, Bharata B Rao wrote:
> On 28-Nov-24 9:52 AM, Matthew Wilcox wrote:
>> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote:
>>> However a point of concern is that FIO bandwidth comes down drastically
>>> after the change.
>>>
>>>         default                inode_lock-fix
>>> rw=30%
>>> Instance 1    r=55.7GiB/s,w=23.9GiB/s        r=9616MiB/s,w=4121MiB/s
>>> Instance 2    r=38.5GiB/s,w=16.5GiB/s        r=8482MiB/s,w=3635MiB/s
>>> Instance 3    r=37.5GiB/s,w=16.1GiB/s        r=8609MiB/s,w=3690MiB/s
>>> Instance 4    r=37.4GiB/s,w=16.0GiB/s        r=8486MiB/s,w=3637MiB/s
>>
>> Something this dramatic usually only happens when you enable a debugging
>> option.  Can you recheck that you're running both A and B with the same
>> debugging options both compiled in, and enabled?
> 
> It is the same kernel tree with and w/o Mateusz's inode_lock changes to 
> block/fops.c. I see the config remains same for both the builds.
> 
> Let me get a run for both base and patched case w/o running perf lock 
> contention to check if that makes a difference.

Without perf lock contention

                 default                         inode_lock-fix
rw=30%
Instance 1      r=54.6GiB/s,w=23.4GiB/s         r=11.4GiB/s,w=4992MiB/s
Instance 2      r=52.7GiB/s,w=22.6GiB/s         r=11.4GiB/s,w=4981MiB/s
Instance 3      r=53.3GiB/s,w=22.8GiB/s         r=12.7GiB/s,w=5575MiB/s
Instance 4      r=37.7GiB/s,w=16.2GiB/s         r=10.4GiB/s,w=4581MiB/s

> 
> Regards,
> Bharata.
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28 11:23               ` Bharata B Rao
@ 2024-11-28 23:31                 ` Mateusz Guzik
  2024-11-29 10:32                   ` Bharata B Rao
  0 siblings, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-11-28 23:31 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Matthew Wilcox, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On Thu, Nov 28, 2024 at 12:24 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 28-Nov-24 10:07 AM, Bharata B Rao wrote:
> > On 28-Nov-24 9:52 AM, Matthew Wilcox wrote:
> >> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote:
> >>> However a point of concern is that FIO bandwidth comes down drastically
> >>> after the change.
> >>>
> >>>         default                inode_lock-fix
> >>> rw=30%
> >>> Instance 1    r=55.7GiB/s,w=23.9GiB/s        r=9616MiB/s,w=4121MiB/s
> >>> Instance 2    r=38.5GiB/s,w=16.5GiB/s        r=8482MiB/s,w=3635MiB/s
> >>> Instance 3    r=37.5GiB/s,w=16.1GiB/s        r=8609MiB/s,w=3690MiB/s
> >>> Instance 4    r=37.4GiB/s,w=16.0GiB/s        r=8486MiB/s,w=3637MiB/s
> >>
> >> Something this dramatic usually only happens when you enable a debugging
> >> option.  Can you recheck that you're running both A and B with the same
> >> debugging options both compiled in, and enabled?
> >
> > It is the same kernel tree with and w/o Mateusz's inode_lock changes to
> > block/fops.c. I see the config remains same for both the builds.
> >
> > Let me get a run for both base and patched case w/o running perf lock
> > contention to check if that makes a difference.
>
> Without perf lock contention
>
>                  default                         inode_lock-fix
> rw=30%
> Instance 1      r=54.6GiB/s,w=23.4GiB/s         r=11.4GiB/s,w=4992MiB/s
> Instance 2      r=52.7GiB/s,w=22.6GiB/s         r=11.4GiB/s,w=4981MiB/s
> Instance 3      r=53.3GiB/s,w=22.8GiB/s         r=12.7GiB/s,w=5575MiB/s
> Instance 4      r=37.7GiB/s,w=16.2GiB/s         r=10.4GiB/s,w=4581MiB/s
>

per my other e-mail can you follow willy's suggestion and increase the hash?

best case scenario this takes care of it and then some heuristic can
be added how to autosize the thing.

If someone feels like microoptimizing I also note there is magic infra
to have the size hotpatchable into generated asm instead of it being
read (see dentry cache as an example user).

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28 23:31                 ` Mateusz Guzik
@ 2024-11-29 10:32                   ` Bharata B Rao
  0 siblings, 0 replies; 24+ messages in thread
From: Bharata B Rao @ 2024-11-29 10:32 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Matthew Wilcox, linux-block, linux-kernel, linux-fsdevel,
	linux-mm, nikunj, vbabka, david, akpm, yuzhao, axboe, viro,
	brauner, jack, joshdon, clm

On 29-Nov-24 5:01 AM, Mateusz Guzik wrote:
> On Thu, Nov 28, 2024 at 12:24 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 28-Nov-24 10:07 AM, Bharata B Rao wrote:
>>> On 28-Nov-24 9:52 AM, Matthew Wilcox wrote:
>>>> On Thu, Nov 28, 2024 at 09:31:50AM +0530, Bharata B Rao wrote:
>>>>> However a point of concern is that FIO bandwidth comes down drastically
>>>>> after the change.
>>>>>
>>>>>          default                inode_lock-fix
>>>>> rw=30%
>>>>> Instance 1    r=55.7GiB/s,w=23.9GiB/s        r=9616MiB/s,w=4121MiB/s
>>>>> Instance 2    r=38.5GiB/s,w=16.5GiB/s        r=8482MiB/s,w=3635MiB/s
>>>>> Instance 3    r=37.5GiB/s,w=16.1GiB/s        r=8609MiB/s,w=3690MiB/s
>>>>> Instance 4    r=37.4GiB/s,w=16.0GiB/s        r=8486MiB/s,w=3637MiB/s
>>>>
>>>> Something this dramatic usually only happens when you enable a debugging
>>>> option.  Can you recheck that you're running both A and B with the same
>>>> debugging options both compiled in, and enabled?
>>>
>>> It is the same kernel tree with and w/o Mateusz's inode_lock changes to
>>> block/fops.c. I see the config remains same for both the builds.
>>>
>>> Let me get a run for both base and patched case w/o running perf lock
>>> contention to check if that makes a difference.
>>
>> Without perf lock contention
>>
>>                   default                         inode_lock-fix
>> rw=30%
>> Instance 1      r=54.6GiB/s,w=23.4GiB/s         r=11.4GiB/s,w=4992MiB/s
>> Instance 2      r=52.7GiB/s,w=22.6GiB/s         r=11.4GiB/s,w=4981MiB/s
>> Instance 3      r=53.3GiB/s,w=22.8GiB/s         r=12.7GiB/s,w=5575MiB/s
>> Instance 4      r=37.7GiB/s,w=16.2GiB/s         r=10.4GiB/s,w=4581MiB/s
>>
> 
> per my other e-mail can you follow willy's suggestion and increase the hash?

With Mateusz's inode_lock fix and PAGE_WAIT_TABLE_BITS value of 10, 14, 
16 and 20.
(Two values given with each instance below are FIO READ bw and WRITE bw)

                 10              14              16              20
rw=30%
Instance 1      11.3GiB/s       14.2GiB/s       14.8GiB/s       14.9GiB/s
                 4965MiB/s       6225MiB/s       6487MiB/s       6552MiB/s
Instance 2      12.3GiB/s       10.4GiB/s       10.9GiB/s       11.0GiB/s
                 5389MiB/s       4548MiB/s       4770MiB/s       4815MiB/s
Instance 3      11.1GiB/s       12.3GiB/s       11.2GiB/s       13.5GiB/s
                 4864MiB/s       5410MiB/s       4923MiB/s       5927MiB/s
Instance 4      12.3GiB/s       13.7GiB/s       13.0GiB/s       11.4GiB/s
                 5404MiB/s       6004MiB/s       5689MiB/s       5007MiB/s

Number of hash buckets don't seem to matter all that much in this case.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-11-28  4:31             ` Mateusz Guzik
@ 2024-12-02  9:37               ` Bharata B Rao
  2024-12-02 10:08                 ` Mateusz Guzik
  0 siblings, 1 reply; 24+ messages in thread
From: Bharata B Rao @ 2024-12-02  9:37 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On 28-Nov-24 10:01 AM, Mateusz Guzik wrote:

> WIlly mentioned the folio wait queue hash table could be grown, you
> can find it in mm/filemap.c:
>    1062 #define PAGE_WAIT_TABLE_BITS 8
>    1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
>    1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
> __cacheline_aligned;
>    1065
>    1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
>    1067 {
>    1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
>    1069 }
> 
> Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out

Flamegraph for "perf record --off-cpu -F 99 -a -g --all-kernel 
--kernel-callchains -- sleep 120" is attached.

Off-cpu samples were collected for 120s at around 45th minute run of the 
FIO benchmark that actually runs for 1hr. This run was with kernel that 
had your inode_lock fix but no changes to PAGE_WAIT_TABLE_BITS.

Hopefully this captures the representative sample of the scalability 
issue with folio lock.

Regards,
Bharata.

[-- Attachment #2: perf-inode-fix.svg --]
[-- Type: image/svg+xml, Size: 243647 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-12-02  9:37               ` Bharata B Rao
@ 2024-12-02 10:08                 ` Mateusz Guzik
  2024-12-03  5:01                   ` Bharata B Rao
  0 siblings, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-12-02 10:08 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On Mon, Dec 2, 2024 at 10:37 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 28-Nov-24 10:01 AM, Mateusz Guzik wrote:
>
> > WIlly mentioned the folio wait queue hash table could be grown, you
> > can find it in mm/filemap.c:
> >    1062 #define PAGE_WAIT_TABLE_BITS 8
> >    1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
> >    1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
> > __cacheline_aligned;
> >    1065
> >    1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
> >    1067 {
> >    1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
> >    1069 }
> >
> > Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out
>
> Flamegraph for "perf record --off-cpu -F 99 -a -g --all-kernel
> --kernel-callchains -- sleep 120" is attached.
>
> Off-cpu samples were collected for 120s at around 45th minute run of the
> FIO benchmark that actually runs for 1hr. This run was with kernel that
> had your inode_lock fix but no changes to PAGE_WAIT_TABLE_BITS.
>
> Hopefully this captures the representative sample of the scalability
> issue with folio lock.
>

I'm not familiar with the off-cpu option, fwiw does not look like any
of that time got graphed. The thing that I know to work is
offcputime-bpfcc.

Regardless, per your own graph over half the *on* cpu  time is spent
spinning on the folio hash table locks.

If bumping the size does not resolve the problem, the most likely
contention shifts again to something else. So what we need is some
profiling data from that state.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/1] Large folios in block buffered IO path
  2024-12-02 10:08                 ` Mateusz Guzik
@ 2024-12-03  5:01                   ` Bharata B Rao
  0 siblings, 0 replies; 24+ messages in thread
From: Bharata B Rao @ 2024-12-03  5:01 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: linux-block, linux-kernel, linux-fsdevel, linux-mm, nikunj,
	willy, vbabka, david, akpm, yuzhao, axboe, viro, brauner, jack,
	joshdon, clm

On 02-Dec-24 3:38 PM, Mateusz Guzik wrote:
> On Mon, Dec 2, 2024 at 10:37 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 28-Nov-24 10:01 AM, Mateusz Guzik wrote:
>>
>>> WIlly mentioned the folio wait queue hash table could be grown, you
>>> can find it in mm/filemap.c:
>>>     1062 #define PAGE_WAIT_TABLE_BITS 8
>>>     1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
>>>     1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
>>> __cacheline_aligned;
>>>     1065
>>>     1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
>>>     1067 {
>>>     1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
>>>     1069 }
>>>
>>> Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out
>>
>> Flamegraph for "perf record --off-cpu -F 99 -a -g --all-kernel
>> --kernel-callchains -- sleep 120" is attached.
>>
>> Off-cpu samples were collected for 120s at around 45th minute run of the
>> FIO benchmark that actually runs for 1hr. This run was with kernel that
>> had your inode_lock fix but no changes to PAGE_WAIT_TABLE_BITS.
>>
>> Hopefully this captures the representative sample of the scalability
>> issue with folio lock.

Here is the data from offcputime-bpfcc -K run with inode_lock fix and no 
change to PAGE_WAIT_TABLE_BITS. This data was captured for the entire 
duration of FIO run (1hr). Since the data is huge, I am pasting a few 
relevant entries.

The first entry in the offcputime records

     finish_task_switch.isra.0
     schedule
     irqentry_exit_to_user_mode
     irqentry_exit
     sysvec_reschedule_ipi
     asm_sysvec_reschedule_ipi
     -                fio (33790)
         2

There are thousands of entries for read and write paths of FIO and I 
have shown only the first and last entries for the same here.

First entry for FIO read path that waits on folio_lock

     finish_task_switch.isra.0
     schedule
     io_schedule
     folio_wait_bit_common
     filemap_get_pages
     filemap_read
     blkdev_read_iter
     vfs_read
     ksys_read
     __x64_sys_read
     x64_sys_call
     do_syscall_64
     entry_SYSCALL_64_after_hwframe
     -                fio (34143)
         3381769535

Last entry for FIO read path that waits on folio_lock

     finish_task_switch.isra.0
     schedule
     io_schedule
     folio_wait_bit_common
     filemap_get_pages
     filemap_read
     blkdev_read_iter
     vfs_read
     ksys_read
     __x64_sys_read
     x64_sys_call
     do_syscall_64
     entry_SYSCALL_64_after_hwframe
     -                fio (34171)
         3516224519

First entry for FIO write path that waits on folio_lock

     finish_task_switch.isra.0
     schedule
     io_schedule
     folio_wait_bit_common
     __filemap_get_folio
     iomap_get_folio
     iomap_write_begin
     iomap_file_buffered_write
     blkdev_write_iter
     vfs_write
     ksys_write
     __x64_sys_write
     x64_sys_call
     do_syscall_64
     entry_SYSCALL_64_after_hwframe
     -                fio (33842)
         48900

Last entry for FIO write path that waits on folio_lock

     finish_task_switch.isra.0
     schedule
     io_schedule
     folio_wait_bit_common
     __filemap_get_folio
     iomap_get_folio
     iomap_write_begin
     iomap_file_buffered_write
     blkdev_write_iter
     vfs_write
     ksys_write
     __x64_sys_write
     x64_sys_call
     do_syscall_64
     entry_SYSCALL_64_after_hwframe
     -                fio (34187)
         1815993

The last entry in the offcputime records

     finish_task_switch.isra.0
     schedule
     futex_wait_queue
     __futex_wait
     futex_wait
     do_futex
     __x64_sys_futex
     x64_sys_call
     do_syscall_64
     entry_SYSCALL_64_after_hwframe
     -                multipathd (6308)
         3698877753

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-12-03  5:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-27  5:47 [RFC PATCH 0/1] Large folios in block buffered IO path Bharata B Rao
2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27  6:26   ` Christoph Hellwig
2024-11-27 10:37     ` Bharata B Rao
2024-11-28  5:43       ` Christoph Hellwig
2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27  6:19   ` Mateusz Guzik
2024-11-27 12:02     ` Jan Kara
2024-11-27 12:13       ` Christian Brauner
2024-11-28  5:40       ` Ritesh Harjani
2024-11-27 12:18     ` Bharata B Rao
2024-11-27 12:28       ` Mateusz Guzik
2024-11-28  4:01         ` Bharata B Rao
2024-11-28  4:22           ` Matthew Wilcox
2024-11-28  4:37             ` Bharata B Rao
2024-11-28 11:23               ` Bharata B Rao
2024-11-28 23:31                 ` Mateusz Guzik
2024-11-29 10:32                   ` Bharata B Rao
2024-11-28  4:22           ` Mateusz Guzik
2024-11-28  4:31             ` Mateusz Guzik
2024-12-02  9:37               ` Bharata B Rao
2024-12-02 10:08                 ` Mateusz Guzik
2024-12-03  5:01                   ` Bharata B Rao
2024-11-28  4:43             ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox