* Re: sequential 1MB mmap read ends in 1 page sync read-ahead [not found] <aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm> @ 2023-04-13 21:33 ` Bernd Schubert 2023-04-14 4:07 ` Matthew Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-04-13 21:33 UTC (permalink / raw) To: Matthew Wilcox (Oracle) Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm Sorry, forgot to add Andrew and linux-mm into CC. On 4/13/23 23:27, Bernd Schubert wrote: > Hello, > > I found a weird mmap read behavior while benchmarking the fuse-uring > patches. > I did not verify yet, but it does not look fuse specific. > Basically, I started to check because fio results were much lower > than expected (better with the new code, though) > > fio cmd line: > fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse > --directory=/scratch/dest/ --rw=read multi-file.fio > > > bernd@squeeze1 test2>cat multi-file.fio > [global] > group_reporting > bs=1M > runtime=300 > > [test] > > This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy > beginning at offset 0 in 1MB steps (verified with additional > logging in fios engines/mmap.c). > > And additional log in fuse_readahead() gives > > [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > index=0 > [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > index=255 > [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > index=254 > [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > index=253 > ... <count is always 1 page> > [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > index=64 > [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > index=511 > [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > index=510 > [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > index=509 > ... <count is always 1 page> > > Logging in do_sync_mmap_readahead() > > [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 > ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > ... > [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 > ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 > ras_prev_pos=-1 > > > So we can see from fuse that it starts to read at page index 0, wants > 64 pages (which is actually the double of bdi read_ahead_kb), then > skips index 64...254) and immediately goes to index 255. For the mmaped > memcpy pages are missing and then it goes back in 1 page steps to get > these. > > A workaround here is to set read_ahead_kb in the bdi to a larger > value, another workaround might be (untested) to increase the read-ahead > window. Either of these two seem to be workarounds for the index order > above. > > I understand that read-ahead gets limited by the bdi value (although > exceeded above), but why does it go back in 1 page steps? My expectation > would have been > > index=0 count=32 (128kb read-head) > index=32 count=32 > index=64 count=32 > ... > > > This is with plain 6.2 + fuse-uring patches. > > Thanks, > Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-13 21:33 ` sequential 1MB mmap read ends in 1 page sync read-ahead Bernd Schubert @ 2023-04-14 4:07 ` Matthew Wilcox 2023-04-14 15:05 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2023-04-14 4:07 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote: > Sorry, forgot to add Andrew and linux-mm into CC. > > On 4/13/23 23:27, Bernd Schubert wrote: > > Hello, > > > > I found a weird mmap read behavior while benchmarking the fuse-uring > > patches. > > I did not verify yet, but it does not look fuse specific. > > Basically, I started to check because fio results were much lower > > than expected (better with the new code, though) > > > > fio cmd line: > > fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse > > --directory=/scratch/dest/ --rw=read multi-file.fio > > > > > > bernd@squeeze1 test2>cat multi-file.fio > > [global] > > group_reporting > > bs=1M > > runtime=300 > > > > [test] > > > > This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy > > beginning at offset 0 in 1MB steps (verified with additional > > logging in fios engines/mmap.c). > > > > And additional log in fuse_readahead() gives > > > > [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > > index=0 > > [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > > index=255 > > [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > > index=254 > > [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > > index=253 > > ... <count is always 1 page> > > [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > > index=64 > > [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 > > index=511 > > [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > > index=510 > > [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 > > index=509 > > ... <count is always 1 page> > > > > Logging in do_sync_mmap_readahead() > > > > [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 > > ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 > > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 > > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 > > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > ... > > [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 > > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 > > ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 > > ras_prev_pos=-1 > > > > > > So we can see from fuse that it starts to read at page index 0, wants > > 64 pages (which is actually the double of bdi read_ahead_kb), then > > skips index 64...254) and immediately goes to index 255. For the mmaped > > memcpy pages are missing and then it goes back in 1 page steps to get > > these. > > > > A workaround here is to set read_ahead_kb in the bdi to a larger > > value, another workaround might be (untested) to increase the read-ahead > > window. Either of these two seem to be workarounds for the index order > > above. > > > > I understand that read-ahead gets limited by the bdi value (although > > exceeded above), but why does it go back in 1 page steps? My expectation > > would have been > > > > index=0 count=32 (128kb read-head) > > index=32 count=32 > > index=64 count=32 What I see with XFS is: fio-27518 [005] ..... 276.565025: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2 fio-27518 [005] ..... 276.565035: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x19868 ofs=16384 order=2 fio-27518 [005] ..... 276.565036: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x237fc ofs=32768 order=2 fio-27518 [005] ..... 276.565038: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x27518 ofs=49152 order=2 fio-27518 [005] ..... 276.565039: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14c7c ofs=65536 order=2 fio-27518 [005] ..... 276.565040: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14338 ofs=81920 order=2 fio-27518 [005] ..... 276.565041: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x201fc ofs=98304 order=2 fio-27518 [005] ..... 276.565042: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1fb98 ofs=114688 order=2 fio-27518 [005] ..... 276.565044: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14510 ofs=131072 order=2 fio-27518 [005] ..... 276.565045: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1e88c ofs=147456 order=2 fio-27518 [005] ..... 276.565046: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x26f00 ofs=163840 order=2 ... dev 8:32 ino 44 pfn=0x14f30 ofs=262144 order=4 fio-27518 [005] ..... 276.567718: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x145a0 ofs=327680 order=4 fio-27518 [005] ..... 276.567720: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15730 ofs=393216 order=4 fio-27518 [005] ..... 276.567722: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e30 ofs=458752 order=4 fio-27518 [005] ..... 276.567942: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x18b00 ofs=524288 order=6 fio-27518 [005] ..... 276.569982: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 order=6 ... it then gets "stuck" at order-6, which is expected for a 256kB readahead window. This is produced by: echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/mnt/scratch/ --rw=read multi-file.fio echo 0 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable less /sys/kernel/tracing/trace ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-14 4:07 ` Matthew Wilcox @ 2023-04-14 15:05 ` Bernd Schubert 2023-04-14 19:52 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-04-14 15:05 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm On 4/14/23 06:07, Matthew Wilcox wrote: > On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote: >> Sorry, forgot to add Andrew and linux-mm into CC. >> >> On 4/13/23 23:27, Bernd Schubert wrote: >>> Hello, >>> >>> I found a weird mmap read behavior while benchmarking the fuse-uring >>> patches. >>> I did not verify yet, but it does not look fuse specific. >>> Basically, I started to check because fio results were much lower >>> than expected (better with the new code, though) >>> >>> fio cmd line: >>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse >>> --directory=/scratch/dest/ --rw=read multi-file.fio >>> >>> >>> bernd@squeeze1 test2>cat multi-file.fio >>> [global] >>> group_reporting >>> bs=1M >>> runtime=300 >>> >>> [test] >>> >>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy >>> beginning at offset 0 in 1MB steps (verified with additional >>> logging in fios engines/mmap.c). >>> >>> And additional log in fuse_readahead() gives >>> >>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>> index=0 >>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>> index=255 >>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>> index=254 >>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>> index=253 >>> ... <count is always 1 page> >>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>> index=64 >>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>> index=511 >>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>> index=510 >>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>> index=509 >>> ... <count is always 1 page> >>> >>> Logging in do_sync_mmap_readahead() >>> >>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 >>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 >>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 >>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 >>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> ... >>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 >>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 >>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 >>> ras_prev_pos=-1 >>> >>> >>> So we can see from fuse that it starts to read at page index 0, wants >>> 64 pages (which is actually the double of bdi read_ahead_kb), then >>> skips index 64...254) and immediately goes to index 255. For the mmaped >>> memcpy pages are missing and then it goes back in 1 page steps to get >>> these. >>> >>> A workaround here is to set read_ahead_kb in the bdi to a larger >>> value, another workaround might be (untested) to increase the read-ahead >>> window. Either of these two seem to be workarounds for the index order >>> above. >>> >>> I understand that read-ahead gets limited by the bdi value (although >>> exceeded above), but why does it go back in 1 page steps? My expectation >>> would have been >>> >>> index=0 count=32 (128kb read-head) >>> index=32 count=32 >>> index=64 count=32 > > What I see with XFS is: > > fio-27518 [005] ..... 276.565025: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2 > fio-27518 [005] ..... 276.565035: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x19868 ofs=16384 order=2 > fio-27518 [005] ..... 276.565036: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x237fc ofs=32768 order=2 > fio-27518 [005] ..... 276.565038: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x27518 ofs=49152 order=2 > fio-27518 [005] ..... 276.565039: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14c7c ofs=65536 order=2 > fio-27518 [005] ..... 276.565040: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14338 ofs=81920 order=2 > fio-27518 [005] ..... 276.565041: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x201fc ofs=98304 order=2 > fio-27518 [005] ..... 276.565042: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1fb98 ofs=114688 order=2 > fio-27518 [005] ..... 276.565044: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14510 ofs=131072 order=2 > fio-27518 [005] ..... 276.565045: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1e88c ofs=147456 order=2 > fio-27518 [005] ..... 276.565046: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x26f00 ofs=163840 order=2 > > ... > > dev 8:32 ino 44 pfn=0x14f30 ofs=262144 order=4 > fio-27518 [005] ..... 276.567718: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x145a0 ofs=327680 order=4 > fio-27518 [005] ..... 276.567720: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15730 ofs=393216 order=4 > fio-27518 [005] ..... 276.567722: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e30 ofs=458752 order=4 > fio-27518 [005] ..... 276.567942: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x18b00 ofs=524288 order=6 > fio-27518 [005] ..... 276.569982: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 order=6 > > ... it then gets "stuck" at order-6, which is expected for a 256kB > readahead window. > > This is produced by: > > echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable > fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/mnt/scratch/ --rw=read multi-file.fio > echo 0 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable > less /sys/kernel/tracing/trace > Thanks for looking at it Matthew! I see the same as on fuse on xfs - same output as I initially posted (except the fuse logging, of course). With tracing and reduced file size to 2M squeeze1:~# grep fio /sys/kernel/tracing/trace fio-3459 [018] ..... 65055.425435: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6b0 ofs=0 order=2 fio-3459 [018] ..... 65055.425456: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6ac ofs=16384 order=2 fio-3459 [018] ..... 65055.425464: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a8 ofs=32768 order=2 fio-3459 [018] ..... 65055.425472: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a4 ofs=49152 order=2 fio-3459 [018] ..... 65055.425480: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a0 ofs=65536 order=2 fio-3459 [018] ..... 65055.425489: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c69c ofs=81920 order=2 fio-3459 [018] ..... 65055.425497: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c698 ofs=98304 order=2 fio-3459 [018] ..... 65055.425505: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c694 ofs=114688 order=2 fio-3459 [018] ..... 65055.425513: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c690 ofs=131072 order=2 fio-3459 [018] ..... 65055.425521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c68c ofs=147456 order=2 fio-3459 [018] ..... 65055.425529: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c688 ofs=163840 order=2 fio-3459 [018] ..... 65055.425538: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c684 ofs=180224 order=2 fio-3459 [018] ..... 65055.425545: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c680 ofs=196608 order=2 fio-3459 [018] ..... 65055.425553: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c83c ofs=212992 order=2 fio-3459 [018] ..... 65055.425561: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c838 ofs=229376 order=2 fio-3459 [018] ..... 65055.425569: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c834 ofs=245760 order=2 fio-3459 [011] ..... 65055.436500: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb8 ofs=1044480 order=0 fio-3459 [011] ..... 65055.436530: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb9 ofs=1048576 order=0 fio-3459 [011] ..... 65055.436534: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 ofs=1052672 order=0 And then it is stuck at order=0 fio-3459 [002] ..... 65060.353420: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12afc2 ofs=1306624 order=0 Not any different with increase file size to 10M. squeeze1:~# cat /proc/self/mountinfo |grep "252:16" 88 28 252:16 / /scratch rw,relatime shared:44 - xfs /dev/vdb rw,attr2,inode64,logbufs=8,logbsize=32k,noquota squeeze1:~# cat /sys/class/bdi/252\:16/read_ahead_kb 128 With increased RA to 1024kb fio-3568 [013] ..... 66481.269207: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12b8a0 ofs=0 order=2 fio-3568 [013] ..... 66481.269224: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d5dc ofs=16384 order=2 fio-3568 [013] ..... 66481.269233: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d8dc ofs=32768 order=2 fio-3568 [013] ..... 66481.269242: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12aae4 ofs=49152 order=2 fio-3568 [013] ..... 66481.269252: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12abb0 ofs=65536 order=2 ... fio-3568 [013] ..... 66481.276736: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12da00 ofs=2097152 order=4 fio-3568 [013] ..... 66481.276758: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d4f0 ofs=2162688 order=4 fio-3568 [013] ..... 66481.276785: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d910 ofs=2228224 order=4 fio-3568 [013] ..... 66481.276812: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c3c0 ofs=2293760 order=4 fio-3568 [013] ..... 66481.278920: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d340 ofs=4194304 order=6 fio-3568 [013] ..... 66481.279033: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12db80 ofs=4456448 order=6 fio-3568 [013] ..... 66481.279183: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dbc0 ofs=4718592 order=6 fio-3568 [013] ..... 66481.279302: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e000 ofs=4980736 order=6 fio-3568 [013] ..... 66481.279422: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e040 ofs=5242880 order=6 fio-3568 [013] ..... 66481.279521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e080 ofs=5505024 order=6 fio-3568 [013] ..... 66481.279938: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e0c0 ofs=5767168 order=6 fio-3568 [013] ..... 66481.280038: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e100 ofs=6029312 order=6 fio-3568 [013] ..... 66481.282445: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e200 ofs=6291456 order=8 fio-3568 [013] ..... 66481.283986: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e300 ofs=7340032 order=8 fio-3568 [013] ..... 66481.288688: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dc00 ofs=8388608 order=9 fio-3568 [013] ..... 66481.291731: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12de00 ofs=10485760 order=9 Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-14 15:05 ` Bernd Schubert @ 2023-04-14 19:52 ` Bernd Schubert 2023-04-14 20:47 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-04-14 19:52 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm On 4/14/23 17:05, Bernd Schubert wrote: > > > On 4/14/23 06:07, Matthew Wilcox wrote: >> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote: >>> Sorry, forgot to add Andrew and linux-mm into CC. >>> >>> On 4/13/23 23:27, Bernd Schubert wrote: >>>> Hello, >>>> >>>> I found a weird mmap read behavior while benchmarking the fuse-uring >>>> patches. >>>> I did not verify yet, but it does not look fuse specific. >>>> Basically, I started to check because fio results were much lower >>>> than expected (better with the new code, though) >>>> >>>> fio cmd line: >>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse >>>> --directory=/scratch/dest/ --rw=read multi-file.fio >>>> >>>> >>>> bernd@squeeze1 test2>cat multi-file.fio >>>> [global] >>>> group_reporting >>>> bs=1M >>>> runtime=300 >>>> >>>> [test] >>>> >>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy >>>> beginning at offset 0 in 1MB steps (verified with additional >>>> logging in fios engines/mmap.c). >>>> >>>> And additional log in fuse_readahead() gives >>>> >>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>> index=0 >>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>> index=255 >>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>> index=254 >>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>> index=253 >>>> ... <count is always 1 page> >>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>> index=64 >>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>> index=511 >>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>> index=510 >>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>> index=509 >>>> ... <count is always 1 page> >>>> >>>> Logging in do_sync_mmap_readahead() >>>> >>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 >>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 >>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 >>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 >>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> ... >>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 >>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 >>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 >>>> ras_prev_pos=-1 >>>> >>>> >>>> So we can see from fuse that it starts to read at page index 0, wants >>>> 64 pages (which is actually the double of bdi read_ahead_kb), then >>>> skips index 64...254) and immediately goes to index 255. For the mmaped >>>> memcpy pages are missing and then it goes back in 1 page steps to get >>>> these. >>>> >>>> A workaround here is to set read_ahead_kb in the bdi to a larger >>>> value, another workaround might be (untested) to increase the >>>> read-ahead >>>> window. Either of these two seem to be workarounds for the index order >>>> above. >>>> >>>> I understand that read-ahead gets limited by the bdi value (although >>>> exceeded above), but why does it go back in 1 page steps? My >>>> expectation >>>> would have been >>>> >>>> index=0 count=32 (128kb read-head) >>>> index=32 count=32 >>>> index=64 count=32 >> >> What I see with XFS is: >> >> fio-27518 [005] ..... 276.565025: >> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2 ... >> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 >> order=6 >> >> ... it then gets "stuck" at order-6, which is expected for a 256kB >> readahead window. >> >> This is produced by: >> >> echo 1 >> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable >> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse >> --directory=/mnt/scratch/ --rw=read multi-file.fio >> echo 0 >> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable >> less /sys/kernel/tracing/trace >> > > Thanks for looking at it Matthew! > > I see the same as on fuse on xfs - same output as I initially > posted (except the fuse logging, of course). > > With tracing and reduced file size to 2M > ... > fio-3459 [011] ..... 65055.436534: > mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 ofs=1052672 > order=0 > > And then it is stuck at order=0 > When I reduce bs to 4K I get similar results as you fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/scratch/source/ --rw=read multi-file.fio --bs=4K --group_reporting (so bs set to 4K) and a basically empty job file bernd@squeeze1 test2>cat multi-file.fio [global] [test] Up to bs=512K it works fine, 1M (and for what it matters already 768K) introduce the order=0 issue. Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-14 19:52 ` Bernd Schubert @ 2023-04-14 20:47 ` Bernd Schubert 2023-04-14 22:02 ` Matthew Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-04-14 20:47 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm On 4/14/23 21:52, Bernd Schubert wrote: > > > On 4/14/23 17:05, Bernd Schubert wrote: >> >> >> On 4/14/23 06:07, Matthew Wilcox wrote: >>> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote: >>>> Sorry, forgot to add Andrew and linux-mm into CC. >>>> >>>> On 4/13/23 23:27, Bernd Schubert wrote: >>>>> Hello, >>>>> >>>>> I found a weird mmap read behavior while benchmarking the fuse-uring >>>>> patches. >>>>> I did not verify yet, but it does not look fuse specific. >>>>> Basically, I started to check because fio results were much lower >>>>> than expected (better with the new code, though) >>>>> >>>>> fio cmd line: >>>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse >>>>> --directory=/scratch/dest/ --rw=read multi-file.fio >>>>> >>>>> >>>>> bernd@squeeze1 test2>cat multi-file.fio >>>>> [global] >>>>> group_reporting >>>>> bs=1M >>>>> runtime=300 >>>>> >>>>> [test] >>>>> >>>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy >>>>> beginning at offset 0 in 1MB steps (verified with additional >>>>> logging in fios engines/mmap.c). >>>>> >>>>> And additional log in fuse_readahead() gives >>>>> >>>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>>> index=0 >>>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>>> index=255 >>>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>>> index=254 >>>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>>> index=253 >>>>> ... <count is always 1 page> >>>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>>> index=64 >>>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 >>>>> index=511 >>>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>>> index=510 >>>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 >>>>> index=509 >>>>> ... <count is always 1 page> >>>>> >>>>> Logging in do_sync_mmap_readahead() >>>>> >>>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 >>>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 >>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 >>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 >>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> ... >>>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 >>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 >>>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 >>>>> ras_prev_pos=-1 >>>>> >>>>> >>>>> So we can see from fuse that it starts to read at page index 0, wants >>>>> 64 pages (which is actually the double of bdi read_ahead_kb), then >>>>> skips index 64...254) and immediately goes to index 255. For the >>>>> mmaped >>>>> memcpy pages are missing and then it goes back in 1 page steps to get >>>>> these. >>>>> >>>>> A workaround here is to set read_ahead_kb in the bdi to a larger >>>>> value, another workaround might be (untested) to increase the >>>>> read-ahead >>>>> window. Either of these two seem to be workarounds for the index order >>>>> above. >>>>> >>>>> I understand that read-ahead gets limited by the bdi value (although >>>>> exceeded above), but why does it go back in 1 page steps? My >>>>> expectation >>>>> would have been >>>>> >>>>> index=0 count=32 (128kb read-head) >>>>> index=32 count=32 >>>>> index=64 count=32 >>> >>> What I see with XFS is: >>> >>> fio-27518 [005] ..... 276.565025: >>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2 > > ... > >>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 >>> order=6 >>> >>> ... it then gets "stuck" at order-6, which is expected for a 256kB >>> readahead window. >>> >>> This is produced by: >>> >>> echo 1 >>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable >>> fio --size=1G --numjobs=1 --ioengine=mmap >>> --output-format=normal,terse --directory=/mnt/scratch/ --rw=read >>> multi-file.fio >>> echo 0 >>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable >>> less /sys/kernel/tracing/trace >>> >> >> Thanks for looking at it Matthew! >> >> I see the same as on fuse on xfs - same output as I initially >> posted (except the fuse logging, of course). >> >> With tracing and reduced file size to 2M >> > > ... > >> fio-3459 [011] ..... 65055.436534: >> mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 >> ofs=1052672 order=0 >> >> And then it is stuck at order=0 >> > > When I reduce bs to 4K I get similar results as you > > fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse > --directory=/scratch/source/ --rw=read multi-file.fio --bs=4K > --group_reporting > > (so bs set to 4K) > > and a basically empty job file > > bernd@squeeze1 test2>cat multi-file.fio > [global] > > [test] > > > > Up to bs=512K it works fine, 1M (and for what it matters > already 768K) introduce the order=0 issue. Hmm, I replaced memcpy with dumb version, that copies byte by byte - problem gone. Is it possible that the optimized memcpy causes caused kind of random memory access and confuses mm / readahead? And somehow your memcpy or system is not doing that? Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-14 20:47 ` Bernd Schubert @ 2023-04-14 22:02 ` Matthew Wilcox 2023-04-17 15:10 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2023-04-14 22:02 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm On Fri, Apr 14, 2023 at 10:47:39PM +0200, Bernd Schubert wrote: > > Up to bs=512K it works fine, 1M (and for what it matters > > already 768K) introduce the order=0 issue. > > Hmm, I replaced memcpy with dumb version, that copies byte by byte - problem > gone. Is it possible that the optimized memcpy causes caused kind of random > memory access and confuses mm / readahead? > And somehow your memcpy or system is not doing that? Oh, that would make sense! If the memcpy() works high-to-low, then you'd see exactly the behaviour you're reporting. Whereas low-to-high results in the behaviour I'm seeing. Hm. I don't know what to do about that. Maybe a "sufficiently large" memcpy should call posix_madvise(src, n, POSIX_MADV_WILLNEED) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead 2023-04-14 22:02 ` Matthew Wilcox @ 2023-04-17 15:10 ` Bernd Schubert 0 siblings, 0 replies; 7+ messages in thread From: Bernd Schubert @ 2023-04-17 15:10 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm, linux-doc On 4/15/23 00:02, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 10:47:39PM +0200, Bernd Schubert wrote: >>> Up to bs=512K it works fine, 1M (and for what it matters >>> already 768K) introduce the order=0 issue. >> >> Hmm, I replaced memcpy with dumb version, that copies byte by byte - problem >> gone. Is it possible that the optimized memcpy causes caused kind of random >> memory access and confuses mm / readahead? >> And somehow your memcpy or system is not doing that? > > Oh, that would make sense! If the memcpy() works high-to-low, then > you'd see exactly the behaviour you're reporting. Whereas low-to-high > results in the behaviour I'm seeing. In my case it is not exactly high-to-low, it is more low, high, then from high to low. Issue goes away with a sufficiently large RA size. And RA behaves better when POSIX_MADV_SEQUENTIAL is not done. For sure memcpy implementation and cpu depending (I have tested on avx and avx2 systems). > > Hm. I don't know what to do about that. Maybe a "sufficiently large" > memcpy should call posix_madvise(src, n, POSIX_MADV_WILLNEED) What would speak against ignoring POSIX_MADV_SEQUENTIAL for RA when it would detect that something breaks the contract? But then I don't know how much mmap/memcpy is used for large memcpy or memcmp - maybe we should just document the issue right now in the mmap man page? Especially that POSIX_MADV_SEQUENTIAL should be avoided? Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-04-17 15:10 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm>
2023-04-13 21:33 ` sequential 1MB mmap read ends in 1 page sync read-ahead Bernd Schubert
2023-04-14 4:07 ` Matthew Wilcox
2023-04-14 15:05 ` Bernd Schubert
2023-04-14 19:52 ` Bernd Schubert
2023-04-14 20:47 ` Bernd Schubert
2023-04-14 22:02 ` Matthew Wilcox
2023-04-17 15:10 ` Bernd Schubert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox