linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
       [not found] <aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm>
@ 2023-04-13 21:33 ` Bernd Schubert
  2023-04-14  4:07   ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-04-13 21:33 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm

Sorry, forgot to add Andrew and linux-mm into CC.

On 4/13/23 23:27, Bernd Schubert wrote:
> Hello,
> 
> I found a weird mmap read behavior while benchmarking the fuse-uring 
> patches.
> I did not verify yet, but it does not look fuse specific.
> Basically, I started to check because fio results were much lower
> than expected (better with the new code, though)
> 
> fio cmd line:
> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse 
> --directory=/scratch/dest/ --rw=read multi-file.fio
> 
> 
> bernd@squeeze1 test2>cat multi-file.fio
> [global]
> group_reporting
> bs=1M
> runtime=300
> 
> [test]
> 
> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
> beginning at offset 0 in 1MB steps (verified with additional
> logging in fios engines/mmap.c).
> 
> And additional log in fuse_readahead() gives
> 
> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 
> index=0
> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 
> index=255
> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 
> index=254
> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 
> index=253
> ... <count is always 1 page>
> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 
> index=64
> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64 
> index=511
> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 
> index=510
> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1 
> index=509
> ... <count is always 1 page>
> 
> Logging in do_sync_mmap_readahead()
> 
> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0 
> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0 
> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0 
> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0 
> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> ...
> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0 
> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0 
> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0 
> ras_prev_pos=-1
> 
> 
> So we can see from fuse that it starts to read at page index 0, wants
> 64 pages (which is actually the double of bdi read_ahead_kb), then
> skips index 64...254) and immediately goes to index 255. For the mmaped
> memcpy pages are missing and then it goes back in 1 page steps to get
> these.
> 
> A workaround here is to set read_ahead_kb in the bdi to a larger
> value, another workaround might be (untested) to increase the read-ahead
> window. Either of these two seem to be workarounds for the index order
> above.
> 
> I understand that read-ahead gets limited by the bdi value (although
> exceeded above), but why does it go back in 1 page steps? My expectation
> would have been
> 
> index=0  count=32 (128kb read-head)
> index=32 count=32
> index=64 count=32
> ...
> 
> 
> This is with plain 6.2 + fuse-uring patches.
> 
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-13 21:33 ` sequential 1MB mmap read ends in 1 page sync read-ahead Bernd Schubert
@ 2023-04-14  4:07   ` Matthew Wilcox
  2023-04-14 15:05     ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2023-04-14  4:07 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm

On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
> Sorry, forgot to add Andrew and linux-mm into CC.
> 
> On 4/13/23 23:27, Bernd Schubert wrote:
> > Hello,
> > 
> > I found a weird mmap read behavior while benchmarking the fuse-uring
> > patches.
> > I did not verify yet, but it does not look fuse specific.
> > Basically, I started to check because fio results were much lower
> > than expected (better with the new code, though)
> > 
> > fio cmd line:
> > fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
> > --directory=/scratch/dest/ --rw=read multi-file.fio
> > 
> > 
> > bernd@squeeze1 test2>cat multi-file.fio
> > [global]
> > group_reporting
> > bs=1M
> > runtime=300
> > 
> > [test]
> > 
> > This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
> > beginning at offset 0 in 1MB steps (verified with additional
> > logging in fios engines/mmap.c).
> > 
> > And additional log in fuse_readahead() gives
> > 
> > [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
> > index=0
> > [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
> > index=255
> > [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
> > index=254
> > [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
> > index=253
> > ... <count is always 1 page>
> > [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
> > index=64
> > [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
> > index=511
> > [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
> > index=510
> > [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
> > index=509
> > ... <count is always 1 page>
> > 
> > Logging in do_sync_mmap_readahead()
> > 
> > [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
> > ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
> > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
> > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
> > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > ...
> > [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
> > ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
> > ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
> > ras_prev_pos=-1
> > 
> > 
> > So we can see from fuse that it starts to read at page index 0, wants
> > 64 pages (which is actually the double of bdi read_ahead_kb), then
> > skips index 64...254) and immediately goes to index 255. For the mmaped
> > memcpy pages are missing and then it goes back in 1 page steps to get
> > these.
> > 
> > A workaround here is to set read_ahead_kb in the bdi to a larger
> > value, another workaround might be (untested) to increase the read-ahead
> > window. Either of these two seem to be workarounds for the index order
> > above.
> > 
> > I understand that read-ahead gets limited by the bdi value (although
> > exceeded above), but why does it go back in 1 page steps? My expectation
> > would have been
> > 
> > index=0  count=32 (128kb read-head)
> > index=32 count=32
> > index=64 count=32

What I see with XFS is:

             fio-27518   [005] .....   276.565025: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
             fio-27518   [005] .....   276.565035: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x19868 ofs=16384 order=2
             fio-27518   [005] .....   276.565036: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x237fc ofs=32768 order=2
             fio-27518   [005] .....   276.565038: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x27518 ofs=49152 order=2
             fio-27518   [005] .....   276.565039: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14c7c ofs=65536 order=2
             fio-27518   [005] .....   276.565040: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14338 ofs=81920 order=2
             fio-27518   [005] .....   276.565041: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x201fc ofs=98304 order=2
             fio-27518   [005] .....   276.565042: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1fb98 ofs=114688 order=2
             fio-27518   [005] .....   276.565044: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14510 ofs=131072 order=2
             fio-27518   [005] .....   276.565045: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1e88c ofs=147456 order=2
             fio-27518   [005] .....   276.565046: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x26f00 ofs=163840 order=2

...

 dev 8:32 ino 44 pfn=0x14f30 ofs=262144 order=4
             fio-27518   [005] .....   276.567718: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x145a0 ofs=327680 order=4
             fio-27518   [005] .....   276.567720: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15730 ofs=393216 order=4
             fio-27518   [005] .....   276.567722: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e30 ofs=458752 order=4
             fio-27518   [005] .....   276.567942: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x18b00 ofs=524288 order=6
             fio-27518   [005] .....   276.569982: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 order=6

... it then gets "stuck" at order-6, which is expected for a 256kB
readahead window.

This is produced by:

echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/mnt/scratch/ --rw=read multi-file.fio
echo 0 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
less /sys/kernel/tracing/trace



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-14  4:07   ` Matthew Wilcox
@ 2023-04-14 15:05     ` Bernd Schubert
  2023-04-14 19:52       ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-04-14 15:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm



On 4/14/23 06:07, Matthew Wilcox wrote:
> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
>> Sorry, forgot to add Andrew and linux-mm into CC.
>>
>> On 4/13/23 23:27, Bernd Schubert wrote:
>>> Hello,
>>>
>>> I found a weird mmap read behavior while benchmarking the fuse-uring
>>> patches.
>>> I did not verify yet, but it does not look fuse specific.
>>> Basically, I started to check because fio results were much lower
>>> than expected (better with the new code, though)
>>>
>>> fio cmd line:
>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
>>> --directory=/scratch/dest/ --rw=read multi-file.fio
>>>
>>>
>>> bernd@squeeze1 test2>cat multi-file.fio
>>> [global]
>>> group_reporting
>>> bs=1M
>>> runtime=300
>>>
>>> [test]
>>>
>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
>>> beginning at offset 0 in 1MB steps (verified with additional
>>> logging in fios engines/mmap.c).
>>>
>>> And additional log in fuse_readahead() gives
>>>
>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>> index=0
>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>> index=255
>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>> index=254
>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>> index=253
>>> ... <count is always 1 page>
>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>> index=64
>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>> index=511
>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>> index=510
>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>> index=509
>>> ... <count is always 1 page>
>>>
>>> Logging in do_sync_mmap_readahead()
>>>
>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>> ...
>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
>>> ras_prev_pos=-1
>>>
>>>
>>> So we can see from fuse that it starts to read at page index 0, wants
>>> 64 pages (which is actually the double of bdi read_ahead_kb), then
>>> skips index 64...254) and immediately goes to index 255. For the mmaped
>>> memcpy pages are missing and then it goes back in 1 page steps to get
>>> these.
>>>
>>> A workaround here is to set read_ahead_kb in the bdi to a larger
>>> value, another workaround might be (untested) to increase the read-ahead
>>> window. Either of these two seem to be workarounds for the index order
>>> above.
>>>
>>> I understand that read-ahead gets limited by the bdi value (although
>>> exceeded above), but why does it go back in 1 page steps? My expectation
>>> would have been
>>>
>>> index=0  count=32 (128kb read-head)
>>> index=32 count=32
>>> index=64 count=32
> 
> What I see with XFS is:
> 
>               fio-27518   [005] .....   276.565025: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
>               fio-27518   [005] .....   276.565035: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x19868 ofs=16384 order=2
>               fio-27518   [005] .....   276.565036: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x237fc ofs=32768 order=2
>               fio-27518   [005] .....   276.565038: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x27518 ofs=49152 order=2
>               fio-27518   [005] .....   276.565039: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14c7c ofs=65536 order=2
>               fio-27518   [005] .....   276.565040: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14338 ofs=81920 order=2
>               fio-27518   [005] .....   276.565041: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x201fc ofs=98304 order=2
>               fio-27518   [005] .....   276.565042: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1fb98 ofs=114688 order=2
>               fio-27518   [005] .....   276.565044: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14510 ofs=131072 order=2
>               fio-27518   [005] .....   276.565045: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1e88c ofs=147456 order=2
>               fio-27518   [005] .....   276.565046: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x26f00 ofs=163840 order=2
> 
> ...
> 
>   dev 8:32 ino 44 pfn=0x14f30 ofs=262144 order=4
>               fio-27518   [005] .....   276.567718: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x145a0 ofs=327680 order=4
>               fio-27518   [005] .....   276.567720: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15730 ofs=393216 order=4
>               fio-27518   [005] .....   276.567722: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e30 ofs=458752 order=4
>               fio-27518   [005] .....   276.567942: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x18b00 ofs=524288 order=6
>               fio-27518   [005] .....   276.569982: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 order=6
> 
> ... it then gets "stuck" at order-6, which is expected for a 256kB
> readahead window.
> 
> This is produced by:
> 
> echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/mnt/scratch/ --rw=read multi-file.fio
> echo 0 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
> less /sys/kernel/tracing/trace
> 

Thanks for looking at it Matthew!

I see the same as on fuse on xfs - same output as I initially
posted (except the fuse logging, of course).

With tracing and reduced file size to 2M

squeeze1:~# grep fio /sys/kernel/tracing/trace
              fio-3459    [018] ..... 65055.425435: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6b0 ofs=0 order=2
              fio-3459    [018] ..... 65055.425456: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6ac ofs=16384 order=2
              fio-3459    [018] ..... 65055.425464: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a8 ofs=32768 order=2
              fio-3459    [018] ..... 65055.425472: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a4 ofs=49152 order=2
              fio-3459    [018] ..... 65055.425480: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a0 ofs=65536 order=2
              fio-3459    [018] ..... 65055.425489: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c69c ofs=81920 order=2
              fio-3459    [018] ..... 65055.425497: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c698 ofs=98304 order=2
              fio-3459    [018] ..... 65055.425505: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c694 ofs=114688 order=2
              fio-3459    [018] ..... 65055.425513: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c690 ofs=131072 order=2
              fio-3459    [018] ..... 65055.425521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c68c ofs=147456 order=2
              fio-3459    [018] ..... 65055.425529: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c688 ofs=163840 order=2
              fio-3459    [018] ..... 65055.425538: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c684 ofs=180224 order=2
              fio-3459    [018] ..... 65055.425545: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c680 ofs=196608 order=2
              fio-3459    [018] ..... 65055.425553: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c83c ofs=212992 order=2
              fio-3459    [018] ..... 65055.425561: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c838 ofs=229376 order=2
              fio-3459    [018] ..... 65055.425569: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c834 ofs=245760 order=2
              fio-3459    [011] ..... 65055.436500: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb8 ofs=1044480 order=0
              fio-3459    [011] ..... 65055.436530: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb9 ofs=1048576 order=0
              fio-3459    [011] ..... 65055.436534: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 ofs=1052672 order=0

And then it is stuck at order=0

              fio-3459    [002] ..... 65060.353420: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12afc2 ofs=1306624 order=0

Not any different with increase file size to 10M.

squeeze1:~# cat /proc/self/mountinfo |grep "252:16"
88 28 252:16 / /scratch rw,relatime shared:44 - xfs /dev/vdb rw,attr2,inode64,logbufs=8,logbsize=32k,noquota


squeeze1:~# cat /sys/class/bdi/252\:16/read_ahead_kb
128



With increased RA to 1024kb

              fio-3568    [013] ..... 66481.269207: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12b8a0 ofs=0 order=2
              fio-3568    [013] ..... 66481.269224: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d5dc ofs=16384 order=2
              fio-3568    [013] ..... 66481.269233: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d8dc ofs=32768 order=2
              fio-3568    [013] ..... 66481.269242: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12aae4 ofs=49152 order=2
              fio-3568    [013] ..... 66481.269252: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12abb0 ofs=65536 order=2
...
              fio-3568    [013] ..... 66481.276736: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12da00 ofs=2097152 order=4
              fio-3568    [013] ..... 66481.276758: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d4f0 ofs=2162688 order=4
              fio-3568    [013] ..... 66481.276785: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d910 ofs=2228224 order=4
              fio-3568    [013] ..... 66481.276812: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c3c0 ofs=2293760 order=4
              fio-3568    [013] ..... 66481.278920: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d340 ofs=4194304 order=6
              fio-3568    [013] ..... 66481.279033: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12db80 ofs=4456448 order=6
              fio-3568    [013] ..... 66481.279183: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dbc0 ofs=4718592 order=6
              fio-3568    [013] ..... 66481.279302: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e000 ofs=4980736 order=6
              fio-3568    [013] ..... 66481.279422: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e040 ofs=5242880 order=6
              fio-3568    [013] ..... 66481.279521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e080 ofs=5505024 order=6
              fio-3568    [013] ..... 66481.279938: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e0c0 ofs=5767168 order=6
              fio-3568    [013] ..... 66481.280038: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e100 ofs=6029312 order=6
              fio-3568    [013] ..... 66481.282445: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e200 ofs=6291456 order=8
              fio-3568    [013] ..... 66481.283986: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e300 ofs=7340032 order=8
              fio-3568    [013] ..... 66481.288688: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dc00 ofs=8388608 order=9
              fio-3568    [013] ..... 66481.291731: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12de00 ofs=10485760 order=9


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-14 15:05     ` Bernd Schubert
@ 2023-04-14 19:52       ` Bernd Schubert
  2023-04-14 20:47         ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-04-14 19:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm



On 4/14/23 17:05, Bernd Schubert wrote:
> 
> 
> On 4/14/23 06:07, Matthew Wilcox wrote:
>> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
>>> Sorry, forgot to add Andrew and linux-mm into CC.
>>>
>>> On 4/13/23 23:27, Bernd Schubert wrote:
>>>> Hello,
>>>>
>>>> I found a weird mmap read behavior while benchmarking the fuse-uring
>>>> patches.
>>>> I did not verify yet, but it does not look fuse specific.
>>>> Basically, I started to check because fio results were much lower
>>>> than expected (better with the new code, though)
>>>>
>>>> fio cmd line:
>>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
>>>> --directory=/scratch/dest/ --rw=read multi-file.fio
>>>>
>>>>
>>>> bernd@squeeze1 test2>cat multi-file.fio
>>>> [global]
>>>> group_reporting
>>>> bs=1M
>>>> runtime=300
>>>>
>>>> [test]
>>>>
>>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
>>>> beginning at offset 0 in 1MB steps (verified with additional
>>>> logging in fios engines/mmap.c).
>>>>
>>>> And additional log in fuse_readahead() gives
>>>>
>>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>> index=0
>>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>> index=255
>>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>> index=254
>>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>> index=253
>>>> ... <count is always 1 page>
>>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>> index=64
>>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>> index=511
>>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>> index=510
>>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>> index=509
>>>> ... <count is always 1 page>
>>>>
>>>> Logging in do_sync_mmap_readahead()
>>>>
>>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
>>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>> ...
>>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
>>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
>>>> ras_prev_pos=-1
>>>>
>>>>
>>>> So we can see from fuse that it starts to read at page index 0, wants
>>>> 64 pages (which is actually the double of bdi read_ahead_kb), then
>>>> skips index 64...254) and immediately goes to index 255. For the mmaped
>>>> memcpy pages are missing and then it goes back in 1 page steps to get
>>>> these.
>>>>
>>>> A workaround here is to set read_ahead_kb in the bdi to a larger
>>>> value, another workaround might be (untested) to increase the 
>>>> read-ahead
>>>> window. Either of these two seem to be workarounds for the index order
>>>> above.
>>>>
>>>> I understand that read-ahead gets limited by the bdi value (although
>>>> exceeded above), but why does it go back in 1 page steps? My 
>>>> expectation
>>>> would have been
>>>>
>>>> index=0  count=32 (128kb read-head)
>>>> index=32 count=32
>>>> index=64 count=32
>>
>> What I see with XFS is:
>>
>>               fio-27518   [005] .....   276.565025: 
>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2

...

>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 
>> order=6
>>
>> ... it then gets "stuck" at order-6, which is expected for a 256kB
>> readahead window.
>>
>> This is produced by:
>>
>> echo 1 
>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse 
>> --directory=/mnt/scratch/ --rw=read multi-file.fio
>> echo 0 
>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>> less /sys/kernel/tracing/trace
>>
> 
> Thanks for looking at it Matthew!
> 
> I see the same as on fuse on xfs - same output as I initially
> posted (except the fuse logging, of course).
> 
> With tracing and reduced file size to 2M
> 

...

>               fio-3459    [011] ..... 65055.436534: 
> mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 ofs=1052672 
> order=0
> 
> And then it is stuck at order=0
> 

When I reduce bs to 4K I get similar results as you

fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/scratch/source/ --rw=read multi-file.fio --bs=4K --group_reporting

(so bs set to 4K)

and a basically empty job file

bernd@squeeze1 test2>cat multi-file.fio
[global]

[test]



Up to bs=512K it works fine, 1M (and for what it matters
already 768K) introduce the order=0 issue.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-14 19:52       ` Bernd Schubert
@ 2023-04-14 20:47         ` Bernd Schubert
  2023-04-14 22:02           ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-04-14 20:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm



On 4/14/23 21:52, Bernd Schubert wrote:
> 
> 
> On 4/14/23 17:05, Bernd Schubert wrote:
>>
>>
>> On 4/14/23 06:07, Matthew Wilcox wrote:
>>> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
>>>> Sorry, forgot to add Andrew and linux-mm into CC.
>>>>
>>>> On 4/13/23 23:27, Bernd Schubert wrote:
>>>>> Hello,
>>>>>
>>>>> I found a weird mmap read behavior while benchmarking the fuse-uring
>>>>> patches.
>>>>> I did not verify yet, but it does not look fuse specific.
>>>>> Basically, I started to check because fio results were much lower
>>>>> than expected (better with the new code, though)
>>>>>
>>>>> fio cmd line:
>>>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
>>>>> --directory=/scratch/dest/ --rw=read multi-file.fio
>>>>>
>>>>>
>>>>> bernd@squeeze1 test2>cat multi-file.fio
>>>>> [global]
>>>>> group_reporting
>>>>> bs=1M
>>>>> runtime=300
>>>>>
>>>>> [test]
>>>>>
>>>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
>>>>> beginning at offset 0 in 1MB steps (verified with additional
>>>>> logging in fios engines/mmap.c).
>>>>>
>>>>> And additional log in fuse_readahead() gives
>>>>>
>>>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=0
>>>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=255
>>>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=254
>>>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=253
>>>>> ... <count is always 1 page>
>>>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=64
>>>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=511
>>>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=510
>>>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=509
>>>>> ... <count is always 1 page>
>>>>>
>>>>> Logging in do_sync_mmap_readahead()
>>>>>
>>>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
>>>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> ...
>>>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
>>>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>>
>>>>>
>>>>> So we can see from fuse that it starts to read at page index 0, wants
>>>>> 64 pages (which is actually the double of bdi read_ahead_kb), then
>>>>> skips index 64...254) and immediately goes to index 255. For the 
>>>>> mmaped
>>>>> memcpy pages are missing and then it goes back in 1 page steps to get
>>>>> these.
>>>>>
>>>>> A workaround here is to set read_ahead_kb in the bdi to a larger
>>>>> value, another workaround might be (untested) to increase the 
>>>>> read-ahead
>>>>> window. Either of these two seem to be workarounds for the index order
>>>>> above.
>>>>>
>>>>> I understand that read-ahead gets limited by the bdi value (although
>>>>> exceeded above), but why does it go back in 1 page steps? My 
>>>>> expectation
>>>>> would have been
>>>>>
>>>>> index=0  count=32 (128kb read-head)
>>>>> index=32 count=32
>>>>> index=64 count=32
>>>
>>> What I see with XFS is:
>>>
>>>               fio-27518   [005] .....   276.565025: 
>>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
> 
> ...
> 
>>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 
>>> order=6
>>>
>>> ... it then gets "stuck" at order-6, which is expected for a 256kB
>>> readahead window.
>>>
>>> This is produced by:
>>>
>>> echo 1 
>>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>>> fio --size=1G --numjobs=1 --ioengine=mmap 
>>> --output-format=normal,terse --directory=/mnt/scratch/ --rw=read 
>>> multi-file.fio
>>> echo 0 
>>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>>> less /sys/kernel/tracing/trace
>>>
>>
>> Thanks for looking at it Matthew!
>>
>> I see the same as on fuse on xfs - same output as I initially
>> posted (except the fuse logging, of course).
>>
>> With tracing and reduced file size to 2M
>>
> 
> ...
> 
>>               fio-3459    [011] ..... 65055.436534: 
>> mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 
>> ofs=1052672 order=0
>>
>> And then it is stuck at order=0
>>
> 
> When I reduce bs to 4K I get similar results as you
> 
> fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse 
> --directory=/scratch/source/ --rw=read multi-file.fio --bs=4K 
> --group_reporting
> 
> (so bs set to 4K)
> 
> and a basically empty job file
> 
> bernd@squeeze1 test2>cat multi-file.fio
> [global]
> 
> [test]
> 
> 
> 
> Up to bs=512K it works fine, 1M (and for what it matters
> already 768K) introduce the order=0 issue.

Hmm, I replaced memcpy with dumb version, that copies byte by byte - 
problem gone. Is it possible that the optimized memcpy causes caused 
kind of random memory access and confuses mm / readahead?
And somehow your memcpy or system is not doing that?


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-14 20:47         ` Bernd Schubert
@ 2023-04-14 22:02           ` Matthew Wilcox
  2023-04-17 15:10             ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2023-04-14 22:02 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm

On Fri, Apr 14, 2023 at 10:47:39PM +0200, Bernd Schubert wrote:
> > Up to bs=512K it works fine, 1M (and for what it matters
> > already 768K) introduce the order=0 issue.
> 
> Hmm, I replaced memcpy with dumb version, that copies byte by byte - problem
> gone. Is it possible that the optimized memcpy causes caused kind of random
> memory access and confuses mm / readahead?
> And somehow your memcpy or system is not doing that?

Oh, that would make sense!  If the memcpy() works high-to-low, then
you'd see exactly the behaviour you're reporting.  Whereas low-to-high
results in the behaviour I'm seeing.

Hm.  I don't know what to do about that.  Maybe a "sufficiently large"
memcpy should call posix_madvise(src, n, POSIX_MADV_WILLNEED)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sequential 1MB mmap read ends in 1 page sync read-ahead
  2023-04-14 22:02           ` Matthew Wilcox
@ 2023-04-17 15:10             ` Bernd Schubert
  0 siblings, 0 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-04-17 15:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Miklos Szeredi, Andrew Morton, linux-mm, linux-doc



On 4/15/23 00:02, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 10:47:39PM +0200, Bernd Schubert wrote:
>>> Up to bs=512K it works fine, 1M (and for what it matters
>>> already 768K) introduce the order=0 issue.
>>
>> Hmm, I replaced memcpy with dumb version, that copies byte by byte - problem
>> gone. Is it possible that the optimized memcpy causes caused kind of random
>> memory access and confuses mm / readahead?
>> And somehow your memcpy or system is not doing that?
> 
> Oh, that would make sense!  If the memcpy() works high-to-low, then
> you'd see exactly the behaviour you're reporting.  Whereas low-to-high
> results in the behaviour I'm seeing.

In my case it is not exactly high-to-low, it is more low, high, then 
from high to low. Issue goes away with a sufficiently large RA size. And 
RA behaves better when POSIX_MADV_SEQUENTIAL is not done.
For sure memcpy implementation and cpu depending (I have tested on avx 
and avx2 systems).

> 
> Hm.  I don't know what to do about that.  Maybe a "sufficiently large"
> memcpy should call posix_madvise(src, n, POSIX_MADV_WILLNEED)


What would speak against ignoring POSIX_MADV_SEQUENTIAL for RA when it 
would detect that something breaks the contract?

But then I don't know how much mmap/memcpy is used for large memcpy or 
memcmp - maybe we should just document the issue right now in the mmap 
man page? Especially that POSIX_MADV_SEQUENTIAL should be avoided?



Thanks,
Bernd


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-17 15:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm>
2023-04-13 21:33 ` sequential 1MB mmap read ends in 1 page sync read-ahead Bernd Schubert
2023-04-14  4:07   ` Matthew Wilcox
2023-04-14 15:05     ` Bernd Schubert
2023-04-14 19:52       ` Bernd Schubert
2023-04-14 20:47         ` Bernd Schubert
2023-04-14 22:02           ` Matthew Wilcox
2023-04-17 15:10             ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox