linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd.schubert@fastmail.fm>
To: Matthew Wilcox <willy@infradead.org>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org
Subject: Re: sequential 1MB mmap read ends in 1 page sync read-ahead
Date: Fri, 14 Apr 2023 22:47:39 +0200	[thread overview]
Message-ID: <c59e54f9-6eae-3c35-bce8-ac03af84b3ed@fastmail.fm> (raw)
In-Reply-To: <b8afbfba-a58d-807d-1bbc-3be4b5b08710@fastmail.fm>



On 4/14/23 21:52, Bernd Schubert wrote:
> 
> 
> On 4/14/23 17:05, Bernd Schubert wrote:
>>
>>
>> On 4/14/23 06:07, Matthew Wilcox wrote:
>>> On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
>>>> Sorry, forgot to add Andrew and linux-mm into CC.
>>>>
>>>> On 4/13/23 23:27, Bernd Schubert wrote:
>>>>> Hello,
>>>>>
>>>>> I found a weird mmap read behavior while benchmarking the fuse-uring
>>>>> patches.
>>>>> I did not verify yet, but it does not look fuse specific.
>>>>> Basically, I started to check because fio results were much lower
>>>>> than expected (better with the new code, though)
>>>>>
>>>>> fio cmd line:
>>>>> fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
>>>>> --directory=/scratch/dest/ --rw=read multi-file.fio
>>>>>
>>>>>
>>>>> bernd@squeeze1 test2>cat multi-file.fio
>>>>> [global]
>>>>> group_reporting
>>>>> bs=1M
>>>>> runtime=300
>>>>>
>>>>> [test]
>>>>>
>>>>> This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
>>>>> beginning at offset 0 in 1MB steps (verified with additional
>>>>> logging in fios engines/mmap.c).
>>>>>
>>>>> And additional log in fuse_readahead() gives
>>>>>
>>>>> [ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=0
>>>>> [ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=255
>>>>> [ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=254
>>>>> [ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=253
>>>>> ... <count is always 1 page>
>>>>> [ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=64
>>>>> [ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
>>>>> index=511
>>>>> [ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=510
>>>>> [ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
>>>>> index=509
>>>>> ... <count is always 1 page>
>>>>>
>>>>> Logging in do_sync_mmap_readahead()
>>>>>
>>>>> [ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
>>>>> ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> ...
>>>>> [ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
>>>>> ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>> [ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
>>>>> ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
>>>>> ras_prev_pos=-1
>>>>>
>>>>>
>>>>> So we can see from fuse that it starts to read at page index 0, wants
>>>>> 64 pages (which is actually the double of bdi read_ahead_kb), then
>>>>> skips index 64...254) and immediately goes to index 255. For the 
>>>>> mmaped
>>>>> memcpy pages are missing and then it goes back in 1 page steps to get
>>>>> these.
>>>>>
>>>>> A workaround here is to set read_ahead_kb in the bdi to a larger
>>>>> value, another workaround might be (untested) to increase the 
>>>>> read-ahead
>>>>> window. Either of these two seem to be workarounds for the index order
>>>>> above.
>>>>>
>>>>> I understand that read-ahead gets limited by the bdi value (although
>>>>> exceeded above), but why does it go back in 1 page steps? My 
>>>>> expectation
>>>>> would have been
>>>>>
>>>>> index=0  count=32 (128kb read-head)
>>>>> index=32 count=32
>>>>> index=64 count=32
>>>
>>> What I see with XFS is:
>>>
>>>               fio-27518   [005] .....   276.565025: 
>>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
> 
> ...
> 
>>> mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 
>>> order=6
>>>
>>> ... it then gets "stuck" at order-6, which is expected for a 256kB
>>> readahead window.
>>>
>>> This is produced by:
>>>
>>> echo 1 
>>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>>> fio --size=1G --numjobs=1 --ioengine=mmap 
>>> --output-format=normal,terse --directory=/mnt/scratch/ --rw=read 
>>> multi-file.fio
>>> echo 0 
>>> >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
>>> less /sys/kernel/tracing/trace
>>>
>>
>> Thanks for looking at it Matthew!
>>
>> I see the same as on fuse on xfs - same output as I initially
>> posted (except the fuse logging, of course).
>>
>> With tracing and reduced file size to 2M
>>
> 
> ...
> 
>>               fio-3459    [011] ..... 65055.436534: 
>> mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 
>> ofs=1052672 order=0
>>
>> And then it is stuck at order=0
>>
> 
> When I reduce bs to 4K I get similar results as you
> 
> fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse 
> --directory=/scratch/source/ --rw=read multi-file.fio --bs=4K 
> --group_reporting
> 
> (so bs set to 4K)
> 
> and a basically empty job file
> 
> bernd@squeeze1 test2>cat multi-file.fio
> [global]
> 
> [test]
> 
> 
> 
> Up to bs=512K it works fine, 1M (and for what it matters
> already 768K) introduce the order=0 issue.

Hmm, I replaced memcpy with dumb version, that copies byte by byte - 
problem gone. Is it possible that the optimized memcpy causes caused 
kind of random memory access and confuses mm / readahead?
And somehow your memcpy or system is not doing that?


Thanks,
Bernd




  reply	other threads:[~2023-04-14 20:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm>
2023-04-13 21:33 ` Bernd Schubert
2023-04-14  4:07   ` Matthew Wilcox
2023-04-14 15:05     ` Bernd Schubert
2023-04-14 19:52       ` Bernd Schubert
2023-04-14 20:47         ` Bernd Schubert [this message]
2023-04-14 22:02           ` Matthew Wilcox
2023-04-17 15:10             ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c59e54f9-6eae-3c35-bce8-ac03af84b3ed@fastmail.fm \
    --to=bernd.schubert@fastmail.fm \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox