Re: [PATCH] mm: vma: skip anonymous vma when inserting vma to file rmap tree

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <yang@os.amperecomputing.com>
To: Pedro Falcato <pedro.falcato@gmail.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam.Howlett@oracle.com, vbabka@suse.cz, jannh@google.com,
	oliver.sang@intel.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: vma: skip anonymous vma when inserting vma to file rmap tree
Date: Fri, 7 Mar 2025 09:51:33 -0800	[thread overview]
Message-ID: <687513f8-2ed9-479d-a6f8-986cebd562e8@os.amperecomputing.com> (raw)
In-Reply-To: <CAKbZUD3Gk8Qb4zznpCszXHzfAO82=rkTOb0_z6yVU0CXWAMoSA@mail.gmail.com>



On 3/7/25 5:35 AM, Pedro Falcato wrote:
> On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
>>> LKP reported 800% performance improvement for small-allocs benchmark
>>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>>> the output of smaps somewhat.
>> Yeah sorry about that, but unfortunately something we really do have to
>> think about (among other things, the VMA edge cases are always the source
>> of weirdness...)
>>
>>> The profiling shows one of the major sources of the performance
>>> improvement is the less contention to i_mmap_rwsem.
>> Great work tracking that down! Sorry I lost track of the other thread.
>>
>>> The small-allocs benchmark creates a lot of 40K size memory maps by
>>> mmap'ing private /dev/zero then triggers page fault on the mappings.
>>> When creating private mapping for /dev/zero, the anonymous VMA is
>>> created, but it has valid vm_file.  Kernel basically assumes anonymous
>>> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
>>> rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
>>> will be inserted to the file rmap tree, this resulted in the contention
>>> to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
>>> to insert it to file rmap tree.
>> Ughhhh god haha.
>>
>>> Skip anonymous VMA for this case.  Over 400% performance improvement was
>>> reported [3].
>> That's insane. Amazing work.
>>
> Ok, so the real question (to Yang) is: who are these /dev/zero users
> that require an insane degree of scalability, and why didn't they
> switch to regular MAP_ANONYMOUS? Are they in the room with us?

I wish I could. Who knows what applications use /dev/zero. But mmap'ing 
private /dev/zero is definitely an established way to create anonymous 
mappings. So we can't rule out it is *NOT* used.

>
>>> It is not on par with the 800% improvement from the original patch.  It is
>>> because page fault handler needs to access some members of struct file
>>> if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
>>> the same cacheline with file refcount.  When mmap'ing a file the file
>>> refcount is inc'ed and dec'ed, this caused bad cache false sharing
>>> problem.  The further debug showed checking whether the VMA is anonymous
>>> or not can alleviate the problem.  But I'm not sure whether it is the
>>> best way to handle it, maybe we should consider shuffle the layout of
>>> struct file.
>> Interesting, I guess you'll take a look at this also?
> ... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
> it's something like libc.so.6 at an insane rate of execs/second.
>
> This seems like a patch in search of a problem and I really don't see
> why we should wart up the mmap code otherwise. Not that I have a huge
> problem with this patch, which is somewhat simple and obvious.
> It'd be great if there was a real workload driving this rather than
> useless synthetic benchmarks.

Inserting an anonymous VMA to file rmap tree is definitely not expected 
by other parts of kernel. This is a broken behavior (or special case) 
IMHO. Making it behave in right way and making no surprise (also less 
special) to other parts of kernel is worth it even though we don't count 
the potential performance gain.

Thanks,
Yang

>

next prev parent reply	other threads:[~2025-03-07 17:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-06 21:49 Yang Shi
2025-03-07 13:12 ` Lorenzo Stoakes
2025-03-07 13:35   ` Pedro Falcato
2025-03-07 13:41     ` Lorenzo Stoakes
2025-03-07 17:58       ` Yang Shi
2025-03-07 17:51     ` Yang Shi [this message]
2025-03-07 17:34   ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=687513f8-2ed9-479d-a6f8-986cebd562e8@os.amperecomputing.com \
    --to=yang@os.amperecomputing.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=oliver.sang@intel.com \
    --cc=pedro.falcato@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox