Re: [RFC v5 0/2] mm: zap pages with read mmap_sem in munmap for large mapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <yang.shi@linux.alibaba.com>
To: mhocko@kernel.org, willy@infradead.org,
	ldufour@linux.vnet.ibm.com, kirill@shutemov.name,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC v5 0/2] mm: zap pages with read mmap_sem in munmap for large mapping
Date: Mon, 23 Jul 2018 15:00:05 -0700	[thread overview]
Message-ID: <d0535691-4abb-b53a-64cf-1ee0ea59fd22@linux.alibaba.com> (raw)
In-Reply-To: <1531956101-8526-1-git-send-email-yang.shi@linux.alibaba.com>

Hi folks,


Any comment on this version?


Thanks,

Yang


On 7/18/18 4:21 PM, Yang Shi wrote:
> Background:
> Recently, when we ran some vm scalability tests on machines with large memory,
> we ran into a couple of mmap_sem scalability issues when unmapping large memory
> space, please refer to https://lkml.org/lkml/2017/12/14/733 and
> https://lkml.org/lkml/2018/2/20/576.
>
>
> History:
> Then akpm suggested to unmap large mapping section by section and drop mmap_sem
> at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
>
> V1 patch series was submitted to the mailing list per Andrew's suggestion
> (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback
> and suggestions.
>
> Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko
> suggested (also in the v1 patches review) to try "two phases" approach. Zapping
> pages with read mmap_sem, then doing via cleanup with write mmap_sem (for
> discussion detail, see https://lwn.net/Articles/753269/)
>
>
> Approach:
> Zapping pages is the most time consuming part, according to the suggestion from
> Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
> what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
>
> But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
>    * The unexpected state from PF if it wins the race in the middle of munmap.
>      It may return zero page, instead of the content or SIGSEGV.
>    * Cana??t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
>      is a showstopper from akpm
>
> But, some part may need write mmap_sem, for example, vma splitting. So,
> the design is as follows:
>          acquire write mmap_sem
>          lookup vmas (find and split vmas)
>          detach vmas
>          deal with special mappings
>          downgrade_write
>
>          zap pages
>          free page tables
>          release mmap_sem
>
> The vm events with read mmap_sem may come in during page zapping, but
> since vmas have been detached before, they, i.e. page fault, gup, etc,
> will not be able to find valid vma, then just return SIGSEGV or -EFAULT
> as expected.
>
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
> considered as special mappings. They will be dealt with before zapping
> pages with write mmap_sem held. Basically, just update vm_flags.
>
> And, since they are also manipulated by unmap_single_vma() which is
> called by unmap_vma() with read mmap_sem held in this case, to
> prevent from updating vm_flags in read critical section, a new
> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas()
> and unmap_single_vma(). If it is true, then just skip unmap those
> special mappings. Currently, the only place which pass true to this
> parameter is us.
>
> With this approach we don't have to re-acquire mmap_sem again to clean
> up vmas to avoid race window which might get the address space changed.
>
> And, since the lock acquire/release cost is managed to the minimum and
> almost as same as before, the optimization could be extended to any size
> of mapping without incuring significant penalty to small mappings.
>
> For the time being, just do this in munmap syscall path. Other vm_munmap() or
> do_munmap() call sites (i.e mmap, mremap, etc) remain intact for stability
> reason.
>
> Changelog:
> v4 -> v5:
> * Detach vmas before zapping pages so that we don't have to use VM_DEAD to mark
>    a being unmapping vma since they have been detached from rbtree when zapping
>    pages. Per Kirill
> * Eliminate VM_DEAD stuff
> * With this change we don't have to re-acquire write mmap_sem to do cleanup.
>    So, we could eliminate a potential race window
> * Eliminate PUD_SIZE check, and extend this optimization to all size
>
> v3 -> v4:
> * Extend check_stable_address_space to check VM_DEAD as Michal suggested
> * Deal with vm_flags update of VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe
>    mappings with exclusive lock held. The actual unmapping is still done with read
>    mmap_sem to solve akpm's concern
> * Clean up vmas with calling do_munmap to prevent from race condition by not
>    carrying vmas as Kirill suggested
> * Extracted more common code
> * Solved some code cleanup comments from akpm
> * Dropped uprobe and arch specific code, now all the changes are mm only
> * Still keep PUD_SIZE threshold, if everyone thinks it is better to extend to all
>    sizes or smaller size, will remove it
> * Make this optimization 64 bit only explicitly per akpm's suggestion
>
> v2 -> v3:
> * Refactor do_munmap code to extract the common part per Peter's sugestion
> * Introduced VM_DEAD flag per Michal's suggestion. Just handled VM_DEAD in
>    x86's page fault handler for the time being. Other architectures will be covered
>    once the patch series is reviewed
> * Now lookup vma (find and split) and set VM_DEAD flag with write mmap_sem, then
>    zap mapping with read mmap_sem, then clean up pgtables and vmas with write
>    mmap_sem per Peter's suggestion
>
> v1 -> v2:
> * Re-implemented the code per the discussion on LSFMM summit
>
>
> Regression and performance data:
> Did the below regression test with setting thresh to 4K manually in the code:
>    * Full LTP
>    * Trinity (munmap/all vm syscalls)
>    * Stress-ng: mmap/mmapfork/mmapfixed/mmapaddr/mmapmany/vm
>    * mm-tests: kernbench, phpbench, sysbench-mariadb, will-it-scale
>    * vm-scalability
>
> With the patches, exclusive mmap_sem hold time when munmap a 80GB address
> space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
> from second.
>
> munmap_test-15002 [008]   594.380138: funcgraph_entry: |  vm_munmap_zap_rlock() {
> munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us |    unmap_region();
> munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us |  }
>
> Here the excution time of unmap_region() is used to evaluate the time of
> holding read mmap_sem, then the remaining time is used with holding
> exclusive lock.
>
> Yang Shi (2):
>        mm: refactor do_munmap() to extract the common part
>        mm: mmap: zap pages with read mmap_sem in munmap
>
>   include/linux/mm.h |   2 +-
>   mm/memory.c        |  35 +++++++++++++-----
>   mm/mmap.c          | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------
>   3 files changed, 199 insertions(+), 57 deletions(-)

     prev parent reply	other threads:[~2018-07-23 22:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-18 23:21 Yang Shi
2018-07-18 23:21 ` [RFC v5 PATCH 1/2] mm: refactor do_munmap() to extract the common part Yang Shi
2018-07-24 16:22   ` Laurent Dufour
2018-07-18 23:21 ` [RFC v5 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap Yang Shi
2018-07-24  9:26   ` Kirill A. Shutemov
2018-07-24 15:42     ` Yang Shi
2018-07-24 17:18   ` Laurent Dufour
2018-07-24 17:26     ` Yang Shi
2018-07-24 17:31       ` Laurent Dufour
2018-07-24 17:56         ` Yang Shi
2018-07-23 22:00 ` Yang Shi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0535691-4abb-b53a-64cf-1ee0ea59fd22@linux.alibaba.com \
    --to=yang.shi@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=kirill@shutemov.name \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox