Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, andreyknvl@gmail.com,
	anshuman.khandual@arm.com,  ardb@kernel.org,
	catalin.marinas@arm.com, david@redhat.com,  dvyukov@google.com,
	glider@google.com, james.morse@arm.com,  jhubbard@nvidia.com,
	linux-arm-kernel@lists.infradead.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mark.rutland@arm.com,  maz@kernel.org, oliver.upton@linux.dev,
	ryabinin.a.a@gmail.com,  suzuki.poulose@arm.com,
	vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com,
	 will@kernel.org, willy@infradead.org, yuzenghui@huawei.com,
	yuzhao@google.com,  ziy@nvidia.com
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
Date: Thu, 30 Nov 2023 13:34:13 +1300	[thread overview]
Message-ID: <CAGsJ_4wXawhY5OsjjZUkgaBwXeFeBbcZa5NcsWETgXty=iogJg@mail.gmail.com> (raw)
In-Reply-To: <b485e908-770b-4cd9-8e45-b9107a581ae8@arm.com>

On Thu, Nov 30, 2023 at 3:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/11/2023 13:09, Barry Song wrote:
> > On Thu, Nov 30, 2023 at 1:29 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 19:00, Barry Song wrote:
> >>> On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 28/11/2023 00:11, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 05:54, Barry Song wrote:
> >>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>>>>>> +              pte_t *dst_pte, pte_t *src_pte,
> >>>>>>>> +              unsigned long addr, unsigned long end,
> >>>>>>>> +              int *rss, struct folio **prealloc)
> >>>>>>>>  {
> >>>>>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>>>>>      unsigned long vm_flags = src_vma->vm_flags;
> >>>>>>>>      pte_t pte = ptep_get(src_pte);
> >>>>>>>>      struct page *page;
> >>>>>>>>      struct folio *folio;
> >>>>>>>> +    int nr = 1;
> >>>>>>>> +    bool anon;
> >>>>>>>> +    bool any_dirty = pte_dirty(pte);
> >>>>>>>> +    int i;
> >>>>>>>>
> >>>>>>>>      page = vm_normal_page(src_vma, addr, pte);
> >>>>>>>> -    if (page)
> >>>>>>>> +    if (page) {
> >>>>>>>>              folio = page_folio(page);
> >>>>>>>> -    if (page && folio_test_anon(folio)) {
> >>>>>>>> -            /*
> >>>>>>>> -             * If this page may have been pinned by the parent process,
> >>>>>>>> -             * copy the page immediately for the child so that we'll always
> >>>>>>>> -             * guarantee the pinned page won't be randomly replaced in the
> >>>>>>>> -             * future.
> >>>>>>>> -             */
> >>>>>>>> -            folio_get(folio);
> >>>>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >>>>>>>> -                    /* Page may be pinned, we have to copy. */
> >>>>>>>> -                    folio_put(folio);
> >>>>>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >>>>>>>> -                                             addr, rss, prealloc, page);
> >>>>>>>> +            anon = folio_test_anon(folio);
> >>>>>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >>>>>>>> +                                            end, pte, &any_dirty);
> >>>>>>>
> >>>>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
> >>>>>>> do madvise(addr + 4KB * 5, DONTNEED);
> >>>>>>
> >>>>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> >>>>>> page0~page4 and page6~15?
> >>>>>>
> >>>>>>>
> >>>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> >>>>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
> >>>>>>
> >>>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> >>>>>> not how its intended to work. The function is scanning forwards from the current
> >>>>>> pte until it finds the first pte that does not fit in the batch - either because
> >>>>>> it maps a PFN that is not contiguous, or because the permissions are different
> >>>>>> (although this is being relaxed a bit; see conversation with DavidH against this
> >>>>>> same patch).
> >>>>>>
> >>>>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> >>>>>> (page0~page4) then the next time through the loop we will go through the
> >>>>>> !present path and process the single swap marker. Then the 3rd time through the
> >>>>>> loop folio_nr_pages_cont_mapped() will return 10.
> >>>>>
> >>>>> one case we have met by running hundreds of real phones is as below,
> >>>>>
> >>>>>
> >>>>> static int
> >>>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> >>>>>                unsigned long end)
> >>>>> {
> >>>>>         ...
> >>>>>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> >>>>>         if (!dst_pte) {
> >>>>>                 ret = -ENOMEM;
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> >>>>>         if (!src_pte) {
> >>>>>                 pte_unmap_unlock(dst_pte, dst_ptl);
> >>>>>                 /* ret == 0 */
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> >>>>>         orig_src_pte = src_pte;
> >>>>>         orig_dst_pte = dst_pte;
> >>>>>         arch_enter_lazy_mmu_mode();
> >>>>>
> >>>>>         do {
> >>>>>                 /*
> >>>>>                  * We are holding two locks at this point - either of them
> >>>>>                  * could generate latencies in another task on another CPU.
> >>>>>                  */
> >>>>>                 if (progress >= 32) {
> >>>>>                         progress = 0;
> >>>>>                         if (need_resched() ||
> >>>>>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>>>                                 break;
> >>>>>                 }
> >>>>>                 ptent = ptep_get(src_pte);
> >>>>>                 if (pte_none(ptent)) {
> >>>>>                         progress++;
> >>>>>                         continue;
> >>>>>                 }
> >>>>>
> >>>>> the above iteration can break when progress > =32. for example, at the
> >>>>> beginning,
> >>>>> if all PTEs are none, we break when progress >=32, and we break when we
> >>>>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
> >>>>> PTL.
> >>>>>
> >>>>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
> >>>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> >>>>> thus, immediately
> >>>>> break the consistent CONPTEs rule of hardware?
> >>>>>
> >>>>> pte0 - pte_none
> >>>>> pte1 - pte_none
> >>>>> ...
> >>>>> pte7 - pte_none
> >>>>>
> >>>>> pte8 - pte_cont
> >>>>> ...
> >>>>> pte15 - pte_cont
> >>>>>
> >>>>> so we did some modification to avoid a break in the middle of PTEs
> >>>>> which can potentially
> >>>>> become CONTPE.
> >>>>> do {
> >>>>>                 /*
> >>>>>                 * We are holding two locks at this point - either of them
> >>>>>                 * could generate latencies in another task on another CPU.
> >>>>>                 */
> >>>>>                 if (progress >= 32) {
> >>>>>                                 progress = 0;
> >>>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>>>                 /*
> >>>>>                 * XXX: don't release ptl at an unligned address as
> >>>>> cont_pte might form while
> >>>>>                 * ptl is released, this causes double-map
> >>>>>                 */
> >>>>>                 if (!vma_is_chp_anonymous(src_vma) ||
> >>>>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>>>> HPAGE_CONT_PTE_SIZE)))
> >>>>> #endif
> >>>>>                 if (need_resched() ||
> >>>>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>>>                                 break;
> >>>>> }
> >>>>>
> >>>>> We could only reproduce the above issue by running thousands of phones.
> >>>>>
> >>>>> Does your code survive from this problem?
> >>>>
> >>>> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
> >>>> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
> >>>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
> >>>> will see some pte_none() entries followed by some pte_present() entries. And
> >>>> when calling set_ptes() on the child, the arch code will evaluate the current
> >>>> state of the pgtable along with the new set_ptes() request and determine where
> >>>> it should insert the CONT_PTE bit.
> >>>
> >>> yep, i have read very carefully and think your code is safe here. The
> >>> only problem
> >>> is that the code can randomly unfold parent processes' CONPTE while setting
> >>> wrprotect in the middle of a large folio while it actually should keep CONT
> >>> bit as all PTEs can be still consistent if we set protect from the 1st PTE.
> >>>
> >>> while A forks B,  progress >= 32 might interrupt in the middle of a
> >>> new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
> >>> this parent immediately loses CONT bit. this is  sad. but i can't find a
> >>> good way to resolve it unless CONT is exposed to mm-core. any idea on
> >>> this?
> >>
> >> No this is not the case; copy_present_ptes() will copy as many ptes as are
> >> physcially contiguous and belong to the same folio (which usually means "the
> >> whole folio" - the only time it doesn't is when we hit the end of the vma). We
> >> will then return to the main loop and move forwards by the number of ptes that
> >> were serviced, including:
> >
> > I probably have failed to describe my question. i'd like to give a
> > concrete example
> >
> > 1. process A forks B
> > 2. At the beginning, address~address +64KB has pte_none PTEs
> > 3. we scan the 5th pte of address + 5 * 4KB, progress becomes 32, we
> > break and release PTLs
> > 4. another page fault in process A gets PTL and set
> > address~address+64KB to pte_cont
> > 5. we get the PTL again and arrive 5th pte
> > 6. we set wrprotects on 5,6,7....15 ptes, in this case, we have to
> > unfold parent A
> > and child B also gets unfolded PTEs unless our loop can go back the 0th pte.
> >
> > technically, A should be able to keep CONTPTE, but because of the implementation
> > of the code, it can't. That is the sadness. but it is obviously not your fault.
> >
> > no worries. This is not happening quite often. but i just want to make a note
> > here, maybe someday we can get back to address it.
>
> Ahh, I understand the situation now, sorry for being slow!
>
> I expect this to be a very rare situation anyway since (4) suggests process A
> has another thread, and forking is not encouraged for multithreaded programs. In
> fact the fork man page says:
>
>   After a fork() in a multithreaded program, the child can safely call only
>   async-signal-safe functions (see signal-safety(7)) until such time as it calls
>   execve(2).
>
> So in this case, we are about to completely repaint the child's address space
> with execve() anyway.
>
> So its just the racing parent that loses the CONT_PTE bit. I expect this to be
> extremely rare. I'm not sure there is much we can do to solve it though, because
> unlike with your solution, we have to cater for multiple sizes so there is no
> obvious boarder until we get to PMD size and I'm guessing that's going to be a
> problem for latency spikes.

right. i don't think this can be a big problem. the background is that
we have some
way to constantly detect and report unexpected unfold/events, so we run hundreds
of phones in lab, and collect data to find out if we have any
potential problems. we
record unexpected unfold and reasons in a proc file, we monitor those data to
look for potential bugs we might have. this ">=32" break and unfold was found in
this way, thus we simply addressed it by disallowing the break at an unaligned
address.

This is never an issue which can stop your patchset. but I was still
glad to share
our observations with you and would like to hear if you had any idea :-)

>
>
> >
> >>
> >> progress += 8 * ret;
> >>
> >> That might go above 32, so we will flash the lock. But we haven't done that in
> >> the middle of a large folio. So the contpte-ness should be preserved.
> >>
> >>>
> >>> Our code[1] resolves this by only breaking at the aligned address
> >>>
> >>> if (progress >= 32) {
> >>>      progress = 0;
> >>>      #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>      /*
> >>>       * XXX: don't release ptl at an unligned address as cont_pte
> >>> might form while
> >>>       * ptl is released, this causes double-map
> >>>      */
> >>>     if (!vma_is_chp_anonymous(src_vma) ||
> >>>         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>> HPAGE_CONT_PTE_SIZE)))
> >>>     #endif
> >>>         if (need_resched() ||
> >>>            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>              break;
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180

Thanks
Barry

next prev parent reply	other threads:[~2023-11-30  0:34 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15 16:30 [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() Ryan Roberts
2023-11-15 21:26   ` kernel test robot
2023-11-16 10:07     ` Ryan Roberts
2023-11-16 10:12       ` David Hildenbrand
2023-11-16 10:36         ` Ryan Roberts
2023-11-16 11:01           ` David Hildenbrand
2023-11-16 11:13             ` Ryan Roberts
2023-11-15 21:37   ` Andrew Morton
2023-11-16  9:34     ` Ryan Roberts
2023-12-04 11:01     ` Christophe Leroy
2023-11-15 22:40   ` kernel test robot
2023-11-16 10:03   ` David Hildenbrand
2023-11-16 10:26     ` Ryan Roberts
2023-11-27  8:42     ` Barry Song
2023-11-27  9:35       ` Ryan Roberts
2023-11-27  9:59         ` Barry Song
2023-11-27 10:10           ` Ryan Roberts
2023-11-27 10:28             ` Barry Song
2023-11-27 11:07               ` Ryan Roberts
2023-11-27 20:34                 ` Barry Song
2023-11-28  9:14                   ` Ryan Roberts
2023-11-28  9:49                     ` Barry Song
2023-11-28 10:49                       ` Ryan Roberts
2023-11-28 21:06                         ` Barry Song
2023-11-29 12:21                           ` Ryan Roberts
2023-11-30  0:51                             ` Barry Song
2023-11-16 11:03   ` David Hildenbrand
2023-11-16 11:20     ` Ryan Roberts
2023-11-16 13:20       ` David Hildenbrand
2023-11-16 13:49         ` Ryan Roberts
2023-11-16 14:13           ` David Hildenbrand
2023-11-16 14:15             ` David Hildenbrand
2023-11-16 17:58               ` Ryan Roberts
2023-11-23 10:26               ` Ryan Roberts
2023-11-23 12:12                 ` David Hildenbrand
2023-11-23 12:28                   ` Ryan Roberts
2023-11-24  8:53                     ` David Hildenbrand
2023-11-23  4:26   ` Alistair Popple
2023-11-23 14:43     ` Ryan Roberts
2023-11-23 23:50       ` Alistair Popple
2023-11-27  5:54   ` Barry Song
2023-11-27  9:24     ` Ryan Roberts
2023-11-28  0:11       ` Barry Song
2023-11-28 11:00         ` Ryan Roberts
2023-11-28 19:00           ` Barry Song
2023-11-29 12:29             ` Ryan Roberts
2023-11-29 13:09               ` Barry Song
2023-11-29 14:07                 ` Ryan Roberts
2023-11-30  0:34                   ` Barry Song [this message]
2023-11-15 16:30 ` [PATCH v2 02/14] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 03/14] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 04/14] arm64/mm: pte_clear(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 05/14] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 06/14] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 07/14] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 08/14] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 09/14] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 10/14] arm64/mm: ptep_get(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 11/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2023-11-21 11:22   ` Alistair Popple
2023-11-21 15:14     ` Ryan Roberts
2023-11-22  6:01       ` Alistair Popple
2023-11-22  8:35         ` Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 13/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork() Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Ryan Roberts
2023-11-23  5:13   ` Alistair Popple
2023-11-23 16:01     ` Ryan Roberts
2023-11-24  1:35       ` Alistair Popple
2023-11-24  8:54         ` Ryan Roberts
2023-11-27  7:34           ` Alistair Popple
2023-11-27  8:53             ` Ryan Roberts
2023-11-28  6:54               ` Alistair Popple
2023-11-28 12:45                 ` Ryan Roberts
2023-11-28 16:55                   ` Ryan Roberts
2023-11-30  5:07                     ` Alistair Popple
2023-11-30  5:57                       ` Barry Song
2023-11-30 11:47                       ` Ryan Roberts
2023-12-03 23:20                         ` Alistair Popple
2023-12-04  9:39                           ` Ryan Roberts
2023-11-28  7:32   ` Barry Song
2023-11-28 11:15     ` Ryan Roberts
2023-11-28  8:17   ` Barry Song
2023-11-28 11:49     ` Ryan Roberts
2023-11-28 20:23       ` Barry Song
2023-11-29 12:43         ` Ryan Roberts
2023-11-29 13:00           ` Barry Song
2023-11-30  5:35           ` Barry Song
2023-11-30 12:00             ` Ryan Roberts
2023-12-03 21:41               ` Barry Song
2023-11-27  3:18 ` [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Barry Song
2023-11-27  9:15   ` Ryan Roberts
2023-11-27 10:35     ` Barry Song
2023-11-27 11:11       ` Ryan Roberts
2023-11-27 22:53         ` Barry Song
2023-11-28 11:52           ` Ryan Roberts
2023-11-28  3:13     ` Yang Shi
2023-11-28 11:58       ` Ryan Roberts
2023-11-28  5:49     ` Barry Song
2023-11-28 12:08       ` Ryan Roberts
2023-11-28 19:37         ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4wXawhY5OsjjZUkgaBwXeFeBbcZa5NcsWETgXty=iogJg@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=anshuman.khandual@arm.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=dvyukov@google.com \
    --cc=glider@google.com \
    --cc=james.morse@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=ryabinin.a.a@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=vincenzo.frascino@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzenghui@huawei.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox