Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, andreyknvl@gmail.com,
	anshuman.khandual@arm.com,  ardb@kernel.org,
	catalin.marinas@arm.com, david@redhat.com,  dvyukov@google.com,
	glider@google.com, james.morse@arm.com,  jhubbard@nvidia.com,
	linux-arm-kernel@lists.infradead.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mark.rutland@arm.com,  maz@kernel.org, oliver.upton@linux.dev,
	ryabinin.a.a@gmail.com,  suzuki.poulose@arm.com,
	vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com,
	 will@kernel.org, willy@infradead.org, yuzenghui@huawei.com,
	yuzhao@google.com,  ziy@nvidia.com
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
Date: Tue, 28 Nov 2023 13:11:19 +1300	[thread overview]
Message-ID: <CAGsJ_4z_ftxvG-EcTe=X+Te8fNSShhVHHPvbEgAa1rQXgO5XCA@mail.gmail.com> (raw)
In-Reply-To: <755343a1-ce94-4d38-8317-0925e2dae3bc@arm.com>

On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 05:54, Barry Song wrote:
> >> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >> +              pte_t *dst_pte, pte_t *src_pte,
> >> +              unsigned long addr, unsigned long end,
> >> +              int *rss, struct folio **prealloc)
> >>  {
> >>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>      unsigned long vm_flags = src_vma->vm_flags;
> >>      pte_t pte = ptep_get(src_pte);
> >>      struct page *page;
> >>      struct folio *folio;
> >> +    int nr = 1;
> >> +    bool anon;
> >> +    bool any_dirty = pte_dirty(pte);
> >> +    int i;
> >>
> >>      page = vm_normal_page(src_vma, addr, pte);
> >> -    if (page)
> >> +    if (page) {
> >>              folio = page_folio(page);
> >> -    if (page && folio_test_anon(folio)) {
> >> -            /*
> >> -             * If this page may have been pinned by the parent process,
> >> -             * copy the page immediately for the child so that we'll always
> >> -             * guarantee the pinned page won't be randomly replaced in the
> >> -             * future.
> >> -             */
> >> -            folio_get(folio);
> >> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >> -                    /* Page may be pinned, we have to copy. */
> >> -                    folio_put(folio);
> >> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >> -                                             addr, rss, prealloc, page);
> >> +            anon = folio_test_anon(folio);
> >> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >> +                                            end, pte, &any_dirty);
> >
> > in case we have a large folio with 16 CONTPTE basepages, and userspace
> > do madvise(addr + 4KB * 5, DONTNEED);
>
> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> page0~page4 and page6~15?
>
> >
> > thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> > will return 15. in this case, we should copy page0~page3 and page5~page15.
>
> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> not how its intended to work. The function is scanning forwards from the current
> pte until it finds the first pte that does not fit in the batch - either because
> it maps a PFN that is not contiguous, or because the permissions are different
> (although this is being relaxed a bit; see conversation with DavidH against this
> same patch).
>
> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> (page0~page4) then the next time through the loop we will go through the
> !present path and process the single swap marker. Then the 3rd time through the
> loop folio_nr_pages_cont_mapped() will return 10.

one case we have met by running hundreds of real phones is as below,


static int
copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
               pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
               unsigned long end)
{
        ...
        dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
        if (!dst_pte) {
                ret = -ENOMEM;
                goto out;
        }
        src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
        if (!src_pte) {
                pte_unmap_unlock(dst_pte, dst_ptl);
                /* ret == 0 */
                goto out;
        }
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
        orig_src_pte = src_pte;
        orig_dst_pte = dst_pte;
        arch_enter_lazy_mmu_mode();

        do {
                /*
                 * We are holding two locks at this point - either of them
                 * could generate latencies in another task on another CPU.
                 */
                if (progress >= 32) {
                        progress = 0;
                        if (need_resched() ||
                            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
                }
                ptent = ptep_get(src_pte);
                if (pte_none(ptent)) {
                        progress++;
                        continue;
                }

the above iteration can break when progress > =32. for example, at the
beginning,
if all PTEs are none, we break when progress >=32, and we break when we
are in the 8th pte of 16PTEs which might become CONTPTE after we release
PTL.

since we are releasing PTLs, next time when we get PTL, those pte_none() might
become pte_cont(), then are you going to copy CONTPTE from 8th pte,
thus, immediately
break the consistent CONPTEs rule of hardware?

pte0 - pte_none
pte1 - pte_none
...
pte7 - pte_none

pte8 - pte_cont
...
pte15 - pte_cont

so we did some modification to avoid a break in the middle of PTEs
which can potentially
become CONTPE.
do {
                /*
                * We are holding two locks at this point - either of them
                * could generate latencies in another task on another CPU.
                */
                if (progress >= 32) {
                                progress = 0;
#ifdef CONFIG_CONT_PTE_HUGEPAGE
                /*
                * XXX: don't release ptl at an unligned address as
cont_pte might form while
                * ptl is released, this causes double-map
                */
                if (!vma_is_chp_anonymous(src_vma) ||
                   (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
#endif
                if (need_resched() ||
                   spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
}

We could only reproduce the above issue by running thousands of phones.

Does your code survive from this problem?

>
> Thanks,
> Ryan
>
> >
> > but the current code is copying page0~page14, right? unless we are immediatly
> > split_folio to basepages in zap_pte_range(), we will have problems?
> >
> >> +
> >> +            for (i = 0; i < nr; i++, page++) {
> >> +                    if (anon) {
> >> +                            /*
> >> +                             * If this page may have been pinned by the
> >> +                             * parent process, copy the page immediately for
> >> +                             * the child so that we'll always guarantee the
> >> +                             * pinned page won't be randomly replaced in the
> >> +                             * future.
> >> +                             */
> >> +                            if (unlikely(page_try_dup_anon_rmap(
> >> +                                            page, false, src_vma))) {
> >> +                                    if (i != 0)
> >> +                                            break;
> >> +                                    /* Page may be pinned, we have to copy. */
> >> +                                    return copy_present_page(
> >> +                                            dst_vma, src_vma, dst_pte,
> >> +                                            src_pte, addr, rss, prealloc,
> >> +                                            page);
> >> +                            }
> >> +                            rss[MM_ANONPAGES]++;
> >> +                            VM_BUG_ON(PageAnonExclusive(page));
> >> +                    } else {
> >> +                            page_dup_file_rmap(page, false);
> >> +                            rss[mm_counter_file(page)]++;
> >> +                    }
> >

Thanks
Barry

next prev parent reply	other threads:[~2023-11-28  0:11 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15 16:30 [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() Ryan Roberts
2023-11-15 21:26   ` kernel test robot
2023-11-16 10:07     ` Ryan Roberts
2023-11-16 10:12       ` David Hildenbrand
2023-11-16 10:36         ` Ryan Roberts
2023-11-16 11:01           ` David Hildenbrand
2023-11-16 11:13             ` Ryan Roberts
2023-11-15 21:37   ` Andrew Morton
2023-11-16  9:34     ` Ryan Roberts
2023-12-04 11:01     ` Christophe Leroy
2023-11-15 22:40   ` kernel test robot
2023-11-16 10:03   ` David Hildenbrand
2023-11-16 10:26     ` Ryan Roberts
2023-11-27  8:42     ` Barry Song
2023-11-27  9:35       ` Ryan Roberts
2023-11-27  9:59         ` Barry Song
2023-11-27 10:10           ` Ryan Roberts
2023-11-27 10:28             ` Barry Song
2023-11-27 11:07               ` Ryan Roberts
2023-11-27 20:34                 ` Barry Song
2023-11-28  9:14                   ` Ryan Roberts
2023-11-28  9:49                     ` Barry Song
2023-11-28 10:49                       ` Ryan Roberts
2023-11-28 21:06                         ` Barry Song
2023-11-29 12:21                           ` Ryan Roberts
2023-11-30  0:51                             ` Barry Song
2023-11-16 11:03   ` David Hildenbrand
2023-11-16 11:20     ` Ryan Roberts
2023-11-16 13:20       ` David Hildenbrand
2023-11-16 13:49         ` Ryan Roberts
2023-11-16 14:13           ` David Hildenbrand
2023-11-16 14:15             ` David Hildenbrand
2023-11-16 17:58               ` Ryan Roberts
2023-11-23 10:26               ` Ryan Roberts
2023-11-23 12:12                 ` David Hildenbrand
2023-11-23 12:28                   ` Ryan Roberts
2023-11-24  8:53                     ` David Hildenbrand
2023-11-23  4:26   ` Alistair Popple
2023-11-23 14:43     ` Ryan Roberts
2023-11-23 23:50       ` Alistair Popple
2023-11-27  5:54   ` Barry Song
2023-11-27  9:24     ` Ryan Roberts
2023-11-28  0:11       ` Barry Song [this message]
2023-11-28 11:00         ` Ryan Roberts
2023-11-28 19:00           ` Barry Song
2023-11-29 12:29             ` Ryan Roberts
2023-11-29 13:09               ` Barry Song
2023-11-29 14:07                 ` Ryan Roberts
2023-11-30  0:34                   ` Barry Song
2023-11-15 16:30 ` [PATCH v2 02/14] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 03/14] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 04/14] arm64/mm: pte_clear(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 05/14] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 06/14] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 07/14] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 08/14] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 09/14] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 10/14] arm64/mm: ptep_get(): " Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 11/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 12/14] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2023-11-21 11:22   ` Alistair Popple
2023-11-21 15:14     ` Ryan Roberts
2023-11-22  6:01       ` Alistair Popple
2023-11-22  8:35         ` Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 13/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork() Ryan Roberts
2023-11-15 16:30 ` [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Ryan Roberts
2023-11-23  5:13   ` Alistair Popple
2023-11-23 16:01     ` Ryan Roberts
2023-11-24  1:35       ` Alistair Popple
2023-11-24  8:54         ` Ryan Roberts
2023-11-27  7:34           ` Alistair Popple
2023-11-27  8:53             ` Ryan Roberts
2023-11-28  6:54               ` Alistair Popple
2023-11-28 12:45                 ` Ryan Roberts
2023-11-28 16:55                   ` Ryan Roberts
2023-11-30  5:07                     ` Alistair Popple
2023-11-30  5:57                       ` Barry Song
2023-11-30 11:47                       ` Ryan Roberts
2023-12-03 23:20                         ` Alistair Popple
2023-12-04  9:39                           ` Ryan Roberts
2023-11-28  7:32   ` Barry Song
2023-11-28 11:15     ` Ryan Roberts
2023-11-28  8:17   ` Barry Song
2023-11-28 11:49     ` Ryan Roberts
2023-11-28 20:23       ` Barry Song
2023-11-29 12:43         ` Ryan Roberts
2023-11-29 13:00           ` Barry Song
2023-11-30  5:35           ` Barry Song
2023-11-30 12:00             ` Ryan Roberts
2023-12-03 21:41               ` Barry Song
2023-11-27  3:18 ` [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Barry Song
2023-11-27  9:15   ` Ryan Roberts
2023-11-27 10:35     ` Barry Song
2023-11-27 11:11       ` Ryan Roberts
2023-11-27 22:53         ` Barry Song
2023-11-28 11:52           ` Ryan Roberts
2023-11-28  3:13     ` Yang Shi
2023-11-28 11:58       ` Ryan Roberts
2023-11-28  5:49     ` Barry Song
2023-11-28 12:08       ` Ryan Roberts
2023-11-28 19:37         ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4z_ftxvG-EcTe=X+Te8fNSShhVHHPvbEgAa1rQXgO5XCA@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=anshuman.khandual@arm.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=dvyukov@google.com \
    --cc=glider@google.com \
    --cc=james.morse@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=ryabinin.a.a@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=vincenzo.frascino@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzenghui@huawei.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox