From: Chih-En Lin <shiyn.lin@gmail.com>
To: Nadav Amit <namit@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Christophe Leroy <christophe.leroy@csgroup.eu>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Luis Chamberlain <mcgrof@kernel.org>,
Kees Cook <keescook@chromium.org>,
Iurii Zaikin <yzaikin@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
William Kucharski <william.kucharski@oracle.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Peter Xu <peterx@redhat.com>,
Suren Baghdasaryan <surenb@google.com>,
Arnd Bergmann <arnd@arndb.de>,
Tong Tiangen <tongtiangen@huawei.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Li kunyu <kunyu@nfschina.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Minchan Kim <minchan@kernel.org>, Yang Shi <shy828301@gmail.com>,
Song Liu <song@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>,
Thomas Gleixner <tglx@linutronix.de>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Andy Lutomirski <luto@kernel.org>,
Fenghua Yu <fenghua.yu@intel.com>,
Dinglan Peng <peng301@purdue.edu>,
Pedro Fonseca <pfonseca@purdue.edu>,
Jim Huang <jserv@ccns.ncku.edu.tw>,
Huichun Feng <foxhoundsk.tw@gmail.com>
Subject: Re: [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions
Date: Wed, 28 Sep 2022 03:00:12 +0800 [thread overview]
Message-ID: <YzNIPGu+bGEWZXlM@strix-laptop> (raw)
In-Reply-To: <8041503C-D087-434D-A026-90BFD0CAF209@vmware.com>
On Tue, Sep 27, 2022 at 05:51:19PM +0000, Nadav Amit wrote:
> > +static void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> > + pmd_t *pmdp, unsigned long addr,
> > + unsigned long end, bool inc_dec)
> > +{
> > + int rss[NR_MM_COUNTERS];
> > + spinlock_t *ptl;
> > + pte_t *orig_ptep, *ptep;
> > + struct page *page;
> > +
> > + init_rss_vec(rss);
> > +
> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > + orig_ptep = ptep;
> > + arch_enter_lazy_mmu_mode();
> > + do {
> > + if (pte_none(*ptep))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, *ptep);
> > + if (page) {
> > + if (inc_dec)
> > + rss[mm_counter(page)]++;
> > + else
> > + rss[mm_counter(page)]--;
> > + }
> > + } while (ptep++, addr += PAGE_SIZE, addr != end);
> > + arch_leave_lazy_mmu_mode();
> > + pte_unmap_unlock(orig_ptep, ptl);
>
> It seems to me very fragile to separate the accounting from the actual
> operation. I do not see copying of the pages here, so why is the RSS
> updated?
Since it may have a situation like one process that doesn't do the
accounting during COW, and it would want to reuse the table. So, we
need to restore the states.
On the other hand, it will have a situation like unmap the COWed table
and wanting to remove the states.
>
> > + add_mm_rss_vec(mm, rss);
> > +}
> > +
> > /*
> > * This function is called to print an error when a bad pte
> > * is found. For example, we might have a PFN-mapped pte in
> > @@ -2817,6 +2848,68 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
> > }
> > EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
> >
> > +/**
> > + * cow_pte_fallback - reuse the shared PTE table
> > + * @vma: vma that coever the shared PTE table
> > + * @pmd: pmd index maps to the shared PTE table
> > + * @addr: the address trigger the break COW,
> > + *
> > + * Reuse the shared (COW) PTE table when the refcount is equal to one.
> > + * @addr needs to be in the range of the shared PTE table that @vma and
> > + * @pmd mapped to it.
> > + *
> > + * COW PTE fallback to normal PTE:
> > + * - two state here
> > + * - After break child : [parent, rss=1, ref=1, write=NO , owner=parent]
> > + * to [parent, rss=1, ref=1, write=YES, owner=NULL ]
> > + * - After break parent: [child , rss=0, ref=1, write=NO , owner=NULL ]
> > + * to [child , rss=1, ref=1, write=YES, owner=NULL ]
> > + */
> > +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> > + unsigned long addr)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct vm_area_struct *prev = vma->vm_prev;
> > + struct vm_area_struct *next = vma->vm_next;
> > + unsigned long start, end;
> > + pmd_t new;
> > +
> > + VM_WARN_ON(pmd_write(*pmd));
> > +
> > + start = addr & PMD_MASK;
> > + end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > + /*
> > + * If pmd is not owner, it needs to increase the rss.
> > + * Since only the owner has the RSS state for the COW PTE.
> > + */
> > + if (!cow_pte_owner_is_same(pmd, pmd)) {
> > + /* The part of address range is covered by previous. */
> > + if (start < vma->vm_start && prev && start < prev->vm_end) {
> > + cow_pte_rss(mm, prev, pmd,
> > + start, prev->vm_end, true /* inc */);
> > + start = vma->vm_start;
> > + }
> > + /* The part of address range is covered by next. */
> > + if (end > vma->vm_end && next && end > next->vm_start) {
> > + cow_pte_rss(mm, next, pmd,
> > + next->vm_start, end, true /* inc */);
> > + end = vma->vm_end;
> > + }
> > + cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
> > +
> > + mm_inc_nr_ptes(mm);
> > + /* Memory barrier here is the same as pmd_install(). */
> > + smp_wmb();
> > + pmd_populate(mm, pmd, pmd_page(*pmd));
> > + }
> > +
> > + /* Reuse the pte page */
> > + set_cow_pte_owner(pmd, NULL);
> > + new = pmd_mkwrite(*pmd);
> > + set_pmd_at(mm, addr, pmd, new);
> > +}
>
> Again, kind of hard to understand such a function without a context
> (caller). For instance, is there any lock that prevents
> cow_pte_owner_is_same() from racing with change of the owner?
>
It is called by the refcount operation and the break COW handler
when the refcount is 1.
Also, It uses synchronization primitives (in set_cow_pte_owner() and
cow_pte_owner_is_same()) to prevent the race.
> I would expect to see first patches that always copy the PTEs without
> reusing the PTEs and only then a PTE reuse logic as an optimization.
>
I will restructure all the commits to make the logic clear.
Thanks,
Chih-En Lin
next prev parent reply other threads:[~2022-09-27 19:00 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
2022-09-27 17:23 ` Nadav Amit
2022-09-27 17:36 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
2022-09-27 17:27 ` Nadav Amit
2022-09-27 18:05 ` Chih-En Lin
2022-09-27 21:22 ` John Hubbard
2022-09-28 8:36 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
2022-09-27 17:30 ` Nadav Amit
2022-09-27 18:23 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
2022-09-27 17:51 ` Nadav Amit
2022-09-27 19:00 ` Chih-En Lin [this message]
2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
2022-09-27 17:59 ` Nadav Amit
2022-09-27 19:07 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
2022-09-27 18:15 ` Nadav Amit
2022-09-27 19:23 ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
2022-09-27 18:38 ` Nadav Amit
2022-09-27 19:53 ` Chih-En Lin
2022-09-27 21:26 ` John Hubbard
2022-09-28 8:52 ` Chih-En Lin
2022-09-28 14:03 ` David Hildenbrand
2022-09-29 13:38 ` Chih-En Lin
2022-09-29 13:49 ` Chih-En Lin
2022-09-29 17:24 ` David Hildenbrand
2022-09-29 18:29 ` Chih-En Lin
2022-09-29 18:38 ` David Hildenbrand
2022-09-29 18:57 ` Chih-En Lin
2022-09-29 19:00 ` David Hildenbrand
2022-09-29 18:40 ` Nadav Amit
2022-09-29 19:02 ` Chih-En Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzNIPGu+bGEWZXlM@strix-laptop \
--to=shiyn.lin@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=arnd@arndb.de \
--cc=bigeasy@linutronix.de \
--cc=christophe.leroy@csgroup.eu \
--cc=david@redhat.com \
--cc=fenghua.yu@intel.com \
--cc=foxhoundsk.tw@gmail.com \
--cc=jserv@ccns.ncku.edu.tw \
--cc=keescook@chromium.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=kunyu@nfschina.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mcgrof@kernel.org \
--cc=minchan@kernel.org \
--cc=namit@vmware.com \
--cc=pasha.tatashin@soleen.com \
--cc=peng301@purdue.edu \
--cc=peterx@redhat.com \
--cc=pfonseca@purdue.edu \
--cc=shy828301@gmail.com \
--cc=song@kernel.org \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=tongtiangen@huawei.com \
--cc=vbabka@suse.cz \
--cc=william.kucharski@oracle.com \
--cc=willy@infradead.org \
--cc=yzaikin@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox