linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Chih-En Lin <shiyn.lin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	 David Hildenbrand <david@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	 Christophe Leroy <christophe.leroy@csgroup.eu>,
	John Hubbard <jhubbard@nvidia.com>,
	 Nadav Amit <namit@vmware.com>, Barry Song <baohua@kernel.org>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	 Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	 Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>,  Namhyung Kim <namhyung@kernel.org>,
	Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
	 Vlastimil Babka <vbabka@suse.cz>,
	"Zach O'Keefe" <zokeefe@google.com>,
	Yun Zhou <yun.zhou@windriver.com>,
	 Hugh Dickins <hughd@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Yu Zhao <yuzhao@google.com>,  Juergen Gross <jgross@suse.com>,
	Tong Tiangen <tongtiangen@huawei.com>,
	 Liu Shixin <liushixin2@huawei.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	 Li kunyu <kunyu@nfschina.com>, Minchan Kim <minchan@kernel.org>,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Gautam Menghani <gautammenghani201@gmail.com>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	Mark Brown <broonie@kernel.org>,  Will Deacon <will@kernel.org>,
	Vincenzo Frascino <Vincenzo.Frascino@arm.com>,
	 Thomas Gleixner <tglx@linutronix.de>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	 Andy Lutomirski <luto@kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Fenghua Yu <fenghua.yu@intel.com>,
	 Andrei Vagin <avagin@gmail.com>, Barret Rhoden <brho@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 "Jason A. Donenfeld" <Jason@zx2c4.com>,
	Alexey Gladkov <legion@kernel.org>,
	linux-kernel@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,  linux-trace-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org,
	 Dinglan Peng <peng301@purdue.edu>,
	Pedro Fonseca <pfonseca@purdue.edu>,
	 Jim Huang <jserv@ccns.ncku.edu.tw>,
	Huichun Feng <foxhoundsk.tw@gmail.com>
Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
Date: Thu, 9 Feb 2023 13:15:56 -0500	[thread overview]
Message-ID: <CA+CK2bBt0Gujv9BdhghVkbFRirAxCYXbpH-nquccPsKGnGwOBQ@mail.gmail.com> (raw)
In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com>

On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
>
> v3 -> v4
> - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
>   s390 and powerpc32, don't support the PMD entry and PTE table
>   operations.
> - Fix unmatch type of break_cow_pte_range() in
>   migrate_vma_collect_pmd().
> - Don’t break COW PTE in folio_referenced_one().
> - Fix the wrong VMA range checking in break_cow_pte_range().
> - Only break COW when we modify the soft-dirty bit in
>   clear_refs_pte_range().
> - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
> - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
>   tlb_flush_pmd_range().
> - Handle VM_DONTCOPY with COW PTE fork.
> - Fix the wrong address and invalid vma in recover_pte_range().
> - Fix the infinite page fault loop in GUP routine.
>   In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
>   handler, we return -EMLINK to let the GUP handles the page fault
>   (call faultin_page() in __get_user_pages()).
> - return not_found(pvmw) if the break COW PTE failed in
>   page_vma_mapped_walk().
> - Since COW PTE has the same result as the normal COW selftest, it
>   probably passed the COW selftest.
>
>         # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
>         not ok 33 No leak from parent into child
>         # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
>         not ok 44 No leak from parent into child
>         # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
>         not ok 55 No leak from child into parent
>         # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
>         not ok 66 No leak from child into parent
>
>         Bail out! 4 out of 147 tests failed
>         # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
>   See the more information about anon cow hugetlb tests:
>     https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/
>
>
> v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/
>
> RFC v2 -> v3
> - Change the sysctl with PID to prctl(PR_SET_COW_PTE).
> - Account all the COW PTE mapped pages in fork() instead of defer it to
>   page fault (break COW PTE).
> - If there is an unshareable mapped page (maybe pinned or private
>   device), recover all the entries that are already handled by COW PTE
>   fork, then copy to the new one.
> - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
>   follow_pfn_pte().
> - Remove the PTE ownership since we don't need it.
> - Use pte lock to protect the break COW PTE and free COW-ed PTE.
> - Do TLB flushing in break COW PTE handler.
> - Handle THP, KSM, madvise, mprotect, uffd and migrate device.
> - Handle the replacement page of uprobe.
> - Handle the clear_refs_write() of fs/proc.
> - All of the benchmarks dropped since the accounting and pte lock.
>   The benchmarks of v3 is worse than RFC v2, most of the cases are
>   similar to the normal fork, but there still have an use case
>   (TriforceAFL) is better than the normal fork version.
>
> RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/
>
> RFC v1 -> RFC v2
> - Change the clone flag method to sysctl with PID.
> - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
>   MMF_COW_PTE_READY, for the sysctl.
> - Change the owner pointer to use the folio padding.
> - Handle all the VMAs that cover the PTE table when doing the break COW PTE.
> - Remove the self-defined refcount to use the _refcount for the page
>   table page.
> - Add the exclusive flag to let the page table only own by one task in
>   some situations.
> - Invalidate address range MMU notifier and start the write_seqcount
>   when doing the break COW PTE.
> - Handle the swap cache and swapoff.
>
> RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/
>
> ---
>
> Currently, copy-on-write is only used for the mapped memory; the child
> process still needs to copy the entire page table from the parent
> process during forking. The parent process might take a lot of time and
> memory to copy the page table when the parent has a big page table
> allocated. For example, the memory usage of a process after forking with
> 1 GB mapped memory is as follows:

For some reason, I was not able to reproduce performance improvements
with a simple fork() performance measurement program. The results that
I saw are the following:

Base:
Fork latency per gigabyte: 0.004416 seconds
Fork latency per gigabyte: 0.004382 seconds
Fork latency per gigabyte: 0.004442 seconds
COW kernel:
Fork latency per gigabyte: 0.004524 seconds
Fork latency per gigabyte: 0.004764 seconds
Fork latency per gigabyte: 0.004547 seconds

AMD EPYC 7B12 64-Core Processor
Base:
Fork latency per gigabyte: 0.003923 seconds
Fork latency per gigabyte: 0.003909 seconds
Fork latency per gigabyte: 0.003955 seconds
COW kernel:
Fork latency per gigabyte: 0.004221 seconds
Fork latency per gigabyte: 0.003882 seconds
Fork latency per gigabyte: 0.003854 seconds

Given, that page table for child is not copied, I was expecting the
performance to be better with COW kernel, and also not to depend on
the size of the parent.

Test program:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/types.h>

#define USEC    1000000
#define GIG     (1ul << 30)
#define NGIG    32
#define SIZE    (NGIG * GIG)
#define NPROC   16

void main() {
        int page_size = getpagesize();
        struct timeval start, end;
        long duration, i;
        char *p;

        p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (p == MAP_FAILED) {
                perror("mmap");
                exit(1);
        }
        madvise(p, SIZE, MADV_NOHUGEPAGE);

        /* Touch every page */
        for (i = 0; i < SIZE; i += page_size)
                p[i] = 0;

        gettimeofday(&start, NULL);
        for (i = 0; i < NPROC; i++) {
                int pid = fork();

                if (pid == 0) {
                        sleep(30);
                        exit(0);
                }
        }
        gettimeofday(&end, NULL);
        /* Normolize per proc and per gig */
        duration = ((end.tv_sec - start.tv_sec) * USEC
                + (end.tv_usec - start.tv_usec)) / NPROC / NGIG;
        printf("Fork latency per gigabyte: %ld.%06ld seconds\n",
                duration / USEC, duration % USEC);
}


  parent reply	other threads:[~2023-02-09 18:16 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-07  3:51 Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 10/14] mm/userfaultfd: Support COW PTE Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 11/14] mm/migrate_device: " Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call Chih-En Lin
2023-02-09 18:15 ` Pasha Tatashin [this message]
2023-02-10  2:17   ` [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
2023-02-10 16:21     ` Pasha Tatashin
2023-02-10 17:20       ` Chih-En Lin
2023-02-10 19:02         ` Chih-En Lin
2023-02-14  9:58         ` David Hildenbrand
2023-02-14 13:07           ` Pasha Tatashin
2023-02-14 13:17             ` David Hildenbrand
2023-02-14 15:59           ` Chih-En Lin
2023-02-14 16:30             ` Pasha Tatashin
2023-02-14 18:41               ` Chih-En Lin
2023-02-14 18:52                 ` Pasha Tatashin
2023-02-14 19:17                   ` Chih-En Lin
2023-02-14 16:58             ` David Hildenbrand
2023-02-14 17:03               ` David Hildenbrand
2023-02-14 17:56                 ` Chih-En Lin
2023-02-14 17:54               ` Chih-En Lin
2023-02-14 17:59                 ` David Hildenbrand
2023-02-14 19:06                   ` Chih-En Lin
2023-02-14 17:23           ` Yang Shi
2023-02-14 17:39             ` David Hildenbrand
2023-02-14 18:25               ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+CK2bBt0Gujv9BdhghVkbFRirAxCYXbpH-nquccPsKGnGwOBQ@mail.gmail.com \
    --to=pasha.tatashin@soleen.com \
    --cc=Jason@zx2c4.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=Vincenzo.Frascino@arm.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=anshuman.khandual@arm.com \
    --cc=avagin@gmail.com \
    --cc=baohua@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=brho@google.com \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=fenghua.yu@intel.com \
    --cc=foxhoundsk.tw@gmail.com \
    --cc=gautammenghani201@gmail.com \
    --cc=hughd@google.com \
    --cc=jgross@suse.com \
    --cc=jhubbard@nvidia.com \
    --cc=jolsa@kernel.org \
    --cc=jserv@ccns.ncku.edu.tw \
    --cc=kunyu@nfschina.com \
    --cc=legion@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=liushixin2@huawei.com \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=namit@vmware.com \
    --cc=peng301@purdue.edu \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pfonseca@purdue.edu \
    --cc=rostedt@goodmis.org \
    --cc=shiyn.lin@gmail.com \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=tongtiangen@huawei.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yun.zhou@windriver.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox