From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 396D7C636D4 for ; Thu, 9 Feb 2023 18:16:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F3116B0074; Thu, 9 Feb 2023 13:16:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 77C0D6B0075; Thu, 9 Feb 2023 13:16:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CE9C6B0078; Thu, 9 Feb 2023 13:16:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4698A6B0074 for ; Thu, 9 Feb 2023 13:16:36 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0187341014 for ; Thu, 9 Feb 2023 18:16:35 +0000 (UTC) X-FDA: 80448558792.03.0883E85 Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf02.hostedemail.com (Postfix) with ESMTP id AE8EF80020 for ; Thu, 9 Feb 2023 18:16:33 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=Ym+CXKOt; spf=pass (imf02.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675966593; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D17ul6Jv4BBsV+2Jo6JXzyrGGRmbOMiujl+Liy7jNKU=; b=Cg5hPJPhYyDhAH8vPvx+yiUlSOss/FJECeKb/TwH5GHbl7Xcg49ymV2FCrP6YM+usgJZAa rKzqPUTH965kYHWwOuJcpkebU8p3TM5QXPyPQXycXniSmGoHlhDtSsWNq7pBOyGkdt/iZx CRb9DQ9GM2HSXs2I0slExIZgFbD7fro= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=Ym+CXKOt; spf=pass (imf02.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675966593; a=rsa-sha256; cv=none; b=nlIahN83bhPN9g5wP3VjydcASe5tXDDCX28Ca5FY1QFKd7NzEPxvmOK4tTHbVrEG92bw+U HwShyqGBQid9J97xO5nU0IQL6rwp3DJoxLeMSp8Btv0B2sGG4msmxe4QEgvGOlp6B4PIav VunU882g1zEvePM7peIWe1yKR+2Xi4U= Received: by mail-qv1-f41.google.com with SMTP id lx14so1888582qvb.11 for ; Thu, 09 Feb 2023 10:16:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D17ul6Jv4BBsV+2Jo6JXzyrGGRmbOMiujl+Liy7jNKU=; b=Ym+CXKOtcv8hciunQy57PF9+Z0g9lkZnEGbxg0LeHnfu/JqGR+eXCyBLaJNHe2a1gB RA77v7i9vAW3zCN/HYYrnibeS1rILcYRPWFAbU/mBmjZZm0l8+f1M417b/mqwPOPVQDY uEBxLZpogxT1gue/5pPtX+xiHXuqlEuy0Coi0J2u3USmPmA2W9cOU8/XpJiJk8CS852B ejQEBszFP3wzyp+8Lac5VDTaDfGAB/M4Zb6goMSfStqcOGEDTpf01j+sYhEPjGl/1mXf CTXdW8E+gSiGAe4/HkAr13WDVdah+jTZsJw2drmrwOJQy4C5Ul9NJJMndD3aDclHNkBe Hk3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D17ul6Jv4BBsV+2Jo6JXzyrGGRmbOMiujl+Liy7jNKU=; b=wlwLf3E18YjL0s12iegej9cl+QyJG2beA0FPiZ1FcdT09lSsIAsWQc2PKuVJpPjgyX gJevLWTAWNwzp/5BnRiF4D7dYx0PmofGoRBSJls+npveip4aJYpGjeBpohnHKr0vywNY Rj/VUsfaYwTvtOM12gOIpPFUO20mOutkP9O7tSUNDw54qdbN49WcEb75WJotY27P0WLX ZuBXzYebGW6Jvh6IR5NEAOCyvB3HWhaOAy+rlCXes0ff3EA4KI+dI9H0zhFAnG1eGESq iSVDknCTWjvZSjp2MZvY77ItusJBjGuemmDkN0jLLF3ChFDMR+lbau0iTjVe3ZoQXN/Y oGaQ== X-Gm-Message-State: AO0yUKWDfIfb1qhY5h0O/KbNc3Q7W5wqkrQ5QhDnxK9vSO1e3MzK43le Q+gSrbDnigpZtAgP5nnKkBJSzb7FMFlnD9Aw1OVPow== X-Google-Smtp-Source: AK7set9/asfDGBIlS6FIxPCSSF0J5s4zfbRpUSo/lPaWYG0FFjA+m1JuZQPdJzdk/PLzyIx6cdEIqfKndQA7GHYwaXM= X-Received: by 2002:a0c:f302:0:b0:56e:8d82:742d with SMTP id j2-20020a0cf302000000b0056e8d82742dmr49876qvl.34.1675966592638; Thu, 09 Feb 2023 10:16:32 -0800 (PST) MIME-Version: 1.0 References: <20230207035139.272707-1-shiyn.lin@gmail.com> In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> From: Pasha Tatashin Date: Thu, 9 Feb 2023 13:15:56 -0500 Message-ID: Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table To: Chih-En Lin Cc: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: qjj1bjdqmmby3ie4ptj9toqf6bffupj6 X-Rspamd-Queue-Id: AE8EF80020 X-HE-Tag: 1675966593-645561 X-HE-Meta: U2FsdGVkX19twhjU1miewW0QTJX+mmajzpyB0q3C7A8hf7DdP7I4p3HaYoXJ2HjNn93YyVeocDS7zUGIYrR6fWioNZbQiTz+fCNlC8nzn2eyf6ekC1oshnTWe0VOHTRUS4ugqAq0VSr2ntY5frRHJqQ+/uUdARkr5n3Fzi6Lb51JqKW7CDOQrrhr7Gapc3WRcJ3QjlhTtPiehX9KC/2W4mWzLO8Ph4R6BDocJo7bXuecC7XFoS+2CMc1FE2Z9d7IZvqQXzVbsRJL/gGxfnz3lLmdwMls0iMFuLdmkweKxE+ChnpxEuU0g9gDU8qad/PouHvinBoAncsz7iGHFrf0xvkIs6a+ItH62vAMmRTah58Tepw3YyMDO0A0rTFf/B0hZDs7H6rZUDMEuhHOHaAYuK8iJ6dYamjBDBhocfm9d35mQpr4mIfkuwU2mHAlDPNwTZ8gaZpD/PFUDHzKgN4FGU5y8Jbomp1Fe2euf3agxMKsmWPyl9rfo4KZYeGmydHn0f4wa4mAJ2v/Rwpjt05wJDoZVfIN4Y+/8mHARIkcCg28v9EeZPN+kr6+UB66YsLSTbsw2gp1GK9EqehuyFHhyUYKzWAk5Gtmx9VSEB0JebBTnVlaICB4WnWsQiD7lQBAdsb0osGH9FcgWgg5hSZ+rh8dmvCEMp0cxL5ANUEBi4T8SUdwnN3fp9mbYy+VXySkGUUqxH1c8qZmkhlZRPKoV3BKClT4XebIOTqIY5Q5kD8IApVaBL4PhrXWQpITOmefw1EXEyH5jFzGhLcfEKuKhg0HOg8W2H37w51jwasLnMt1+EfiiQA824CiHLZnMlQM2Wf5VUQ+zE+D4wWXuvvouX/gh9WHgMkBE99oxdmhVj7xFmBrRcEA14sSC+IEYPscBTEiKP7G56aXJzJvaiFArhsbOBIaN0N4rOFFgq9sB+f+ALvoHE4Fsira7ettBFnUmtEuL/ymgoV9SqgZ2QX Q6JibElR 0wm5mAEPBl3mMZPnmD4be9UdVlARSh1hLpTsNEUUFn9Ss2sKdMk84k33OagJY658VmLQv+sy0GXXQ7fF3PKDIDainJWLm3GQ6uf7LXJJZ8jPCnFRfdHRmFiqtmuP9E71WzlirLh91bHBKYJRI+VoCAnrV8dMy/6ZTqYZpHzp0WLGNBghJngfZSeMJ/SzjuOnmwcjEqCC1rw22Op/QsGLCx1zsCDyn70RGx4rJTvJ33t9+54LG/BhcRcZMzPj3uCGgzD8XjwvMHuq7CxxlQSImfkUHYIsnLtRErzVnTQkRaiE5d3CzHF8WxbtOdhOEouyHH/JYNpO4H0LuKQ0n7qa18nCK9a1P/3l26qg3LFoKVc+f4IlG18Cu9LTQ7RN3mrmyPUzN4KQjcYwBaN0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin wrote: > > v3 -> v4 > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > s390 and powerpc32, don't support the PMD entry and PTE table > operations. > - Fix unmatch type of break_cow_pte_range() in > migrate_vma_collect_pmd(). > - Don=E2=80=99t break COW PTE in folio_referenced_one(). > - Fix the wrong VMA range checking in break_cow_pte_range(). > - Only break COW when we modify the soft-dirty bit in > clear_refs_pte_range(). > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > tlb_flush_pmd_range(). > - Handle VM_DONTCOPY with COW PTE fork. > - Fix the wrong address and invalid vma in recover_pte_range(). > - Fix the infinite page fault loop in GUP routine. > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > handler, we return -EMLINK to let the GUP handles the page fault > (call faultin_page() in __get_user_pages()). > - return not_found(pvmw) if the break COW PTE failed in > page_vma_mapped_walk(). > - Since COW PTE has the same result as the normal COW selftest, it > probably passed the COW selftest. > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > not ok 33 No leak from parent into child > # [RUN] vmsplice() + unmap in child with mprotect() optimization = ... with hugetlb (2048 kB) > not ok 44 No leak from parent into child > # [RUN] vmsplice() before fork(), unmap in parent after fork() ..= . with hugetlb (2048 kB) > not ok 55 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetl= b (2048 kB) > not ok 66 No leak from child into parent > > Bail out! 4 out of 147 tests failed > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > See the more information about anon cow hugetlb tests: > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.10= 6906-5-david@redhat.com/ > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@g= mail.com/T/ > > RFC v2 -> v3 > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > - Account all the COW PTE mapped pages in fork() instead of defer it to > page fault (break COW PTE). > - If there is an unshareable mapped page (maybe pinned or private > device), recover all the entries that are already handled by COW PTE > fork, then copy to the new one. > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > follow_pfn_pte(). > - Remove the PTE ownership since we don't need it. > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > - Do TLB flushing in break COW PTE handler. > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > - Handle the replacement page of uprobe. > - Handle the clear_refs_write() of fs/proc. > - All of the benchmarks dropped since the accounting and pte lock. > The benchmarks of v3 is worse than RFC v2, most of the cases are > similar to the normal fork, but there still have an use case > (TriforceAFL) is better than the normal fork version. > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.li= n@gmail.com/T/ > > RFC v1 -> RFC v2 > - Change the clone flag method to sysctl with PID. > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > MMF_COW_PTE_READY, for the sysctl. > - Change the owner pointer to use the folio padding. > - Handle all the VMAs that cover the PTE table when doing the break COW P= TE. > - Remove the self-defined refcount to use the _refcount for the page > table page. > - Add the exclusive flag to let the page table only own by one task in > some situations. > - Invalidate address range MMU notifier and start the write_seqcount > when doing the break COW PTE. > - Handle the swap cache and swapoff. > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gm= ail.com/ > > --- > > Currently, copy-on-write is only used for the mapped memory; the child > process still needs to copy the entire page table from the parent > process during forking. The parent process might take a lot of time and > memory to copy the page table when the parent has a big page table > allocated. For example, the memory usage of a process after forking with > 1 GB mapped memory is as follows: For some reason, I was not able to reproduce performance improvements with a simple fork() performance measurement program. The results that I saw are the following: Base: Fork latency per gigabyte: 0.004416 seconds Fork latency per gigabyte: 0.004382 seconds Fork latency per gigabyte: 0.004442 seconds COW kernel: Fork latency per gigabyte: 0.004524 seconds Fork latency per gigabyte: 0.004764 seconds Fork latency per gigabyte: 0.004547 seconds AMD EPYC 7B12 64-Core Processor Base: Fork latency per gigabyte: 0.003923 seconds Fork latency per gigabyte: 0.003909 seconds Fork latency per gigabyte: 0.003955 seconds COW kernel: Fork latency per gigabyte: 0.004221 seconds Fork latency per gigabyte: 0.003882 seconds Fork latency per gigabyte: 0.003854 seconds Given, that page table for child is not copied, I was expecting the performance to be better with COW kernel, and also not to depend on the size of the parent. Test program: #include #include #include #include #include #include #include #include #define USEC 1000000 #define GIG (1ul << 30) #define NGIG 32 #define SIZE (NGIG * GIG) #define NPROC 16 void main() { int page_size =3D getpagesize(); struct timeval start, end; long duration, i; char *p; p =3D mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p =3D=3D MAP_FAILED) { perror("mmap"); exit(1); } madvise(p, SIZE, MADV_NOHUGEPAGE); /* Touch every page */ for (i =3D 0; i < SIZE; i +=3D page_size) p[i] =3D 0; gettimeofday(&start, NULL); for (i =3D 0; i < NPROC; i++) { int pid =3D fork(); if (pid =3D=3D 0) { sleep(30); exit(0); } } gettimeofday(&end, NULL); /* Normolize per proc and per gig */ duration =3D ((end.tv_sec - start.tv_sec) * USEC + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; printf("Fork latency per gigabyte: %ld.%06ld seconds\n", duration / USEC, duration % USEC); }