From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4846BC61DA4 for ; Tue, 14 Feb 2023 13:08:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C98656B0073; Tue, 14 Feb 2023 08:08:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C48416B0074; Tue, 14 Feb 2023 08:08:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE9026B0075; Tue, 14 Feb 2023 08:08:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9F4466B0073 for ; Tue, 14 Feb 2023 08:08:34 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7785614021E for ; Tue, 14 Feb 2023 13:08:34 +0000 (UTC) X-FDA: 80465926548.25.ADF6F94 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf04.hostedemail.com (Postfix) with ESMTP id 95B2F40020 for ; Tue, 14 Feb 2023 13:08:32 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=pO1vKw91; spf=pass (imf04.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676380112; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d7iD43RuVySzDimpnmaP4LT4xXO0J3BUtoo0lGShArc=; b=JwhCFkV2mAQElYndvR/xNrIHgsp+6L4CrtD7QtLbmn72rbU6PGMDLq+iZn0vJwEFU3SRHT NyDYs69Njd5w5NBPalYTgzmZcwHJNgbslF+50vWbnK/ejcjnsSNu+1ETZcuebzG8KcYy8V V8slR+2jFIX9OCo/IcC8Jp/SNaa+QMM= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=pO1vKw91; spf=pass (imf04.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676380112; a=rsa-sha256; cv=none; b=bRkrvWftxH7P4noMP4+/FSiGA7dPNbBCGimQc19ab0ERmwVz+y7XqPVXoxTN1NqwVAMilA Y5Ui7MDSMZiOmAbgTgIAo7jfuy5fDihmasQdG05SaDOTMcSF5tk19dA3euXzKLhWtWwvMF UK+ploC/ZQ2VeKt9fFjKW+KD8Aj5QlA= Received: by mail-qv1-f43.google.com with SMTP id i5so5707342qvp.6 for ; Tue, 14 Feb 2023 05:08:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=d7iD43RuVySzDimpnmaP4LT4xXO0J3BUtoo0lGShArc=; b=pO1vKw910NGD+QvW3upj9bxB+96r0/YxhOOdt/cELSkVPjWrEljUmf4XzD4mfOYpew mqe2Yjn7WzmsgmuQM2wW/81XJ2G2fokcqcNwWs/H9Fu489nfYXGUhGj4e3xl5zgLMOYm UJjcBuIJVGeXNeRUGaYNE1s/nth54SN/PQHuBITj6lOBsEqiLg5aLbCCXIx8ug61XhyE 3QVAJ+hK7P7o/N91LrfoPIvvi9ilW/dlQKNrfqi6v5s90uuWMB3zlqDpGIHP3/4XVYY9 Jw0SHhZ/j5AYRFvwGemJejg7ozx8VOL6jGe3G/+UregpoDgy4Tx/o3rdV/4YpmWxj1rs XWMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=d7iD43RuVySzDimpnmaP4LT4xXO0J3BUtoo0lGShArc=; b=4IMrCE1wipbu8ZX54VfoXK7xqe4wnOVXt1mFfxjVcPsjFr615+6IuP9jszlLPoKFhp DlSliljlbfuoHYFuvDfsxy3yFS4HwBXMQZiDCXyjn0kJXrRRodwtKbXDnGAzHBG3f1AN dQWuS+xjI9VUh1S3owBizIRvedWNOUQQpVBwdFFA8bjMQSTYcnp56j7FLF64p6Eez5Tq m3ETSPtS+yPb6WGiA1ITcIznNhkxrxjmPqht5eIHbhy3PZb/Bcrz3HVqml1F4f++vNJ5 yV/CmYVv114rqT10+ex8aDj3A6eB+QFDAQ+4p3viBFkR0rn0UqUlis73v49jhwAcxEIj WQdw== X-Gm-Message-State: AO0yUKUT58RS3Zim1+Nre3PuWN46hKIy+jeph9L8re1H5/CaShuv3zPS d0FrRK8iCw1SOlXKKV0hYlctZf1d3ObG4u5iZtDMHw== X-Google-Smtp-Source: AK7set/9NH3fHFal9QJ7z9f3vQC37YYM4w0VMhx4GLZZxCPRYzbFY5ZgZZBs0cL/X7/6SUX4p6gBRLnG5NbumKf4IRQ= X-Received: by 2002:a0c:db07:0:b0:56e:986f:cc09 with SMTP id d7-20020a0cdb07000000b0056e986fcc09mr196862qvk.44.1676380111529; Tue, 14 Feb 2023 05:08:31 -0800 (PST) MIME-Version: 1.0 References: <20230207035139.272707-1-shiyn.lin@gmail.com> <62c44d12-933d-ee66-ef50-467cd8d30a58@redhat.com> In-Reply-To: <62c44d12-933d-ee66-ef50-467cd8d30a58@redhat.com> From: Pasha Tatashin Date: Tue, 14 Feb 2023 08:07:54 -0500 Message-ID: Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table To: David Hildenbrand Cc: Chih-En Lin , Andrew Morton , Qi Zheng , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: hyp8w5936baqhpzgy97tb8urfgff1a98 X-Rspamd-Queue-Id: 95B2F40020 X-HE-Tag: 1676380112-929086 X-HE-Meta: U2FsdGVkX19T1NXWclmEnHXj4oR5WmF+T0V4UeBfax66MLKLgPHgJYw/8zrsS2Z/zMBj0//iNx4+k+WIazxTFPw8OSuQaBhPBb6Ui4yTRrROxWnQ5R7QEyEWpAWSQ3+vAN6HKlvuNLaz61fKFuB5Dk3dD/JTnLzG2PV3v0eDN5DEe2fnQ7AWZ/KBGgiiEp/naPo5KWOF4/yi5tYkHYYKoFNwWaJk11C26mDo0NWlc2c76CPNsEiogf26nhFABgdLpLotmVnu4GhJlyq6w9N8wrwKTm2xsoQ9L0rFS5J7IKRRLPMgshh5XS5zHVFUIPqpWCosm0n0nTzVdZpir2iMtVEPrGqot4qErGv7QH0w28smPwie1L23F3/M/0hN4daZlCrk/3COfGurBdKZvATKTq71l19a3t7lJ5lq3HN9W3l4Vn6GkAVqLJXOD6igTzDBSeuIyZYfsXCCW8kXQJqSdenF/hkQF9YNrtjwAKN+fi7VArMOamYZNfGi/p5EHZLCZCmXcyyoRQWvUku+ZGdJrWedfMoOmwXVYaO7Z8n4y1uUxH6n7mH0irdBZJbtkCXr22XWgya9+P0HPxSZchM63Fmws8h93Xhj6hdocEF/XQ60pXTlXtF7Kuq+m7bGNwydO3Ej6rZFJCYgUHSqWq3477PA9yZc4fQKtDy3QULTBcG0YRJBW7IvXbPSzQIuOOsGDU0eSI98unw1j/TiZAi65JVED99C7zffxbw+U72F9kT07qI08RxpCOsPnemfwCNsbd/90hvpCFXksCERPKTDqnI71I0eIa4pHjKkiMSp6THUQlBCUM1eF12oiiHvQMO7phgpfjnIc1J8QaJb6t3sAiCTeplHJ187Irk1Gf+kcyqghqj8E1m8ThCyvn4yKbe3EEcFHKNGuKqofXSBUevln2u4Lx2vtcpIA81Jb6F7+113y6ZONy4ZbVBGxaxLpndAn9WqtUEBDecYieZioSo zOsowTe/ Qlq1rL9WAdQxHWqEN6fapMU8Vyx0VXfX7xmyt9EvoEr9uRgMvIn9j7/18fEwCJvqRiZsjmLUXVhUBWSEt2B7FaHl0wgYfhp+DQRFX7mGU92I2gJlj5abvOpIUa6BuO/yxVOT0qeECSb3NB0fW0RSpsmV5SH0Qig/o4mMaCPlj1BPHbrMn61Hcft7i0Hn60wjcatW7DoqIIEq5l29Py5JGHtWqapectV7w6R9MhEJ/y6+FpqGbG92jjbvT4OkKnzjhbBSM6eDMFk9Aw4SFbWkV/V7xWvzL6KhLAPLcF6RFkFWvlin0my2MpyEnxQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 14, 2023 at 4:58 AM David Hildenbrand wrote: > > On 10.02.23 18:20, Chih-En Lin wrote: > > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > >>>>> Currently, copy-on-write is only used for the mapped memory; the child > >>>>> process still needs to copy the entire page table from the parent > >>>>> process during forking. The parent process might take a lot of time and > >>>>> memory to copy the page table when the parent has a big page table > >>>>> allocated. For example, the memory usage of a process after forking with > >>>>> 1 GB mapped memory is as follows: > >>>> > >>>> For some reason, I was not able to reproduce performance improvements > >>>> with a simple fork() performance measurement program. The results that > >>>> I saw are the following: > >>>> > >>>> Base: > >>>> Fork latency per gigabyte: 0.004416 seconds > >>>> Fork latency per gigabyte: 0.004382 seconds > >>>> Fork latency per gigabyte: 0.004442 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004524 seconds > >>>> Fork latency per gigabyte: 0.004764 seconds > >>>> Fork latency per gigabyte: 0.004547 seconds > >>>> > >>>> AMD EPYC 7B12 64-Core Processor > >>>> Base: > >>>> Fork latency per gigabyte: 0.003923 seconds > >>>> Fork latency per gigabyte: 0.003909 seconds > >>>> Fork latency per gigabyte: 0.003955 seconds > >>>> COW kernel: > >>>> Fork latency per gigabyte: 0.004221 seconds > >>>> Fork latency per gigabyte: 0.003882 seconds > >>>> Fork latency per gigabyte: 0.003854 seconds > >>>> > >>>> Given, that page table for child is not copied, I was expecting the > >>>> performance to be better with COW kernel, and also not to depend on > >>>> the size of the parent. > >>> > >>> Yes, the child won't duplicate the page table, but fork will still > >>> traverse all the page table entries to do the accounting. > >>> And, since this patch expends the COW to the PTE table level, it's not > >>> the mapped page (page table entry) grained anymore, so we have to > >>> guarantee that all the mapped page is available to do COW mapping in > >>> the such page table. > >>> This kind of checking also costs some time. > >>> As a result, since the accounting and the checking, the COW PTE fork > >>> still depends on the size of the parent so the improvement might not > >>> be significant. > >> > >> The current version of the series does not provide any performance > >> improvements for fork(). I would recommend removing claims from the > >> cover letter about better fork() performance, as this may be > >> misleading for those looking for a way to speed up forking. In my > > > > From v3 to v4, I changed the implementation of the COW fork() part to do > > the accounting and checking. At the time, I also removed most of the > > descriptions about the better fork() performance. Maybe it's not enough > > and still has some misleading. I will fix this in the next version. > > Thanks. > > > >> case, I was looking to speed up Redis OSS, which relies on fork() to > >> create consistent snapshots for driving replicates/backups. The O(N) > >> per-page operation causes fork() to be slow, so I was hoping that this > >> series, which does not duplicate the VA during fork(), would make the > >> operation much quicker. > > > > Indeed, at first, I tried to avoid the O(N) per-page operation by > > deferring the accounting and the swap stuff to the page fault. But, > > as I mentioned, it's not suitable for the mainline. > > > > Honestly, for improving the fork(), I have an idea to skip the per-page > > operation without breaking the logic. However, this will introduce the > > complicated mechanism and may has the overhead for other features. It > > might not be worth it. It's hard to strike a balance between the > > over-complicated mechanism with (probably) better performance and data > > consistency with the page status. So, I would focus on the safety and > > stable approach at first. > > Yes, it is most probably possible, but complexity, robustness and > maintainability have to be considered as well. > > Thanks for implementing this approach (only deduplication without other > optimizations) and evaluating it accordingly. It's certainly "cleaner", > such that we only have to mess with unsharing and not with other > accounting/pinning/mapcount thingies. But it also highlights how > intrusive even this basic deduplication approach already is -- and that > most benefits of the original approach requires even more complexity on top. > > I am not quite sure if the benefit is worth the price (I am not to > decide and I would like to hear other options). > > My quick thoughts after skimming over the core parts of this series > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite > likely (meaning that it might be fairly error-prone) and forgetting > to break COW on a PTE table, accidentally modifying the shared > table. > (2) break_cow_pte() can fail, which means that we can fail some > operations (possibly silently halfway through) now. For example, > looking at your change_pte_range() change, I suspect it's wrong. > (3) handle_cow_pte_fault() looks quite complicated and needs quite some > double-checking: we temporarily clear the PMD, to reset it > afterwards. I am not sure if that is correct. For example, what > stops another page fault stumbling over that pmd_none() and > allocating an empty page table? Maybe there are some locking details > missing or they are very subtle such that we better document them. I > recall that THP played quite some tricks to make such cases work ... > > > > >>> Actually, at the RFC v1 and v2, we proposed the version of skipping > >>> those works, and we got a significant improvement. You can see the > >>> number from RFC v2 cover letter [1]: > >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > >>> for normal fork" > >> > >> I suspect the 93% improvement (when the mapcount was not updated) was > >> only for VAs with 4K pages. With 2M mappings this series did not > >> provide any benefit is this correct? > > > > Yes. In this case, the COW PTE performance is similar to the normal > > fork(). > > > The thing with THP is, that during fork(), we always allocate a backup > PTE table, to be able to PTE-map the THP whenever we have to. Otherwise > we'd have to eventually fail some operations we don't want to fail -- > similar to the case where break_cow_pte() could fail now due to -ENOMEM > although we really don't want to fail (e.g., change_pte_range() ). > > I always considered that wasteful, because in many scenarios, we'll > never ever split a THP and possibly waste memory. Yes, it does sound wasteful for a pretty rare corner case that combines splitting THP in a process, and not having enough memory to allocate PTE page tables. > Optimizing that for THP (e.g., don't always allocate backup THP, have > some global allocation backup pool for splits + refill when > close-to-empty) might provide similar fork() improvements, both in speed > and memory consumption when it comes to anonymous memory. This sounds like a reasonable way to optimize the fork performance for processes with large RSS, which in most cases would have 2M THP mappings. When you say global pool, do you mean per machine, per cgroup, or per process? Pasha > > -- > Thanks, > > David / dhildenb > >