From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 746B9C05027 for ; Fri, 10 Feb 2023 17:20:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D78A16B015F; Fri, 10 Feb 2023 12:20:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D284B6B0161; Fri, 10 Feb 2023 12:20:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCA5B6B0162; Fri, 10 Feb 2023 12:20:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A73596B015F for ; Fri, 10 Feb 2023 12:20:30 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6E719140E6F for ; Fri, 10 Feb 2023 17:20:30 +0000 (UTC) X-FDA: 80452046220.28.59DB4E9 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf06.hostedemail.com (Postfix) with ESMTP id 8DDF8180021 for ; Fri, 10 Feb 2023 17:20:28 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=pFDCtIF1; spf=pass (imf06.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676049628; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wMbKVouMASLzJM91iOKsN1uPCY/b7K76uQnJa1gTFMw=; b=fnER07S9IN3EmkeGlz/PJtgGxuPu+d08vESPzPoo9UR/x0lAHEwUGlJxP7szMkJVFA4TFW Svowm2pZzGnYG6Fk4m8dct75JXexgzSuCEViu19qBUpDQUdzSjTUd2O0ibMc9JAADG6ZrE u1QH4JMYZYbDbkGxcuYKWtlAvd3ZxRs= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=pFDCtIF1; spf=pass (imf06.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676049628; a=rsa-sha256; cv=none; b=LR4a98J1F7BQQLPRhL1XQzvffcypOOQuVcwkr5XiDbiaZ5yvHwpiQL7ocoEJXV4X5kJ98u GFxHX16NAgzW9yvbFWSnisdmIC+hfxDNugFRz1ZFG9vywygvFWEqdgNi83QsnUUaRJur1W EK4n3y+09QrY6pBZ/KbufBIKAVYUdzo= Received: by mail-pj1-f51.google.com with SMTP id gj9-20020a17090b108900b0023114156d36so10054048pjb.4 for ; Fri, 10 Feb 2023 09:20:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=wMbKVouMASLzJM91iOKsN1uPCY/b7K76uQnJa1gTFMw=; b=pFDCtIF1XgirwuX0oWehxHYtaQbsuvr1QwuxAo9k0F+3ptyVb1ST3SzENpLVfZVyWK KouQE1XmiRHq5ABrKjlx7a8Ddwdh5A5hhu900oPBnoqyz46bHHODNu3ifc07gRXVNBwC biamt+rATAqc8xARw9IU5FhCpOQ6onbbR0uoovVh4qQ1pCChm0qSKaRn4WgkXGuLd+WR YcfQorSWIcumYh3wu9nJyzrjEpiQbUfF/DfX+PAXBodlE9EVh4eoql2N/xgIJS7ovFKb cz2EV5Uk2kEQ3kqvAIxF7s0ZKRK0vgeOb5kSuQcNYjJNLTJBIeFvJsxHm8lWw9N8b0id jQKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=wMbKVouMASLzJM91iOKsN1uPCY/b7K76uQnJa1gTFMw=; b=06eZTTKV2XyUcPey5MaIBhDCuBCtn++B51GWa3YWmveWG7xt+Hs8Lvg2bi1zVL0rmS 7UJAwXkSmEIbl0eUtGpdP646GuZiz89qTc46BNlr2s1QXlPG9F6cQk4K5CJI4z4Kg625 xDLS/YYroVhMXBQaD7glcK+0UK5whl4A1Z0QQwlJvgXQvKjiATQ2wknBEE5IUqveBYT9 REhR0WA0zIEkxeG3reaGTqqeKWgn5Zf2DvsbRu0eQHuCRnJqean2Z+mN+GWf9i+QuSbE yAL6VvY9ReR3yCex2jeHHh7dnNiRJQv+gWWJZH/Sl1g7CDTqvMUEy7aQhwzP+rT7C2UG dtEQ== X-Gm-Message-State: AO0yUKW0FrQb4F6dU5XIQJ0haz4GBb27ix5V1fTHIL+GszYWcgRcMN6I 60bwoCRaMMtBxq2eXcJqQYM= X-Google-Smtp-Source: AK7set/22jYwlRoq3JJRkdhMuoVUH4dOoOiTm+bHbi6ug2EgKCKELtcfBHYA5wjVSaUoSMsnv5Z8Gg== X-Received: by 2002:a17:902:dad0:b0:199:1f42:8bed with SMTP id q16-20020a170902dad000b001991f428bedmr5843680plx.12.1676049627175; Fri, 10 Feb 2023 09:20:27 -0800 (PST) Received: from strix-laptop ([123.110.9.95]) by smtp.gmail.com with ESMTPSA id y4-20020a170902ed4400b00188c9c11559sm1645994plb.1.2023.02.10.09.20.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Feb 2023 09:20:26 -0800 (PST) Date: Sat, 11 Feb 2023 01:20:10 +0800 From: Chih-En Lin To: Pasha Tatashin Cc: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , Zach O'Keefe , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Message-ID: References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 8DDF8180021 X-Rspam-User: X-Stat-Signature: uostxdmk1dwofxxcs71h6p3qn8dmoqhh X-HE-Tag: 1676049628-682063 X-HE-Meta: U2FsdGVkX1+IHAu9bGBTPCk4iq/SuFIn6YGtyOKT+HPEIGkdwZwYhXoKMkn/xFVQCEwmIxGr3WYJL6lcOG/jU4o9AhwFqLXL5l3iNMPJuGtm/rhXHvgRkCFQrK3wQdJ+wZI357m8I6lFntTYAahpUJD4BKmDIvRW7ps97IkFOeSgvmACmoMJBmkg2amTZoS3updiFc6gjDRzYKYi5vDX1dp7EJZvOYi4Sog1tI3idcZ0ZCMnk64cYg5HF9GR8hNMx6zWXZTIr4zqcmFgA1OfWPgu+884L40BQOII13OzdfUB4DFZihuP9kD6UyXkJebcSzU7RhWVdnymh53Owk6JmFweG/1dpWez177LwAaE49IvtKw/Yt7S3IgSnv70pxXC9bhAq3L4xVBN2ki1h2kXeC1TXvj1b4IuTdNM0UD+XDiQPzo0C4EDWDPKbrzpW4Kac2nivq8V7UIc1RNvXUF7VLPRwB8cg8DUgWCDzmYZOhapIv9Ozy27onWTuctF0nCh1Q4++bqMNXMoVkvepqPFMsQGkwisKOa7cKEaU3nZRUSstgtf358u+aExnd8xyIQ3dygZZPq/562HtvG3zuXNwigu0pHRcl4Zp6Oz+1x+6qw9kGzu96sZO+FrRCUTP9nfQfCdonr5JYJpeO6WrNF6a9mhcq+JE4Dxkd8JVBBE35MndS84dja7uFE+1Ybjjg4zRsMBbeRgUb3OYjOV7cL5SaS9nwm8oVLi859TV/f5UQEiVPHrRp07CFKOtCGWKy+RHmVSEouvC/8beYWmF12N873OMEdpdW9Cfo6o8BiwnUc6eC1CnGgnX0Kk72OtFz4VAZOopX8z+wiux/E8ghl7lHMHlucJceMtdC2V1yU/ljmCQWYO5Ij4EgLsDYxsXe5B/91hnOdNmeVRjeBzFVj9/Zqqa++vwjZCb2jsLV0c8ZYuv12sUMwxbGN/3hbtbDY/7+gH4AlST8A1yEZZdiR 4EvxyfQ+ HyM4uZkSc5NSJhK61fOXIr7+AI66faNYUkKKjB6imey2Lyuqq1vWxJzGMrGPBKthp5gBdy12NifJLro0q7tVlXTOBRleFpkpP2BOQ7cp44SKallyYlFTE4SEBp2Bc4rveHp/OdwZsaraOlT6trUJV4Im4XxDj6ohwGv3KSIZ90NPC0Wc+k+4JfUa0Ks/IKbDx71zjMQHf+uvKo1ttsIJN2iqt/Bdf1NFTNu4I1zx/K429V58wjTZ6NsPGFYrhnVN6n0J8Yv+xvYAF6Kyyg76ro8ptPDGIlV6vpUltgqeremXZnQ7aqaVcmg4Z5nBHLSUi0seJGUoPt4Eu2ze2ZpoMDTdAD/raLmTLzhz8lwVcegIjJs1dN9LBLzdg+X5+dKGoN31+Y61CfTIM4f6WgoRORLpGMndMBRCn/xOzMBz96Fn/0dSrJFeMZ4LdnVsP/xEWCivs9I8Txd/ghv9l22OJxk60I4k0RptVX9daLLfPvT+O8wAnSUx7FEEYksWA0mKHsQN48rZq4AUPTWPWSjaxKmCEY808yNDGnf2Iqo7KXjUous8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote: > > > > Currently, copy-on-write is only used for the mapped memory; the child > > > > process still needs to copy the entire page table from the parent > > > > process during forking. The parent process might take a lot of time and > > > > memory to copy the page table when the parent has a big page table > > > > allocated. For example, the memory usage of a process after forking with > > > > 1 GB mapped memory is as follows: > > > > > > For some reason, I was not able to reproduce performance improvements > > > with a simple fork() performance measurement program. The results that > > > I saw are the following: > > > > > > Base: > > > Fork latency per gigabyte: 0.004416 seconds > > > Fork latency per gigabyte: 0.004382 seconds > > > Fork latency per gigabyte: 0.004442 seconds > > > COW kernel: > > > Fork latency per gigabyte: 0.004524 seconds > > > Fork latency per gigabyte: 0.004764 seconds > > > Fork latency per gigabyte: 0.004547 seconds > > > > > > AMD EPYC 7B12 64-Core Processor > > > Base: > > > Fork latency per gigabyte: 0.003923 seconds > > > Fork latency per gigabyte: 0.003909 seconds > > > Fork latency per gigabyte: 0.003955 seconds > > > COW kernel: > > > Fork latency per gigabyte: 0.004221 seconds > > > Fork latency per gigabyte: 0.003882 seconds > > > Fork latency per gigabyte: 0.003854 seconds > > > > > > Given, that page table for child is not copied, I was expecting the > > > performance to be better with COW kernel, and also not to depend on > > > the size of the parent. > > > > Yes, the child won't duplicate the page table, but fork will still > > traverse all the page table entries to do the accounting. > > And, since this patch expends the COW to the PTE table level, it's not > > the mapped page (page table entry) grained anymore, so we have to > > guarantee that all the mapped page is available to do COW mapping in > > the such page table. > > This kind of checking also costs some time. > > As a result, since the accounting and the checking, the COW PTE fork > > still depends on the size of the parent so the improvement might not > > be significant. > > The current version of the series does not provide any performance > improvements for fork(). I would recommend removing claims from the > cover letter about better fork() performance, as this may be > misleading for those looking for a way to speed up forking. In my >From v3 to v4, I changed the implementation of the COW fork() part to do the accounting and checking. At the time, I also removed most of the descriptions about the better fork() performance. Maybe it's not enough and still has some misleading. I will fix this in the next version. Thanks. > case, I was looking to speed up Redis OSS, which relies on fork() to > create consistent snapshots for driving replicates/backups. The O(N) > per-page operation causes fork() to be slow, so I was hoping that this > series, which does not duplicate the VA during fork(), would make the > operation much quicker. Indeed, at first, I tried to avoid the O(N) per-page operation by deferring the accounting and the swap stuff to the page fault. But, as I mentioned, it's not suitable for the mainline. Honestly, for improving the fork(), I have an idea to skip the per-page operation without breaking the logic. However, this will introduce the complicated mechanism and may has the overhead for other features. It might not be worth it. It's hard to strike a balance between the over-complicated mechanism with (probably) better performance and data consistency with the page status. So, I would focus on the safety and stable approach at first. > > Actually, at the RFC v1 and v2, we proposed the version of skipping > > those works, and we got a significant improvement. You can see the > > number from RFC v2 cover letter [1]: > > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > > for normal fork" > > I suspect the 93% improvement (when the mapcount was not updated) was > only for VAs with 4K pages. With 2M mappings this series did not > provide any benefit is this correct? Yes. In this case, the COW PTE performance is similar to the normal fork(). > > > > However, it might break the existing logic of the refcount/mapcount of > > the page and destabilize the system. > > This makes sense. ;) Thanks, Chih-En Lin