From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 069DAC433FE for ; Thu, 29 Sep 2022 18:57:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B9C48D0003; Thu, 29 Sep 2022 14:57:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 669798D0001; Thu, 29 Sep 2022 14:57:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E2078D0003; Thu, 29 Sep 2022 14:57:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3B7D28D0001 for ; Thu, 29 Sep 2022 14:57:49 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 123CA81379 for ; Thu, 29 Sep 2022 18:57:49 +0000 (UTC) X-FDA: 79966032258.04.A83B974 Received: from mail-pg1-f175.google.com (mail-pg1-f175.google.com [209.85.215.175]) by imf21.hostedemail.com (Postfix) with ESMTP id B48CE1C0008 for ; Thu, 29 Sep 2022 18:57:48 +0000 (UTC) Received: by mail-pg1-f175.google.com with SMTP id r62so2213845pgr.12 for ; Thu, 29 Sep 2022 11:57:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=MPKXDe/4Z1BXklCL+wTKgd8YdLgG1d0/bq9BWTsLDnI=; b=kh6RN9IZGAGzL52O8pWBf25xA+9pL1xEdR03q9YULZIhO/rEtg8tL5d6dksYy7Q1SL T+xyCgZPah/rNxBuGlOS8wqlsz2uJk5Bks3lhbLnnnZgTvG3Y5ogWmwreX7y+OWBNcTM GMy4hPHoLjOXRldnCyrEXSmajHIDDsxd8pj/i4bG7YZt+pxqeEr6Tweg6wuIMmmPGjic ES6gtYME4Wq1EAl5dbvJWMjtIlE3j8NgagtkrPBdpp39JrE5Jolc/UeT5N1kJBQBLHMU 6e+P7kPCNWxgzC33J6a39E1E0Hj8EbnZCp4p9PbMOW6+yiMK29H1fUs5F/vdxj9BkSNI dBGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=MPKXDe/4Z1BXklCL+wTKgd8YdLgG1d0/bq9BWTsLDnI=; b=jFFDE3lT9cSOaR3VTgTfp7+JUcwnFJHfi8y1G9xM+VQQdLUQPjmBD/TCLcdhmB9Lwz sNrZst/A5tFzZ0KvRBCZS1qMmUxqE4aQQ9wEvEgB443+26sZODhAIdpi6euSNWu1MmKj XEB2qa/hYi2Jdwk9OVSk4sLgxUWilkw1x9KtYtBGP5eCXzYh2SIt9zb3IeSadT9M0Kps Rqv05o8lQhbcmoriXud4GgosPRLqsFS/0FAbqj4goZecgKfeYfGO5+UN+zSTnSWjkWhx xvrepB8HiLiwejhJwDSB6vCYDT7AxYVvX9sQ1AoBZ2XspPkq5pvWGA05bJSZp/x8w/Sd I+Bw== X-Gm-Message-State: ACrzQf0UFWCPBvij9yl7dXbXS9HhB74vbopOGwm1pGFGri4/t9GI/CSI bOeMNxcTdHkkRbgQS7X7ms4= X-Google-Smtp-Source: AMsMyM7KahqHAdmUB7UdN83AxqwnKIAauwCWsOlH7de+r6DvfN5kPmhprCdhtVBTIWO2ORp+pLASJg== X-Received: by 2002:a63:2bcc:0:b0:439:36bb:c036 with SMTP id r195-20020a632bcc000000b0043936bbc036mr4060428pgr.447.1664477867656; Thu, 29 Sep 2022 11:57:47 -0700 (PDT) Received: from strix-laptop (2001-b011-20e0-1b9a-f5f9-665b-0715-9cc1.dynamic-ip6.hinet.net. [2001:b011:20e0:1b9a:f5f9:665b:715:9cc1]) by smtp.gmail.com with ESMTPSA id f14-20020a170902684e00b00178b06fea7asm225050pln.148.2022.09.29.11.57.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Sep 2022 11:57:47 -0700 (PDT) Date: Fri, 30 Sep 2022 02:57:40 +0800 From: Chih-En Lin To: David Hildenbrand Cc: Nadav Amit , Andrew Morton , Qi Zheng , Matthew Wilcox , Christophe Leroy , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Message-ID: References: <20220927162957.270460-1-shiyn.lin@gmail.com> <20220927162957.270460-10-shiyn.lin@gmail.com> <3D21021E-490F-4FE0-9C75-BB3A46A66A26@vmware.com> <39c5ef18-1138-c879-2c6d-c013c79fa335@redhat.com> <834c258d-4c0e-1753-3608-8a7e28c14d07@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <834c258d-4c0e-1753-3608-8a7e28c14d07@redhat.com> ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=kh6RN9IZ; spf=pass (imf21.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664477868; a=rsa-sha256; cv=none; b=TF436a12OYks+aDY9h5WVEIa+ErGXJ9Ep/7OhcF6Qvq/64ySRMIg47BqdEsO4aZk26UDuh 9J3zgChrIHdTH0MS2KF7VWDoBonRUxVMwJv15ti94CCeD5GdmC+qeJNg5xMmvmsMyT24DD vmVsrjQNZ59rslMKsiTdqLBsvCJ1igs= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664477868; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MPKXDe/4Z1BXklCL+wTKgd8YdLgG1d0/bq9BWTsLDnI=; b=od4QdCm7iAuvt+6c6um3kzMTmH8Zug2STRmsCJIxmGpn3IEgeTbgFWnHeBZZKXFeFXz6uA 9WPITKXrQ5IiVXUrDxlSMEJdX5oOj5TjRNqNHsmKqegDeb4Q1LNpIb3d6VRRTWbx3AIvIf WujmCJ+L0kFQOTd/a8YSa+vtLIKebQs= X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: B48CE1C0008 X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=kh6RN9IZ; spf=pass (imf21.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: wxgamkemmp56qa5j7waexrj9yc8nedx4 X-HE-Tag: 1664477868-359477 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote: > On 29.09.22 20:29, Chih-En Lin wrote: > > On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote: > > > > > IMHO, a relaxed form that focuses on only the memory consumption reduction > > > > > could *possibly* be accepted upstream if it's not too invasive or complex. > > > > > During fork(), we'd do exactly what we used to do to PTEs (increment > > > > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, > > > > > duplicate swap entries; all while holding the page table lock), however, > > > > > sharing the prepared page table with the child process using COW after we > > > > > prepared it. > > > > > > > > > > Any (most once we want to *optimize* rmap handling) modification attempts > > > > > require breaking COW -- copying the page table for the faulting process. But > > > > > at that point, the PTEs are already write-protected and properly accounted > > > > > (refcount/mapcount/PageAnonExclusive). > > > > > > > > > > Doing it that way might not require any questionable GUP hacks and swapping, > > > > > MMU notifiers etc. "might just work as expected" because the accounting > > > > > remains unchanged" -- we simply de-duplicate the page table itself we'd have > > > > > after fork and any modification attempts simply replace the mapped copy. > > > > > > > > Agree. > > > > However for GUP hacks, if we want to do the COW to page table, we still > > > > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to > > > > check whether the PTE table is available or not before we do the COW to > > > > the table). Otherwise, it will be more complicated since it might need > > > > to handle situations like while preparing the COW work, it just figuring > > > > out that it needs to duplicate the whole table and roll back (recover > > > > the state and copy it to new table). Hopefully, I'm not wrong here. > > > > > > The nice thing is that GUP itself *usually* doesn't modify page tables. One > > > corner case is follow_pfn_pte(). All other modifications should happen in > > > the actual fault handler that has to deal with such kind of unsharing either > > > way when modifying the PTE. > > > > > > If the pages are already in a COW-ed pagetable in the desired "shared" state > > > (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such > > > pages will just work as expected and we shouldn't be surprised by another > > > set of GUP+COW CVEs. > > > > > > We'd really only deduplicate the page table and not play other tricks with > > > the actual page table content that differ from the existing way of handling > > > fork(). > > > > > > I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when > > > not modifying the page table. I think we only need "we have to unshare this > > > page table now" in follow_pfn_pte() and inside the fault handling when GUP > > > triggers a fault. > > > > > > I hope my assumption is correct, or am I missing something? > > > > > > > My consideration is when we pinned the page and did the COW to make the > > page table be shared. It might not allow mapping the pinned page to R/O) > > into both processes. > > > > So, if the fork is working on the shared state, it needs to recover the > > table and copy to a new one since that pinned page will need to copy > > immediately. We can hold the shared state after occurring such a > > situation. So we still need some trick to let the fork() know which page > > table already has the pinned page (or such page won't let us share) > > before going to duplicate. > > > > Am I wrong here? > > I think you might be overthinking this. Let's keep it simple: > > 1) Handle pinned anon pages just as I described below, falling back to the > "slow" path of page table copying. > > 2) Once we passed that stage, you can be sure that the COW-ed page table > cannot have actually pinned anon pages. All anon pages in such a page table > have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in > pinning these pages anymore, because it will only pin exclusive anon pages! > > 3) If anybody wants to take a R/O pin on a shared anon page that is mapped > into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead > of pinning the page. This has to break COW on the page table and properly > map an exclusive anon page into it, breaking COW. > > Do you see a problem with that? > > > > > After that, since we handled the accounting in fork(), we don't need > > ownership (pmd_t pointer) anymore. We have to find another way to mark > > the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is > > stored at that space.) > > > > > > > > > > > But devil is in the detail (page table lock, TLB flushing). > > > > > > > > Sure, it might be an overhead in the page fault and needs to be handled > > > > carefully. ;) > > > > > > > > > "will make fork() even have more overhead" is not a good excuse for such > > > > > complexity/hacks -- sure, it will make your benchmark results look better in > > > > > comparison ;) > > > > > > > > ;);) > > > > I think that, even if we do the accounting with the COW page table, it > > > > still has a little bit improve. > > > > > > :) > > > > > > My gut feeling is that this is true. While we have to do a pass over the > > > parent page table during fork and wrprotect all PTEs etc., we don't have to > > > duplicate the page table content and allocate/free memory for that. > > > > > > One interesting case is when we cannot share an anon page with the child > > > process because it maybe pinned -- and we have to copy it via > > > copy_present_page(). In that case, the page table between the parent and the > > > child would differ and we'd not be able to share the page table. > > > > That is what I want to say above. > > The case might happen in the middle of the shared page table progress. > > It might cost more overhead to recover it. Therefore, if GUP wants to > > pin the mapped page we can mark the PTE table first, so fork() won't > > waste time doing the work for sharing. > > Having pinned pages is a corner case for most apps. No need to worry about > optimizing this corner case for now. > > I see what you are trying to optimize, but I don't think this is needed in a > first version, and probably never is needed. > > > Any attempts to mark page tables in a certain way from GUP > (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast > (get_user_pages_fast) can race with pretty much anything, even with > concurrent fork. I suspect your current code might be really racy in that > regard. I see. Now, I know why optimizing that corner case is not worth it. Thank you for explaining that. Thanks, Chih-En Lin