From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA4E1C433FE for ; Thu, 29 Sep 2022 19:00:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFD6A8D0003; Thu, 29 Sep 2022 15:00:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAB2C8D0001; Thu, 29 Sep 2022 15:00:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B24A58D0003; Thu, 29 Sep 2022 15:00:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A3CA38D0001 for ; Thu, 29 Sep 2022 15:00:14 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 50F5A1213A5 for ; Thu, 29 Sep 2022 19:00:14 +0000 (UTC) X-FDA: 79966038348.03.D79611F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 5B2691C0011 for ; Thu, 29 Sep 2022 19:00:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664478012; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=I3ZnKxLrw7j05MQCLu5nQyiSvemex9uHnqnPUL/ZV2luvdCNeqlhuPsTzia49QQX7NT9qf I4kXPhTxXBsawmvs84p2T1wBS4oihkDm3dQwVNYjaIBtymp7af/lHCZkKVpYoXM4dEfWSQ 7mS0jaDURM/MHOpW8a8KpT4bzFtO1q8= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-147-TeZiMnLuNvmruq8sjL38XA-1; Thu, 29 Sep 2022 15:00:11 -0400 X-MC-Unique: TeZiMnLuNvmruq8sjL38XA-1 Received: by mail-wm1-f70.google.com with SMTP id y20-20020a05600c365400b003b4d4ae666fso704988wmq.4 for ; Thu, 29 Sep 2022 12:00:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=673Wdroh7/6s6FiJOSHKGGETnQ+nzRnO2k8cvOLc6EKkOee1JvWXYVwb4tJ2M9b4/x PCsM1HzdKUsJaTdVm0S80kCLEdLTokgWLQWvbFMfGPVjLzf0zulOvVKi0IYN5F3z5I5J P8059bOECik3GvIxtRN7g3Wl8CWSroz4TvTsSx3cBxeqWZw6TmULRgHklPlxshOAulve rC2Wxb7SiQOX8lScRXriMxI0Bz4GKYR32mAVvjD/maerFCkfzUNpQkXRsfkALoTDtm2H UX8SvD631cEEGmydqviA7g1DTMA4rMwwLYaRExqDfHRY27MmTqS9ZZmSH7uM/milNxsm URVg== X-Gm-Message-State: ACrzQf2YUdRozTajoJ0bXZEMtx3sS7+/AXcrhzEhSOGt1+le9V+cVLUe iQUP5+lhsnMRhrD0sudiFlXm03TcB+TH9JF4vDbKrMob/O6FRvETGm74Cs3+MsHPWj6Wg+vC5CT 8aXyfYpMSxSc= X-Received: by 2002:adf:e192:0:b0:228:d066:a844 with SMTP id az18-20020adfe192000000b00228d066a844mr3695804wrb.54.1664478007869; Thu, 29 Sep 2022 12:00:07 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6i5KDsEdBe+Zglk/0fi94Adp4mMYTsyEF+rUs38HniJtSG+pStzCpqZi17H96xn0UGfpQOOQ== X-Received: by 2002:adf:e192:0:b0:228:d066:a844 with SMTP id az18-20020adfe192000000b00228d066a844mr3695762wrb.54.1664478007499; Thu, 29 Sep 2022 12:00:07 -0700 (PDT) Received: from ?IPV6:2003:cb:c705:ce00:b5d:2b28:1eb5:9245? (p200300cbc705ce000b5d2b281eb59245.dip0.t-ipconnect.de. [2003:cb:c705:ce00:b5d:2b28:1eb5:9245]) by smtp.gmail.com with ESMTPSA id i13-20020a5d55cd000000b0022ae59d472esm73668wrw.112.2022.09.29.12.00.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 29 Sep 2022 12:00:07 -0700 (PDT) Message-ID: <3654e74b-8145-33bb-1eb7-fb5e2ffd2fba@redhat.com> Date: Thu, 29 Sep 2022 21:00:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table To: Chih-En Lin Cc: Nadav Amit , Andrew Morton , Qi Zheng , Matthew Wilcox , Christophe Leroy , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng References: <20220927162957.270460-1-shiyn.lin@gmail.com> <20220927162957.270460-10-shiyn.lin@gmail.com> <3D21021E-490F-4FE0-9C75-BB3A46A66A26@vmware.com> <39c5ef18-1138-c879-2c6d-c013c79fa335@redhat.com> <834c258d-4c0e-1753-3608-8a7e28c14d07@redhat.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=I3ZnKxLr; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664478013; a=rsa-sha256; cv=none; b=6mnmLqxHJxT4wX38HPXiw44jGvzwRqvkYE5ps3tAIZesZ9NCUlc0fDPo1pXEp8USXVYbA8 0LYQRXPWwdJUWTIhKdF5GDEmu4O1RTcIeipq6ohUPmCodSlt8FO57JgqOOVPRBAbJNC4qY TvIynt2NmdRsBh7a3M9RWeKFn43z3uc= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664478013; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=l8DCNqxfzFDLKlgGBpTrsbYa+QPnUiE6dUNTZZYqPaGWYRPm9TN1cLcpuYPwKkY1I0FE2a 3yKXSghEOELRjy1XdQUSxpMuN/HgLb3nU3zEEQA46+El4yT32Pq+QPSQfuEXk3eM/9CeVd PfiMquc7Kw6BnbrhOfllPI5BEM/8nGc= X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 5B2691C0011 X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=I3ZnKxLr; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: 7dmoxr3o9zzue1ffijmwmf87gju6t8fc X-HE-Tag: 1664478013-788777 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 29.09.22 20:57, Chih-En Lin wrote: > On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote: >> On 29.09.22 20:29, Chih-En Lin wrote: >>> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote: >>>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction >>>>>> could *possibly* be accepted upstream if it's not too invasive or complex. >>>>>> During fork(), we'd do exactly what we used to do to PTEs (increment >>>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, >>>>>> duplicate swap entries; all while holding the page table lock), however, >>>>>> sharing the prepared page table with the child process using COW after we >>>>>> prepared it. >>>>>> >>>>>> Any (most once we want to *optimize* rmap handling) modification attempts >>>>>> require breaking COW -- copying the page table for the faulting process. But >>>>>> at that point, the PTEs are already write-protected and properly accounted >>>>>> (refcount/mapcount/PageAnonExclusive). >>>>>> >>>>>> Doing it that way might not require any questionable GUP hacks and swapping, >>>>>> MMU notifiers etc. "might just work as expected" because the accounting >>>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have >>>>>> after fork and any modification attempts simply replace the mapped copy. >>>>> >>>>> Agree. >>>>> However for GUP hacks, if we want to do the COW to page table, we still >>>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to >>>>> check whether the PTE table is available or not before we do the COW to >>>>> the table). Otherwise, it will be more complicated since it might need >>>>> to handle situations like while preparing the COW work, it just figuring >>>>> out that it needs to duplicate the whole table and roll back (recover >>>>> the state and copy it to new table). Hopefully, I'm not wrong here. >>>> >>>> The nice thing is that GUP itself *usually* doesn't modify page tables. One >>>> corner case is follow_pfn_pte(). All other modifications should happen in >>>> the actual fault handler that has to deal with such kind of unsharing either >>>> way when modifying the PTE. >>>> >>>> If the pages are already in a COW-ed pagetable in the desired "shared" state >>>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such >>>> pages will just work as expected and we shouldn't be surprised by another >>>> set of GUP+COW CVEs. >>>> >>>> We'd really only deduplicate the page table and not play other tricks with >>>> the actual page table content that differ from the existing way of handling >>>> fork(). >>>> >>>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when >>>> not modifying the page table. I think we only need "we have to unshare this >>>> page table now" in follow_pfn_pte() and inside the fault handling when GUP >>>> triggers a fault. >>>> >>>> I hope my assumption is correct, or am I missing something? >>>> >>> >>> My consideration is when we pinned the page and did the COW to make the >>> page table be shared. It might not allow mapping the pinned page to R/O) >>> into both processes. >>> >>> So, if the fork is working on the shared state, it needs to recover the >>> table and copy to a new one since that pinned page will need to copy >>> immediately. We can hold the shared state after occurring such a >>> situation. So we still need some trick to let the fork() know which page >>> table already has the pinned page (or such page won't let us share) >>> before going to duplicate. >>> >>> Am I wrong here? >> >> I think you might be overthinking this. Let's keep it simple: >> >> 1) Handle pinned anon pages just as I described below, falling back to the >> "slow" path of page table copying. >> >> 2) Once we passed that stage, you can be sure that the COW-ed page table >> cannot have actually pinned anon pages. All anon pages in such a page table >> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in >> pinning these pages anymore, because it will only pin exclusive anon pages! >> >> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped >> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead >> of pinning the page. This has to break COW on the page table and properly >> map an exclusive anon page into it, breaking COW. >> >> Do you see a problem with that? >> >>> >>> After that, since we handled the accounting in fork(), we don't need >>> ownership (pmd_t pointer) anymore. We have to find another way to mark >>> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is >>> stored at that space.) >>> >>>>> >>>>>> But devil is in the detail (page table lock, TLB flushing). >>>>> >>>>> Sure, it might be an overhead in the page fault and needs to be handled >>>>> carefully. ;) >>>>> >>>>>> "will make fork() even have more overhead" is not a good excuse for such >>>>>> complexity/hacks -- sure, it will make your benchmark results look better in >>>>>> comparison ;) >>>>> >>>>> ;);) >>>>> I think that, even if we do the accounting with the COW page table, it >>>>> still has a little bit improve. >>>> >>>> :) >>>> >>>> My gut feeling is that this is true. While we have to do a pass over the >>>> parent page table during fork and wrprotect all PTEs etc., we don't have to >>>> duplicate the page table content and allocate/free memory for that. >>>> >>>> One interesting case is when we cannot share an anon page with the child >>>> process because it maybe pinned -- and we have to copy it via >>>> copy_present_page(). In that case, the page table between the parent and the >>>> child would differ and we'd not be able to share the page table. >>> >>> That is what I want to say above. >>> The case might happen in the middle of the shared page table progress. >>> It might cost more overhead to recover it. Therefore, if GUP wants to >>> pin the mapped page we can mark the PTE table first, so fork() won't >>> waste time doing the work for sharing. >> >> Having pinned pages is a corner case for most apps. No need to worry about >> optimizing this corner case for now. >> >> I see what you are trying to optimize, but I don't think this is needed in a >> first version, and probably never is needed. >> >> >> Any attempts to mark page tables in a certain way from GUP >> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast >> (get_user_pages_fast) can race with pretty much anything, even with >> concurrent fork. I suspect your current code might be really racy in that >> regard. > > I see. > Now, I know why optimizing that corner case is not worth it. > Thank you for explaining that. Falling back after already processing some PTEs requires some care, though. I guess it's not too hard to get it right -- it might be harder to get it "clean". But we can talk about that detail later. -- Thanks, David / dhildenb