From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96117C63797 for ; Thu, 22 Jul 2021 06:27:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2774F61264 for ; Thu, 22 Jul 2021 06:27:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2774F61264 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B54156B0036; Thu, 22 Jul 2021 02:27:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B04E16B005D; Thu, 22 Jul 2021 02:27:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9CD206B006C; Thu, 22 Jul 2021 02:27:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id 83F3C6B0036 for ; Thu, 22 Jul 2021 02:27:16 -0400 (EDT) Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 2767A82499A8 for ; Thu, 22 Jul 2021 06:27:16 +0000 (UTC) X-FDA: 78389241672.39.F4D6850 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf14.hostedemail.com (Postfix) with ESMTP id A8C806034338 for ; Thu, 22 Jul 2021 06:27:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626935235; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=16MfRQqxh1i6QuzX2eKR1pL3W5i4BoPVR6Kn/ien+P4=; b=OGJq5rHjcCptdT/9qLICLO849WT3h77SDgESzzMjvjpA9dms1UGEfOIml+/2RYW1EQhfMe 4pwuRQJUw6KwUFCT+Ob1fW+d2JsWNwlpDjFdrON8EpVca9yis/B0X48vrikeWpFfkv+rea Tcas8pD19fVgytmoz1OlvWeLGmuGEmI= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-449-XWbS6ZMhOm23LWvuKyc5Ww-1; Thu, 22 Jul 2021 02:27:10 -0400 X-MC-Unique: XWbS6ZMhOm23LWvuKyc5Ww-1 Received: by mail-wm1-f70.google.com with SMTP id l6-20020a05600c1d06b0290225338d8f53so1137636wms.8 for ; Wed, 21 Jul 2021 23:27:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=16MfRQqxh1i6QuzX2eKR1pL3W5i4BoPVR6Kn/ien+P4=; b=d6uDi8EwPwRRp1NgTPpueWKDVaOAjIukFqy/eWbwDzHQhTsA+G3TeVTzt7f3Y1Lq53 oqpvzrZKJzKr3MGHFZ+XsQwDo25wpUPZxP7fzlyqjZDM2yZyIasa39ZV8SQQpf7HUp7A kKJaQyzhOb96u9d5AK1ItyBQa7Tp94NBhTjzDu/ygCzAs9GWD968S5S4s4V4V5Y61Psj MpysoJcd8G2XY+KvFUPDzZMusMUlI9sb7YHXfDE/GBFKWu70glWjtrxcCGHhEO/R4EQo InVlNkUowQoQvLJMKw2Hxjamt/tCprKBWt7FcAugbILZ80I1sR7gGitjDLXmuEr9du7q 2FPQ== X-Gm-Message-State: AOAM532o1mNQy6+71cLYjZHk5Eg1n5By/1aPGXcD9c5gazSdFDMnaG/V TwUXU5eJFa7WjmFjTbnH/KGl4B8/WO2mE6GaIJhiW0Drv5lUzgC+J6cPPUgOXVCsAhDF6vPeX1z 8SoZGnVHhKZ8= X-Received: by 2002:a05:600c:2298:: with SMTP id 24mr7630440wmf.36.1626935229561; Wed, 21 Jul 2021 23:27:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxcYJZbsYRrvqEoeKSa+7sZmNgVZD81UWD+roUznufva3s9buy3uMB+FHWQJyGA3LWOh/FnSg== X-Received: by 2002:a05:600c:2298:: with SMTP id 24mr7630396wmf.36.1626935229288; Wed, 21 Jul 2021 23:27:09 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6970.dip0.t-ipconnect.de. [91.12.105.112]) by smtp.gmail.com with ESMTPSA id e15sm28507995wrp.29.2021.07.21.23.27.08 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 21 Jul 2021 23:27:08 -0700 (PDT) To: Peter Xu , Ivan Teterevkov Cc: Tiberiu Georgescu , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Axel Rasmussen , Nadav Amit , Jerome Glisse , "Kirill A . Shutemov" , Jason Gunthorpe , Alistair Popple , Andrew Morton , Andrea Arcangeli , Matthew Wilcox , Mike Kravetz , Hugh Dickins , Miaohe Lin , Mike Rapoport , "Carl Waldspurger [C]" , Florian Schmidt , "ovzxemul@gmail.com" References: <20210715201422.211004-1-peterx@redhat.com> <20210715201651.212134-1-peterx@redhat.com> <5c3c84ee-02f6-a2af-13b8-5dcf70676641@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Message-ID: <3a316327-0971-6c30-ca23-a2f9d580f97d@redhat.com> Date: Thu, 22 Jul 2021 08:27:07 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OGJq5rHj; spf=none (imf14.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A8C806034338 X-Stat-Signature: mdqapx4wxtn9kp4ia4otmter7iqgkih3 X-HE-Tag: 1626935235-948471 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 22.07.21 00:57, Peter Xu wrote: > On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote: >> Hi, Ivan, >> >> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote: >>> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote: >>>> On 21.07.21 16:38, Ivan Teterevkov wrote: >>>>> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote: >>>>>> I'm also curious what would be the real use to have an accurate >>>>>> PM_SWAP accounting. To me current implementation may not provide >>>>>> accurate value but should be good enough for most cases. However = not >>>>>> sure whether it's also true for your use case. >>>>> >>>>> We want the PM_SWAP bit implemented (for shared memory in the pagem= ap >>>>> interface) to enhance the live migration for some fraction of the >>>>> guest VMs that have their pages swapped out to the host swap. Once >>>>> those pages are paged in and transferred over network, we then want= to >>>>> release them with madvise(MADV_PAGEOUT) and preserve the working se= t >>>>> of the guest VMs to reduce the thrashing of the host swap. >>>> >>>> There are 3 possibilities I think (swap is just another variant of t= he page cache): >>>> >>>> 1) The page is not in the page cache, e.g., it resides on disk or in= a swap file. >>>> pte_none(). >>>> 2) The page is in the page cache and is not mapped into the page tab= le. >>>> pte_none(). >>>> 3) The page is in the page cache and mapped into the page table. >>>> !pte_none(). >>>> >>>> Do I understand correctly that you want to identify 1) and indicate = it via >>>> PM_SWAP? >>> >>> Yes, and I also want to outline the context so we're on the same page= . >>> >>> This series introduces the support for userfaultfd-wp for shared memo= ry >>> because once a shared page is swapped, its PTE is cleared. Upon retri= eval >>> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP f= lag >>> because unlike private memory it's not kept in PTE or elsewhere. >>> >>> We came across the same issue with PM_SWAP in the pagemap interface, = but >>> fortunately, there's the place that we could query: the i_pages field= of >>> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14= /595 >>> we do it similarly to what shmem_fault() does when it handles #PF. >>> >>> Now, in the context of this series, we were exploring whether it make= s >>> any practical sense to introduce more brand new flags to the special >>> PTE to populate the pagemap flags "on the spot" from the given PTE. >>> >>> However, I can't see how (and why) to achieve that specifically for >>> PM_SWAP even with an extra bit: the XArray is precisely what we need = for >>> the live migration use case. Another flag PM_SOFT_DIRTY suffers the s= ame >>> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we d= on't >>> need it at the moment. >>> >>> Hope that clarification makes sense? >> >> Yes it helps, thanks. >> >> So I can understand now on how that patch comes initially, even if it = may not >> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP. >> >> However I have a concern that I raised also in the other thread: I thi= nk >> there'll be an extra and meaningless xa_load() for all the real pte_no= ne()s >> that aren't swapped out but just having no page at the back from the v= ery >> beginning. That happens much more frequent when the memory being obse= rved by >> pagemap is mapped in a huge chunk and sparsely mapped. >> >> With old code we'll simply skip those ptes, but now I have no idea how= much >> overhead would a xa_load() brings. Let's benchmark it then. I feel like we really shouldn't be storing=20 unnecessarily data in page tables if they are readily available=20 somehwere else, because ... >> >> Btw, I think there's a way to implement such an idea similar to the sw= ap >> special uffd-wp pte - when page reclaim of shmem pages, instead of put= ting a >> none pte there maybe we can also have one bit set in the none pte show= ing that >> this pte is swapped out. When the page faulted back we just drop that= bit. >> >> That bit could be also scanned by pagemap code to know that this page = was >> swapped out. That should be much lighter than xa_load(), and that ide= ntifies >> immediately from a real none pte just by reading the value. ... we are optimizing a corner case feature (pagemap) by affecting other=20 system parts. Just imagine 1. Forking: will always have to copy the whole page tables for shemem=20 instead of optimizing. 2. New shmem mappings: will always have to sync back that bit from the=20 pagecache And these are just the things that immediately come to mind. There is=20 certainly more (e.g., page table reclaim [1]). >> >> Do you think this would work? >=20 > Btw, I think that's what Tiberiu used to mention, but I think I just ch= anged my > mind.. Sorry to have brought such a confusion. >=20 > So what I think now is: we can set it (instead of zeroing the pte) righ= t at > unmapping the pte of page reclaim. Code-wise, that can be a special fl= ag > (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list= () to > differenciate from other try_to_unmap()s. >=20 > I think that bit can also be dropped correctly e.g. when punching a hol= e in the > file, then rmap_walk() can find and drop the marker (I used to suspect = uffd-wp > bit could get left-overs, but after a second thought here similarly, it= seems > it won't; as long as hole punching and vma unmapping will always be abl= e to > scan those marker ptes, then it seems all right to drop them correctly)= . >=20 > But that's my wild thoughts; I could have missed something too. >=20 Adding to that, Peter can you enlighten me how uffd-wp on shmem=20 combined with the uffd-wp bit in page tables is supposed to work in=20 general when talking about multiple processes? Shmem means any process can modify any memory. To be able to properly=20 catch writes to such memory, the only way I can see it working is 1. All processes register uffd-wp on the shmem VMA 2. All processes arm uffd-wp by setting the same uffd-wp bits in their=20 page tables for the affected shmem 3. All processes synchronize, sending each other uffd-wp events when=20 they receive one This is quite ... suboptimal I have to say. This is really the only way=20 I can imagine uffd-wp to work reliably. Is there any obvious way to make=20 this work I am missing? But then, all page tables are already supposed to contain the uffd-wp=20 bit. Which makes me think that we can actually get rid of the uffd-wp=20 bit in the page table for pte_none() entries and instead store this=20 information somewhere else (in the page cache?) for all entries combined. So that simplification would result in 1. All processes register uffd-wp on the shmem VMA 2. One processes wp-protects uffd-wp via the page cache (we can update=20 all PTEs in other processes) 3. All processes synchronize, sending each other uffd-wp events when=20 they receive one The semantics of uffd-wp on shmem would be different to what we have so=20 far ... which would be just fine as we never had uffd-wp on shared memory= . In an ideal world, 1. and 3. wouldn't be required and all registered=20 uffd listeners would be notified when any process writes to it. Sure, for single-user shmem it would work just like !shmem, but then,=20 maybe that user really shouldn't be using shmem. But maybe I am missing=20 something important :) > Thanks, >=20 [1]=20 https://lkml.kernel.org/r/20210718043034.76431-1-zhengqi.arch@bytedance.c= om --=20 Thanks, David / dhildenb