From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=If0M=MO=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 96117C63797
	for <linux-mm@archiver.kernel.org>; Thu, 22 Jul 2021 06:27:17 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2774F61264
	for <linux-mm@archiver.kernel.org>; Thu, 22 Jul 2021 06:27:17 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2774F61264
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B54156B0036; Thu, 22 Jul 2021 02:27:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B04E16B005D; Thu, 22 Jul 2021 02:27:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9CD206B006C; Thu, 22 Jul 2021 02:27:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59])
	by kanga.kvack.org (Postfix) with ESMTP id 83F3C6B0036
	for <linux-mm@kvack.org>; Thu, 22 Jul 2021 02:27:16 -0400 (EDT)
Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 2767A82499A8
	for <linux-mm@kvack.org>; Thu, 22 Jul 2021 06:27:16 +0000 (UTC)
X-FDA: 78389241672.39.F4D6850
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf14.hostedemail.com (Postfix) with ESMTP id A8C806034338
	for <linux-mm@kvack.org>; Thu, 22 Jul 2021 06:27:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1626935235;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=16MfRQqxh1i6QuzX2eKR1pL3W5i4BoPVR6Kn/ien+P4=;
	b=OGJq5rHjcCptdT/9qLICLO849WT3h77SDgESzzMjvjpA9dms1UGEfOIml+/2RYW1EQhfMe
	4pwuRQJUw6KwUFCT+Ob1fW+d2JsWNwlpDjFdrON8EpVca9yis/B0X48vrikeWpFfkv+rea
	Tcas8pD19fVgytmoz1OlvWeLGmuGEmI=
Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com
 [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-449-XWbS6ZMhOm23LWvuKyc5Ww-1; Thu, 22 Jul 2021 02:27:10 -0400
X-MC-Unique: XWbS6ZMhOm23LWvuKyc5Ww-1
Received: by mail-wm1-f70.google.com with SMTP id l6-20020a05600c1d06b0290225338d8f53so1137636wms.8
        for <linux-mm@kvack.org>; Wed, 21 Jul 2021 23:27:10 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=16MfRQqxh1i6QuzX2eKR1pL3W5i4BoPVR6Kn/ien+P4=;
        b=d6uDi8EwPwRRp1NgTPpueWKDVaOAjIukFqy/eWbwDzHQhTsA+G3TeVTzt7f3Y1Lq53
         oqpvzrZKJzKr3MGHFZ+XsQwDo25wpUPZxP7fzlyqjZDM2yZyIasa39ZV8SQQpf7HUp7A
         kKJaQyzhOb96u9d5AK1ItyBQa7Tp94NBhTjzDu/ygCzAs9GWD968S5S4s4V4V5Y61Psj
         MpysoJcd8G2XY+KvFUPDzZMusMUlI9sb7YHXfDE/GBFKWu70glWjtrxcCGHhEO/R4EQo
         InVlNkUowQoQvLJMKw2Hxjamt/tCprKBWt7FcAugbILZ80I1sR7gGitjDLXmuEr9du7q
         2FPQ==
X-Gm-Message-State: AOAM532o1mNQy6+71cLYjZHk5Eg1n5By/1aPGXcD9c5gazSdFDMnaG/V
	TwUXU5eJFa7WjmFjTbnH/KGl4B8/WO2mE6GaIJhiW0Drv5lUzgC+J6cPPUgOXVCsAhDF6vPeX1z
	8SoZGnVHhKZ8=
X-Received: by 2002:a05:600c:2298:: with SMTP id 24mr7630440wmf.36.1626935229561;
        Wed, 21 Jul 2021 23:27:09 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxcYJZbsYRrvqEoeKSa+7sZmNgVZD81UWD+roUznufva3s9buy3uMB+FHWQJyGA3LWOh/FnSg==
X-Received: by 2002:a05:600c:2298:: with SMTP id 24mr7630396wmf.36.1626935229288;
        Wed, 21 Jul 2021 23:27:09 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c6970.dip0.t-ipconnect.de. [91.12.105.112])
        by smtp.gmail.com with ESMTPSA id e15sm28507995wrp.29.2021.07.21.23.27.08
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 21 Jul 2021 23:27:08 -0700 (PDT)
To: Peter Xu <peterx@redhat.com>,
 Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>,
 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
 "linux-mm@kvack.org" <linux-mm@kvack.org>,
 Axel Rasmussen <axelrasmussen@google.com>, Nadav Amit
 <nadav.amit@gmail.com>, Jerome Glisse <jglisse@redhat.com>,
 "Kirill A . Shutemov" <kirill@shutemov.name>, Jason Gunthorpe
 <jgg@ziepe.ca>, Alistair Popple <apopple@nvidia.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Andrea Arcangeli <aarcange@redhat.com>, Matthew Wilcox
 <willy@infradead.org>, Mike Kravetz <mike.kravetz@oracle.com>,
 Hugh Dickins <hughd@google.com>, Miaohe Lin <linmiaohe@huawei.com>,
 Mike Rapoport <rppt@linux.vnet.ibm.com>,
 "Carl Waldspurger [C]" <carl.waldspurger@nutanix.com>,
 Florian Schmidt <flosch@nutanix.com>, "ovzxemul@gmail.com"
 <ovzxemul@gmail.com>
References: <20210715201422.211004-1-peterx@redhat.com>
 <20210715201651.212134-1-peterx@redhat.com>
 <A83FCF8F-193E-4584-9442-C95B2635FD03@nutanix.com> <YPWiRsNaearMNB2g@t490s>
 <D2FD5D85-BA6D-492E-801F-E5003452DA70@nutanix.com> <YPW8xaejtl68AYCk@t490s>
 <CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com>
 <5c3c84ee-02f6-a2af-13b8-5dcf70676641@redhat.com>
 <CY4PR0201MB3460AAED19F46CD184B2AB30E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com>
 <YPifc+eRNSs/rjv1@t490s> <YPimPFvyH2MWLLp/@t490s>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for
 shmem/hugetlbfs
Message-ID: <3a316327-0971-6c30-ca23-a2f9d580f97d@redhat.com>
Date: Thu, 22 Jul 2021 08:27:07 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <YPimPFvyH2MWLLp/@t490s>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OGJq5rHj;
	spf=none (imf14.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: A8C806034338
X-Stat-Signature: mdqapx4wxtn9kp4ia4otmter7iqgkih3
X-HE-Tag: 1626935235-948471
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 22.07.21 00:57, Peter Xu wrote:
> On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote:
>> Hi, Ivan,
>>
>> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
>>> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
>>>> On 21.07.21 16:38, Ivan Teterevkov wrote:
>>>>> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
>>>>>> I'm also curious what would be the real use to have an accurate
>>>>>> PM_SWAP accounting.  To me current implementation may not provide
>>>>>> accurate value but should be good enough for most cases.  However =
not
>>>>>> sure whether it's also true for your use case.
>>>>>
>>>>> We want the PM_SWAP bit implemented (for shared memory in the pagem=
ap
>>>>> interface) to enhance the live migration for some fraction of the
>>>>> guest VMs that have their pages swapped out to the host swap. Once
>>>>> those pages are paged in and transferred over network, we then want=
 to
>>>>> release them with madvise(MADV_PAGEOUT) and preserve the working se=
t
>>>>> of the guest VMs to reduce the thrashing of the host swap.
>>>>
>>>> There are 3 possibilities I think (swap is just another variant of t=
he page cache):
>>>>
>>>> 1) The page is not in the page cache, e.g., it resides on disk or in=
 a swap file.
>>>> pte_none().
>>>> 2) The page is in the page cache and is not mapped into the page tab=
le.
>>>> pte_none().
>>>> 3) The page is in the page cache and mapped into the page table.
>>>> !pte_none().
>>>>
>>>> Do I understand correctly that you want to identify 1) and indicate =
it via
>>>> PM_SWAP?
>>>
>>> Yes, and I also want to outline the context so we're on the same page=
.
>>>
>>> This series introduces the support for userfaultfd-wp for shared memo=
ry
>>> because once a shared page is swapped, its PTE is cleared. Upon retri=
eval
>>> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP f=
lag
>>> because unlike private memory it's not kept in PTE or elsewhere.
>>>
>>> We came across the same issue with PM_SWAP in the pagemap interface, =
but
>>> fortunately, there's the place that we could query: the i_pages field=
 of
>>> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14=
/595
>>> we do it similarly to what shmem_fault() does when it handles #PF.
>>>
>>> Now, in the context of this series, we were exploring whether it make=
s
>>> any practical sense to introduce more brand new flags to the special
>>> PTE to populate the pagemap flags "on the spot" from the given PTE.
>>>
>>> However, I can't see how (and why) to achieve that specifically for
>>> PM_SWAP even with an extra bit: the XArray is precisely what we need =
for
>>> the live migration use case. Another flag PM_SOFT_DIRTY suffers the s=
ame
>>> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we d=
on't
>>> need it at the moment.
>>>
>>> Hope that clarification makes sense?
>>
>> Yes it helps, thanks.
>>
>> So I can understand now on how that patch comes initially, even if it =
may not
>> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.
>>
>> However I have a concern that I raised also in the other thread: I thi=
nk
>> there'll be an extra and meaningless xa_load() for all the real pte_no=
ne()s
>> that aren't swapped out but just having no page at the back from the v=
ery
>> beginning.  That happens much more frequent when the memory being obse=
rved by
>> pagemap is mapped in a huge chunk and sparsely mapped.
>>
>> With old code we'll simply skip those ptes, but now I have no idea how=
 much
>> overhead would a xa_load() brings.

Let's benchmark it then. I feel like we really shouldn't be storing=20
unnecessarily data in page tables if they are readily available=20
somehwere else, because ...

>>
>> Btw, I think there's a way to implement such an idea similar to the sw=
ap
>> special uffd-wp pte - when page reclaim of shmem pages, instead of put=
ting a
>> none pte there maybe we can also have one bit set in the none pte show=
ing that
>> this pte is swapped out.  When the page faulted back we just drop that=
 bit.
>>
>> That bit could be also scanned by pagemap code to know that this page =
was
>> swapped out.  That should be much lighter than xa_load(), and that ide=
ntifies
>> immediately from a real none pte just by reading the value.

... we are optimizing a corner case feature (pagemap) by affecting other=20
system parts. Just imagine

1. Forking: will always have to copy the whole page tables for shemem=20
instead of optimizing.
2. New shmem mappings: will always have to sync back that bit from the=20
pagecache

And these are just the things that immediately come to mind. There is=20
certainly more (e.g., page table reclaim [1]).

>>
>> Do you think this would work?
>=20
> Btw, I think that's what Tiberiu used to mention, but I think I just ch=
anged my
> mind..  Sorry to have brought such a confusion.
>=20
> So what I think now is: we can set it (instead of zeroing the pte) righ=
t at
> unmapping the pte of page reclaim.  Code-wise, that can be a special fl=
ag
> (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list=
() to
> differenciate from other try_to_unmap()s.
>=20
> I think that bit can also be dropped correctly e.g. when punching a hol=
e in the
> file, then rmap_walk() can find and drop the marker (I used to suspect =
uffd-wp
> bit could get left-overs, but after a second thought here similarly, it=
 seems
> it won't; as long as hole punching and vma unmapping will always be abl=
e to
> scan those marker ptes, then it seems all right to drop them correctly)=
.
>=20
> But that's my wild thoughts; I could have missed something too.
>=20

Adding to that, Peter can you enlighten me  how uffd-wp on shmem=20
combined with the uffd-wp bit in page tables is supposed to work in=20
general when talking about multiple processes?

Shmem means any process can modify any memory. To be able to properly=20
catch writes to such memory, the only way I can see it working is

1. All processes register uffd-wp on the shmem VMA
2. All processes arm uffd-wp by setting the same uffd-wp bits in their=20
page tables for the affected shmem
3. All processes synchronize, sending each other uffd-wp events when=20
they receive one

This is quite ... suboptimal I have to say. This is really the only way=20
I can imagine uffd-wp to work reliably. Is there any obvious way to make=20
this work I am missing?

But then, all page tables are already supposed to contain the uffd-wp=20
bit. Which makes me think that we can actually get rid of the uffd-wp=20
bit in the page table for pte_none() entries and instead store this=20
information somewhere else (in the page cache?) for all entries combined.

So that simplification would result in

1. All processes register uffd-wp on the shmem VMA
2. One processes wp-protects uffd-wp via the page cache (we can update=20
all PTEs in other processes)
3. All processes synchronize, sending each other uffd-wp events when=20
they receive one

The semantics of uffd-wp on shmem would be different to what we have so=20
far ... which would be just fine as we never had uffd-wp on shared memory=
.

In an ideal world, 1. and 3. wouldn't be required and all registered=20
uffd listeners would be notified when any process writes to it.


Sure, for single-user shmem it would work just like !shmem, but then,=20
maybe that user really shouldn't be using shmem. But maybe I am missing=20
something important :)

> Thanks,
>=20

[1]=20
https://lkml.kernel.org/r/20210718043034.76431-1-zhengqi.arch@bytedance.c=
om

--=20
Thanks,

David / dhildenb