From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC058C432BE for ; Wed, 11 Aug 2021 18:41:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 37F5761019 for ; Wed, 11 Aug 2021 18:41:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 37F5761019 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 720946B0071; Wed, 11 Aug 2021 14:41:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6CF366B0072; Wed, 11 Aug 2021 14:41:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 596A36B0073; Wed, 11 Aug 2021 14:41:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0070.hostedemail.com [216.40.44.70]) by kanga.kvack.org (Postfix) with ESMTP id 3CC2A6B0071 for ; Wed, 11 Aug 2021 14:41:38 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 88F3016941 for ; Wed, 11 Aug 2021 18:41:37 +0000 (UTC) X-FDA: 78463668234.08.BD7E7EA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 1AE5D901E59B for ; Wed, 11 Aug 2021 18:41:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628707296; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WCRvG0nT8eYow76GIS0OaHgCKSzTjyOv3zg2eLrvx08=; b=gKdUru/B+roffLYcmfmmdV9fvGmHg44Q/YD/upF914mGmeTWU17QRgp2j2y3fmj1Dn5/Y2 WS8HQLod6eh+IYvHMr1caTAifzuplxF17mnxqt2hFzsryhorzwe/mI4ADFsDSXrROfHWnc ZRBCOZIAYud9bh1sIGD6WsDxoFXUI5k= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-143-kTwet_nOPNer5I8IWvjwAA-1; Wed, 11 Aug 2021 14:41:35 -0400 X-MC-Unique: kTwet_nOPNer5I8IWvjwAA-1 Received: by mail-wm1-f69.google.com with SMTP id o26-20020a05600c511ab0290252d0248251so700205wms.1 for ; Wed, 11 Aug 2021 11:41:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=WCRvG0nT8eYow76GIS0OaHgCKSzTjyOv3zg2eLrvx08=; b=mTkdJPPLgbcCD+vtm5jh15yhVF5Mj4Cwr99jDu4kR/dFPMRY9yGh/ZieszXDl9lnEN qT4V96EU/JlfjfrblmFI5pswNKuuopUr5n+4DCerDgwTpgy1yhfD2RlCC/NQcU84rEn2 RpXMd62JnRxXqg1RWyxDMFrJCCWL7nUKOSsjxf2R9WM++stRG4q42OD/5OH7SFU+wMp4 bR89GEXFNLedaddJtQCymk+nYpmnC88QsEGDUItMEJMW9dDNRcVYdsllAnzoZnvUEFvn 4ZB1tXafEORjtDgisgsyfqhHKCVtvDKAwWluZnj+C2Dz+fApO9SdSPEz+7ZSnxwCPSYM v5Og== X-Gm-Message-State: AOAM532LjKGeu4+tPT0im2S8lEguQuz0Be0NFQbqzs0cxmsTZq2h8hhp 5F8jmD5G3q3jxE8CIGUtGMspnNFR+BvtHm2nXLaLXNEH58lZ4PjdDab0PELl730ELZDP3squQFV zzewzwxjAh6o= X-Received: by 2002:adf:f809:: with SMTP id s9mr15372729wrp.370.1628707294122; Wed, 11 Aug 2021 11:41:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx7wrpBeaVSdpk1ugNUDz3E+k+d0gSiok+JgAMK0Drg4isHn2j/Vx/tUMwB8MNipqDJBnMefA== X-Received: by 2002:adf:f809:: with SMTP id s9mr15372696wrp.370.1628707293804; Wed, 11 Aug 2021 11:41:33 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64a0.dip0.t-ipconnect.de. [91.12.100.160]) by smtp.gmail.com with ESMTPSA id c9sm143103wrm.43.2021.08.11.11.41.32 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 Aug 2021 11:41:33 -0700 (PDT) To: Peter Xu Cc: Tiberiu A Georgescu , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, christian.brauner@ubuntu.com, ebiederm@xmission.com, adobriyan@gmail.com, songmuchun@bytedance.com, axboe@kernel.dk, vincenzo.frascino@arm.com, catalin.marinas@arm.com, peterz@infradead.org, chinwen.chang@mediatek.com, linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, ivan.teterevkov@nutanix.com, florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com, jonathan.davies@nutanix.com References: <20210730160826.63785-1-tiberiu.georgescu@nutanix.com> <839e82f7-2c54-d1ef-8371-0a332a4cb447@redhat.com> <0beb1386-d670-aab1-6291-5c3cb0d661e0@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages Message-ID: <253e7067-1c62-19bd-d395-d5c0495610d7@redhat.com> Date: Wed, 11 Aug 2021 20:41:32 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1AE5D901E59B X-Stat-Signature: erdqyts87ukguy8a538uqsmpmw5cdse3 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="gKdUru/B"; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf29.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1628707296-342211 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11.08.21 20:25, Peter Xu wrote: > On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote: >> On 04.08.21 21:17, Peter Xu wrote: >>> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote: >>>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn'= t store >>>> any state information regarding shared memory in per-process page ta= bles: it >>>> just doesn't make too much sense. >>>> >>>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information a= ctually >>>> belongs to the shared file ("did *someone* write to this page", "is >>>> *someone* interested into changes to that page", "is there something= "). I >>>> know, that screams for a completely different design in respect to t= hese >>>> features. >>>> >>>> I guess we start learning the hard way that shared memory is just di= fferent >>>> and requires different interfaces than per-process page table interf= aces we >>>> have (pagemap, userfaultfd). >>>> >>>> I didn't have time to explore any alternatives yet, but I wonder if = tracking >>>> such stuff per an actual fd/memfd and not via process page tables is >>>> actually the right and clean approach. There are certainly many issu= es to >>>> solve, but conceptually to me it feels more natural to have these sh= ared >>>> memory features not mangled into process page tables. >>> >>> Yes, we can explore all the possibilities, I'm totally fine with it. >>> >>> I just want to say I still don't think when there's page cache then w= e must put >>> all the page-relevant things into the page cache. >> >> [sorry for the late reply] >> >> Right, but for the case of shared, swapped out pages, the information = is >> already there, in the page cache :) >> >>> >>> They're shared by processes, but process can still have its own way t= o describe >>> the relationship to that page in the cache, to me it's as simple as "= we allow >>> process A to write to page cache P", while "we don't allow process B = to write >>> to the same page" like the write bit. >> >> The issue I'm having uffd-wp as it was proposed for shared memory is t= hat >> there is hardly a sane use case where we would *want* it to work that = way. >> >> A UFFD-WP flag in a page table for shared memory means "please notify = once >> this process modifies the shared memory (via page tables, not via any = other >> fd modification)". Do we have an example application where these seman= tics >> makes sense and don't over-complicate the whole approach? I don't know= any, >> thus I'm asking dumb questions :) >> >> >> For background snapshots in QEMU the flow would currently be like this= , >> assuming all processes have the shared guest memory mapped. >> >> 1. Background snapshot preparation: QEMU requests all processes >> to uffd-wp the range >> a) All processes register a uffd handler on guest RAM >=20 > To be explicit: not a handler; just register with uffd-wp and pass over= the fd > to the main process. Good point. >=20 >> b) All processes fault in all guest memory (essentially populating all >> memory): with a uffd-WP extensions we might be able to get rid of >> that, I remember you were working on that. >> c) All processes uffd-WP the range to set the bit in their page table >> >> 2. Background snapshot runs: >> a) A process either receives a UFFD-WP event and forwards it to QEMU o= r >> QEMU polls all other processes for UFFD events. >> b) QEMU writes the to-be-changed page to the migration stream. >> c) QEMU triggers all processes to un-protect the page and wake up any >> waiters. All processes clear the uffd-WP bit in their page tables. >> >> 3. Background snapshot completes: >> a) All processes unregister the uffd handler >> >> >> Now imagine something like this: >> >> 1. Background snapshot preparation: >> a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds >> to guest memory. >> b) QEMU uffd-wp's the whole file >> >> 2. Background snapshot runs: >> a) QEMU receives a UFFD-WP event. >> b) QEMU writes the to-be-changed page to the migration stream. >> c) QEMU un-protect the page and wake up any waiters. >> >> 3. Background snapshot completes: >> a) QEMU unregister the uffd handler >> >> >> Wouldn't that be much nicer and much easier to handle? Yes, it is much >> harder to implement because such an infrastructure does not exist yet,= and >> it most probably wouldn't be called uffd anymore, because we are deali= ng >> with file access. But this way, it would actually be super easy to use= the >> feature across multiple processes and eventually to even catch other f= ile >> modifications. >=20 > I can totally understand how you see this. We've discussed about that,= isn't > it? About the ideal worlds. :) Well, let's dream big :) >=20 > It would be great if this can work out, I hope so. So far I'm not that > ambicious, and as I said, I don't know whether there will be other conc= erns > when it goes into the page cache layer, and when it's a behavior of mul= tiple > processes where one of them can rule others without others being notice= of it. >=20 > Even if we want to go that way, I think we should first come up with so= me way > to describe the domains that one uffd-wp registered file should behave = upon. > It shouldn't be "any process touching this file". >=20 > One quick example in my mind is when a malicious process wants to stop = another > daemon process, it'll be easier as long as the malicious process can de= lete a > file that the daemon used to read/write, replace it with a shmem with u= ffd-wp > registered (or maybe just a regular file on file systems, if your propo= sal will > naturally work on them). The problem is, is it really "legal" to be ab= le to > stop the daemon running like that? Good question, I'd imagine e.g., file sealing could forbid uffd (or=20 however it is called) registration on a file, and there would have to be=20 a way to reject files that have uffd registered. But it's certainly a=20 valid concern - and it raises the question to *what* we actually want to=20 apply such a concept. Random files? random memfd? most probably not.=20 Special memfds created with an ALLOW_UFFD flag? sounds like a good idea. >=20 > I also don't know the initial concept when uffd is designed and why it'= s > designed at pte level. Avoid vma manipulation should be a major factor= , but I > can't say I understand all of them. Not sure whether Andrea has any in= put here. AFAIU originally a) avoid signal handler madness and b) avoid VMA=20 modifications and c) avoid taking the mmap lock in write (well, that=20 didn't work out completely for uffd-wp for now IIRC). >=20 > That's why I think current uffd can still make sense with per-process c= oncepts > and keep it that way. When register uffd-wp yes we need to do that for > multiple processes, but it also means each process is fully aware that = this is > happening so it's kind of verified that this is wanted behavior for tha= t > process. It'll happen with less "surprises", and smells safer. >=20 > I don't think that will not work out. It may require all the process t= o > support uffd-wp apis and cooperate, but that's so far how it should wor= k for me > in a safe and self-contained way. Say, every process should be aware o= f what's > going to happen on blocked page faults. That's a valid concern, although I wonder if it can just be handled via=20 specially marked memfds ("this memfd might get a uffd handler registered=20 later"). >> >> Again, I am not sure if uffd-wp or softdirty make too much sense in ge= neral >> when applied to shmem. But I'm happy to learn more. >=20 > Me too, I'm more than glad to know whether the page cache idea could be > welcomed or am I just wrong about it. Before I understand more things = around > this, so far I still think the per-process based and fd-based solution = of uffd > still makes sense. I'd be curious about applications where the per-process approach would=20 actually solve something a per-fd approach couldn't solve. Maybe there=20 are some that I just can't envision. (using shmem for a single process only isn't a use case I consider=20 important :) ) --=20 Thanks, David / dhildenb