From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7609C25B4E for ; Tue, 24 Jan 2023 07:53:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B6496B0074; Tue, 24 Jan 2023 02:53:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 141046B0075; Tue, 24 Jan 2023 02:53:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFBDF6B0078; Tue, 24 Jan 2023 02:53:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DC8586B0074 for ; Tue, 24 Jan 2023 02:53:40 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B7B5712016D for ; Tue, 24 Jan 2023 07:53:40 +0000 (UTC) X-FDA: 80388928200.05.0D6E31A Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf10.hostedemail.com (Postfix) with ESMTP id 9D912C000B for ; Tue, 24 Jan 2023 07:53:38 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="KBZy/6rf"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5qR8OroD; dmarc=none; spf=pass (imf10.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674546819; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ruXNUJNdlgxTcKpGQIIryFyEkOXceoGZmlzFPRxeoas=; b=CYwvBixEjkdaOWAFoT8rdsQ1ubDSaIevh5y8JKE20afI4szbFIic7NR5CFF5rG0VbLzUfL +TdSuSWT7DPlE/4GfBiERJYRxkpzaRGFwQfYBZj+i7bymHobEBaeAnbUtDEpIrGLkmG4kS /rG/bYxW30vLllSnxUH1VauWQ12qA9w= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="KBZy/6rf"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=5qR8OroD; dmarc=none; spf=pass (imf10.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674546819; a=rsa-sha256; cv=none; b=v3+hBPywqy7Rei/JcSCx9USaVVXxEyb47wjydrf5+8IizZFQz5UUGxDOmNBb9rrckVvIZv HU5IjUB8BUxGJIA7bvhry10Ft2BMz8EJGPUBLOTGDo0QNC6M2SbJKdqpoKnR1m9SAAA1Pt ieqDtNoufmuRKGP+vsFvCsmkY9sW9aM= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id E57DA1F889; Tue, 24 Jan 2023 07:53:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1674546816; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ruXNUJNdlgxTcKpGQIIryFyEkOXceoGZmlzFPRxeoas=; b=KBZy/6rf1TQ67g44L7GTiJ8oSbfhPfVBczpwAWnzfsUxhfC2iuCIk6qk9LBZzqbdqHS9ij Xo/ephBOmafeSayjs7c5/FPCJUSd2Msf8a1NMzxX8FC7m33OJCMAr4kwrcZhMcXrY5NeLZ PTRszq6e6uAR3JoT+NAAk5grhZN/cnk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1674546816; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ruXNUJNdlgxTcKpGQIIryFyEkOXceoGZmlzFPRxeoas=; b=5qR8OroDRbnLzHCA3YaBwq02hXFJllZS99kT3/fzfEAfh1RBI9ipdG1fjVOHFRLXyqtRUK okE5lPcauQAR8tCA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id CA32C139FB; Tue, 24 Jan 2023 07:53:35 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id E4c3MH+Oz2OlHgAAMHmgww (envelope-from ); Tue, 24 Jan 2023 07:53:35 +0000 Message-ID: <8ac234a3-9dc3-3ebf-309a-b4a6bcb72d2d@suse.cz> Date: Tue, 24 Jan 2023 08:51:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Content-Language: en-US To: Sean Christopherson , "Huang, Kai" Cc: "chao.p.peng@linux.intel.com" , "tglx@linutronix.de" , "linux-arch@vger.kernel.org" , "kvm@vger.kernel.org" , "jmattson@google.com" , "Lutomirski, Andy" , "ak@linux.intel.com" , "kirill.shutemov@linux.intel.com" , "Hocko, Michal" , "tabba@google.com" , "qemu-devel@nongnu.org" , "david@redhat.com" , "michael.roth@amd.com" , "corbet@lwn.net" , "linux-kernel@vger.kernel.org" , "dhildenb@redhat.com" , "bfields@fieldses.org" , "linux-fsdevel@vger.kernel.org" , "x86@kernel.org" , "bp@alien8.de" , "ddutile@redhat.com" , "rppt@kernel.org" , "shuah@kernel.org" , "vkuznets@redhat.com" , "naoya.horiguchi@nec.com" , "linux-api@vger.kernel.org" , "qperret@google.com" , "arnd@arndb.de" , "pbonzini@redhat.com" , "Annapurve, Vishal" , "mail@maciej.szmigiero.name" , "wanpengli@tencent.com" , "yu.c.zhang@linux.intel.com" , "hughd@google.com" , "aarcange@redhat.com" , "mingo@redhat.com" , "hpa@zytor.com" , "Nakajima, Jun" , "jlayton@kernel.org" , "joro@8bytes.org" , "linux-mm@kvack.org" , "Wang, Wei W" , "steven.price@arm.com" , "linux-doc@vger.kernel.org" , "Hansen, Dave" , "akpm@linux-foundation.org" , "linmiaohe@huawei.com" References: <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <5c6e2e516f19b0a030eae9bf073d555c57ca1f21.camel@intel.com> <20221219075313.GB1691829@chaop.bj.intel.com> <20221220072228.GA1724933@chaop.bj.intel.com> <126046ce506df070d57e6fe5ab9c92cdaf4cf9b7.camel@intel.com> <20221221133905.GA1766136@chaop.bj.intel.com> <010a330c-a4d5-9c1a-3212-f9107d1c5f4e@suse.cz> <0959c72ec635688f4b6c1b516815f79f52543b31.camel@intel.com> From: Vlastimil Babka In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 9D912C000B X-Stat-Signature: usst49oj1zp79e37xjjk131d96xw8z5k X-HE-Tag: 1674546818-631321 X-HE-Meta: U2FsdGVkX191IauHHrvj+Ws7kJ1+3/Bf/Iw1MlRPvJB20Su/018/QdsPJVkxJfkyq/PNO+L7KQN+igBWNMZQZkHVKZPmKJhgR8YrXpX2CA7In5IZF8QvdbptOx8pkNsR92oE72mpiz8/RcuNt1o77o04hb/nUFwdEuJO9CePEs1EBMm6RGeEI1EX1hYeljSEuLYrqFBW/zoJzMvzayfZOOMMFAOQffI9Aq1OM/jzumWiBkRkIuuLjY/MDkY05niOVl1H8+51gb2pbjzNIfoZp1sMtp3ga2NwZmDzPCwM7L716BcDFPOObT03ptXyb+nSMkOuHThi11CniQnie0RRTW5pfcEptda6/gjRdc+nl6zd/M1mLY8r3Xj1e9m2oHLPHjInnAE4gohlgWY8lyaACweeyRq+Y04UYwmHA4wgfuyQ0N2mWuptNwKjukLXDMo+tXngL8lkOkyKjfOkZ3+a5CJrZBiVdkQmB1oLINJrPteb7bGKaEU4JLG08CNCv8enOIgwEE/JtBdBTqYPJ2XqEXwa9G7coHQOPht2vETeDWbSGHMkxL4WceHGIGM/FYgi6JS4vmAhLSgFDhNo99mcgJVtacfNn65tBhpIoZu5h/PqoCQykWYn1Vdl5c8mK+mrsxcRRknAe/OkhbS5F/hDGG4QF5oSjid1LZ90Wa40yTJc83nnjEmC3jxsPy7UHQeN5q1hSzdpV+EjLG9IrB3hL7SZRZ+Byp8khMOVE4mOkuYGklY4Cr1qNClpgk1+G2ElgOIYkFry5UhW6vA9Honm+eLuJGg1BtYD1aAZevRe7HSPFMhCCVc/TDyO9manel/iWdzD2DmxS3WPHAO/s9qkbH+QtiALo3rSnkbp3EAiY6at0zBxK9c8B/Ob5e3FmnJxM5gc6+eN/ndRBefCmF70bukE5FZJod3khM/8pYV9JwJV/jYQBO0e0vqvTl5IExdT/dQv8ajIHhVVmw4ndg3 W66jA/ZB WWWRYjZaj+QUqUwGWz3V1oZ/Y2bN/VS52O7k8WOeIbm/TLP9wdB7uX8EMoi68C1ToadfPeZ0fVG91zamYTHsLes3x+xgJpTM4nVnCO+NlbvkuvVGhTALepTdKclT0Z3zZa/vAQO/cey3o//ONVWyviC/iwyHfTzwaqM+cMGfgo1lOZCsW3l2KUTcaURctIvHs6//BFdaPAvhRQCWhuHUpKT04vzHRNlleh2qHk9CZAclA+FbTIMbjbxaowhbAR6ms02oViXT+w+SuwY+HkEMe24b1LuUxNfubRjYWSY/ARCj4HiaCaA+BsGX4gA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 1/24/23 00:38, Sean Christopherson wrote: > On Mon, Jan 23, 2023, Huang, Kai wrote: >> On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote: >>> On 12/22/22 01:37, Huang, Kai wrote: >>>>>> I argue that this page pinning (or page migration prevention) is not >>>>>> tied to where the page comes from, instead related to how the page will >>>>>> be used. Whether the page is restrictedmem backed or GUP() backed, once >>>>>> it's used by current version of TDX then the page pinning is needed. So >>>>>> such page migration prevention is really TDX thing, even not KVM generic >>>>>> thing (that's why I think we don't need change the existing logic of >>>>>> kvm_release_pfn_clean()).  >>>>>> >>>> This essentially boils down to who "owns" page migration handling, and sadly, >>>> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page >>>> migration by itself -- it's just a passive receiver. >>>> >>>> For normal pages, page migration is totally done by the core-kernel (i.e. it >>>> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- >>>>> migrate_page() to actually migrate the page). >>>> In the sense of TDX, conceptually it should be done in the same way. The more >>>> important thing is: yes KVM can use get_page() to prevent page migration, but >>>> when KVM wants to support it, KVM cannot just remove get_page(), as the core- >>>> kernel will still just do migrate_page() which won't work for TDX (given >>>> restricted_memfd doesn't have a_ops->migrate_page() implemented). >>>> >>>> So I think the restricted_memfd filesystem should own page migration handling, >>>> (i.e. by implementing a_ops->migrate_page() to either just reject page migration >>>> or somehow support it). >>> >>> While this thread seems to be settled on refcounts already,  >>> >> >> I am not sure but will let Sean/Paolo to decide. > > My preference is whatever is most performant without being hideous :-) > >>> just wanted >>> to point out that it wouldn't be ideal to prevent migrations by >>> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. >>> by memory compaction) by isolating the pages for migration and then >>> releasing them after the callback rejects it (at least we wouldn't waste >>> time creating and undoing migration entries in the userspace page tables >>> as there's no mmap). Elevated refcount on the other hand is detected >>> very early in compaction so no isolation is attempted, so from that >>> aspect it's optimal. >> >> I am probably missing something, > > Heh, me too, I could have sworn that using refcounts was the least efficient way > to block migration. Well I admit that due to my experience with it, I do mostly consider migration through memory compaction POV, which is a significant user of migration on random pages that's not requested by userspace actions on specific ranges. And compaction has in isolate_migratepages_block(): /* * Migration will fail if an anonymous page is pinned in memory, * so avoid taking lru_lock and isolating it unnecessarily in an * admittedly racy check. */ mapping = page_mapping(page); if (!mapping && page_count(page) > page_mapcount(page)) goto isolate_fail; so that prevents migration of pages with elevated refcount very early, before they are even isolated, so before migrate_pages() is called. But it's true there are other sources of "random pages migration" - numa balancing, demotion in lieu of reclaim... and I'm not sure if all have such early check too. Anyway, whatever is decided to be a better way than elevated refcounts, would ideally be checked before isolation as well, as that's the most efficient way. >> but IIUC the checking of refcount happens at very last stage of page migration too