From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBF29C41535 for ; Fri, 23 Dec 2022 08:28:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5604E900003; Fri, 23 Dec 2022 03:28:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5100C900002; Fri, 23 Dec 2022 03:28:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3D6F5900003; Fri, 23 Dec 2022 03:28:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2EDB8900002 for ; Fri, 23 Dec 2022 03:28:38 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D84FF120781 for ; Fri, 23 Dec 2022 08:28:37 +0000 (UTC) X-FDA: 80272894674.19.AAF5F82 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf11.hostedemail.com (Postfix) with ESMTP id D1B754000A for ; Fri, 23 Dec 2022 08:28:35 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=D1ZpdChB; spf=none (imf11.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671784116; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JxBTPAfWIhtVpSwcMcX7UiAZeZDY/r85Ep5AqRQtKvw=; b=Leo0Bl/ADY1EOJarlI/mjEiVctKl8wi/6ptkDCLfMBi9Zo5LRtYPR0LYvuGL5pIK58mFnq VCkp4wNqPIKSYTY+/ex1B+Wok5whjC+c4LQpRrj/r9wHMU9s5bX8dPUOXG3bYWhVAsILak kl6Y4d7r+g6fYfVd5k+1glfz9HHw4rA= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=D1ZpdChB; spf=none (imf11.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671784116; a=rsa-sha256; cv=none; b=sf27q7HY1BpUOLugW8f9wQAazC869t6aa5jSLA2BoFly1xIx2fjX1WFyOhBsv5dZVHFW05 ZnOlbcig9hYvNq0ghzQgg4GIKh3JrCi3JO5/r9qyUrXRakydWbYPG+KmREZScdrCTkf1Z1 115hyM2g1Wk6uja2LWyE7icqNfchEhc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671784115; x=1703320115; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=ZGaAHoTLI1U3Y/sWtoppFD46fIsTwSzhwP5untsoPlg=; b=D1ZpdChB7PBZlFybmy5MOrgOwPbeZuh2E81nbnVhm4I1BrZE7dTwk2+7 2kIxlo/G34mxZ5ubQJD6gwNi+aWDFrh1qvAQSiKNDOBkN3KyYU+Dfunh/ FElSQMQb264XddWw0ffHUFuhVzuyU7OTILGHSkhaRipb3/YTJ73dyLrJM SJvdIQzbRXdvMU3U4aIprtR9R53jOijK2WgeEIUCdSlmGmD3uS6IDsNfa AtRUhyTnyxIVRNhXc2sARxG2YFM1jFRiIxFws3p0X6wCyODtORiH6PCr/ 5RMCkM2y5NlwV/hOEuxGCvCNGUgb21SdZ0E/FiMp6ssdN/3iSHyeb3qDK g==; X-IronPort-AV: E=McAfee;i="6500,9779,10569"; a="300634945" X-IronPort-AV: E=Sophos;i="5.96,267,1665471600"; d="scan'208";a="300634945" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2022 00:28:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10569"; a="602140158" X-IronPort-AV: E=Sophos;i="5.96,267,1665471600"; d="scan'208";a="602140158" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by orsmga003.jf.intel.com with ESMTP; 23 Dec 2022 00:28:21 -0800 Date: Fri, 23 Dec 2022 16:24:06 +0800 From: Chao Peng To: Sean Christopherson Cc: "Huang, Kai" , "tglx@linutronix.de" , "linux-arch@vger.kernel.org" , "kvm@vger.kernel.org" , "jmattson@google.com" , "Lutomirski, Andy" , "ak@linux.intel.com" , "kirill.shutemov@linux.intel.com" , "Hocko, Michal" , "qemu-devel@nongnu.org" , "tabba@google.com" , "david@redhat.com" , "michael.roth@amd.com" , "corbet@lwn.net" , "bfields@fieldses.org" , "dhildenb@redhat.com" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "x86@kernel.org" , "bp@alien8.de" , "linux-api@vger.kernel.org" , "rppt@kernel.org" , "shuah@kernel.org" , "vkuznets@redhat.com" , "vbabka@suse.cz" , "mail@maciej.szmigiero.name" , "ddutile@redhat.com" , "qperret@google.com" , "arnd@arndb.de" , "pbonzini@redhat.com" , "vannapurve@google.com" , "naoya.horiguchi@nec.com" , "wanpengli@tencent.com" , "yu.c.zhang@linux.intel.com" , "hughd@google.com" , "aarcange@redhat.com" , "mingo@redhat.com" , "hpa@zytor.com" , "Nakajima, Jun" , "jlayton@kernel.org" , "joro@8bytes.org" , "linux-mm@kvack.org" , "Wang, Wei W" , "steven.price@arm.com" , "linux-doc@vger.kernel.org" , "Hansen, Dave" , "akpm@linux-foundation.org" , "linmiaohe@huawei.com" Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Message-ID: <20221223082406.GB1829090@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <5c6e2e516f19b0a030eae9bf073d555c57ca1f21.camel@intel.com> <20221219075313.GB1691829@chaop.bj.intel.com> <20221220072228.GA1724933@chaop.bj.intel.com> <126046ce506df070d57e6fe5ab9c92cdaf4cf9b7.camel@intel.com> <20221221133905.GA1766136@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: D1B754000A X-Stat-Signature: aydjtadmx5yxczp4jfpb9sxcsbemzmnz X-HE-Tag: 1671784115-234749 X-HE-Meta: U2FsdGVkX1/ktDZpTylxNRRyDT/BZgqwWez7da3FHp8WskxTwfBkDznLoUUGD7o86dZakmLmtM5S9G/oyclT5VaIPDYNM+cXeZFv5uvefyLEtlfo9yy/iX8yLTcwaPHHwjqAQBrVakYU5w6U5rb94ifmsYh17gYXCh/vGmR4Hnmvl1NiYCNvOU8oRbaRSFqBWIpbk6J/NOnd0tyAORjuKvMGSK7BNAr8BlM+5FRsoXMXjT64nVGRR7laprW1bCXZT3KuorjV8xaW3JjORg2NxVRJkW676w4k9BhcgBoS0KNdqS2Gr07cSGaQWHGKWQMnFawyy2LRnFtua9lKq8l7wAoOEnOtBVKrcieWSZydsUkgP2QQSBnmEb7NpfxTtRaZYBwoODhC3K6AfX9fT+sh6z9hV5v2KKVTZdoUt/FrQHnj6ebhKFcBiczEOTyiRQQmmHQvrsPhjV0u41JjpJcXueOsg1C2Jlp8lgP/PWUIbA+4QYa7zzXI0jOVLKBDDpAz6NZAInEFyY1EAHwj8RGgxl/aO8gwW2Um36Zy4/bkbiVFu52/Wo84jZbkVv7/k+NEDR0XoF5VEcjRyq+/IrCZJunltlaIIOhkIp2Ut4KwAWH9HUqAoyRFjDUcVN2y52zx0Yb7LX+49Gr/jmytr4pq+Tp+tGX20hTAdYXVNFCknKqueuM/ozq9BQImX69S6baVclTfYRCO0nxcjlCScW6ehcgzdqcG4jq7ea7TY0H5vHAKzc5HvQ3mtIZtYV29pRkSCn82VKCmzphNQXCB8GWAOnPVz6wSKZ6OWqobpEZ4MqfrnFPf+8cPF0MjSnj0F3n8FGsfGL221k7VjyKBhaieYDvju/WwdYX7xz4twFg1DCckgNKPKqOUjl2sTtXiM6++G2gmT2SsYP8BGTLpqd/64Rs0g0CGpL4oO6g+H1TGH8fbkOYpTL0v8IQ3ST7WFQ3rtsukL2qEKGbF0z/W5Rt /4Fc99HT Ia6K1u2KE8COl7+0yQ/hd7i5n27IMuF9hCfIPdgcbyqBf1Y/r+vUzZzPMLkOZ23Q1CJfg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote: > On Wed, Dec 21, 2022, Chao Peng wrote: > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > That's true. Actually even true for restrictedmem case, most likely we > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > side, other restrictedmem users like pKVM may not require page pinning > > at all. On the other side, see below. > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > No, requiring the user (KVM) to guard against lack of support for page migration > in restricted mem is a terrible API. It's totally fine for restricted mem to not > support page migration until there's a use case, but punting the problem to KVM > is not acceptable. Restricted mem itself doesn't yet support page migration, > e.g. explosions would occur even if KVM wanted to allow migration since there is > no notification to invalidate existing mappings. > > > I argue that this page pinning (or page migration prevention) is not > > tied to where the page comes from, instead related to how the page will > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > it's used by current version of TDX then the page pinning is needed. So > > such page migration prevention is really TDX thing, even not KVM generic > > thing (that's why I think we don't need change the existing logic of > > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who > > requires that) to increase/decrease the refcount when it populates/drops > > the secure EPT entries? This is exactly what the current TDX code does: > > I agree that whether or not migration is supported should be controllable by the > user, but I strongly disagree on punting refcount management to KVM (or TDX). > The whole point of restricted mem is to support technologies like TDX and SNP, > accomodating their special needs for things like page migration should be part of > the API, not some footnote in the documenation. I never doubt page migration should be part of restrictedmem API, but that's not an initial implementing as we all agreed? Then before that API being introduced, we need find a solution to prevent page migration for TDX. Other than refcount management, do we have any other workable solution? > > It's not difficult to let the user communicate support for page migration, e.g. > if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops > to signal support (or lack thereof) for page migration. NULL == no migration, > non-NULL == migration allowed. I know. > > We know that supporting page migration in TDX and SNP is possible, and we know > that page migration will require a dedicated API since the backing store can't > memcpy() the page. I don't see any reason to ignore that eventuality. No, I'm not ignoring it. It's just about the short-term page migration prevention before that dedicated API being introduced. > > But again, unless I'm missing something, that's a future problem because restricted > mem doesn't yet support page migration regardless of the downstream user. It's true a future problem for page migration support itself, but page migration prevention is not a future problem since TDX pages need to be pinned before page migration gets supported. Thanks, Chao