From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F37AEC4332F for ; Wed, 21 Dec 2022 13:43:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CBD698E0002; Wed, 21 Dec 2022 08:43:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C6D8D8E0001; Wed, 21 Dec 2022 08:43:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0E158E0002; Wed, 21 Dec 2022 08:43:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A09E28E0001 for ; Wed, 21 Dec 2022 08:43:38 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7A63D160200 for ; Wed, 21 Dec 2022 13:43:38 +0000 (UTC) X-FDA: 80266430916.14.CA958DE Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf07.hostedemail.com (Postfix) with ESMTP id AB10D4001A for ; Wed, 21 Dec 2022 13:43:35 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YJdHHiCH; spf=none (imf07.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.120) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671630216; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nTPYQXJ5N3j9jX3oJ9X2N0ncxAYFpqobVfp2jfrKGwo=; b=Ni4F3dGA05Uz5hzd1SBSwLYSvFBuP0IbETlAhFBxmbBPh0lkcJTN3XslTN6cxO0q1z9N2c c1Mj6K/i8KhFlMwO79/FTH327piZnRuaDFu0hZzy8SB9bAYrUARZ4SIqgxEgJ3W9+MyuAP C8rONuZ6mEeZ1f8WrIVQp+RVtT4TON8= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YJdHHiCH; spf=none (imf07.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.120) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671630216; a=rsa-sha256; cv=none; b=CG7EHvzD3WbmW8AxaHzIqQpdlzWfzmNomf5Yb69XoaGLGlMns9z2xC333lM2z3ZS3kG/4y JzyyCs3xshxxExrgax8lB8+vYLbvORjaDvZloKBSIRPUU51CRetHoFvedgrVrDjCSw1kf2 oo6+BmE8g/ZyAmzfVQMvN/cuQGsJ28A= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671630215; x=1703166215; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:content-transfer-encoding:in-reply-to; bh=KeOWd+1s5+y+piSUTaodricsl+2xgK/0jmwlSljJARM=; b=YJdHHiCHiUBBSTVA3mVvwtu8O8MzAz7d9iPOJUanmqXLoYP8fcJwoSvZ XM0H252CKoC5L92k5ekf90rSwuU8E/YYcs9KOIAw9tYljgn2iZF5OjhfI 5Tox7oIMour2iaBT+EKDDKbtyZ/J+744udMwiA4/wtK1jQSiY11of7vVr ZBFMP+5swXYNVa32dam55OkQQOplC1QR9lHibeeMBBNavKjWJqzImN4lo xeJq7c2g9HqegAsyPzpVwN5oUuGItNik8gBiW0bn10Z3rxX/2udbtnsDI gdWep1DxwXFWRawyAm6Spf1RzztOeSjkydk0UFbR9Pv8vVVUi4PoIVqWQ A==; X-IronPort-AV: E=McAfee;i="6500,9779,10567"; a="318567877" X-IronPort-AV: E=Sophos;i="5.96,262,1665471600"; d="scan'208";a="318567877" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Dec 2022 05:43:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10567"; a="651401281" X-IronPort-AV: E=Sophos;i="5.96,262,1665471600"; d="scan'208";a="651401281" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by orsmga002.jf.intel.com with ESMTP; 21 Dec 2022 05:43:21 -0800 Date: Wed, 21 Dec 2022 21:39:05 +0800 From: Chao Peng To: "Huang, Kai" Cc: "tglx@linutronix.de" , "linux-arch@vger.kernel.org" , "kvm@vger.kernel.org" , "jmattson@google.com" , "Lutomirski, Andy" , "ak@linux.intel.com" , "kirill.shutemov@linux.intel.com" , "Hocko, Michal" , "qemu-devel@nongnu.org" , "tabba@google.com" , "david@redhat.com" , "michael.roth@amd.com" , "corbet@lwn.net" , "bfields@fieldses.org" , "dhildenb@redhat.com" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "x86@kernel.org" , "bp@alien8.de" , "linux-api@vger.kernel.org" , "rppt@kernel.org" , "shuah@kernel.org" , "vkuznets@redhat.com" , "vbabka@suse.cz" , "mail@maciej.szmigiero.name" , "ddutile@redhat.com" , "qperret@google.com" , "arnd@arndb.de" , "pbonzini@redhat.com" , "vannapurve@google.com" , "naoya.horiguchi@nec.com" , "Christopherson,, Sean" , "wanpengli@tencent.com" , "yu.c.zhang@linux.intel.com" , "hughd@google.com" , "aarcange@redhat.com" , "mingo@redhat.com" , "hpa@zytor.com" , "Nakajima, Jun" , "jlayton@kernel.org" , "joro@8bytes.org" , "linux-mm@kvack.org" , "Wang, Wei W" , "steven.price@arm.com" , "linux-doc@vger.kernel.org" , "Hansen, Dave" , "akpm@linux-foundation.org" , "linmiaohe@huawei.com" Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Message-ID: <20221221133905.GA1766136@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <5c6e2e516f19b0a030eae9bf073d555c57ca1f21.camel@intel.com> <20221219075313.GB1691829@chaop.bj.intel.com> <20221220072228.GA1724933@chaop.bj.intel.com> <126046ce506df070d57e6fe5ab9c92cdaf4cf9b7.camel@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <126046ce506df070d57e6fe5ab9c92cdaf4cf9b7.camel@intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: AB10D4001A X-Stat-Signature: xyk1bjxxagksxt5xf3xm6n6518ju861r X-Rspam-User: X-HE-Tag: 1671630215-451290 X-HE-Meta: U2FsdGVkX19RpmUdTeIfQBdmZVsQC+V4uWcJFZFL6jUItV7c7X0QdD4TWEtDlOeyOcaz64kDlsK63lUopEOnKWny5wsgK602NM7eCn2s68gQMml+uXukHjVt2F7MRaHrEi9EiJ0lWn1/hhWY2vm9fom8kyRHGUB09DLrPKwO2yKhhaFtH6x/hbN76Y2Te14ankzgua/GAnTIOLUv6uNTBrz4FnMp/n9jOJUWi81TBj4U5uUV5u1R1HoR4V0HsohIIJKfqgG/2ZubH4NEXZ1O5SVuqw+QjCFhgUUt4CKY7eKsNFkKqunPYCXXZwlAFLrHlRl0NKekzlYKJTT6mE+DNkp/w8jQyMYmBk4ueja+6mGKiNQ5mkmXPLer8IqKI4BYhdr8fBysNS+gBD6UMP4TuZ490mO3h+z4wlvYlw2h7MYeY7aG4/vW7C6u559z6U4xwInJRVCptyqIKzGFG5YiPTCBBBwjd4/OBGRc555kucvTZ2jkMnUDUg2WgXKlIZi3ZqvbghyIhhhtH1ELLNfK477svNgxskRBTWZWLkoLKCxP9RP9K6gzmV/b9L47OzNvuTCcvGq4YpboYVFbm274qoF27jEsuXeetllvOdcf2GN4zx4FSulDgWfoD8+OrRpMbslnQQHuCCD+EYHyVeoOPkyBN328BfHbNTnMPpefYZjedisZvO5z+L6hthRfoRuU+TWMrunc/TocKxqgM3T5UpM6AHMkf7kV01kFYLVRy3xEciuv7bID4jm21bQ2P0KbNzITYH+I6O1tO12QtEScJdngvhDa8TmCbYZnIegKq57DXT2FGorfpCDRUZklVg3WXAoKl9/s2a9kVTL5QrCPJU3bdLLO4a7sx3fSx0tCalw934qBhbb4wxFIC0n09X8JSu8aZIwiVbhcaqq53bWyHAe0HgmfTN2hQHqUgKcynBUUTqea0BKYMlKvR9BE6ZRa96VHFRBw+I4B7fwd7XL 79yVcklh zGLYLs14s8qtuXf/e08S2JEhFIitMfDuvbwZsQQF2ZEoqYNfssV3E/hpymbQwqNXIRDaE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > > > [...] > > > > > > > > > > > + > > > > > > + /* > > > > > > + * These pages are currently unmovable so don't place them into > > > > > > movable > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > > + */ > > > > > > + mapping = memfd->f_mapping; > > > > > > + mapping_set_unevictable(mapping); > > > > > > + mapping_set_gfp_mask(mapping, > > > > > > +      mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > > non- > > > > > movable zones, but doesn't necessarily prevent page from being migrated.  My > > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > > get_page() after faulting in the page to prevent. > > > > > > > > The current api restrictedmem_get_page() already does this, after the > > > > caller calling it, it holds a reference to the page. The caller then > > > > decides when to call put_page() appropriately. > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > > said in v9 that this code doesn't prevent page migration, and we need to > > > increase page refcount in restrictedmem_get_page(): > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > > > restrictedmem_get_page() increases page refcount several versions ago so > > no change in v10 is needed. You probably missed my reply: > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ > > But for non-restricted-mem case, it is correct for KVM to decrease page's > refcount after setting up mapping in the secondary mmu, otherwise the page will > be pinned by KVM for normal VM (since KVM uses GUP to get the page). That's true. Actually even true for restrictedmem case, most likely we will still need the kvm_release_pfn_clean() for KVM generic code. On one side, other restrictedmem users like pKVM may not require page pinning at all. On the other side, see below. > > So what we are expecting is: for KVM if the page comes from restricted mem, then > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. I argue that this page pinning (or page migration prevention) is not tied to where the page comes from, instead related to how the page will be used. Whether the page is restrictedmem backed or GUP() backed, once it's used by current version of TDX then the page pinning is needed. So such page migration prevention is really TDX thing, even not KVM generic thing (that's why I think we don't need change the existing logic of kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who requires that) to increase/decrease the refcount when it populates/drops the secure EPT entries? This is exactly what the current TDX code does: get_page(): https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217 put_page(): https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334 Thanks, Chao > > > > > The current solution is clear: unless we have better approach, we will > > let restrictedmem user (KVM in this case) to hold the refcount to > > prevent page migration. > > > > OK. Will leave to others :) >