From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3350C5479D for ; Thu, 12 Jan 2023 02:00:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 155F98E0002; Wed, 11 Jan 2023 21:00:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 106208E0001; Wed, 11 Jan 2023 21:00:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE9BA8E0002; Wed, 11 Jan 2023 21:00:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DD83F8E0001 for ; Wed, 11 Jan 2023 21:00:25 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9A9D81C632C for ; Thu, 12 Jan 2023 02:00:25 +0000 (UTC) X-FDA: 80344492410.05.A48B602 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf19.hostedemail.com (Postfix) with ESMTP id 0E1501A0003 for ; Thu, 12 Jan 2023 02:00:22 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gq1f85Er; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673488823; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eseVzyNfVWynUaS2MsYJcqjsdFOiaxvOlF3HGeTb4BE=; b=ilLaeBHlr1FDhbQI6H7O2Y6qpb0zC4WcHN/wkLcMS13anzHY10hpod7E6XF3CpkY36pqg9 ++4OqBjetD6UEWrvb2Lwrb2DxU9up+Bx7i8O6Xy/bgUD3K/WoXRpgbhaD8U3IW6McnPd+y +mr9UiPV8BdlrXdlb+2oz3THA4LaEZU= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gq1f85Er; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673488824; a=rsa-sha256; cv=none; b=0ItDWtRSMONRkmoRlD/R8zsLFnFKnll6z3X9h//k40EMM0YVH/S+k3O6QihZrIg8P7Tt3b b3LpUHaIbsyLlN6JblcSnZKBpBW4fhXHIvIVmoVAslDIqM8g0BSYEtq3iHIr2awaB6h2Y/ VxjE0ZjazaHlzHr5UwMQUurU9x1FI3k= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673488823; x=1705024823; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=WZhB4TXTTqiix5HvJ99JmQUvaJHiPAxWOHzeltD4KSY=; b=Gq1f85Er6GmepNax9MmEFTG56Oem10eOfKtPhMFuZUTFNQWgu2o5eh0B 4nx7VabISVwHd9og6oA3QOxncM5enT9HAcUhC0njI2OutZvqGU33br3OQ TYWrMbjev1gMitoZLKoBoTDofpSOizlOWJbwbuyR/2JQfiwM4u6gG1jRI bskPuTyQDSLMWgpLnJ3Uw+e8QpE50cFRgp34gF9kaqp2AzmMXFOc5MppI /uPLUBKqUj2OiKK5Xh2E8w+WUs7oB8Fe+hZphEFW9Zepol/6hJF868iw9 xdDYgiivGAL/eA8xHLjWH2Y6TODXBB4R2ahQsknIg1o4pkZPORG0Twkfa A==; X-IronPort-AV: E=McAfee;i="6500,9779,10586"; a="303283488" X-IronPort-AV: E=Sophos;i="5.96,318,1665471600"; d="scan'208";a="303283488" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2023 18:00:16 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10586"; a="650950152" X-IronPort-AV: E=Sophos;i="5.96,318,1665471600"; d="scan'208";a="650950152" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2023 18:00:07 -0800 From: "Huang, Ying" To: "Huang, Kai" Cc: "kvm@vger.kernel.org" , "Hansen, Dave" , "Luck, Tony" , "bagasdotme@gmail.com" , "ak@linux.intel.com" , "Wysocki, Rafael J" , "kirill.shutemov@linux.intel.com" , "Christopherson,, Sean" , "Chatre, Reinette" , "pbonzini@redhat.com" , "tglx@linutronix.de" , "Yamahata, Isaku" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "peterz@infradead.org" , "imammedo@redhat.com" , "Gao, Chao" , "Brown, Len" , "Shahar, Sagi" , "sathyanarayanan.kuppuswamy@linux.intel.com" , "Williams, Dan J" Subject: Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory References: <8aab33a7db7a408beb403950e21f693b0b0f1f2b.1670566861.git.kai.huang@intel.com> <16f23950-2a27-29de-c0b4-e5f2d927c8b4@intel.com> <13096e4e39801806270c3a6a641102a8151aa5fc.camel@intel.com> <871qo0y3i3.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 12 Jan 2023 09:59:14 +0800 In-Reply-To: (Kai Huang's message of "Thu, 12 Jan 2023 09:18:58 +0800") Message-ID: <87wn5swm19.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 0E1501A0003 X-Stat-Signature: tpw4iqxo6stzojes8pkmibzbzdbjwhdb X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1673488822-806095 X-HE-Meta: U2FsdGVkX1+BCVM+jLCaFpyPCy7v4BDWz5AR5q7wI46SaCsz3IrpNVTPtd/wnYEXNA7L+Duew/oitTHVb1qOvRfOGVK3/p6leMvOsd1horlvJXF8qNJgxuWW4cVkSxrIuS9ZBa7/vgJj3SUnvsYH+T64ZGGN9SsTXDgwD+CnatbzVATD0MsHUVc8xI4TuzTsT4Zp8sYxaVYOHAVLeCAHjhneISYRg2aZXiLfO24swbFm8OdVl24iquoZZPWVhvRW9xu3rkf8pTJrJdzvxtSubsQtBiPV4Tfx5YIWauYuCOkOEtYtdM5i+lbIWzOhUx3rdfDJAU4Gf0fcpWxiq/ZmSiCezmjxYhiVYCx3mFgsVGhtvc7E9G8SBTVRIRBspa2U20mEQaAUnhTt5kBvmHtykLx4cHrGxSlHr6jiHycc1VFrjI4czYGScXOdYI438BsYVh3NhfmJgYTVMXaYrY1RndH1xDD1DS45ccTPSWZ4uFPpMNOAJFY2lAx3P9DJUimO2D/c0f1MZb5G44D+7rT0cUgCca2oMzg0t41YIAsD0SHbrI7/IR873Jl7wzJ3pEp/u+7HiBKRkQuvRVioP6KxI54jInzNWQv39wyHzEYEhsuQNBR3RSlxHpaIiBKbYRVPGtv8z2wKcL2i8BmwWjuY0PqInba3kb67Md/1TzStOgUkvGht3qR320nFSlsOYgka5BU6dVtmoWVMgycbdIjpzyaKfD5GLwLOk5u/sEu02Um8NshC71mP8F/7HXiwZdxZVs/2t3t98e/NoDYd60sXNRJPEDA8UqC+jsB9D464ycEbgRGVwuiskK3cgWi761dvLUXH5ItjFjroq8su2FFFCn4pDeFsuF1xJ7mCtFS2FeVIt9niMvYRbNQVhA5xGriyEWZLKkrOgG3tsFZ7IyCVAZyrBa27/N7W X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: "Huang, Kai" writes: > On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote: >> "Huang, Kai" writes: >> >> > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote: >> > > On 1/10/23 04:09, Huang, Kai wrote: >> > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote: >> > > > > On 1/9/23 03:48, Huang, Kai wrote: >> > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX >> > > > > > > > > > memory to a separate NUMA node. In this case, the "TDX-capable" nodes >> > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace >> > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from >> > > > > > > > > > the "TDX-capable" nodes. >> > > > > > > > >> > > > > > > > Why does it need to be enhanced? What's the problem? >> > > > > > >> > > > > > The problem is after TDX module initialization, no more memory can be hot-added >> > > > > > to the page allocator. >> > > > > > >> > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can >> > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can >> > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can >> > > > > > utilize all memory. >> > > > > > >> > > > > > But probably it is not necessarily to call out in the changelog? >> > > > > >> > > > > Let's say that we add this TDX-compatible-node ABI in the future. What >> > > > > will old code do that doesn't know about this ABI? >> > > > >> > > > Right. The old app will break w/o knowing the new ABI. One resolution, I >> > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non- >> > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating >> > > > TDX guest memory from those "TDX-capable" nodes. >> > > >> > > That doesn't actually hide all of the behavior from users. Let's say >> > > they do: >> > > >> > > numactl --membind=6 qemu-kvm ... >> > > >> > > In other words, take all of this guest's memory and put it on node 6. >> > > There lots of free memory on node 6 which is TDX-*IN*compatible. Then, >> > > they make it a TDX guest: >> > > >> > > numactl --membind=6 qemu-kvm -tdx ... >> > > >> > > What happens? Does the kernel silently ignore the --membind=6? Or does >> > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free >> > > memory on node 6. >> > > >> > > In other words, I don't think the kernel can just enforce this >> > > internally and hide it from userspace. >> > >> > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX >> > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that >> > the memory will always be allocated from those "TDX-capable" nodes. KVM can >> > refuse to create the TDX guest if it found task's mempolicy doesn't have any >> > "TDX-capable" node, and print out a clear message to the userspace. >> > >> > But I am new to the core-mm, so I might have some misunderstanding. >> >> KVM here means in-kernel KVM module? If so, KVM can only output some >> message in dmesg. Which isn't very good for users to digest. It's >> better for the user space QEMU to detect whether current configuration >> is usable and respond to users, via GUI, or syslog, etc. > > I am not against this. For instance, maybe we can add some dedicated error code > and let KVM return it to Qemu, but I don't want to speak for KVM guys. We can > discuss this more when we have patches actually sent out to the community. Error code is a kind of ABI too. :-) Best Regards, Huang, Ying