From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92500C54EBC for ; Thu, 12 Jan 2023 00:57:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E58AC8E0002; Wed, 11 Jan 2023 19:57:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E06488E0001; Wed, 11 Jan 2023 19:57:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA7778E0002; Wed, 11 Jan 2023 19:57:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B855E8E0001 for ; Wed, 11 Jan 2023 19:57:38 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6DB4AA0BBF for ; Thu, 12 Jan 2023 00:57:38 +0000 (UTC) X-FDA: 80344334196.16.6F4F33E Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf27.hostedemail.com (Postfix) with ESMTP id 1D58A4000F for ; Thu, 12 Jan 2023 00:57:34 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=B2WGgeXa; spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673485055; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LykzWI22S96IwkgsqImu3a6c4XeaG6bGNOyZQRlGRRM=; b=GzHwWj52oaRZ4G+/BRWNcWhWASBQ0+xEjjlM0aD4WTSzohlWKL8r36H3NxjEe9+lzIPsyU MJdrtSpIsxRZT1/mMRN/bPzKp5mDxs1M0oGRVoZ8NEZqtylgH5QI8rSf0BVsoj71VtfbXp 7rpiANkUVvgIqZ9o9UuP0eivT7vpkEY= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=B2WGgeXa; spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673485055; a=rsa-sha256; cv=none; b=6OQ3C5jYVIniNbPzY7fwil0G/kAHZ8xZAlgT1Qw6JwGsHgjK85kId5pW+Xr4d/63AWe6vC EXMjfSsPN6Xsb5T85S9GO7QZVerV/67sOOFNMuto2jncZgC/uPWUKhzqHlUs8hPQASbKAf Tu7HqmAsoL178tEAZhe1W0rcb4LLSMs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673485055; x=1705021055; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=gXSztU5rbk3B7Y8rjXqRhgkpb7HVOhhRvitKJq8kch8=; b=B2WGgeXaW224412h3LZll7jIqg0KtwiEPxIAJDsl78oVRBSO11/wbaqt tC/PifKRhwlMCIQ9iC5oYjJsoSYFvzAX8VKXSV1S84dU+FpHO/2CvYdjE rRLwuq6iDAvm7jSpmIunlOLFbgL2/1sIQBDcytyQHnD7d8oDVc7Nk8iHl EsoL/Aa0dgRTzVDk7uiSEU/fyjMjHaZPEPAPuXOpS1jzoHbHd0fUEafLn MsWPBa0YwwdlUsuuYwXEpRaIn50m5rvjPg5AGtjU73MaHu2SkdSq5dfod 3K4TLpkEa1mEB3I4rN9ieHmxLOMuVl+vI2AO3lDFoGIrV+EwJlskzs939 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10586"; a="303954056" X-IronPort-AV: E=Sophos;i="5.96,318,1665471600"; d="scan'208";a="303954056" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2023 16:57:33 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10586"; a="746357534" X-IronPort-AV: E=Sophos;i="5.96,318,1665471600"; d="scan'208";a="746357534" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2023 16:57:28 -0800 From: "Huang, Ying" To: "Huang, Kai" Cc: "kvm@vger.kernel.org" , "Hansen, Dave" , "linux-kernel@vger.kernel.org" , "Luck, Tony" , "bagasdotme@gmail.com" , "ak@linux.intel.com" , "Wysocki, Rafael J" , "kirill.shutemov@linux.intel.com" , "Christopherson,, Sean" , "Chatre, Reinette" , "pbonzini@redhat.com" , "linux-mm@kvack.org" , "Yamahata, Isaku" , "tglx@linutronix.de" , "peterz@infradead.org" , "imammedo@redhat.com" , "Gao, Chao" , "Brown, Len" , "Shahar, Sagi" , "sathyanarayanan.kuppuswamy@linux.intel.com" , "Williams, Dan J" Subject: Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory References: <8aab33a7db7a408beb403950e21f693b0b0f1f2b.1670566861.git.kai.huang@intel.com> <16f23950-2a27-29de-c0b4-e5f2d927c8b4@intel.com> <13096e4e39801806270c3a6a641102a8151aa5fc.camel@intel.com> Date: Thu, 12 Jan 2023 08:56:36 +0800 In-Reply-To: <13096e4e39801806270c3a6a641102a8151aa5fc.camel@intel.com> (Kai Huang's message of "Wed, 11 Jan 2023 18:00:07 +0800") Message-ID: <871qo0y3i3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 1D58A4000F X-Stat-Signature: dq7przug9pzpa5cqqfxd9qofphzswse1 X-HE-Tag: 1673485054-746519 X-HE-Meta: U2FsdGVkX19Y6N09ts5yKFIY847JPmgF+nxU9wHOw48M2r2lULdxMAISAICOgrzaBx4fFM4rZnzi9KmnF41STo20ZXajUlrD770wqD797cFUX6vuULYVK5MBvl6ks4LnkDblKhNFyBZ7JtRTCpjeE9jI0km3+lgYJAGwtcYiA4CpUOe2g+Fot98mWv2cmQwE53TXTAB/q3Jh3Y/fLMvETqqwAvr6iLkbBmpAMNPeL13UmmkG0VELbzFW9q3dSNQ8dFf619oOgY0qg4CgGj5k7+kf1AUWkYH94J9NyFgoFBOP5eWELh9f703ny/U+NFzLxZKX/PRg7q5QmNmW2KSeJmZorQPJ7mx1zHKMycHsicSFHBpFNbX0A72OEOkFpDoodzdGN10Fg8+nqM12iVpcRoKKg+XB3kxAvuWFlUmsdlmSqyXrwhegi3Y7IooNwGTmdO4U8Xbx97WJwqe6fgplr6O7cuRFq4OILpaKYJVeOUfL6JSZ3DydrMgNVxOmLzBJA8ekBLcJthYQ9oMdre4O5Ba9chAf7ZTzWPLlnaSxBTJKZxL6lwzRzqcqyMkPFDNfxaqTP2jyc8HBao56QdjwriXAd+nHEItOsfcmlyX0lf8CojcT6hYVMutSKd9cmRwoMBYxdqM3pfz08VgPJ4OFFyOayIo4kyMbeDp0x3+k8u2IjOTRkg+QbQ6432deCwT62ELpEEFyMYavcpqcCsXglGtJzrq5FRMFSJok+Eo2+DnCalb1ApRsHQSFCzGNzyAXPoHVmlA2JwhhFuHyX1qfUSybN5DEBJHtT5EkUyyFLCFg9IEVGy8sHKpdykX+nJphrA4TXRDo6DVgJAynzSEin4OU+nxaaZKGuVlzvZA7FHY9cliWvBcJyaSN4mVQbLeIcgWMDpu6P4nnDvuWz8jDPNg8dlFl0WxF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: "Huang, Kai" writes: > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote: >> On 1/10/23 04:09, Huang, Kai wrote: >> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote: >> > > On 1/9/23 03:48, Huang, Kai wrote: >> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX >> > > > > > > > memory to a separate NUMA node. In this case, the "TDX-capable" nodes >> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace >> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from >> > > > > > > > the "TDX-capable" nodes. >> > > > > > >> > > > > > Why does it need to be enhanced? What's the problem? >> > > > >> > > > The problem is after TDX module initialization, no more memory can be hot-added >> > > > to the page allocator. >> > > > >> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can >> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can >> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can >> > > > utilize all memory. >> > > > >> > > > But probably it is not necessarily to call out in the changelog? >> > > >> > > Let's say that we add this TDX-compatible-node ABI in the future. What >> > > will old code do that doesn't know about this ABI? >> > >> > Right. The old app will break w/o knowing the new ABI. One resolution, I >> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non- >> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating >> > TDX guest memory from those "TDX-capable" nodes. >> >> That doesn't actually hide all of the behavior from users. Let's say >> they do: >> >> numactl --membind=6 qemu-kvm ... >> >> In other words, take all of this guest's memory and put it on node 6. >> There lots of free memory on node 6 which is TDX-*IN*compatible. Then, >> they make it a TDX guest: >> >> numactl --membind=6 qemu-kvm -tdx ... >> >> What happens? Does the kernel silently ignore the --membind=6? Or does >> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free >> memory on node 6. >> >> In other words, I don't think the kernel can just enforce this >> internally and hide it from userspace. > > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that > the memory will always be allocated from those "TDX-capable" nodes. KVM can > refuse to create the TDX guest if it found task's mempolicy doesn't have any > "TDX-capable" node, and print out a clear message to the userspace. > > But I am new to the core-mm, so I might have some misunderstanding. KVM here means in-kernel KVM module? If so, KVM can only output some message in dmesg. Which isn't very good for users to digest. It's better for the user space QEMU to detect whether current configuration is usable and respond to users, via GUI, or syslog, etc. Best Regards, Huang, Ying