From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA7ACCF884E for ; Fri, 4 Oct 2024 15:55:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F3D06B02A7; Fri, 4 Oct 2024 11:55:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6A1D56B02AA; Fri, 4 Oct 2024 11:55:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51BB76B02A8; Fri, 4 Oct 2024 11:55:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 318F26B02A5 for ; Fri, 4 Oct 2024 11:55:05 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C2608121A4A for ; Fri, 4 Oct 2024 15:55:04 +0000 (UTC) X-FDA: 82636368528.25.8037D2D Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf30.hostedemail.com (Postfix) with ESMTP id C78C48000D for ; Fri, 4 Oct 2024 15:55:02 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=m9zGM9iz; spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728057129; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=g7V1eSqayAcR6DKSXIhi+T/ONIAgUgtsiQgcpv204+8=; b=X3Vsq3D/d7psqnKis+gGTLfLnSJkK34EGFtzkg61N3KdcuzjlgRNr4r/0YusbptV1r3bNX 1D/NXRCtu+Un966ZGs522Qq0sCml4WFt7syoXi4pWi5J0dL3PfY0oDxeLcPkjSkhRPWsS7 JwAXmWFEKBG2UC7fGGrEsHHPJOLw/lg= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=m9zGM9iz; spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728057129; a=rsa-sha256; cv=none; b=xex1pZKDv7CqmjHASlTC285C6IEvmf9grxCvMsPXGGZ+tCf6SIa/Sbu1/6aUZhUw/mvJHM igyX9E1UJu2Qi2OCxjbaUEaD6CAWtmAdPqCzmFWcKxtmbBAo14dvqNERlLWTlJOJw0CnQM OytglDQmKPRlu0ZwwZWSr9vinodSjXk= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-a8d446adf6eso358576166b.2 for ; Fri, 04 Oct 2024 08:55:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728057301; x=1728662101; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=g7V1eSqayAcR6DKSXIhi+T/ONIAgUgtsiQgcpv204+8=; b=m9zGM9iz3E899o06XexGvbN4kgwTuzbzfHEP3SSRRTrhupKayYeTPv0SvH7U2gVOcT Jwd0mKd4WtC/pxqnlXghPPaOs2oV+SuP0c//HiHN7kzlEju6RgdangFdtAW3X/YXbKys LAfGH2B0kR8ftecVQ19yAUyYMyFcYY605e2pnqTyj9ajiAby+0O+IUoFM/y+x2Cwjb9x /Ff5EyNjjJuZib7DUMRy1Z8ICwNeNDKuIL2xqVb5vJqtpJ96ULr16ICJFW+k+yA9XQsb Bb8TVNv91XYDaqWT2f0QvJvNAVloEdSPiXndXbGZODvQj/5knUy5T5k6SarBIXAGv8WV I3mA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728057301; x=1728662101; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=g7V1eSqayAcR6DKSXIhi+T/ONIAgUgtsiQgcpv204+8=; b=APuUIFdN2NON/mNKzDlhxt1aRYIAJ/vm4gQYj4s/4rFuM9urWRbsG2eCCWsjgxmCwQ vW9X0Ltau7+6Q1LMP5gHtKCR4OVnYZJcT8SAPWOwL2begm5/aEnJQaPYDeVT0DaB+tty IAZlP+sDkiYgHA3vbSN+w7RlZqKctzLrjNLOweywuJZeEbvSBF6Lu1/vC5AYC6oDRTVH C9otWjcCulwS9HF7TdxM8DZ4Ndm0hZWhnptfnyqXNlkxB28qIN8hWt0aumGpZr3GqVMg HKV16bT4+J/1Wmo9AhxEDLUbkW4jE9WcSnVKhoteeVg/8frURa7afeyhVDuJozBtnrZV hP+A== X-Forwarded-Encrypted: i=1; AJvYcCXQl2X1uvIwXKH4TIZlIWU5taOEAKIb+Hqw7XyutS2uKOU0o54pjk/jd9/x0HvCCrECC81X25TAMA==@kvack.org X-Gm-Message-State: AOJu0YyZxszgAbnk3/z8rcG4TcUBv2fkJMneFgJW7leCrwag1Hf+Aj/Q l+OW5a/JewIxu8tQb6Oda9RTHwiaApFpHBGmVBs/QEfkEVxt5NtVT4w0pFW3vtxPetlIIEO9QTW piCwB3cZhviC6t4dzSa+zya/ijt0= X-Google-Smtp-Source: AGHT+IFUeFMXgskl1sc4SyH/9JFwHHLkLn+Im1yAUtOBQWKw+t6O+BK082CGGJ19AYBEeFnE9FU2GY7KgsUCu9Ai3Gg= X-Received: by 2002:a17:907:36c4:b0:a8d:1303:2283 with SMTP id a640c23a62f3a-a991bd7a123mr326684166b.30.1728057301051; Fri, 04 Oct 2024 08:55:01 -0700 (PDT) MIME-Version: 1.0 References: <20240930055112.344206-1-ying.huang@intel.com> <8734lgpuoi.fsf@yhuang6-desk2.ccr.corp.intel.com> <66ff297119b92_964f2294c6@dwillia2-xfh.jf.intel.com.notmuch> <66ff5dd3b9128_964fe294ca@dwillia2-xfh.jf.intel.com.notmuch> <22d1cfbf-b195-4343-b87b-493cb3d2843b@redhat.com> In-Reply-To: <22d1cfbf-b195-4343-b87b-493cb3d2843b@redhat.com> From: Yang Shi Date: Fri, 4 Oct 2024 08:54:49 -0700 Message-ID: Subject: Re: [PATCH] tdx, memory hotplug: Check whole hot-adding memory range for TDX To: David Hildenbrand Cc: Dan Williams , "Huang, Ying" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "Kirill A . Shutemov" , x86@kernel.org, Andrew Morton , Oscar Salvador , linux-coco@lists.linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kai Huang , "H. Peter Anvin" , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C78C48000D X-Stat-Signature: dknq64gd1nz4azwrqcn1aqdhskaq4pmq X-Rspam-User: X-HE-Tag: 1728057302-385466 X-HE-Meta: U2FsdGVkX1+ydGKIx/7yNTrF83xOCiplLQF7VL0BsbCVBirR7fQ3gCwjUJkG1fP9cHfxo3RelyUB1MjH8PdAgyTSENwAhEITwlsF3DztIJLdLSLYViijd2B70z2F1PG+adV3dEx14pO7a1dNo6Kf4ZFMbG+UgUYH2OHi4g9QI/oYRY/5OQaeIFmzwavUaX9NTaGydjkQjSLj7JuEg6yHGFuvkjuT4RaHPSChZQWLbeJ8ktvs2xDvyS1hu0nnduzP3J0OpVZZI2YQEzmC8GrrdLgp4+8MvRUy2nQFlJI6oe7g6+znZkUvOxHtio+dTRUed/JS4qEr/H8+CazJoRNl4kC2EF0QAMiK8IQ5hurKcyOFVYhExTPwjhMZFuJGnYrHNdHPx3NnlGgzTqu2KNNMMpO/F8PRSctcCP3/gQYrIdaTJmaMtuq8GX0OwMS5GXgNIxBDyg3hrqbX3yQB1JtW0rGvaWsdcrQwKS4N+71rewuQtwJC1CXNkehnDpw+eTz1FxvrDIzxssF0YQo5NZC66qm3Arypby7ejlvgDHiiucIwxpysQVlsgDvGLVkBHW7ugywhXBG3vq/iWyIKOKOG5R4h6reMrXaIx190TgYsd3DuS8p5QUrOR/KP+w4eYYi+0lSV5hrAT1MvEabIIR1wwGLpbMbVO+V1Ese4YELgnOESiALrQlxDMQ9vWnGmsYOZyb7jnLSYfRWTbzzDyASaoGesGrTDUOYEO3NHW1qVFgFD85N/WybLaBxIiaK+lq+K0CirwVG7Rx3DM4SJcJD0pr/vwSXghNDIVI/e4U2+l8iMohNKFVckXLWP69zHHtgGaJebvb03S9Q0rF5TJ4/8WTxG0OCFc4M2TrF6KbeLogHs/6BtWJP3QRlJ6/hkb4Yxn7AimL2Pf6azoger8XJ3xg1bERNArshP1kXfxrdSUKEJbR0GR0PjKW2Z4hRQ16K4/+Fj1N3xnqhGt2bVdTw 5/nGOb6b Hu7Uj43xm5oz7Rfrt80rhtCkS7LuG6dMznrcwfJAwvZE3hEma0iMn/Mjibg6upKJMbt3Wo4eosvrz7VczZdysRpvbpyQWXagvkwFwK7eUurSw70SKhB2X1YZaR1oy+at2XnVtzunEocCOZNx8iNhy+kfbEMWuDX+BdwK3CvqHX9VBPCM5kwVeHJWBpiPqlcbAx+jqmhYCUd+oXlnA30HGTVVihkDTHNqgtu510PHWOxCshT/TJiCSzqhP2J91KNDTDmmsyrSi4xTSyMW+rjjIMQkbr4PZeSS1YmtYtOficTBvRRzvT4C8egq2/UALdsJR+VnUi47ttwQ71El6ndlkJD++Ic1zV6fKm8weG9tb+Ry3OuvfVwm1NwUx1w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 4, 2024 at 3:21=E2=80=AFAM David Hildenbrand = wrote: > > On 04.10.24 05:15, Dan Williams wrote: > > Yang Shi wrote: > >> On Thu, Oct 3, 2024 at 4:32=E2=80=AFPM Dan Williams wrote: > >>> > >>> Yang Shi wrote: > >>>> On Mon, Sep 30, 2024 at 4:54=E2=80=AFPM Huang, Ying wrote: > >>>>> > >>>>> Hi, David, > >>>>> > >>>>> Thanks a lot for comments! > >>>>> > >>>>> David Hildenbrand writes: > >>>>> > >>>>>> On 30.09.24 07:51, Huang Ying wrote: > >>>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ran= ges > >>>>>>> hot-added must be checked for compatibility by TDX. This is curr= ently > >>>>>>> implemented through memory hotplug notifiers for each memory_bloc= k. > >>>>>>> If a memory range which isn't TDX compatible is hot-added, for > >>>>>>> example, some CXL memory, the command line as follows, > >>>>>>> $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online > >>>>>>> will report something like, > >>>>>>> bash: echo: write error: Operation not permitted > >>>>>>> If pr_debug() is enabled, the error message like below will be sh= own > >>>>>>> in the kernel log, > >>>>>>> online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed > >>>>>>> Both are too general to root cause the problem. This will confus= e > >>>>>>> users. One solution is to print some error messages in the TDX m= emory > >>>>>>> hotplug notifier. However, memory hotplug notifiers are called f= or > >>>>>>> each memory block, so this may lead to a large volume of messages= in > >>>>>>> the kernel log if a large number of memory blocks are onlined wit= h a > >>>>>>> script or automatically. For example, the typical size of memory > >>>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messag= es > >>>>>>> will be logged. > >>>>>> > >>>>>> ratelimiting would likely help here a lot, but I agree that it is > >>>>>> suboptimal. > >>>>>> > >>>>>>> Therefore, in this patch, the whole hot-adding memory range is > >>>>>>> checked > >>>>>>> for TDX compatibility through a newly added architecture specific > >>>>>>> function (arch_check_hotplug_memory_range()). If rejected, the m= emory > >>>>>>> hot-adding will be aborted with a proper kernel log message. Whi= ch > >>>>>>> looks like something as below, > >>>>>>> virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXX= XX > >>>>>>> for TDX compatibility. > >>>>>>>> The target use case is to support CXL memory on TDX enabled syst= ems. > >>>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory > >>>>>>> range hot-adding will be rejected. While the CXL memory can stil= l be > >>>>>>> used via devdax interface. > >>>>>> > >>>>>> I'm curious, why can that memory be used through devdax but not > >>>>>> through the buddy? I'm probably missing something important :) > >>>>> > >>>>> Because only TDX compatible memory can be used for TDX guest. The = buddy > >>>>> is used to allocate memory for TDX guest. While devdax will not be= used > >>>>> for that. > >>>> > >>>> Sorry for chiming in late. I think CXL also faces the similar proble= m > >>>> on the platform with MTE (memory tagging extension on ARM64). AFAIK, > >>>> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is > >>>> enabled. > >>>> > >>>> We should need a similar mechanism to prevent users from hot-adding > >>>> CXL memory if MTE is on. But not like TDX I don't think we have a > >>>> simple way to tell whether the pfn belongs to CXL or not. Please > >>>> correct me if I'm wrong. I'm wondering whether we can find a more > >>>> common way to tell memory hotplug to not hot-add some region. For > >>>> example, a special flag in struct resource. off the top of my head. > >>>> > >>>> No solid idea yet, I'm definitely seeking some advice. > >>> > >>> Could the ARM version of arch_check_hotplug_memory_range() check if M= TE > >>> is enabled in the CPU and then ask the CXL subsystem if the address r= ange is > >>> backed by a topology that supports MTE? > >> > >> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC > >> that relies on the CXL subsystem is able to tell whether that range > >> can support MTE or not, right? Or CXL subsystem tells us whether the > >> range is CXL memory range or not, then we can just refuse MTE for all > >> CXL regions for now. Does CXL support this now? > > > > So the CXL specification has section: > > > > 8.2.4.31 CXL Extended Metadata Capability Register > > > > ...that indicates if the device supports "Extended Metadata" (EMD). > > However, the CXL specification does not talk about how a given hosts > > uses the extended metadata capabilities of a device. That detail would > > need to come from an ARM platform specification. > > > > Currently CXL subsystem does nothing with this since there has been no > > need to date, but I would expect someone from the ARM side to plumb thi= s > > detection into the CXL subsystem. > > > >>> However, why would it be ok to access CXL memory without MTE via devd= ax, > >>> but not as online page allocator memory? > >> > >> CXL memory can be onlined as system ram as long as MTE is not enabled. > >> It just can be used as devdax device if MTE is enabled. > > > > Do you mean the kernel only manages MTE for kernel pages, but with user > > mapped memory the application will need to implicitly know that > > memory-tagging is not available? > > > > I worry about applications that might not know that their heap is comin= g > > from a userspace memory allocator backed by device-dax rather than the > > kernel. > > I recall that MTE is requested by user space via mprotect(). If we end > up with memory that is not taggable, we would have to fail the > operation, which is not desirable. > > This is what we want to avoid, so if MTE is enabled, all memory in the > buddy should be taggable. Yes, the buddy memory has to be taggable if MTE is enabled. And not only mprotect(), but also mmap() and malloc() (glibc compiled with MTE support) can allocate mapping with MTE. And MTE mapping is just allowed for anonymous and tmpfs currently. > > > > >>> If the goal is to simply deny any and all non-MTE supported CXL regio= n > >>> from attaching then that could probably be handled as a modification = to > >>> the "cxl_acpi" driver to deny region creation unless it supports > >>> everything the CPU expects from "memory". > >> > >> I'm not quite familiar with the details in CXL driver. What did you > >> mean "deny region creation"? As long as the CXL memory still can be > >> used as devdax device, it should be fine. > > > > Meaning that the CXL subsytem knows how to, for a given address range, = figure > > out the members and geometry of the CXL devices that contribute to that > > range (CXL region). It would be straightforward to add EMD to that > > enumeration and flag the CXL region as not online-capable if the CPU ha= s > > MTE enabled but no EMD capability. > > If it's really just CXL memory we are worrying about, we could pass a > flag to add_memory_driver_managed(), and passing that to our callback her= e. > > Not sure if that is the most reliable way of handling it :) What about > other ways of hotplugging memory besides CXL? Are we sure, they are/will > be providing taggable memory? AFAIK, I don't think they are, or at least some of them are not. So this should be not CXL specific. > > -- > Cheers, > > David / dhildenb >