From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CA93C433EF for ; Tue, 26 Apr 2022 19:23:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA0916B0073; Tue, 26 Apr 2022 15:23:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A50366B0074; Tue, 26 Apr 2022 15:23:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 917DA6B0075; Tue, 26 Apr 2022 15:23:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 7F2436B0073 for ; Tue, 26 Apr 2022 15:23:25 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4BF9EF20 for ; Tue, 26 Apr 2022 19:23:25 +0000 (UTC) X-FDA: 79400003970.30.62C43F7 Received: from mail-vk1-f177.google.com (mail-vk1-f177.google.com [209.85.221.177]) by imf05.hostedemail.com (Postfix) with ESMTP id 51BC9100053 for ; Tue, 26 Apr 2022 19:23:18 +0000 (UTC) Received: by mail-vk1-f177.google.com with SMTP id e144so1402762vke.9 for ; Tue, 26 Apr 2022 12:23:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4kJuLLHGtPtIorhNFJcmDJJAD95gZZjAVy8E0FdFDqI=; b=XO85tu+xt6URFDy7xYJZ83U1iEVyfWES0YkiQT59tRy00Q1U62oIllIFkeN2GzsQ2O dCs6yCy96ndKd//OWO/KdYT3ZGc5cUpmKfbcRo0kcZFoeAO8eOnPtIkhLHbXKilr6egV JcI6KNJmp192vFfnnXh5Ux9d63qtNIPbYNPvkXXovqclgUJQTZCvOsMRLMLHatGs/uRf dQdWINgzUQt2xN8izxIrcJCQh7g2YHV7amg/pAHE5BEaMg4OJRipPxL2DFkHhA43tyTn sqnH7e1w+bfY/hRIdiCGcQ3KjYq3kLGlhwmcnQe3c1ChNS+duhlEu4kpVJuac62S/h9S CkUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4kJuLLHGtPtIorhNFJcmDJJAD95gZZjAVy8E0FdFDqI=; b=MN5RpzAseu1ts1vcLjKf9lTRagRcjRD5lg753jIcQM420Fy8Gwy3mNncF8f/205rGk 7up0M38zeRShb6aPhzavLubQ8Cr0EokZKIeVOKN7Gzn3+mM2rHuRtbp25ntLdNsB/QPY pPE8SrUDBBkvMlW0gwwpx9tzf/4bnd0os75JBfyBxaQpj4g/hCEqPMSCCAhaEjQmB+5b I4GcQsm10Lcw7YvcHi7B7aNwCIiV5rm3k+qc9SGjFqc1NxHhWfgZ4ASZaMz7KTn/rle1 CLN74fTiTrrod9vxo2gYbN3R3COTQ5bIOvowtOcmoTD/5kTPzP3nuHKq8dIXTZNvNMpT qe3Q== X-Gm-Message-State: AOAM530qRkZyO310DfSgmFwFXW1y1pm7NjsDsOgOKQbF6o23L1BL79B5 rXUuY8PUzzIuxAfZjaBvbzBS6SmHlCaa6eLkWHou3Q== X-Google-Smtp-Source: ABdhPJzHitd6o6RzPM4e8phTzMm4mJEyFrw6Ajz31N7ynGicOlz6feyUjEGxYOP5OkhfN37OX5LxQ0IVmC7fMR04UTY= X-Received: by 2002:a05:6122:2228:b0:32d:e4e:a79a with SMTP id bb40-20020a056122222800b0032d0e4ea79amr7343282vkb.27.1651001003904; Tue, 26 Apr 2022 12:23:23 -0700 (PDT) MIME-Version: 1.0 References: <20220425163451.3818838-1-juew@google.com> <5a7f00e3-311d-b5ca-4249-7f50f8712559@intel.com> In-Reply-To: <5a7f00e3-311d-b5ca-4249-7f50f8712559@intel.com> From: Jue Wang Date: Tue, 26 Apr 2022 12:23:12 -0700 Message-ID: Subject: Re: [RFC] Expose a memory poison detector ioctl to user space. To: Dave Hansen Cc: Naoya Horiguchi , Tony Luck , Dave Hansen , Jiaqi Yan , Greg Thelen , Mina Almasry , linux-mm@kvack.org, Sean Christopherson Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 51BC9100053 X-Stat-Signature: wq8ra9eto4tcibad4utio356qu6o76xa Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=XO85tu+x; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of juew@google.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=juew@google.com X-Rspam-User: X-HE-Tag: 1651000998-360035 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 26, 2022 at 11:18 AM Dave Hansen wrote: > > On 4/26/22 10:57, Jue Wang wrote: > > On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen wrote: > >> From your description, you have me mostly convinced that this is > >> something that needs to get fixed. The hardware patrol scrubber(s) > >> address the same basic problem, but don't seem to be flexible to your > >> specific needs. > >> > >> But, have hardware vendors been receptive at all to making the patrol > >> scrubbers more tunable? > > > > We have discussed the use case in detail with Intel. There are > > improvements in progress to address some of the issues like the > > signaling to avoid broadcasted MCEs. But fundamentally, the needed > > throughput is not quite compatible with the patrol scrubber's design > > purpose and arch. > > This would be great material to cover in the changelog in some more > detail. > > >> On 4/25/22 09:34, Jue Wang wrote: > >>> /* Could stop and return after the 1st poison is detected */ > >>> #define MCESCAN_IOCTL_SCAN 0 > >>> > >>> struct SysramRegion { > >>> /* input */ > >>> uint64_t first_byte; /* first page-aligned physical address to scan */ > >>> uint64_t length; /* page-aligned length of memory region to scan */ > >>> /* output */ > >>> uint32_t poisoned; /* 1 - a poisoned page is found, 0 - otherwise */ > >>> uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */ > >>> } > >> > >> So, the ioctl() caller has to know the physical address layout of the > >> system? > > > > This info is available from /proc/iomem and /proc/zoneinfo already > > supported / exposed by the kernel. > > I don't think they are good enough. > > Think of a TDX guest. It can't touch "unaccepted" memory. But, that > information is not present in /proc/iomem. In a TDX host (not upstream > yet), it can't touch any guest memory. That's also not in /proc/iomem. We will follow up on these topics wrt the interactions with TDX/SEV-SNP. > > What if you're in a normal (non-TDX) guest and some of the physical > address space has been ballooned away? Accessing to memory that gets ballooned away will cause extra EPT violations and have the memory faulted in on the host side, which is transparent to the guest. > > What does the kernel do when userspace asks it to poke a non-"System > RAM" address? I expect the kernel should reject the request with -EINVAL. > > >> While this is a good start at a conversation, I think you might want to > >> back up a bit. You alluded to a few requirements that you have, like: > >> > >> * Adjustable detector resource use based on system utilization > >> * Adjustable scan rate to ensure issues are found at a deterministic > >> rate > >> * Detector must be able to find errors in allocated, in-use memory > >> > >> What about SEV-SNP or TDX private memory? It might be unmapped *and* > >> limited in how it can be accessed. For instance, TDX hosts can't > >> practically read guest memory. SEV-SNP hosts have special page mapping > >> requirements; the cost can't create arbitrary mappings with arbitrary > >> mapping sizes. What would this ioctl() do if asked to scan a TDX guest > >> private page? > > > > Thanks for raising the UPM case for SEV-SNP / TDX private memory. This > > is what we like to get more feedback and more experts' weigh-ins. > > > > Is reading private memory via kernel's direct mapping benign for > > SEV-SNP and TDX? > > No. It causes machine checks for TDX. > > For SEV-SNP, I think reads of private memory read ciphertext. I'm not > sure how benign it is or if it has any cache coherency implications. > > > Otherwise this feature should be defined as mutually exclusive with > > incompatible features. > > Just as an exercise, I'd suggest going and asking some of your > colleagues about this. Surely, you're asking for this functionality > because Google wants to use it, and use it *widely*. What would your > colleagues think if this wasn't available at all on systems that use or > might use TDX? > > For upstream, making features mutually exclusive is a deal breaker > unless it's absolutely necessary. Ack, we will follow up within Google. Just curious, what could be recommendations from Intel's perspective to make proactively poison detection work on TDX / SEV-SNP? > > > Even in that case, I believe SEV-SNP or TDX may still benefit fro > > _reactive_ memory poison recovery if the MCE handling and > > CONFIG_MEMORY_FAILURE still function the same on uncorrectable error > > raised #MC. > > If I remember right, the blast radius for machine checks on systems > using TDX is substantially bigger than without TDX. I think there are > quite a few more cases that are non-recoverable, like poison detected in > TDX metadata. TDX systems have a *stronger* requirement to proactively > find issues than non-TDX systems.