From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FBB3C433EF for ; Tue, 26 Apr 2022 18:02:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A5FF66B0075; Tue, 26 Apr 2022 14:02:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A0FCD6B0078; Tue, 26 Apr 2022 14:02:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8AFE46B007B; Tue, 26 Apr 2022 14:02:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 758B56B0075 for ; Tue, 26 Apr 2022 14:02:52 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4CA6B274C5 for ; Tue, 26 Apr 2022 18:02:52 +0000 (UTC) X-FDA: 79399800984.23.43090F7 Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf07.hostedemail.com (Postfix) with ESMTP id D4ADF40058 for ; Tue, 26 Apr 2022 18:02:49 +0000 (UTC) Received: by mail-yb1-f173.google.com with SMTP id f38so34405851ybi.3 for ; Tue, 26 Apr 2022 11:02:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=waW6X7qzTcrw4URgxy97tvn/lyW2zj9aBEFSlzOOwvw=; b=qHpDj+O9A13Kw0edZMrp5Ko36s0/zxB9zfkRQhJAT2qbtI6axZ1FbNwsXdJ7tFlPym DshdaheP23I6Kc9x9kACK3M10xAsSYhcy8trjbA25tcdrAOzkY8hfBBoTEopeqH2Jl5z sR9w+w0LQZ2lI5pf9ETzFhJQH6xw+ybY8fH+NJKnbF0uP0F8+5/BoVgeFZoNNZt7txfh QfN7iTIH9FWsKjFxZKWI+pgPg9zLElQtw7kaFjqFLCkFV2vQ/xXrlxCFdbSylFpDNRPY iga6E1+HoSjaqTr1xC/flafL34BMpqf7puylwLBhbmqShB9ltMmWuLhoAJyBfqM1DmBK SrQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=waW6X7qzTcrw4URgxy97tvn/lyW2zj9aBEFSlzOOwvw=; b=k0qa5ov7DxlSQzEm73V6XVgQ5JT8sFxVgZk83DiEZ/TcJ6c/Ptu0xDCs0c8EMX9yN7 Cd4Txb4zyQhpyYZX2ObkrPW4uEa14VdX6VGHxEPYyuKqNQSI8nOAWbQ2D2oePqfQLQpe VyL1k+eYmL46zIenc157t3bD3dYefHVY0Zuwn8m6l+5sikcWCaWjtX9FHI7gjeD72s4s MsAwHD4l0ht9Jr7I3h0rcN1wZxzJElp19bmzDWitKGwQZqAh8J19NhhIH3mp7d/xkxWq wWoSclIBkNL/Z1bwVE9XrEGrwkXnvIK6Bj4UDNMBoaLfGsVr6ZvkL+YCIxG/ShGF3O6V FMuw== X-Gm-Message-State: AOAM530STNR4NY6INDyhwvjlWTH4/YOyebVK+m4FXwNCVKTF8HRvmZPb tOB1mfByx1ZL+Ie8tqofVBva2U1/w2iiXVmDbxRdyg== X-Google-Smtp-Source: ABdhPJzkB56GODn3jpy9W0hU9FgiUe0zVTyln8HeHVD4EsQtmG9HdqMa+6Zc5z5oQxLdMmXNB9uhETEuguFtCd2awyQ= X-Received: by 2002:a25:7c05:0:b0:644:dec5:6d6e with SMTP id x5-20020a257c05000000b00644dec56d6emr21856310ybc.598.1650996170754; Tue, 26 Apr 2022 11:02:50 -0700 (PDT) MIME-Version: 1.0 References: <20220425163451.3818838-1-juew@google.com> In-Reply-To: From: Jue Wang Date: Tue, 26 Apr 2022 11:02:38 -0700 Message-ID: Subject: Re: [RFC] Expose a memory poison detector ioctl to user space. To: Dave Hansen Cc: Naoya Horiguchi , Tony Luck , Dave Hansen , Jiaqi Yan , Greg Thelen , Mina Almasry , linux-mm@kvack.org, Sean Christopherson Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: D4ADF40058 X-Stat-Signature: wueme8jmkdg1d7x6kbg1yyp6waow4fbx Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qHpDj+O9; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of juew@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=juew@google.com X-Rspam-User: X-HE-Tag: 1650996169-874677 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 26, 2022 at 10:57 AM Jue Wang wrote: > > Hi Dave, > > Thanks for the reply, some comments inline. > > On Tue, Apr 26, 2022 at 8:40 AM Dave Hansen wrote: > > > > From your description, you have me mostly convinced that this is > > something that needs to get fixed. The hardware patrol scrubber(s) > > address the same basic problem, but don't seem to be flexible to your > > specific needs. > > > > But, have hardware vendors been receptive at all to making the patrol > > scrubbers more tunable? > > We have discussed the use case in detail with Intel. There are > improvements in progress to address some of the issues like the > signaling to avoid broadcasted MCEs. But fundamentally, the needed > throughput is not quite compatible with the patrol scrubber's design > purpose and arch. > > It's unclear at what generation of hardware this need may get > addressed. Thus now, we look at software assisted approaches making > use of the _whole_ CPU. > > > > On 4/25/22 09:34, Jue Wang wrote: > > > /* Could stop and return after the 1st poison is detected */ > > > #define MCESCAN_IOCTL_SCAN 0 > > > > > > struct SysramRegion { > > > /* input */ > > > uint64_t first_byte; /* first page-aligned physical address to scan */ > > > uint64_t length; /* page-aligned length of memory region to scan */ > > > /* output */ > > > uint32_t poisoned; /* 1 - a poisoned page is found, 0 - otherwise */ > > > uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */ > > > } > > > > So, the ioctl() caller has to know the physical address layout of the > > system? > > This info is available from /proc/iomem and /proc/zoneinfo already > supported / exposed by the kernel. > > > > > While this is a good start at a conversation, I think you might want to > > back up a bit. You alluded to a few requirements that you have, like: > > > > * Adjustable detector resource use based on system utilization > > * Adjustable scan rate to ensure issues are found at a deterministic > > rate > > * Detector must be able to find errors in allocated, in-use memory > > > > What about SEV-SNP or TDX private memory? It might be unmapped *and* > > limited in how it can be accessed. For instance, TDX hosts can't > > practically read guest memory. SEV-SNP hosts have special page mapping > > requirements; the cost can't create arbitrary mappings with arbitrary > > mapping sizes. What would this ioctl() do if asked to scan a TDX guest > > private page? > > > > Thanks for raising the UPM case for SEV-SNP / TDX private memory. This > is what we like to get more feedback and more experts' weigh-ins. > > Is reading private memory via kernel's direct mapping benign for > SEV-SNP and TDX? If true, could this be a way to let SEV-SNP and TDX > use cases benefit from this work while the user space / hypervisor > mapping is still removed? > > Otherwise this feature should be defined as mutually exclusive with > incompatible features. Even in that case, I believe SEV-SNP or TDX may > still benefit from _reactive_ memory poison recovery if the MCE > handling and CONFIG_MEMORY_FAILURE still function the same on > uncorrectable error raised #MC. > > > > Is doing it from userspace a strict requirement? Not necessarily an absolute requirement. We just found there are lots of policy and integration elements in user space that cannot be avoided: what to scan, how fast to scan, when to backoff given the host anticipated workload or special customer request etc, what to do with the errors detected in term of monitoring, telemetry, machine repair automation, scheduling systems etc. > > > > Would the detector just read memory? Yes, read transaction is sufficient to signal #MC on uncorrectable cachelines. > > > > Are there any other physical addresses which are RAM but should not have > > the detector used on them? In theory, if some physical address range are never / very rarely accessed, they can be exempted. > >