From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 454B1C433FE
	for <linux-mm@archiver.kernel.org>; Thu, 10 Nov 2022 20:23:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DCC1A6B0074; Thu, 10 Nov 2022 15:23:37 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D7BD06B0075; Thu, 10 Nov 2022 15:23:37 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C472F8E0001; Thu, 10 Nov 2022 15:23:37 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id B488D6B0074
	for <linux-mm@kvack.org>; Thu, 10 Nov 2022 15:23:37 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 744C6160509
	for <linux-mm@kvack.org>; Thu, 10 Nov 2022 20:23:37 +0000 (UTC)
X-FDA: 80118658074.15.2453E24
Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169])
	by imf04.hostedemail.com (Postfix) with ESMTP id CAF4E40008
	for <linux-mm@kvack.org>; Thu, 10 Nov 2022 20:23:36 +0000 (UTC)
Received: by mail-pl1-f169.google.com with SMTP id io19so2473925plb.8
        for <linux-mm@kvack.org>; Thu, 10 Nov 2022 12:23:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=uCJmLCx9BeVQfCDVxYixvv8fZuLB8YrtQzbFwdKBEqs=;
        b=baPP7yfjnfn7Sg0DLYuVyw7lGjqATQSI7qidePc69j1qVZpJkcBdjsSYd//FQljANg
         J5QIgeifZjJ0PjX6mZjYY7FiLz1pBqopt4cdXCpS4jvdSwyT9a/IgqY7BW3eJasHxWkG
         UJn5FCIMRGuxhmd2S9djG4++gjJGQ1+pYPDxjXnLsIsjKhWwMyehZRNEoUdxu8+oTkEx
         THknosAmaM3LftX/onxa2QXL63DBvtcz47rp/usCzwovC/jEI3PaZzY+qamj9h/JoO2G
         /zVcp0AT0xc1uldEtFU33wRK4m42qBk2vYc/5hlQlGSAwKTR0xowHnbHumjzvrEZD8Tj
         TNnQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=uCJmLCx9BeVQfCDVxYixvv8fZuLB8YrtQzbFwdKBEqs=;
        b=0skjakI17qavJAhRuzWJLEO+TPnlGdZRpIR+9yvkJDK8ZZtUXIOUebkpEhzfj7vmT9
         iTwWsoRvweVcRnm/jZh3LtL6MTa/z+m4U1MqLAEA+7wW5rvhYV1Qtk/S1YDX/db3H+5w
         ADlTxqcdS1jmKqywT7GmKECINV4no4ICh2pVpoZnf2POKTwV8CMrlCe4wZctmoBYsxFU
         2Y28T0v98F7f1FluKfrmF5WAMpR7gwmQAY2NbTMV1YXTHaHO4U1nrCOJdHf9C1rqvLMC
         /k407NIUgvW82NN0IBLCyXPqDUltZFsC/m0JdTOFtx1ru9sEPNODD+eEUAC0ct284uCL
         OWXQ==
X-Gm-Message-State: ACrzQf3khb0QByYJM/wkYmbFhmp9I17vp2DnhM3y4G5TbQWEK08kn6Xk
	lhzdX/v418H8ysaPPJEJ947/23Uhr0fqolXxbA+RJA==
X-Google-Smtp-Source: AMsMyM4D0WfqBr8D7cV8pasFSHe3SlEr9yWLs/MOMLXYQj+hCh1rZCGRzfRnIG2IeSinGxSAqiKk0WArh5YLjKClQ2I=
X-Received: by 2002:a17:902:ec89:b0:186:59e9:20f6 with SMTP id
 x9-20020a170902ec8900b0018659e920f6mr2006539plg.39.1668111815556; Thu, 10 Nov
 2022 12:23:35 -0800 (PST)
MIME-Version: 1.0
References: <20221103155029.2451105-1-jiaqiyan@google.com> <20221109052908.GB527418@hori.linux.bs1.fc.nec.co.jp>
In-Reply-To: <20221109052908.GB527418@hori.linux.bs1.fc.nec.co.jp>
From: Jiaqi Yan <jiaqiyan@google.com>
Date: Thu, 10 Nov 2022 12:23:24 -0800
Message-ID: <CACw3F52dbbzLDDa9g17Ka9C5+U4=y_7Qe2=GHT4wbhb_jo346g@mail.gmail.com>
Subject: Re: [RFC] Kernel Support of Memory Error Detection.
To: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= <naoya.horiguchi@nec.com>
Cc: "tony.luck@intel.com" <tony.luck@intel.com>, 
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>, "david@redhat.com" <david@redhat.com>, 
	"erdemaktas@google.com" <erdemaktas@google.com>, "pgonda@google.com" <pgonda@google.com>, 
	"rientjes@google.com" <rientjes@google.com>, "duenwen@google.com" <duenwen@google.com>, 
	"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>, 
	"mike.malvestuto@intel.com" <mike.malvestuto@intel.com>, "gthelen@google.com" <gthelen@google.com>, 
	"linux-mm@kvack.org" <linux-mm@kvack.org>, "jthoughton@google.com" <jthoughton@google.com>, 
	"Ghannam, Yazen" <Yazen.Ghannam@amd.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=baPP7yfj;
	spf=pass (imf04.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=jiaqiyan@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668111816; a=rsa-sha256;
	cv=none;
	b=lyi/tw1XE/UzARVj5MZU+HaKH3BctyVdUGpQBCqlF8B/orSmI5z34NTz77GqllhsYxGNaZ
	LoTxbM6L0q2tzjIly/0CsQEQgplPB8rn35at4UFZ9sYzFfoNckUFFhtNjPNNZ7fA79GZFB
	57gHAA7txD/yw4eQMwStYnzebcvq1hI=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1668111816;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=uCJmLCx9BeVQfCDVxYixvv8fZuLB8YrtQzbFwdKBEqs=;
	b=uEcZvahxmuQOAl4mj3hCBUOI+Dc2Yr/yP3S5nN9JpIrDsvvFNfYlVYY++26E+rxLK1KrNz
	k520Lo9aqaW9itnxBeV89o4XTW6FLc25jeGWMOO8uQVJQEXNZT8I23wwqvSkn09BfaHLRO
	ctxxIJe5fW+bgVU4s5CmTmp2so3y4Ok=
X-Rspam-User: 
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=baPP7yfj;
	spf=pass (imf04.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=jiaqiyan@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Stat-Signature: qaykjmascncphy4e863zcdu3c38o3bd3
X-Rspamd-Queue-Id: CAF4E40008
X-Rspamd-Server: rspam09
X-HE-Tag: 1668111816-72940
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Nov 8, 2022 at 9:29 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3=E3=80=80=
=E7=9B=B4=E4=B9=9F)
<naoya.horiguchi@nec.com> wrote:
>
> On Thu, Nov 03, 2022 at 03:50:29PM +0000, Jiaqi Yan wrote:
> > This RFC is a followup for [1]. We=E2=80=99d like to first revisit the =
problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed=
,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> >
> > Problem Statement
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > The ever increasing DRAM size and cost has brought the memory subsystem
> > reliability to the forefront of large fleet owners=E2=80=99 concern. Me=
mory
> > errors are one of the top hardware failures that cause server and
> > workload crashes. Simply deploying extra-reliable DRAM hardware to a
> > large-scale computing fleet adds significant cost, e.g., 10% extra cost
> > on DRAM can amount to hundreds of millions of dollars.
> >
> > Reactive memory poison recovery (MPR), e.g., recovering from MCEs raise=
d
> > during an execution context (the kernel mechanisms are MCE handler +
> > CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been fou=
nd
> > effective in keeping systems resilient from memory errors. However,
> > reactive memory poison recovery has several major drawbacks:
> > - It requires software systems that access poisoned memory to
> >   be specifically designed and implemented to recover from memory error=
s.
> >   Uncorrectable (UC) errors are random, which may happen outside of the
> >   enlightened address spaces or execution contexts. The added error
> >   recovery capability comes at the cost of added complexity and often
> >   impossible to enlighten in 3rd party software.
> > - In a virtualized environment, the injected MCEs introduce the same
> >   challenge to the guest.
> > - It only covers MCEs raised by CPU accesses, but the scope of memory
> >   error issue is far beyond that. For example, PCIe devices (e.g. NIC a=
nd
> >   GPU) accessing poisoned memory cause host crashes when
> >   on certain machine configs.
> >
> > We want to upstream a patch set that proactively scans the memory DIMMs
> > at a configurable rate to detect UC memory errors, and attempts to
> > recover the detected memory errors. We call it proactive MPR, which
> > provides three benefits to tackle the memory error problem:
> > - Proactively scanning memory DIMMs reduces the chance of a correctable
> >   error becoming uncorrectable.
> > - Once detected, UC errors caught in unallocated memory pages are
> >   isolated and prevented from being allocated to an application or the =
OS.
> > - The probability of software/hardware products encountering memory
> >   errors is reduced, as they are only exposed to memory errors develope=
d
> >   over a window of T, where T stands for the period of scrubbing the
> >   entire memory space. Any memory errors that occurred more than T ago
> >   should have resulted in custom recovery actions. For example, in a cl=
oud
> >   environment VMs can be live migrated to another healthy host.
> >
> > Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> > prevent the build up of memory errors. In comparison software memory
> > error detector (SW) has pros and cons:
> > - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
> >   on/off scanning, and yields its own CPU cycles and memory bandwidth.
> >   All of these can happen on-the-fly based on the system workload statu=
s
> >   or administrator=E2=80=99s choice. HPS doesn=E2=80=99t have all these=
 flexibilities.
> >   Its patrol speed is usually only configurable at boot time, and it is
> >   not able to consider system state. (Note: HPS is a memory controller
> >   feature and usually doesn=E2=80=99t consume CPU time).
> > - SW can expose controls to scan by memory types, while HPS always scan=
s
> >   full system memory. For example, an administrator can use SW to only
> >   scan hugetlb memory on the system.
> > - SW can scan memory at a finer granularity, for example, having differ=
ent
> >   scan rate per node, or entirely disabled on some node. HPS, however,
> >   currently only supports per host scanning.
> > - SW can make scan statistics (e.g. X bytes has been scanned for the
> >   last Y seconds and Z memory errors are found) easily visible to
> >   datacenter administrators, who can schedule maintenance (e.g. migrati=
ng
> >   running jobs before repairing DIMMs) accordingly.
>
> I think that exposing memory error info in the system to usespace is
> useful independent of the new scanner.

Agreed. The error info exposure interface is independently useful.
If we have the interface, today it probably only has data when memory
error access happens and is recovered.
When the scanning is running on a machine and detects memory errors
(generating data),
the interface will be more meaningful because now it has more data to expos=
e.

>
> > - SW=E2=80=99s functionality is consistent across hardware platforms. H=
PS=E2=80=99s
> >   functionality varies from vendor to vendor. For example, some vendors
> >   support shorter scrubbing periods than others, and some vendors may n=
ot
> >   support memory scrubbing at all.
> > - HPS usually doesn=E2=80=99t consume CPU cores but does consume memory
> >   controller cycles and memory bandwidth. SW consumes both CPU cycles
> >   and memory bandwidth, but is only a problem if administrators opt int=
o
> >   the scanning after weighing the cost benefit.
> > - As CPU cores are not consumed by HPS, there won=E2=80=99t be any cach=
e impact.
> >   SW can utilize prefetchnta (for x86) [4] and equivalent hints for oth=
er
> >   architectures [5] to minimize cache impact (in case of prefetchnta,
> >   completely avoiding L1/L2 cache impact).
> >
> > Solution Proposals
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >
> > What to Scan
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > The initial RFC proposed to scan the **entire system memory**, which
> > raised the question of what memory is scannable (i.e. memory accessible
> > from kernel direct mapping). We attempt to address this question by
> > breaking down the memory types as follows:
> > - Static memory types: memory that either stays scannable or unscannabl=
e.
> >   Well defined examples are hugetlb vs regular memory, node-local memor=
y
> >   vs far memory (e.g. CXL or PMEM). While most static memory types are
> >   scannable, administrators could disable scanning far memory to avoid
> >   messing with the promotion and demotion logic in memory tiring
> >   solutions. (The implementation will allow administrators to disable
> >   scanning on scannable memory).
>
> I think that another viewpoint of how we prioritize memory type to scan
> is kernel vs userspace memory. Current hwpoison mechanism does little to
> recover from errors in kernel pages (slab, reserved), so there seesm
> little benefit to detect such errors proactively and beforehand.  If the
> resource for scanning is limited, the user might think of focusing on
> scanning userspace memory.

I definitely agree that scanning userspace is important, but I want to
argue scanning kernel memory is also necessary.
Memory error found in userspace =3D> (almost) never causes panic
Memory error found in kernel space:
- For allocated pages =3D> little recovery, no better comparing to do not s=
can
- For free pages =3D> take off from buddy allocator to prevent future
usage, better than do not scan
(The scanner is going to access the memory without reading content,
and properly fixup the kernel access to memory error using EXTABLE.)
Overall, scanning kernel memory proactively improves things.

>
> Thanks,
> Naoya Horiguchi