From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=FJBn=AF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1F762C433E3
	for <linux-mm@archiver.kernel.org>; Wed, 24 Jun 2020 23:21:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D51292078D
	for <linux-mm@archiver.kernel.org>; Wed, 24 Jun 2020 23:21:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="Zf0RjYCM"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D51292078D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 67D756B0007; Wed, 24 Jun 2020 19:21:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 606EE6B0008; Wed, 24 Jun 2020 19:21:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4A8206B000A; Wed, 24 Jun 2020 19:21:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0128.hostedemail.com [216.40.44.128])
	by kanga.kvack.org (Postfix) with ESMTP id 307276B0007
	for <linux-mm@kvack.org>; Wed, 24 Jun 2020 19:21:41 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A051D824556B
	for <linux-mm@kvack.org>; Wed, 24 Jun 2020 23:21:40 +0000 (UTC)
X-FDA: 76965679560.20.shirt16_3508d3a26e48
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin20.hostedemail.com (Postfix) with ESMTP id D7A0A180C060F
	for <linux-mm@kvack.org>; Wed, 24 Jun 2020 23:21:38 +0000 (UTC)
X-HE-Tag: shirt16_3508d3a26e48
X-Filterd-Recvd-Size: 7432
Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68])
	by imf15.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 24 Jun 2020 23:21:37 +0000 (UTC)
Received: by mail-ej1-f68.google.com with SMTP id dp18so4102477ejc.8
        for <linux-mm@kvack.org>; Wed, 24 Jun 2020 16:21:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=intel-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=c4WmOEtLzMdFuXw4rYIj/E7NU8dacf1RcNP8WlgBHeM=;
        b=Zf0RjYCME3yc1o039zXRU1fdt9IyFIhlxTlvTPEsrQAsXk8308tTPi6yefar3Vp177
         KK5yo7ToPxuY+WBjMFXeArTWyeFjoiS2ZYrPeEglfXMj3gHKeCDnZ2N+t5jYrcGI8u9r
         UHqPOOCYHAp5WZxHgY9DRpabx8FDz66HDp9yTb9NiFkbgkjuc0LrElRmGsBOKvhxVamg
         +0jBLoXL7xruIAGswA6/08nqMsipY1W0zLnQGSVct5s4Dt4NNwqYSiqwCf1wRGUpgC66
         czUnjsBCU3O89hiCtG061qadlSDVYCbynoBbGSX2sQZYw1Xn8yPVgLDEc9Iy8vDmzM/E
         mSxQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=c4WmOEtLzMdFuXw4rYIj/E7NU8dacf1RcNP8WlgBHeM=;
        b=FsrUVZHtdHbM5Q09hApCwFWwEpp8hosHr1RFzdfcYsgzUtMAfMt9Ip3XCPtBszZXMl
         wx6xS7xJxBsPAAhcE4cBV7mLczzl1olF4IT426quXF2jpQLLxDnUCppVhDQK7b+DbS5m
         BAFVhd5aKnnWAHdwiZ2xoieNKf4Jl2fIxVFHtAV/wiTe/fdnU2KFMhJ8g1maP6f+YEH0
         1MXAQO+UYxVf7m4EaXaikV1OZ/Yim1wRKT1WKI0mlPUZ2qi2fbpGQzUDyNbj3AkC1Ksl
         yRxzyrfimzG501uXaDulWp18JclY/6SPQ+k6YwsWqLFaBSN4c29y7be2cM7i5meMY4gW
         VqEQ==
X-Gm-Message-State: AOAM530apZiZfQoBX/srHrZuZoVgsStSFjxRnSkZV1/dgSp/evEr+7Kx
	zzscwfvAXeaPR5CWhQvxWRbQ+vNKSPCkKqkC9YeCeA==
X-Google-Smtp-Source: ABdhPJx7YRxVbC24Pjqb6OgwXTPTIjnJ1TegmuuA5gsPccgVzsZlRcaQ3AyYyB/TMf6WDhGlYOs1/bu+XTzhS8BKq3g=
X-Received: by 2002:a17:906:6d56:: with SMTP id a22mr21897232ejt.440.1593040895907;
 Wed, 24 Jun 2020 16:21:35 -0700 (PDT)
MIME-Version: 1.0
References: <20200623201745.GG21350@casper.infradead.org> <20200623220412.GA21232@agluck-desk2.amr.corp.intel.com>
 <20200623221741.GH21350@casper.infradead.org> <20200623222658.GA21817@agluck-desk2.amr.corp.intel.com>
 <20200623224027.GI21350@casper.infradead.org> <20200624000124.GH7625@magnolia>
 <20200624121000.GM21350@casper.infradead.org>
In-Reply-To: <20200624121000.GM21350@casper.infradead.org>
From: Dan Williams <dan.j.williams@intel.com>
Date: Wed, 24 Jun 2020 16:21:24 -0700
Message-ID: <CAPcyv4joCu00OXV9da3eoQVqM_FTwELQa6=YdwXjZCtyxy13bg@mail.gmail.com>
Subject: Re: [RFC] Make the memory failure blast radius more precise
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>, "Luck, Tony" <tony.luck@intel.com>, 
	Borislav Petkov <bp@alien8.de>, Naoya Horiguchi <naoya.horiguchi@nec.com>, linux-edac@vger.kernel.org, 
	Linux MM <linux-mm@kvack.org>, linux-nvdimm <linux-nvdimm@lists.01.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: D7A0A180C060F
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jun 24, 2020 at 5:10 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jun 23, 2020 at 05:01:24PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 23, 2020 at 11:40:27PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 23, 2020 at 03:26:58PM -0700, Luck, Tony wrote:
> > > > On Tue, Jun 23, 2020 at 11:17:41PM +0100, Matthew Wilcox wrote:
> > > > > It might also be nice to have an madvise() MADV_ZERO option so the
> > > > > application doesn't have to look up the fd associated with that memory
> > > > > range, but we haven't floated that idea with the customer yet; I just
> > > > > thought of it now.
> > > >
> > > > So the conversation between OS and kernel goes like this?
> > > >
> > > > 1) machine check
> > > > 2) Kernel unmaps the 4K page surroundinng the poison and sends
> > > >    SIGBUS to the application to say that one cache line is gone
> > > > 3) App says madvise(MADV_ZERO, that cache line)
> > > > 4) Kernel says ... "oh, you know how to deal with this" and allocates
> > > >    a new page, copying the 63 good cache lines from the old page and
> > > >    zeroing the missing one. New page is mapped to user.
> > >
> > > That could be one way of implementing it.  My understanding is that
> > > pmem devices will reallocate bad cachelines on writes, so a better
> > > implementation would be:
> > >
> > > 1) Kernel receives machine check
> > > 2) Kernel sends SIGBUS to the application
> > > 3) App send madvise(MADV_ZERO, addr, 1 << granularity)
> > > 4) Kernel does special writes to ensure the cacheline is zeroed
> > > 5) App does whatever it needs to recover (reconstructs the data or marks
> > > it as gone)
> >
> > Frankly, I've wondered why the filesystem shouldn't just be in charge of
> > all this--
> >
> > 1. kernel receives machine check
> > 2. kernel tattles to xfs
> > 3. xfs looks up which file(s) own the pmem range
> > 4. xfs zeroes the region, clears the poison, and sets AS_EIO on the
> >    files
>
> ... machine reboots, app restarts, gets no notification anything is wrong,
> treats zeroed region as good data, launches nuclear missiles.

Isn't AS_EIO stored persistently in the file block allocation map?
Even if it isn't today that is included in the proposal that the
filesystem maintains a list of poison that is coordinated with the
pmem driver.

>
> > 5. xfs sends SIGBUS to any programs that had those files mapped to tell
> >    them "Your data is gone, we've stabilized the storage you had
> >    mapped."
> > 6. app does whatever it needs to recover
> >
> > Apps shouldn't have to do this punch-and-reallocate dance, seeing as
> > they don't currently do that for SCSI disks and the like.
>
> The SCSI disk retains the error until the sector is rewritten.
> I'm not entirely sure whether you're trying to draw an analogy with
> error-in-page-cache or error-on-storage-medium.
>
> error-on-medium needs to persist until the app takes an affirmative step
> to clear it.  I presume XFS does not write zeroes to sectors with
> errors on SCSI disks ...

SCSI does not have an async mechanism to retrieve a list of poisoned
blocks from the hardware (that I know of), pmem does. I really think
we should not glom on pmem error handling semantics on top of the same
infrastructure that it has handling volatile / replaceable pages. When
the filesystem is enabled to get involved it should impose a different
model than generic memory error handling especially because generic
memory-error handling has no chance to solve the reflink problem.

If an application wants to survive poison consumption, signals seem
only sufficient for interrupting an application that needs to take
immediate action because one of its instructions was prevented from
making forward progress. The interface for enumerating the extent of
errors for DAX goes beyond what signinfo can reasonably convey, that
piece is where the filesystem can be called to discover which file
extents are impacted by poison.

I like Darrick's idea that the kernel stabilizes the storage by
default, and that the repair mechanism is just a write(2). I assume
"stabilize" means make sure that the file offset is permanently
recorded as poisoned until the next write(2), but read(2) and mmap(2)
return errors so no more machine checks are triggered.