From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3878BC7EE2E
	for <linux-mm@archiver.kernel.org>; Fri,  9 Jun 2023 20:20:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B1D708E000C; Fri,  9 Jun 2023 16:20:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AA6B28E0002; Fri,  9 Jun 2023 16:20:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 96E6A8E000C; Fri,  9 Jun 2023 16:20:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 7D6A98E0002
	for <linux-mm@kvack.org>; Fri,  9 Jun 2023 16:20:58 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 491FB1202ED
	for <linux-mm@kvack.org>; Fri,  9 Jun 2023 20:20:58 +0000 (UTC)
X-FDA: 80884328196.11.C01DBF3
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175])
	by imf23.hostedemail.com (Postfix) with ESMTP id 587EE14000C
	for <linux-mm@kvack.org>; Fri,  9 Jun 2023 20:20:56 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=SQsgJWM2;
	spf=pass (imf23.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=jthoughton@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686342056; a=rsa-sha256;
	cv=none;
	b=nZbzPGy+4PHvv2DVhQHHUN+vR32BwHJJlTIYjvkpMrIZqajndp+utmCCMuvV7N+manZkbR
	rwUoFWN7Pf/X373INp4eTFD2B/Ru2oPQifhWms3AwJJaP6b/xuhfHGUvwRCBj0LQqc8eRc
	AmkdNnmHJfM8Sz4iMzA//fALN3o8+WI=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=SQsgJWM2;
	spf=pass (imf23.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=jthoughton@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1686342056;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AbYqqo0BI2ZPUQ6t3vW1xgBqUlUZ/rZyJlNcwqBRKus=;
	b=pIZ+qAJaKsKxbeNSeYYKfVZ8myvE/vqfoF3i2y8yQe6HBdayRbpXBJXEnTDpo7eYtsS34D
	pTIm8TqeOERmdnr+4YfpUL+TyfPgeVUEdXhtY+PRLxWiSxla4GweAWt5Hrhtr9LjVcgEr5
	BsVht+yvsiTMGKPCkzhEWE7uAN6Mr+Q=
Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-3f9d619103dso42321cf.1
        for <linux-mm@kvack.org>; Fri, 09 Jun 2023 13:20:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1686342055; x=1688934055;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=AbYqqo0BI2ZPUQ6t3vW1xgBqUlUZ/rZyJlNcwqBRKus=;
        b=SQsgJWM25VLtk7VgFyK1V9Uov5DjcS4e0XqGpczS9Pj1ACSzCLZJz2nsuNkPNI/fiN
         sAJ4HfQlrvCm4lPOvdt610IVSUDlf3vU/humDEXs3iOuRPdUgFJsFyKw7STSb9iIL5cW
         k9SDOEH+RK0cu7jHB0Jyr9PCCyYlXu8L8x1CuGDq83H1RN/oGVfvCSHMKdWbDdX1/grY
         HHSqO1FuVQlQUDpAiCQ/DjBzxsNzP1MP1j3EUiGPYTG6UVHRnWld9/7py7YKQd4c2Ysn
         lZAwyae0bCOU8e9biLqO2sQLuDtoxNFbqSLm1nz0B0JPaIfMKY5PLiam1++eaFdyNiEZ
         ftVQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1686342055; x=1688934055;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=AbYqqo0BI2ZPUQ6t3vW1xgBqUlUZ/rZyJlNcwqBRKus=;
        b=YvmuG2gTWgczLBQ9m1LeZuEfTfPouU9Sez8k1tgx3+9N9WKN/jC3XqdXjUOatr9412
         5N2gx6Pnmt1pbHNfc//dwfw5Ll/F2qMXJZHB9oMjYaG92d+XFXe8JPtW0Ntb30aYlty3
         8UD7VSzK96EOFQRVTQLncWiTQGgYX2V0fOSJ4L4GkA/CRpxgEOjDyGsDPkGOrufHaGul
         bIXaoIEU+raV3d/9ehqmlgalCK2qQNmUPIEAoQ9ZA92vFCHrBG6Sikv724N9wQ1QnnCZ
         Y3T/YRr63yRt8xmrmFagcYfr5rQ9JjLWYmat7U15G486YeCUrD0p2nFUezx/baUCz/Mb
         ha3A==
X-Gm-Message-State: AC+VfDyK9CS1JKgi0m1Wc7vYrozFKm0V9vC4st5tqYL6s2GtfnsrcjmM
	X2Pm2HIq1eF6AIwJrL70/Ab6yv2QKUNt8p91SHkb/g==
X-Google-Smtp-Source: ACHHUZ6UBOCmGpwT4v8CCgTvkqEIyFQkUWm7yLjaBg1tV3TtTkHju3OqtzSpTCc8mQWsd3rIK9NQiP5Kj/77jUOLjyQ=
X-Received: by 2002:ac8:59c9:0:b0:3f9:b8c2:f2d3 with SMTP id
 f9-20020ac859c9000000b003f9b8c2f2d3mr43813qtf.19.1686342055387; Fri, 09 Jun
 2023 13:20:55 -0700 (PDT)
MIME-Version: 1.0
References: <CADrL8HW87GWWTrBT1i722UnxLTG5Rh_5Y9XvCa1hWhY9C4Bh2Q@mail.gmail.com>
 <a49e7ec8-735d-5a81-1744-cb887389a559@google.com> <20230602172723.GA3941@monkey>
 <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com>
 <CAJD7tkbXtabHWd70Dt5GASNOjyOFmqR_9x9hdUKvR6XB-DOrEQ@mail.gmail.com>
 <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey>
 <64824e07ba371_142af829493@dwillia2-xfh.jf.intel.com.notmuch>
 <20230608223543.GB88798@monkey> <64829e26edbc6_1433ac29475@dwillia2-xfh.jf.intel.com.notmuch>
In-Reply-To: <64829e26edbc6_1433ac29475@dwillia2-xfh.jf.intel.com.notmuch>
From: James Houghton <jthoughton@google.com>
Date: Fri, 9 Jun 2023 13:20:19 -0700
Message-ID: <CADrL8HXy3OnAqd4Y6FHZLMkxXsCE54UH=PQVCTUvNhX9yWCacw@mail.gmail.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
To: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand <david@redhat.com>, 
	Miaohe Lin <linmiaohe@huawei.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, 
	Peter Xu <peterx@redhat.com>, Yosry Ahmed <yosryahmed@google.com>, linux-mm@kvack.org, 
	Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>, 
	David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, 
	lsf-pc@lists.linux-foundation.org, Jiaqi Yan <jiaqiyan@google.com>, 
	jane.chu@oracle.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 587EE14000C
X-Stat-Signature: ndoxsq8jtodbdt3ctrcbg9n6t79jugyo
X-Rspam-User: 
X-HE-Tag: 1686342056-46522
X-HE-Meta: U2FsdGVkX1+X0WVehmTEI2yv8Y6JMW7q+hYa9+lNYFjWnbB6guZrxR0A+O2jLAcSMQcEvxzrw1G4+mCUPSC3v/quIQkLb5DW5BQCGtByMCc3VfP67XXk4aC7sIPo/T6urEBfXsgpKGiqwYnclWYWwxusBju4wOlUI84lWuF4gFF/68foXsD9W0fMKikSt3Ofj6DZATD7rlvVT4riiPJkJ/PivfT8YhiSv1m9cNF32I5FKXK6bXkSh2atjbISnarSJ3P/fNXQInRu7T/jj1VKshkKMaFZSRpAhfyDZbCG5/WyXtXy7a8TBOg92Btq7yR6S3KmKXBLvobgWv5BP2uSuOHMqEgJwDJFNeVw/Us6eUzJsmP6S9giQKqgpbI2K44qJjZDnt+eBO/RdkhaS3odxHilRdbWDsoVscvmQvq/O072zpKHVwzLxJnVmqqIs6+6/O6zXgJVHzhetSCgVVur+ulzSFQUkbAskYvHHyeOMMiTHAmorc1NU4he363HkNmY72ojZhFOi/o/S2Z0pAHIGrJwNEbbFH7/bJDoF4dgH+R/Vzo39eKC9hNaYBlGxujEGiUWIf6UIGUUaVwBqobSAF/kfqUvapfTHc+pmtCXNkAlMT2me40PRSFRiqmUvu1JOzqKQn7XaZLPnFDEtNZY8WFnhPtl/aRtn75G2YMxS3joO97oG5hPtuMskd+G3wgOF2SQKOKrJkCSzCbMWoY1qxQD0AbnqhhtmhtLWosJFJfGpAcsXqJoVxOSLbqBiBtNLYipMdSY63Y58wPHodRwubW42+DiyyUKBBLreAGzcPy4o5K7hrO8nm8DdsEcrEMH+P7PIMH4MmSqjhlEBPmvuqJ2C4KNYNbFoU0ygKkQaoTo9f5BE1jutLqShdS0V8AdT1cGTIalGSNWpla8DYTP6IF4mpljCrC0VXf+JknZLw1dllCG9SbjBHC4l9pp2KKp4MXKDr199RwikOuAAAG
 hwR3XNBU
 tWY60y7Mb6Z1pdYCIwXDnsAyjnnH4OWVsclscQldYRqOEgK41oOGr13ijohOVcqLzBDEZdRmFgvxXYv1PBGLwhxUuopb3/u9fpw2x045oOVoevkmqelqx0tIMTrW1O3v16MMRrprLBT2R+d83hfdCbRosID6uY8FJJ8VSGwjsdXSe7whn/TvPo+8pAg5pshwQN/oRUOv2LOw97LHbzgyuE1eJ+pbi/CR4B+KeKCScuC2ei3Fy4bamhcmtJlLlxMBjac7I2B3VsWQRj1gbHBSP+tx0yqDbFYhOe99uJv0AFqxjlWCMpdmEpub3/TO4XM2kQZMFm6G2uf3HVKQ=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jun 8, 2023 at 8:36=E2=80=AFPM Dan Williams <dan.j.williams@intel.c=
om> wrote:
>
> [ add Jane ]
>
> Mike Kravetz wrote:
> > On 06/08/23 14:54, Dan Williams wrote:
> > > Mike Kravetz wrote:
> > > > On 06/07/23 10:13, David Hildenbrand wrote:
> > > [..]
> > > > I am struggling with how to support existing hugetlb users that are=
 running
> > > > into issues like memory errors on hugetlb pages today.  And, yes th=
at is a
> > > > source of real customer issues.  They are not really happy with the=
 current
> > > > design that a single error will take out a 1G page, and their VM or
> > > > application.  Moving to THP is not likely as they really want a pre=
-allocated
> > > > pool of 1G pages.  I just don't have a good answer for them.
> > >
> > > Is it the reporting interface, or the fact that the page gets offline=
d
> > > too quickly?
> >
> > Somewhat both.
> >
> > Reporting says the error starts at the beginning of the huge page with
> > length of huge page size.  So, actual error is not really isolated.  In
> > a way, this is 'desired' since hugetlb pages are treated as a single pa=
ge.
>
> On x86 the error reporting is always by cacheline, but it's the
> memory-failure code that turns that into a SIGBUS with the sigaction
> info indicating failure relative to the page-size. That interface has
> been awkward for PMEM as well as Jane can attest.
>
> > Once a page is marked with poison, we prevent subsequent faults of the =
page.
>
> That makes sense.
>
> > Since a hugetlb page is treated as a single page, the 'good data' can
> > not be accessed as there is no way to fault in smaller pieces (4K pages=
)
> > of the page.  Jiaqi Yan actually put together patches to 'read' the goo=
d
> > 4K pages within the hugetlb page [1], but we will not always have a fil=
e
> > handle.
>
> That mitigation is also a problem for device-dax that makes hard
> guarantees that mappings will always be aligned, mainly to keep the
> driver simple.
>
> >
> > [1] https://lore.kernel.org/linux-mm/20230517160948.811355-1-jiaqiyan@g=
oogle.com/
> >
> > >              I.e. if the 1GB page was unmapped from userspace per usu=
al
> > > memory-failure, but the application had an opportunity to record what
> > > got clobbered on a smaller granularity and then ask the kernel to rep=
air
> > > the page, would that relieve some pain?
> >
> > Sounds interesting.
> >
> > >                                         Where repair is atomically
> > > writing a full cacheline of zeroes,
> >
> > Excuse my hardware ignorance ... In this case, I assume writing zeroes
> > will repair the error on the original memory?  This would then result
> > in data loss/zeroed, BUT the memory could be accessed without error.
> > So, the original 1G page could be used by the application (with data
> > missing of course).
>
> Yes, but it depends. Sometimes poison is a permanent error and no amount
> of writing to it can correct the error, sometimes it is transient like a
> high energy particle flipped a bit in the cell, and sometime it is
> deposited from outside the memory controller like the case when a
> poisoned dirty cacheline gets written back.
>
> The majority of the time, outside catastrophic loss of a whole rank,
> it's only 64-bytes at a time that has gone bad.
>
> > >                                     or copying around the poison to a
> > > new page and returning the old one to broken down and only have the
> > > single 4K page with error quarantined.
> >
> > I suppose we could do that within the kernel, however user space would
> > have the ability to do this IF it could access the good 4K pages.  That
> > is essentially what we do with THP pages by splitting and just marking =
a
> > single 4K page with poison.  That is the functionality proposed by HGM.
> >
> > It seems like asking the kernel to 'repair the page' would be a new
> > hugetlb specific interface.  Or, could there be other users?
>
> I think there are other users for this.
>
> Jane worked on DAX_RECOVERY_WRITE support which is a way for a DIRECT_IO
> write on a DAX file (guaranteed to be page aligned) to plumb an
> operation to the pmem driver top repair a location that is not mmap'able
> due to hardware poison.
>
> However that's fsdax specific. It would be nice to be able to have
> SIGBUS handlers that can ask the kernel to overwrite the cacheline and
> restore access to the rest of the page. It seems unfortunate to live
> with throwing away 1GB - 64-bytes of capacity on the first sign of
> trouble.
>
> The nice thing about hugetlb compared to pmem is that you do not need to
> repair in place, in case the error is permanent. Conceivably the kernel
> could allocate a new page, perform the copy of the good bits on behalf
> of the application, and let the page be mapped again. If that copy
> encounters poison rinse and repeat until it succeeds or the application
> says, "you know what, I think it's dead, thanks anyway".

I'm not sure if this is compatible with what we need for VMs. We can't
overwrite/zero guest memory unless the guest were somehow enlightened,
which we can't guarantee. We can't allow the guest to keep triggering
memory errors -- i.e., we have to unmap the memory at least from the
EPT (ideally by unmapping it from the userspace page tables).

So, we could:
1. Do what HGM does and have the kernel unmap the 4K page in the
userspace page tables.
2. On-the-fly change the VMA for our hugepage to not be HugeTLB
anymore, and re-map all the good 4K pages.
3. Tell userspace that it must change its mapping from HugeTLB to
something else, and move the good 4K pages into the new mapping.

(2) feels like more complexity than (1). If a user created a
MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong.

(3) today isn't possible, but with Jiaqi's improvement to hugetlbfs
read() it becomes possible. We'll need to have an extra 1G of memory
while we are doing this copying/recovery, and it isn't transparent at
all.

(3) is additionally painful when considering live migration. We have
to keep the 4K page unmapped after the migration (to keep it poisoned
from the guest's perspective), but the page is no longer *actually*
poisoned on the host. To get the memory we need to back our
fake-poisoned pages with tmpfs, we would need to free our 1G page.
Getting that page back later isn't trivial.

So (1) still seems like the most natural solution, so the question
becomes: how exactly do we implement 4K unmapping? And that brings us
back to the main question about how HGM should be implemented in
general.

>
> It's something that has been on the "when there is time pile", but maybe
> instead of making hugetlb more complicated this effort goes to make
> memory-failure more capable.

I like this line of thinking, but as I see it right now, we still need
something like HGM -- maybe I'm wrong. :)

- James