From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5142BEB64DA
	for <linux-mm@archiver.kernel.org>; Fri, 23 Jun 2023 00:45:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 60AEB8D0002; Thu, 22 Jun 2023 20:45:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5BAFA8D0001; Thu, 22 Jun 2023 20:45:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 482BC8D0002; Thu, 22 Jun 2023 20:45:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 38F298D0001
	for <linux-mm@kvack.org>; Thu, 22 Jun 2023 20:45:47 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 047AEB052D
	for <linux-mm@kvack.org>; Fri, 23 Jun 2023 00:45:46 +0000 (UTC)
X-FDA: 80932169934.01.0EC0C69
Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182])
	by imf02.hostedemail.com (Postfix) with ESMTP id 4399C8000E
	for <linux-mm@kvack.org>; Fri, 23 Jun 2023 00:45:45 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=px0po+wV;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf02.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=jiaqiyan@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1687481145;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hKIfRbAJZ0FsYoJoh9IDm1KIaCcGTYPwnDQc9veeEAs=;
	b=4htKD6bGoPxRRBT3pXdnE6kiOf3NOxrlaz3C/gG9Fnpq4632Yuj7nWUq43vXO5QKZ95XeU
	+hbaGfpuwhLHq+ljP/vcJ5lLW0l1xk9pItzx8i8aI/XlRVPCPISyDTYtYFRfbkXqH96/Tw
	EhnORyqd8+qEw9coQ4QM4vLaPLRDj9I=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=px0po+wV;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf02.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=jiaqiyan@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687481145; a=rsa-sha256;
	cv=none;
	b=riziRZ5pBhMCFYajBX1BrAfdFzKPRuUEuJ0uCiWkA0mL22a4Ba0q25P8j2TeDDRbp9514s
	+AHwEMGVNJ7mWs7N3BiyPEk9Zs0EUaDzsJF6fI65NbtmAdRJXH6BDjCyD1IKDPogDmJy8P
	NY2HarsM7BooD5PbPYp0dejpaSKUIRc=
Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-570114e1feaso86715417b3.3
        for <linux-mm@kvack.org>; Thu, 22 Jun 2023 17:45:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1687481144; x=1690073144;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=hKIfRbAJZ0FsYoJoh9IDm1KIaCcGTYPwnDQc9veeEAs=;
        b=px0po+wV+QaKxqXfNUnOnwJ2AkoIIJw7Je1MDRkW8R9xWZb5k6DvrnbarZJRLzdyw3
         j1mn+1Yf54DWLjGADaO68C+hcH/gPJ5H6WwlB2DMg94xQ5tYhJ3N7uqGeGVIUh+b72Ww
         8A6nHDdlzUtHi1QKQyPmAwF4gRJHrJ1NI5yzfPFOtmeqHp3l0FmzfZmTTPGFZGBQPdfR
         Gqi20MO0o9+hTydugxDR4SQd4jG2SMsILvyAS6xSCFr5gh+uzUhv/811NCPHOOWX4DQ9
         tAIS1etXRwofZULqQSQxRKyXwiOHgYKnZGMAGQ5ed/aiM1rkRXSJKa24H7xTpGl/gKKf
         N9WA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1687481144; x=1690073144;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=hKIfRbAJZ0FsYoJoh9IDm1KIaCcGTYPwnDQc9veeEAs=;
        b=KwhA3hdwmFyJuW2l0xFQIa+Bi0iBcxzMx0ZREZEBZmPm4wDsX1R+26i/ndk5542agN
         XNffbPvhN5B3jNpGvYqYhEbWRQyNMn/ifnYaWOnWbsF5Lz6yTBkZGLJnuSfEguCawoXU
         R4dsyGTXtdGbsxRd+2mRox/KdK/4cVTfbJkZNM3Wl4tFqLwof3vjjxq82zeAEplCXvGO
         jnxghFpswkQE+B1f/6BJ4Awa/PjpfwAw6z5ZklppcRq5mNc6IpcudjOW9/A3qKIhWGgk
         j7/ZgV9wvlt/UYzqphJxbF20Oc4unrWkLPbbFcAr8V0/ji4DJDdoA+oN3slTiflzEbDp
         LcJQ==
X-Gm-Message-State: AC+VfDzDVFvyX9Wx7KOQP7jvQzixPbWWBYA/aO+YLkyu/St5MTZxvDZV
	SnVOISKngm3JqOFLRMHLqW9qpnO6ZWxvFDTe7/Km8g==
X-Google-Smtp-Source: ACHHUZ6GWoN4GSCylJJKRDmKwgB7NgBuBS/8Y0nXHNbRvzi6glMEtToSY2CWWKqjUEzqLZDez3sPzudFQbfFvSUE7E8=
X-Received: by 2002:a81:6e56:0:b0:573:2fea:39a6 with SMTP id
 j83-20020a816e56000000b005732fea39a6mr13957854ywc.32.1687481144094; Thu, 22
 Jun 2023 17:45:44 -0700 (PDT)
MIME-Version: 1.0
References: <20230523024305.GA920098@hori.linux.bs1.fc.nec.co.jp>
 <CACw3F53C0f_Ph0etD+BgkAz4P8pX3YArjFgSPaLh_d6rUqMUCw@mail.gmail.com>
 <CACw3F52k=fhYpLpvDoVPcmKnOALLkPsGk08PdS_H0+miSYvhEQ@mail.gmail.com>
 <20230612041901.GA3083591@ik1-406-35019.vs.sakura.ne.jp> <CACw3F51o1ZFSYZa+XLnk4Wwjy2w_q=Kn+aOQs0=qpfG-ZYDFKg@mail.gmail.com>
 <20230616233447.GB7371@monkey> <CACw3F52iG5bqQbvZ9QkkRkVfy+NbSOu9hnkVOt5khukNNG73OQ@mail.gmail.com>
 <20230617225927.GA3540@monkey> <20230619082330.GA1612447@ik1-406-35019.vs.sakura.ne.jp>
 <20230620180533.GA3567@monkey> <20230620223909.GB3567@monkey>
In-Reply-To: <20230620223909.GB3567@monkey>
From: Jiaqi Yan <jiaqiyan@google.com>
Date: Thu, 22 Jun 2023 17:45:32 -0700
Message-ID: <CACw3F53iPiLrJt4pyaX2aaZ5BVg9tj8x_k6-v7=9Xn1nrh=UCw@mail.gmail.com>
Subject: Re: [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>, 
	=?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= <naoya.horiguchi@nec.com>, 
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>, "shy828301@gmail.com" <shy828301@gmail.com>, 
	"linmiaohe@huawei.com" <linmiaohe@huawei.com>, 
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, 
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "duenwen@google.com" <duenwen@google.com>, 
	"axelrasmussen@google.com" <axelrasmussen@google.com>, "jthoughton@google.com" <jthoughton@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: 95ccf145rezwbx5ne5td5fhdsqbyezfy
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 4399C8000E
X-HE-Tag: 1687481145-370012
X-HE-Meta: U2FsdGVkX1/3G2b1VP+PgyoKhJBHTvHoN9fKGTjEbtegMhumhwNnmZcqdFn1cJt3xwOnJrGx/eOBU75ErwROSN+5pfnt+r+d4hWiavP4zk73MeoZ/qkDUGpLGqt9/Pmlshuxd71DX3g25n8T5QzQbdqcOqgwahLEMlAh+SN7O0p91Q8pbU3oUaqy0IKNedFeO6VSexr7MoPzkBUV76n+QAVFAff3TzcEN/t5QIigvtYzlWnHo7YdNjweEuvvmjjP3BRV7xgDp0Jo6AIPiZdeYhzzaH+l2/1aMjs/t+fpg9teDVBda59RXnMgCIvwW1WtQkEUY2SBR6J4glVZH2cV+ocChcDdhP3zh1Tn/kYDMr/3ZEnzbaYKCjQoxU+9t0py21sRcddPh7+duXGjnGK+adekklIWn5dcXKQwZgXxfPECjm8BVi5QVBNF2yEwpUIrzWIwV4dIobKbQWNJmfIggfdng6LHZ5PRTsFsDEkq55SYyfqCEsK6MOR8lE2aRZHi+dKmQD9J7i6Smz0j2BRNFHlb8o6j/NUlQow2RCoryVLgKGIHx5JG9XXNq6WdqoprwCeJ114+M9si086YVA1V5Dtm6F9dT78JpQ98PVPoImG7D/VkBS11YX8vP4QTR2H0Hyq171HFpnuQCdmAUI+Sp8fEpu+xofUiSB/HV3XnxSJSR9ib3jo3AenHV25zYQiDGdI6p63kHqJMuIC7LadOL7UTODXG5bdWHYgnAsnQqoUZngCTLCa68gLK9zvsUCHQUurOsjDnpV+LynUz6m0ZHMiuOxaBMQTJwMh8V8oJL3I5MXzfIkLycZFIzmB7sv6ZJeUvVJtIrGKdObvJaUgiXYIv2KXNmBkMyJjT8vod2+4DGwjEo9XtRmuJK7odQGPtBwvub+XTscSLz3g3Yt08zJ0KGqRUrr1pza9eP/myN4wtjana2673odJFCgklUJwnRWH+X09r8d2XkUJNgMv
 sB2CqO0A
 gD8E9gGFZ/33+o0vyaH9Mz/MVknb2LdRufx67zG5Vjk3cu63+4T5aiqv9NkzfnoTeEgrMXq/2LsYyy1yhbrBG3Iu/a6K36R1eVK69P/Y0xL9LD3d4CIv65NjB/xRquL1AuQRinurFNpFMDu/GLUAKWQ5bHguLTnFd/K36F6goXXFAeZxL4T0v4t0OQLvT8jDdPyhREPqTk7mKgoGClgNi6uEJm8fNCCc/mJTX/OJv0cKTTWp8ezyoXDZ29luyXhkWLb7yuc7LTfZvGTKaJcj9LzXfNysrUVI6SLqFDMBaqAyYlQqEE7wPWeUcNNqphmiyvSDptkRNi1MdpE3UwkJWqXkT1H2IzBj27k8m
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jun 20, 2023 at 3:39=E2=80=AFPM Mike Kravetz <mike.kravetz@oracle.c=
om> wrote:
>
> On 06/20/23 11:05, Mike Kravetz wrote:
> > On 06/19/23 17:23, Naoya Horiguchi wrote:
> > >
> > > Considering this issue as one specific to memory error handling, chec=
king
> > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be help=
ful to
> > > detect the race.  Then, an idea like the below diff (not tested) can =
make
> > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wa=
it
> > > for complete the allocation of vmemmap pages.
> > >
> > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long=
 pfn, int flags,
> > >         int ret =3D 2;    /* fallback to normal page handling */
> > >         bool count_increased =3D false;
> > >
> > > -       if (!folio_test_hugetlb(folio))
> > > +       if (!folio_test_hugetlb(folio)) {
> > > +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> > > +                       ret =3D -EBUSY;
> >
> > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
> > the folio->private field.
> >
> > In the case where the folio is a non-hugetlb folio, the folio->private =
field
> > could be any arbitrary value.  As such, the test for vmemmap_optimized =
may
> > return a false positive.  We could end up retrying for an arbitrarily
> > long time.
> >
> > I am looking at how to restructure the code which removes and frees
> > hugetlb pages so that folio_test_hugetlb() would remain true until
> > vmemmap pages are allocated.  The easiest way to do this would introduc=
e
> > another hugetlb lock/unlock cycle in the page freeing path.  This would
> > undo some of the speedups in the series:
> > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle=
.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
> >
>
> Perhaps something like this?  Minimal testing.

Thanks for putting up a fix, Mike!

>
> From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 20 Jun 2023 14:48:39 -0700
> Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmem=
map
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
>   lists, get a ref on the page and remove hugetlb destructor.  This
>   all must be done under the hugetlb lock.  After this call, the page
>   can be treated as a normal compound page or a collection of base
>   size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
>   needed and the free routine of the underlying allocator is called
>   on the resulting page.  We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps.  In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages.  It will then try to SetPageHWPoison(page) on the page with an
> error.  If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap.  Since clearing the destructor
> required holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present.  This saves a
> lock/unlock cycle.  Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count.  This is not
> a normal state.  The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation.  This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a bunch of pages.
>
> Fixes: ???
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 51 insertions(+), 24 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d76574425da3..f7f64470aee0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio=
(struct folio *folio,
>                                                 unsigned int order) { }
>  #endif
>
> +static inline void __clear_hugetlb_destructor(struct hstate *h,
> +                                               struct folio *folio)
> +{
> +       lockdep_assert_held(&hugetlb_lock);
> +
> +       /*
> +        * Very subtle
> +        *
> +        * For non-gigantic pages set the destructor to the normal compou=
nd
> +        * page dtor.  This is needed in case someone takes an additional
> +        * temporary ref to the page, and freeing is delayed until they d=
rop
> +        * their reference.
> +        *
> +        * For gigantic pages set the destructor to the null dtor.  This
> +        * destructor will never be called.  Before freeing the gigantic
> +        * page destroy_compound_gigantic_folio will turn the folio into =
a
> +        * simple group of pages.  After this the destructor does not
> +        * apply.
> +        *
> +        */
> +       if (hstate_is_gigantic(h))
> +               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> +       else
> +               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +}
> +
>  /*
> - * Remove hugetlb folio from lists, and update dtor so that the folio ap=
pears
> - * as just a compound page.
> + * Remove hugetlb folio from lists.
> + * If vmemmap exists for the folio, update dtor so that the folio appear=
s
> + * as just a compound page.  Otherwise, wait until after allocating vmem=
map
> + * to update dtor.
>   *
>   * A reference is held on the folio, except in the case of demote.
>   *
> @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate =
*h, struct folio *folio,
>         }
>
>         /*
> -        * Very subtle
> -        *
> -        * For non-gigantic pages set the destructor to the normal compou=
nd
> -        * page dtor.  This is needed in case someone takes an additional
> -        * temporary ref to the page, and freeing is delayed until they d=
rop
> -        * their reference.
> -        *
> -        * For gigantic pages set the destructor to the null dtor.  This
> -        * destructor will never be called.  Before freeing the gigantic
> -        * page destroy_compound_gigantic_folio will turn the folio into =
a
> -        * simple group of pages.  After this the destructor does not
> -        * apply.
> -        *
> -        * This handles the case where more than one ref is held when and
> -        * after update_and_free_hugetlb_folio is called.
> -        *
> -        * In the case of demote we do not ref count the page as it will =
soon
> -        * be turned into a page of smaller size.
> +        * We can only clear the hugetlb destructor after allocating vmem=
map
> +        * pages.  Otherwise, someone (memory error handling) may try to =
write
> +        * to tail struct pages.
> +        */
> +       if (!folio_test_hugetlb_vmemmap_optimized(folio))
> +               __clear_hugetlb_destructor(h, folio);
> +
> +        /*
> +         * In the case of demote we do not ref count the page as it will=
 soon
> +         * be turned into a page of smaller size.
>          */
>         if (!demote)
>                 folio_ref_unfreeze(folio, 1);
> -       if (hstate_is_gigantic(h))
> -               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> -       else
> -               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>
>         h->nr_huge_pages--;
>         h->nr_huge_pages_node[nid]--;
> @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct =
hstate *h,
>  {
>         int i;
>         struct page *subpage;
> +       bool clear_dtor =3D folio_test_hugetlb_vmemmap_optimized(folio);

Can this test on vmemmap_optimized still tell us if we should
__clear_hugetlb_destructor? From my reading:
1. If a hugetlb folio is still vmemmap optimized in
__remove_hugetlb_folio, __remove_hugetlb_folio won't
__clear_hugetlb_destructor.
2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear
HPG_vmemmap_optimized if it succeeds.
3. Now when dissolve_free_huge_page gets into
__update_and_free_hugetlb_folio, we will see clear_dtor to be false
and __clear_hugetlb_destructor won't be called.

Or maybe I misunderstood, and what you really want to do is never
__clear_hugetlb_destructor so that folio_test_hugetlb is always true?

>
>         if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>                 return;
> @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct=
 hstate *h,
>         if (unlikely(folio_test_hwpoison(folio)))
>                 folio_clear_hugetlb_hwpoison(folio);
>
> +       /*
> +        * If vmemmap pages were allocated above, then we need to clear t=
he
> +        * hugetlb destructor under the hugetlb lock.
> +        */
> +       if (clear_dtor) {
> +               spin_lock_irq(&hugetlb_lock);
> +               __clear_hugetlb_destructor(h, folio);
> +               spin_unlock_irq(&hugetlb_lock);
> +       }
> +
>         for (i =3D 0; i < pages_per_huge_page(h); i++) {
>                 subpage =3D folio_page(folio, i);
>                 subpage->flags &=3D ~(1 << PG_locked | 1 << PG_error |
> --
> 2.41.0
>