From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D1D8BF41807 for ; Mon, 9 Mar 2026 15:47:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 307CE6B0089; Mon, 9 Mar 2026 11:47:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E0006B008A; Mon, 9 Mar 2026 11:47:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C25B6B008C; Mon, 9 Mar 2026 11:47:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 05C1A6B0089 for ; Mon, 9 Mar 2026 11:47:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A26DE57A78 for ; Mon, 9 Mar 2026 15:47:28 +0000 (UTC) X-FDA: 84526954176.29.303C383 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) by imf27.hostedemail.com (Postfix) with ESMTP id 8852340015 for ; Mon, 9 Mar 2026 15:47:26 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SUQ6n3+v; spf=pass (imf27.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1773071246; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lG/w5nH34EyLvyhitTi/enqIK8fRsce1UJOwmSqWnO8=; b=Jo6vQHOt3RA6b85PNFz6LanAZpURy8oi2nbC+Xj1j0ksrD6U5Q5DaXGCLmEj+n7UUuag/b m5gem/V1CBB6CI/Nh2wVafLWFfzR07VrqI8Dj9a0UqK5ngBSTIsvs7tPuIVpD6w/4KgJv0 OAfb6CxfHPQRyggeNvEqh8jCDHQ1V3I= ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1773071246; a=rsa-sha256; cv=pass; b=kNa/bpKLi9AW0/mJnTlvfoDedDJ/x7/XEEHQAqWLWAandhT3dV+lOyhjxEXIINJXtlld9Y qZwjX4eZgNmoEz1Uyc70sUOFA+5gTIWdKokR9Se4NIBPq16p3bVLs6WWXFEBilzSeZqQjM R0zPTEPRB93pcfg/f65sx7ZNJt2WvPM= ARC-Authentication-Results: i=2; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SUQ6n3+v; spf=pass (imf27.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com; arc=pass ("google.com:s=arc-20240605:i=1") Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-4852af55981so121965e9.0 for ; Mon, 09 Mar 2026 08:47:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773071245; cv=none; d=google.com; s=arc-20240605; b=OaS6MSJ9a9GBRJxnZus0OkDwgha5F05YeJE2jzsLIi+F324t/xO/IQt60ptc4jXXep cmeaFDRo/ID+JSzVHYsoxoORdg6rIwWNKbVsXmKO6Ol9kaeQrCLbgHUNLdUi/m32ISCN dW2wNTJlVq1hpkqFV84DGRTp/CX6T7HvYzjGZrARxws+I2MI9yMulyoAdK59QhKOUD13 tvS1B5j5Kgotw9KhCftXtxacnJsa1vlTkt84H/f1K/VyJO1fgH+ti906dlTWRYx0I4Jp P5BKxi3MoXKeZ+5i7YAXvw9euHAZ7dtWkBsQXL1Kg4MjDc8Utc3jv5KDs8sODjHP/5ho C/QA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=lG/w5nH34EyLvyhitTi/enqIK8fRsce1UJOwmSqWnO8=; fh=TVfjr4OzncHwDU3M3DBg779Mj0EeJaKcHt58KqJvS2U=; b=ZuDHq5epUmSrLNHq8/nHGaiTDSnBTdL4+vp7HRU1vovSmT4uoa4VhDsvXt/fx+lprA Owupl0lDd92sQ5mMnaENdt7ty0sR4umQZkMZlOCZ0LRJg2RJzLNz8CZ1TyQbm7h0IDNh oCUXHDJ7knWY4B45+xqd1n6MWitVQdgagxfWIG78v9hnmE2z5sjKlf5SU5gnj2GDmEF1 SqukrIDZkvMbICcQscNc9n76z2tej55eDIyOerKstaf+xpbkCKSO/MDlq1VmB1VcJklJ DrPJq9uPmHeb3PDQBQMvAVztUesuMkVGZZaAeSYGEPN2ici/ebcNR5FGUVlaEqW+smRf F5aw==; darn=kvack.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1773071245; x=1773676045; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lG/w5nH34EyLvyhitTi/enqIK8fRsce1UJOwmSqWnO8=; b=SUQ6n3+v23jpQLRblVWgvcvWUJ3Y0TEyr1ao2UJCjWVHel0ghKPcrerpObG836o0VC qDWa8XM9Nc1GgV7dIM+Qzyu2t76z2qLIcjvoj7/Xw9138Q0ltqYkNLJxesO6JeJu4YXX QVCgC4v5ItMo6Jm1bT7gvcXR32/TelIrHRD/a0N3OWss90mX79gDA7nxOlZnL326yXBb WH+QAaBnveqLOcpotYWMbCFVPXSmOMqwKQlApa2Y4ZzANbNA0tYnBOKN8bXiNmsSmEH0 s7k5V2pXArLeMopHpc7YNWw1IDENy1lMhOOgPbXwhJB+Dy2SYXETssVbr2oKYfNJKNRC YjIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773071245; x=1773676045; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=lG/w5nH34EyLvyhitTi/enqIK8fRsce1UJOwmSqWnO8=; b=uSw+lTD9W4fusu2cYn9gc5biUxbxa9RuPihA1nOWGgxknpO74d1XAYIoHOsZU6uXd8 fudF+q8H+ev5QMNif5KaQzNe3lweEYP9OPXVV3xDtc++ye1Jh5LBqitEiLgqxOy+1zJt tiEzaXNNOjHCagi9lMQ9b0iQmBB12YGufMBWTipxWe4sTFx1uGt3nNrZGNvbO8MwnmK5 /sP97WO2G9YHKsVi9YdyC6RL8e9CriQFMOcOfaYmp75VVTPagCl2djT9o/A3eL3ZwUrr xSLmEH96LAv7TyKiKb8hN4Qa6cBt6E9sbtoA97h2W3CU0jzDvyUE23vHZZsqcHTo5MFb F5FQ== X-Forwarded-Encrypted: i=1; AJvYcCVTHc62DtbqraQ7gqP5O8eDTMS62lhuNKGeZSJXVVFfDIvFWxcdb84AAcjf+UCgJQANaBh120jwjg==@kvack.org X-Gm-Message-State: AOJu0YzO999h/ggCJeI9tFJrevXO39r+nSh1xgbv/bMZrE703YM1FW9z d2juhmr2BP7DT3vxZD/gD9rtmzMntZAjAZdlbUTGc5EgdCrFly0MCr8ftr8Epek2baWcsdJ92uB T/8SKivkMoKjlm0nhlgq+0E5QQBPrfupS3VC9YCBV X-Gm-Gg: ATEYQzysuBVZ5baw6xmqplqaqA7lDO6lsdp9rxpm9v+isLIab0jX0EPjbZLAaawqUpP 7Efgd9agBU1WwJoS1ghob+yYD6F3RH3/H5DrfMYPmQPaJIiDENBsXR7FbeNdBH6kpobENXu4074 gpA6gDGGZpRIa0KF7wBnrSO3EZoBx5JQRUmgbWqn535GPas+1/Vdh13ftnfzvjcI4y+LJ5JWAyW xH795aMseOffCBklpK2VoCG+flBqfSz8JOnYS016lyB0aDjHl6ErJHpRCkWL8GJICkV1QpG1WiA WmrTTxNXioOAVKnmXQW/GOuj1k01VdiSC+VsWbc= X-Received: by 2002:a05:600c:c171:b0:485:302f:f27f with SMTP id 5b1f17b1804b1-485303007b7mr1934615e9.17.1773071244439; Mon, 09 Mar 2026 08:47:24 -0700 (PDT) MIME-Version: 1.0 References: <20260203192352.2674184-1-jiaqiyan@google.com> <20260203192352.2674184-2-jiaqiyan@google.com> <7ad34b69-2fb4-770b-14e5-bea13cf63d2f@huawei.com> <31cc7bed-c30f-489c-3ac3-4842aa00b869@huawei.com> In-Reply-To: From: Jiaqi Yan Date: Mon, 9 Mar 2026 08:47:12 -0700 X-Gm-Features: AaiRm50LMW6sKrWbVlgM7Z4RSdPgjZRf5FHF4xv3HNCCFfBrg3jrKW38LOlQCIY Message-ID: Subject: Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy To: Miaohe Lin Cc: nao.horiguchi@gmail.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, ziy@nvidia.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, william.roche@oracle.com, harry.yoo@oracle.com, jane.chu@oracle.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: cre16jo6ss7km3a6a6ij54kyndmwiw6h X-Rspam-User: X-Rspamd-Queue-Id: 8852340015 X-Rspamd-Server: rspam12 X-HE-Tag: 1773071246-365536 X-HE-Meta: U2FsdGVkX18J62U2XqUpecY70/P2kph09RVu7E+ifEuY4yX4NxEeYAnYsp/mp9UopvGpSTBgz5iUOAPEuI6bsC/1kWG95evAekEwJKTYo4VyC6rr5+6ByK0XtzdOTeF4dQfqYIo1lh86i7YaIrdZezJ2QBEMggCfRR+MjnIUpSXyAJEyCVGiqwzXMQX+495Oc34pF4vtrPoTIIkNu0wCv31Vqo455dLMdB0DubqK2m/5StINE5BjhlvCBd0oWYM+0ePAa15ZCjQ7z4Wt3bOOShM2GWIr1nNMZykVPhXLqHYscIS6ujB9osaauZyw5Btc/ApP9oUpisMvVnyy1h5PdKtnT0EPLDtJTJD7vhtcHdRAs7k1MwB4ci9hZaF0X+yfPvBuJkXlrj1HVVm6TSa+EVcnkDth2pnRruh7oRC9PyoUBHSbY3CPFgixIma4PIFcCYht4ULYwmDHaew/ZGBYtB76hgvNm8C76ZZisNZltSw1TndWi4++o4gg3KGjgKGrEAa+bKPoKAKUN3PTCqCROgCLZgrYjfW4O6fjlrtJer8OH/cXVO9WYCsdsrAQl3VKuF9claMRE2NcCjMC2x0Os4+B/Xyp5yQwXCJ45moKRu5uzPGKDGAh01vszt8dZmfy2J7CopU+wdmfUw85M8XpsmAQe6abNhg9L7peS9nrj/Eog4YYtGIWFJH03fFCnFwA+NkYQmco0Q4MWtp1c7OqEC4e89HLslRH6rMS1dNAfvyrbf0JR7LY3K/AYU7+jYHxlLxKbIBbCxbKJpVhCjb9iMbUaGkn3RyOkEP60zU5rbMd/++ShNPXgummLhlbDeSWssfEiyoC2HnAiVvi7pYAQ7HCwI+xgzGKaXBTLEaZTQyR9XpuvLyAGpXtSglE9Qjd+gpkFwQY1zt6nobBmAyb97e0wa4KRPP/JjwW+DEBC9zdCXLyX/ORV26dKsqX3MRqOEcY2kFjgL4zN8psLZk Jz2BxVz7 BqxVfY1OatjUmU2ap4tzpTrYm4yyu9QT5tXizXU9Dd8ab/3OKphm5cJJLqnJ9TOzYKmoep3ZrDfBgI1wt1o000KQKnp5GtdS2MMy57JogEGfj7buzNmX4PRHyf7TTJDmGltukUwWxFnnIWZYcdL2E0oF9a/xUbtcjX3HKlTkYSFAEOglNvXOUPazKvUSjSgiDjuDbERCBX2XA3PjDOBXhOjMt8JxhSl4YXviI+LDWE68X0tm8h/sBkrDFYo1sETMXyovgiiV1/VUU5ORdmG9fzvqS6l6MAp+WCLpEI3EjsJrR1XFEQyy3/cUsDskp6LqssSFqIav0UNDDnwwu6ci9zImw4pJ4HZAD2AebUbuZnYN1uAj9jTqx7TJk02G+pM9q2hCJWH3vxVAg0b+v0gMDbwhesulW5B0Sc4mb Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 9, 2026 at 12:41=E2=80=AFAM Miaohe Lin w= rote: > > On 2026/3/9 12:53, Jiaqi Yan wrote: > > On Mon, Feb 23, 2026 at 11:30=E2=80=AFPM Miaohe Lin wrote: > >> > >> On 2026/2/13 13:01, Jiaqi Yan wrote: > >>> On Mon, Feb 9, 2026 at 11:31=E2=80=AFPM Miaohe Lin wrote: > >>>> > >>>> On 2026/2/10 12:47, Jiaqi Yan wrote: > >>>>> On Mon, Feb 9, 2026 at 3:54=E2=80=AFAM Miaohe Lin wrote: > >>>>>> > >>>>>> On 2026/2/4 3:23, Jiaqi Yan wrote: > >>>>>>> Sometimes immediately hard offlining a large chunk of contigous m= emory > >>>>>>> having uncorrected memory errors (UE) may not be the best option. > >>>>>>> Cloud providers usually serve capacity- and performance-critical = guest > >>>>>>> memory with 1G HugeTLB hugepages, as this significantly reduces t= he > >>>>>>> overhead associated with managing page tables and TLB misses. How= ever, > >>>>>>> for today's HugeTLB system, once a byte of memory in a hugepage i= s > >>>>>>> hardware corrupted, the kernel discards the whole hugepage, inclu= ding > >>>>>>> the healthy portion. Customer workload running in the VM can hard= ly > >>>>>>> recover from such a great loss of memory. > >>>>>> > >>>>>> Thanks for your patch. Some questions below. > >>>>>> > >>>>>>> > >>>>>>> Therefore keeping or discarding a large chunk of contiguous memor= y > >>>>>>> owned by userspace (particularly to serve guest memory) due to > >>>>>>> recoverable UE may better be controlled by userspace process > >>>>>>> that owns the memory, e.g. VMM in the Cloud environment. > >>>>>>> > >>>>>>> Introduce a memfd-based userspace memory failure (MFR) policy, > >>>>>>> MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd, > >>>>>>> but the current implementation only covers HugeTLB. > >>>>>>> > >>>>>>> For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memf= d, > >>>>>>> whenever it runs into a new UE, > >>>>>>> > >>>>>>> * MFR defers hard offline operations, i.e., unmapping and > >>>>>> > >>>>>> So the folio can't be unpoisoned until hugetlb folio becomes free? > >>>>> > >>>>> Are you asking from testing perspective, are we still able to clean= up > >>>>> injected test errors via unpoison_memory() with MFD_MF_KEEP_UE_MAPP= ED? > >>>>> > >>>>> If so, unpoison_memory() can't turn the HWPoison hugetlb page to > >>>>> normal hugetlb page as MFD_MF_KEEP_UE_MAPPED automatically dissolve= s > >>>> > >>>> We might loss some testability but that should be an acceptable comp= romise. > >>> > >>> To clarify, looking at unpoison_memory(), it seems unpoison should > >>> still work if called before truncated or memfd closed. > >>> > >>> What I wanted to say is, for my test hugetlb-mfr.c, since I really > >>> want to test the cleanup code (dissolving free hugepage having > >>> multiple errors) after truncation or memfd closed, so we can only > >>> unpoison the raw pages rejected by buddy allocator. > >>> > >>>> > >>>>> it. unpoison_memory(pfn) can probably still turn the HWPoison raw p= age > >>>>> back to a normal one, but you already lost the hugetlb page. > >>>>> > >>>>>> > >>>>>>> dissolving. MFR still sets HWPoison flag, holds a refcount > >>>>>>> for every raw HWPoison page, record them in a list, sends SIGBU= S > >>>>>>> to the consuming thread, but si_addr_lsb is reduced to PAGE_SHI= FT. > >>>>>>> If userspace is able to handle the SIGBUS, the HWPoison hugepag= e > >>>>>>> remains accessible via the mapping created with that memfd. > >>>>>>> > >>>>>>> * If the memory was not faulted in yet, the fault handler also > >>>>>>> allows fault in the HWPoison folio. > >>>>>>> > >>>>>>> For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or > >>>>>>> when userspace process truncates its hugepages: > >>>>>>> > >>>>>>> * When the HugeTLB in-memory file system removes the filemap's > >>>>>>> folios one by one, it asks MFR to deal with HWPoison folios > >>>>>>> on the fly, implemented by filemap_offline_hwpoison_folio(). > >>>>>>> > >>>>>>> * MFR drops the refcounts being held for the raw HWPoison > >>>>>>> pages within the folio. Now that the HWPoison folio becomes > >>>>>>> free, MFR dissolves it into a set of raw pages. The healthy pag= es > >>>>>>> are recycled into buddy allocator, while the HWPoison ones are > >>>>>>> prevented from re-allocation. > >>>>>>> > >>>>>> ... > >>>>>> > >>>>>>> > >>>>>>> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio = *folio) > >>>>>>> +{ > >>>>>>> + int ret; > >>>>>>> + struct llist_node *head; > >>>>>>> + struct raw_hwp_page *curr, *next; > >>>>>>> + > >>>>>>> + /* > >>>>>>> + * Since folio is still in the folio_batch, drop the refcou= nt > >>>>>>> + * elevated by filemap_get_folios. > >>>>>>> + */ > >>>>>>> + folio_put_refs(folio, 1); > >>>>>>> + head =3D llist_del_all(raw_hwp_list_head(folio)); > >>>>>> > >>>>>> We might race with get_huge_page_for_hwpoison()? llist_add() might= be called > >>>>>> by folio_set_hugetlb_hwpoison() just after llist_del_all()? > >>>>> > >>>>> Oh, when there is a new UE while we releasing the folio here, right= ? > >>>> > >>>> Right. > >>>> > >>>>> In that case, would mutex_lock(&mf_mutex) eliminate potential race? > >>>> > >>>> IMO spin_lock_irq(&hugetlb_lock) might be better. > >>> > >>> Looks like I don't need any lock given the correction below. > >>> > >>>> > >>>>> > >>>>>> > >>>>>>> + > >>>>>>> + /* > >>>>>>> + * Release refcounts held by try_memory_failure_hugetlb, on= e per > >>>>>>> + * HWPoison-ed page in the raw hwp list. > >>>>>>> + * > >>>>>>> + * Set HWPoison flag on each page so that free_has_hwpoison= ed() > >>>>>>> + * can exclude them during dissolve_free_hugetlb_folio(). > >>>>>>> + */ > >>>>>>> + llist_for_each_entry_safe(curr, next, head, node) { > >>>>>>> + folio_put(folio); > >>>>>> > >>>>>> The hugetlb folio refcnt will only be increased once even if it co= ntains multiple UE sub-pages. > >>>>>> See __get_huge_page_for_hwpoison() for details. So folio_put() mig= ht be called more times than > >>>>>> folio_try_get() in __get_huge_page_for_hwpoison(). > >>>>> > >>>>> The changes in folio_set_hugetlb_hwpoison() should make > >>>>> __get_huge_page_for_hwpoison() not to take the "out" path which > >>>>> decrease the increased refcount for folio. IOW, every time a new UE > >>>>> happens, we handle the hugetlb page as if it is an in-use hugetlb > >>>>> page. > >>>> > >>>> See below code snippet (comment [1] and [2]): > >>>> > >>>> int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > >>>> bool *migratable_cleared) > >>>> { > >>>> struct page *page =3D pfn_to_page(pfn); > >>>> struct folio *folio =3D page_folio(page); > >>>> int ret =3D 2; /* fallback to normal page handling */ > >>>> bool count_increased =3D false; > >>>> > >>>> if (!folio_test_hugetlb(folio)) > >>>> goto out; > >>>> > >>>> if (flags & MF_COUNT_INCREASED) { > >>>> ret =3D 1; > >>>> count_increased =3D true; > >>>> } else if (folio_test_hugetlb_freed(folio)) { > >>>> ret =3D 0; > >>>> } else if (folio_test_hugetlb_migratable(folio)) { > >>>> > >>>> ^^^^*hugetlb_migratable is checked before trying = to get folio refcnt* [1] > >>>> > >>>> ret =3D folio_try_get(folio); > >>>> if (ret) > >>>> count_increased =3D true; > >>>> } else { > >>>> ret =3D -EBUSY; > >>>> if (!(flags & MF_NO_RETRY)) > >>>> goto out; > >>>> } > >>>> > >>>> if (folio_set_hugetlb_hwpoison(folio, page)) { > >>>> ret =3D -EHWPOISON; > >>>> goto out; > >>>> } > >>>> > >>>> /* > >>>> * Clearing hugetlb_migratable for hwpoisoned hugepages to p= revent them > >>>> * from being migrated by memory hotremove. > >>>> */ > >>>> if (count_increased && folio_test_hugetlb_migratable(folio))= { > >>>> folio_clear_hugetlb_migratable(folio); > >>>> > >>>> ^^^^^*hugetlb_migratable is cleared when first time = seeing folio* [2] > >>>> > >>>> *migratable_cleared =3D true; > >>>> } > >>>> > >>>> Or am I miss something? > >>> > >>> Thanks for your explaination! You are absolutely right. It turns out > >>> the extra refcount I saw (during running hugetlb-mfr.c) on the folio > >>> at the moment of filemap_offline_hwpoison_folio_hugetlb() is actually > >>> because of the MF_COUNT_INCREASED during MADV_HWPOISON. In the past I > >>> used to think that is the effect of folio_try_get() in > >>> __get_huge_page_for_hwpoison(), and it is wrong. Now I see two cases: > >>> - MADV_HWPOISON: instead of __get_huge_page_for_hwpoison(), > >>> madvise_inject_error() is the one that increments hugepage refcount > >>> for every error injected. Different from other cases, > >>> MFD_MF_KEEP_UE_MAPPED makes the hugepage still a in-use page after > >>> memory_failure(MF_COUNT_INCREASED), so I think madvise_inject_error() > >>> should decrement in MFD_MF_KEEP_UE_MAPPED case. > >>> - In the real world: as you pointed out, MF always just increments > >>> hugepage refcount once in __get_huge_page_for_hwpoison(), even if it > >>> runs into multiple errors. When > >> > >> This might not always hold true. When MF occurs while hugetlb folio is= under isolation(hugetlb_migratable is > >> cleared and extra folio refcnt is held by isolating code in that case)= , __get_huge_page_for_hwpoison won't get > >> extra folio refcnt. > >> > >>> filemap_offline_hwpoison_folio_hugetlb() drops the refcount elevated > >>> by filemap_get_folios(), it only needs to decrement again if > >>> folio_ref_dec_and_test() returns false. I tested something like below= : > >>> > >>> /* drop the refcount elevated by filemap_get_folios. */ > >>> folio_put(folio); > >>> if (folio_ref_count(folio)) > >>> folio_put(folio); > >>> /* now refcount should be zero. */ > >>> ret =3D dissolve_free_hugetlb_folio(folio); > >> > >> So I think above code might drop the folio refcnt held by isolating co= de. > > > > Hi Miaohe, thanks for raising the concern. Given two things below > > - both folio_isolate_hugetlb() and get_huge_page_for_hwpoison() are > > guarded by hugetlb_lock. > > - hugetlb_update_hwpoison() only folio_test_set_hwpoison() for > > non-isolated folio after folio_try_get() succeeds. > > > > as long as folio_test_set_hwpoison() is true here, this refcount > > should never come from folio_isolate_hugetlb(). What do you think? > > > > Let's think about below scenario. When __get_huge_page_for_hwpoison() enc= ounters an > isolated hugetlb folio: > > int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > bool *migratable_cleared) > { > struct page *page =3D pfn_to_page(pfn); > struct folio *folio =3D page_folio(page); > bool count_increased =3D false; > int ret, rc; > > if (!folio_test_hugetlb(folio)) { > ret =3D MF_HUGETLB_NON_HUGEPAGE; > goto out; > } else if (flags & MF_COUNT_INCREASED) { > ret =3D MF_HUGETLB_IN_USED; > count_increased =3D true; > } else if (folio_test_hugetlb_freed(folio)) { > ret =3D MF_HUGETLB_FREED; > } else if (folio_test_hugetlb_migratable(folio)) { > > ^^^^*Since hugetlb_migratable is cleared for the isola= ted hugetlb folio* > > if (folio_try_get(folio)) { > ret =3D MF_HUGETLB_IN_USED; > count_increased =3D true; > } else { > ret =3D MF_HUGETLB_FREED; > } > } else { > > ^^^^*Code will reach here without extra refcnt increase= d* > > ret =3D MF_HUGETLB_RETRY; > if (!(flags & MF_NO_RETRY)) > goto out; > } > > *Code will reach here after retry* You are right, thanks for pointing that out. Let me think about more how to handle this. > rc =3D hugetlb_update_hwpoison(folio, page); > if (rc >=3D MF_HUGETLB_FOLIO_PRE_POISONED) { > ret =3D rc; > goto out; > } > > So hugetlb_update_hwpoison() will be called even for folio under isolatio= n > without folio_try_get(). Or am I miss something? Just a random question: if MF never increments a hugepage's refcount, what does the folio_put() in me_huge_page() (when mapping =3D null) do? Is it dropping for something other than MF? > > Thanks. > .