From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5235FCCF9F2 for ; Tue, 28 Oct 2025 04:17:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 974B68010F; Tue, 28 Oct 2025 00:17:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 925B1800E4; Tue, 28 Oct 2025 00:17:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 862F58010F; Tue, 28 Oct 2025 00:17:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 74F73800E4 for ; Tue, 28 Oct 2025 00:17:46 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 157EA58B34 for ; Tue, 28 Oct 2025 04:17:46 +0000 (UTC) X-FDA: 84046214532.22.9B9AEB0 Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by imf07.hostedemail.com (Postfix) with ESMTP id 269B240004 for ; Tue, 28 Oct 2025 04:17:43 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IhS9Vl3n; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1761625064; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EafH425e5ZdVyi0FV+yegWkfn9dVJGmMSMrvuISBCM4=; b=CI346XdK2yMJ3vGM+9m2N75v6ScJmaOFg8O/1A+FOufrJ51RuQMuGwnXk1+KRm1i4Gfnq5 qVZzQzL+MRFihlRsqug2YeeOFjlconJohSHSWKNn9Jrx9M/uHx1XfijP6m0nTy8MH/jAim eS/bUEAfHntCQVXyA+Sltx4Rtup7wpI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1761625064; a=rsa-sha256; cv=none; b=IhWFhxkJbIu41UsjmPCa8v3ZRgPWu+8u67G1gsy4XjFNbgFbae6S8w52srT0OdUlDatgxz zC4uZ6IsKmjiOhQnqG7rT1Sj9mYCwtoYTsGoHa1djtwPStX4YJ++nW7jzexy1g/xf7XPm3 pef1Xre7Gvix7rclDhv6jUE/xB1ADko= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IhS9Vl3n; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jiaqiyan@google.com Received: by mail-wm1-f48.google.com with SMTP id 5b1f17b1804b1-47105bbb8d9so26995e9.1 for ; Mon, 27 Oct 2025 21:17:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1761625063; x=1762229863; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EafH425e5ZdVyi0FV+yegWkfn9dVJGmMSMrvuISBCM4=; b=IhS9Vl3nQHw6fMQITiIoLHY6P1RJX6+f/st4j0Jg3e176u2bZbYoCQesEUZAQ6Aey1 ZD85pSZsXTHtnwZ2DuAxh0I1MY9Tc40mLq6vHrgnl3wDxXws8usnRikk54FmMyzKFUTW +2B9cwo6Ecn6LhBSEMPwCk0YfVVZ4kqm90QH1wr93pctAq5BCdnRwQQkP7srKAvx7JQQ tvtHkAZfWh0mBYSTC+mJrcT8lUQ31qPYOOP6+kY3AEmGi/kzhkm4b5M6f82DUmUscCgA vMZ82jwM6PciQP2nvcU6OcQjUUzCNUT/cBo4oZGPl2UqmqageTZ8fFlP7hz1YACpjYlC LdKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761625063; x=1762229863; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EafH425e5ZdVyi0FV+yegWkfn9dVJGmMSMrvuISBCM4=; b=IgwURMZAnDxhWd0GX69vZgYKPQYMOq2kVkpZNTV0Q9F5T2adMuBLGWPzTnZR+36B8r hKjUty0EXaDmzHu/JhRUZqQPztxZtOaRkMcZrqQ4q9IKufUqlfdxbKR0S9koROcgoSw3 I0+Ir3ie6Rfw933eNuXltY7HkSPv6/kNzeHs0aEmGxQVpSkM1O8QkrKPaa39wglHfcQu dJ9YpzBEylheFftK4h0LrMr/mvP0fBksLU9+xyLE8Onlpkvkob/MgKhv+YIo5LzYczoU GG3i0U4Fl86kCeA0bgCh0pezPmnhEQkpZET80H1p9ZPSi5t0UYQOE9VI4Q1k7bQG2P5p NBEg== X-Forwarded-Encrypted: i=1; AJvYcCVZvcZtkQXqsvRhNLEe23LKIUWFkKQE1dKCjG8aiLavv3nz0tcKcNU8zXeUIbFcVA4ipNDjA0zPFQ==@kvack.org X-Gm-Message-State: AOJu0YzMTmHGZ4Am/7/WWT3DaXVl5edC3Cdh11W/7ltWUncoOjb217H5 DOBrS84m66VWkkbM1+dJDyxtVnf2zzt6R3XHZGdETXkt41g1yt5ybnowVKNJ5ikMv8PvIOVYseh zTP5udTYHqYKwg5uPIgBwzWenriFDuWEMosT1sKZ8 X-Gm-Gg: ASbGncuMZsMLdsjAOlNmUu7zGDRo1PnkxtY4BlZt9BncU+ywAIwDorakbS8WDiSb/Ym T/ptwyZwHCsJnBMWJ6tyETZ2LjJSJrCcCIVOb/F9JxHAJ+6hLPGalMiWThHmBNJPJhLtuO3pONM Tv82SyWcDSCZFBA0iBOL+mfkJw2Hu/a8evj0ePsVsfviUp+IhfGfs1gWtrDxKJmuPVkz4o8MRMr 3SkbMA1ocElW7bq2WIyyxEWvqh9ibvX9KFW5yDIuyHZ+kZe7E4b+mPCooONSP2KFD3IrmUKQLz3 jXBsfVS8X5bRO0rLfA== X-Google-Smtp-Source: AGHT+IEhtd6FMoBD9oW9VTHPjTNuAKD4LI9Bm+L8vr/ZIvRemEu0gME+w1twwea2P3HAp+5Fi55N6fK3//NhVxDsmnM= X-Received: by 2002:a05:600c:1791:b0:45b:74f7:9d30 with SMTP id 5b1f17b1804b1-477191e0eabmr589665e9.1.1761625062640; Mon, 27 Oct 2025 21:17:42 -0700 (PDT) MIME-Version: 1.0 References: <20250118231549.1652825-1-jiaqiyan@google.com> <20250919155832.1084091-1-william.roche@oracle.com> In-Reply-To: From: Jiaqi Yan Date: Mon, 27 Oct 2025 21:17:31 -0700 X-Gm-Features: AWmQ_bm9R7TnoGST5GcqG6bwJmlJM5TYefCTLaciqwqQYwwTV1Gje_wcNqRiI_k Message-ID: Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd To: Harry Yoo , =?UTF-8?Q?=E2=80=9CWilliam_Roche?= Cc: Ackerley Tng , jgg@nvidia.com, akpm@linux-foundation.org, ankita@nvidia.com, dave.hansen@linux.intel.com, david@redhat.com, duenwen@google.com, jane.chu@oracle.com, jthoughton@google.com, linmiaohe@huawei.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev, nao.horiguchi@gmail.com, osalvador@suse.de, peterx@redhat.com, rientjes@google.com, sidhartha.kumar@oracle.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam01 X-Stat-Signature: xpzy64yk695tkkxmeamny4przw7t58ju X-Rspam-User: X-Rspamd-Queue-Id: 269B240004 X-HE-Tag: 1761625063-227707 X-HE-Meta: U2FsdGVkX19tbF8eH8dxkHDE4KL33C/rSyoG34DMJJ0IDmgY0ijLz+2xR/uKNEp7yfsHz+OrDTR2Wompc4QkEeITM22Kr4BsU+jh6erCmZko9w2i0/lkhUx0vS7KSCZt1cSwEfouFS+fCsqqnY8S7Do/R8ulJU0CSxpBLD5q7sRqfojRo7V7tp1YZZI/X4Rp68+mYpcevUYSfKz+8ZEJNeBa0dPU2hx0wOkBIDqqqHX2fD3IDKNSexTaXbvX3EE5vrgkZWBcWDrf3EPOWuUMqnFVPfhHom72eapI1d7b6/M4u7ZeWWMJVfwi3P3nMDkfNcPU4Qe26UTz5YpwF7yDNSWuF8vvzo+nbnFmwE020dbBPjT7nY1DzyHpKFY4p6j5xBphcXMt8JK9iY54w2tjGxOZQsIrvRm+b8WqyDu/nd28dmg0TBrzCUKrpo4QmS4aU3YN+YP/0WdLOKl48U8RBOl3/HQzbGwZiTJDvZDMm78pUfWtU/lkWHELWXyUa68HmMbHdFuhoyCLE3dmN4GQIzK+OqwclyOEeo3S5cklY6v7QVXaV93q+0TaBNvA0qgsBNs5YGyO6nIaTyLQV4tVRb5CEz5PuF+foI0eFGnEQs95pe3YhpgjdWtqdDl1Y5H0mgt9zElc50bg8Z0clLF70hiB2ihwO6ZP5dNH7p4fleuoWod2o2vktOiwxDgqY3lvXeaLQ5GTmLNgg1SCrUJNfCCPm3ne1Orhz983iVQ3PF3ouLSISC21xhNc6UAef+gtILZ/FiIbKm0kVFmiUm5Uhu/ZsGIa9rhb/t6lEeg4yuKNCTiewuFJtUze7tle+BYoxGOg6nS/ye40cNf71/SFsCURGZ1luvAYPhz2IkQ/7bFAK2uv0iMhBX0IrCoOh/fcPXD57rB6OFOYcrbAUxxgYWPMaS7hRaTuGaKQXGnCY5HIEKAV7Bh0vb4CszvlGJ6oSQWUGklLFmJrVsYuKgp S2OpTNbn DkpMijA9ybiXPXmOns/WkGWkP9xCrRwYhabTLTrpsIgH+rgW5R7bnCI7Ocp4HuWss2b5zAEQWS8uYmYAG3AMmN9SfOswcPfe9gypsdBq+B4aHvQh6wfFtg1BbAkW/iT+D2Lbn47ELBqegiQ6EnW55hAdFBAndpBUkczBHxi6kMPGWGYPYcScXai1PDFYBL7qFtW/03lnPj3C/Kfv0RnhGKZnOQRdAi+YBuFdo7QBQoq7EDmC5v3bB3MRFsqFz6EU/AB6yhYoFKPmLegRjo8VjAAM9z/H6JypRhfBTmQ+vgbKsz4MYPzCT1kiS/guquMtYMotfEFdwDj25h3c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 22, 2025 at 6:09=E2=80=AFAM Harry Yoo wr= ote: > > On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote: > > On Fri, Sep 19, 2025 at 8:58=E2=80=AFAM =E2=80=9CWilliam Roche wrote: > > > > > > From: William Roche > > > > > > Hello, > > > > > > The possibility to keep a VM using large hugetlbfs pages running afte= r a memory > > > error is very important, and the possibility described here could be = a good > > > candidate to address this issue. > > > > Thanks for expressing interest, William, and sorry for getting back to > > you so late. > > > > > > > > So I would like to provide my feedback after testing this code with t= he > > > introduction of persistent errors in the address space: My tests used= a VM > > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments= to the > > > test program provided with this project. But instead of injecting the= errors > > > with madvise calls from this program, I get the guest physical addres= s of a > > > location and inject the error from the hypervisor into the VM, so tha= t any > > > subsequent access to the location is prevented directly from the hype= rvisor > > > level. > > > > This is exactly what VMM should do: when it owns or manages the VM > > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to > > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting > > such memory accesses. > > > > > > > > Using this framework, I realized that the code provided here has a pr= oblem: > > > When the error impacts a large folio, the release of this folio doesn= 't isolate > > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() = can return > > > a known poisoned page to get_page_from_freelist(). > > > > Just curious, how exactly you can repro this leaking of a known poison > > page? It may help me debug my patch. > > > > > > > > This revealed some mm limitations, as I would have expected that the > > > check_new_pages() mechanism used by the __rmqueue functions would fil= ter these > > > pages out, but I noticed that this has been disabled by default in 20= 23 with: > > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks > > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz > > > > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=3Dy during de= v > > and testing but didn't notice any WARNING on "bad page"; It is very > > likely I was just lucky. > > > > > > > > > > > This problem seems to be avoided if we call take_page_off_buddy(page)= in the > > > filemap_offline_hwpoison_folio_hugetlb() function without testing if > > > PageBuddy(page) is true first. > > > > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb > > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or > > not. take_page_off_buddy will check PageBuddy or not, on the page_head > > of different page orders. So maybe somehow a known poisoned page is > > not taken off from buddy allocator due to this? > > Maybe it's the case where the poisoned page is merged to a larger page, > and the PGTY_buddy flag is set on its buddy of the poisoned page, so > PageBuddy() returns false?: > > [ free page A ][ free page B (poisoned) ] > > When these two are merged, then we set PGTY_buddy on page A but not on B. Thanks Harry! It is indeed this case. I validate by adding some debug prints in take_page_off_buddy: [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=3D0 after drain_al= l_pages [ 193.029426] 0x2800200: [yjq] order=3D0, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029428] 0x2800200: [yjq] order=3D1, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029429] 0x2800200: [yjq] order=3D2, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029430] 0x2800200: [yjq] order=3D3, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029431] 0x2800200: [yjq] order=3D4, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029432] 0x2800200: [yjq] order=3D5, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029434] 0x2800200: [yjq] order=3D6, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029435] 0x2800200: [yjq] order=3D7, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029436] 0x2800200: [yjq] order=3D8, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029437] 0x2800200: [yjq] order=3D9, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029438] 0x2800200: [yjq] order=3D10, page_order=3D10, PageBuddy(page_= head)=3D1 In this case, page for 0x2800200 is hwpoisoned, and its buddy page is 0x2800000 with order 10. > > But even after fixing that we need to fix the race condition. What exactly is the race condition you are referring to? > > > Let me try to fix it in v2, by the end of the week. If you could test > > with your way of repro as well, that will be very helpful! > > > > > But according to me it leaves a (small) race condition where a new pa= ge > > > allocation could get a poisoned sub-page between the dissolve phase a= nd the > > > attempt to remove it from the buddy allocator. > > > > > > I do have the impression that a correct behavior (isolating an impact= ed > > > sub-page and remapping the valid memory content) using large pages is > > > currently only achieved with Transparent Huge Pages. > > > If performance requires using Hugetlb pages, than maybe we could acce= pt to > > > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd= segment > > > is released ? If it can easily avoid some other corruption. > > > > > > I'm very interested in finding an appropriate way to deal with memory= errors on > > > hugetlbfs pages, and willing to help to build a valid solution. This = project > > > showed a real possibility to do so, even in cases where pinned memory= is used - > > > with VFIO for example. > > > > > > I would really be interested in knowing your feedback about this proj= ect, and > > > if another solution is considered more adapted to deal with errors on= hugetlbfs > > > pages, please let us know. > > > > There is also another possible path if VMM can change to back VM > > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's > > work [1], guest_memfd can split the 1G page for conversions. If we > > re-use the splitting for memory failure recovery, we can probably > > achieve something generally similar to THP's memory failure recovery: > > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We > > still lose the 1G TLB size so VM may be subject to some performance > > sacrifice. > > [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed= 65cd01.1747264138.git.ackerleytng@google.com > > I want to take a closer look at the actual patches but either way sounds > good to me. > > By the way, please Cc me in future revisions :) For sure! > > Thanks! > > -- > Cheers, > Harry / Hyeonggon