From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21014C636DF for ; Wed, 28 Aug 2024 16:23:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7EDE16B007B; Wed, 28 Aug 2024 12:23:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 79EA76B0082; Wed, 28 Aug 2024 12:23:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63F2C6B0083; Wed, 28 Aug 2024 12:23:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 41C266B007B for ; Wed, 28 Aug 2024 12:23:33 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DFF4CC03F5 for ; Wed, 28 Aug 2024 16:23:32 +0000 (UTC) X-FDA: 82502174664.21.23B7406 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf02.hostedemail.com (Postfix) with ESMTP id 113B480015 for ; Wed, 28 Aug 2024 16:23:30 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2ale11ha; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724862146; a=rsa-sha256; cv=none; b=5fbsKQ8W6duiB8KijpH1QHuW8rWqXEM6OjKwbq7qCpnQ5DuemG72ZoxCR9fg0CRagOtKhL e5Ynot5rb637HTXENrnaCNYc6Y5kZZhcmZUIWeWSD+mHvCtoMQqeQBuDBxyLkcYDmPNOFb +NoR0gdi8eMhIXHL10ewB/ZBJiACXt8= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2ale11ha; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724862146; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EplfYsmUAm0zdExeIe5fWGrao4IvbEYZxsxFMCs2OwU=; b=tctiKo0s2b2xiTRtXp6nI+Pn255mRiAT/tYnNn8fJETx2sjju8IgOIZXo4HB7xXX86Zha0 NQR0IMuz0ViaFXEt5eTMM8FjrkajvkzAt/t3pnr7HvAjraHH4gkXm7Gk5bA4Gpd9h51i5W i4R9kHUGZ7KIoFmdgamV1gIq4SQiozM= Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-5c07fb195b6so14587a12.1 for ; Wed, 28 Aug 2024 09:23:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1724862209; x=1725467009; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EplfYsmUAm0zdExeIe5fWGrao4IvbEYZxsxFMCs2OwU=; b=2ale11hag6Yz+8Tsp9TOH3EiYHd2MSM60z/IZep3xncrnKenznBD62x/sr7nUyu3Ge 2r1fqvbxxDisXlvhD2wiEQi0Zn23SvzQBvRquaXEP5GLlCviawihRkWojA9rel294uiu Y2aZugjVCDZyp3rkHjzLM5mcd8rw9SchS9IO0Sqb5W3MopMlQ4GpgyNnWkEE2itDY1GE jam6VD6ieXjUmxs6io/RNK9vv9yOjpJ7s9qCMbPEzrGgaX7l8I4Fl18wmu/4Q+l/XTM8 l08eHi+P2IcGlajm+T1PZaiJVGvi81PmqlCKT1ahdsEDQzKqcUNwBDDaYerstqtjl4Pa 2EDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724862209; x=1725467009; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EplfYsmUAm0zdExeIe5fWGrao4IvbEYZxsxFMCs2OwU=; b=PR7kq28vY2v3H4GDOigE09/EFq4DlGH0CXV4QdDEFCThogvUjcjbj/AzGyTI4Tg30M L+enXHtgmpl+hICv4TpglvtDgtUgmVE+eI0Y4UEY2Kl2j4iPoL7566r/yFTLJOSQ09SC 1uC6ckTH1EKKL/hQ97gsNz/e5ZvgTyUGISv+WTeyF4mHr1UhOyl8vy7Yb/glNqVK2hJ4 JUmIOXkd6EF7n2X0tliWgnCsNOdSgRQaVVprtJCR5JSF9toiXVvNhaRKJy4hZPADzSY/ L3EbsDFnqzLo14wUzwSlXVHQS5BKi7x0JgFOu745O4jL54l8T6+40zU853i7VE+7DDUF pgEw== X-Forwarded-Encrypted: i=1; AJvYcCU7CfgfeGULpP/rSSaPclV8EMyON8y1e9wAxUkwCC4i/8Pqc1VhEdbgbt/lgaZffj0JmfqObdZpkQ==@kvack.org X-Gm-Message-State: AOJu0Yx9lWOoqAWN/SXXtjajrizqMse/9RPk3xtgyemoG5LcGxAQqU4D yQe9hQUICFjkEFTeaQgA4HDDHuE0BNUpwU0MBY/rk9S2x7SPBsBGriD1mdgLr7sOqdIr4CAlrde wCjTH+7we7zmd69PArwd2kTO2on7OKGzOQ3Vc X-Google-Smtp-Source: AGHT+IEFE58yJznOImn4f1wAbYp3KJ+BFLhNMAUjzfmb8JyzlrbhncOsGjJEcI25K90biqzBn1CjhqA8SLI0gojL92o= X-Received: by 2002:a05:6402:40ca:b0:5c0:aa37:660b with SMTP id 4fb4d7f45d1cf-5c2117308cbmr198434a12.6.1724862208798; Wed, 28 Aug 2024 09:23:28 -0700 (PDT) MIME-Version: 1.0 References: <20240826204353.2228736-1-peterx@redhat.com> In-Reply-To: From: Jiaqi Yan Date: Wed, 28 Aug 2024 09:23:17 -0700 Message-ID: Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gavin Shan , Catalin Marinas , x86@kernel.org, Ingo Molnar , Andrew Morton , Paolo Bonzini , Dave Hansen , Thomas Gleixner , Alistair Popple , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sean Christopherson , Oscar Salvador , Jason Gunthorpe , Borislav Petkov , Zi Yan , Axel Rasmussen , David Hildenbrand , Yan Zhao , Will Deacon , Kefeng Wang , Alex Williamson Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 113B480015 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 5ssot54436yfzwt81s1sewjqe5yq48zw X-HE-Tag: 1724862210-823864 X-HE-Meta: U2FsdGVkX19IVwVGIq6XxTClfvydsVoPsvVsTa3E3eKctvfiish/wAgAg8gdE9f/UxIsfLqVTJyrlMbnODaF8VzK3KJNx7I0Xa6BO4LD8atCyH+0+dyWXmHQYDuxn9jn4HhPBrGGi0zXdlGo8KQxNTReWQl4MQlu7qkRkV4nJKp6QDNX0TgGURba7nhxOmH5+I+azPMPVkko/njS3pIMX06gCiP45/KMJaAqGOEUYATTgXe4TG932bhOfqJ/f05hv/R9x5WveoNqCapV3W/Pad8ekcQqpr7isZLSywh+q5hKwAwzKfnmQ7sCPha2lbUHlFaSaVcAT0nI16iOa7zZ9gQ0pFu8CIXiuzt/9GWwTwVI9yxDsdBsSROb55baw7LPpk48GueqRkkd8rM6QlB1QHOw6gxQYroXZY+qPenizrbkLV5RH3XIGWiZ6q46/vnLZpTN49tEqMASyaYfZSwWwE6CofWy4cBtfaIcNLb2wiLjEyM9ec7CKwd16CjI1KI1z1nYaLliCv8ctcx5SF3A4N7QCt2f7rCf7Pnmm3tFUlF+wGnlM6F2NJCK7b+0Kr4X8CFl4NVlaW0JrgMJ13wuT6MN4XE2NVN32FX5oZVd35FyWpcl3x7hlBnlHtBEjhhQvs2ailgGSex3ehn5rjs+DccvLjvIt0OkiHXOiFYmiajh+FkrupS/SgpMs9iAiSHzkdo4Y68OvoNMKYCA4ncSMyDupjMe5WV27iSrf9lobAx+snArbS0xwot50eykgXieiaUy3YJvOFcBrUDfLz6Tc9mggbA6gCJ5Ox7eGsZ/3+o2uiGvoRo05Y4ceYliuSggdYbhrxQT4JGBrlP7PSstmXkAeanHcRGlJsZJxWllZslnbl5YQVZBE8hcsI9dGwO9QZYBfXThltLRTQCn0+4TvoQBtFKJMA34FTt3Ea9XQy+CUkr6QwMUa2nMks7FMQMbnrpnsfjuQOzCNHiJM3H Dq4ndc3X bqgLTruFCbHGkXFm0OwufmTs2k3ScK7s3Upc0Y5lmNp3/yPVAf94GpqrSn8fk0IN0eiKRb9LQFWCzVtsPAB33eno2Qax9Ct2mI4gLo+hdas+Y62XyLyU0//AQGcQF08RU9H0Prf+mlgEJ42p/zfEE+1APHGs/W0rfj/ibSOcFH2a8z2kJw8oC/HWaFJdBgXLxLEJ05DluB75vNPM3CLmIPoxjKSgmtqQdM0CRMn8J3S+6vIVXcIfkyJ9VNJSjkPIBP/hk3JDT9TPzznNxwHB4FZ1y8iCxIVx/gnhmPkiOBCdeQ4M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Aug 28, 2024 at 7:41=E2=80=AFAM Peter Xu wrote: > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote: > > On Tue, Aug 27, 2024 at 3:57=E2=80=AFPM Peter Xu wr= ote: > > > > > > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote: > > > > Hi Peter, > > > > > > Hi, Jiaqi, > > > > > > > I am curious if there is any work needed for unmap_mapping_range? I= f a > > > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver > > > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is > > > > > > Yes it can, but it'll invoke the split_huge_pud() which default route= s to > > > removal of the whole pud right now (currently only covers either DAX > > > mappings or huge pfnmaps; it won't for anonymous if it comes, for exa= mple). > > > > > > In that case it'll rely on the driver providing proper fault() / > > > huge_fault() to refault things back with smaller sizes later when acc= essed > > > again. > > > > I see, so the driver needs to drive the recovery process, and code > > needs to be in the driver. > > > > But it seems to me the recovery process will be more or less the same > > to different drivers? In that case does it make sense that > > memory_failure do the common things for all drivers? > > > > Instead of removing the whole pud, can driver or memory_failure do > > something similar to non-struct-page-version of split_huge_page? So > > driver doesn't need to re-fault good pages back? > > I think we can, it's just that we don't yet have a valid use case. > > DAX is definitely fault-able. > > While for the new huge pfnmap, currently vfio is the only user, and vfio > only requires to either zap all or map all. In that case there's no real > need to ask for what you described yet. Meanwhile it's also faultable, s= o > if / when needed it should hopefully still do the work properly. > > I believe it's not usual requirement too for most of the rest drivers, as > most of them don't even support fault() afaiu. remap_pfn_range() can star= t > to use huge mappings, however I'd expect they're mostly not ready for > random tearing down of any MMIO mappings. > > It sounds doable to me though when there's a need of what you're > describing, but I don't think I know well on the use case yet. > > > > > > > > > > > > poisoned in the 1G mapping, it would be great if the mapping can be > > > > splitted to 2M mappings + 4k mappings, so only the single poisoned = PFN > > > > is lost. (Pretty much like the past proposal* to use HGM** to impro= ve > > > > hugetlb's memory failure handling). > > > > > > Note that we're only talking about MMIO mappings here, in which case = the > > > PFN doesn't even have a struct page, so the whole poison idea shouldn= 't > > > apply, afaiu. > > > > Yes, there won't be any struct page. Ankit proposed this patchset* for > > handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci > > driver adopts your change via new remap_pfn_range (install PMD/PUD > > instead of PTE), and memory_failure_pfn still > > unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE, > > 0), can it somehow just work and no re-fault needed? > > > > * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com= /#t > > I see now, interesting.. Thanks for the link. > > In that case of nvgpu usage, one way is to do as what you said; we can > enhance the pmd/pud split for pfnmap, but maybe that's an overkill. Yeah, just want a poke to see if splitting pmd/pud is some low-hanging frui= t. > > I saw that the nvgpu will need a fault() anyway so as to detect poisoned > PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault(= ) > when it supports huge pfnmaps it'll need to try to detect whether the who= le > faulting range contains any poisoned PFNs, then provide FALLBACK if so > (rather than VM_FAULT_HWPOISON). > > E.g., when 4K of 2M is poisoned, we'll erase the 2M completely. When > access happens, as long as the accessed 4K is not on top of the poisoned > 4k, huge_fault() should still detect that there's 4k range poisoned, then > it'll not inject pmd but return FALLBACK, then in the fault() it'll see > the accessed 4k range is not poisoned, then install a pte. Thanks for illustrating the re-fault flow again. I think this should work well for drivers (having large MMIO size) that care about memory errors. We can put the pmd/pud split idea to backlog and see if it is needed in future. > > Thanks, > > -- > Peter Xu >