From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 792C4C5475B for ; Wed, 6 Mar 2024 23:24:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E29A6B00B9; Wed, 6 Mar 2024 18:24:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 093AC6B00BA; Wed, 6 Mar 2024 18:24:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9CE96B00BB; Wed, 6 Mar 2024 18:24:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D43E76B00B9 for ; Wed, 6 Mar 2024 18:24:44 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7FFE41C09D2 for ; Wed, 6 Mar 2024 23:24:44 +0000 (UTC) X-FDA: 81868196088.19.8FF496B Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf20.hostedemail.com (Postfix) with ESMTP id AAFF51C0009 for ; Wed, 6 Mar 2024 23:24:42 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dkaWo5xp; spf=pass (imf20.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709767482; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I1PFpaiJsm6boPgrzxzlOURkvDFMXgThQWzSGb3lh/o=; b=MnmQ9pcHeqnFG16YeTGXTqdH3JUzk2lx71OeNVNRxtEvkZFMFi/tkZdIxNEz7TIxG52IkU RMgWv59tvDvmzaUaJ+12LEcPexVkgHUbcjfNJD9f20mqW0nA/WoUoH+zbVfMEgF3I3SHYZ Z0cy2sPcVeBWkP0Ap67ERQxCXhW377M= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dkaWo5xp; spf=pass (imf20.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709767482; a=rsa-sha256; cv=none; b=iuBeaoYurXPytuwBGDonAEy507j/75KK2zsYsi6qe1VyQ8gKKC2INW8hMqRnefEuRNRXFF 4PcFqhTXy91fezoW8Xpf3qJkxtCt/9atb21MCU4J6Pqajw7BflBT9n7mU9LHkrVdP1a31S 9Gm9Esi/0Wt65mv0s5Wnbxu7wvGoJCQ= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-42ef8193ae6so53461cf.1 for ; Wed, 06 Mar 2024 15:24:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709767482; x=1710372282; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=I1PFpaiJsm6boPgrzxzlOURkvDFMXgThQWzSGb3lh/o=; b=dkaWo5xpNIeSgpzDlPSkiy5/yWKBPm3DplOR91agurDKywtjnXJlsAB8JYQeokNUkN lfX3J60hjHxcl00UQS5WyVZikqy6PuCf6+bSjdCdXHu8zVZarNfbvRlmFN5GYqD+SK5D qpfOZX/En9O/wakTnkbdgjdUT9H0k5Lk68R/JGwh1Zrjr/iuiIpXqG4rtOO6maXKZ3Jx J/iV2wuSPvVylXrXKUBwp3ckChxYyXv+BCG7hZvULt5sHSAFD8S+l0EM3mhHS1AXZCoa OWYmA1L2y7LMAPUuc/cbzt3cFv1OEdRayPytgqXidcAHTaKpyMOJqJ74BSuIK25eQPJU snAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709767482; x=1710372282; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=I1PFpaiJsm6boPgrzxzlOURkvDFMXgThQWzSGb3lh/o=; b=UGAebvgGaM+VxAsSSQ7ulN28e0j8C8jchALYix81RV4/7UBpPey1Eisbs4/Qm3Z8ec S/MjKtXSt6fDtQsLkEt6Nrc1LniusEPmTOu4umAAWLPOxbblkJiSC6rAyqxrMQ0cz6+G PNyFydmZkKjP2sIM4TK/WtpGVT+tbT96/VxTefZpR++g2JIob4aJJyfNLdJ13/9CXbmR 5sgTDnD+Mm3nWtSqh6L2yrjHvQebxUOcHt0VDGyr7VxjOhrKdoN/HuZIfQPk2i71301T SBJSmyd6RCR37rlDmuT+ntFsqKYNzjUoLgktxYIourEYr03+nEpnqADAogbc++GJNMgq 0o+w== X-Forwarded-Encrypted: i=1; AJvYcCXTjLow/efG6srviX+vs6VYqXLbyM8UEetj4LP5RTeny74vUAKhN5Xr/Squ33j7GGqSadifzkCdzWuulTXiPfEXoEw= X-Gm-Message-State: AOJu0Yw77OxEhbqsSV6tt0nTwAGAV2UEo4tTuBNnwSAm02v9Xi1OJU8h BhNsctuVjfQSSmLGFGrqzP3wT0KjVEc6zIH7y8O9/6th13I6PmwPa9v7k18NxNogsVeJ69cWRZl e5xprQHvA9CniBIFrX3uLZEnCdgR5KtxMQ95N X-Google-Smtp-Source: AGHT+IGq/txqIhRu9vkw3z75t0PT3boEGIaR8oVVQMkrzWPoYcYcLd+cD18p2JiXY8ig8/1nWzUwbLPOf9EtgnO4MSo= X-Received: by 2002:a05:622a:148a:b0:42e:f45c:6761 with SMTP id t10-20020a05622a148a00b0042ef45c6761mr131561qtx.22.1709767481630; Wed, 06 Mar 2024 15:24:41 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Wed, 6 Mar 2024 15:24:04 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications To: Peter Xu Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Muchun Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AAFF51C0009 X-Rspam-User: X-Stat-Signature: i79shq86m5m7ypsxeicnjyqfrjoyr715 X-Rspamd-Server: rspam01 X-HE-Tag: 1709767482-18278 X-HE-Meta: U2FsdGVkX1/IhKjLaln4tObdUY7Q3p/xxzZboGAfDwNfJB08GMw4ngRke3It8I2uitIDbKsZ4oHxLhcz4RWIDhJe9L6VNIp6dhvSY68Rwhn/27w7Xl2NMyeAQ7cdG6d9G8qsgRtKGHEgSXeih9GJCLh4hAYA4syJd8UyLGwVoOrbrMj2yH3fUF3ogDgA2faqvGxQNBUXq2MAa+Dwtt/NKPJzcM82F4/VGaZmER/XbYdX7+XSMAXRh2q1iFAFIGWzgUBFLOBCpbe2KosZSgC6oVD9/gVKJp1EFx/F7DrIHz/4X5TUceTNQ7HZccL2tB+2KDOCxWOCdhYHL2pBJ4YIgT3yT5HDCDl677MK90ihRWd9kVyeXnwCDcdZS8fj8PZeH99AEpeylqXqcFjODgezYCqglk7n6/KvFs94qILAfPVtirK9Voyc9k5ae+siimOUMu3J7czen97C+q3GKEDl68HUGfQbmSioQiQK2HuQf3/d8VLGRI4UmM2rm23pJEcbBnjAjLOld+yA5eut6fNcadTkCLT/0118HKX4sLW7c/EaeIYO55f+VVroFzDwizZz38wrj89XdK2H8KISZAo9YOyqRc2TsOKQfYG6Xa62U1LkJniG1wm/dG8cuid558rB6U2fA+7vZMVpXykJQZb8RVoNcbv5Z6UfYKZpfBH43EU+AoE+vtvCBVany2I3vdaXAeG4POXM+Er2MA3ovuYMj6mdGLMpUy4MejhGvHCUyx4Gewo6twG8iGuXRz4q56xy8WGg2a3kOv5SM9Iadw0BCBkVR0Bc1B5za6nyF0t9I/ryfL+QJHl0X3DRKWpYrJPcfvLWpPaFsTVsJpGMq1dLndtd61cljBKyAUAnsdN8si8MOVqESQALhxXT5sV2+GDbPCdZdwWhH/BZIaRSg6Xyyej3513VXayxEQxUA8hoG3wQkZL6YeZUt+nAAghuQueByPc1ukTxCOtkmhp6Upx yG/odtmX KOpphX3KF2GOXpPKEU32ByiSJK5y4dNxkzi7d4hx+n9/JPFoL5mMl3ZqpAnkY+GmujifttQXaamQyUybg2GIviniRvhkfQ4uieTraTb5BbT17yxP6OoVAcA44aY5GTmnWFLtbykTvJAUD3C5ZqCIZI64jyYkC6r8YUR+8qwjNcaMnGqytr18WBYN12dfx2tSVGCQ4s/uPDDwia4N710x/6KqLhKmPr/tAgOhXaQZcyaK5j/LtI4NR1aV9Njw7JFJkFhtb7r6cd1EmvkMy8C4UgMc77sJJqq9JBn8g X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 29, 2024 at 7:11=E2=80=AFPM Peter Xu wrote: > > Hey, James, > > On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote: > > No matter what, we'll need to add (more) PUD support into the main mm, > > so we could start with that, though it won't be easy. Then we would > > need at least... > > > > (1) ...a filesystem that implements huge_fault for PUDs > > > > It's not inconceivable to add support for this in shmem (where 1G > > pages are allocated -- perhaps ahead of time -- with CMA, maybe?). > > This could be done in hugetlbfs, but then you'd have to make sure that > > the huge_fault implementation stays compatible with everything else in > > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or > > you could create hugetlbfs-v2. I'm honestly not sure which of these is > > the least difficult -- probably the shmem route? > > IMHO hugetlb fault path can be the last to tackle; there seem to have oth= er > lower hanging fruits that are good candidates for such unifications works= . > > For example, what if we can reduce customized hugetlb paths from 20 -> 2, > where the customized fault() will be 1 out of the 2? To further reduce > that 2 paths we may need a new file system, but if it's good enough maybe > we don't need v2, at least not for someone looking for a cleanup: that is > more suitable who can properly define the new interface first, and it can > be much more work than an unification effort, also orthogonal in some way= . This is a fine approach to take. At the same time, I think the separate fault path is the most important difference between hugetlb and main mm, so if we're doing a bunch of work to unify hugetlb with mm (like, 20 -> 2 special paths), it'd be kind of a shame not to go all the way. But I'm not exactly doing the work here. :) (The other huge piece that I'd want unified is the huge_pte architecture-specific functions, that's probably #2 on my list.) > > > > (2) ...a mapcount (+refcount) system that works for PUD mappings. > > > > This discussion has progressed a lot since I last thought about it; > > I'll let the experts figure this one out[1]. > > I hope there will be an solid answer there. > > Otherwise IIRC the last plan was to use 1 mapcount for anything mapped > underneath. I still think it's a good plan, which may not apply to mTHP > but could be perfectly efficient & simple to hugetlb. The complexity lie= s > in elsewhere other than the counting itself but I had a feeling it's stil= l > a workable solution. > > > > > Anyway, I'm oversimplifying things, and it's been a while since I've > > thought hard about this, so please take this all with a grain of salt. > > The main motivating use-case for HGM (to allow for post-copy live > > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in > > other ways[2]. > > Do you know how far David went in that direction? When there will be a > prototype? Would it easily work with MISSING faults (not MINOR)? A prototype will come eventually. :) It's valid for a user to use KVM-based demand paging with userfaultfd, MISSING or MINOR. For MISSING, you could do: - Upon getting a KVM fault, we will exit to KVM_RUN will exit to userspace. - Fetch the page, install it with UFFDIO_COPY, then mark the page as present with KVM. KVM-based demand paging is redundant with userfaultfd in this case though. With minor faults, the equivalent approach would be: - Map memory twice. Register one with userfaultfd. The other ("alias mapping") will be used to install memory. - Use the userfaultfd-registered mapping to build the KVM memslots. - Upon getting a KVM fault, KVM_RUN will exit. - Fetch the page, install it by copying it into the alias mapping, then UFFDIO_CONTINUE the KVM mapping, then mark the page as present with KVM. We can be a little more efficient with MINOR faults, provided we're confident that KVM-based demand paging works properly: - Map memory twice. Register one with userfaultfd. - Give KVM the alias mapping, so we won't get userfaults on it. All other components get the userfaultfd-registered mapping. - KVM_RUN exits to userspace. - Fetch the page, install it in the pagecache. Mark it as present with KVM. - If other components get userfaults, fetch the page (if it needs to be), then UFFDIO_CONTINUE to unblock it. Now userfaultfd and KVM-based demand paging are no longer redundant. Furthermore, if a user can guarantee that all other components are able to properly participate in migration without userfaultfd (i.e., they are explicitly aware of demand paging), then the need for userfaultfd is removed. This is just like KVM's own dirty logging vs. userfaultfd-wp. > > I will be more than happy to see whatever solution come up from kernel th= at > will resolve that pain for VMs first. It's unfortunate KVM will has its > own solution for hugetlb small mappings, but I also understand there's mo= re > than one demand to that besides hugetlb on 1G (even though I'm not 100% > sure of that demand when I think it again today: is it a worry that the > pgtable pages will take a lot of space when trapping minor-faults? I > haven't yet got time to revisit David's proposal there in the past two > months; nor do I think I fully digested the details back then). In my view, the main motivating factor is that userfaultfd is inherently incompatible with guest_memfd. We talked a bit about the potential to do a file-based userfaultfd, but it's very unclear how that would work. But a KVM-based demand paging system would be able to help with: - post-copy for HugeTLB pages - reduce unnecessary work/overhead in mm (both minor faults and missing fau= lts). The "unnecessary" work/overhead: - shattered mm page tables as well as shattered EPT, whereas with a KVM-based solution, only the EPT is shattered. - must collapse both mm page tables and EPT at the end of post-copy, instead of only the EPT - mm page tables are mapped during post-copy, when they could be completely present to begin with You could make collapsing as efficient as possible (like, if possible, have an mmu_notifier_collapse() instead of using invalidate_start/end, so that KVM can do the fastest possible invalidations), but we're fundamentally doing more work with userfaultfd. > The answer to above could also help me to prioritize my work, e.g., huget= lb > unification is probably something we should do regardless, at least for t= he > sake of a healthy mm code base. I have plan to move HGM or whatever it > will be called to upstream if necessary, but it can also depends on how > fast the other project goes, as personally I don't yet worry on hugetlb > hwpoison yet (at least QEMU's hwpoison handling is still pretty much > broken.. which is pretty unfortunate), but maybe any serious cloud provid= e > still should care. My hope with the unification is that HGM almost becomes a byproduct of that effort. :) The hwpoison case (in my case) is also solved with a KVM-based demand paging system: we can use it to prevent access to the page, but instead of demand-fetching, we inject poison. (We need HugeTLB to keep mapping the page though.)