From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E82FCC61DA4 for ; Thu, 9 Mar 2023 18:05:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E4D56B0071; Thu, 9 Mar 2023 13:05:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 49406280002; Thu, 9 Mar 2023 13:05:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33423280001; Thu, 9 Mar 2023 13:05:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1FC0E6B0071 for ; Thu, 9 Mar 2023 13:05:53 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D0F538023C for ; Thu, 9 Mar 2023 18:05:52 +0000 (UTC) X-FDA: 80550138144.25.D8C96E4 Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com [209.85.219.177]) by imf13.hostedemail.com (Postfix) with ESMTP id DDF392000B for ; Thu, 9 Mar 2023 18:05:49 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dFL+zCTy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of jthoughton@google.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678385149; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d4+MmmI56u1NwsfPU4OAuxHn0FCJxKV3G8aUUxcib8g=; b=mcT2RewqyuLavp4L9CbRf7nRWWL4ZmjZLvvjnn93LAz7zxakyChK5LlERX9NCjl+s8vnS4 1TMRHWr66geUTnf8EtVUmGevXTUvd7jWSYD0qQB0vf+NcwV1npZz+6MrfLBmMNjM4ypHYP i3LFFLBW61uRnce8IrDboRB2elBBHrQ= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dFL+zCTy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of jthoughton@google.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678385149; a=rsa-sha256; cv=none; b=gESODD5UgrpTmYxsr7rhmYX6eweEi0Yhh7iVeanetkhnjlvHEhkY/1v7uZ6l8OLPWxjKNe 78AefCyncNoG++HD+RcJ0sgpPXXwitrTlkLeYl5vDUd6fCMs/ly6iwBlwBIwao590p+NvJ fPqc0RSJnXPVbdccLtIrNOH+dw9okLU= Received: by mail-yb1-f177.google.com with SMTP id t4so2820317ybg.11 for ; Thu, 09 Mar 2023 10:05:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1678385149; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d4+MmmI56u1NwsfPU4OAuxHn0FCJxKV3G8aUUxcib8g=; b=dFL+zCTyV2D4K76kTGiACEBSB8ZhcIbaSPqHSY0Dy8AXJbnkfDzP8lf++YnyfRuMWb UtTxkXITj8d9a9g3fHM0euY9JriF3Pym6jkA0SF+hw4Cnkhz66R/SHI8VTa3IuvIzhzW YRcjELEgrPTAREZgFmtZ+rGrYFknTDTbCoa6mMf/hjue2vx9hhcugX4YeAj7qu+t6pUI fxWm3ZJIUyLFDdGXlC/uqB81T4HBRXV/2R4eAc+7CQUM1AoZJCqEhQfI9mvvniefNZT3 UV5rSvy8tp/YlwDUzIq329BLTgO0l8y5lA/GZZtVI7oANFuvsaTRlJQ+MVbMq6vZXcEl /b/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678385149; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d4+MmmI56u1NwsfPU4OAuxHn0FCJxKV3G8aUUxcib8g=; b=USfzLqeC7JoBe9693W6O0jNv/UoRq1T7E+UDXY1/jiMD/5ZHCXfCLpp9TYBRgOU4MV LWoNWCjZPCv/Dh8gZegqanJcFqYmdsL+E25wG7AC0OdxI7tV/Dol4GH1tx+hvthhzDZw CEdjfrcp0WqdnzmZF5cscBfmg4g20RUk4iu4DKuowlpmVMhwrN50DUt+jviToFCxRj0e Ij/9Em40m8a5WeBdQW0HL2qS8+bD0PYNc4E5PhglqDzl9rtCpwtWfx4CMbiG6trwhY+2 meLPQXGOY1OrsX8QKeSobfd2GkPWiQXQ8UPCwWY/H1k/Mxuo6k9n7t+685wi90RnXYLH OslQ== X-Gm-Message-State: AO0yUKX9tb+Tx71QbiS5+13fl4CIrLeWKUv29wKeeAm0NX0HQgauYlD3 3a6z6sMNrrYbdO2ODlc6LH4aHQkalpRG3M73N8a+Kg== X-Google-Smtp-Source: AK7set84ZsMlJHmD8/dkY6pPwXjq9IdnqIsG9FFjxubhQV5KD7/LZZ6rD3mpQLc27Nyen19Vk5x5Bdmep7SEoLCLAq0= X-Received: by 2002:a25:8f8c:0:b0:b21:a3b8:45cd with SMTP id u12-20020a258f8c000000b00b21a3b845cdmr3831758ybl.0.1678385148633; Thu, 09 Mar 2023 10:05:48 -0800 (PST) MIME-Version: 1.0 References: <20230306230004.1387007-1-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Thu, 9 Mar 2023 10:05:12 -0800 Message-ID: Subject: Re: [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs To: Peter Xu Cc: Mike Kravetz , Hugh Dickins , Muchun Song , "Matthew Wilcox (Oracle)" , Andrew Morton , "Kirill A . Shutemov" , David Hildenbrand , David Rientjes , Axel Rasmussen , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DDF392000B X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: xhnkffajrcmisyshwqp1914b677dghco X-HE-Tag: 1678385149-273661 X-HE-Meta: U2FsdGVkX18oVb/My9vrvCEsd1d2oG8p4whERqFaqjLbUwktmnvGgPtPd/AGTpkhHNiT6YCAs/s/Ju62iXTgwXX0EZ5TDqSaR1gTk0jx2h+ix+qfKw/xBYhTmAYr+ttZ0DNRmHS7h7SNdA0GeZp89CjqBDs+X86kAT+d5mAaFow6zamfVYURXJBTeJ6muVq5/wysy82qf+/jYjXFlVPvLmbkpbWCUbRfXV4aVa80eu2I/sMUB7/wMYopA6tqkB7dAnVGbxeZzGimx/raHmolH0E7HazJEfza3ss/fU41x0+bNJUPO83QCpqPE/WVVLjhRXmJzFr9XlDSNA70tCrGkjZWSrb3Hzkw8oNrFrXGx+qREJ+rQaKJXRrdSFfjEwxA1P2Y3DtFXhToqsyE9EYrR0/bP/YSkVpjQrwGCWI1RWHLb5UAqxxT4F/JaVOIBeUwacW/Lb0iNSNhZ46igjU//sgmO17LkzW4osLGbL/xPaxK/hcnVBC9tkpI+IlnSWKPM/7scoD5dOi7X4uvYONtdEWO2CPLGxgSKy86N1260oeGgamG9NPVIcTHKH+gM6ATZpLJS51yWHoFwgteER/z0MOIpYYN+sHNFNgD3MjYS8eCjxx+MWdAPbg5mcQtJU2vaE7xDp3Cb25SLD49aHHmcd1en1AUGesz2Ub1BZYsusTtggOUInYckHBLnePuZ9Xpshg+ViOG1979A3t4OkBj1a71fpDG0aKLUud8aw/Qc8hEcKb+hR7zH0v1GPRjo+PMfoeUtc7zQyQTuos+70TeOufnJ+sNqmydaRJIrEKt1yyNr+HSAZiqrw/sJQyQa2dLAEJGQfWMio9jaEyAywixt/DBZtxMvLge2cCCTuWTcFsbVsMxY45KE3KXisoaEc72LgQhOtAASkEN2AIG5gEumRKFuEbp98XZZhkXZrEFkd0u+EzamAYIRsLgh5UGuSDhq0hZRY7DE+OfWWTIlbQ t/7c2FRe lYGyR0GsktwpHTMqAMsJWx4jrsLn49Q7/sLQ+6IQFa+1D9dcI2pJmzNyZDdVKl0ADtRk4w+W7fLXRN3xJhvFOX+qTg30apwXWMISWKHV/OyO+L3x8ypel4ICuTCZGwO1j5DDJJCzriCxLV3rTQ1WWl0xL/kjYwjZl9bpgnkyPGuuGHETHlm+UI/KCUTXMbCTP0Ggfe//3iXet1dO2fhiowxorZLaICDedRpF0/I6OY92pmPJjKC/d6N92NxShgKmvmgSM25zK1vVAuMmUZG23dBzqwpfm6ULfUQKADibJ+mtPcdJLqhkvUZK1711kbfCxo/OY0MnbSBpaqf1Blj+y2mAzhEA6wVM4gJe2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 8, 2023 at 2:10=E2=80=AFPM Peter Xu wrote: > > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote: > > HugeTLB pages may soon support being mapped with PTEs. To allow for thi= s > > case, merge HugeTLB's mapcount scheme with THP's. > > > > The first patch of this series comes from the HugeTLB high-granularity > > mapping series[1], though with some updates, as the original version > > was buggy[2] and incomplete. > > > > I am sending this change as part of this smaller series in hopes that i= t > > can be more thoroughly scrutinized. > > > > I haven't run any THP performance tests with this series applied. > > HugeTLB pages don't currently support being mapped with > > `compound=3Dfalse`, but this mapcount scheme will make collapsing > > compound=3Dfalse mappings in HugeTLB pages quite slow. This can be > > optimized with future patches (likely by taking advantage of HugeTLB's > > alignment guarantees). > > > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid > > the use of each subpage's mapcount. If this series is applied, Matthew'= s > > new scheme will automatically apply to HugeTLB pages. > > Is this the plan? > > I may have not followed closely on the latest development of Matthew's > idea. The thing is if the design requires ptes being installed / removed > at the same time for the whole folio, then it may not work directly for H= GM > if HGM wants to support at least postcopy, iiuc, because if we install th= e > whole folio ptes at the same time it seems to beat the whole purpose of > having HGM.. My understanding is that it doesn't *require* all the PTEs in a folio to be mapped at the same time. I don't see how it possibly could, given that UFFDIO_CONTINUE exists (which can already create PTE-mapped THPs today). It would be faster to populate all the PTEs at the same time (you would only need to traverse the page table once for the entire group to see if you should be incrementing mapcount). Though, with respect to unmapping, if PTEs aren't all unmapped at the same time, then you could end up with a case where mapcount is still incremented but nothing is really mapped. I'm not really sure what should be done there, but this problem applies to PTE-mapped THPs the same way that it applies to HGMed HugeTLB pages. > The patch (especially patch 1) looks good. So it's a pure question just = to > make sure we're on the same page. IIUC your other mapcount proposal may > work, but it still needs to be able to take care of ptes in less-than-fol= io > sizes whatever it'll look like at last. By my "other mapcount proposal", I assume you mean the "using the PAGE_SPECIAL bit to track if mapcount has been incremented or not". It really only serves as an optimization for Matthew's scheme (see below [2] for some more thoughts), and it doesn't have to only apply to HugeTLB. I originally thought[1] that Matthew's scheme would be really painful for postcopy for HGM without this optimization, but it's actually not so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing from the end to the beginning, like in [1]: First CONTINUE: pvmw finds an empty PUD, so quickly returns false. Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs, then finds a present PTE (from the first CONTINUE). Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs. ... 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs. So it'll be slow, but it won't have to check 262k empty PTEs per CONTINUE (though you could make this possible with MADV_DONTNEED). Even with an HGM implementation that only allows PTE-mapping of HugeTLB pages, it should still behave just like this, too. > A trivial comment on patch 2 since we're at it: does "a future plan on so= me > arch to support 512GB huge page" justify itself? It would be better > justified, IMHO, when that support is added (and decided to use HGM)? That's fine with me. I'm happy to drop that patch. > What I feel like is missing (rather than patch 2 itself) is some guard to > make sure thp mapcountings will not be abused with new hugetlb sizes > coming. > > How about another BUG_ON() squashed into patch 1 (probably somewhere in > page_add_file|anon_rmap()) to make sure folio_size() is always smaller th= an > COMPOUND_MAPPED / 2)? Sure, I can add that. Thanks, Peter! - James [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_= nqHF0k-pqPi52w@mail.gmail.com/ [2]: Some details on what the optimization might look like: So an excerpt of Matthew's scheme would look something like this: /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */ if (!folio_has_ptes(folio, vma)) atomic_inc(folio->_mapcount); where folio_has_ptes() is defined like: if (!page_vma_mapped_walk(...)) return false; page_vma_mapped_walk_done(...); return true; You might be able to optimize folio_has_ptes() with a block like this at the beginning: if (folio_is_naturally_aligned(folio, vma)) { /* optimization for naturally-aligned folios. */ if (folio_test_hugetlb(folio)) { /* check hstate-level PTE, and do a similar check as below. */ } /* for naturally-aligned THPs: */ pmdp =3D mm_find_pmd(...); /* or just pass it in. */ pmd =3D READ_ONCE(*pmdp); BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd)); if (pmd_special(pmd)) return true; /* we already hold the PTL for the PTE. */ ptl =3D pmd_lock(mm, pmdp); /* test and set pmd_special */ pmd_unlock(ptl) return if_we_set_pmd_special; } (pmd_special() doesn't currently exist.) If HugeTLB walking code can be merged with generic mm, then HugeTLB wouldn't have a special case at all here.