From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5D5EC64EC4 for ; Thu, 9 Mar 2023 19:29:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E6E5D280001; Thu, 9 Mar 2023 14:29:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DF8C86B0074; Thu, 9 Mar 2023 14:29:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7308280001; Thu, 9 Mar 2023 14:29:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B09FA6B0072 for ; Thu, 9 Mar 2023 14:29:55 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5FE4AAB21D for ; Thu, 9 Mar 2023 19:29:55 +0000 (UTC) X-FDA: 80550349950.14.773DC4D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 2F551140007 for ; Thu, 9 Mar 2023 19:29:52 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dclBlltU; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678390192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8rjkLnFnFbXNnmU+t7ss87Y2e6BpbMdJhJqRqbf2LnQ=; b=PNYwN4pnEqsdkoOm4GHlawC1/duoceMIW4x5cgwvPK2wISiIZPnDZJ9Mik7HgMsgkt7D5H WFWFOGtHMQTxHLGMoKjt+nnqJNbzcEBKsdZIz5244pCRQdtGEoLcURFjPuc1Tem1yBDaoT ztIGdVr5RmxjOmCuuR6723NcVa/8OF8= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dclBlltU; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678390192; a=rsa-sha256; cv=none; b=p8MnnaKi81zS7gSlLPh1GyfbJhYMsxXYVhQ/1UQSRhYiwO+ueNZkXKyWZplhTPcUQt2WPk 68qeOzWKUQ3Y+g/sfn8Vz5f3fzluZrkqk2Og1PAJqhhXjKXXA2y+pHvuI6CyOAA+JXOR50 lmKKr/AfRapA88XSbV4EtFeixg5YSPA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678390191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8rjkLnFnFbXNnmU+t7ss87Y2e6BpbMdJhJqRqbf2LnQ=; b=dclBlltUwmBLE/Vibv8KQR97GDUP/fNJTIu6fEjKrThXvQW3VsDFt4Rzmx2FgaQeNH4T1q t7Rzd7AXr5pDFAr7qjCOmg242LM3cuwKb4u+woqjA+aUXQ+KK8X5YZDog0BCFShp7pqqlz Hwz4jDBZBXWOnqXVPlwnA/XzGLdcBB0= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-531-rmnM0FUdPfOHzh_Zft7vHw-1; Thu, 09 Mar 2023 14:29:50 -0500 X-MC-Unique: rmnM0FUdPfOHzh_Zft7vHw-1 Received: by mail-qk1-f198.google.com with SMTP id pc36-20020a05620a842400b00742c715894bso1744559qkn.21 for ; Thu, 09 Mar 2023 11:29:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678390190; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8rjkLnFnFbXNnmU+t7ss87Y2e6BpbMdJhJqRqbf2LnQ=; b=oCUmUPYC+bf827vhVE2mz97VWcyiFcEK2FStiIDKtM+93kMZCjE5z4WFG2ge7eQ5n1 q3ox43URJ6ikjapNEPocha38fLniAIjwYYg0OQMMR8Bor0kHIP8uOgYtEszkFPfBcnhZ uLV8OdFYalahHpm/zbfXI+IzqOr3XYQ/YP77kwzNjtTh9QyMU9kRzElJuNxN7MpOgC1u oRPjiNZMidOUTPI/3C0pVIrCWVschL7wLbgDZo0rDM2FghXSHf0pkZWIMj6y8cNG1RoC 73PC3vMrd2Vv/uTlp2zPAEYLmgzU2fmaZcStVBA1yMbHtCU6XP8b9CHTh/uFBcYj17QO Rhng== X-Gm-Message-State: AO0yUKWwgwK4FS7qqT4mEVdFDGc4OQkZzxoun17t2VkLDJ6WYSFkRNyv fMO8XPyJKBTUGmlI3CCtZDvZSj4sS/uEvPkMz+a9OfGyoagbtcfBBMVAoadWXLl/ydpv8BKjq3E uZWQrkT3AfBU= X-Received: by 2002:a05:6214:2027:b0:53a:3591:1e49 with SMTP id 7-20020a056214202700b0053a35911e49mr653259qvf.1.1678390189870; Thu, 09 Mar 2023 11:29:49 -0800 (PST) X-Google-Smtp-Source: AK7set9hev3mSeA0Fa08ehlYVEigim0ZGbXOSN7+pZAgsmmPDmLT6cjm//xKGgRn8uChiHS67OjTmw== X-Received: by 2002:a05:6214:2027:b0:53a:3591:1e49 with SMTP id 7-20020a056214202700b0053a35911e49mr653212qvf.1.1678390189473; Thu, 09 Mar 2023 11:29:49 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id x8-20020a05620a12a800b007427fce1377sm4973944qki.7.2023.03.09.11.29.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Mar 2023 11:29:49 -0800 (PST) Date: Thu, 9 Mar 2023 14:29:47 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , Hugh Dickins , Muchun Song , "Matthew Wilcox (Oracle)" , Andrew Morton , "Kirill A . Shutemov" , David Hildenbrand , David Rientjes , Axel Rasmussen , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs Message-ID: References: <20230306230004.1387007-1-jthoughton@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 2F551140007 X-Stat-Signature: 4xrxijxawciq1yuppexbgaep6dkqkcrp X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1678390192-916879 X-HE-Meta: U2FsdGVkX1+NDtM/QD+ezGymcD4NHIeQIOCW+B/2dMHSP8lQ/oNeykrs/N6bJ+5PspuNo52Ys+kVgMrcCRb1ZLnxUEAxCUrT1PwEOm1pB1xlmTD9phxf8xAIRA6DXlRnXEcXn7xUSgHAoRmZwZBNgaKBsTc/1800FvbChUywyFhc/ewkO5SOum8J5q/88R8ER/oOqW1SHs8kW8ib4Lkor5vqsbkXtkd5lW0lHFF/QS3Me3VeCwUpi7oxpHJpNS090c0Olz6ZjCNlnh0llYBiuCgyiHW021guu+XGe1nfhYz6U1iajuJccG5ZPPWNpLCQogZeGzILFGnl+juCPAWgiPsIJmt3w7DflYjILtA78o54PTS69HvDMLMoVoPSfzB8tp22kUEMpWah+DI8EXDqJdMOCVr8lu+HgzcGQO1v5QjSJTd05NURAUE9fyCs8Apy73qBydGqm7yYJX8HrzuL5efG9/z+QzF80LpqHgzKaDl8XfodeQ9XOvDKndoR6BhHa3QOIJ4+j5ubgz7XCMk6CSNtlLi1lzMvPvWZ4DBbDTHZZo4vqflx8RujySFU4XJvpYB0WlRTAW+T0gp/0+gy1BqpBQFUQ1JZLdCTwOleqDD/YBMNOChVtMWrzQr8YJoDAcbPXnnlnJY5YJawxiR4lmAi2NIYFuga15OB1njZpE/85PSeGemf3lMFX3WRAPehIpgYNofj8sOL6WkVeGm4XUWNczT+K9oxbLLQOYq5skNTzPPbhrIs89hpLkxzjrvrSvaS24gDn+qOGUx6ySAMBL1jKs9mXYjlh1o3EFrmoM8T1PmUukLH+sE4oeV248xu8VoqJMKU+ZbWUbGjdqGPMFlfrGW260q+RXHN3MDbRmYwbvakG7GwqhgfuvQHW3SoRG47vRxSrHjlCFN/k3Df8VkSgz/u2r5JrqEejA+T4ZqiVvUZZiKlMqL6z6SwlxgOmAukO4dpThoIq1UWa1Y P0mY/f5l Ud2yznMc+pUPA6IvNVzbGmuBDmaxo0pD97Dh+3ljVg1EJioYV4ltuhL4ZgLjB8gSR8mh7hPBXCMJsmjHOOi5srDkXdNU21B+OWwA65rR8JxUgmvLm4MBom8VmYlrkV9agiVVDUeziTPf554nRzNUg5oAHdiHHXd6O//++eD3g937Fi6I4aGOSi0tImc+D39eUIOnPsw+YrS5HtE9L/mQdRkPGdxEIvl5Lip4TxerfyYq5HnI1qm+wM0N/ifbnAuwRiQP/zH5dQqGZNGVhDXrx10zrxuOkm6O3Tlythcc0KPj4ONqN4PX5xGkeYOz8OIQLoL7rdceOL8FoH/zfNyww/pgQWOaXmhLitZXgWdbXG5EWi4LICyqF1/L6eca22GuG+zRva4S+i7Yq7pLXh19fueV4bKn8XirIJ5p2sZWLiRl+mHAW4Ipf5Rb3IyM5tOFn9NNzUqTsk8Ymfb9QIzHpDJsEdOJW7vkhUxu4f/kihFutJIc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote: > On Wed, Mar 8, 2023 at 2:10 PM Peter Xu wrote: > > > > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote: > > > HugeTLB pages may soon support being mapped with PTEs. To allow for this > > > case, merge HugeTLB's mapcount scheme with THP's. > > > > > > The first patch of this series comes from the HugeTLB high-granularity > > > mapping series[1], though with some updates, as the original version > > > was buggy[2] and incomplete. > > > > > > I am sending this change as part of this smaller series in hopes that it > > > can be more thoroughly scrutinized. > > > > > > I haven't run any THP performance tests with this series applied. > > > HugeTLB pages don't currently support being mapped with > > > `compound=false`, but this mapcount scheme will make collapsing > > > compound=false mappings in HugeTLB pages quite slow. This can be > > > optimized with future patches (likely by taking advantage of HugeTLB's > > > alignment guarantees). > > > > > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid > > > the use of each subpage's mapcount. If this series is applied, Matthew's > > > new scheme will automatically apply to HugeTLB pages. > > > > Is this the plan? > > > > I may have not followed closely on the latest development of Matthew's > > idea. The thing is if the design requires ptes being installed / removed > > at the same time for the whole folio, then it may not work directly for HGM > > if HGM wants to support at least postcopy, iiuc, because if we install the > > whole folio ptes at the same time it seems to beat the whole purpose of > > having HGM.. > > My understanding is that it doesn't *require* all the PTEs in a folio > to be mapped at the same time. I don't see how it possibly could, > given that UFFDIO_CONTINUE exists (which can already create PTE-mapped > THPs today). It would be faster to populate all the PTEs at the same > time (you would only need to traverse the page table once for the > entire group to see if you should be incrementing mapcount). > > Though, with respect to unmapping, if PTEs aren't all unmapped at the > same time, then you could end up with a case where mapcount is still > incremented but nothing is really mapped. I'm not really sure what > should be done there, but this problem applies to PTE-mapped THPs the > same way that it applies to HGMed HugeTLB pages. > > > The patch (especially patch 1) looks good. So it's a pure question just to > > make sure we're on the same page. IIUC your other mapcount proposal may > > work, but it still needs to be able to take care of ptes in less-than-folio > > sizes whatever it'll look like at last. > > By my "other mapcount proposal", I assume you mean the "using the > PAGE_SPECIAL bit to track if mapcount has been incremented or not". It > really only serves as an optimization for Matthew's scheme (see below > [2] for some more thoughts), and it doesn't have to only apply to > HugeTLB. > > I originally thought[1] that Matthew's scheme would be really painful > for postcopy for HGM without this optimization, but it's actually not > so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing > from the end to the beginning, like in [1]: > > First CONTINUE: pvmw finds an empty PUD, so quickly returns false. > Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs, > then finds a present PTE (from the first CONTINUE). > Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs. > ... > 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs. > > So it'll be slow, but it won't have to check 262k empty PTEs per > CONTINUE (though you could make this possible with MADV_DONTNEED). > Even with an HGM implementation that only allows PTE-mapping of > HugeTLB pages, it should still behave just like this, too. > > > A trivial comment on patch 2 since we're at it: does "a future plan on some > > arch to support 512GB huge page" justify itself? It would be better > > justified, IMHO, when that support is added (and decided to use HGM)? > > That's fine with me. I'm happy to drop that patch. > > > What I feel like is missing (rather than patch 2 itself) is some guard to > > make sure thp mapcountings will not be abused with new hugetlb sizes > > coming. > > > > How about another BUG_ON() squashed into patch 1 (probably somewhere in > > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than > > COMPOUND_MAPPED / 2)? > > Sure, I can add that. > > Thanks, Peter! > > - James > > [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@mail.gmail.com/ > > [2]: Some details on what the optimization might look like: > > So an excerpt of Matthew's scheme would look something like this: > > /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */ > if (!folio_has_ptes(folio, vma)) > atomic_inc(folio->_mapcount); > > where folio_has_ptes() is defined like: > > if (!page_vma_mapped_walk(...)) > return false; > page_vma_mapped_walk_done(...); > return true; > > You might be able to optimize folio_has_ptes() with a block like this > at the beginning: > > if (folio_is_naturally_aligned(folio, vma)) { > /* optimization for naturally-aligned folios. */ > if (folio_test_hugetlb(folio)) { > /* check hstate-level PTE, and do a similar check as below. */ > } > /* for naturally-aligned THPs: */ > pmdp = mm_find_pmd(...); /* or just pass it in. */ > pmd = READ_ONCE(*pmdp); > BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd)); > if (pmd_special(pmd)) > return true; > /* we already hold the PTL for the PTE. */ > ptl = pmd_lock(mm, pmdp); > /* test and set pmd_special */ > pmd_unlock(ptl) > return if_we_set_pmd_special; > } > > (pmd_special() doesn't currently exist.) If HugeTLB walking code can > be merged with generic mm, then HugeTLB wouldn't have a special case > at all here. I see what you mean now, thanks. That looks fine. I just suspect the pte_special trick will still be needed if this will start to apply to HGM, as it seems to not suite perfectly with a large folio size, still. The MADV_DONTNEED worst case of having it loop over ~folio_size() times of none pte is still possible. -- Peter Xu