From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57538C636CC for ; Wed, 8 Feb 2023 16:16:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA9D26B0074; Wed, 8 Feb 2023 11:16:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A59456B0075; Wed, 8 Feb 2023 11:16:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8ACBC6B0078; Wed, 8 Feb 2023 11:16:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7AEBD6B0074 for ; Wed, 8 Feb 2023 11:16:10 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4A0501A0E5D for ; Wed, 8 Feb 2023 16:16:10 +0000 (UTC) X-FDA: 80444626500.20.E36563F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 1A7BD140020 for ; Wed, 8 Feb 2023 16:16:07 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=W5Iov3ZC; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675872968; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mwf7rfwfNT2NehwiIE+ikRTPDY0JBgN/N8jjGidg+sU=; b=rk7Uv32OvgxLL1Kss7E1U5zdU2OX4HwqiEj+6vrfTa9YDFicjTNCj7xj+/NuBM9gHV0ihM A0bUKrVExOlqwSCtgE7Bz8QFp7dpPm0oX3Vw/b/OK0mG3ecHdb8jFUBIv6y/Bh2q0PLqdP IgbIAH1H3qR5lcEDIGn+v4xFTBTiTPA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=W5Iov3ZC; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675872968; a=rsa-sha256; cv=none; b=3ZGR1+1/n74ujkongDorXO/b0BkLpmhc2TWVvKirb68JpHjXMtK6R47GZVrrgW18ob146u vDqYN+o0pcY0e8b3nbf2dB/dlWSyyCmKuvv/Yu7okowm7uH339mtOr0QgB3kqLbCBmYJYX 34wpstsIznFNglzi86KIQTX8I8ZTA38= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675872967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mwf7rfwfNT2NehwiIE+ikRTPDY0JBgN/N8jjGidg+sU=; b=W5Iov3ZCFozIJzGX2GXeiiGz/KWVPv9t4dqCWgOFpcPyR7ANpQxijB7701DMXgpTpkShCX gkcTaFx9YrZ8c28dhob3XvozoEuT65qy0WS3MUz3OF+GXt6icFhIG6556cMTlfv0ctD6TY jRdRkv+Dzgc8ja0zpm36oIdAcflLL0U= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-407-lf3nCIH8NKmCXyqJT9cIkg-1; Wed, 08 Feb 2023 11:16:06 -0500 X-MC-Unique: lf3nCIH8NKmCXyqJT9cIkg-1 Received: by mail-qv1-f71.google.com with SMTP id ks3-20020a056214310300b0056bec2871e8so6009958qvb.1 for ; Wed, 08 Feb 2023 08:16:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=mwf7rfwfNT2NehwiIE+ikRTPDY0JBgN/N8jjGidg+sU=; b=wWOKjZQeu+w6nfRVY8Q1iQb0OXbuQDdGv3l50+Hhh7uMZ7lcCiHRHQBRjnYv9IlY9a vvOVP5BshOanU4MLevhxi54Lk8uKS9RTW6WnHxvOlOE//XJeQpXGB0okP/8XouakqIeo F2z28lXwBSwdl+xh5VsZyaXh3GlV5+DkfWYOdNtk8ccs0AhnznP+WXyD/IfGV09IO650 CB0yTrcuJv1KcxKGC9kWKe2TE6rtK5Mw6i1KtuxkrzTZxhJnaibOGefk/HrZPxqL3GY2 pCzpawiVrnsZIov7Jhc5FKcD4mYNaSmK+tM4E+2aGzeLgN2AYqb8Gczm+gB2kzfQ95wi UCJA== X-Gm-Message-State: AO0yUKUly7DEg3ZQdgxBD53uCtR4P42uJbZ9QIUd0ZOErKUnaqVbIUAk L1ZYQkrcUjd5BV5VVM8Gg6MWTfO8xxTTeMG1wsWJvQd7wonHGxOTrIIQY0Vs2wuVh3k9txCHloO eMC+BN4rbfDo= X-Received: by 2002:ac8:5e11:0:b0:3b9:fc92:a6 with SMTP id h17-20020ac85e11000000b003b9fc9200a6mr14939748qtx.6.1675872965424; Wed, 08 Feb 2023 08:16:05 -0800 (PST) X-Google-Smtp-Source: AK7set/0m8hfu7/yvzWEDDzH7ppzEgnoMBaIJeRGEEkMikotRaed4dllfcaULguCF/RTGle4IqbPxA== X-Received: by 2002:ac8:5e11:0:b0:3b9:fc92:a6 with SMTP id h17-20020ac85e11000000b003b9fc9200a6mr14939695qtx.6.1675872965000; Wed, 08 Feb 2023 08:16:05 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id fv22-20020a05622a4a1600b003b646123691sm11843472qtb.31.2023.02.08.08.16.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Feb 2023 08:16:04 -0800 (PST) Date: Wed, 8 Feb 2023 11:16:02 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1A7BD140020 X-Rspam-User: X-Stat-Signature: 7jbxuqucsdutau5jkuj45obbai8yxonm X-HE-Tag: 1675872967-502046 X-HE-Meta: U2FsdGVkX196ZCxUxLMHS0ByPgtYVuIDlSo/e2jo01+Si20OwjUVfw/a/2kzoAGYGx3o9sGNwqpqtrsOK4Oj5XD8ArTSXMo8iocIxMTYW18iEbOA/csx6MZdIMfQeQIKnQVfncedR6z56OuXgk3H7RqyBxp3L0O2YdWBKmNu2Fkbuk3fMwq9ubktmhgTMHJ1UZzMuf042AVI27CVbcOW2N15PG9tT9xYxpCMH2jRWo/uxzPDJc7JsmmLb9wN6k5voaU68WoSQGTQ7G/2lAKoiFoXz/tmz84zZjw4CyHh6kQ/gUmxzwLcVMKiFqpyIPFur82Dc0KZGyvK6RTNvfQWJpvsc+ZGYSAL5L5h6zkEy69+IIq8S3W8pB/mlxmSVwrZXrlUZhYbiNtCMRqH5Is9LNdVVWuIzYmI0CtE26PfYzPqB2WS91ckW50NLRXQnyLQqTl2mzNjtmZsxQ+FzlpGQMGe39okWwOKyhZKQm90xVbSqaikDz59/k476QNM0d5RUKQJrdarfJfyToa1MKowZNsyMrBAEVhqNqGNSeE7cF0tj7+U/0cLgc/pydt7Az4kn12SiXCyudyTKUHRUsOAiM9qHbv78NckD1943YVpBbDA/wEwdF8l/G+NbAG0JUDeQbaTaBkPhPqIk0qjA9QRr3CrWqchL6OzuOjY7I+CKNdHEop8BoDW2qQYjBoA+7q/OqXrKMj0LG4UvNAG9brnIvj67+/9HPUjk3L7kEPdDbvCGnSTTRpSdIgwB5NjuE9a3ak+PoR4K6v0/IHaw3PIaMDbuLtMzmyzws51kvxWUAA1QrJwcwpMluvMWDlxVBlMldxWLTirIOQGNE6H0sZFerKRmdCoyinWm6GBM6vek8ge9VzAkCZ6c75w83wz/rxguqATJrl5oo9CXeLXsq3EU6nTnNbltQ9XxenY6tXdfx5gVA3945ZAN++sU1lyXtcqdlIKZgtKCGoCcHAOTri IDg4KSfV 5U2nQcGjUG7QgFRnlml1DEx3vOhA4gDxWrmRHuRci/yI86wVoN24rBVURvEd8jf7F8+4QIl4CyGMy8gzA+JubiWY5EfQoQE2GAxPtJo2IVqz9taG2MGdWJCgrTC2Sk6d4QPMAm9DM6knSQbk5AkMJsv835EBMKKwzWSvQwzidLGYfyJMtZu0fCbmiV/XmibJ5ALSEtQLkAVpKKIZMWqOTdqWjdLvQfd870aIY28Uv0IHXxj8fdJszgXqCOQd4GwTxgkuEhMTmQ6yQA8s7ToKP8Llka09pk+JtlRW/r9eWmLyYeQMle/hUL0pXtZEaZhCmKQtg7dwcoUWb8ieUZc7WWJdsGWy1CqRBBe2oNFZYJa8UGUas/07I4giS3D+COeJT6jxYXH1nxKL3o+Jd59gM/rPTFtJl2yIk0hH65anuPnqGjHD7/fWZsNsRHgu9Hm1wd3qbblyJAaPrCL4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu wrote: > > > > James, > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > Thanks. From what I can tell, that number shows that it'll be great we > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > Matthew for generic folio. > > Do you think the RFC v1 way is better than doing the THP-like way > *with the additional MMU notifier*? What's the additional MMU notifier you're referring? > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > Any more information on why it's horrible? :) > > I figured the code would speak for itself, heh. It's quite complicated. > > I really didn't like: > 1. The 'inc' business in copy_hugetlb_page_range. > 2. How/where I call put_page()/folio_put() to keep the refcount and > mapcount synced up. > 3. Having to check the page cache in UFFDIO_CONTINUE. I think the complexity is one thing which I'm fine with so far. However when I think again about the things behind that complexity, I noticed there may be at least one flaw that may not be trivial to work around. It's about truncation. The problem is now we use the pgtable entry to represent the mapcount, but the pgtable entry cannot be zapped easily, unless vma unmapped or collapsed. It means e.g. truncate_inode_folio() may stop working for hugetlb (of course, with page lock held). The mappings will be removed for real, but not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps the pgtable leaves, not the ones that we used to account for mapcounts. So the kernel may see weird things, like mapcount>0 after truncate_inode_folio() being finished completely. For HGM to do the right thing, we may want to also remove the non-leaf entries when truncating or doing similar things like a rmap walk to drop any mappings for a page/folio. Though that's not doable for now because the locks that truncate_inode_folio() is weaker than what we need to free the pgtable non-leaf entries - we'll need mmap write lock for that, the same as when we unmap or collapse. Matthew's design doesn't have such issue if the ptes need to be populated, because mapcount is still with the leaves; not the case for us here. If that's the case, _maybe_ we still need to start with the stupid but working approach of subpage mapcounts. [...] > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > > > I haven't figured out how we can apply Matthews's approach to HGM > > > > right now, but there probably is a way. (If we left the mapcount > > > > increment bits in the same place, we couldn't just check the > > > > hstate-level PTE; it would have already been made present.) > > > > I'm just worried that (1) this may add yet another dependency to your work > > which is still during discussion phase, and (2) whether the folio approach > > is easily applicable here, e.g., we may not want to populate all the ptes > > for hugetlb HGMs by default. > > That's true. I definitely don't want to wait for this either. It seems > like Matthew's approach won't work very well for us -- when doing a > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all > the PTEs to see if any of them are mapped would get really slow. I think it'll be a common problem to userfaultfd when it comes, e.g., userfaultfd by design is PAGE_SIZE based so far. It needs page size granule on pgtable manipulations, unless we extend the userfaultfd protocol to support folios, iiuc. > > > > > > > > > > > We could: > > > > - use the THP-like way and tolerate ~1 second collapses > > > > > > Another thought here. We don't necessarily *need* to collapse the page > > > table mappings in between mmu_notifier_invalidate_range_start() and > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > > > we aren't punching any holes, and we aren't changing permission bits. > > > If we had an MMU notifier that simply informed KVM that we collapsed > > > the page tables *after* we finished collapsing, then it would be ok > > > for hugetlb_collapse() to be slow. > > > > That's a great point! It'll definitely apply to either approach. > > > > > > > > If this MMU notifier is something that makes sense, it probably > > > applies to MADV_COLLAPSE for THPs as well. > > > > THPs are definitely different, mmu notifiers should be required there, > > afaict. Isn't that what the current code does? > > > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. > > Oh, yes, of course, MADV_COLLAPSE can actually move things around and > properly make THPs. Thanks. But it would apply if we were only > collapsing PTE-mapped THPs, I think? Yes it applies I think. And if I'm not wrong it's also doing so. :) See collapse_pte_mapped_thp(). While for anon we always allocate a new page, hence not applicable. -- Peter Xu