From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0193FEB64D9
	for <linux-mm@archiver.kernel.org>; Thu, 15 Jun 2023 18:58:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4FAA56B0078; Thu, 15 Jun 2023 14:58:54 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4AAE38E0002; Thu, 15 Jun 2023 14:58:54 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 373A88E0001; Thu, 15 Jun 2023 14:58:54 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 27F896B0078
	for <linux-mm@kvack.org>; Thu, 15 Jun 2023 14:58:54 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id E407114047C
	for <linux-mm@kvack.org>; Thu, 15 Jun 2023 18:58:53 +0000 (UTC)
X-FDA: 80905894146.30.412C0FE
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf19.hostedemail.com (Postfix) with ESMTP id 292771A0002
	for <linux-mm@kvack.org>; Thu, 15 Jun 2023 18:58:49 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WD+rviTG;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686855530; a=rsa-sha256;
	cv=none;
	b=npKXClcLx5I+XlrHmdhwNde4hcRBbbqElY56xGb8DlVLuGsl7X9hK/1/JJuolxXDfOPDhl
	dZ5K9mMYVfZpq+S40mKIY43Z3vDDuC4H5VsTkzcPMpvYHtcjCA2DRldZkA/xwpwPG9MZ8k
	q31hjXa8Go/GNyeKg90a7JJ3y4O4sL0=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WD+rviTG;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1686855530;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=iTCiIGOdPlmS93ZLmBjkSaXUY4LeQiXPOSioaKJov8Q=;
	b=2KDusoL7L9AsKFPyI1MbYvopHzg02e1Dk+EQbJFOQr2I6q6G3Va3O6WYgKNb0DzKqprweG
	yRmqiQ4/5/ZJnjZ2wj7OM7wuHssMiRjHo0y5/bATZk4/Z+Xz8HB485HKl/uGW2h1n2OqNB
	hj7eu5tNA10usTWu4JFgeevlLUsPM90=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1686855529;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=iTCiIGOdPlmS93ZLmBjkSaXUY4LeQiXPOSioaKJov8Q=;
	b=WD+rviTGWH+d4HiasRXAPr1beD5LRjgZiazB7XO1Mjnj+ZqLu4nUG0z68+D6T0ouVJ0JHU
	z7SDbye344P4vvCJWqJlT4LGhzt2hxMODEoGG9LKQk4Td1MU4IIKu9Ru5MzQYrZAMal6IN
	b3fhUbRUrn1AveQYacWu6CZiemnFRj8=
Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com
 [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-92-ktdcOjLlObu9GFDkQgUkgA-1; Thu, 15 Jun 2023 14:58:46 -0400
X-MC-Unique: ktdcOjLlObu9GFDkQgUkgA-1
Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-3f8b055287fso15032201cf.0
        for <linux-mm@kvack.org>; Thu, 15 Jun 2023 11:58:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1686855526; x=1689447526;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=iTCiIGOdPlmS93ZLmBjkSaXUY4LeQiXPOSioaKJov8Q=;
        b=iKoEBLK49i4+cvYnsjVS392VKPjFXtygZHto6WQsDxIrfLsQEhYiHqWRup4K3XF6Rw
         gIAPEAwsSe3Fvuf738xfCMlZy4DeKBBMqDD6Qn3LIeqGIOq2d7lzOrKCVpTs0y5qXzqF
         RY7SIxMwen1iZOZZJG4icZe8UAAjsRq516mxKbU5jglt6UXgqymEo/aLzG+na4z8uQ8Z
         OB9d1GIQmrvfe4M4/XukxjKFbENxwa/Q+r+1okSz1mJHVDldDiY6N18Hk4aQWdyzXmq9
         r3MLI5eIlvs/3mlpgEUURK2Nek5gTcy/0aDHLg20JEFUloCKln5+wYFf2wJZpW9168eY
         9f9g==
X-Gm-Message-State: AC+VfDxd5ggVomhTXQR2pbSVVLgeB5RKSvS/7JtruF0uZ2fJRvFJNbBW
	JK49N1mUOdJJzOqPPIpp6cpA3u4yOorgwQ8OtzEicYUiY4Y8csuvP9UwskjGX+voW3UVpKflFyi
	0DRs2KcOvQDY=
X-Received: by 2002:ac8:5f8b:0:b0:3f9:ab19:714b with SMTP id j11-20020ac85f8b000000b003f9ab19714bmr71975qta.3.1686855525740;
        Thu, 15 Jun 2023 11:58:45 -0700 (PDT)
X-Google-Smtp-Source: ACHHUZ4oHzQf3OOUBbmgcUMnp9tOq1QmKf7xoEIYEFuCCDe/am5/3RNE6Q3Qh3LyO0V992eyw+RQNA==
X-Received: by 2002:ac8:5f8b:0:b0:3f9:ab19:714b with SMTP id j11-20020ac85f8b000000b003f9ab19714bmr71947qta.3.1686855525314;
        Thu, 15 Jun 2023 11:58:45 -0700 (PDT)
Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39])
        by smtp.gmail.com with ESMTPSA id s14-20020ac85ece000000b003f6b32a1049sm6523857qtx.55.2023.06.15.11.58.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 15 Jun 2023 11:58:44 -0700 (PDT)
Date: Thu, 15 Jun 2023 14:58:43 -0400
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	David Rientjes <rientjes@google.com>, linux-mm@kvack.org,
	John Hubbard <jhubbard@nvidia.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>
Subject: Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM
 Convergence on Wednesday
Message-ID: <ZItfYxrTNL4Mu/lo@x1n>
References: <c5afdf35-a5fa-03e2-348d-cf1d990fc389@google.com>
 <20230614230458.GB3559@monkey>
 <ZIrGKKpFTKpxCUN1@dhcp22.suse.cz>
 <141b7088-684b-32dc-efe4-03713d38ae28@redhat.com>
 <CADrL8HVgFLb5NWGSpEg3GPMsOFv_U+upHmOYtgZnjmi6=p+zeA@mail.gmail.com>
MIME-Version: 1.0
In-Reply-To: <CADrL8HVgFLb5NWGSpEg3GPMsOFv_U+upHmOYtgZnjmi6=p+zeA@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 292771A0002
X-Stat-Signature: 1r83ptcjzoarb57dci97sdqw8dptu8hs
X-HE-Tag: 1686855529-525907
X-HE-Meta: U2FsdGVkX18atry9W6Z7cDMVfQF+RgcgJhcnLwGAm8cqFXXK2H+eL1cuBjGQej4VpC9CWp3YpmfKHWWe80PpuZuHiGr/aTXvCKgxk9d2IVpBfPNPtcjb7m+8wAOnFmuQIBX/BwWsS0ChswsXw2dKnmoW/WJJoccGUJ15P35G8/5Y7jlcDm3ciYExIHglIQ/d1JArnfcqSXPTi0fpXH6bA7jMsOfCs9cUrr08r0vEDafXFVzI8ZINQc07/NpPCCaIGZN0sPRG27EKnfifLGpHIPuSMooLH/6JYhu6zqm4UhYjUBk1SedyH/w+aYp3RLlRM2kylfmiNWdiTE+od4EOp89YJfFq522JEpnGTt70D6Of9yrpft+8PMm9ELyTuuTtpzx5ZJEOFzNW6MT+9yPJlwr3SS2Qfb9nCwcaQB31DSoQMNnxQdBMNNiMNMC4T499wEMIjzVKlBhAM9RLJru+vpWA3NEBUhmt0vZxX1J8GQNsrBukDmCKFWuaG6UU8zwdv3cT3nezCUaFuwJ4ySpeEc2QaLQ6t2w0JI7rN+y+YlzFm8KSjIwArjwnXzk8f9rjOdG7mnDw8YZwoCkOwzL3fWa2S6+37jdPDuEu9NCzLXnCRL3uK3mP3L4cEHGatTRZyHzQ8Vo96sxOCd/vcOppqYqemoaPGnZuQqTPr6xNFpe/Xhvl126S7So7GJDFCEuVT6+GYtdOXyRmtq+REamUkxz2HCHkGkvXMl5iKe5gOgDFbGLGlSOqcZPvoQklW/wGTa7+TeT7Bhy764FkE/AslLalttyVDkCAX/5KGDu49+MMzSoQNB7Hhav9aNQaDjkX6Mfb6IKyit6M14hHREVSSVMctO4TCY67bN1aHs5NtqDmGRsjr00++4eBYuoq/AApM4T03dzrcYc5/wr2p6auYCxFG7JN6mipC1jfDpLI81FN+b7AcqZc+3drXmprMdZ7LJfO8zkQTys9KSkA464
 /JpIl5Nh
 OnaA4uayy5/dsVaemht2AP8UHVq/mfRkz2ZJ2AwgBQnCX694CLQw940U1y2eQD+pFMzy8yIYKIZJvXs3YRmz7T0n7qBbM+5WN31oVAN3lCwzfkJdp6sFB7tiW1yNS6y2YgTJyV5CZqDZgjKsiHuGk1kctEp2RH8C3qz0UK3eKJHT1t/hHn1rRiChDv91QSEkt+qgXEgBtzWVNrV7GTai94+41McCcyuzqPJuiA8uMSjGY8RoaCWKL8a2APAJ33LWS3xuCCiDbmcsyiMrwObPxxXFN1SRBCzb4+HWPb6HX+xG3agVcBlhhuEgsg6uKiu0YQR+GFuaL+I1E+v1B+jnK/RcpxFANFvrPVWy16yreGxwFGioJRH8Mxt1goV80icFOXnmI200br6mDRVdTD4c3HHa7Y9UEi1XWXGjqqPDoPUmvLGGdM0dngJ+v4I1IRktU0sBv7FPX/QtaRmoumj2R+pJ92YJPcKzO/HVoZfe/Um+GM1Vbq+9TC42SYHk2ED3XNE2AhMEDUWWwqwSak0GIL8KYug==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jun 15, 2023 at 10:24:24AM -0700, James Houghton wrote:
> On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 15.06.23 10:04, Michal Hocko wrote:
> > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> > >> On 06/12/23 18:59, David Rientjes wrote:
> > >>> This week's topic will be a technical brainstorming session on HugeTLB
> > >>> convergence with the core MM.  This has been discussed most recently in
> > >>> this thread:
> > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
> > >>
> > >> Thank you David for putting this session together!  And, thanks to everyone
> > >> who participated.
> > >>
> > >> Following up on linux-mm with most active participants on Cc (sorry if I
> > >> missed someone).   If it makes more sense to continue the above thread,
> > >> please move there.
> > >>
> > >> Even though everyone knows that hugetlb is special cased throughout the
> > >> core mm, it came to a head with the proposed introduction of HGM.  TBH,
> > >> few people in the core mm community paid much attention to HGM when first
> > >> introduced.  A LSF/MM session was then dedicated to the discussion of
> > >> HGM with the outcome being the suggestion to create a new filesystem/driver
> > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> > >> One thing that was not emphasized at LSF/MM is that there are existing
> > >> hugetlb users experiencing major issues that could be addressed with HGM:
> > >> specifically the issues of memory errors and live migration.  That was
> > >> the starting point for recent discussion in the above thread.
> > >>
> > >> I may be wrong, but it appeared the direction of that thread was to
> > >> first try and unify some of the hugetlb and core mm code.  Eliminate
> > >> some of the special casing.  If hugetlb was less of a special case, then
> > >> perhaps HGM would be more acceptable.  That is the impression I (perhaps
> > >> incorrectly) had going into today's session.
> > >
> > > My impression from the discussion yesterday was that the level of
> > > unification would need to be really large and time consuming in order to
> > > be useful for the HGM patchset to be in a more maintainable form. The
> > > final outcome is quite hard to predict at this stage.
> > >
> > >> During today's session, we often discussed what would/could be introduced
> > >> in a hugetlb v2.  The idea is that this would be the ideal place for HGM.
> > >> However, people also made the comparisons to cgroup v1 - v2.  Such a
> > >> redesign provides the needed 'clean slate' to do things right, but it
> > >> does little for existing users who would be unwilling to quickly move off
> > >> existing hugetlb.
> > >>
> > >> We did spend a good chunk of time on hugetlb/core mm unification and
> > >> removing special casing.  In some (most) of these cases, the benefit of
> > >> removing special cases from core mm would result in adding more code to
> > >> hugetlb.  For example: proper type'ing so that hugetlb does not treat
> > >> all page table entries as PTEs.  Again, I may be wrong but I think
> > >> people were OK with adding more code (and even complexity) to hugetlb
> > >> if it eliminated special casing in the core mm.  But, there did not
> > >> seem to be a clear concensus especially with the thought that we may
> > >> need to double hugetlb code to get types right.
> > >
> > > This is primarily your call as a maintainer. If you ask me, hugetlb is
> > > over complicated in its current form already. Regression are not really
> > > seldom when code is added which is a signal we are hitting maintenance
> > > cost walls. This doesn't mean further development is impossible of
> > > course but it is increasingly more costly AFAICS.
> > >
> > >> Unless I missed something, there was no clear direction at the end of this
> > >> session.  I was hoping that we could come up with a plan to address the
> > >> issues facing today's hugetlb users.  IMO, there seems to be two options:
> > >> 1) Start work on hugetlb v2 with the intention that customers will need
> > >>     to move to this to address their issues.
> > >> 2) Incorporate functionality like HGM into existing hugetlb.
> > >
> >
> > I fully agree with all that Michal said.
> >
> > I'm just going to add that I don't see why anyone would look into a
> > hugetlbv2 if we're going to use the motivation of "help existing users"
> > to make hugetlb ever-more complicated and special. "existing users" her
> > even meaning "people use hugetlb for backing VMs. Now they want to get
> > postcopy working with less latency." -- which I consider partially a new
> > use case.
> >
> > So working on adding HGM and concurrently starting a hugetlbv2? I don't
> > think that will happen if we decide on adding HGM and proceeding with
> > that reasoning about existing users.
> >
> > As expressed yesterday, I don't see a fast an clean way to make hugetlb
> > significantly less special (thanks Willy for the list of odd cases).
> >
> > Sure, we can talk about adding pte_t safety, but I don't really see a
> > way forward to unify page table walking code that way -- there are still
> > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if
> > anybody wants to work on that, why not.
> >
> > Having that said, like Michal, I acknowledge that it is Mikes call
> > regarding the hugetlb code. I, for my part, will push back on any added
> > core-mm complexity that adds more special casing for hugetlb. Maybe
> > there are easy ways to integrate it nicely and that is not really a concern.
> 
> HGM is mostly contained in the already-existing HugeTLB special cases.
> HGM doesn't really *add* special cases, it just makes the HugeTLB
> special cases more complicated.

Maybe we shouldn't account all the changes in HGM series to "add complexity
to hugetlb" indeed.  Some of the changes may be still needed even if / when
there is the hugetlbv2.

IMHO the goal should be trying our best to reduce the ones that still in
account due to hugetlb's specialties within v1.  Personally if anyone can
prove that he/she tried the best on approaching that and showed progress on
getting us more closer to a "converged" state, I'll be fine to merge HGM
within v1 when the special code can be reduced also to minimum.

> 
> There are a few small ways that HGM touches non-hugetlb code:
> 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2]

This seems totally benign to me, as this switches hugetlb to use thp
mapcounts.  I'd say this does not make hugetlb special but instead making
it slightly forward to convergence..

If we want v2, we can design whatever better way to do mapcount, maybe not
only for hugetlb but also for thp.  But that seems to always be able to be
done on top of merging the two major large folios first.

> 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4]

This is needed no matter what; I'd say not accounted for
"over-complicating" hugetlb.

> 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5]

A few lines of complexity, maybe not a big issue.

> 4. A small special case in try_to_unmap_one and try_to_migrate_one (to
> check the head page for page flags)[6]

Seems to be an extended specialty due to different handling over hwpoisoned
large pages.  Not sure whether it can be worked out from memory failure
side to merge the behavior against thp.

> 5. smaps stats[7]

Seems also benign.  Even if hugetlb merges with generic mm, someone could
propose some statistics over less-than-huge sized hugetlb mapping
statistics, then that'll be it.  Not directly relevant to core mm, IMHO.

> 
> [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@google.com/
> [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@google.com/
> [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@google.com/
> [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@google.com/
> [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@google.com/
> [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@google.com/
> [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@google.com/

I didn't read HGM for a long time, but afair a few hundreds of LOCs lie in
the pgtable walking changes which should probably be accounted into "adding
complexity" if we say hugetlb will converge one day with core mm.  That's
one (out of many issues) that Matthew listed in his slides yesterday that
hugetlb may need an eye looking at for convergence.

Does it mean that this might be a good spot to pay some more attention?  I
know this goes back to the very early stage where we were discussing what
would be the best way to walk hugetlb pgtable knowing that we can map 4K
over a 2M, but I think it may be slightly different: at least we're clearer
that now we want to merge that with core mm.

I think it means the possibility to mostly deprecate huge_pte_offset().
James/Mike/anyone, have any of you looked into that area?  Would above make
any sense at all?

Thanks,

-- 
Peter Xu