From: David Hildenbrand <david@redhat.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Muchun Song <muchun.song@linux.dev>,
SeongJae Park <sj@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>,
Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Christophe Leroy <christophe.leroy@csgroup.eu>,
Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH 00/45] hugetlb pagewalk unification
Date: Wed, 10 Jul 2024 05:52:43 +0200 [thread overview]
Message-ID: <9d5980e3-72e6-4848-b1ac-83ffab8522c4@redhat.com> (raw)
In-Reply-To: <Zoug1swoTOqNUPJo@localhost.localdomain>
On 08.07.24 10:18, Oscar Salvador wrote:
> On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
>> My thinking was if "remove hugetlb_entry" cannot wait for "remove
>> page_walk", because we found a reasonable way to do it better and convert
>> the individual users. Maybe it can't.
>>
>> I've not given up hope that we can end up with something better and clearer
>> than the current page_walk API :)
>
> Hi David,
>
Hi!
> I agree that the current page_walk might be a bit convoluted, and that the
> indirect functions approach is a bit of a hassle.
> Having said that, let me clarify something.
>
> Although this patchset touches the page_walk API wrt. getting rid of
> hugetlb special casing all over the place, my goal was not as focused on
> the page_walk as it was on the hugetlb code to gain hability to be
> interpreted on PUD/PMD level.
I understand that. And it would all be easier+more straight forward if
we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that
works similar, but different to "ordinary" cont-pte for thp.
I'm sure you stumbled over the set_huge_pte_at() on arm64 for example.
If we, at one point *don't* use these hugetlb functions right now to
modify hugetlb entries, we might be in trouble.
That's why I think we should maybe invest our time and effort in having
a new pagewalker that will just batch such things naturally, and users
that can operate on that naturally. For example: a hugetlb
cont-pte-mapped folio will just naturally be reported as a "fully mapped
folio", just like a THP would be if mapped in a compatible way.
Yes, this requires more work, but as raised in some patches here,
working on individual PTEs/PMDs for hugetlb is problematic.
You have to batch every operation, to essentially teach ordinary code to
do what the hugetlb_* special code would have done on cont-pte/cont-pmd
things.
(as a side note, cont-pte/cont-pmd should primarily be a hint from arch
code on how many entries we can batch, like we do in folio_pte_batch();
point is that we want to batch also on architectures where we don't have
such bits, and prepare for architectures that implement various sizes of
batching; IMHO, having cont-pte/cont-pmd checks in common code is likely
the wrong approach. Again, folio_pte_batch() is where we tackled the
problem differently from the THP perspective)
>
> One of the things, among other things, that helped in creating this
> mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
> __always__ operates on ptes, which means that we cannot rely on the mm
> core to do the right thing, and we need a bunch of hugetlb-pte functions
> that knows about their thing, so we lean on that.
>
> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.
> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.
>
> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).
>
> E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
> is_hwpoison_entry
>
> Yes, it is a hassle to have more code around, but IMO, this new code
> will help us in 1) move away from __always__ operate on ptes 2) ease
> integrate hugetlb code into mm core.
>
> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.
>
> If that gets accomplished, I think that a fair amount of code that lives
> in hugetlb.c can be deleted/converted as less special casing will be needed.
>
> I might be wrong and maybe I will hit a brick wall, but hopefully not.
I have an idea for a better page table walker API that would try
batching most entries (under one PTL), and walkers can just register for
the types they want. Hoping I will find some time to at least scetch the
user interface soon.
That doesn't mean that this should block your work, but the
cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I
don't particularly like where this is going.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-07-10 3:52 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-04 4:30 Oscar Salvador
2024-07-04 4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
2024-07-04 4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
2024-07-04 15:02 ` Peter Xu
2024-07-04 4:30 ` [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront Oscar Salvador
2024-07-04 4:30 ` [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf Oscar Salvador
2024-07-04 4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
2024-07-04 15:41 ` David Hildenbrand
2024-07-05 16:56 ` kernel test robot
2024-07-04 4:30 ` [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs Oscar Salvador
2024-07-04 4:30 ` [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04 4:30 ` [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped " Oscar Salvador
2024-07-04 4:30 ` [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud Oscar Salvador
2024-07-04 4:30 ` [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04 4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
2024-07-04 10:30 ` David Hildenbrand
2024-07-04 4:30 ` [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
2024-07-05 15:48 ` kernel test robot
2024-07-05 15:48 ` kernel test robot
2024-07-04 4:31 ` [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04 4:31 ` [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for " Oscar Salvador
2024-07-04 4:31 ` [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish Oscar Salvador
2024-07-04 4:31 ` [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04 4:31 ` [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages Oscar Salvador
2024-07-04 4:31 ` [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04 4:31 ` [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
2024-07-04 11:03 ` David Hildenbrand
2024-07-04 4:31 ` [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle " Oscar Salvador
2024-07-04 4:31 ` [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped " Oscar Salvador
2024-07-04 4:31 ` [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge Oscar Salvador
2024-07-04 4:31 ` [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm Oscar Salvador
2024-07-04 4:31 ` [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas Oscar Salvador
2024-07-04 4:31 ` [PATCH 37/45] mm/lock: Make mlock_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 38/45] mm/madvise: Make swapin_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 40/45] mm/madvise: Make madvise_free_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk " Oscar Salvador
2024-07-04 4:31 ` [PATCH 45/45] mm: Delete all hugetlb_entry entries Oscar Salvador
2024-07-04 10:13 ` [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
2024-07-04 10:44 ` David Hildenbrand
2024-07-04 14:30 ` Peter Xu
2024-07-04 15:23 ` David Hildenbrand
2024-07-04 16:43 ` Peter Xu
2024-07-08 8:18 ` Oscar Salvador
2024-07-08 14:28 ` Jason Gunthorpe
2024-07-10 3:52 ` David Hildenbrand [this message]
2024-07-10 11:26 ` Oscar Salvador
2024-07-11 0:15 ` David Hildenbrand
2024-07-11 4:48 ` Oscar Salvador
2024-07-11 4:53 ` David Hildenbrand
2024-07-08 14:35 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d5980e3-72e6-4848-b1ac-83ffab8522c4@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=christophe.leroy@csgroup.eu \
--cc=jgg@nvidia.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=sj@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox