linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Muchun Song <muchun.song@linux.dev>,
	SeongJae Park <sj@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH 00/45] hugetlb pagewalk unification
Date: Thu, 4 Jul 2024 17:23:30 +0200	[thread overview]
Message-ID: <84d4e799-90da-487e-adba-6174096283b5@redhat.com> (raw)
In-Reply-To: <Zoax9nwi5qmgTQR4@x1n>

On 04.07.24 16:30, Peter Xu wrote:
> Hey, David,
> 

Hi!

> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
>> There are roughly two categories of page table walkers we have:
>>
>> 1) We actually only want to walk present folios (to be precise, page
>>     ranges of folios). We should look into moving away from the walk the
>>     page walker API where possible, and have something better that
>>     directly gives us the folio (page ranges). Any PTE batching would be
>>     done internally.
>>
>> 2) We want to deal with non-present folios as well (swp entries and all
>>     kinds of other stuff). We should maybe implement our custom page
>>     table walker and move away from walk_page_range(). We are not walking
>>     "pages" after all but everything else included :)
>>
>> Then, there is a subset of 1) where we only want to walk to a single address
>> (a single folio). I'm working on that right now to get rid of follow_page()
>> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
>> remain a bit special, but I'm afraid we cannot hide that completely.
> 
> Maybe you are talking about the generic concept of "page table walker", not
> walk_page_range() explicitly?
> 
> I'd agree if it's about the generic concept. For example, follow_page()
> definitely is tailored for getting the page/folio.  But just to mention
> Oscar's series is only working on the page_walk API itself.  What I see so
> far is most of the walk_page API users aren't described above - most of
> them do not fall into category 1) at all, if any. And they either need to
> fetch something from the pgtable where having the folio isn't enough, or
> modify the pgtable for different reasons.

Right, but having 1) does not imply that we won't be having access to 
the page table entry in an abstracted form, the folio is simply the 
primary source of information that these users care about. 2) is an 
extension of 1), but walking+exposing all (or most) other page table 
entries as well in some form, which is certainly harder to get right.

Taking a look at some examples:

* madvise_cold_or_pageout_pte_range() only cares about present folios.
* madvise_free_pte_range() only cares about present folios.
* break_ksm_ops() only cares about present folios.
* mlock_walk_ops() only cares about present folios.
* damon_mkold_ops() only cares about present folios.
* damon_young_ops() only cares about present folios.

There are certainly other page_walk API users that are more involved and 
need to do way more magic, which fall into category 2). In particular 
things like swapin_walk_ops(), hmm_walk_ops() and most 
fs/proc/task_mmu.c. Likely there are plenty of them.


Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there 
even is left in using walk_page_range() :)

> 
> A generic pgtable walker looks still wanted at some point, but it can be
> too involved to be introduced together with this "remove hugetlb_entry"
> effort.

My thinking was if "remove hugetlb_entry" cannot wait for "remove 
page_walk", because we found a reasonable way to do it better and 
convert the individual users. Maybe it can't.

I've not given up hope that we can end up with something better and 
clearer than the current page_walk API :)

> 
> To me, that future work is not yet about "get the folio, ignore the
> pgtable", but about how to abstract different layers of pgtables, so the
> caller may get a generic concept of "one pgtable entry" with the level/size
> information attached, and process it at a single place / hook, and perhaps
> hopefully even work with a device pgtable, as long as it's a radix tree.

To me 2) is an extension of 1). My thinking is that we can start with 1) 
without having to are about all details of 2). If we have to make it as 
generic that we can walk any page table layout out there in this world, 
I'm not so sure.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-07-04 15:23 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-04  4:30 Oscar Salvador
2024-07-04  4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
2024-07-04  4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
2024-07-04 15:02   ` Peter Xu
2024-07-04  4:30 ` [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront Oscar Salvador
2024-07-04  4:30 ` [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf Oscar Salvador
2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
2024-07-04 15:41   ` David Hildenbrand
2024-07-05 16:56   ` kernel test robot
2024-07-04  4:30 ` [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs Oscar Salvador
2024-07-04  4:30 ` [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04  4:30 ` [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped " Oscar Salvador
2024-07-04  4:30 ` [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud Oscar Salvador
2024-07-04  4:30 ` [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
2024-07-04 10:30   ` David Hildenbrand
2024-07-04  4:30 ` [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
2024-07-05 15:48   ` kernel test robot
2024-07-05 15:48   ` kernel test robot
2024-07-04  4:31 ` [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for " Oscar Salvador
2024-07-04  4:31 ` [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish Oscar Salvador
2024-07-04  4:31 ` [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages Oscar Salvador
2024-07-04  4:31 ` [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
2024-07-04 11:03   ` David Hildenbrand
2024-07-04  4:31 ` [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge Oscar Salvador
2024-07-04  4:31 ` [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm Oscar Salvador
2024-07-04  4:31 ` [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 37/45] mm/lock: Make mlock_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 38/45] mm/madvise: Make swapin_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 40/45] mm/madvise: Make madvise_free_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 45/45] mm: Delete all hugetlb_entry entries Oscar Salvador
2024-07-04 10:13 ` [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
2024-07-04 10:44 ` David Hildenbrand
2024-07-04 14:30   ` Peter Xu
2024-07-04 15:23     ` David Hildenbrand [this message]
2024-07-04 16:43       ` Peter Xu
2024-07-08  8:18       ` Oscar Salvador
2024-07-08 14:28         ` Jason Gunthorpe
2024-07-10  3:52         ` David Hildenbrand
2024-07-10 11:26           ` Oscar Salvador
2024-07-11  0:15             ` David Hildenbrand
2024-07-11  4:48               ` Oscar Salvador
2024-07-11  4:53                 ` David Hildenbrand
2024-07-08 14:35     ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84d4e799-90da-487e-adba-6174096283b5@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=jgg@nvidia.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=sj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox