linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Peter Xu <peterx@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
Date: Wed, 22 May 2024 09:08:36 +0000	[thread overview]
Message-ID: <13c7f673-2975-4e45-8247-6168e70e3d3a@csgroup.eu> (raw)
In-Reply-To: <ZkdpP-6znZS5Fvz2@localhost.localdomain>



Le 17/05/2024 à 16:27, Oscar Salvador a écrit :
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>
>> Unlike most architectures, powerpc 8xx HW requires a two-level
>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>> is not feasible as such.
>>
>> Possible sizes are 4k, 16k, 512k and 8M.
>>
>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>> must point to a single entry level-2 page table. Until now that was
>> done using hugepd. This series changes it to use standard page tables
>> where the entry is replicated 1024 times on each of the two pagetables
>> refered by the two associated PMD entries for that 8M page.
>>
>> At the moment it has to look into each helper to know if the
>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>> a lower size. I hope this can me handled by core-mm in the future.
>>
>> There are probably several ways to implement stuff, so feedback is
>> very welcome.
> 
> 
> Hi Christophe,
> 
> I have been looking into this because I am interested in the ongoing work of
> the hugetlb unification, but my knowledge of ppc pagetables tends to zero,
> So be prepared for some stupid questions.
> 
> First, let me have a clear picture of the current situation:
> 
> power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level
> pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them
> as PMD + PTE though.
> 
> And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of
> of 4GB.
> 
> Looking at the page table diagram for power8xx, it seems power8xx has also some
> sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing
> bigger page sizes?
> I also guess that although power8xx supports all these different sizes, only one
> of them can be active at any time, right?

Don't know what you mean by "active at any time". In a running system 
with PAGE_SIZE defined as 4k, you can at any time have some hugepages of 
size 16K, some 512K and some 8M.

> 
> It also seems that this whole hugepd thing is only used when we are using 8MB
> PAGE_SIZE, right?

Today yes. In the past it was also used for 512K pages, until commit 
b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")

> And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB
> hugepd.
> E.g:
>                          ____________
>   [PGD#0] ------------> |            |
>                         | 8MB hugepd |
>   [PGD#1] ------------> |____________|
> 
> What you want to do with this work is to get rid of that hugepd abstraction
> because it is something power8xx/hugetlb specific and cannot be represented
> with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE).

It is more than 8xx, also used on e500 and book3s/64 but for sure that's 
specific to powerpc and it would help to get rid of it completely.

> I did not check, but I guess we cannot walk the hugepd thing with a normal
> page table walker, or can we? (how special is a hugepd? can you describe its
> internal layout?)

depends on what you mean by "normal". For instance 
walk_page_range_novma() handles hugepd.

> 
> So, what you proprose, is something like the following?
> 
>   [PGD#X] -----------> [PTE#0]
>           -----------> [PTE..#1023]
>   [PGD#Y] -----------> [PTE#0]
>           -----------> [PTE..#1023]
> 
> so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs.
> 
> The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is
> just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless
> of the pagesize, but either I read that wrong or..else.

What it means is that when PAGE_SIZE is 16k, the pte_t is a table of 4 
long, see 
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/include/asm/pgtable-types.h#L11

> Because on 16K page size, they show that each pte covers for 16KB memory chunk.
> But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what
> is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous
> PTEs for bigger sizes?

In a way yes, we cheat the HW by defining a PTE as a table of 4 u32 
values to behave like a cont-PTE.

> 
> Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right?
> 
> Peter mentioned that we should have something like:
> 
> X       X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
> 
> where the PMD and PTE would be the ones we use for representing the 2Level
> ptage table, and PGD,P4D and PUD would just be dummies.
> 
> But, is not the convention to at least have PGD-PTE always, and have anything
> in between as optional? E.g:
> 
>    X       ?       ?       ?       X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
> 

That's also my understanding hence the discussion with Peter. The fog is 
that there is a mix between PGD and PMD, for instance to populate a PGD 
entry you use pmd_populate(). To clear a PGD entry you use pmd_clear, 
pgd_clear only exists when you have P4Ds.

> I mean, are page table walkers ready to deal with non-PGD? I thought they were
> not.

Yes that's how it is today but Peter was thinking for the future.

> 
> Also, in patch#1, you mentioned:
> 
> "At the time being, for 512k pages the flag is kept in the PTE and inserted in
> the PMD entry at TLB miss exception".
> 
> Can you point out where to look for that in the code?

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L219

and

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L277

> 
> Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()?
> Is the size of the current mapping we are establishing?
> I see that you only make a distinction when the mapping size is 8MB.
> So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous
> representing a 4MB chunk?
> 
> I will start looking deeper into this series on Monday, but just wanted to have a better
> insight of what is going on.
> 
> PD: I think we could make the changelog of the coverletter a bit fatter and cover some
> details in there, e.g: layout of page-tables for different page sizes, layout of hugepd,
> expected layout after the work, etc.
> I think it would help in reviewing this series.

I prefer keeping details in individual patches and keep the cover letter 
for a high level summary and only administrative information because 
information in cover letter are usually lost on the long term.

> 
> Thanks!
> 
> [1] https://github.com/linuxppc/wiki/wiki/Huge-pages
> 
> 

      reply	other threads:[~2024-05-22  9:08 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25 14:55 Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
2024-03-25 16:19   ` Jason Gunthorpe
2024-03-25 19:05     ` Christophe Leroy
2024-03-26 15:01       ` Jason Gunthorpe
2024-03-27  9:58         ` Christophe Leroy
2024-03-27 16:57           ` Jason Gunthorpe
2024-04-03 18:24             ` Christophe Leroy
2024-04-04 11:46               ` Jason Gunthorpe
2024-05-26  9:29     ` Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge() Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size() Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
2024-03-25 16:35   ` Jason Gunthorpe
2024-05-26  9:25     ` Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
2024-03-25 14:56 ` [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages Christophe Leroy
2024-03-25 14:56 ` [RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries Christophe Leroy
2024-03-25 16:38 ` [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Jason Gunthorpe
2024-04-11 16:15   ` Peter Xu
2024-04-12 14:08     ` Christophe Leroy
2024-04-12 14:30       ` Peter Xu
2024-04-15 19:12         ` Christophe Leroy
2024-04-16 10:58           ` Christophe Leroy
2024-04-16 19:40             ` Peter Xu
2024-05-17 14:27 ` Oscar Salvador
2024-05-22  9:08   ` Christophe Leroy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13c7f673-2975-4e45-8247-6168e70e3d3a@csgroup.eu \
    --to=christophe.leroy@csgroup.eu \
    --cc=akpm@linux-foundation.org \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox