linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Jason Miu <jasonmiu@google.com>
Cc: Alexander Graf <graf@amazon.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Baoquan He <bhe@redhat.com>,
	Changyuan Lyu <changyuanl@google.com>,
	David Matlack <dmatlack@google.com>,
	David Rientjes <rientjes@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Pratyush Yadav <pratyush@kernel.org>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
Date: Thu, 9 Oct 2025 14:52:05 -0300	[thread overview]
Message-ID: <20251009175205.GB3899236@nvidia.com> (raw)
In-Reply-To: <CAHN2nPKjg6=0QTcSoptxvQW9MpP0YwGUTx_gQDBxgCtSzxY5Pg@mail.gmail.com>

> > > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> > >  #define PROP_SUB_FDT "fdt"
> >
> > I'de really like to see all of these sorts of definitions in some
> > structured ABI header not open coded all over the place..
> 
> Do you think `include/linux/kexec_handover.h` is the appropriate
> place, or would you prefer a new, dedicated ABI header (e.g., in
> `include/uapi/linux/`) for all KHO-related FDT constants?

I would avoid uapi, but maybe Pasha has some
idea.

 include/linux/live_update/abi/ ?

> Agreed. Will change `u64` according to Pasha's comment. And we use
> explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
> kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
> believe this, along with the `u64` type, makes it clear that the table
> stores physical addresses.

Well, the macros were intended to automate this and avoid mistakes
from opencoding.. Just keep using them?

> > > + */
> > > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
> >
> > pa is phys_addr_t in the kernel, never unsigned long.
> >
> > If you want to make it all dynamic then this should be phys_addr_t
> 
> Should this also be `u64`, or we stay with `phys_addr_t` for all page
> addresses?

you should use phys_addr_t for everything that originates from a
phys_addr_t, and u64 for all the ABI

> > > +{
> > > +     unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
> >
> > And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> > may not exist.
> >
> > Use an enum ORDER_0_LG2 maybe
> 
> I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
> PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
> to understand, what do you think? This constant, along with others,
> will be placed in the enum.

Sure, though I prefer LG2 to BIT_POS

BIT_POS to me implies it is being used as  bit wise operation, while
log2 is a mathematical concept

  X_lg2 = ilog2(X)  &&  X == 1 << X_lg2

> > > +                             kho_radix_tree_walk_callback_t cb)
> > > +{
> > > +     int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > > +     struct kho_radix_tree *next_tree;
> > > +     unsigned long encoded, i;
> > > +     int err = 0;
> > >
> > > +     if (level == 1) {
> > > +             encoded = offset;
> > > +             return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > > +                                           encoded, cb);
> >
> > Better to do this in the caller  a few lines below
> 
> But the caller is in a different tree level? Should we only walk the
> bitmaps at the lowest level?

I mean just have the caller do

if (level-1 ==0)
   kho_radix_walk_bitmaps()
else
   ..

Avoids a function call

> > > +     for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > > +             if (root->table[i]) {
> > > +                     encoded = offset << level_shift | i;
> >
> > This doesn't seem right..
> >
> > The argument to the walker should be the starting encoded of the table
> > it is about to walk.
> >
> > Since everything always starts at 0 it should always be
> >   start | (i << level_shift)
> >
> > ?
> 
> You're right that this line might not be immediately intuitive. The
> var `level_shift` (which is constant value 9 here) is applied to the
> *accumulated* `offset` from the parent level. Let's consider an
> example of a preserved page at physical address `0x1000`, which
> encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
> for page 1).

Oh, weird, too weird maybe. I'd just keep all the values as fully
shifted, level_shift should be adjusted to have the full shift for
this level. Easier to understand.

Also, I think the order bits might have become a bit confused, I think
I explained it wrong.

My idea was to try to share the radix levels to save space eg if we
have like this patch does:

  Order phys
  00001 abcd
  00010 0bcd
  00100 00cd
  01000 000d

Then we don't get too much page level sharing, the middle ends up with
0 indexes in tables that cannot be shared.

What I was going for was to push all the shared pages to the left

  00001 abcd
  00000 1bcd
  00000 01cd
  00000 001d

Here the first radix level has index 0 or 1 and is fully shared. So eg
Order 4 and 5 will have all the same 0 index table levels. This also
reduces the max height of the tree because only +1 bit is needed to
store order.

Jason


  reply	other threads:[~2025-10-09 17:52 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01  1:19 [PATCH v1 0/3] Make KHO Stateless Jason Miu
2025-10-01  1:19 ` [PATCH v1 1/3] kho: Adopt KHO radix tree data structures Jason Miu
2025-10-02  4:29   ` kernel test robot
2025-10-06 14:14   ` Jason Gunthorpe
2025-10-06 17:26     ` Pasha Tatashin
2025-10-06 22:50       ` Jason Gunthorpe
2025-10-09  2:07     ` Jason Miu
2025-10-09 17:52       ` Jason Gunthorpe [this message]
2025-10-22  0:59         ` Jason Miu
2025-10-01  1:19 ` [PATCH v1 2/3] memblock: Remove KHO notifier usage Jason Miu
2025-10-01 16:35   ` kernel test robot
2025-10-01  1:19 ` [PATCH v1 3/3] kho: Remove notifier system infrastructure Jason Miu
2025-10-01 18:07   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251009175205.GB3899236@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=changyuanl@google.com \
    --cc=dmatlack@google.com \
    --cc=graf@amazon.com \
    --cc=jasonmiu@google.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=pratyush@kernel.org \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox