From: Jason Gunthorpe <jgg@nvidia.com>
To: Jason Miu <jasonmiu@google.com>
Cc: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
Date: Thu, 9 Oct 2025 14:52:05 -0300 [thread overview]
Message-ID: <20251009175205.GB3899236@nvidia.com> (raw)
In-Reply-To: <CAHN2nPKjg6=0QTcSoptxvQW9MpP0YwGUTx_gQDBxgCtSzxY5Pg@mail.gmail.com>
> > > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> > > #define PROP_SUB_FDT "fdt"
> >
> > I'de really like to see all of these sorts of definitions in some
> > structured ABI header not open coded all over the place..
>
> Do you think `include/linux/kexec_handover.h` is the appropriate
> place, or would you prefer a new, dedicated ABI header (e.g., in
> `include/uapi/linux/`) for all KHO-related FDT constants?
I would avoid uapi, but maybe Pasha has some
idea.
include/linux/live_update/abi/ ?
> Agreed. Will change `u64` according to Pasha's comment. And we use
> explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
> kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
> believe this, along with the `u64` type, makes it clear that the table
> stores physical addresses.
Well, the macros were intended to automate this and avoid mistakes
from opencoding.. Just keep using them?
> > > + */
> > > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
> >
> > pa is phys_addr_t in the kernel, never unsigned long.
> >
> > If you want to make it all dynamic then this should be phys_addr_t
>
> Should this also be `u64`, or we stay with `phys_addr_t` for all page
> addresses?
you should use phys_addr_t for everything that originates from a
phys_addr_t, and u64 for all the ABI
> > > +{
> > > + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
> >
> > And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> > may not exist.
> >
> > Use an enum ORDER_0_LG2 maybe
>
> I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
> PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
> to understand, what do you think? This constant, along with others,
> will be placed in the enum.
Sure, though I prefer LG2 to BIT_POS
BIT_POS to me implies it is being used as bit wise operation, while
log2 is a mathematical concept
X_lg2 = ilog2(X) && X == 1 << X_lg2
> > > + kho_radix_tree_walk_callback_t cb)
> > > +{
> > > + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > > + struct kho_radix_tree *next_tree;
> > > + unsigned long encoded, i;
> > > + int err = 0;
> > >
> > > + if (level == 1) {
> > > + encoded = offset;
> > > + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > > + encoded, cb);
> >
> > Better to do this in the caller a few lines below
>
> But the caller is in a different tree level? Should we only walk the
> bitmaps at the lowest level?
I mean just have the caller do
if (level-1 ==0)
kho_radix_walk_bitmaps()
else
..
Avoids a function call
> > > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > > + if (root->table[i]) {
> > > + encoded = offset << level_shift | i;
> >
> > This doesn't seem right..
> >
> > The argument to the walker should be the starting encoded of the table
> > it is about to walk.
> >
> > Since everything always starts at 0 it should always be
> > start | (i << level_shift)
> >
> > ?
>
> You're right that this line might not be immediately intuitive. The
> var `level_shift` (which is constant value 9 here) is applied to the
> *accumulated* `offset` from the parent level. Let's consider an
> example of a preserved page at physical address `0x1000`, which
> encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
> for page 1).
Oh, weird, too weird maybe. I'd just keep all the values as fully
shifted, level_shift should be adjusted to have the full shift for
this level. Easier to understand.
Also, I think the order bits might have become a bit confused, I think
I explained it wrong.
My idea was to try to share the radix levels to save space eg if we
have like this patch does:
Order phys
00001 abcd
00010 0bcd
00100 00cd
01000 000d
Then we don't get too much page level sharing, the middle ends up with
0 indexes in tables that cannot be shared.
What I was going for was to push all the shared pages to the left
00001 abcd
00000 1bcd
00000 01cd
00000 001d
Here the first radix level has index 0 or 1 and is fully shared. So eg
Order 4 and 5 will have all the same 0 index table levels. This also
reduces the max height of the tree because only +1 bit is needed to
store order.
Jason
next prev parent reply other threads:[~2025-10-09 17:52 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-01 1:19 [PATCH v1 0/3] Make KHO Stateless Jason Miu
2025-10-01 1:19 ` [PATCH v1 1/3] kho: Adopt KHO radix tree data structures Jason Miu
2025-10-02 4:29 ` kernel test robot
2025-10-06 14:14 ` Jason Gunthorpe
2025-10-06 17:26 ` Pasha Tatashin
2025-10-06 22:50 ` Jason Gunthorpe
2025-10-09 2:07 ` Jason Miu
2025-10-09 17:52 ` Jason Gunthorpe [this message]
2025-10-22 0:59 ` Jason Miu
2025-10-01 1:19 ` [PATCH v1 2/3] memblock: Remove KHO notifier usage Jason Miu
2025-10-01 16:35 ` kernel test robot
2025-10-01 1:19 ` [PATCH v1 3/3] kho: Remove notifier system infrastructure Jason Miu
2025-10-01 18:07 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251009175205.GB3899236@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox