From: Jason Miu <jasonmiu@google.com>
To: Mike Rapoport <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v4 1/2] kho: Adopt radix tree for preserved memory tracking
Date: Mon, 12 Jan 2026 21:33:10 -0800 [thread overview]
Message-ID: <CAHN2nPL=wnqjici92Vj5Ub5w=xXN8CDfOVjOTUOWs4VeJC107Q@mail.gmail.com> (raw)
In-Reply-To: <aWTJ2v9J-Aa6ykBf@kernel.org>
On Mon, Jan 12, 2026 at 2:16 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Jan 08, 2026 at 04:11:26PM -0800, Jason Miu wrote:
> > Introduce a radix tree implementation for tracking preserved memory
> > pages and switch the KHO memory tracking mechanism to use it. This
> > lays the groundwork for a stateless KHO implementation that eliminates
> > the need for serialization and the associated "finalize" state.
> >
...
>
> ...
>
> > +/**
> > + * DOC: KHO persistent memory tracker
> > + *
> > + * KHO tracks preserved memory using a radix tree data structure. Each node of
> > + * the tree is exactly a single page. The leaf nodes are bitmaps where each set
> > + * bit is a preserved page of any order. The intermediate nodes are tables of
> > + * physical addresses that point to a lower level node.
> > + *
> > + * The tree hierarchy is shown below::
> > + *
> > + * root
> > + * +-------------------+
> > + * | Level 5 | (struct kho_radix_node)
> > + * +-------------------+
> > + * |
> > + * v
> > + * +-------------------+
> > + * | Level 4 | (struct kho_radix_node)
> > + * +-------------------+
> > + * |
> > + * | ... (intermediate levels)
> > + * |
> > + * v
> > + * +-------------------+
> > + * | Level 0 | (struct kho_radix_leaf)
> > + * +-------------------+
> > + *
> > + * The tree is traversed using a key that encodes the page's physical address
> > + * (pa) and its order into a single unsigned long value. The encoded key value
> > + * is composed of two parts: the 'order bit' in the upper part and the 'page
> > + * offset' in the lower part.::
> > + *
> > + * +------------+-----------------------------+--------------------------+
> > + * | Page Order | Order Bit | Page Offset |
> > + * +------------+-----------------------------+--------------------------+
> > + * | 0 | ...000100 ... (at bit 52) | pa >> (PAGE_SHIFT + 0) |
> > + * | 1 | ...000010 ... (at bit 51) | pa >> (PAGE_SHIFT + 1) |
> > + * | 2 | ...000001 ... (at bit 50) | pa >> (PAGE_SHIFT + 2) |
> > + * | ... | ... | ... |
> > + * +------------+-----------------------------+--------------------------+
> > + *
> > + * Page Offset:
>
> To me "page offset" reads as offset from somewhere and here it's rather pfn
> on steroids :)
> Also in many places in the kernel "page offset" refers to the offset inside a
> page.
>
> Can't say I can think of a better name, but it feels that it should express
> that this is an address more explicitly.
>
I updated this to "Shifted Physical Address," borrowing the idea from
Jason Gunthorpe. (Thank you =)
> > + * The 'page offset' is the physical address normalized for its order. It
> > + * effectively represents the page offset for the given order.
> > + *
> > + * Order Bit:
> > + * The 'order bit' encodes the page order by setting a single bit at a
> > + * specific position. The position of this bit itself represents the order.
> > + *
> > + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> > + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> > + * offset occupies bits [0-51]. For order 0, the order bit is set at
> > + * position 52.
> > + *
> > + * The following diagram illustrates how the encoded key value is split into
> > + * indices for the tree levels, with PAGE_SIZE of 4KB::
> > + *
> > + * 63:60 59:51 50:42 41:33 32:24 23:15 14:0
> > + * +---------+--------+--------+--------+--------+--------+-----------------+
> > + * | 0 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 | Lv 0 (bitmap) |
> > + * +---------+--------+--------+--------+--------+--------+-----------------+
> > + *
> > + * The radix tree stores pages of all sizes (orders) in a single 6-level
>
> "sizes" can be misleading here because "all sizes" can mean non power-of-2
> sizes as well. I'd just use "pages of all orders".
>
> > + * hierarchy. It efficiently shares lower table levels, especially due to
>
> Don't we share the higher levels? Also using "tree" instead of "table"
> seems clearer to me.
>
yes, updated.
> > + * common zero top address bits, allowing a single, efficient algorithm to
> > + * manage all pages. This bitmap approach also offers memory efficiency; for
> > + * example, a 512KB bitmap can cover a 16GB memory range for 0-order pages with
> > + * PAGE_SIZE = 4KB.
> > + *
> > + * The data structures defined here are part of the KHO ABI. Any modification
> > + * to these structures that breaks backward compatibility must be accompanied by
> > + * an update to the "compatible" string. This ensures that a newer kernel can
> > + * correctly interpret the data passed by an older kernel.
> > + */
> > +
> > +/*
> > + * Defines constants for the KHO radix tree structure, used to track preserved
> > + * memory. These constants govern the indexing, sizing, and depth of the tree.
> > + */
> > +enum kho_radix_consts {
> > + /*
> > + * The bit position of the order bit (and also the length of the
> > + * page offset) for an order-0 page.
> > + */
> > + KHO_ORDER_0_LOG2 = 64 - PAGE_SHIFT,
> > +
> > + /* Size of the table in kho_radix_node, in log2 */
> > + KHO_TABLE_SIZE_LOG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> > +
> > + /* Number of bits in the kho_radix_leaf bitmap, in log2 */
> > + KHO_BITMAP_SIZE_LOG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
> > +
> > + /*
> > + * The total tree depth is the number of intermediate levels
> > + * and 1 bitmap level.
> > + */
> > + KHO_TREE_MAX_DEPTH =
> > + DIV_ROUND_UP(KHO_ORDER_0_LOG2 - KHO_BITMAP_SIZE_LOG2,
> > + KHO_TABLE_SIZE_LOG2) + 1,
>
> Extra tab in indentation of DIV_ROUND_UP would make it more readable IMHO.
>
> > +};
> > +
> > +struct kho_radix_node {
> > + u64 table[1 << KHO_TABLE_SIZE_LOG2];
> > +};
> > +
> > +struct kho_radix_leaf {
> > + DECLARE_BITMAP(bitmap, 1 << KHO_BITMAP_SIZE_LOG2);
> > +};
> > +
> > #endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */
> > diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h
> > new file mode 100644
> > index 000000000000..8f03dd226dd9
> > --- /dev/null
> > +++ b/include/linux/kho_radix_tree.h
> > @@ -0,0 +1,72 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _LINUX_KHO_ABI_RADIX_TREE_H
> > +#define _LINUX_KHO_ABI_RADIX_TREE_H
>
> I misinterpreted the file name during v3 review, no need for _ABI here :)
>
> > +#include <linux/err.h>
> > +#include <linux/errno.h>
> > +#include <linux/mutex_types.h>
> > +#include <linux/types.h>
>
> ...
>
> > +#ifdef CONFIG_KEXEC_HANDOVER
> > +
> > +int kho_radix_add_page(struct kho_radix_tree *tree, unsigned long pfn,
> > + unsigned int order);
> > +
> > +void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
> > + unsigned int order);
> > +
> > +int kho_radix_walk_tree(struct kho_radix_tree *tree, unsigned int level,
>
> I think 'level' and 'start' should not be a part of public API. We don't
> want to expose walking the tree from arbitrary place.
>
Sure, update the function implementation as well.
> > + unsigned long start, kho_radix_tree_walk_callback_t cb);
> > +
> > +#else /* #ifdef CONFIG_KEXEC_HANDOVER */
>
> ...
>
> > +/**
> > + * kho_radix_decode_key - Decodes a radix key back into a physical address and order.
> > + * @key: The unsigned long key to decode.
> > + * @order: An output parameter, a pointer to an unsigned int where the decoded
> > + * page order will be stored.
> > + *
> > + * This function reverses the encoding performed by kho_radix_encode_key(),
> > + * extracting the original physical address and page order from a given key.
> > + *
> > + * Return: The decoded physical address.
> > + */
> > +static phys_addr_t kho_radix_decode_key(unsigned long key,
> > + unsigned int *order)
>
> Nit: *order can be on the same line as key
>
> > {
>
> ...
>
> > +static unsigned long kho_radix_get_index(unsigned long key,
> > + unsigned int level)
> > +{
> > + int s;
> > +
> > + if (level == 0)
> > + return kho_radix_get_bitmap_index(key);
>
> I'd drop this and use get_bitmap_index() explicitly for level-0.
> Maybe also rename _get_index() to _get_table_index().
>
The function name is udpated and the caller will call
kho_radix_get_bitmap_index() for bitmap level explicitly.
> > +
> > + s = ((level - 1) * KHO_TABLE_SIZE_LOG2) + KHO_BITMAP_SIZE_LOG2;
> > + return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2);
> > +}
>
> ...
>
> > +void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn,
> > + unsigned int order)
> > +{
> > + unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order);
> > + struct kho_radix_node *node = tree->root;
> > + struct kho_radix_leaf *leaf;
> > + unsigned int i, idx;
> > +
> > + if (WARN_ON_ONCE(!tree->root))
> > + return;
> > +
> > + might_sleep();
> > +
> > + guard(mutex)(&tree->lock);
> > +
> > + /* Go from high levels to low levels */
> > + for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) {
> > + idx = kho_radix_get_index(key, i);
> > +
> > + /*
> > + * Attempting to delete a page that has not been preserved,
> > + * return with a warning.
> > + */
> > + if (WARN_ON(!node->table[idx]))
> > + return;
> > +
> > + if (node->table[idx])
>
> There's already WARN_ON(!node->table) and return, no need for another if
> here.
>
> > + node = phys_to_virt((phys_addr_t)node->table[idx]);
> > + }
> > +
> > + /* Handle the leaf level bitmap (level 0) */
> > + leaf = (struct kho_radix_leaf *)node;
> > + idx = kho_radix_get_index(key, 0);
> > + __clear_bit(idx, leaf->bitmap);
>
> I think I already mentioned it in earlier reviews, but I don't remember any
> response.
>
> How do we approach freeing empty bitmaps and intermediate nodes?
> If we do a few preserve/uppreserve cycles for memory that can be allocated
> and freed in between we might get many unused bitmaps.
>
> My view is that we should free the empty bitmaps, maybe asynchronously.
> The intermediate nodes probably don't take that much memory to bother with
> them.
>
I think memory preserving and than unpreserving is not a common use
case. I suggest keeping the current simpler implementation instead of
doing the cleanup, we can address this memory usage optimization
later. If we implement this, I tend to clean up all the empty bitmaps
and intermediate nodes to avoid incomplete tree traversing paths.
> > +}
> > +EXPORT_SYMBOL_GPL(kho_radix_del_page);
>
> ...
>
> > +/**
> > + * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each preserved page.
> > + * @tree: A pointer to the KHO radix tree to walk.
> > + * @level: The starting level for the walk (typically KHO_TREE_MAX_DEPTH - 1).
> > + * @start: The initial key prefix for the walk (typically 0).
> > + * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be
> > + * invoked for each preserved page found in the tree. The callback receives
> > + * the physical address and order of the preserved page.
> > + *
> > + * This function walks the radix tree, searching from the specified top level
> > + * (@level) down to the lowest level (level 0). For each preserved page found,
> > + * it invokes the provided callback, passing the page's physical address and
> > + * order.
> > + *
> > + * Return: 0 if the walk completed the specified tree, or the non-zero return
> > + * value from the callback that stopped the walk.
> > + */
> > +int kho_radix_walk_tree(struct kho_radix_tree *tree, unsigned int level,
> > + unsigned long start, kho_radix_tree_walk_callback_t cb)
> > +{
> > + if (WARN_ON_ONCE(!tree->root))
> > + return -EINVAL;
> > +
> > + guard(mutex)(&tree->lock);
> > +
> > + return __kho_radix_walk_tree(tree->root, level, start, cb);
>
> Like I said, let's make it
>
> return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb);
>
> and drop level and start parameters.
>
> > +}
> > +EXPORT_SYMBOL_GPL(kho_radix_walk_tree);
>
> ...
>
> > @@ -260,11 +388,20 @@ static struct page *kho_restore_page(phys_addr_t phys, bool is_folio)
> >
> > /* Clear private to make sure later restores on this page error out. */
> > page->private = 0;
> > + /* Head page gets refcount of 1. */
> > + set_page_count(page, 1);
> >
> > - if (is_folio)
> > - kho_init_folio(page, info.order);
> > - else
> > - kho_init_pages(page, nr_pages);
> > + /*
> > + * For higher order folios, tail pages get a page count of zero.
> > + * For physically contiguous order-0 pages every pages gets a page
> > + * count of 1
> > + */
> > + ref_cnt = is_folio ? 0 : 1;
> > + for (unsigned int i = 1; i < nr_pages; i++)
> > + set_page_count(page + i, ref_cnt);
> > +
> > + if (is_folio && info.order)
> > + prep_compound_page(page, info.order);
>
> It looks like this reverts latest Pratyush's changes to kho_restore_page(),
> please rebase on a newer version of mm.git/linux-next.git.
>
Yes, sync-ed again.
> >
> > adjust_managed_page_count(page, nr_pages);
> > return page;
>
> ...
>
> > +static int __init kho_radix_memblock_reserve(phys_addr_t phys,
> > + unsigned int order)
>
> I don't think _radix should be mentioned in the callback at all.
> How about
>
> kho_preserved_memory_reserve()?
>
> > {
>
> ...
>
> > + /* Reserve the memory preserved in KHO radix tree in memblock */
>
> /* Reserve memory preserved by KHO in memblock */
>
> What data structure is used to track the preserved memory is not important
> here.
>
> > + memblock_reserve(phys, sz);
> > + memblock_reserved_mark_noinit(phys, sz);
> > + info.magic = KHO_PAGE_MAGIC;
> > + info.order = order;
> > + page->private = info.page_private;
> >
> > return 0;
> > }
>
> ...
>
> > static __init int kho_out_fdt_setup(void)
> > {
> > + struct kho_radix_tree *tree = &kho_out.radix_tree;
> > void *root = kho_out.fdt;
> > - u64 empty_mem_map = 0;
> > + u64 preserved_mem_tree_pa;
> > int err;
> >
> > err = fdt_create(root, PAGE_SIZE);
> > err |= fdt_finish_reservemap(root);
> > err |= fdt_begin_node(root, "");
> > err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE);
> > - err |= fdt_property(root, KHO_FDT_MEMORY_MAP_PROP_NAME, &empty_mem_map,
> > - sizeof(empty_mem_map));
> > +
> > + scoped_guard(mutex, &tree->lock) {
>
> This runs exactly one time on boot and there's no place in the code that
> can concurrently change tree->root.
>
Updated and I understand the idea. Personally I think having a mutex
lock always when accessing the tree makes the logic symmetric, but
this change is certainly more optimal. =)
> > + preserved_mem_tree_pa = (u64)virt_to_phys(tree->root);
>
> I don't think we should cast phys_addr_t to u64.
>
> > + }
> > +
> > + err |= fdt_property(root, KHO_FDT_MEMORY_MAP_PROP_NAME,
> > + &preserved_mem_tree_pa,
> > + sizeof(preserved_mem_tree_pa));
> > +
> > err |= fdt_end_node(root);
> > err |= fdt_finish(root);
> >
> > @@ -1332,16 +1329,26 @@ static __init int kho_out_fdt_setup(void)
> >
> > static __init int kho_init(void)
> > {
> > + struct kho_radix_tree *tree = &kho_out.radix_tree;
> > const void *fdt = kho_get_fdt();
> > int err = 0;
> >
> > if (!kho_enable)
> > return 0;
> >
> > + scoped_guard(mutex, &tree->lock) {
>
> No need for lock here. If anything tries to access the tree before the root
> is allocated we are anyway doomed.
>
> > + tree->root = (struct kho_radix_node *)
> > + kzalloc(PAGE_SIZE, GFP_KERNEL);
>
> No need for casting from void *.
>
> > + if (!tree->root) {
> > + err = -ENOMEM;
> > + goto err_free_scratch;
> > + }
> > + }
>
> --
> Sincerely yours,
> Mike.
Thanks again, sending out the v5 patches.
--
Jason Miu
next prev parent reply other threads:[~2026-01-13 5:33 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-09 0:11 [PATCH v4 0/2] Make KHO Stateless Jason Miu
2026-01-09 0:11 ` [PATCH v4 1/2] kho: Adopt radix tree for preserved memory tracking Jason Miu
2026-01-12 10:15 ` Mike Rapoport
2026-01-12 14:39 ` Jason Gunthorpe
2026-01-13 5:33 ` Jason Miu [this message]
2026-01-09 0:11 ` [PATCH v4 2/2] kho: Remove finalize state and clients Jason Miu
2026-01-12 10:25 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHN2nPL=wnqjici92Vj5Ub5w=xXN8CDfOVjOTUOWs4VeJC107Q@mail.gmail.com' \
--to=jasonmiu@google.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jgg@nvidia.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox