From: Jason Gunthorpe <jgg@nvidia.com>
To: Jason Miu <jasonmiu@google.com>
Cc: Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>,
Changyuan Lyu <changyuanl@google.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Mike Rapoport <rppt@kernel.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Pratyush Yadav <pratyush@kernel.org>,
kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v2 1/3] kho: Adopt KHO radix tree data structures
Date: Thu, 23 Oct 2025 14:43:17 -0300 [thread overview]
Message-ID: <20251023174317.GO262900@nvidia.com> (raw)
In-Reply-To: <20251020100306.2709352-2-jasonmiu@google.com>
On Mon, Oct 20, 2025 at 03:03:04AM -0700, Jason Miu wrote:
> - * Keep track of memory that is to be preserved across KHO.
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * | Level 5 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * v
> + * +-------------------+
> + * | Level 4 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * | ... (intermediate levels)
> + * |
> + * v
> + * +-------------------+
> + * | Level 0 | (struct kho_bitmap_table)
> + * +-------------------+
> *
> - * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.
I think it is valuable to preserve this justification for
bitmaps. There was a lot of debate over bitmaps vs ranges and this is
the advantage of bitmaps. It is a bit subtle..
> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + * <-- Higher Bits ------------------------------------ Lower Bits -->
> + * +--------------------------+-----------------------------------------+
> + * | Order Bits | Page Offset |
> + * +--------------------------+-----------------------------------------+
> + * | ... 0 0 1 0 0 ... | pa >> (PAGE_SHIFT + order) |
> + * +--------------------------+-----------------------------------------+
> + * ^
> + * |
> + * This single '1' bit's position
> + * uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51,
> + * i.e. shifting right, without conflicting with the page offset bits.
This description is correct, but the diagram is misleading. Maybe like this:
* |0| ... 0 0 0 1 | pa >> (PAGE_SHIFT + 0) |
* |1| ... 0 0 0 0 1 | pa >> (PAGE_SHIFT + 1) |
* |2| ... 0 0 0 0 0 1 | pa >> (PAGE_SHIFT + 2) |
[..]
> + *
> + * This design stores pages of all sizes (orders) in a single 6-level table. It
> + * efficiently shares lower table levels, especially due to common zero top
> + * address bits, allowing a single, efficient algorithm to manage all pages.
> + */
> +enum kho_radix_consts {
> + /* The bit position of a 0-order page, only supports 64bits system */
> + ORDER_0_LG2 = 64 - PAGE_SHIFT,
> + /* Bit number used for each level of tables */
> + TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> + /* Bit number used for lowest level of bitmaps */
> + BITMAP_SIZE_LG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
"bit number" is a bit confusing when using a lg2 terms..
/* Size of the table in kho_radix_tree, in lg2 */
TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t))
/* Number of bits in the kho_bitmap_table, in lg2 */
BITMAP_SIZE_LG2 = const_ilog2(BITS_PER_BYTE * PAGE_SIZE)
Then use the constants in the structs:
struct kho_radix_tree {
phys_addr_t table[1 << TABLE_SIZE_LG2];
};
struct kho_bitmap_table {
unsigned long bitmaps[(1 << BITMAP_SIZE_LG2) / BITS_PER_LONG];
};
> -struct khoser_mem_chunk;
> +static inline phys_addr_t kho_radix_tree_desc(struct kho_radix_tree *va)
> +{
> + return (phys_addr_t)virt_to_phys(va);
> +}
virt_to_phys already returns phys_addr_t ?
> +static inline struct kho_radix_tree *kho_radix_tree(phys_addr_t desc)
> +{
> + return (struct kho_radix_tree *)(phys_to_virt(desc));
> +}
phys_to_virt returns void *, no need for a cast
> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb,
> + unsigned long offset)
> +{
> + if (!bit_tlb ||
> + offset >= PAGE_SIZE * BITS_PER_BYTE)
> + return -EINVAL;
WARN_ON ? These are assertions aren't they?
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, unsigned int level,
> + unsigned long start,
> + kho_radix_tree_walk_callback_t cb)
> +{
> + struct kho_radix_tree *next_tree;
> + struct kho_bitmap_table *bitmap_table;
> + unsigned long encoded, i;
> + unsigned int level_shift;
> + int err = 0;
> +
> + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
> + if (root->table[i]) {
if (!root->table[i])
continue
Remove a level of indentation here
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> + unsigned long pa = PFN_PHYS(pfn);
> +
> + might_sleep();
> +
> + return kho_radix_preserve_page(pa, order);
might_sleep should be in kho_radix_preserve_page() ? The might should
be placed around if statements that might avoid a sleep, and that is
not in this function.
> static void __init kho_mem_deserialize(const void *fdt)
> {
> + struct kho_radix_tree *tree_root;
> const phys_addr_t *mem;
> int len;
>
> + /* Retrieve the KHO radix tree from passed-in FDT. */
> + mem = fdt_getprop(fdt, 0, PROP_PRESERVED_PAGE_RADIX_TREE, &len);
>
> if (!mem || len != sizeof(*mem)) {
> + pr_err("failed to get preserved KHO memory tree\n");
> return;
> }
>
> + tree_root = *mem ?
> + (struct kho_radix_tree *)phys_to_virt(*mem) :
> + NULL;
>
> + if (!tree_root)
> + return;
Seems weird?
if (!*mem)
return;
tree_root = phys_to_virt(*mem)
kho_radix_walk_trees(tree_root, TREE_MAX_DEPTH - 1, 0,
kho_radix_walk_trees_memblock_callback);
This is prettty good now, IMHO
Jason
next prev parent reply other threads:[~2025-10-23 17:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 10:03 [PATCH v2 0/3] Make KHO Stateless Jason Miu
2025-10-20 10:03 ` [PATCH v2 1/3] kho: Adopt KHO radix tree data structures Jason Miu
2025-10-22 15:51 ` David Matlack
2025-10-23 0:51 ` Jason Miu
2025-10-23 17:43 ` Jason Gunthorpe [this message]
2025-10-29 1:28 ` Jason Miu
2025-10-23 23:45 ` Jason Gunthorpe
2025-10-24 1:23 ` Pasha Tatashin
2025-10-24 11:43 ` Jason Gunthorpe
2025-10-27 11:53 ` Mike Rapoport
2025-10-29 2:08 ` Jason Miu
2025-10-20 10:03 ` [PATCH v2 2/3] memblock: kho: Remove KHO notifier usage Jason Miu
2025-10-20 10:03 ` [PATCH v2 3/3] kho: Remove notifier system infrastructure Jason Miu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251023174317.GO262900@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=changyuanl@google.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox