From: Sourav Panda <souravpanda@google.com>
To: David Rientjes <rientjes@google.com>
Cc: corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
mike.kravetz@oracle.com, muchun.song@linux.dev, rppt@kernel.org,
david@redhat.com, rdunlap@infradead.org,
chenlinxuan@uniontech.com, yang.yang29@zte.com.cn,
tomas.mudrunka@gmail.com, bhelgaas@google.com,
ivan@cloudflare.com, pasha.tatashin@soleen.com,
yosryahmed@google.com, hannes@cmpxchg.org,
kirill.shutemov@linux.intel.com, wangkefeng.wang@huawei.com,
adobriyan@gmail.com, Vlastimil Babka <vbabka@suse.cz>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
surenb@google.com, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
weixugc@google.com
Subject: Re: [PATCH v11] mm: report per-page metadata information
Date: Tue, 30 Apr 2024 10:45:26 -0700 [thread overview]
Message-ID: <CANruzcQfhRgst5ho2fUaTH7JA0ftK=8RbKmoKigbGjC386ie5Q@mail.gmail.com> (raw)
In-Reply-To: <03225c8a-63ff-61c5-fedf-0fe8e9f1767d@google.com>
[-- Attachment #1: Type: text/plain, Size: 17713 bytes --]
On Sun, Apr 28, 2024 at 1:28 AM David Rientjes <rientjes@google.com> wrote:
> On Sat, 27 Apr 2024, Sourav Panda wrote:
>
> > Adds a global Memmap field to /proc/meminfo. This information can
> > be used by users to see how much memory is being used by per-page
> > metadata, which can vary depending on build configuration, machine
> > architecture, and system use.
> >
> > Accounting per-page metadata allocated by boot-allocator:
> > /proc/vmstat:nr_memmap_boot * PAGE_SIZE
> >
> > Accounting per-page metadata allocated by buddy-allocator:
> > /proc/vmstat:nr_memmap * PAGE_SIZE
> >
>
> So three things are actually being added: a new field in /proc/meminfo and
> two new fields in /proc/vmstat.
>
> For the /proc/vmstat entries, are these also available in
> /sys/devices/system/node/nodeX/vmstat?
>
> > Accounting total Perpage metadata allocated on the machine:
> > (/proc/vmstat:nr_memmap_boot + /proc/vmstat:nr_memmap) * PAGE_SIZE
> >
>
> Makes sense :) Does this mean that the new /proc/meminfo field is
> redundant, though?
>
Thanks David for the review. Yep, it is redundant and this was also
indicated by Wei before.
In v12, I shall restore the changes to fs/proc/meminfo.c
and Documentation/filesystems/proc.rst
>
> > Utility for userspace:
> >
> > Application Optimization: Depending on the kernel version and command
> > line options, the kernel would relinquish a different number of pages
> > (that contain struct pages) when a hugetlb page is reserved (e.g., 0, 6
> > or 7 for a 2MB hugepage). This patch allows the userspace application
> > to know the exact savings achieved through page metadata deallocation
> > without dealing with the intricacies of the kernel.
> >
>
> ... and for future memdesc optimizations :) But this is implementation
> details, it feels like the main goal of this is the observability part
> below. I'd just focus on that use cases.
>
> > Observability: Struct page overhead can only be calculated on-paper at
> > boot time (e.g., 1.5% machine capacity). Beyond boot once hugepages are
> > reserved or memory is hotplugged, the computation becomes complex.
> > Per-page metrics will help explain part of the system memory overhead,
> > which shall help guide memory optimizations and memory cgroup sizing.
> >
>
> This depends on kernel configuration options, which is another
> implementation detail. I'd just say that we want to know where our memory
> is going, i.e. be able to describe the amount of memory overhead that is
> going to per-page metadata on the system at any given time.
>
Thanks David. We shall restructure this in the following manner:
- what's missing today, what are we introducing, and how this can be put
into practice.
>
> > Debugging: Tracking the changes or absolute value in struct pages can
> > help detect anomalies as they can be correlated with other metrics in
> > the machine (e.g., memtotal, number of huge pages, etc).
> >
> > page_ext overheads: Some kernel features such as page_owner
> > page_table_check that use page_ext can be optionally enabled via kernel
> > parameters. Having the total per-page metadata information helps users
> > precisely measure impact.
> >
> > For background and results see:
> > lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com
> >
> > Signed-off-by: Sourav Panda <souravpanda@google.com>
> > Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
> I'd suggest removing the /proc/meminfo field, it's redundant: the
> information is already available through at least /proc/vmstat.
> That said, why not /proc/zoneinfo instead of any changes to /proc/vmstat?
> Do we want to know what zones this overhead is coming from, i.e. where our
> struct page is allocated from on a node with ZONE_MOVABLE?
We would like to keep it a system-wide metric and if we care what NUMA node
the page overhead is coming from, the per-node vmstat. This is to keep the
implementation and interpretation of the results relatively simple.
>
> > ---
> > Changelog:
> > Fixed positioning of node_stat_item:NR_MEMMAP's comment.
> > Synchronized with 6.9-rc5.
> >
> > v10:
> > lore.kernel.org/all/20240416201335.3551099-1-souravpanda@google.com/
> > ---
> > Documentation/filesystems/proc.rst | 3 +++
> > fs/proc/meminfo.c | 4 ++++
> > include/linux/mmzone.h | 3 +++
> > include/linux/vmstat.h | 4 ++++
> > mm/hugetlb_vmemmap.c | 17 ++++++++++++----
> > mm/mm_init.c | 3 +++
> > mm/page_alloc.c | 1 +
> > mm/page_ext.c | 32 +++++++++++++++++++++---------
> > mm/sparse-vmemmap.c | 8 ++++++++
> > mm/sparse.c | 7 ++++++-
> > mm/vmstat.c | 26 +++++++++++++++++++++++-
> > 11 files changed, 93 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst
> b/Documentation/filesystems/proc.rst
> > index c6a6b9df21049..a7445d49a3bb7 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -993,6 +993,7 @@ Example output. You may not have all of these fields.
> > AnonPages: 4654780 kB
> > Mapped: 266244 kB
> > Shmem: 9976 kB
> > + Memmap: 513419 kB
> > KReclaimable: 517708 kB
> > Slab: 660044 kB
> > SReclaimable: 517708 kB
> > @@ -1095,6 +1096,8 @@ Mapped
> > files which have been mmapped, such as libraries
> > Shmem
> > Total memory used by shared memory (shmem) and tmpfs
> > +Memmap
> > + Memory used for per-page metadata
> > KReclaimable
> > Kernel allocations that the kernel will attempt to reclaim
> > under memory pressure. Includes SReclaimable (below), and
> other
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 45af9a989d404..3d3db55cfeab6 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -39,6 +39,7 @@ static int meminfo_proc_show(struct seq_file *m, void
> *v)
> > long available;
> > unsigned long pages[NR_LRU_LISTS];
> > unsigned long sreclaimable, sunreclaim;
> > + unsigned long nr_memmap;
> > int lru;
> >
> > si_meminfo(&i);
> > @@ -57,6 +58,8 @@ static int meminfo_proc_show(struct seq_file *m, void
> *v)
> > sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
> > sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
> >
> > + nr_memmap = global_node_page_state_pages(NR_MEMMAP);
> > +
> > show_val_kb(m, "MemTotal: ", i.totalram);
> > show_val_kb(m, "MemFree: ", i.freeram);
> > show_val_kb(m, "MemAvailable: ", available);
> > @@ -104,6 +107,7 @@ static int meminfo_proc_show(struct seq_file *m,
> void *v)
> > show_val_kb(m, "Mapped: ",
> > global_node_page_state(NR_FILE_MAPPED));
> > show_val_kb(m, "Shmem: ", i.sharedram);
> > + show_val_kb(m, "Memmap: ", nr_memmap);
> > show_val_kb(m, "KReclaimable: ", sreclaimable +
> > global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
> > show_val_kb(m, "Slab: ", sreclaimable + sunreclaim);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c11b7cde81efa..87963b13b53ee 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -217,6 +217,9 @@ enum node_stat_item {
> > PGDEMOTE_KSWAPD,
> > PGDEMOTE_DIRECT,
> > PGDEMOTE_KHUGEPAGED,
> > + /* Page metadata size (struct page and page_ext) in pages */
> > + NR_MEMMAP,
> > + NR_MEMMAP_BOOT, /* NR_MEMMAP for bootmem */
> > NR_VM_NODE_STAT_ITEMS
> > };
> >
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 343906a98d6ee..c3785fdd3668d 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -632,4 +632,8 @@ static inline void lruvec_stat_sub_folio(struct
> folio *folio,
> > {
> > lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
> > }
> > +
> > +void __meminit mod_node_early_perpage_metadata(int nid, long delta);
> > +void __meminit store_early_perpage_metadata(void);
> > +
> > #endif /* _LINUX_VMSTAT_H */
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index da177e49d9564..2da8689aeb93f 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -184,10 +184,13 @@ static int vmemmap_remap_range(unsigned long
> start, unsigned long end,
> > */
> > static inline void free_vmemmap_page(struct page *page)
> > {
> > - if (PageReserved(page))
> > + if (PageReserved(page)) {
> > free_bootmem_page(page);
> > - else
> > + mod_node_page_state(page_pgdat(page), NR_MEMMAP_BOOT, -1);
> > + } else {
> > __free_page(page);
> > + mod_node_page_state(page_pgdat(page), NR_MEMMAP, -1);
> > + }
> > }
> >
> > /* Free a list of the vmemmap pages */
> > @@ -338,6 +341,7 @@ static int vmemmap_remap_free(unsigned long start,
> unsigned long end,
> > copy_page(page_to_virt(walk.reuse_page),
> > (void *)walk.reuse_addr);
> > list_add(&walk.reuse_page->lru, vmemmap_pages);
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, 1);
> > }
> >
> > /*
> > @@ -384,14 +388,19 @@ static int alloc_vmemmap_page_list(unsigned long
> start, unsigned long end,
> > unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> > int nid = page_to_nid((struct page *)start);
> > struct page *page, *next;
> > + int i;
> >
> > - while (nr_pages--) {
> > + for (i = 0; i < nr_pages; i++) {
> > page = alloc_pages_node(nid, gfp_mask, 0);
> > - if (!page)
> > + if (!page) {
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, i);
> > goto out;
> > + }
> > list_add(&page->lru, list);
> > }
> >
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, nr_pages);
> > +
> > return 0;
> > out:
> > list_for_each_entry_safe(page, next, list, lru)
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 549e76af8f82a..1a429c73b32e4 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -27,6 +27,7 @@
> > #include <linux/swap.h>
> > #include <linux/cma.h>
> > #include <linux/crash_dump.h>
> > +#include <linux/vmstat.h>
> > #include "internal.h"
> > #include "slab.h"
> > #include "shuffle.h"
> > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct
> pglist_data *pgdat)
> > panic("Failed to allocate %ld bytes for node %d memory
> map\n",
> > size, pgdat->node_id);
> > pgdat->node_mem_map = map + offset;
> > + mod_node_early_perpage_metadata(pgdat->node_id,
> > + DIV_ROUND_UP(size, PAGE_SIZE));
> > pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
> > __func__, pgdat->node_id, (unsigned long)pgdat,
> > (unsigned long)pgdat->node_mem_map);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 14d39f34d3367..aa8dd5bccb7ac 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5650,6 +5650,7 @@ void __init setup_per_cpu_pageset(void)
> > for_each_online_pgdat(pgdat)
> > pgdat->per_cpu_nodestats =
> > alloc_percpu(struct per_cpu_nodestat);
> > + store_early_perpage_metadata();
> > }
> >
> > __meminit void zone_pcp_init(struct zone *zone)
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 4548fcc66d74d..c1e324a1427e0 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
> > return -ENOMEM;
> > NODE_DATA(nid)->node_page_ext = base;
> > total_usage += table_size;
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT,
> > + DIV_ROUND_UP(table_size, PAGE_SIZE));
> > return 0;
> > }
> >
> > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size,
> int nid)
> > void *addr = NULL;
> >
> > addr = alloc_pages_exact_nid(nid, size, flags);
> > - if (addr) {
> > + if (addr)
> > kmemleak_alloc(addr, size, 1, flags);
> > - return addr;
> > - }
> > + else
> > + addr = vzalloc_node(size, nid);
> >
> > - addr = vzalloc_node(size, nid);
> > + if (addr) {
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP,
> > + DIV_ROUND_UP(size, PAGE_SIZE));
> > + }
> >
> > return addr;
> > }
> > @@ -303,18 +308,27 @@ static int __meminit
> init_section_page_ext(unsigned long pfn, int nid)
> >
> > static void free_page_ext(void *addr)
> > {
> > + size_t table_size;
> > + struct page *page;
> > + struct pglist_data *pgdat;
> > +
> > + table_size = page_ext_size * PAGES_PER_SECTION;
> > +
> > if (is_vmalloc_addr(addr)) {
> > + page = vmalloc_to_page(addr);
> > + pgdat = page_pgdat(page);
> > vfree(addr);
> > } else {
> > - struct page *page = virt_to_page(addr);
> > - size_t table_size;
> > -
> > - table_size = page_ext_size * PAGES_PER_SECTION;
> > -
> > + page = virt_to_page(addr);
> > + pgdat = page_pgdat(page);
> > BUG_ON(PageReserved(page));
> > kmemleak_free(addr);
> > free_pages_exact(addr, table_size);
> > }
> > +
> > + mod_node_page_state(pgdat, NR_MEMMAP,
> > + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> > +
> > }
> >
> > static void __free_page_ext(unsigned long pfn)
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index a2cbe44c48e10..1dda6c53370b0 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -469,5 +469,13 @@ struct page * __meminit
> __populate_section_memmap(unsigned long pfn,
> > if (r < 0)
> > return NULL;
> >
> > + if (system_state == SYSTEM_BOOTING) {
> > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(end -
> start,
> > +
> PAGE_SIZE));
> > + } else {
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP,
> > + DIV_ROUND_UP(end - start, PAGE_SIZE));
> > + }
> > +
> > return pfn_to_page(pfn);
> > }
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index aed0951b87fa0..684a91773bd76 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -14,7 +14,7 @@
> > #include <linux/swap.h>
> > #include <linux/swapops.h>
> > #include <linux/bootmem_info.h>
> > -
> > +#include <linux/vmstat.h>
> > #include "internal.h"
> > #include <asm/dma.h>
> >
> > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long
> size, int nid)
> > */
> > sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid,
> true);
> > sparsemap_buf_end = sparsemap_buf + size;
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size,
> PAGE_SIZE));
> > +#endif
> > }
> >
> > static void __init sparse_buffer_fini(void)
> > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long
> pfn, unsigned long nr_pages,
> > unsigned long start = (unsigned long) pfn_to_page(pfn);
> > unsigned long end = start + nr_pages * sizeof(struct page);
> >
> > + mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_MEMMAP,
> > + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
> > vmemmap_free(start, end, altmap);
> > }
> > static void free_map_bootmem(struct page *memmap)
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index db79935e4a543..79466450040e6 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1252,7 +1252,8 @@ const char * const vmstat_text[] = {
> > "pgdemote_kswapd",
> > "pgdemote_direct",
> > "pgdemote_khugepaged",
> > -
> > + "nr_memmap",
> > + "nr_memmap_boot",
> > /* enum writeback_stat_item counters */
> > "nr_dirty_threshold",
> > "nr_dirty_background_threshold",
> > @@ -2279,4 +2280,27 @@ static int __init extfrag_debug_init(void)
> > }
> >
> > module_init(extfrag_debug_init);
> > +
> > #endif
> > +
> > +/*
> > + * Page metadata size (struct page and page_ext) in pages
> > + */
> > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __meminitdata;
> > +
> > +void __meminit mod_node_early_perpage_metadata(int nid, long delta)
> > +{
> > + early_perpage_metadata[nid] += delta;
> > +}
> > +
> > +void __meminit store_early_perpage_metadata(void)
> > +{
> > + int nid;
> > + struct pglist_data *pgdat;
> > +
> > + for_each_online_pgdat(pgdat) {
> > + nid = pgdat->node_id;
> > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT,
> > + early_perpage_metadata[nid]);
> > + }
> > +}
> > --
> > 2.44.0.769.g3c40516874-goog
> >
> >
> >
>
[-- Attachment #2: Type: text/html, Size: 22603 bytes --]
prev parent reply other threads:[~2024-04-30 17:45 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-27 20:28 Sourav Panda
2024-04-28 8:28 ` David Rientjes
2024-04-30 17:45 ` Sourav Panda [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CANruzcQfhRgst5ho2fUaTH7JA0ftK=8RbKmoKigbGjC386ie5Q@mail.gmail.com' \
--to=souravpanda@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bhelgaas@google.com \
--cc=chenlinxuan@uniontech.com \
--cc=corbet@lwn.net \
--cc=david@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=ivan@cloudflare.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=pasha.tatashin@soleen.com \
--cc=rafael@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tomas.mudrunka@gmail.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=yang.yang29@zte.com.cn \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox