Re: [PATCH v6 1/1] mm: report per-page metadata information

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Rientjes <rientjes@google.com>
To: Sourav Panda <souravpanda@google.com>
Cc: corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org,
	 Andrew Morton <akpm@linux-foundation.org>,
	mike.kravetz@oracle.com,  muchun.song@linux.dev, rppt@kernel.org,
	david@redhat.com,  rdunlap@infradead.org,
	chenlinxuan@uniontech.com, yang.yang29@zte.com.cn,
	 tomas.mudrunka@gmail.com, bhelgaas@google.com,
	ivan@cloudflare.com,  pasha.tatashin@soleen.com,
	yosryahmed@google.com, hannes@cmpxchg.org,  shakeelb@google.com,
	kirill.shutemov@linux.intel.com,  wangkefeng.wang@huawei.com,
	adobriyan@gmail.com,  Vlastimil Babka <vbabka@suse.cz>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	surenb@google.com,  linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,  linux-doc@vger.kernel.org,
	linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,
	weixugc@google.com
Subject: Re: [PATCH v6 1/1] mm: report per-page metadata information
Date: Fri, 5 Jan 2024 16:38:23 -0800 (PST)	[thread overview]
Message-ID: <9266a9f9-71a5-dd88-df7b-facc037aaaff@google.com> (raw)
In-Reply-To: <b425ba6e-50b8-d1a6-7cb1-f94ba9e06c35@google.com>

On Fri, 5 Jan 2024, David Rientjes wrote:

> On Tue, 5 Dec 2023, Sourav Panda wrote:
> 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 49ef12df631b..d5901d04e082 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -993,6 +993,7 @@ Example output. You may not have all of these fields.
> >      AnonPages:       4654780 kB
> >      Mapped:           266244 kB
> >      Shmem:              9976 kB
> > +    PageMetadata:     513419 kB
> >      KReclaimable:     517708 kB
> >      Slab:             660044 kB
> >      SReclaimable:     517708 kB
> > @@ -1095,6 +1096,8 @@ Mapped
> >                files which have been mmapped, such as libraries
> >  Shmem
> >                Total memory used by shared memory (shmem) and tmpfs
> > +PageMetadata
> > +              Memory used for per-page metadata
> >  KReclaimable
> >                Kernel allocations that the kernel will attempt to reclaim
> >                under memory pressure. Includes SReclaimable (below), and other
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 45af9a989d40..f141bb2a550d 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >  	long available;
> >  	unsigned long pages[NR_LRU_LISTS];
> >  	unsigned long sreclaimable, sunreclaim;
> > +	unsigned long nr_page_metadata;
> 
> Initialize it here (if we actually need this variable)?
> 
> >  	int lru;
> > +	int nid;
> >  
> >  	si_meminfo(&i);
> >  	si_swapinfo(&i);
> > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >  	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
> >  	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
> >  
> > +	nr_page_metadata = 0;
> > +	for_each_online_node(nid)
> > +		nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA);
> 
> Is this intended to be different than 
> global_node_page_state_pages(NR_PAGE_METADATA)?  
> 
> If so, any hint as to why we want to discount page metadata on offline 
> nodes?  We can't make an inference that metadata is always allocated 
> locally, memoryless nodes need things like struct page allocated on nodes 
> with memory.
> 

Sorry, meant a node with only ZONE_MOVABLE here so metadata can't be 
allocated locally.

But would be very interested to learn why this subtlety exists to sum up 
only online nodes.

> So even if a memoryless node is offline, we'd still be including its 
> metadata here with the current implementation.
> 
> Or maybe I'm missing a subtlety here for why this is not already 
> global_node_page_state_pages().
> 
> > +
> >  	show_val_kb(m, "MemTotal:       ", i.totalram);
> >  	show_val_kb(m, "MemFree:        ", i.freeram);
> >  	show_val_kb(m, "MemAvailable:   ", available);
> > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >  	show_val_kb(m, "Mapped:         ",
> >  		    global_node_page_state(NR_FILE_MAPPED));
> >  	show_val_kb(m, "Shmem:          ", i.sharedram);
> > +	show_val_kb(m, "PageMetadata:   ", nr_page_metadata);
> >  	show_val_kb(m, "KReclaimable:   ", sreclaimable +
> >  		    global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
> >  	show_val_kb(m, "Slab:           ", sreclaimable + sunreclaim);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 3c25226beeed..ef176152be7c 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -207,6 +207,10 @@ enum node_stat_item {
> >  	PGPROMOTE_SUCCESS,	/* promote successfully */
> >  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> >  #endif
> > +	NR_PAGE_METADATA,	/* Page metadata size (struct page and page_ext)
> > +				 * in pages
> > +				 */
> > +	NR_PAGE_METADATA_BOOT,	/* NR_PAGE_METADATA for bootmem */
> 
> So if some vmemmap pages are freed, then MemTotal could be incremented by 
> a portion of NR_PAGE_METADATA_BOOT and then this stat is decremented?  Is 
> the goal that the sum of MemTotal + SUM(nr_page_metadata_boot) is always 
> constant?
> 
> >  	NR_VM_NODE_STAT_ITEMS
> >  };
> >  
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index fed855bae6d8..af096a881f03 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
> >  {
> >  	lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
> >  }
> > +
> > +void __init mod_node_early_perpage_metadata(int nid, long delta);
> > +void __init store_early_perpage_metadata(void);
> > +
> >  #endif /* _LINUX_VMSTAT_H */
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 87818ee7f01d..5b10d8d2b471 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -230,10 +230,14 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
> >   */
> >  static inline void free_vmemmap_page(struct page *page)
> >  {
> > -	if (PageReserved(page))
> > +	if (PageReserved(page)) {
> >  		free_bootmem_page(page);
> > -	else
> > +		mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA_BOOT,
> > +				    -1);
> > +	} else {
> >  		__free_page(page);
> > +		mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1);
> > +	}
> >  }
> >  
> >  /* Free a list of the vmemmap pages */
> > @@ -389,6 +393,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >  		copy_page(page_to_virt(walk.reuse_page),
> >  			  (void *)walk.reuse_addr);
> >  		list_add(&walk.reuse_page->lru, vmemmap_pages);
> > +		mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1);
> >  	}
> >  
> >  	/*
> > @@ -437,14 +442,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> >  	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> >  	int nid = page_to_nid((struct page *)start);
> >  	struct page *page, *next;
> > +	int i;
> >  
> > -	while (nr_pages--) {
> > +	for (i = 0; i < nr_pages; i++) {
> >  		page = alloc_pages_node(nid, gfp_mask, 0);
> > -		if (!page)
> > +		if (!page) {
> > +			mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +					    i);
> >  			goto out;
> > +		}
> >  		list_add(&page->lru, list);
> >  	}
> >  
> > +	mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages);
> > +
> >  	return 0;
> >  out:
> >  	list_for_each_entry_safe(page, next, list, lru)
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 077bfe393b5e..38f8e1f454a0 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/pgtable.h>
> >  #include <linux/swap.h>
> >  #include <linux/cma.h>
> > +#include <linux/vmstat.h>
> >  #include "internal.h"
> >  #include "slab.h"
> >  #include "shuffle.h"
> > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
> >  			panic("Failed to allocate %ld bytes for node %d memory map\n",
> >  			      size, pgdat->node_id);
> >  		pgdat->node_mem_map = map + offset;
> > +		mod_node_early_perpage_metadata(pgdat->node_id,
> > +						DIV_ROUND_UP(size, PAGE_SIZE));
> >  	}
> >  	pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
> >  				__func__, pgdat->node_id, (unsigned long)pgdat,
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 733732e7e0ba..dd78017105b0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5636,6 +5636,7 @@ void __init setup_per_cpu_pageset(void)
> >  	for_each_online_pgdat(pgdat)
> >  		pgdat->per_cpu_nodestats =
> >  			alloc_percpu(struct per_cpu_nodestat);
> > +	store_early_perpage_metadata();
> >  }
> >  
> >  __meminit void zone_pcp_init(struct zone *zone)
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 4548fcc66d74..4ca9f298f34e 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
> >  		return -ENOMEM;
> >  	NODE_DATA(nid)->node_page_ext = base;
> >  	total_usage += table_size;
> > +	mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA_BOOT,
> > +			    DIV_ROUND_UP(table_size, PAGE_SIZE));
> >  	return 0;
> >  }
> >  
> > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
> >  	void *addr = NULL;
> >  
> >  	addr = alloc_pages_exact_nid(nid, size, flags);
> > -	if (addr) {
> > +	if (addr)
> >  		kmemleak_alloc(addr, size, 1, flags);
> > -		return addr;
> > -	}
> > +	else
> > +		addr = vzalloc_node(size, nid);
> >  
> > -	addr = vzalloc_node(size, nid);
> > +	if (addr) {
> > +		mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +				    DIV_ROUND_UP(size, PAGE_SIZE));
> > +	}
> >  
> >  	return addr;
> >  }
> > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid)
> >  
> >  static void free_page_ext(void *addr)
> >  {
> > +	size_t table_size;
> > +	struct page *page;
> > +	struct pglist_data *pgdat;
> > +
> > +	table_size = page_ext_size * PAGES_PER_SECTION;
> > +
> >  	if (is_vmalloc_addr(addr)) {
> > +		page = vmalloc_to_page(addr);
> > +		pgdat = page_pgdat(page);
> >  		vfree(addr);
> >  	} else {
> > -		struct page *page = virt_to_page(addr);
> > -		size_t table_size;
> > -
> > -		table_size = page_ext_size * PAGES_PER_SECTION;
> > -
> > +		page = virt_to_page(addr);
> > +		pgdat = page_pgdat(page);
> >  		BUG_ON(PageReserved(page));
> >  		kmemleak_free(addr);
> >  		free_pages_exact(addr, table_size);
> >  	}
> > +
> > +	mod_node_page_state(pgdat, NR_PAGE_METADATA,
> > +			    -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> > +
> >  }
> >  
> >  static void __free_page_ext(unsigned long pfn)
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index a2cbe44c48e1..054b49539843 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -469,5 +469,13 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
> >  	if (r < 0)
> >  		return NULL;
> >  
> > +	if (system_state == SYSTEM_BOOTING) {
> > +		mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA_BOOT,
> > +				    DIV_ROUND_UP(end - start, PAGE_SIZE));
> > +	} else {
> > +		mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +				    DIV_ROUND_UP(end - start, PAGE_SIZE));
> > +	}
> > +
> >  	return pfn_to_page(pfn);
> >  }
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 77d91e565045..0c100ae1cf8b 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -14,7 +14,7 @@
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >  #include <linux/bootmem_info.h>
> > -
> > +#include <linux/vmstat.h>
> >  #include "internal.h"
> >  #include <asm/dma.h>
> >  
> > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid)
> >  	 */
> >  	sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true);
> >  	sparsemap_buf_end = sparsemap_buf + size;
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > +	mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE));
> > +#endif
> >  }
> >  
> >  static void __init sparse_buffer_fini(void)
> > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> >  	unsigned long start = (unsigned long) pfn_to_page(pfn);
> >  	unsigned long end = start + nr_pages * sizeof(struct page);
> >  
> > +	mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA,
> > +			    -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
> >  	vmemmap_free(start, end, altmap);
> >  }
> >  static void free_map_bootmem(struct page *memmap)
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 359460deb377..23e88d8c21b7 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1249,7 +1249,8 @@ const char * const vmstat_text[] = {
> >  	"pgpromote_success",
> >  	"pgpromote_candidate",
> >  #endif
> > -
> > +	"nr_page_metadata",
> > +	"nr_page_metadata_boot",
> >  	/* enum writeback_stat_item counters */
> >  	"nr_dirty_threshold",
> >  	"nr_dirty_background_threshold",
> > @@ -2278,4 +2279,27 @@ static int __init extfrag_debug_init(void)
> >  }
> >  
> >  module_init(extfrag_debug_init);
> > +
> >  #endif
> > +
> > +/*
> > + * Page metadata size (struct page and page_ext) in pages
> > + */
> > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata;
> > +
> > +void __init mod_node_early_perpage_metadata(int nid, long delta)
> > +{
> > +	early_perpage_metadata[nid] += delta;
> > +}
> > +
> > +void __init store_early_perpage_metadata(void)
> > +{
> > +	int nid;
> > +	struct pglist_data *pgdat;
> > +
> > +	for_each_online_pgdat(pgdat) {
> > +		nid = pgdat->node_id;
> > +		mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA_BOOT,
> > +				    early_perpage_metadata[nid]);
> > +	}
> > +}
> > -- 
> > 2.43.0.472.g3155946c3a-goog
> > 
> > 
> > 
>

next prev parent reply	other threads:[~2024-01-06  0:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05 22:31 [PATCH v6 0/1] " Sourav Panda
2023-12-05 22:31 ` [PATCH v6 1/1] " Sourav Panda
2024-01-05 18:34   ` Pasha Tatashin
2024-01-05 22:19   ` David Rientjes
2024-01-06  0:38     ` David Rientjes [this message]
2024-01-09 21:57       ` Sourav Panda
2023-12-06  2:36 ` [PATCH v6 0/1] " Greg KH
2023-12-06  2:57   ` Pasha Tatashin
2023-12-06  3:12     ` Greg KH
2023-12-06 15:59       ` Andrew Morton
2023-12-07  7:31         ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9266a9f9-71a5-dd88-df7b-facc037aaaff@google.com \
    --to=rientjes@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhelgaas@google.com \
    --cc=chenlinxuan@uniontech.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=pasha.tatashin@soleen.com \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=rppt@kernel.org \
    --cc=shakeelb@google.com \
    --cc=souravpanda@google.com \
    --cc=surenb@google.com \
    --cc=tomas.mudrunka@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yang.yang29@zte.com.cn \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox