From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFA53C4345F for ; Tue, 30 Apr 2024 17:45:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7038F6B0092; Tue, 30 Apr 2024 13:45:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 68E126B00C7; Tue, 30 Apr 2024 13:45:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E0686B00C9; Tue, 30 Apr 2024 13:45:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 242916B0092 for ; Tue, 30 Apr 2024 13:45:41 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C1BDC1A04BC for ; Tue, 30 Apr 2024 17:45:40 +0000 (UTC) X-FDA: 82066925640.04.3034E67 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf04.hostedemail.com (Postfix) with ESMTP id 28C0840005 for ; Tue, 30 Apr 2024 17:45:39 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=pQTWJx5R; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of souravpanda@google.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=souravpanda@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714499139; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6E2khDhhrgjRdeIpAlW+W4Z1gIU08kzDT5csjQJu5Wo=; b=XzeunF3+XOBLLSjRkRLIcio67Es0koXf0xAMNzXUYlrN42Wo5gJVIL+rUoA6ZDBU9+zuSy iwH1RhYVRxhSIKGZRLmhQpxRH38iiWhdFxUrV5kowuN4ydytPw1ho2i6z/dhLBjsLdN27c sftEbjac1nmtVSFtAf/gyh4Tw4Fkt/A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714499139; a=rsa-sha256; cv=none; b=lXEgqLRNMq0EIIxGXTaDBuOobRGlVV/LBLs1fQVTAv416AsDE0ihPAMYd1aVyK8+BWm/NI U/NDMuV+wQOqAaBL6uxtuOoRV08GSFWaoKyHG+FoHPV4Ag7gfQ7QI9NYwN1OZSOSUESN3K veYU+PQ9t8DVWAeSjdyfDmER0JqPjhs= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=pQTWJx5R; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of souravpanda@google.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=souravpanda@google.com Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6a0406438b5so24891936d6.1 for ; Tue, 30 Apr 2024 10:45:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714499138; x=1715103938; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=6E2khDhhrgjRdeIpAlW+W4Z1gIU08kzDT5csjQJu5Wo=; b=pQTWJx5RlNyH+vqiiuFDRO38SyT/N+A7YGm/X6Io2kA2VIvobIMC9ghTbydagze1w0 AsACi+QcHSgktJ8gGqKV6hwenWfdr90z7Er0c1i2qvu4/i0A79vD4Sx9Wcsb9EyDhGn3 F5DBkCYx0dlmFtCyUKm1VPF6LWR+EbqodGXvQ1k1gK37ArJfwEbj1C/ZXwXUEIM4rEeY 482hpjRUWIIwTZG5DOyxe4Ywd+K9N1Zt7d44zFErIY+GMgMY4FXcq1nTmSTIEn5KRiqp 6YdNToiPQ0VXZ18VZ01puRymLd8nJu0iF8QVWL1krUN93o1u75Gs8jVpQwZQGeH64p0G tOqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714499138; x=1715103938; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6E2khDhhrgjRdeIpAlW+W4Z1gIU08kzDT5csjQJu5Wo=; b=RZ6IydpYzMmsFGBhqMYA3GbriyQ98md0JDn92soVIAybYa8JbG/2gAi1cXfIJ3CjrE BFq4pFCvn0DK15O46NpEh0zt9qJ7HEW4ya7cUJnp21qhxJVKMcJDjhW8y7x8vnqU8Rio jhSXgFWKMpqOZJjx+DnIjHPnT/qUH2sHGXw9jEwVCdecqjq273Ulqtrr7dC3D7/u8ut4 /KGsS3AgDq0OgsuGbj5B419lcczRMWNMcMe0ub5uy5bUr86mHg19sNchYPdpeNpmgy2i z+DkdrsUU1gxP9W2kwEmD6UaNGftxkHuebrkuCzhfKosE+00fWmvGk/3jtS6+k9pPNYs KvNA== X-Forwarded-Encrypted: i=1; AJvYcCWIMtPDcDFZKw0DONIhW3iPtRNqSAWVy9JxeUqQSu6iPoiVcF/cXX4RIWyCNFgqkDcNwqIs3zaOiBBnRj7Kva0w1UE= X-Gm-Message-State: AOJu0YzyGqZObt3sYN650SJdGyeIGxtWAwg3z5ri0XFHgxgGW6+yBAGh nouU2JlHyqcL40w6Ppvq703phAJIfvzwPgx4Hpu6JJt1g0EV328uWO+l2kKALCQIDuzv4LheY8B gcnptGudn1bE0cU3j4ygZ0thL3JkirYdslyTb X-Google-Smtp-Source: AGHT+IHJN0SQBvdmEfk4xz9Luvps/8F8TD/QzAk5yu9H/AzEo/EGef+HHy2JE1sg4prPxFDI+mVGOfONwKXI4uFap2U= X-Received: by 2002:a05:6214:21cd:b0:6a0:d548:ce5 with SMTP id d13-20020a05621421cd00b006a0d5480ce5mr29966qvh.53.1714499138083; Tue, 30 Apr 2024 10:45:38 -0700 (PDT) MIME-Version: 1.0 References: <20240427202840.4123201-1-souravpanda@google.com> <03225c8a-63ff-61c5-fedf-0fe8e9f1767d@google.com> In-Reply-To: <03225c8a-63ff-61c5-fedf-0fe8e9f1767d@google.com> From: Sourav Panda Date: Tue, 30 Apr 2024 10:45:26 -0700 Message-ID: Subject: Re: [PATCH v11] mm: report per-page metadata information To: David Rientjes Cc: corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, Andrew Morton , mike.kravetz@oracle.com, muchun.song@linux.dev, rppt@kernel.org, david@redhat.com, rdunlap@infradead.org, chenlinxuan@uniontech.com, yang.yang29@zte.com.cn, tomas.mudrunka@gmail.com, bhelgaas@google.com, ivan@cloudflare.com, pasha.tatashin@soleen.com, yosryahmed@google.com, hannes@cmpxchg.org, kirill.shutemov@linux.intel.com, wangkefeng.wang@huawei.com, adobriyan@gmail.com, Vlastimil Babka , "Liam R. Howlett" , surenb@google.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, Matthew Wilcox , weixugc@google.com Content-Type: multipart/alternative; boundary="0000000000001872b9061753f087" X-Rspamd-Queue-Id: 28C0840005 X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: yihamx3ottpsekcfnwh6fjrgpe4ru9un X-HE-Tag: 1714499139-741063 X-HE-Meta: U2FsdGVkX196XhJHc+CpNV94oXqWBuOS+V+pPcRtmXFivgmAcgH2MsfFuukEaEPs73aXdVHOpAPL11BxZRL7Ez1k62mKMPKHmWu7mh45zRP5gezvjqKN/DCUQ/WVEGDf224yXyBEjTFcW+KT4uX9O42nZCPZoxqBPbyB5VK4mJs/wJsJeXW9Kks+fFe7jwci85326U7wAVOa7L/LVk57SuC2b8nHY9/2rl/wdk68yPADhfMSldq64WENZ2fCaNpQ3gmYXcAkAb8TWLNvHqHeyTsOD0PBY62ddKfKnNl5mkQAlBZ9H91jo01Q9r+goePP3PtfZfiAuoJY6KiiMYLQc72hSyzp3GU+gQeu/BWC4NRoGPtb5sJxY53g8gMosS/tAORqRUjuKrrQdpmOUcWFB2h5HTKvp40MStvkhiOumH+wxyKhrxk5Tx7cBKyyXbgskNyQJHSz8OerVaj/eSNR5TZXzc/v4ct3AQSJUCsvGjRC90fgVkBsIJFCQXeCzIARebd2gXBRln1/zWnT/E12bYFESZdECZV4PYzT7ilzCivqY75pTWFV1N1zYQR5JBs+S8vXT+l8yV4gWuGs19g8pUxkc9cIOVW2GZmNPN4TiF5Bi3aKxZEE3x/deto4BeTK84N4KCBlf+MnFD5Y5878Xe3A+hfpj5m3mEi08Lzkb4jBuJP1heziRb+nnbLZ9FMjYNBaSrpAjxusHvI8UfrdYk4gCnV9RzNwqeCVauWHr+RGRYjh3/WXPgz64H6mXEI5HzTNxcr644CkmZQcZv99AxrxsielNHxxNfKXoIfCik8iP39ZVh9XnEFcqAGcr7ZW5G2sTRLdoQ+A7JwX6mUQgTIXq27ctGl7cTqnPaKLGcdjszikPYYkwHVMk4dSmm74MqwqG8a9hw+MOOTm6gFFTWPOT6iglXkIVyPBnETEX60TzpRuaKHLHoUCSoGNlPR77lI/vXagG3gxdShnImC QnGnlFiD oj6koGKzYlaSrDws= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --0000000000001872b9061753f087 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Apr 28, 2024 at 1:28=E2=80=AFAM David Rientjes wrote: > On Sat, 27 Apr 2024, Sourav Panda wrote: > > > Adds a global Memmap field to /proc/meminfo. This information can > > be used by users to see how much memory is being used by per-page > > metadata, which can vary depending on build configuration, machine > > architecture, and system use. > > > > Accounting per-page metadata allocated by boot-allocator: > > /proc/vmstat:nr_memmap_boot * PAGE_SIZE > > > > Accounting per-page metadata allocated by buddy-allocator: > > /proc/vmstat:nr_memmap * PAGE_SIZE > > > > So three things are actually being added: a new field in /proc/meminfo an= d > two new fields in /proc/vmstat. > > For the /proc/vmstat entries, are these also available in > /sys/devices/system/node/nodeX/vmstat? > > > Accounting total Perpage metadata allocated on the machine: > > (/proc/vmstat:nr_memmap_boot + /proc/vmstat:nr_memmap) * PAGE_SIZE > > > > Makes sense :) Does this mean that the new /proc/meminfo field is > redundant, though? > Thanks David for the review. Yep, it is redundant and this was also indicated by Wei before. In v12, I shall restore the changes to fs/proc/meminfo.c and Documentation/filesystems/proc.rst > > > Utility for userspace: > > > > Application Optimization: Depending on the kernel version and command > > line options, the kernel would relinquish a different number of pages > > (that contain struct pages) when a hugetlb page is reserved (e.g., 0, 6 > > or 7 for a 2MB hugepage). This patch allows the userspace application > > to know the exact savings achieved through page metadata deallocation > > without dealing with the intricacies of the kernel. > > > > ... and for future memdesc optimizations :) But this is implementation > details, it feels like the main goal of this is the observability part > below. I'd just focus on that use cases. > > > Observability: Struct page overhead can only be calculated on-paper at > > boot time (e.g., 1.5% machine capacity). Beyond boot once hugepages are > > reserved or memory is hotplugged, the computation becomes complex. > > Per-page metrics will help explain part of the system memory overhead, > > which shall help guide memory optimizations and memory cgroup sizing. > > > > This depends on kernel configuration options, which is another > implementation detail. I'd just say that we want to know where our memor= y > is going, i.e. be able to describe the amount of memory overhead that is > going to per-page metadata on the system at any given time. > Thanks David. We shall restructure this in the following manner: - what's missing today, what are we introducing, and how this can be put into practice. > > > Debugging: Tracking the changes or absolute value in struct pages can > > help detect anomalies as they can be correlated with other metrics in > > the machine (e.g., memtotal, number of huge pages, etc). > > > > page_ext overheads: Some kernel features such as page_owner > > page_table_check that use page_ext can be optionally enabled via kernel > > parameters. Having the total per-page metadata information helps users > > precisely measure impact. > > > > For background and results see: > > lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com > > > > Signed-off-by: Sourav Panda > > Reviewed-by: Pasha Tatashin > > I'd suggest removing the /proc/meminfo field, it's redundant: the > information is already available through at least /proc/vmstat. > That said, why not /proc/zoneinfo instead of any changes to /proc/vmstat? > Do we want to know what zones this overhead is coming from, i.e. where ou= r > struct page is allocated from on a node with ZONE_MOVABLE? We would like to keep it a system-wide metric and if we care what NUMA node the page overhead is coming from, the per-node vmstat. This is to keep the implementation and interpretation of the results relatively simple. > > > --- > > Changelog: > > Fixed positioning of node_stat_item:NR_MEMMAP's comment. > > Synchronized with 6.9-rc5. > > > > v10: > > lore.kernel.org/all/20240416201335.3551099-1-souravpanda@google.com/ > > --- > > Documentation/filesystems/proc.rst | 3 +++ > > fs/proc/meminfo.c | 4 ++++ > > include/linux/mmzone.h | 3 +++ > > include/linux/vmstat.h | 4 ++++ > > mm/hugetlb_vmemmap.c | 17 ++++++++++++---- > > mm/mm_init.c | 3 +++ > > mm/page_alloc.c | 1 + > > mm/page_ext.c | 32 +++++++++++++++++++++--------- > > mm/sparse-vmemmap.c | 8 ++++++++ > > mm/sparse.c | 7 ++++++- > > mm/vmstat.c | 26 +++++++++++++++++++++++- > > 11 files changed, 93 insertions(+), 15 deletions(-) > > > > diff --git a/Documentation/filesystems/proc.rst > b/Documentation/filesystems/proc.rst > > index c6a6b9df21049..a7445d49a3bb7 100644 > > --- a/Documentation/filesystems/proc.rst > > +++ b/Documentation/filesystems/proc.rst > > @@ -993,6 +993,7 @@ Example output. You may not have all of these field= s. > > AnonPages: 4654780 kB > > Mapped: 266244 kB > > Shmem: 9976 kB > > + Memmap: 513419 kB > > KReclaimable: 517708 kB > > Slab: 660044 kB > > SReclaimable: 517708 kB > > @@ -1095,6 +1096,8 @@ Mapped > > files which have been mmapped, such as libraries > > Shmem > > Total memory used by shared memory (shmem) and tmpfs > > +Memmap > > + Memory used for per-page metadata > > KReclaimable > > Kernel allocations that the kernel will attempt to recla= im > > under memory pressure. Includes SReclaimable (below), an= d > other > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > > index 45af9a989d404..3d3db55cfeab6 100644 > > --- a/fs/proc/meminfo.c > > +++ b/fs/proc/meminfo.c > > @@ -39,6 +39,7 @@ static int meminfo_proc_show(struct seq_file *m, void > *v) > > long available; > > unsigned long pages[NR_LRU_LISTS]; > > unsigned long sreclaimable, sunreclaim; > > + unsigned long nr_memmap; > > int lru; > > > > si_meminfo(&i); > > @@ -57,6 +58,8 @@ static int meminfo_proc_show(struct seq_file *m, void > *v) > > sreclaimable =3D global_node_page_state_pages(NR_SLAB_RECLAIMABLE= _B); > > sunreclaim =3D global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE= _B); > > > > + nr_memmap =3D global_node_page_state_pages(NR_MEMMAP); > > + > > show_val_kb(m, "MemTotal: ", i.totalram); > > show_val_kb(m, "MemFree: ", i.freeram); > > show_val_kb(m, "MemAvailable: ", available); > > @@ -104,6 +107,7 @@ static int meminfo_proc_show(struct seq_file *m, > void *v) > > show_val_kb(m, "Mapped: ", > > global_node_page_state(NR_FILE_MAPPED)); > > show_val_kb(m, "Shmem: ", i.sharedram); > > + show_val_kb(m, "Memmap: ", nr_memmap); > > show_val_kb(m, "KReclaimable: ", sreclaimable + > > global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); > > show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index c11b7cde81efa..87963b13b53ee 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -217,6 +217,9 @@ enum node_stat_item { > > PGDEMOTE_KSWAPD, > > PGDEMOTE_DIRECT, > > PGDEMOTE_KHUGEPAGED, > > + /* Page metadata size (struct page and page_ext) in pages */ > > + NR_MEMMAP, > > + NR_MEMMAP_BOOT, /* NR_MEMMAP for bootmem */ > > NR_VM_NODE_STAT_ITEMS > > }; > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > > index 343906a98d6ee..c3785fdd3668d 100644 > > --- a/include/linux/vmstat.h > > +++ b/include/linux/vmstat.h > > @@ -632,4 +632,8 @@ static inline void lruvec_stat_sub_folio(struct > folio *folio, > > { > > lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); > > } > > + > > +void __meminit mod_node_early_perpage_metadata(int nid, long delta); > > +void __meminit store_early_perpage_metadata(void); > > + > > #endif /* _LINUX_VMSTAT_H */ > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index da177e49d9564..2da8689aeb93f 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -184,10 +184,13 @@ static int vmemmap_remap_range(unsigned long > start, unsigned long end, > > */ > > static inline void free_vmemmap_page(struct page *page) > > { > > - if (PageReserved(page)) > > + if (PageReserved(page)) { > > free_bootmem_page(page); > > - else > > + mod_node_page_state(page_pgdat(page), NR_MEMMAP_BOOT, -1)= ; > > + } else { > > __free_page(page); > > + mod_node_page_state(page_pgdat(page), NR_MEMMAP, -1); > > + } > > } > > > > /* Free a list of the vmemmap pages */ > > @@ -338,6 +341,7 @@ static int vmemmap_remap_free(unsigned long start, > unsigned long end, > > copy_page(page_to_virt(walk.reuse_page), > > (void *)walk.reuse_addr); > > list_add(&walk.reuse_page->lru, vmemmap_pages); > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, 1); > > } > > > > /* > > @@ -384,14 +388,19 @@ static int alloc_vmemmap_page_list(unsigned long > start, unsigned long end, > > unsigned long nr_pages =3D (end - start) >> PAGE_SHIFT; > > int nid =3D page_to_nid((struct page *)start); > > struct page *page, *next; > > + int i; > > > > - while (nr_pages--) { > > + for (i =3D 0; i < nr_pages; i++) { > > page =3D alloc_pages_node(nid, gfp_mask, 0); > > - if (!page) > > + if (!page) { > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, i)= ; > > goto out; > > + } > > list_add(&page->lru, list); > > } > > > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, nr_pages); > > + > > return 0; > > out: > > list_for_each_entry_safe(page, next, list, lru) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index 549e76af8f82a..1a429c73b32e4 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > #include "internal.h" > > #include "slab.h" > > #include "shuffle.h" > > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct > pglist_data *pgdat) > > panic("Failed to allocate %ld bytes for node %d memory > map\n", > > size, pgdat->node_id); > > pgdat->node_mem_map =3D map + offset; > > + mod_node_early_perpage_metadata(pgdat->node_id, > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", > > __func__, pgdat->node_id, (unsigned long)pgdat, > > (unsigned long)pgdat->node_mem_map); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 14d39f34d3367..aa8dd5bccb7ac 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5650,6 +5650,7 @@ void __init setup_per_cpu_pageset(void) > > for_each_online_pgdat(pgdat) > > pgdat->per_cpu_nodestats =3D > > alloc_percpu(struct per_cpu_nodestat); > > + store_early_perpage_metadata(); > > } > > > > __meminit void zone_pcp_init(struct zone *zone) > > diff --git a/mm/page_ext.c b/mm/page_ext.c > > index 4548fcc66d74d..c1e324a1427e0 100644 > > --- a/mm/page_ext.c > > +++ b/mm/page_ext.c > > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid) > > return -ENOMEM; > > NODE_DATA(nid)->node_page_ext =3D base; > > total_usage +=3D table_size; > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT, > > + DIV_ROUND_UP(table_size, PAGE_SIZE)); > > return 0; > > } > > > > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size= , > int nid) > > void *addr =3D NULL; > > > > addr =3D alloc_pages_exact_nid(nid, size, flags); > > - if (addr) { > > + if (addr) > > kmemleak_alloc(addr, size, 1, flags); > > - return addr; > > - } > > + else > > + addr =3D vzalloc_node(size, nid); > > > > - addr =3D vzalloc_node(size, nid); > > + if (addr) { > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > + } > > > > return addr; > > } > > @@ -303,18 +308,27 @@ static int __meminit > init_section_page_ext(unsigned long pfn, int nid) > > > > static void free_page_ext(void *addr) > > { > > + size_t table_size; > > + struct page *page; > > + struct pglist_data *pgdat; > > + > > + table_size =3D page_ext_size * PAGES_PER_SECTION; > > + > > if (is_vmalloc_addr(addr)) { > > + page =3D vmalloc_to_page(addr); > > + pgdat =3D page_pgdat(page); > > vfree(addr); > > } else { > > - struct page *page =3D virt_to_page(addr); > > - size_t table_size; > > - > > - table_size =3D page_ext_size * PAGES_PER_SECTION; > > - > > + page =3D virt_to_page(addr); > > + pgdat =3D page_pgdat(page); > > BUG_ON(PageReserved(page)); > > kmemleak_free(addr); > > free_pages_exact(addr, table_size); > > } > > + > > + mod_node_page_state(pgdat, NR_MEMMAP, > > + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); > > + > > } > > > > static void __free_page_ext(unsigned long pfn) > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > index a2cbe44c48e10..1dda6c53370b0 100644 > > --- a/mm/sparse-vmemmap.c > > +++ b/mm/sparse-vmemmap.c > > @@ -469,5 +469,13 @@ struct page * __meminit > __populate_section_memmap(unsigned long pfn, > > if (r < 0) > > return NULL; > > > > + if (system_state =3D=3D SYSTEM_BOOTING) { > > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(end - > start, > > + > PAGE_SIZE)); > > + } else { > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, > > + DIV_ROUND_UP(end - start, PAGE_SIZE))= ; > > + } > > + > > return pfn_to_page(pfn); > > } > > diff --git a/mm/sparse.c b/mm/sparse.c > > index aed0951b87fa0..684a91773bd76 100644 > > --- a/mm/sparse.c > > +++ b/mm/sparse.c > > @@ -14,7 +14,7 @@ > > #include > > #include > > #include > > - > > +#include > > #include "internal.h" > > #include > > > > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long > size, int nid) > > */ > > sparsemap_buf =3D memmap_alloc(size, section_map_size(), addr, ni= d, > true); > > sparsemap_buf_end =3D sparsemap_buf + size; > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, > PAGE_SIZE)); > > +#endif > > } > > > > static void __init sparse_buffer_fini(void) > > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long > pfn, unsigned long nr_pages, > > unsigned long start =3D (unsigned long) pfn_to_page(pfn); > > unsigned long end =3D start + nr_pages * sizeof(struct page); > > > > + mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_MEMMAP, > > + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)))= ; > > vmemmap_free(start, end, altmap); > > } > > static void free_map_bootmem(struct page *memmap) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index db79935e4a543..79466450040e6 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1252,7 +1252,8 @@ const char * const vmstat_text[] =3D { > > "pgdemote_kswapd", > > "pgdemote_direct", > > "pgdemote_khugepaged", > > - > > + "nr_memmap", > > + "nr_memmap_boot", > > /* enum writeback_stat_item counters */ > > "nr_dirty_threshold", > > "nr_dirty_background_threshold", > > @@ -2279,4 +2280,27 @@ static int __init extfrag_debug_init(void) > > } > > > > module_init(extfrag_debug_init); > > + > > #endif > > + > > +/* > > + * Page metadata size (struct page and page_ext) in pages > > + */ > > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __meminitdat= a; > > + > > +void __meminit mod_node_early_perpage_metadata(int nid, long delta) > > +{ > > + early_perpage_metadata[nid] +=3D delta; > > +} > > + > > +void __meminit store_early_perpage_metadata(void) > > +{ > > + int nid; > > + struct pglist_data *pgdat; > > + > > + for_each_online_pgdat(pgdat) { > > + nid =3D pgdat->node_id; > > + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT, > > + early_perpage_metadata[nid]); > > + } > > +} > > -- > > 2.44.0.769.g3c40516874-goog > > > > > > > --0000000000001872b9061753f087 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Sun, Apr 28, 2024 at 1:28=E2=80=AF= AM David Rientjes <rientjes@googl= e.com> wrote:
On Sat, 27 Apr 2024, Sourav Panda wrote:

> Adds a global Memmap field to /proc/meminfo. This information can
> be used by users to see how much memory is being used by per-page
> metadata, which can vary depending on build configuration, machine
> architecture, and system use.
>
> Accounting per-page metadata allocated by boot-allocator:
>=C2=A0 =C2=A0 =C2=A0/proc/vmstat:nr_memmap_boot * PAGE_SIZE
>
> Accounting per-page metadata allocated by buddy-allocator:
>=C2=A0 =C2=A0 =C2=A0/proc/vmstat:nr_memmap * PAGE_SIZE
>

So three things are actually being added: a new field in /proc/meminfo and =
two new fields in /proc/vmstat.

For the /proc/vmstat entries, are these also available in
/sys/devices/system/node/nodeX/vmstat?

> Accounting total Perpage metadata allocated on the machine:
>=C2=A0 =C2=A0 =C2=A0(/proc/vmstat:nr_memmap_boot + /proc/vmstat:nr_memm= ap) * PAGE_SIZE
>

Makes sense :)=C2=A0 Does this mean that the new /proc/meminfo field is redundant, though?

Thanks David for the= review. Yep, it is redundant and this was also indicated by Wei before.
In v12, I shall restore the changes to=C2=A0fs/proc/meminfo.c and= =C2=A0Documentation/filesystems/proc.rst =C2=A0
=C2=A0

> Utility for userspace:
>
> Application Optimization: Depending on the kernel version and command<= br> > line options, the kernel would relinquish a different number of pages<= br> > (that contain struct pages) when a hugetlb page is reserved (e.g., 0, = 6
> or 7 for a 2MB hugepage). This patch allows the userspace application<= br> > to know the exact savings achieved through page metadata deallocation<= br> > without dealing with the intricacies of the kernel.
>

... and for future memdesc optimizations :)=C2=A0 But this is implementatio= n
details, it feels like the main goal of this is the observability part
below.=C2=A0 I'd just focus on that use cases.

> Observability: Struct page overhead can only be calculated on-paper at=
> boot time (e.g., 1.5% machine capacity). Beyond boot once hugepages ar= e
> reserved or memory is hotplugged, the computation becomes complex.
> Per-page metrics will help explain part of the system memory overhead,=
> which shall help guide memory optimizations and memory cgroup sizing.<= br> >

This depends on kernel configuration options, which is another
implementation detail.=C2=A0 I'd just say that we want to know where ou= r memory
is going, i.e. be able to describe the amount of memory overhead that is going to per-page metadata on the system at any given time.

Thanks David. We shall restructure=C2=A0this in the fo= llowing manner:
=C2=A0- what's missing today, what are we int= roducing, and how this can be put into practice.
=C2=A0

> Debugging: Tracking the changes or absolute value in struct pages can<= br> > help detect anomalies as they can be correlated with other metrics in<= br> > the machine (e.g., memtotal, number of huge pages, etc).
>
> page_ext overheads: Some kernel features such as page_owner
> page_table_check that use page_ext can be optionally enabled via kerne= l
> parameters. Having the total per-page metadata information helps users=
> precisely measure impact.
>
> For background and results see:
> lore.kernel.org/all/2= 0240220214558.3377482-1-souravpanda@google.com
>
> Signed-off-by: Sourav Panda <souravpanda@google.com>
> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>

I'd suggest removing the /proc/meminfo field, it's redundant: the <= br> information is already available through at least /proc/vmstat.

That said, why not /proc/zoneinfo instead of any changes to /proc/vmstat?= =C2=A0
Do we want to know what zones this overhead is coming from, i.e. where our =
struct page is allocated from on a node with ZONE_MOVABLE?

We would like to keep it a system-wide metric and if we car= e what NUMA node the page overhead is coming from, the per-node vmstat. Thi= s is to keep the implementation and interpretation of the results relativel= y simple.
=C2=A0

> ---
> Changelog:
> Fixed positioning of node_stat_item:NR_MEMMAP's comment.
> Synchronized with 6.9-rc5.
>
> v10:
> lore.kernel.org/all/= 20240416201335.3551099-1-souravpanda@google.com/
> ---
>=C2=A0 Documentation/filesystems/proc.rst |=C2=A0 3 +++
>=C2=A0 fs/proc/meminfo.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 |=C2=A0 4 ++++
>=C2=A0 include/linux/mmzone.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0|=C2=A0 3 +++
>=C2=A0 include/linux/vmstat.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0|=C2=A0 4 ++++
>=C2=A0 mm/hugetlb_vmemmap.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0| 17 ++++++++++++----
>=C2=A0 mm/mm_init.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 3 +++
>=C2=A0 mm/page_alloc.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 |=C2=A0 1 +
>=C2=A0 mm/page_ext.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 32 +++++++++++++++++++++---------
>=C2=A0 mm/sparse-vmemmap.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 |=C2=A0 8 ++++++++
>=C2=A0 mm/sparse.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 7 ++++++-
>=C2=A0 mm/vmstat.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 26 +++++++++++++++++++++++-
>=C2=A0 11 files changed, 93 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesy= stems/proc.rst
> index c6a6b9df21049..a7445d49a3bb7 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -993,6 +993,7 @@ Example output. You may not have all of these fiel= ds.
>=C2=A0 =C2=A0 =C2=A0 AnonPages:=C2=A0 =C2=A0 =C2=A0 =C2=A04654780 kB >=C2=A0 =C2=A0 =C2=A0 Mapped:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A026= 6244 kB
>=C2=A0 =C2=A0 =C2=A0 Shmem:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 9976 kB
> +=C2=A0 =C2=A0 Memmap:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0513419 = kB
>=C2=A0 =C2=A0 =C2=A0 KReclaimable:=C2=A0 =C2=A0 =C2=A0517708 kB
>=C2=A0 =C2=A0 =C2=A0 Slab:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0660044 kB
>=C2=A0 =C2=A0 =C2=A0 SReclaimable:=C2=A0 =C2=A0 =C2=A0517708 kB
> @@ -1095,6 +1096,8 @@ Mapped
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 files which hav= e been mmapped, such as libraries
>=C2=A0 Shmem
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Total memory us= ed by shared memory (shmem) and tmpfs
> +Memmap
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Memory used for per-= page metadata
>=C2=A0 KReclaimable
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Kernel allocati= ons that the kernel will attempt to reclaim
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 under memory pr= essure. Includes SReclaimable (below), and other
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 45af9a989d404..3d3db55cfeab6 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -39,6 +39,7 @@ static int meminfo_proc_show(struct seq_file *m, voi= d *v)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0long available;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long pages[NR_LRU_LISTS];
>=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long sreclaimable, sunreclaim;
> +=C2=A0 =C2=A0 =C2=A0unsigned long nr_memmap;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0int lru;
>=C2=A0
>=C2=A0 =C2=A0 =C2=A0 =C2=A0si_meminfo(&i);
> @@ -57,6 +58,8 @@ static int meminfo_proc_show(struct seq_file *m, voi= d *v)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sreclaimable =3D global_node_page_state_page= s(NR_SLAB_RECLAIMABLE_B);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sunreclaim =3D global_node_page_state_pages(= NR_SLAB_UNRECLAIMABLE_B);
>=C2=A0
> +=C2=A0 =C2=A0 =C2=A0nr_memmap =3D global_node_page_state_pages(NR_MEM= MAP);
> +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "MemTotal:=C2=A0 =C2=A0 = =C2=A0 =C2=A0", i.totalram);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "MemFree:=C2=A0 =C2=A0 = =C2=A0 =C2=A0 ", i.freeram);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "MemAvailable:=C2=A0 =C2= =A0", available);
> @@ -104,6 +107,7 @@ static int meminfo_proc_show(struct seq_file *m, v= oid *v)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "Mapped:=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0gl= obal_node_page_state(NR_FILE_MAPPED));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "Shmem:=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 ", i.sharedram);
> +=C2=A0 =C2=A0 =C2=A0show_val_kb(m, "Memmap:=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0", nr_memmap);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "KReclaimable:=C2=A0 =C2= =A0", sreclaimable +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0gl= obal_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0show_val_kb(m, "Slab:=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0", sreclaimable + sunreclaim);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c11b7cde81efa..87963b13b53ee 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -217,6 +217,9 @@ enum node_stat_item {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0PGDEMOTE_KSWAPD,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0PGDEMOTE_DIRECT,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0PGDEMOTE_KHUGEPAGED,
> +=C2=A0 =C2=A0 =C2=A0/* Page metadata size (struct page and page_ext) = in pages */
> +=C2=A0 =C2=A0 =C2=A0NR_MEMMAP,
> +=C2=A0 =C2=A0 =C2=A0NR_MEMMAP_BOOT,=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= /* NR_MEMMAP for bootmem */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0NR_VM_NODE_STAT_ITEMS
>=C2=A0 };
>=C2=A0
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 343906a98d6ee..c3785fdd3668d 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -632,4 +632,8 @@ static inline void lruvec_stat_sub_folio(struct fo= lio *folio,
>=C2=A0 {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0lruvec_stat_mod_folio(folio, idx, -folio_nr_= pages(folio));
>=C2=A0 }
> +
> +void __meminit mod_node_early_perpage_metadata(int nid, long delta);<= br> > +void __meminit store_early_perpage_metadata(void);
> +
>=C2=A0 #endif /* _LINUX_VMSTAT_H */
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index da177e49d9564..2da8689aeb93f 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -184,10 +184,13 @@ static int vmemmap_remap_range(unsigned long sta= rt, unsigned long end,
>=C2=A0 =C2=A0*/
>=C2=A0 static inline void free_vmemmap_page(struct page *page)
>=C2=A0 {
> -=C2=A0 =C2=A0 =C2=A0if (PageReserved(page))
> +=C2=A0 =C2=A0 =C2=A0if (PageReserved(page)) {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0free_bootmem_pag= e(page);
> -=C2=A0 =C2=A0 =C2=A0else
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(p= age_pgdat(page), NR_MEMMAP_BOOT, -1);
> +=C2=A0 =C2=A0 =C2=A0} else {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0__free_page(page= );
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(p= age_pgdat(page), NR_MEMMAP, -1);
> +=C2=A0 =C2=A0 =C2=A0}
>=C2=A0 }
>=C2=A0
>=C2=A0 /* Free a list of the vmemmap pages */
> @@ -338,6 +341,7 @@ static int vmemmap_remap_free(unsigned long start,= unsigned long end,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0copy_page(page_t= o_virt(walk.reuse_page),
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0(void *)walk.reuse_addr);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add(&wa= lk.reuse_page->lru, vmemmap_pages);
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(N= ODE_DATA(nid), NR_MEMMAP, 1);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0}
>=C2=A0
>=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
> @@ -384,14 +388,19 @@ static int alloc_vmemmap_page_list(unsigned long= start, unsigned long end,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long nr_pages =3D (end - start) >= ;> PAGE_SHIFT;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0int nid =3D page_to_nid((struct page *)start= );
>=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *page, *next;
> +=C2=A0 =C2=A0 =C2=A0int i;
>=C2=A0
> -=C2=A0 =C2=A0 =C2=A0while (nr_pages--) {
> +=C2=A0 =C2=A0 =C2=A0for (i =3D 0; i < nr_pages; i++) {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D alloc_p= ages_node(nid, gfp_mask, 0);
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!page)
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!page) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, i);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0goto out;
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add(&pa= ge->lru, list);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0}
>=C2=A0
> +=C2=A0 =C2=A0 =C2=A0mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, nr= _pages);
> +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
>=C2=A0 out:
>=C2=A0 =C2=A0 =C2=A0 =C2=A0list_for_each_entry_safe(page, next, list, l= ru)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 549e76af8f82a..1a429c73b32e4 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -27,6 +27,7 @@
>=C2=A0 #include <linux/swap.h>
>=C2=A0 #include <linux/cma.h>
>=C2=A0 #include <linux/crash_dump.h>
> +#include <linux/vmstat.h>
>=C2=A0 #include "internal.h"
>=C2=A0 #include "slab.h"
>=C2=A0 #include "shuffle.h"
> @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pgl= ist_data *pgdat)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0panic("Fail= ed to allocate %ld bytes for node %d memory map\n",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0size, pgdat->node_id);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0pgdat->node_mem_map =3D map + offset;
> +=C2=A0 =C2=A0 =C2=A0mod_node_early_perpage_metadata(pgdat->node_id= ,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DIV_ROUND_UP= (size, PAGE_SIZE));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0pr_debug("%s: node %d, pgdat %08lx, nod= e_mem_map %08lx\n",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 __func__, pgdat= ->node_id, (unsigned long)pgdat,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (unsigned long)= pgdat->node_mem_map);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 14d39f34d3367..aa8dd5bccb7ac 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5650,6 +5650,7 @@ void __init setup_per_cpu_pageset(void)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0for_each_online_pgdat(pgdat)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pgdat->per_cp= u_nodestats =3D
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0alloc_percpu(struct per_cpu_nodestat);
> +=C2=A0 =C2=A0 =C2=A0store_early_perpage_metadata();
>=C2=A0 }
>=C2=A0
>=C2=A0 __meminit void zone_pcp_init(struct zone *zone)
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 4548fcc66d74d..c1e324a1427e0 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -ENOMEM;<= br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0NODE_DATA(nid)->node_page_ext =3D base; >=C2=A0 =C2=A0 =C2=A0 =C2=A0total_usage +=3D table_size;
> +=C2=A0 =C2=A0 =C2=A0mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOO= T,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0DIV_ROUND_UP(table_size, PAGE_SIZE));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
>=C2=A0 }
>=C2=A0
> @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t siz= e, int nid)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0void *addr =3D NULL;
>=C2=A0
>=C2=A0 =C2=A0 =C2=A0 =C2=A0addr =3D alloc_pages_exact_nid(nid, size, fl= ags);
> -=C2=A0 =C2=A0 =C2=A0if (addr) {
> +=C2=A0 =C2=A0 =C2=A0if (addr)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0kmemleak_alloc(a= ddr, size, 1, flags);
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return addr;
> -=C2=A0 =C2=A0 =C2=A0}
> +=C2=A0 =C2=A0 =C2=A0else
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addr =3D vzalloc_node= (size, nid);
>=C2=A0
> -=C2=A0 =C2=A0 =C2=A0addr =3D vzalloc_node(size, nid);
> +=C2=A0 =C2=A0 =C2=A0if (addr) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(N= ODE_DATA(nid), NR_MEMMAP,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DIV_ROUND_UP(size, PAGE_SI= ZE));
> +=C2=A0 =C2=A0 =C2=A0}
>=C2=A0
>=C2=A0 =C2=A0 =C2=A0 =C2=A0return addr;
>=C2=A0 }
> @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsig= ned long pfn, int nid)
>=C2=A0
>=C2=A0 static void free_page_ext(void *addr)
>=C2=A0 {
> +=C2=A0 =C2=A0 =C2=A0size_t table_size;
> +=C2=A0 =C2=A0 =C2=A0struct page *page;
> +=C2=A0 =C2=A0 =C2=A0struct pglist_data *pgdat;
> +
> +=C2=A0 =C2=A0 =C2=A0table_size =3D page_ext_size * PAGES_PER_SECTION;=
> +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (is_vmalloc_addr(addr)) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D vmalloc_to_p= age(addr);
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pgdat =3D page_pgdat(= page);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vfree(addr);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0} else {
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *page =3D= virt_to_page(addr);
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0size_t table_size; > -
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0table_size =3D page_e= xt_size * PAGES_PER_SECTION;
> -
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D virt_to_page= (addr);
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pgdat =3D page_pgdat(= page);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0BUG_ON(PageReser= ved(page));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0kmemleak_free(ad= dr);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0free_pages_exact= (addr, table_size);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0}
> +
> +=C2=A0 =C2=A0 =C2=A0mod_node_page_state(pgdat, NR_MEMMAP,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0-1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> +
>=C2=A0 }
>=C2=A0
>=C2=A0 static void __free_page_ext(unsigned long pfn)
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a2cbe44c48e10..1dda6c53370b0 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -469,5 +469,13 @@ struct page * __meminit __populate_section_memmap= (unsigned long pfn,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (r < 0)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL;
>=C2=A0
> +=C2=A0 =C2=A0 =C2=A0if (system_state =3D=3D SYSTEM_BOOTING) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_early_perpag= e_metadata(nid, DIV_ROUND_UP(end - start,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0PAGE_SIZE));
> +=C2=A0 =C2=A0 =C2=A0} else {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(N= ODE_DATA(nid), NR_MEMMAP,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DIV_ROUND_UP(end - start, = PAGE_SIZE));
> +=C2=A0 =C2=A0 =C2=A0}
> +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0return pfn_to_page(pfn);
>=C2=A0 }
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aed0951b87fa0..684a91773bd76 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -14,7 +14,7 @@
>=C2=A0 #include <linux/swap.h>
>=C2=A0 #include <linux/swapops.h>
>=C2=A0 #include <linux/bootmem_info.h>
> -
> +#include <linux/vmstat.h>
>=C2=A0 #include "internal.h"
>=C2=A0 #include <asm/dma.h>
>=C2=A0
> @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned lon= g size, int nid)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sparsemap_buf =3D memmap_alloc(size, section= _map_size(), addr, nid, true);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sparsemap_buf_end =3D sparsemap_buf + size;<= br> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> +=C2=A0 =C2=A0 =C2=A0mod_node_early_perpage_metadata(nid, DIV_ROUND_UP= (size, PAGE_SIZE));
> +#endif
>=C2=A0 }
>=C2=A0
>=C2=A0 static void __init sparse_buffer_fini(void)
> @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned lon= g pfn, unsigned long nr_pages,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long start =3D (unsigned long) pfn_= to_page(pfn);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long end =3D start + nr_pages * siz= eof(struct page);
>=C2=A0
> +=C2=A0 =C2=A0 =C2=A0mod_node_page_state(page_pgdat(pfn_to_page(pfn)),= NR_MEMMAP,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0-1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
>=C2=A0 =C2=A0 =C2=A0 =C2=A0vmemmap_free(start, end, altmap);
>=C2=A0 }
>=C2=A0 static void free_map_bootmem(struct page *memmap)
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a543..79466450040e6 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1252,7 +1252,8 @@ const char * const vmstat_text[] =3D {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0"pgdemote_kswapd",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0"pgdemote_direct",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0"pgdemote_khugepaged",
> -
> +=C2=A0 =C2=A0 =C2=A0"nr_memmap",
> +=C2=A0 =C2=A0 =C2=A0"nr_memmap_boot",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0/* enum writeback_stat_item counters */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0"nr_dirty_threshold",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0"nr_dirty_background_threshold", > @@ -2279,4 +2280,27 @@ static int __init extfrag_debug_init(void)
>=C2=A0 }
>=C2=A0
>=C2=A0 module_init(extfrag_debug_init);
> +
>=C2=A0 #endif
> +
> +/*
> + * Page metadata size (struct page and page_ext) in pages
> + */
> +static unsigned long early_perpage_metadata[MAX_NUMNODES] __meminitda= ta;
> +
> +void __meminit mod_node_early_perpage_metadata(int nid, long delta) > +{
> +=C2=A0 =C2=A0 =C2=A0early_perpage_metadata[nid] +=3D delta;
> +}
> +
> +void __meminit store_early_perpage_metadata(void)
> +{
> +=C2=A0 =C2=A0 =C2=A0int nid;
> +=C2=A0 =C2=A0 =C2=A0struct pglist_data *pgdat;
> +
> +=C2=A0 =C2=A0 =C2=A0for_each_online_pgdat(pgdat) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0nid =3D pgdat->nod= e_id;
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mod_node_page_state(N= ODE_DATA(nid), NR_MEMMAP_BOOT,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0early_perpage_metadata[nid= ]);
> +=C2=A0 =C2=A0 =C2=A0}
> +}
> --
> 2.44.0.769.g3c40516874-goog
>
>
>
--0000000000001872b9061753f087--