[RFC PATCH 0/4] Multiple consecutive page for anonymous mapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
@ 2023-01-09  7:22 Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Yin Fengwei
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-01-09  7:22 UTC (permalink / raw)
  To: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen
  Cc: fengwei.yin

In a nutshell:  4k is too small and 2M is too big.  We started
asking ourselves whether there was something in the middle that
we could do.  This series shows what that middle ground might
look like.  It provides some of the benefits of THP while
eliminating some of the downsides.

This series uses "multiple consecutive pages" (mcpages) of
between 8K and 2M of base pages for anonymous user space mappings.
This will lead to less internal fragmentation versus 2M mappings
and thus less memory consumption and wasted CPU time zeroing
memory which will never be used.

In the implementation, we allocate high order page with order of
mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
physical contiguous memory is used and benefit sequential memory
access latency.

Then split the high order page. By doing this, the sub-page of
mcpage is just 4K normal page. The current kernel page
management is applied to "mc" pages without any changes. Batching
page faults is allowed with mcpage and reduce page faults number.

There are costs with mcpage. Besides no TLB benefit THP brings, it
increases memory consumption and latency of allocation page
comparing to 4K base page.

This series is the first step of mcpage. The furture work can be
enable mcpage for more components like page cache, swapping etc.
Finally, most pages in system will be allocated/free/reclaimed
with mcpage order.

The series is constructed as following:
Patch 1 add the mcpage size related definitions and Kconfig entry
Patch 2 specific for x86_64 to align mmap start address to mcpage
        size
Patch 3 is the main change. It adds code to hook to anonymous page
        fault handle and apply mcpage to anonymous mapping
Patch 4 adds some statistic of mcpage

The overall code change is quite straight forward. The most thing I
like to hear here is whether this is a right direction I can go
further.

This series does not leverage compound pages.  This means that
normal kernel code that encounters an 'mcpage' region does not
need to do anything special.  It also does not leverage folios,
although trying to leverage folios is something that we would
like to explore.  We would welcome input on how that might
happen.

Some performance data were collected with 16K mcpage size and
shown in patch 2/4 and 4/4. If you have other workload and like
to know the impact, just let me know. I can setup the env and
run the test.

Yin Fengwei (4):
  mcpage: add size/mask/shift definition for multiple consecutive page
  mcpage: anon page: Use mcpage for anonymous mapping
  mcpage: add vmstat counters for mcpages
  mcpage: get_unmapped_area return mcpage size aligned addr

 arch/x86/kernel/sys_x86_64.c  |   8 ++
 include/linux/gfp.h           |   5 ++
 include/linux/mcpage_mm.h     |  35 +++++++++
 include/linux/mm_types.h      |  11 +++
 include/linux/vm_event_item.h |  10 +++
 mm/Kconfig                    |  19 +++++
 mm/Makefile                   |   1 +
 mm/mcpage_memory.c            | 140 ++++++++++++++++++++++++++++++++++
 mm/memory.c                   |  12 +++
 mm/mempolicy.c                |  51 +++++++++++++
 mm/vmstat.c                   |   7 ++
 11 files changed, 299 insertions(+)
 create mode 100644 include/linux/mcpage_mm.h
 create mode 100644 mm/mcpage_memory.c

base-commit: b7bfaa761d760e72a969d116517eaa12e404c262
-- 
2.30.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
@ 2023-01-09  7:22 ` Yin Fengwei
  2023-01-09 13:24   ` Matthew Wilcox
  2023-01-09  7:22 ` [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping Yin Fengwei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Yin Fengwei @ 2023-01-09  7:22 UTC (permalink / raw)
  To: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen
  Cc: fengwei.yin

Huge page in current kernel could bring obvious performance improvement
for some workloads with less TLB missing and less page fault. But the
limited options of huge page size (2M/1G for x86_64) also brings extra
cost like larger memory consumption, and more CPU cycle for page zeroing.

The idea of the multiple consecutive page (abbr as "mcpage") is using
collection of physical contiguous 4K page other than huge page for
anonymous mapping. Target is to have more choices to trade off the pros
and cons of huge page. Comparing to huge page, it will not get so much
benefit of TLB missing and page fault. And it will not pay too much extra
cost for large memory consumption and larger latency introduced by page
compaction, page zeroing etc.

The size of mcpage can be configured. The default value of 16K size is
just picked up arbitrarily. User should choose the value according to the
result of tuning their workload with different mcpage size.

To have physical contiguous pages, high order pages is allocated (order
is calculated according to mcpage size). Then the high order page will
be split. By doing this, each sub page of mcpage is just normal 4K page.
The current kernel page management infrastructure is applied to "mc"
pages without any change.

To reduce the page fault number, multiple page table entries are populated
in one page fault with sub pages pfn of mcpage. This also brings a little
bit cost of memory consumption.

Update Kconfig to allow user define the mcpage order. Define MACROs like
mcpage mask/shift/nr/size.

In this RFC patch, only Kconfig is used for mcpage order to show the idea.
Runtime parameter will be chosen if make this official patch in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm_types.h | 11 +++++++++++
 mm/Kconfig               | 19 +++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..fa561c7b6290 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -71,6 +71,17 @@ struct mem_cgroup;
 #define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif

+#ifdef CONFIG_MCPAGE_ORDER
+#define MCPAGE_ORDER		CONFIG_MCPAGE_ORDER
+#else
+#define MCPAGE_ORDER		0
+#endif
+
+#define MCPAGE_SIZE		(1 << (MCPAGE_ORDER + PAGE_SHIFT))
+#define MCPAGE_MASK		(~(MCPAGE_SIZE - 1))
+#define MCPAGE_SHIFT		(MCPAGE_ORDER + PAGE_SHIFT)
+#define MCPAGE_NR		(1 << (MCPAGE_ORDER))
+
 struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..c202dc99ab6d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -650,6 +650,25 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 	  Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
 	  clamped down to MAX_ORDER - 1.

+config MCPAGE
+	bool "multiple consecutive page <mcpage>"
+	default n
+	help
+	  Enable multiple consecutive page: mcpage is page collections (sub-page)
+	  which are physical contiguous. When mapping to user space, all the
+	  sub-pages will be mapped to user space in one page fault handler.
+	  Expect to trade off the pros and cons of huge page. Like less
+	  unnecessary extra memory zeroing and less memory consumption.
+	  But with no TLB benefit.
+
+config MCPAGE_ORDER
+	int "multiple consecutive page order"
+	default 2
+	depends on X86_64 && MCPAGE
+	help
+	  The order of mcpage. Should be chosen carefully by tuning your
+	  workload.
+
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA

-- 
2.30.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Yin Fengwei
@ 2023-01-09  7:22 ` Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages Yin Fengwei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-01-09  7:22 UTC (permalink / raw)
  To: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen
  Cc: fengwei.yin

If mcpage is in the range of VMA, try to allocated mcpage and setup
for anonymous mapping.

Try best to populate all the around page table entries. The benefit
is that the page fault number will be reduced.

Split the mcpage to allow each sub-page to be managed as normal 4K page.
Doing split before setup page table entries to avoid the complicated page
lock, mapcount and refcount handling.

It's expected that the change will impact the memory consumption, page
fault number, zone lock and lru lock directly. The memory consumption and
system performance impact are evaluated as following.

Some system performance data were collected with 16K mcpage size:
===============================================================================
                                             v6.1-rc4-no-thp v6.1-rc4-thp mcpage
will-it-scale/malloc1 (higher is better)      100%            2%           17%
will-it-scale/page_fault1 (higher is better)  100%            238%         115%
redis.set_avg_throughput (higher is better)   100%            99%          102%
redis.get_avg_throughput (higher is better)   100%            99%          100%
kernel build (lower is better)                100%            98%          97%

  * v6.1-rc4-no-thp:   6.1-rc4 with THP disabled in Kconfig
  * v6.1-rc4-thp:      6.1-rc4 with THP enabled as always in Kconfig
  * mcpage:            6.1-rc4 + 16KB mcpage
  The test results are normalized to config "v6.1-rc4-no-thp"

The perf data between v6.1-rc4-no-thp and mcpage are collected:

  For kernel build, perf showed 56% minor_page_fault drop and 1.3% clear_page
  increasing:
     v6.1-rc4-no-thp          mcpage
     5.939e+08       -56.0%   2.61e+08    kbuild.time.minor_page_faults
     0.00            +2.2        2.20     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
     0.72            -0.7        0.00     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page

  For redis, perf showed 74.6% minor_page_fault drop and 0.11% zone lock drop.
     v6.1-rc4-no-thp            mcpage
    401414           -74.6%     102134    redis.time.minor_page_faults
      0.00           +0.1        0.11     perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.22           -0.2        0.00     perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.vma_alloc_folio

  For will-it-scale/page_fault1, perf showed 12.8% minor_page_fault drop and
  15.97% zone lock drop and 27% lru lock increasing.
      v6.1-rc4-no-thp            mcpage
      7239           -12.8%      6312     will-it-scale.time.minor_page_faults
      52.15          -34.4       17.75    perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
      3.29           +27.0       30.29    perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
      4.14           -4.1         0.00    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page
      0.00           +13.2       13.20    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.00           +18.4       18.43    perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages

  For will-it-scale/malloc1, the test result is surprise. The regression is
  much bigger than expected. perf showed 12.3% minor_page_fault drop and 43.6%
  zone lock increasing:
   v6.1-rc4-no-thp               mcpage
   2978027           -82.2%      530847   will-it-scale.128.processes
     7249            -12.3%        6360   will-it-scale.time.minor_page_faults
     0.00            +43.6        43.62   perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.pte_alloc_one.__pte_alloc
     0.00            +45.4        45.39   perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush

  It turned out the mcpage allocation/free pattern hit a corn case (high zone
  lock contention triggered and impact pte_alloc) which current pcp list bulk
  free can't handle very well.

  Will address the pcp list bulk free issue separately. After fix the pcp list
  bulk corn case, the result of will-it-scale/malloc1 is restored to 56% of
  v6.1-rc4-no-thp.

===============================================================================
For tail latency of page allocation, use following testing setup:
  - alloc_page() with order 0, 2 and 9 are called 2097152, 2097152 and 32768
    times in kernel
  - none fragment and fragment entier memory
  - w/o __GFP_ZERO flag to identify pure compaction latency and user visible
    latency

And the result is as following:

no page zeroing:
4K page:
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 27us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 2us (2076180th)     99% tail latency: 3us (2076180th)

16K mcpage
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 9862us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 1us (2076180th)     99% tail latency: 3us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768               Number of test: 32768
      max latency: 40us                   max latency: 12149us
      90% tail latency: 8us  (29491th)    90% tail latency: 864us  (29491th)
      95% tail latency: 10us (31129th)    95% tail latency: 943us  (31129th)
      99% tail latency: 13us (32440th)    99% tail latency: 1067us (32440th)

page zeroing:
4K page:
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 18us                     max latency: 46us
      90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
      95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
      99% tail latency: 2us (2076180th)     99% tail latency: 4us (2076180th)

16K mcpage
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 31us                     max latency: 5740us
      90% tail latency: 3us (1887436th)     90% tail latency: 3us (1887436th)
      95% tail latency: 3us (1992294th)     95% tail latency: 4us (1992294th)
      99% tail latency: 4us (2076180th)     99% tail latency: 5us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768                 Number of test: 32768
      max latency: 530us                    max latency: 10494us
      90% tail latency: 366us (29491th)     90% tail latency: 1114us (29491th)
      95% tail latency: 373us (31129th)     95% tail latency: 1263us (31129th)
      99% tail latency: 391us (32440th)     99% tail latency: 1808us (32440th)

With 16K mcpage, the tail latency for page allocation is good while 2M THP
has much worse result in memory fragment case.

===============================================================================
For the performance of NUMA interleaving on base page, mcpage and THP,
memory latency from https://github.com/torvalds/test-tlb is used.

On a Cascade Lake box with 96 core + 258G memory with two NUMA nodes:
    node distances:
    node   0   1
      0:  10  20
      1:  20  10

With memory policy set to MPOL_INTERLEAVE and 1G memory mapping with
128 bytes (2X cache line) stride, the memory access latency (less
is better):
  random access with 4K apge:   	142.32 ns
  random access with 16K mcpage:	141.21 ns (+0.8%)
  random access with 2M THP:		116.56 ns (+18.2%)

  sequential access with 4K page:	21.28 ns
  sequential access with 16K mcpage:	20.52 ns (+0.36%)
  sequential access with 2M THP:	20.36 ns (+0.43%)

mcpage brings minor memory access latency improvement comparing to 4K page.
But less than the improvement comparing to 2M THP.

===============================================================================
The memory consumption is checked by using firefox to access "www.lwn.net"
website and collect the RSS of firefox with 16K mcpage size:
  6.1-rc7: 		RSS of firefox is 285300 KB
  6.1-rc7 + 16K mcpage: RSS of firefox is 295536 KB

3.59% more memory consumption with 16K mcpage.

===============================================================================

In this RFC patch, the none-batch update to page table entries is used
to show the idea. Batch mode will be chosen if make this official patch
in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/gfp.h       |   5 ++
 include/linux/mcpage_mm.h |  35 ++++++++++
 mm/Makefile               |   1 +
 mm/mcpage_memory.c        | 134 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c               |  11 ++++
 mm/mempolicy.c            |  51 +++++++++++++++
 6 files changed, 237 insertions(+)
 create mode 100644 include/linux/mcpage_mm.h
 create mode 100644 mm/mcpage_memory.c

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..035c5fadd9d4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -265,6 +265,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc(gfp_t gfp, unsigned order);
 struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, bool hugepage);
+struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr);
 #else
 static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
@@ -276,7 +278,10 @@ static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
 }
 #define vma_alloc_folio(gfp, order, vma, addr, hugepage)		\
 	folio_alloc(gfp, order)
+#define alloc_mcpages(gfp, order, vma, addr)				\
+	alloc_pages(gfp, order)
 #endif
+
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 static inline struct page *alloc_page_vma(gfp_t gfp,
 		struct vm_area_struct *vma, unsigned long addr)
diff --git a/include/linux/mcpage_mm.h b/include/linux/mcpage_mm.h
new file mode 100644
index 000000000000..4b2fb7319233
--- /dev/null
+++ b/include/linux/mcpage_mm.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MCPAGE_MM_H
+#define _LINUX_MCPAGE_MM_H
+
+#include <linux/mm_types.h>
+
+#ifdef CONFIG_MCPAGE_ORDER
+
+static inline bool allow_mcpage(struct vm_area_struct *vma,
+	unsigned long addr, unsigned int order)
+{
+	unsigned int mcpage_size = 1 << (order + PAGE_SHIFT);
+	unsigned long haddr = ALIGN_DOWN(addr, mcpage_size);
+
+	return range_in_vma(vma, haddr, haddr + mcpage_size);
+}
+
+extern vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf,
+	unsigned int order);
+
+#else
+static inline bool allow_mcpage(struct vm_area_struct *vma,
+	unsigned long addr, unsigned int order)
+{
+	return false;
+}
+
+static inline vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf,
+	unsigned int order)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_MCPAGE */
+
+#endif /* _LINUX_MCPAGE_MM_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..efeaa8358953 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_MCPAGE) += mcpage_memory.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c
new file mode 100644
index 000000000000..ea4be2e25bce
--- /dev/null
+++ b/mm/mcpage_memory.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2022 Intel Corporation. All rights reserved.
+ */
+
+#include <linux/gfp.h>
+#include <linux/page_owner.h>
+#include <linux/pgtable.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/rmap.h>
+#include <linux/oom.h>
+#include <linux/vm_event_item.h>
+#include <linux/userfaultfd_k.h>
+
+#include "internal.h"
+
+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
+static inline struct page *
+alloc_zeroed_mcpages(int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct page *page = alloc_mcpages(GFP_HIGHUSER_MOVABLE, order,
+		vma, addr);
+
+	if (page) {
+		int i;
+		struct page *it = page;
+
+		for (i = 0; i < (1 << order); i++, it++) {
+			clear_user_highpage(it, addr);
+			cond_resched();
+		}
+	}
+
+	return page;
+}
+#else
+static inline struct page *
+alloc_zeroed_mcpages(int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	return alloc_mcpages(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+		order, vma, addr);
+}
+#endif
+
+static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
+		struct page *page, unsigned long addr)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	vm_fault_t ret = 0;
+	pte_t entry;
+
+	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) {
+		ret = VM_FAULT_OOM;
+		goto oom;
+	}
+
+	cgroup_throttle_swaprate(page, GFP_KERNEL);
+	__SetPageUptodate(page);
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	entry = pte_sw_mkyoung(entry);
+	if (vma->vm_flags & VM_WRITE)
+		entry = pte_mkwrite(pte_mkdirty(entry));
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	if (!pte_none(*vmf->pte)) {
+		ret = VM_FAULT_FALLBACK;
+		update_mmu_cache(vma, addr, vmf->pte);
+		goto release;
+	}
+
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret) {
+		ret = VM_FAULT_FALLBACK;
+		goto release;
+	}
+
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return handle_userfault(vmf, VM_UFFD_MISSING);
+	}
+
+	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr);
+	lru_cache_add_inactive_or_unevictable(page, vma);
+	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+	update_mmu_cache(vma, addr, vmf->pte);
+release:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+oom:
+	return ret;
+}
+
+vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order)
+{
+	int i, nr = 1 << order;
+	unsigned int mcpage_size = nr * PAGE_SIZE;
+	vm_fault_t ret = 0, real_ret = 0;
+	bool handled = false;
+	struct page *page;
+	unsigned long haddr = ALIGN_DOWN(vmf->address, mcpage_size);
+
+	page = alloc_zeroed_mcpages(order, vmf->vma, haddr);
+	if (!page)
+		return VM_FAULT_FALLBACK;
+
+	split_page(page, order);
+	for (i = 0; i < nr; i++, haddr += PAGE_SIZE) {
+		ret = do_anonymous_mcpage(vmf, &page[i], haddr);
+		if (haddr == PAGE_ALIGN_DOWN(vmf->address)) {
+			real_ret = ret;
+			handled = true;
+		}
+		if (ret)
+			break;
+	}
+
+	while (i < nr)
+		put_page(&page[i++]);
+
+	/*
+	 * If the fault address is not handled, fallback to handle
+	 * fault address with normal page.
+	 */
+	if (!handled)
+		return VM_FAULT_FALLBACK;
+	else
+		return real_ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..fb7f370f6c67 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -77,6 +77,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/mcpage_mm.h>
 
 #include <trace/events/kmem.h>
 
@@ -4071,6 +4072,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
+
+	if (allow_mcpage(vma, vmf->address, MCPAGE_ORDER)) {
+		ret = do_anonymous_mcpages(vmf, MCPAGE_ORDER);
+
+		if (!(ret & VM_FAULT_FALLBACK))
+			return ret;
+
+		ret = 0;
+	}
+
 	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
 	if (!page)
 		goto oom;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..87ecbdb74fbe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2251,6 +2251,57 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(vma_alloc_folio);
 
+/**
+ * alloc_mcpages - Allocate a mcpage for a VMA.
+ * @gfp: GFP flags.
+ * @order: Order of the mcpage.
+ * @vma: Pointer to VMA or NULL if not available.
+ * @addr: Virtual address of the allocation.  Must be inside @vma.
+ *
+ * Allocate a mcpage for a specific address in @vma, using the
+ * appropriate NUMA policy.  When @vma is not NULL the caller must hold the
+ * mmap_lock of the mm_struct of the VMA to prevent it from going away.
+ * Should be used for all allocations for pages that will be mapped into
+ * user space.
+ *
+ * Return: The page on success or NULL if allocation fails.
+ */
+struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct mempolicy *pol;
+	int node = numa_node_id();
+	struct page *page;
+	int preferred_nid;
+	nodemask_t *nmask;
+
+	pol = get_vma_policy(vma, addr);
+
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned int nid;
+
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+		mpol_cond_put(pol);
+		page = alloc_page_interleave(gfp, order, nid);
+		goto out;
+	}
+
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		node = policy_node(gfp, pol, node);
+		page = alloc_pages_preferred_many(gfp, order, node, pol);
+		mpol_cond_put(pol);
+		goto out;
+	}
+
+	nmask = policy_nodemask(gfp, pol);
+	preferred_nid = policy_node(gfp, pol, node);
+	page = __alloc_pages(gfp, order, preferred_nid, nmask);
+	mpol_cond_put(pol);
+out:
+	return page;
+}
+EXPORT_SYMBOL(alloc_mcpages);
+
 /**
  * alloc_pages - Allocate pages.
  * @gfp: GFP flags.
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping Yin Fengwei
@ 2023-01-09  7:22 ` Yin Fengwei
  2023-01-09  7:22 ` [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned addr Yin Fengwei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-01-09  7:22 UTC (permalink / raw)
  To: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen
  Cc: fengwei.yin

MCPAGE_ANON_FAULT_ALLOC: how many times mcpage is used for anonymous
mapping.

MCPAGE_ANON_FAULT_FALLBACK: how many times fallback to normal page
for anonymous mapping.

MCPAGE_ANON_FAULT_CHARGE_FAILED: how many times fallback because of
memcg charge failure.

MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED: how many times fallback
because page table already populated.

MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE: how many times fallback
because of unstable address space.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/vm_event_item.h | 10 ++++++++++
 mm/mcpage_memory.c            |  6 ++++++
 mm/memory.c                   |  1 +
 mm/vmstat.c                   |  7 +++++++
 4 files changed, 24 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..9c36bfc4c904 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -119,6 +119,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
 #endif
+#ifdef CONFIG_MCPAGE
+		MCPAGE_ANON_FAULT_ALLOC,
+		MCPAGE_ANON_FAULT_FALLBACK,
+		MCPAGE_ANON_FAULT_CHARGE_FAILED,
+		MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED,
+		MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE,
+#endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
 		BALLOON_DEFLATE,
@@ -159,5 +166,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #define THP_FILE_FALLBACK_CHARGE ({ BUILD_BUG(); 0; })
 #define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
 #endif
+#ifndef CONFIG_MCPAGE
+#define MCPAGE_ANON_FAULT_FALLBACK ({ BUILD_BUG(); 0; })
+#endif
 
 #endif		/* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c
index ea4be2e25bce..e208cf818ebf 100644
--- a/mm/mcpage_memory.c
+++ b/mm/mcpage_memory.c
@@ -55,6 +55,7 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
 
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) {
 		ret = VM_FAULT_OOM;
+		count_vm_event(MCPAGE_ANON_FAULT_CHARGE_FAILED);
 		goto oom;
 	}
 
@@ -71,12 +72,14 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
 	if (!pte_none(*vmf->pte)) {
 		ret = VM_FAULT_FALLBACK;
 		update_mmu_cache(vma, addr, vmf->pte);
+		count_vm_event(MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED);
 		goto release;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
 	if (ret) {
 		ret = VM_FAULT_FALLBACK;
+		count_vm_event(MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE);
 		goto release;
 	}
 
@@ -120,6 +123,9 @@ vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order)
 			break;
 	}
 
+	if (i == nr)
+		count_vm_event(MCPAGE_ANON_FAULT_ALLOC);
+
 	while (i < nr)
 		put_page(&page[i++]);
 
diff --git a/mm/memory.c b/mm/memory.c
index fb7f370f6c67..b3655be849ae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4079,6 +4079,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 
+		count_vm_event(MCPAGE_ANON_FAULT_FALLBACK);
 		ret = 0;
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..c40e33dee1b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1367,6 +1367,13 @@ const char * const vmstat_text[] = {
 	"thp_swpout",
 	"thp_swpout_fallback",
 #endif
+#ifdef CONFIG_MCPAGE
+	"mcpage_anon_fault_alloc",
+	"mcpage_anon_fault_fallback",
+	"mcpage_anon_fault_charge_failed",
+	"mcpage_anon_fault_page_table_populated",
+	"mcpage_anon_fault_instable_address_space",
+#endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",
 	"balloon_deflate",
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned addr
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-01-09  7:22 ` [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages Yin Fengwei
@ 2023-01-09  7:22 ` Yin Fengwei
  2023-01-09  8:37 ` [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Kirill A. Shutemov
  2023-01-09 17:33 ` David Hildenbrand
  5 siblings, 0 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-01-09  7:22 UTC (permalink / raw)
  To: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen
  Cc: fengwei.yin

For x86_64, let mmap start from mcpage size aligned address.

Using firefox with one tab to access the entry page of
"www.lwn.net" as workload. With mcpage set to 2, collected the
count about the mcpage can't be used because mcpage is out of
VMA range:
                         run1  run2  run3  avg  stddev
    With this patch:     1453, 1434, 1428  1438 13.0%
    Without this patch:  1536, 1467, 1493  1498 34.8%

It shows that the chance of using mcpage for anonymous mapping
is increased 4.2% with the patch.

For general possible impact because the virtual address space is
more sparse, run will-it-scale:malloc1, will-it-scale:page_fault1
and kernel build w/o the change based on v6.1-rc7. The result shows
no performance change introduced by this change:

malloc1:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     23338            -0.5%      23210        will-it-scale.per_process_ops

page_fault1:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     96322            -0.1%      96222        will-it-scale.per_process_ops

kernel build:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     28.45            +0.2%      28.52        kbuild.buildtime_per_iteration

One drawback of the change is that the effective ASLR bits is reduced
by mcpage_order bits.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 arch/x86/kernel/sys_x86_64.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..9b5617973e81 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -154,6 +154,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+
+	if (info.align_mask < ~MCPAGE_MASK)
+		info.align_mask = ~MCPAGE_MASK;
+
 	return vm_unmapped_area(&info);
 }
 
@@ -212,6 +216,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+
+	if (info.align_mask < ~MCPAGE_MASK)
+		info.align_mask = ~MCPAGE_MASK;
+
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
 		return addr;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
                   ` (3 preceding siblings ...)
  2023-01-09  7:22 ` [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned addr Yin Fengwei
@ 2023-01-09  8:37 ` Kirill A. Shutemov
  2023-01-11  6:13   ` Yin, Fengwei
  2023-01-09 17:33 ` David Hildenbrand
  5 siblings, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2023-01-09  8:37 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen

On Mon, Jan 09, 2023 at 03:22:28PM +0800, Yin Fengwei wrote:
> In a nutshell:  4k is too small and 2M is too big.  We started
> asking ourselves whether there was something in the middle that
> we could do.  This series shows what that middle ground might
> look like.  It provides some of the benefits of THP while
> eliminating some of the downsides.
> 
> This series uses "multiple consecutive pages" (mcpages) of
> between 8K and 2M of base pages for anonymous user space mappings.
> This will lead to less internal fragmentation versus 2M mappings
> and thus less memory consumption and wasted CPU time zeroing
> memory which will never be used.
> 
> In the implementation, we allocate high order page with order of
> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
> physical contiguous memory is used and benefit sequential memory
> access latency.
> 
> Then split the high order page. By doing this, the sub-page of
> mcpage is just 4K normal page. The current kernel page
> management is applied to "mc" pages without any changes. Batching
> page faults is allowed with mcpage and reduce page faults number.
> 
> There are costs with mcpage. Besides no TLB benefit THP brings, it
> increases memory consumption and latency of allocation page
> comparing to 4K base page.
> 
> This series is the first step of mcpage. The furture work can be
> enable mcpage for more components like page cache, swapping etc.
> Finally, most pages in system will be allocated/free/reclaimed
> with mcpage order.

It doesn't worth adding a new path in page fault handing. We need to make
existing mechanisms more flexible.

I think it has to be done on top of folios:

1. Converts anonymous memory to folios. Only order-9 (HPAGE_PMD_ORDER) and
   order-0 at first.
2. Remove assumption of THP being order-9.
3. Start allocating THPs <order-9.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page
  2023-01-09  7:22 ` [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Yin Fengwei
@ 2023-01-09 13:24   ` Matthew Wilcox
  2023-01-09 16:30     ` Dave Hansen
  2023-01-10  2:53     ` Yin, Fengwei
  0 siblings, 2 replies; 17+ messages in thread
From: Matthew Wilcox @ 2023-01-09 13:24 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, rppt, dave.hansen, ying.huang,
	tim.c.chen

On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
> The idea of the multiple consecutive page (abbr as "mcpage") is using
> collection of physical contiguous 4K page other than huge page for
> anonymous mapping.

This is what folios are for.  You have an interesting demonstration
here that shows that moving to larger folios for anonymous memory
is worth doing (thank you!) but you're missing several of the advantages
of folios by going off and doing your own thing.

> The size of mcpage can be configured. The default value of 16K size is
> just picked up arbitrarily. User should choose the value according to the
> result of tuning their workload with different mcpage size.

Uh, no.  We don't do these kinds of config options any more (or boot-time
options as you mention later).  The size of a folio allocated for a given
VMA should be adaptive based on observing how the program is using memory.
There will likely be many different sizes of folio present in a given VMA.

> To have physical contiguous pages, high order pages is allocated (order
> is calculated according to mcpage size). Then the high order page will
> be split. By doing this, each sub page of mcpage is just normal 4K page.
> The current kernel page management infrastructure is applied to "mc"
> pages without any change.

This is somewhere that you're losing an advantage of folios.  By keeping
all the pages together, they get managed as a single unit.  That shrinks
the length of the LRU list and reduces lock contention.  It also reduces
the number of cache lines which are modified as, eg, we only need to
keep track of one dirty bit for many pages.

> To reduce the page fault number, multiple page table entries are populated
> in one page fault with sub pages pfn of mcpage. This also brings a little
> bit cost of memory consumption.

That needs to be done for folios.  It's a long way down my todo list,
so if you wanted to take it on, it would be very much appreciated!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page
  2023-01-09 13:24   ` Matthew Wilcox
@ 2023-01-09 16:30     ` Dave Hansen
  2023-01-09 17:01       ` Matthew Wilcox
  2023-01-10  2:53     ` Yin, Fengwei
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2023-01-09 16:30 UTC (permalink / raw)
  To: Matthew Wilcox, Yin Fengwei
  Cc: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, rppt, ying.huang, tim.c.chen

On 1/9/23 05:24, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
>> The idea of the multiple consecutive page (abbr as "mcpage") is using
>> collection of physical contiguous 4K page other than huge page for
>> anonymous mapping.
> This is what folios are for.  You have an interesting demonstration
> here that shows that moving to larger folios for anonymous memory
> is worth doing (thank you!) but you're missing several of the advantages
> of folios by going off and doing your own thing.

It might not have come across in the changelog and cover letter, but
Fengwei and the rest of us *totally* agree with you on this.  "Doing
your own thing" just isn't going to cut it and if this is going to go
anywhere, it needs to use folios.

This series is _pure_ RFC and the comments we're interested in is
whether this demonstration warrants going back and doing it the right
way (with folios).



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page
  2023-01-09 16:30     ` Dave Hansen
@ 2023-01-09 17:01       ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2023-01-09 17:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yin Fengwei, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, rppt, ying.huang,
	tim.c.chen

On Mon, Jan 09, 2023 at 08:30:43AM -0800, Dave Hansen wrote:
> On 1/9/23 05:24, Matthew Wilcox wrote:
> > On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
> >> The idea of the multiple consecutive page (abbr as "mcpage") is using
> >> collection of physical contiguous 4K page other than huge page for
> >> anonymous mapping.
> > This is what folios are for.  You have an interesting demonstration
> > here that shows that moving to larger folios for anonymous memory
> > is worth doing (thank you!) but you're missing several of the advantages
> > of folios by going off and doing your own thing.
> 
> It might not have come across in the changelog and cover letter, but
> Fengwei and the rest of us *totally* agree with you on this.  "Doing
> your own thing" just isn't going to cut it and if this is going to go
> anywhere, it needs to use folios.
> 
> This series is _pure_ RFC and the comments we're interested in is
> whether this demonstration warrants going back and doing it the right
> way (with folios).

Ah, yes, that didn't come across.  In fact, the opposite came across
with this paragraph:

: This series is the first step of mcpage. The furture work can be
: enable mcpage for more components like page cache, swapping etc.
: Finally, most pages in system will be allocated/free/reclaimed
: with mcpage order.

Since the page cache has been using multipage folios in mainline since
March (and in various trees of mine since Feb 2020!), that indicated
to me either a lack of knowledge of folios, or a rejection of the folio
approach.  Happy to hear that's not true!

Most of the results here validate my experience and/or assumptions,
which is good.  I'm more than happy for someone else to take on the
hard work of folio-ising the anon VMAs.  I see the problems to be
solved as:

 - Determining (on a page fault) what the correct allocation size
   is for this process at this time.  We have the readahead code to
   leverage for files, but I don't think we have anything similar for
   anon memory
 - Inserting multiple PTEs when a multi-page folio is found.  This also
   needs to be done for file pages, and maybe that's a good place
   to start.
 - Finding all the places in the anon memory code that assume that
   PageCompound() / PageTransHuge() is the same thing as
   folio_test_pmd_mappable().

There are probably other things that are going to come up, but I think
starting is the important part.  Not everything needs to be done
immediately (#3 before #1, I would think ;-).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
                   ` (4 preceding siblings ...)
  2023-01-09  8:37 ` [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Kirill A. Shutemov
@ 2023-01-09 17:33 ` David Hildenbrand
  2023-01-09 19:11   ` Matthew Wilcox
  2023-01-10  3:57   ` Yin, Fengwei
  5 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-01-09 17:33 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, willy, rppt, dave.hansen,
	ying.huang, tim.c.chen

On 09.01.23 08:22, Yin Fengwei wrote:
> In a nutshell:  4k is too small and 2M is too big.  We started
> asking ourselves whether there was something in the middle that
> we could do.  This series shows what that middle ground might
> look like.  It provides some of the benefits of THP while
> eliminating some of the downsides.
> 
> This series uses "multiple consecutive pages" (mcpages) of
> between 8K and 2M of base pages for anonymous user space mappings.
> This will lead to less internal fragmentation versus 2M mappings
> and thus less memory consumption and wasted CPU time zeroing
> memory which will never be used.

Hi,

what I understand is that this is some form of faultaround for anonymous 
memory, with the special-case that we try to allocate the pages 
consecutively.

Some thoughts:

(1) Faultaround might be unexpected for some workloads and increase
     memory consumption unnecessarily.

Yes, something like that can happen with THP BUT

(a) THP can be disabled or is frequently only enabled for madvised
     regions -- for example, exactly for this reason.
(b) Some workloads (especially memory ballooning) rely on memory not
     suddenly re-appearing after MADV_DONTNEED. This works even with THP,
     because the 4k MADV_DONTNEED will first PTE-map the THP. Because
     there is a PTE page table, we won't suddenly get a THP populated
     again (unless khugepaged is configured to fill holes).


I strongly assume we will need something similar to force-disable, 
selectively-enable etc.


(2) This steals consecutive pages to immediately split them up

I know, everybody thinks it might be valuable for their use case to grab 
all higher-order pages :) It will be "fun" once all these cases start 
competing. TBH, splitting up them immediately again smells like being 
the lowest priority among all higher-order users.


(3) All effort will be lost once page compaction gets active, compacts,
     and simply migrates to random 4k pages. This is most probably the
     biggest "issue" of the whole approach AFAIKS: it's only temporary
     because there is no notion of these pages belonging together
     anymore.

> 
> In the implementation, we allocate high order page with order of
> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
> physical contiguous memory is used and benefit sequential memory
> access latency.
> 
> Then split the high order page. By doing this, the sub-page of
> mcpage is just 4K normal page. The current kernel page
> management is applied to "mc" pages without any changes. Batching
> page faults is allowed with mcpage and reduce page faults number.
> 
> There are costs with mcpage. Besides no TLB benefit THP brings, it
> increases memory consumption and latency of allocation page
> comparing to 4K base page.
> 
> This series is the first step of mcpage. The furture work can be
> enable mcpage for more components like page cache, swapping etc.
> Finally, most pages in system will be allocated/free/reclaimed
> with mcpage order.

I think avoiding new, herd-to-get terminology ("mcpage") might be 
better. I know, everybody wants to give its child a name, but the name 
us not really future proof: "multiple consecutive pages" might at one 
point be maybe just a folio.

I'd summarize the ideas as "faultaround" whereby we try optimizing for 
locality.

Note that a similar (but different) concept already exists (hidden) for 
hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence 
of PTEs that logically map a hugetlb page.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09 17:33 ` David Hildenbrand
@ 2023-01-09 19:11   ` Matthew Wilcox
  2023-01-10 14:13     ` David Hildenbrand
  2023-01-10  3:57   ` Yin, Fengwei
  1 sibling, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-01-09 19:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin Fengwei, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, rppt, dave.hansen,
	ying.huang, tim.c.chen

On Mon, Jan 09, 2023 at 06:33:09PM +0100, David Hildenbrand wrote:
> (2) This steals consecutive pages to immediately split them up
> 
> I know, everybody thinks it might be valuable for their use case to grab all
> higher-order pages :) It will be "fun" once all these cases start competing.
> TBH, splitting up them immediately again smells like being the lowest
> priority among all higher-order users.

Actually, it is good for everybody to allocate higher-order pages, if they
can make use of them.  It has the end effect of reducing fragmentation
(imagine if the base unit of allocation were 512 bytes; every page fault
would have to do an order-3 allocation, and it wouldn't be long until
order-0 allocations had fragmented memory such that we could no longer
service a page fault).

Splitting them again is clearly one of the bad things done in this
proof-of-concept.  Anything that goes upstream won't do that, but I
suspect it was necessary to avoid fixing all the places in the kernel
that assume anon memory is either order-0 or -9.

> (3) All effort will be lost once page compaction gets active, compacts,
>     and simply migrates to random 4k pages. This is most probably the
>     biggest "issue" of the whole approach AFAIKS: it's only temporary
>     because there is no notion of these pages belonging together
>     anymore.

Sure, page compaction / migration is going to have to learn how to handle
order 1-8 folios.  Again, not needed for the PoC.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page
  2023-01-09 13:24   ` Matthew Wilcox
  2023-01-09 16:30     ` Dave Hansen
@ 2023-01-10  2:53     ` Yin, Fengwei
  1 sibling, 0 replies; 17+ messages in thread
From: Yin, Fengwei @ 2023-01-10  2:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, rppt, dave.hansen, ying.huang,
	tim.c.chen



On 1/9/2023 9:24 PM, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 03:22:29PM +0800, Yin Fengwei wrote:
>> The idea of the multiple consecutive page (abbr as "mcpage") is using
>> collection of physical contiguous 4K page other than huge page for
>> anonymous mapping.
> 
> This is what folios are for.  You have an interesting demonstration
> here that shows that moving to larger folios for anonymous memory
> is worth doing (thank you!) but you're missing several of the advantages
> of folios by going off and doing your own thing.
Yes. Folio and mcpage share some advantages.

> 
>> The size of mcpage can be configured. The default value of 16K size is
>> just picked up arbitrarily. User should choose the value according to the
>> result of tuning their workload with different mcpage size.
> 
> Uh, no.  We don't do these kinds of config options any more (or boot-time
> options as you mention later).  The size of a folio allocated for a given
> VMA should be adaptive based on observing how the program is using memory.
> There will likely be many different sizes of folio present in a given VMA.
I had two thoughts for adaptive folio size:
1. It could have high tail latency to allocate folio with large size. Which
   is not appreciated by some workloads. It may be good to allow user to
   define the size?
2. Difference size of folio in system may make whole memory fragment?

> 
>> To have physical contiguous pages, high order pages is allocated (order
>> is calculated according to mcpage size). Then the high order page will
>> be split. By doing this, each sub page of mcpage is just normal 4K page.
>> The current kernel page management infrastructure is applied to "mc"
>> pages without any change.
> 
> This is somewhere that you're losing an advantage of folios.  By keeping
> all the pages together, they get managed as a single unit.  That shrinks
> the length of the LRU list and reduces lock contention.  It also reduces
> the number of cache lines which are modified as, eg, we only need to
> keep track of one dirty bit for many pages.
Yes. lru list/lock benefit is provided by folios.

For dirty bit, one dirty bit for many pages means just one dirty sub-page
of folios require all sub-pages need be writing out. It brings pressure
to storage. But Yes. Other bits can get benefit of less cache line
modification.


Regards
Yin, Fengwei

> 
>> To reduce the page fault number, multiple page table entries are populated
>> in one page fault with sub pages pfn of mcpage. This also brings a little
>> bit cost of memory consumption.
> 
> That needs to be done for folios.  It's a long way down my todo list,
> so if you wanted to take it on, it would be very much appreciated!
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09 17:33 ` David Hildenbrand
  2023-01-09 19:11   ` Matthew Wilcox
@ 2023-01-10  3:57   ` Yin, Fengwei
  2023-01-10 14:40     ` David Hildenbrand
  1 sibling, 1 reply; 17+ messages in thread
From: Yin, Fengwei @ 2023-01-10  3:57 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, willy, rppt, dave.hansen,
	ying.huang, tim.c.chen



On 1/10/2023 1:33 AM, David Hildenbrand wrote:
> On 09.01.23 08:22, Yin Fengwei wrote:
>> In a nutshell:  4k is too small and 2M is too big.  We started
>> asking ourselves whether there was something in the middle that
>> we could do.  This series shows what that middle ground might
>> look like.  It provides some of the benefits of THP while
>> eliminating some of the downsides.
>>
>> This series uses "multiple consecutive pages" (mcpages) of
>> between 8K and 2M of base pages for anonymous user space mappings.
>> This will lead to less internal fragmentation versus 2M mappings
>> and thus less memory consumption and wasted CPU time zeroing
>> memory which will never be used.
> 
> Hi,
> 
> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
swapping etc.

> 
> Some thoughts:
> 
> (1) Faultaround might be unexpected for some workloads and increase
>     memory consumption unnecessarily.
Comparing to THP, the memory consumption and latency introduced by
mcpage is minor.

> 
> Yes, something like that can happen with THP BUT
> 
> (a) THP can be disabled or is frequently only enabled for madvised
>     regions -- for example, exactly for this reason.
> (b) Some workloads (especially memory ballooning) rely on memory not
>     suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>     because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>     there is a PTE page table, we won't suddenly get a THP populated
>     again (unless khugepaged is configured to fill holes).
> 
> 
> I strongly assume we will need something similar to force-disable, selectively-enable etc.
Agree.

> 
> 
> (2) This steals consecutive pages to immediately split them up
> 
> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
> 
The motivations to split it immediately are:
1. All the sub-pages is just normal 4K page. No other changes need be
   added to handle it.
2. splitting it before use doesn't involved complicated page lock handling.

> 
> (3) All effort will be lost once page compaction gets active, compacts,
>     and simply migrates to random 4k pages. This is most probably the
>     biggest "issue" of the whole approach AFAIKS: it's only temporary
>     because there is no notion of these pages belonging together
>     anymore.
Yes. But I suppose page compaction could be updated to handle mcpage.
Like always handle all sub-pages together. We did experience for
reclaim.

> 
>>
>> In the implementation, we allocate high order page with order of
>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>> physical contiguous memory is used and benefit sequential memory
>> access latency.
>>
>> Then split the high order page. By doing this, the sub-page of
>> mcpage is just 4K normal page. The current kernel page
>> management is applied to "mc" pages without any changes. Batching
>> page faults is allowed with mcpage and reduce page faults number.
>>
>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>> increases memory consumption and latency of allocation page
>> comparing to 4K base page.
>>
>> This series is the first step of mcpage. The furture work can be
>> enable mcpage for more components like page cache, swapping etc.
>> Finally, most pages in system will be allocated/free/reclaimed
>> with mcpage order.
> 
> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
> 
> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
> 
> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
"cont-pte" on ARM64 has fixed size which match the silicon definition.
mcpage allows software/user to define the size which is not necessary
to be exact same as silicon defined. Thanks.

Regards
Yin, Fengwei 

> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09 19:11   ` Matthew Wilcox
@ 2023-01-10 14:13     ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-01-10 14:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin Fengwei, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, rppt, dave.hansen,
	ying.huang, tim.c.chen

On 09.01.23 20:11, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 06:33:09PM +0100, David Hildenbrand wrote:
>> (2) This steals consecutive pages to immediately split them up
>>
>> I know, everybody thinks it might be valuable for their use case to grab all
>> higher-order pages :) It will be "fun" once all these cases start competing.
>> TBH, splitting up them immediately again smells like being the lowest
>> priority among all higher-order users.
> 
> Actually, it is good for everybody to allocate higher-order pages, if they
> can make use of them.  It has the end effect of reducing fragmentation
> (imagine if the base unit of allocation were 512 bytes; every page fault
> would have to do an order-3 allocation, and it wouldn't be long until
> order-0 allocations had fragmented memory such that we could no longer
> service a page fault).

I don't believe that this reasoning is universally true. But I can see 
some part being true if everybody would be allocating higher-order pages 
and there would be no memory pressure.

Simple example why I am skeptical: Our free lists hold a order-9 page 
and 4 order-0 pages.

It's counter-intuitive to split (fragment!) the order-9 page to allocate 
an order-2 page instead of just "consuming the leftover" and letting 
somebody else make use of the full order-9 page (e.g., a proper THP).

Now, reality will tell us if we're handing out 
higher-order-but-not-thp-order pages too easily to end up fragmenting 
the wrong orders. IMHO,
fragmentation is and remains a challenge ... and I don't think 
especially once we have more consumers of higher-order pages -- 
especially where they might not be that beneficial.

I'm happy to be wrong on this one.

> 
> Splitting them again is clearly one of the bad things done in this
> proof-of-concept.  Anything that goes upstream won't do that, but I
> suspect it was necessary to avoid fixing all the places in the kernel
> that assume anon memory is either order-0 or -9.

Agreed. An usptream version shouldn't perform this split -- which will 
require more work.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-10  3:57   ` Yin, Fengwei
@ 2023-01-10 14:40     ` David Hildenbrand
  2023-01-11  6:12       ` Yin, Fengwei
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-01-10 14:40 UTC (permalink / raw)
  To: Yin, Fengwei, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, willy, rppt, dave.hansen,
	ying.huang, tim.c.chen

On 10.01.23 04:57, Yin, Fengwei wrote:
> 
> 
> On 1/10/2023 1:33 AM, David Hildenbrand wrote:
>> On 09.01.23 08:22, Yin Fengwei wrote:
>>> In a nutshell:  4k is too small and 2M is too big.  We started
>>> asking ourselves whether there was something in the middle that
>>> we could do.  This series shows what that middle ground might
>>> look like.  It provides some of the benefits of THP while
>>> eliminating some of the downsides.
>>>
>>> This series uses "multiple consecutive pages" (mcpages) of
>>> between 8K and 2M of base pages for anonymous user space mappings.
>>> This will lead to less internal fragmentation versus 2M mappings
>>> and thus less memory consumption and wasted CPU time zeroing
>>> memory which will never be used.
>>
>> Hi,

Hi,

>>
>> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
> swapping etc.

Right, PTE-mapping higher-order pages, in a faultaround fashion. But for 
pagecache etc. that doesn't require mcpage IMHO. I think it's the 
natural evolution of folios that Willy envisioned at some point.

> 
>>
>> Some thoughts:
>>
>> (1) Faultaround might be unexpected for some workloads and increase
>>      memory consumption unnecessarily.
> Comparing to THP, the memory consumption and latency introduced by
> mcpage is minor.

But it exists :)

> 
>>
>> Yes, something like that can happen with THP BUT
>>
>> (a) THP can be disabled or is frequently only enabled for madvised
>>      regions -- for example, exactly for this reason.
>> (b) Some workloads (especially memory ballooning) rely on memory not
>>      suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>>      because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>>      there is a PTE page table, we won't suddenly get a THP populated
>>      again (unless khugepaged is configured to fill holes).
>>
>>
>> I strongly assume we will need something similar to force-disable, selectively-enable etc.
> Agree.

Thinking again, we might want to piggy-back on the THP machinery/config 
knobs completely, hmm. After all, it's a similar concept to a THP (once 
we properly handle folios), just that we are not able to PMD-map the 
folio because it is too small.

Some applications that trigger MADV_NOHUGEPAGE don't want to get more 
pages populated than actually accessed. userfaultfd users come to mind, 
where we might not even have the guaranteed to see a UFFD registration 
before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd 
populate too many PTEs, we could miss uffd faults later ...

> 
>>
>>
>> (2) This steals consecutive pages to immediately split them up
>>
>> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
>>
> The motivations to split it immediately are:
> 1. All the sub-pages is just normal 4K page. No other changes need be
>     added to handle it.
> 2. splitting it before use doesn't involved complicated page lock handling.

I think for an upstream version we really want to avoid these splits.

>>>
>>> In the implementation, we allocate high order page with order of
>>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>>> physical contiguous memory is used and benefit sequential memory
>>> access latency.
>>>
>>> Then split the high order page. By doing this, the sub-page of
>>> mcpage is just 4K normal page. The current kernel page
>>> management is applied to "mc" pages without any changes. Batching
>>> page faults is allowed with mcpage and reduce page faults number.
>>>
>>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>>> increases memory consumption and latency of allocation page
>>> comparing to 4K base page.
>>>
>>> This series is the first step of mcpage. The furture work can be
>>> enable mcpage for more components like page cache, swapping etc.
>>> Finally, most pages in system will be allocated/free/reclaimed
>>> with mcpage order.
>>
>> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
>>
>> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
>>
>> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
> "cont-pte" on ARM64 has fixed size which match the silicon definition.
> mcpage allows software/user to define the size which is not necessary
> to be exact same as silicon defined. Thanks.

Yes. And the whole concept is abstracted away: it's logically a single, 
larger PTE, and we can only map/unmap in that PTE granularity.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-10 14:40     ` David Hildenbrand
@ 2023-01-11  6:12       ` Yin, Fengwei
  0 siblings, 0 replies; 17+ messages in thread
From: Yin, Fengwei @ 2023-01-11  6:12 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, jack, hughd, kirill.shutemov,
	mhocko, ak, aarcange, npiggin, mgorman, willy, rppt, dave.hansen,
	ying.huang, tim.c.chen



On 1/10/2023 10:40 PM, David Hildenbrand wrote:
> On 10.01.23 04:57, Yin, Fengwei wrote:
>>
>>
>> On 1/10/2023 1:33 AM, David Hildenbrand wrote:
>>> On 09.01.23 08:22, Yin Fengwei wrote:
>>>> In a nutshell:  4k is too small and 2M is too big.  We started
>>>> asking ourselves whether there was something in the middle that
>>>> we could do.  This series shows what that middle ground might
>>>> look like.  It provides some of the benefits of THP while
>>>> eliminating some of the downsides.
>>>>
>>>> This series uses "multiple consecutive pages" (mcpages) of
>>>> between 8K and 2M of base pages for anonymous user space mappings.
>>>> This will lead to less internal fragmentation versus 2M mappings
>>>> and thus less memory consumption and wasted CPU time zeroing
>>>> memory which will never be used.
>>>
>>> Hi,
> 
> Hi,
> 
>>>
>>> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
>> swapping etc.
> 
> Right, PTE-mapping higher-order pages, in a faultaround fashion. But for pagecache etc. that doesn't require mcpage IMHO. I think it's the natural evolution of folios that Willy envisioned at some point.
Agree.

> 
>>
>>>
>>> Some thoughts:
>>>
>>> (1) Faultaround might be unexpected for some workloads and increase
>>>      memory consumption unnecessarily.
>> Comparing to THP, the memory consumption and latency introduced by
>> mcpage is minor.
> 
> But it exists :)
Yes. There is extra memory consumption even it's minor.

> 
>>
>>>
>>> Yes, something like that can happen with THP BUT
>>>
>>> (a) THP can be disabled or is frequently only enabled for madvised
>>>      regions -- for example, exactly for this reason.
>>> (b) Some workloads (especially memory ballooning) rely on memory not
>>>      suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>>>      because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>>>      there is a PTE page table, we won't suddenly get a THP populated
>>>      again (unless khugepaged is configured to fill holes).
>>>
>>>
>>> I strongly assume we will need something similar to force-disable, selectively-enable etc.
>> Agree.
> 
> Thinking again, we might want to piggy-back on the THP machinery/config knobs completely, hmm. After all, it's a similar concept to a THP (once we properly handle folios), just that we are not able to PMD-map the folio because it is too small.
> 
> Some applications that trigger MADV_NOHUGEPAGE don't want to get more pages populated than actually accessed. userfaultfd users come to mind, where we might not even have the guaranteed to see a UFFD registration before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd populate too many PTEs, we could miss uffd faults later ...
This is good point.

> 
>>
>>>
>>>
>>> (2) This steals consecutive pages to immediately split them up
>>>
>>> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
>>>
>> The motivations to split it immediately are:
>> 1. All the sub-pages is just normal 4K page. No other changes need be
>>     added to handle it.
>> 2. splitting it before use doesn't involved complicated page lock handling.
> 
> I think for an upstream version we really want to avoid these splits.
OK.

> 
>>>>
>>>> In the implementation, we allocate high order page with order of
>>>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>>>> physical contiguous memory is used and benefit sequential memory
>>>> access latency.
>>>>
>>>> Then split the high order page. By doing this, the sub-page of
>>>> mcpage is just 4K normal page. The current kernel page
>>>> management is applied to "mc" pages without any changes. Batching
>>>> page faults is allowed with mcpage and reduce page faults number.
>>>>
>>>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>>>> increases memory consumption and latency of allocation page
>>>> comparing to 4K base page.
>>>>
>>>> This series is the first step of mcpage. The furture work can be
>>>> enable mcpage for more components like page cache, swapping etc.
>>>> Finally, most pages in system will be allocated/free/reclaimed
>>>> with mcpage order.
>>>
>>> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
>>>
>>> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
>>>
>>> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
>> "cont-pte" on ARM64 has fixed size which match the silicon definition.
>> mcpage allows software/user to define the size which is not necessary
>> to be exact same as silicon defined. Thanks.
> 
> Yes. And the whole concept is abstracted away: it's logically a single, larger PTE, and we can only map/unmap in that PTE granularity.
David, thanks a lot for the comments.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping
  2023-01-09  8:37 ` [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Kirill A. Shutemov
@ 2023-01-11  6:13   ` Yin, Fengwei
  0 siblings, 0 replies; 17+ messages in thread
From: Yin, Fengwei @ 2023-01-11  6:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, akpm, jack, hughd, kirill.shutemov, mhocko, ak,
	aarcange, npiggin, mgorman, willy, rppt, dave.hansen, ying.huang,
	tim.c.chen



On 1/9/2023 4:37 PM, Kirill A. Shutemov wrote:
> On Mon, Jan 09, 2023 at 03:22:28PM +0800, Yin Fengwei wrote:
>> In a nutshell:  4k is too small and 2M is too big.  We started
>> asking ourselves whether there was something in the middle that
>> we could do.  This series shows what that middle ground might
>> look like.  It provides some of the benefits of THP while
>> eliminating some of the downsides.
>>
>> This series uses "multiple consecutive pages" (mcpages) of
>> between 8K and 2M of base pages for anonymous user space mappings.
>> This will lead to less internal fragmentation versus 2M mappings
>> and thus less memory consumption and wasted CPU time zeroing
>> memory which will never be used.
>>
>> In the implementation, we allocate high order page with order of
>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>> physical contiguous memory is used and benefit sequential memory
>> access latency.
>>
>> Then split the high order page. By doing this, the sub-page of
>> mcpage is just 4K normal page. The current kernel page
>> management is applied to "mc" pages without any changes. Batching
>> page faults is allowed with mcpage and reduce page faults number.
>>
>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>> increases memory consumption and latency of allocation page
>> comparing to 4K base page.
>>
>> This series is the first step of mcpage. The furture work can be
>> enable mcpage for more components like page cache, swapping etc.
>> Finally, most pages in system will be allocated/free/reclaimed
>> with mcpage order.
> 
> It doesn't worth adding a new path in page fault handing. We need to make
> existing mechanisms more flexible.
> 
> I think it has to be done on top of folios:
> 
> 1. Converts anonymous memory to folios. Only order-9 (HPAGE_PMD_ORDER) and
>    order-0 at first.
> 2. Remove assumption of THP being order-9.
> 3. Start allocating THPs <order-9.
Thanks a lot for the comments. Really appreciate it.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-01-11  6:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09  7:22 [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Yin Fengwei
2023-01-09  7:22 ` [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Yin Fengwei
2023-01-09 13:24   ` Matthew Wilcox
2023-01-09 16:30     ` Dave Hansen
2023-01-09 17:01       ` Matthew Wilcox
2023-01-10  2:53     ` Yin, Fengwei
2023-01-09  7:22 ` [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping Yin Fengwei
2023-01-09  7:22 ` [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages Yin Fengwei
2023-01-09  7:22 ` [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned addr Yin Fengwei
2023-01-09  8:37 ` [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Kirill A. Shutemov
2023-01-11  6:13   ` Yin, Fengwei
2023-01-09 17:33 ` David Hildenbrand
2023-01-09 19:11   ` Matthew Wilcox
2023-01-10 14:13     ` David Hildenbrand
2023-01-10  3:57   ` Yin, Fengwei
2023-01-10 14:40     ` David Hildenbrand
2023-01-11  6:12       ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox