linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
	David Hildenbrand <david@redhat.com>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Michal Hocko <mhocko@suse.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 Peter Xu <peterx@redhat.com>,
	Rongwei Wang <rongwei.wang@linux.alibaba.com>,
	 SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Arnd Bergmann <arnd@arndb.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Chris Kennelly <ckennelly@google.com>,
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	 Hugh Dickins <hughd@google.com>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	 "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matt Turner <mattst88@gmail.com>,
	 Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	 Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>,
	 Pavel Begunkov <asml.silence@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	 "Zach O'Keefe" <zokeefe@google.com>
Subject: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
Date: Fri,  3 Jun 2022 17:39:58 -0700	[thread overview]
Message-ID: <20220604004004.954674-10-zokeefe@google.com> (raw)
In-Reply-To: <20220604004004.954674-1-zokeefe@google.com>

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but it is
expected that file and shmem support will be added later to support the
use-case of backing executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
which might impair services from serving at their full rated load after
(re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint.  With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.

THP allocation may enter direct reclaim and/or compaction.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 include/linux/huge_mm.h                |  12 +++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
 mm/madvise.c                           |   5 +
 8 files changed, 151 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 648cb3ce7099..2ca2f3b41fc8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ad04f552347..073d6bb03b37 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.enforce_page_heuristics = false,
+		.enforce_thp_enabled = false,
+		.last_target_node = NUMA_NO_NODE,
+		.gfp = GFP_TRANSHUGE | __GFP_THISNODE,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long hstart, hend, addr;
+	int thps = 0, last_fail = SCAN_FAIL;
+	bool mmap_locked = true;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	/* TODO: Support file/shmem */
+	if (!vma->anon_vma || !vma_is_anonymous(vma))
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+
+	/*
+	 * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
+	 * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
+	 * Note that hugepage_vma_check() doesn't enforce that
+	 * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
+	 * must be set (i.e. "never" mode)
+	 */
+	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
+		int result = SCAN_FAIL;
+		bool retry = true;  /* Allow one retry per hugepage */
+retry:
+		if (!mmap_locked) {
+			cond_resched();
+			mmap_read_lock(mm);
+			mmap_locked = true;
+			result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
+			if (result) {
+				last_fail = result;
+				goto out_nolock;
+			}
+		}
+		mmap_assert_locked(mm);
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		if (!mmap_locked)
+			*prev = NULL;  /* Tell caller we dropped mmap_lock */
+
+		switch (result) {
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+			break;
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			last_fail = result;
+			break;
+		case SCAN_PAGE_LRU:
+			if (retry) {
+				lru_add_drain_all();
+				retry = false;
+				goto retry;
+			}
+			fallthrough;
+		default:
+			last_fail = result;
+			/* Other error, exit */
+			goto out_maybelock;
+		}
+	}
+
+out_maybelock:
+	/* Caller expects us to hold mmap_lock on return */
+	if (!mmap_locked)
+		mmap_read_lock(mm);
+out_nolock:
+	mmap_assert_locked(mm);
+	mmdrop(mm);
+
+	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
+			: madvise_collapse_errno(last_fail);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 46feb62ce163..eccac2620226 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.36.1.255.ge46751e96f-goog



  parent reply	other threads:[~2022-06-04  0:40 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-04  0:39 [PATCH v6 00/15] mm: userspace " Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
2022-06-06 18:25   ` Yang Shi
2022-06-29 20:49   ` Peter Xu
2022-06-30  1:15     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-06-06 20:45   ` Yang Shi
2022-06-07 16:01     ` Zach O'Keefe
2022-06-07 19:32       ` Zach O'Keefe
2022-06-07 21:27         ` Yang Shi
2022-06-08  0:27           ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 03/15] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-06-06  2:41   ` kernel test robot
2022-06-06 16:40     ` Zach O'Keefe
2022-06-06 20:20       ` Yang Shi
2022-06-06 21:22         ` Yang Shi
2022-06-06 22:23       ` Andrew Morton
2022-06-06 23:53         ` Yang Shi
2022-06-08  0:42           ` Zach O'Keefe
2022-06-08  1:00             ` Yang Shi
2022-06-08  1:06               ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-06-06 20:50   ` Yang Shi
2022-06-29 21:58   ` Peter Xu
2022-06-30 20:14     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific Zach O'Keefe
2022-06-06 20:58   ` Yang Shi
2022-06-07 19:56     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-06-06 22:39   ` Yang Shi
2022-06-07  0:17     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics Zach O'Keefe
2022-06-06 22:51   ` Yang Shi
2022-06-04  0:39 ` [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled Zach O'Keefe
2022-06-06 23:02   ` Yang Shi
     [not found]   ` <YrzehlUoo2iMMLC2@xz-m1.local>
     [not found]     ` <CAAa6QmRXD5KboM8=ZZRPThOmcLEPtxzf0XyjkCeY_vgR7VOPqg@mail.gmail.com>
2022-06-30  2:32       ` Peter Xu
2022-06-30 14:17         ` Zach O'Keefe
2022-06-04  0:39 ` Zach O'Keefe [this message]
2022-06-06 23:53   ` [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Yang Shi
2022-06-07 22:48     ` Zach O'Keefe
2022-06-08  0:39       ` Yang Shi
2022-06-09 17:35         ` Zach O'Keefe
2022-06-09 18:51           ` Yang Shi
2022-06-10 14:51             ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-06-06 23:56   ` Yang Shi
2022-06-07  0:31     ` Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-06-07 19:14   ` Yang Shi
2022-06-04  0:40 ` [PATCH v6 12/15] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 13/15] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 14/15] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools Zach O'Keefe
2022-06-06 23:58   ` Yang Shi
2022-06-07  0:24     ` Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220604004004.954674-10-zokeefe@google.com \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=rongwei.wang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox