From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
David Hildenbrand <david@redhat.com>,
David Rientjes <rientjes@google.com>,
Matthew Wilcox <willy@infradead.org>,
Michal Hocko <mhocko@suse.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Peter Xu <peterx@redhat.com>,
Rongwei Wang <rongwei.wang@linux.alibaba.com>,
SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
Zi Yan <ziy@nvidia.com>,
linux-mm@kvack.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Arnd Bergmann <arnd@arndb.de>,
Axel Rasmussen <axelrasmussen@google.com>,
Chris Kennelly <ckennelly@google.com>,
Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
Hugh Dickins <hughd@google.com>,
Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
Jens Axboe <axboe@kernel.dk>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Matt Turner <mattst88@gmail.com>,
Max Filippov <jcmvbkbc@gmail.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Minchan Kim <minchan@kernel.org>,
Patrick Xia <patrickx@google.com>,
Pavel Begunkov <asml.silence@gmail.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
"Zach O'Keefe" <zokeefe@google.com>
Subject: [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
Date: Wed, 6 Jul 2022 16:59:20 -0700 [thread overview]
Message-ID: <20220706235936.2197195-3-zokeefe@google.com> (raw)
In-Reply-To: <20220706235936.2197195-1-zokeefe@google.com>
From: Yang Shi <shy828301@gmail.com>
The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop. CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much sense
to carry it.
But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing. Then the page would
be just freed if khugepaged already made enough progress. This could make
NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y run. This
problem actually makes things worse due to the way more pointless THP
allocations and makes the optimization pointless.
This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried
indefinitely. But if we take one step back, the optimization itself seems
not worth keeping nowadays since:
* Not too many users build NUMA=n kernel nowadays even though the kernel is
actually running on a non-NUMA machine. Some small devices may run NUMA=n
kernel, but I don't think they actually use THP.
* Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists"), THP could be cached by pcp. This actually
somehow does the job done by the optimization.
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Co-developed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
mm/khugepaged.c | 120 +++++++++++-------------------------------------
1 file changed, 26 insertions(+), 94 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5269d15e20f6..196eaadbf415 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -796,29 +796,16 @@ static int khugepaged_find_target_node(void)
last_khugepaged_target_node = target_node;
return target_node;
}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+#else
+static int khugepaged_find_target_node(void)
{
- if (IS_ERR(*hpage)) {
- if (!*wait)
- return false;
-
- *wait = false;
- *hpage = NULL;
- khugepaged_alloc_sleep();
- } else if (*hpage) {
- put_page(*hpage);
- *hpage = NULL;
- }
-
- return true;
+ return 0;
}
+#endif
static struct page *
khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
- VM_BUG_ON_PAGE(*hpage, *hpage);
-
*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -830,74 +817,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
count_vm_event(THP_COLLAPSE_ALLOC);
return *hpage;
}
-#else
-static int khugepaged_find_target_node(void)
-{
- return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
- struct page *page;
-
- page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
- HPAGE_PMD_ORDER);
- if (page)
- prep_transhuge_page(page);
- return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
- struct page *hpage;
-
- do {
- hpage = alloc_khugepaged_hugepage();
- if (!hpage) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- if (!*wait)
- return NULL;
-
- *wait = false;
- khugepaged_alloc_sleep();
- } else
- count_vm_event(THP_COLLAPSE_ALLOC);
- } while (unlikely(!hpage) && likely(hugepage_flags_enabled()));
-
- return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- /*
- * If the hpage allocated earlier was briefly exposed in page cache
- * before collapse_file() failed, it is possible that racing lookups
- * have not yet completed, and would then be unpleasantly surprised by
- * finding the hpage reused for the same mapping at a different offset.
- * Just release the previous allocation if there is any danger of that.
- */
- if (*hpage && page_count(*hpage) > 1) {
- put_page(*hpage);
- *hpage = NULL;
- }
-
- if (!*hpage)
- *hpage = khugepaged_alloc_hugepage(wait);
-
- if (unlikely(!*hpage))
- return false;
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
-{
- VM_BUG_ON(!*hpage);
-
- return *hpage;
-}
-#endif
/*
* If mmap_lock temporarily dropped, revalidate vma
@@ -1148,8 +1067,10 @@ static void collapse_huge_page(struct mm_struct *mm,
out_up_write:
mmap_write_unlock(mm);
out_nolock:
- if (!IS_ERR_OR_NULL(*hpage))
+ if (!IS_ERR_OR_NULL(*hpage)) {
mem_cgroup_uncharge(page_folio(*hpage));
+ put_page(*hpage);
+ }
trace_mm_collapse_huge_page(mm, isolated, result);
return;
}
@@ -1951,8 +1872,10 @@ static void collapse_file(struct mm_struct *mm,
unlock_page(new_page);
out:
VM_BUG_ON(!list_empty(&pagelist));
- if (!IS_ERR_OR_NULL(*hpage))
+ if (!IS_ERR_OR_NULL(*hpage)) {
mem_cgroup_uncharge(page_folio(*hpage));
+ put_page(*hpage);
+ }
/* TODO: tracepoints */
}
@@ -2197,10 +2120,7 @@ static void khugepaged_do_scan(void)
lru_add_drain_all();
- while (progress < pages) {
- if (!khugepaged_prealloc_page(&hpage, &wait))
- break;
-
+ while (true) {
cond_resched();
if (unlikely(kthread_should_stop() || try_to_freeze()))
@@ -2216,10 +2136,22 @@ static void khugepaged_do_scan(void)
else
progress = pages;
spin_unlock(&khugepaged_mm_lock);
- }
- if (!IS_ERR_OR_NULL(hpage))
- put_page(hpage);
+ if (progress >= pages)
+ break;
+
+ if (IS_ERR(hpage)) {
+ /*
+ * If fail to allocate the first time, try to sleep for
+ * a while. When hit again, cancel the scan.
+ */
+ if (!wait)
+ break;
+ wait = false;
+ hpage = NULL;
+ khugepaged_alloc_sleep();
+ }
+ }
}
static bool khugepaged_should_wakeup(void)
--
2.37.0.rc0.161.g10f37bed90-goog
next prev parent reply other threads:[~2022-07-07 0:06 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
2022-07-11 20:38 ` Yang Shi
2022-07-12 17:14 ` Zach O'Keefe
2022-07-06 23:59 ` Zach O'Keefe [this message]
2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-07-08 21:01 ` Andrew Morton
2022-07-11 18:29 ` Zach O'Keefe
2022-07-11 18:45 ` Andrew Morton
2022-07-12 14:17 ` Zach O'Keefe
2022-07-11 21:51 ` Yang Shi
2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
2022-07-11 20:43 ` Yang Shi
2022-07-12 17:06 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
2022-07-11 20:57 ` Yang Shi
2022-07-12 16:58 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
2022-07-11 21:03 ` Yang Shi
2022-07-12 16:50 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-07-11 21:22 ` Yang Shi
2022-07-12 16:54 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
2022-07-11 21:32 ` Yang Shi
2022-07-12 16:21 ` Zach O'Keefe
2022-07-12 17:05 ` Yang Shi
2022-07-12 17:30 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-07-08 20:47 ` Andrew Morton
2022-07-13 1:05 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
2022-07-11 21:37 ` Yang Shi
2022-07-12 16:31 ` Zach O'Keefe
2022-07-12 17:27 ` Yang Shi
2022-07-12 17:57 ` Zach O'Keefe
2022-07-13 18:02 ` Andrew Morton
2022-07-13 18:40 ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
2022-07-14 18:55 ` [RFC] mm: userspace hugepage collapse: file/shmem semantics Zach O'Keefe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220706235936.2197195-3-zokeefe@google.com \
--to=zokeefe@google.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=arnd@arndb.de \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=axelrasmussen@google.com \
--cc=chris@zankel.net \
--cc=ckennelly@google.com \
--cc=david@redhat.com \
--cc=deller@gmx.de \
--cc=hughd@google.com \
--cc=ink@jurassic.park.msu.ru \
--cc=jcmvbkbc@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=mattst88@gmail.com \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=patrickx@google.com \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=rongwei.wang@linux.alibaba.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=songliubraving@fb.com \
--cc=tsbogend@alpha.franken.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox