From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
Fan Du <fan.du@intel.com>, Jingqi Liu <jingqi.liu@intel.com>,
Fengguang Wu <fengguang.wu@intel.com>,
kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
Yao Yuan <yuan.yao@intel.com>, Peng Dong <dongx.peng@intel.com>,
Huang Ying <ying.huang@intel.com>,
Dong Eddie <eddie.dong@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Zhang Yi <yi.z.zhang@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>
Subject: [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
Date: Wed, 26 Dec 2018 21:15:06 +0800 [thread overview]
Message-ID: <20181226133352.246320288@intel.com> (raw)
In-Reply-To: <20181226131446.330864849@intel.com>
[-- Attachment #1: 0012-vmscan-migrate-anonymous-pages-to-pmem-node-before-s.patch --]
[-- Type: text/plain, Size: 3812 bytes --]
From: Jingqi Liu <jingqi.liu@intel.com>
With PMEM nodes, the demotion path could be
1) DRAM pages: migrate to PMEM node
2) PMEM pages: swap out
This patch does (1) for anonymous pages only. Since we cannot
detect hotness of (unmapped) page cache pages for now.
The user space daemon can do migration in both directions:
- PMEM=>DRAM hot page migration
- DRAM=>PMEM cold page migration
However it's more natural for user space to do hot page migration
and kernel to do cold page migration. Especially, only kernel can
guarantee on-demand migration when there is memory pressure.
So the big picture will look like this: user space daemon does regular
hot page migration to DRAM, creating memory pressure on DRAM nodes,
which triggers kernel cold page migration to PMEM nodes.
Du Fan:
- Support multiple NUMA nodes.
- Don't migrate clean MADV_FREE pages to PMEM node.
With advise(MADV_FREE) syscall, both vma structure and
its corresponding page entries still lives, but we got
MADV_FREE page, anonymous but WITHOUT SwapBacked.
In case of page reclaim, clean MADV_FREE pages will be
freed and return to buddy system, the dirty ones then
turn into canonical anonymous page with
PageSwapBacked(page) set, and put into LRU_INACTIVE_FILE
list falling into standard aging routine.
Point is clean MADV_FREE pages should not be migrated,
it has steal (useless) user data once madvise(MADV_FREE)
called and guard against thus scenarios.
P.S. MADV_FREE is heavily used by jemalloc engine, and
workload like redis, refer to [1] for detailed backgroud,
usecase, and benchmark result.
[1]
https://lore.kernel.org/patchwork/patch/622179/
Fengguang:
- detect migrate thp and hugetlb
- avoid moving pages to a non-existent node
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
mm/vmscan.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
--- linux.orig/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800
+++ linux/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800
@@ -1112,6 +1112,7 @@ static unsigned long shrink_page_list(st
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(move_pages);
int pgactivate = 0;
unsigned nr_unqueued_dirty = 0;
unsigned nr_dirty = 0;
@@ -1121,6 +1122,7 @@ static unsigned long shrink_page_list(st
unsigned nr_immediate = 0;
unsigned nr_ref_keep = 0;
unsigned nr_unmap_fail = 0;
+ int page_on_dram = is_node_dram(pgdat->node_id);
cond_resched();
@@ -1275,6 +1277,21 @@ static unsigned long shrink_page_list(st
}
/*
+ * Check if the page is in DRAM numa node.
+ * Skip MADV_FREE pages as it might be freed
+ * immediately to buddy system if it's clean.
+ */
+ if (node_online(pgdat->peer_node) &&
+ PageAnon(page) && (PageSwapBacked(page) || PageTransHuge(page))) {
+ if (page_on_dram) {
+ /* Add to the page list which will be moved to pmem numa node. */
+ list_add(&page->lru, &move_pages);
+ unlock_page(page);
+ continue;
+ }
+ }
+
+ /*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
* Lazyfree page could be freed directly
@@ -1496,6 +1513,22 @@ keep:
VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
}
+ /* Move the anonymous pages to PMEM numa node. */
+ if (!list_empty(&move_pages)) {
+ int err;
+
+ /* Could not block. */
+ err = migrate_pages(&move_pages, alloc_new_node_page, NULL,
+ pgdat->peer_node,
+ MIGRATE_ASYNC, MR_NUMA_MISPLACED);
+ if (err) {
+ putback_movable_pages(&move_pages);
+
+ /* Join the pages which were not migrated. */
+ list_splice(&ret_pages, &move_pages);
+ }
+ }
+
mem_cgroup_uncharge_list(&free_pages);
try_to_unmap_flush();
free_unref_page_list(&free_pages);
WARNING: multiple messages have this Message-ID
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
Fan Du <fan.du@intel.com>, Jingqi Liu <jingqi.liu@intel.com>,
Fengguang Wu <fengguang.wu@intel.com>
Cc: kvm@vger.kernel.org
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Yao Yuan <yuan.yao@intel.com>
Cc: Peng Dong <dongx.peng@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dong Eddie <eddie.dong@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Zhang Yi <yi.z.zhang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Subject: [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
Date: Wed, 26 Dec 2018 21:15:06 +0800 [thread overview]
Message-ID: <20181226133352.246320288@intel.com> (raw)
Message-ID: <20181226131506.ck_T_IQCTyFyc-dPmKVdypBPEM8IDLPQqtcAqyKopXk@z> (raw)
In-Reply-To: <20181226131446.330864849@intel.com>
[-- Attachment #1: 0012-vmscan-migrate-anonymous-pages-to-pmem-node-before-s.patch --]
[-- Type: text/plain, Size: 3814 bytes --]
From: Jingqi Liu <jingqi.liu@intel.com>
With PMEM nodes, the demotion path could be
1) DRAM pages: migrate to PMEM node
2) PMEM pages: swap out
This patch does (1) for anonymous pages only. Since we cannot
detect hotness of (unmapped) page cache pages for now.
The user space daemon can do migration in both directions:
- PMEM=>DRAM hot page migration
- DRAM=>PMEM cold page migration
However it's more natural for user space to do hot page migration
and kernel to do cold page migration. Especially, only kernel can
guarantee on-demand migration when there is memory pressure.
So the big picture will look like this: user space daemon does regular
hot page migration to DRAM, creating memory pressure on DRAM nodes,
which triggers kernel cold page migration to PMEM nodes.
Du Fan:
- Support multiple NUMA nodes.
- Don't migrate clean MADV_FREE pages to PMEM node.
With advise(MADV_FREE) syscall, both vma structure and
its corresponding page entries still lives, but we got
MADV_FREE page, anonymous but WITHOUT SwapBacked.
In case of page reclaim, clean MADV_FREE pages will be
freed and return to buddy system, the dirty ones then
turn into canonical anonymous page with
PageSwapBacked(page) set, and put into LRU_INACTIVE_FILE
list falling into standard aging routine.
Point is clean MADV_FREE pages should not be migrated,
it has steal (useless) user data once madvise(MADV_FREE)
called and guard against thus scenarios.
P.S. MADV_FREE is heavily used by jemalloc engine, and
workload like redis, refer to [1] for detailed backgroud,
usecase, and benchmark result.
[1]
https://lore.kernel.org/patchwork/patch/622179/
Fengguang:
- detect migrate thp and hugetlb
- avoid moving pages to a non-existent node
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
mm/vmscan.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
--- linux.orig/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800
+++ linux/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800
@@ -1112,6 +1112,7 @@ static unsigned long shrink_page_list(st
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(move_pages);
int pgactivate = 0;
unsigned nr_unqueued_dirty = 0;
unsigned nr_dirty = 0;
@@ -1121,6 +1122,7 @@ static unsigned long shrink_page_list(st
unsigned nr_immediate = 0;
unsigned nr_ref_keep = 0;
unsigned nr_unmap_fail = 0;
+ int page_on_dram = is_node_dram(pgdat->node_id);
cond_resched();
@@ -1275,6 +1277,21 @@ static unsigned long shrink_page_list(st
}
/*
+ * Check if the page is in DRAM numa node.
+ * Skip MADV_FREE pages as it might be freed
+ * immediately to buddy system if it's clean.
+ */
+ if (node_online(pgdat->peer_node) &&
+ PageAnon(page) && (PageSwapBacked(page) || PageTransHuge(page))) {
+ if (page_on_dram) {
+ /* Add to the page list which will be moved to pmem numa node. */
+ list_add(&page->lru, &move_pages);
+ unlock_page(page);
+ continue;
+ }
+ }
+
+ /*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
* Lazyfree page could be freed directly
@@ -1496,6 +1513,22 @@ keep:
VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
}
+ /* Move the anonymous pages to PMEM numa node. */
+ if (!list_empty(&move_pages)) {
+ int err;
+
+ /* Could not block. */
+ err = migrate_pages(&move_pages, alloc_new_node_page, NULL,
+ pgdat->peer_node,
+ MIGRATE_ASYNC, MR_NUMA_MISPLACED);
+ if (err) {
+ putback_movable_pages(&move_pages);
+
+ /* Join the pages which were not migrated. */
+ list_splice(&ret_pages, &move_pages);
+ }
+ }
+
mem_cgroup_uncharge_list(&free_pages);
try_to_unmap_flush();
free_unref_page_list(&free_pages);
next prev parent reply other threads:[~2018-12-26 13:39 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-27 3:41 ` Matthew Wilcox
2018-12-27 4:11 ` Fengguang Wu
2018-12-27 5:13 ` Dan Williams
2018-12-27 5:13 ` Dan Williams
2018-12-27 19:32 ` Yang Shi
2018-12-27 19:32 ` Yang Shi
2018-12-28 3:27 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-27 20:07 ` Christopher Lameter
2018-12-27 20:07 ` Christopher Lameter
2018-12-28 2:31 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2019-01-01 9:14 ` Aneesh Kumar K.V
2019-01-01 9:14 ` Aneesh Kumar K.V
2019-01-07 9:57 ` Fengguang Wu
2019-01-07 14:09 ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2019-01-01 9:23 ` Aneesh Kumar K.V
2019-01-01 9:23 ` Aneesh Kumar K.V
2019-01-02 0:59 ` Yuan Yao
2019-01-02 16:47 ` Dave Hansen
2019-01-07 10:21 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2019-02-02 6:57 ` Peter Xu
2019-02-02 10:50 ` Fengguang Wu
2019-02-04 10:46 ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu [this message]
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28 5:08 ` Fengguang Wu
2018-12-28 8:41 ` Michal Hocko
2018-12-28 9:42 ` Fengguang Wu
2018-12-28 12:15 ` Michal Hocko
2018-12-28 13:15 ` Fengguang Wu
2018-12-28 13:15 ` Fengguang Wu
2018-12-28 19:46 ` Michal Hocko
2018-12-28 13:31 ` Fengguang Wu
2018-12-28 18:28 ` Yang Shi
2018-12-28 18:28 ` Yang Shi
2018-12-28 19:52 ` Michal Hocko
2019-01-02 12:21 ` Jonathan Cameron
2019-01-02 12:21 ` Jonathan Cameron
2019-01-08 14:52 ` Michal Hocko
2019-01-10 15:53 ` Jerome Glisse
2019-01-10 15:53 ` Jerome Glisse
2019-01-10 16:42 ` Michal Hocko
2019-01-10 17:42 ` Jerome Glisse
2019-01-10 17:42 ` Jerome Glisse
2019-01-10 18:26 ` Jonathan Cameron
2019-01-10 18:26 ` Jonathan Cameron
2019-01-28 17:42 ` Jonathan Cameron
2019-01-28 17:42 ` Jonathan Cameron
2019-01-29 2:00 ` Fengguang Wu
2019-01-03 10:57 ` Mel Gorman
2019-01-10 16:25 ` Jerome Glisse
2019-01-10 16:25 ` Jerome Glisse
2019-01-10 16:50 ` Michal Hocko
2019-01-10 18:02 ` Jerome Glisse
2019-01-10 18:02 ` Jerome Glisse
2019-01-02 18:12 ` Dave Hansen
2019-01-08 14:53 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181226133352.246320288@intel.com \
--to=fengguang.wu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dongx.peng@intel.com \
--cc=eddie.dong@intel.com \
--cc=fan.du@intel.com \
--cc=jingqi.liu@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=yi.z.zhang@linux.intel.com \
--cc=ying.huang@intel.com \
--cc=yuan.yao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox