From: Fengguang Wu <fengguang.wu@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Linux Memory Management List <linux-mm@kvack.org>,
kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
Peng Dong <dongx.peng@intel.com>,
Huang Ying <ying.huang@intel.com>,
Liu Jingqi <jingqi.liu@intel.com>,
Dong Eddie <eddie.dong@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Zhang Yi <yi.z.zhang@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Fri, 28 Dec 2018 21:15:42 +0800 [thread overview]
Message-ID: <20181228131542.geshbmzvhr3litty@wfg-t540p.sh.intel.com> (raw)
Message-ID: <20181228131542.1yqTlAEoTT-xwmVSALCrBIlEJlSKFhv_w4Eb9QiZD1E@z> (raw)
In-Reply-To: <20181228121515.GS16738@dhcp22.suse.cz>
[-- Attachment #1: Type: text/plain, Size: 949 bytes --]
On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote:
>On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
>[...]
>> Those look unnecessary complexities for this post. This v2 patchset
>> mainly fulfills our first milestone goal: a minimal viable solution
>> that's relatively clean to backport. Even when preparing for new
>> upstreamable versions, it may be good to keep it simple for the
>> initial upstream inclusion.
>
>On the other hand this is creating a new NUMA semantic and I would like
>to have something long term thatn let's throw something in now and care
>about long term later. So I would really prefer to talk about long term
>plans first and only care about implementation details later.
That makes good sense. FYI here are the several in-house patches that
try to leverage (but not yet integrate with) NUMA balancing. The last
one is brutal force hacking. They obviously break original NUMA
balancing logic.
Thanks,
Fengguang
[-- Attachment #2: 0074-migrate-set-PROT_NONE-on-the-PTEs-and-let-NUMA-balan.patch --]
[-- Type: text/x-diff, Size: 1332 bytes --]
From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001
From: Liu Jingqi <jingqi.liu@intel.com>
Date: Sat, 29 Sep 2018 23:29:56 +0800
Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA
balancing
Need to enable CONFIG_NUMA_BALANCING firstly.
Set PROT_NONE on the PTEs that map to the page,
and do the actual migration in the context of process which initiate migration.
Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
mm/migrate.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/mm/migrate.c b/mm/migrate.c
index b27a287081c2..d933f6966601 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
if (page_mapcount(page) > 1 && !migrate_all)
goto out_putpage;
+ if (flags & MPOL_MF_SW_YOUNG) {
+ unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
+
+ start = max(addr, vma->vm_start);
+
+ /* TODO: if huge page */
+ end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
+ end = min(end, vma->vm_end);
+ nr_pte_updates = change_prot_numa(vma, start, end);
+
+ err = 0;
+ goto out_putpage;
+ }
+
if (PageHuge(page)) {
if (PageHead(page)) {
/* Check if the page is software young. */
--
2.15.0
[-- Attachment #3: 0075-migrate-consolidate-MPOL_MF_SW_YOUNG-behaviors.patch --]
[-- Type: text/x-diff, Size: 3514 bytes --]
From e617e8c2034387cbed50bafa786cf83528dbe3df Mon Sep 17 00:00:00 2001
From: Fengguang Wu <fengguang.wu@intel.com>
Date: Sun, 30 Sep 2018 10:50:58 +0800
Subject: [PATCH 075/166] migrate: consolidate MPOL_MF_SW_YOUNG behaviors
- if page already in target node: SetPageReferenced
- otherwise: change_prot_numa
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
arch/x86/kvm/Kconfig | 1 +
mm/migrate.c | 65 +++++++++++++++++++++++++++++++---------------------
2 files changed, 40 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4c6dec47fac6..c103373536fc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -100,6 +100,7 @@ config KVM_EPT_IDLE
tristate "KVM EPT idle page tracking"
depends on KVM_INTEL
depends on PROC_PAGE_MONITOR
+ depends on NUMA_BALANCING
---help---
Provides support for walking EPT to get the A bits on Intel
processors equipped with the VT extensions.
diff --git a/mm/migrate.c b/mm/migrate.c
index d933f6966601..d944f031c9ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1500,6 +1500,8 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
{
struct vm_area_struct *vma;
struct page *page;
+ unsigned long end;
+ unsigned int page_nid;
unsigned int follflags;
int err;
bool migrate_all = flags & MPOL_MF_MOVE_ALL;
@@ -1522,49 +1524,60 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
if (!page)
goto out;
- err = 0;
- if (page_to_nid(page) == node)
- goto out_putpage;
+ page_nid = page_to_nid(page);
err = -EACCES;
if (page_mapcount(page) > 1 && !migrate_all)
goto out_putpage;
- if (flags & MPOL_MF_SW_YOUNG) {
- unsigned long start, end;
- unsigned long nr_pte_updates = 0;
-
- start = max(addr, vma->vm_start);
-
- /* TODO: if huge page */
- end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
- end = min(end, vma->vm_end);
- nr_pte_updates = change_prot_numa(vma, start, end);
-
- err = 0;
- goto out_putpage;
- }
-
+ err = 0;
if (PageHuge(page)) {
- if (PageHead(page)) {
- /* Check if the page is software young. */
- if (flags & MPOL_MF_SW_YOUNG)
+ if (!PageHead(page)) {
+ err = -EACCES;
+ goto out_putpage;
+ }
+ if (flags & MPOL_MF_SW_YOUNG) {
+ if (page_nid == node)
SetPageReferenced(page);
- isolate_huge_page(page, pagelist);
- err = 0;
+ else if (PageAnon(page)) {
+ end = addr + (hpage_nr_pages(page) << PAGE_SHIFT);
+ if (end <= vma->vm_end)
+ change_prot_numa(vma, addr, end);
+ }
+ goto out_putpage;
}
+ if (page_nid == node)
+ goto out_putpage;
+ isolate_huge_page(page, pagelist);
} else {
struct page *head;
head = compound_head(page);
+
+ if (flags & MPOL_MF_SW_YOUNG) {
+ if (page_nid == node)
+ SetPageReferenced(head);
+ else {
+ unsigned long size;
+ size = hpage_nr_pages(head) << PAGE_SHIFT;
+ end = addr + size;
+ if (unlikely(addr & (size - 1)))
+ err = -EXDEV;
+ else if (likely(end <= vma->vm_end))
+ change_prot_numa(vma, addr, end);
+ else
+ err = -ERANGE;
+ }
+ goto out_putpage;
+ }
+ if (page_nid == node)
+ goto out_putpage;
+
err = isolate_lru_page(head);
if (err)
goto out_putpage;
err = 0;
- /* Check if the page is software young. */
- if (flags & MPOL_MF_SW_YOUNG)
- SetPageReferenced(head);
list_add_tail(&head->lru, pagelist);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON + page_is_file_cache(head),
--
2.15.0
[-- Attachment #4: 0076-mempolicy-force-NUMA-balancing.patch --]
[-- Type: text/x-diff, Size: 1511 bytes --]
From a2d9740d1639f807868014c16dc9e2620d356f3c Mon Sep 17 00:00:00 2001
From: Fengguang Wu <fengguang.wu@intel.com>
Date: Sun, 30 Sep 2018 19:22:27 +0800
Subject: [PATCH 076/166] mempolicy: force NUMA balancing
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
mm/memory.c | 3 ++-
mm/mempolicy.c | 5 -----
2 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..20c7efdff63b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3775,7 +3775,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
*flags |= TNF_FAULT_LOCAL;
}
- return mpol_misplaced(page, vma, addr);
+ return 0;
+ /* return mpol_misplaced(page, vma, addr); */
}
static vm_fault_t do_numa_page(struct vm_fault *vmf)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..21dc6ba1d062 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2295,8 +2295,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
int ret = -1;
pol = get_vma_policy(vma, addr);
- if (!(pol->flags & MPOL_F_MOF))
- goto out;
switch (pol->mode) {
case MPOL_INTERLEAVE:
@@ -2336,9 +2334,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
polnid = thisnid;
-
- if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
- goto out;
}
if (curnid != polnid)
--
2.15.0
next prev parent reply other threads:[~2018-12-28 13:15 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-26 13:14 Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-27 3:41 ` Matthew Wilcox
2018-12-27 4:11 ` Fengguang Wu
2018-12-27 5:13 ` Dan Williams
2018-12-27 5:13 ` Dan Williams
2018-12-27 19:32 ` Yang Shi
2018-12-27 19:32 ` Yang Shi
2018-12-28 3:27 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-27 20:07 ` Christopher Lameter
2018-12-27 20:07 ` Christopher Lameter
2018-12-28 2:31 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2019-01-01 9:14 ` Aneesh Kumar K.V
2019-01-01 9:14 ` Aneesh Kumar K.V
2019-01-07 9:57 ` Fengguang Wu
2019-01-07 14:09 ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2019-01-01 9:23 ` Aneesh Kumar K.V
2019-01-01 9:23 ` Aneesh Kumar K.V
2019-01-02 0:59 ` Yuan Yao
2019-01-02 16:47 ` Dave Hansen
2019-01-07 10:21 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2019-02-02 6:57 ` Peter Xu
2019-02-02 10:50 ` Fengguang Wu
2019-02-04 10:46 ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28 5:08 ` Fengguang Wu
2018-12-28 8:41 ` Michal Hocko
2018-12-28 9:42 ` Fengguang Wu
2018-12-28 12:15 ` Michal Hocko
2018-12-28 13:15 ` Fengguang Wu [this message]
2018-12-28 13:15 ` Fengguang Wu
2018-12-28 19:46 ` Michal Hocko
2018-12-28 13:31 ` Fengguang Wu
2018-12-28 18:28 ` Yang Shi
2018-12-28 18:28 ` Yang Shi
2018-12-28 19:52 ` Michal Hocko
2019-01-02 12:21 ` Jonathan Cameron
2019-01-02 12:21 ` Jonathan Cameron
2019-01-08 14:52 ` Michal Hocko
2019-01-10 15:53 ` Jerome Glisse
2019-01-10 15:53 ` Jerome Glisse
2019-01-10 16:42 ` Michal Hocko
2019-01-10 17:42 ` Jerome Glisse
2019-01-10 17:42 ` Jerome Glisse
2019-01-10 18:26 ` Jonathan Cameron
2019-01-10 18:26 ` Jonathan Cameron
2019-01-28 17:42 ` Jonathan Cameron
2019-01-28 17:42 ` Jonathan Cameron
2019-01-29 2:00 ` Fengguang Wu
2019-01-03 10:57 ` Mel Gorman
2019-01-10 16:25 ` Jerome Glisse
2019-01-10 16:25 ` Jerome Glisse
2019-01-10 16:50 ` Michal Hocko
2019-01-10 18:02 ` Jerome Glisse
2019-01-10 18:02 ` Jerome Glisse
2019-01-02 18:12 ` Dave Hansen
2019-01-08 14:53 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181228131542.geshbmzvhr3litty@wfg-t540p.sh.intel.com \
--to=fengguang.wu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dongx.peng@intel.com \
--cc=eddie.dong@intel.com \
--cc=fan.du@intel.com \
--cc=jingqi.liu@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=yi.z.zhang@linux.intel.com \
--cc=ying.huang@intel.com \
--cc=yuan.yao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox