From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
Paul Turner <pjt@google.com>,
Suresh Siddha <suresh.b.siddha@intel.com>,
Mike Galbraith <efault@gmx.de>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Dan Smith <danms@us.ibm.com>,
Bharata B Rao <bharata.rao@gmail.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...
Date: Fri, 06 Jul 2012 16:04:59 -0400 [thread overview]
Message-ID: <1341605099.14051.23.camel@zaphod.localdomain> (raw)
In-Reply-To: <4FF7147B.1050001@redhat.com>
On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:
> On 03/23/2012 07:50 AM, Mel Gorman wrote:
> > On Fri, Mar 16, 2012 at 03:40:31PM +0100, Peter Zijlstra wrote:
> >> From: Lee Schermerhorn<Lee.Schermerhorn@hp.com>
> >>
> >> This patch adds another mbind() flag to request "lazy migration".
> >> The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> >> pages are simply unmapped from the calling task's page table ['_MOVE]
> >> or from all referencing page tables [_MOVE_ALL]. Anon pages will first
> >> be added to the swap [or migration?] cache, if necessary. The pages
> >> will be migrated in the fault path on "first touch", if the policy
> >> dictates at that time.
> >>
> >> <SNIP>
> >>
> >> @@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
> >> }
> >>
> >> /*
> >> + * Lazy migration: just unmap pages, moving anon pages to swap cache, if
> >> + * necessary. Migration will occur, if policy dictates, when a task faults
> >> + * an unmapped page back into its page table--i.e., on "first touch" after
> >> + * unmapping. Note that migrate-on-fault only migrates pages whose mapping
> >> + * [e.g., file system] supplies a migratepage op, so we skip pages that
> >> + * wouldn't migrate on fault.
> >> + *
> >> + * Pages are placed back on the lru whether or not they were successfully
> >> + * unmapped. Like migrate_pages().
> >> + *
> >> + * Unline migrate_pages(), this function is only called in the context of
> >> + * a task that is unmapping it's own pages while holding its map semaphore
> >> + * for write.
> >> + */
> >> +int migrate_pages_unmap_only(struct list_head *pagelist)
> >
> > I'm not properly reviewing these patches at the moment but am taking a
> > quick look as I play some catch up on linux-mm.
> >
> > I think it's worth pointing out that this potentially will confuse
> > reclaim. Lets say a process is being migrated to another node and it
> > gets unmapped like this then some heuristics will change.
> >
> > 1. If the page was referenced prior to the unmapping then it should be
> > activated if the page reached the end of the LRU due to the checks
> > in page_check_references(). If the process has been unmapped for
> > migrate-on-fault, the pages will instead be reclaimed.
> >
> > 2. The heuristic that applies pressure to slab pages if pages are mapped
> > is changed. Prior to migrate-on-fault sc->nr_scanned is incremented
> > for mapped pages to increase the number of slab pages scanned to
> > avoid swapping. During migrate-on-fault, this pressure is relieved
> >
> > 3. zone_reclaim_mode in default mode will reclaim pages it would
> > previously have skipped over. It potentially will call shrink_zone more
> > for the local node than falling back to other nodes because it thinks
> > most pages are unmapped. This could lead to some trashing.
> >
> > It may not even be a major problem but it's worth thinking about. If it
> > is a problem, it will be necessary to account for migrate-on-fault pages
> > similar to mapped pages during reclaim.
>
> I can see other serious issues with this approach:
>
> 4. Putting a lot of pages in the swap cache ends up allocating
> swap space. This means this NUMA migration scheme will only
> work on systems that have a substantial amount of memory
> represented by swap space. This is highly unlikely on systems
> with memory in the TB range. On smaller systems, it could drive
> the system out of memory (to the OOM killer), by "filling up"
> the overflow swap with migration pages instead.
> 5. In the long run, we want the ability to migrate transparent
> huge pages as one unit. The reason is simple, the performance
> penalty for running on the wrong NUMA node (10-20%) is on the
> same order of magnitude as the performance penalty for running
> with 4kB pages instead of 2MB pages (5-15%).
>
> Breaking up large pages into small ones, and having khugepaged
> reconstitute them on a random NUMA node later on, will negate
> the performance benefits of both NUMA placement and THP.
>
> In short, while this approach made sense when Lee first proposed
> it several years ago (with smaller memory systems, and before Linux
> had transparent huge pages), I do not believe it is an acceptable
> approach to NUMA migration any more.
>
> We really want something like PROT_NONE or PTE_NUMA page table
> (and page directory) entries, so we can avoid filling up swap
> space with migration pages and have the possibility of migrating
> transparent huge pages in one piece at some point.
>
> In other words, NAK to this patch
>
When I originally posted the "migrate on fault" series, I posted a
separate series with a "migration cache" to avoid the use of swap space
for lazy migration: http://markmail.org/message/xgvvrnn2nk4nsn2e.
The migration cache was originally implemented by Marcello Tosatti for
the old memory hotplug project:
http://marc.info/?l=linux-mm&m=109779128211239&w=4.
The idea is that you don't need swap space for lazy migration, just an
"address_space" where you can park an anon VMA's pte's while they're
"unmapped" to cause migration faults. Based on a suggestion from
Christoph Lameter, I had tried to hide the migration cache behind the
swap cache interface to minimize changes mainly in do_swap_page and
vmscan/reclaim. It seemed to work, but the difference in reference
count semantics for the mig cache -- entry removed when last pte
migrated/mapped -- makes coordination with exit teardown, uh, tricky.
Regards,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-07-06 20:05 UTC|newest]
Thread overview: 152+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-06 10:32 ` Johannes Weiner
2012-07-06 14:48 ` Minchan Kim
2012-07-06 15:02 ` Peter Zijlstra
2012-07-06 14:54 ` Kyungmin Park
2012-07-06 15:00 ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
2012-03-23 11:50 ` Mel Gorman
2012-07-06 16:38 ` Rik van Riel
2012-07-06 20:04 ` Lee Schermerhorn [this message]
2012-07-06 20:27 ` Rik van Riel
2012-07-09 11:48 ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
2012-07-06 18:40 ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
2012-04-03 17:32 ` Dan Smith
2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
2012-03-16 18:34 ` Christoph Lameter
2012-03-16 21:12 ` Peter Zijlstra
2012-03-19 13:53 ` Christoph Lameter
2012-03-19 14:05 ` Peter Zijlstra
2012-03-19 15:16 ` Christoph Lameter
2012-03-19 15:23 ` Peter Zijlstra
2012-03-19 15:31 ` Christoph Lameter
2012-03-19 17:09 ` Peter Zijlstra
2012-03-19 17:28 ` Peter Zijlstra
2012-03-19 19:06 ` Christoph Lameter
2012-03-19 20:28 ` Lee Schermerhorn
2012-03-19 21:21 ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
2012-06-15 18:16 ` Tony Luck
2012-06-20 19:12 ` [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n Luck, Tony
2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
2012-07-07 18:26 ` Rik van Riel
2012-07-09 12:05 ` Peter Zijlstra
2012-07-09 12:23 ` Peter Zijlstra
2012-07-09 12:40 ` Peter Zijlstra
2012-07-09 14:50 ` Rik van Riel
2012-07-08 18:35 ` Rik van Riel
2012-07-09 12:25 ` Peter Zijlstra
2012-07-09 14:54 ` Rik van Riel
2012-07-12 22:02 ` Rik van Riel
2012-07-13 14:45 ` Don Morris
2012-07-14 16:20 ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
2012-03-19 12:16 ` Srivatsa S. Bhat
2012-03-19 12:19 ` Peter Zijlstra
2012-03-19 12:27 ` Srivatsa S. Bhat
2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
2012-07-08 18:34 ` Rik van Riel
2012-07-09 12:26 ` Peter Zijlstra
2012-07-09 14:53 ` Rik van Riel
2012-07-09 14:55 ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
2012-03-19 18:47 ` Peter Zijlstra
2012-03-19 19:02 ` Andrea Arcangeli
2012-03-20 23:41 ` Dan Smith
2012-03-21 1:00 ` Andrea Arcangeli
2012-03-21 2:12 ` Andrea Arcangeli
2012-03-21 4:01 ` Dan Smith
2012-03-21 12:49 ` Andrea Arcangeli
2012-03-21 22:05 ` Dan Smith
2012-03-21 22:52 ` Andrea Arcangeli
2012-03-21 23:13 ` Dan Smith
2012-03-21 23:41 ` Andrea Arcangeli
2012-03-22 0:17 ` Andrea Arcangeli
2012-03-22 13:58 ` Dan Smith
2012-03-22 14:27 ` Andrea Arcangeli
2012-03-22 18:49 ` Andrea Arcangeli
2012-03-22 18:56 ` Dan Smith
2012-03-22 19:11 ` Andrea Arcangeli
2012-03-23 14:15 ` Andrew Theurer
2012-03-23 16:01 ` Andrea Arcangeli
2012-03-25 13:30 ` Andrea Arcangeli
2012-03-21 7:12 ` Ingo Molnar
2012-03-21 12:08 ` Andrea Arcangeli
2012-03-21 7:53 ` Ingo Molnar
2012-03-21 12:17 ` Andrea Arcangeli
2012-03-19 9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
2012-03-19 11:12 ` Peter Zijlstra
2012-03-19 11:30 ` Peter Zijlstra
2012-03-19 11:39 ` Peter Zijlstra
2012-03-19 11:42 ` Avi Kivity
2012-03-19 11:59 ` Peter Zijlstra
2012-03-19 12:07 ` Avi Kivity
2012-03-19 12:09 ` Peter Zijlstra
2012-03-19 12:16 ` Avi Kivity
2012-03-19 20:03 ` Peter Zijlstra
2012-03-20 10:18 ` Avi Kivity
2012-03-20 10:48 ` Peter Zijlstra
2012-03-20 10:52 ` Avi Kivity
2012-03-20 11:07 ` Peter Zijlstra
2012-03-20 11:48 ` Avi Kivity
2012-03-19 12:20 ` Peter Zijlstra
2012-03-19 12:24 ` Avi Kivity
2012-03-19 15:44 ` Avi Kivity
2012-03-19 13:40 ` Andrea Arcangeli
2012-03-19 20:06 ` Peter Zijlstra
2012-03-19 13:04 ` Andrea Arcangeli
2012-03-19 13:26 ` Peter Zijlstra
2012-03-19 13:57 ` Andrea Arcangeli
2012-03-19 14:06 ` Avi Kivity
2012-03-19 14:30 ` Andrea Arcangeli
2012-03-19 18:42 ` Peter Zijlstra
2012-03-20 22:18 ` Rik van Riel
2012-03-21 16:50 ` Andrea Arcangeli
2012-04-02 16:34 ` Pekka Enberg
2012-04-02 16:55 ` Rik van Riel
2012-04-02 16:54 ` Pekka Enberg
2012-04-02 17:12 ` Pekka Enberg
2012-04-02 17:23 ` Pekka Enberg
2012-03-19 14:07 ` Peter Zijlstra
2012-03-19 14:34 ` Andrea Arcangeli
2012-03-19 18:41 ` Peter Zijlstra
2012-03-19 19:13 ` Peter Zijlstra
2012-03-19 14:07 ` Andrea Arcangeli
2012-03-19 19:05 ` Peter Zijlstra
2012-03-19 13:26 ` Peter Zijlstra
2012-03-19 14:16 ` Andrea Arcangeli
2012-03-19 13:29 ` Peter Zijlstra
2012-03-19 14:19 ` Andrea Arcangeli
2012-03-19 13:39 ` Peter Zijlstra
2012-03-19 14:20 ` Andrea Arcangeli
2012-03-19 20:17 ` Christoph Lameter
2012-03-19 20:28 ` Ingo Molnar
2012-03-19 20:43 ` Christoph Lameter
2012-03-19 21:34 ` Ingo Molnar
2012-03-20 0:05 ` Linus Torvalds
2012-03-20 7:31 ` Ingo Molnar
2012-03-21 22:53 ` Nish Aravamudan
2012-03-22 9:45 ` Peter Zijlstra
2012-03-22 10:34 ` Ingo Molnar
2012-03-24 1:41 ` Nish Aravamudan
2012-03-26 11:42 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1341605099.14051.23.camel@zaphod.localdomain \
--to=lee.schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bharata.rao@gmail.com \
--cc=danms@us.ibm.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox