* [PATCH -mm 13/25] Noreclaim LRU Infrastructure
[not found] <20080606202838.390050172@redhat.com>
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
2008-06-07 1:05 ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
` (5 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-13-lts-noreclaim-ramfs-pages-are-non-reclaimable.patch --]
[-- Type: text/plain, Size: 34028 bytes --]
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.
Kosaki Motohiro added the support for the memory controller noreclaim
lru list.
Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM_LRU.
A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable. Subsequent patches will add the various
!reclaimable tests. We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from nonreclaimable to reclaimable
state, one should test reclaimability under page lock and place
nonreclaimable pages directly on the noreclaim list before dropping the
lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim
list. It's OK to use the pagevec caches for reclaimable pages. The new
function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
this transition, including potential page truncation while the page is
unlocked.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2
include/linux/mm_inline.h | 13 ++-
include/linux/mmzone.h | 24 ++++++
include/linux/page-flags.h | 13 +++
include/linux/pagevec.h | 1
include/linux/swap.h | 12 +++
mm/Kconfig | 10 ++
mm/internal.h | 26 +++++++
mm/memcontrol.c | 73 ++++++++++++--------
mm/mempolicy.c | 2
mm/migrate.c | 68 ++++++++++++------
mm/page_alloc.c | 9 ++
mm/swap.c | 52 +++++++++++---
mm/vmscan.c | 164 +++++++++++++++++++++++++++++++++++++++------
14 files changed, 382 insertions(+), 87 deletions(-)
Index: linux-2.6.26-rc2-mm1/mm/Kconfig
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM_LRU
+ bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
+ depends on EXPERIMENTAL && 64BIT
+ help
+ Supports tracking of non-reclaimable pages off the [in]active lists
+ to avoid excessive reclaim overhead on large memory systems. Pages
+ may be non-reclaimable because: they are locked into memory, they
+ are anonymous pages for which no swap space exists, or they are anon
+ pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
@@ -94,6 +94,9 @@ enum pageflags {
PG_reclaim, /* To be reclaimed asap */
PG_buddy, /* Page is free, on buddy lists */
PG_swapbacked, /* Page is backed by RAM/swap */
+#ifdef CONFIG_NORECLAIM_LRU
+ PG_noreclaim, /* Page is "non-reclaimable" */
+#endif
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
@@ -167,6 +170,7 @@ PAGEFLAG(Referenced, referenced) TESTCLE
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
+ TESTCLEARFLAG(Active, active)
__PAGEFLAG(Slab, slab)
PAGEFLAG(Checked, owner_priv_1) /* Used by some filesystems */
PAGEFLAG(Pinned, owner_priv_1) TESTSCFLAG(Pinned, owner_priv_1) /* Xen */
@@ -203,6 +207,15 @@ PAGEFLAG(SwapCache, swapcache)
PAGEFLAG_FALSE(SwapCache)
#endif
+#ifdef CONFIG_NORECLAIM_LRU
+PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
+ TESTCLEARFLAG(Noreclaim, noreclaim)
+#else
+PAGEFLAG_FALSE(Noreclaim) TESTCLEARFLAG_FALSE(Noreclaim)
+ SETPAGEFLAG_NOOP(Noreclaim) CLEARPAGEFLAG_NOOP(Noreclaim)
+ __CLEARPAGEFLAG_NOOP(Noreclaim)
+#endif
+
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PAGEFLAG(Uncached, uncached)
#else
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h 2008-06-06 16:05:15.000000000 -0400
@@ -85,6 +85,11 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+#ifdef CONFIG_NORECLAIM_LRU
+ NR_NORECLAIM, /* " " " " " */
+#else
+ NR_NORECLAIM = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
@@ -124,10 +129,18 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
- NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM_LRU
+ LRU_NORECLAIM,
+#else
+ LRU_NORECLAIM = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
+ NR_LRU_LISTS
+};
#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
static inline int is_file_lru(enum lru_list l)
{
return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
@@ -138,6 +151,15 @@ static inline int is_active_lru(enum lru
return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
}
+static inline int is_noreclaim_lru(enum lru_list l)
+{
+#ifdef CONFIG_NORECLAIM_LRU
+ return l == LRU_NORECLAIM;
+#else
+ return 0;
+#endif
+}
+
enum lru_list page_lru(struct page *page);
struct per_cpu_pages {
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c 2008-06-06 16:05:15.000000000 -0400
@@ -256,6 +256,9 @@ static void bad_page(struct page *page)
1 << PG_private |
1 << PG_locked |
1 << PG_active |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
@@ -491,6 +494,9 @@ static inline int free_pages_check(struc
1 << PG_swapcache |
1 << PG_writeback |
1 << PG_reserved |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_buddy ))))
bad_page(page);
if (PageDirty(page))
@@ -642,6 +648,9 @@ static int prep_new_page(struct page *pa
1 << PG_private |
1 << PG_locked |
1 << PG_active |
+#ifdef CONFIG_NORECLAIM_LRU
+ 1 << PG_noreclaim |
+#endif
1 << PG_dirty |
1 << PG_slab |
1 << PG_swapcache |
Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h 2008-06-06 16:05:15.000000000 -0400
@@ -89,11 +89,16 @@ del_page_from_lru(struct zone *zone, str
enum lru_list l = LRU_INACTIVE_ANON;
list_del(&page->lru);
- if (PageActive(page)) {
- __ClearPageActive(page);
- l += LRU_ACTIVE;
+ if (PageNoreclaim(page)) {
+ __ClearPageNoreclaim(page);
+ l = LRU_NORECLAIM;
+ } else {
+ if (PageActive(page)) {
+ __ClearPageActive(page);
+ l += LRU_ACTIVE;
+ }
+ l += page_file_cache(page);
}
- l += page_file_cache(page);
__dec_zone_state(zone, NR_INACTIVE_ANON + l);
}
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h 2008-06-06 16:05:15.000000000 -0400
@@ -180,6 +180,8 @@ extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void swap_setup(void);
+extern void add_page_to_noreclaim_list(struct page *page);
+
/**
* lru_cache_add: add a page to the page lists
* @page: the page to add
@@ -228,6 +230,16 @@ static inline int zone_reclaim(struct zo
}
#endif
+#ifdef CONFIG_NORECLAIM_LRU
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+ struct vm_area_struct *vma)
+{
+ return 1;
+}
+#endif
+
extern int kswapd_run(int nid);
#ifdef CONFIG_MMU
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h 2008-06-06 16:05:15.000000000 -0400
@@ -101,7 +101,6 @@ static inline void __pagevec_lru_add_act
____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
}
-
static inline void pagevec_lru_add_file(struct pagevec *pvec)
{
if (pagevec_count(pvec))
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c 2008-06-06 16:05:15.000000000 -0400
@@ -106,9 +106,13 @@ enum lru_list page_lru(struct page *page
{
enum lru_list lru = LRU_BASE;
- if (PageActive(page))
- lru += LRU_ACTIVE;
- lru += page_file_cache(page);
+ if (PageNoreclaim(page))
+ lru = LRU_NORECLAIM;
+ else {
+ if (PageActive(page))
+ lru += LRU_ACTIVE;
+ lru += page_file_cache(page);
+ }
return lru;
}
@@ -133,7 +137,8 @@ static void pagevec_move_tail(struct pag
zone = pagezone;
spin_lock(&zone->lru_lock);
}
- if (PageLRU(page) && !PageActive(page)) {
+ if (PageLRU(page) && !PageActive(page) &&
+ !PageNoreclaim(page)) {
int lru = page_file_cache(page);
list_move_tail(&page->lru, &zone->list[lru]);
pgmoved++;
@@ -154,7 +159,7 @@ static void pagevec_move_tail(struct pag
void rotate_reclaimable_page(struct page *page)
{
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
- PageLRU(page)) {
+ !PageNoreclaim(page) && PageLRU(page)) {
struct pagevec *pvec;
unsigned long flags;
@@ -175,7 +180,7 @@ void activate_page(struct page *page)
struct zone *zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
- if (PageLRU(page) && !PageActive(page)) {
+ if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
int file = page_file_cache(page);
int lru = LRU_BASE + file;
del_page_from_lru_list(zone, page, lru);
@@ -184,7 +189,7 @@ void activate_page(struct page *page)
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- mem_cgroup_move_lists(page, true);
+ mem_cgroup_move_lists(page, lru);
if (file) {
zone->recent_scanned_file++;
@@ -207,7 +212,8 @@ void activate_page(struct page *page)
*/
void mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+ if (!PageActive(page) && !PageNoreclaim(page) &&
+ PageReferenced(page) && PageLRU(page)) {
activate_page(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
@@ -235,13 +241,38 @@ void __lru_cache_add(struct page *page,
void lru_cache_add_lru(struct page *page, enum lru_list lru)
{
if (PageActive(page)) {
+ VM_BUG_ON(PageNoreclaim(page));
ClearPageActive(page);
+ } else if (PageNoreclaim(page)) {
+ VM_BUG_ON(PageActive(page));
+ ClearPageNoreclaim(page);
}
- VM_BUG_ON(PageLRU(page) || PageActive(page));
+ VM_BUG_ON(PageLRU(page) || PageActive(page) || PageNoreclaim(page));
__lru_cache_add(page, lru);
}
+/**
+ * add_page_to_noreclaim_list
+ * @page: the page to be added to the noreclaim list
+ *
+ * Add page directly to its zone's noreclaim list. To avoid races with
+ * tasks that might be making the page reclaimble while it's not on the
+ * lru, we want to add the page while it's locked or otherwise "invisible"
+ * to other tasks. This is difficult to do when using the pagevec cache,
+ * so bypass that.
+ */
+void add_page_to_noreclaim_list(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ SetPageNoreclaim(page);
+ SetPageLRU(page);
+ add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+ spin_unlock_irq(&zone->lru_lock);
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
@@ -339,6 +370,7 @@ void release_pages(struct page **pages,
if (PageLRU(page)) {
struct zone *pagezone = page_zone(page);
+
if (pagezone != zone) {
if (zone)
spin_unlock_irqrestore(&zone->lru_lock,
@@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec
{
int i;
struct zone *zone = NULL;
+ VM_BUG_ON(is_noreclaim_lru(lru));
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
@@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec
zone = pagezone;
spin_lock_irq(&zone->lru_lock);
}
+ VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
if (is_active_lru(lru))
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c 2008-06-06 16:05:15.000000000 -0400
@@ -53,14 +53,9 @@ int migrate_prep(void)
return 0;
}
-static inline void move_to_lru(struct page *page)
-{
- lru_cache_add_lru(page, page_lru(page));
- put_page(page);
-}
-
/*
- * Add isolated pages on the list back to the LRU.
+ * Add isolated pages on the list back to the LRU under page lock
+ * to avoid leaking reclaimable pages back onto noreclaim list.
*
* returns the number of pages put back.
*/
@@ -72,7 +67,9 @@ int putback_lru_pages(struct list_head *
list_for_each_entry_safe(page, page2, l, lru) {
list_del(&page->lru);
- move_to_lru(page);
+ lock_page(page);
+ if (putback_lru_page(page))
+ unlock_page(page);
count++;
}
return count;
@@ -340,8 +337,11 @@ static void migrate_page_copy(struct pag
SetPageReferenced(newpage);
if (PageUptodate(page))
SetPageUptodate(newpage);
- if (PageActive(page))
+ if (TestClearPageActive(page)) {
+ VM_BUG_ON(PageNoreclaim(page));
SetPageActive(newpage);
+ } else
+ noreclaim_migrate_page(newpage, page);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
@@ -362,7 +362,6 @@ static void migrate_page_copy(struct pag
#ifdef CONFIG_SWAP
ClearPageSwapCache(page);
#endif
- ClearPageActive(page);
ClearPagePrivate(page);
set_page_private(page, 0);
page->mapping = NULL;
@@ -541,10 +540,15 @@ static int fallback_migrate_page(struct
*
* The new page will have replaced the old page if this function
* is successful.
+ *
+ * Return value:
+ * < 0 - error code
+ * == 0 - success
*/
static int move_to_new_page(struct page *newpage, struct page *page)
{
struct address_space *mapping;
+ int unlock = 1;
int rc;
/*
@@ -579,10 +583,16 @@ static int move_to_new_page(struct page
if (!rc) {
remove_migration_ptes(page, newpage);
+ /*
+ * Put back on LRU while holding page locked to
+ * handle potential race with, e.g., munlock()
+ */
+ unlock = putback_lru_page(newpage);
} else
newpage->mapping = NULL;
- unlock_page(newpage);
+ if (unlock)
+ unlock_page(newpage);
return rc;
}
@@ -599,18 +609,19 @@ static int unmap_and_move(new_page_t get
struct page *newpage = get_new_page(page, private, &result);
int rcu_locked = 0;
int charge = 0;
+ int unlock = 1;
if (!newpage)
return -ENOMEM;
if (page_count(page) == 1)
/* page was freed from under us. So we are done. */
- goto move_newpage;
+ goto end_migration;
charge = mem_cgroup_prepare_migration(page, newpage);
if (charge == -ENOMEM) {
rc = -ENOMEM;
- goto move_newpage;
+ goto end_migration;
}
/* prepare cgroup just returns 0 or -ENOMEM */
BUG_ON(charge);
@@ -618,7 +629,7 @@ static int unmap_and_move(new_page_t get
rc = -EAGAIN;
if (TestSetPageLocked(page)) {
if (!force)
- goto move_newpage;
+ goto end_migration;
lock_page(page);
}
@@ -680,8 +691,6 @@ rcu_unlock:
unlock:
- unlock_page(page);
-
if (rc != -EAGAIN) {
/*
* A page that has been migrated has all references
@@ -690,17 +699,30 @@ unlock:
* restored.
*/
list_del(&page->lru);
- move_to_lru(page);
+ if (!page->mapping) {
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
+ put_page(page); /* just free the old page */
+ goto end_migration;
+ } else
+ unlock = putback_lru_page(page);
}
-move_newpage:
+ if (unlock)
+ unlock_page(page);
+
+end_migration:
if (!charge)
mem_cgroup_end_migration(newpage);
- /*
- * Move the new page to the LRU. If migration was not successful
- * then this will free the page.
- */
- move_to_lru(newpage);
+
+ if (!newpage->mapping) {
+ /*
+ * Migration failed or was never attempted.
+ * Free the newpage.
+ */
+ VM_BUG_ON(page_count(newpage) != 1);
+ put_page(newpage);
+ }
if (result) {
if (rc)
*result = rc;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-06-06 16:05:50.000000000 -0400
@@ -437,6 +437,73 @@ cannot_free:
return 0;
}
+/**
+ * putback_lru_page
+ * @page to be put back to appropriate lru list
+ *
+ * Add previously isolated @page to appropriate LRU list.
+ * Page may still be non-reclaimable for other reasons.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ * Must be called with page locked.
+ *
+ * return 1 if page still locked [not truncated], else 0
+ */
+int putback_lru_page(struct page *page)
+{
+ int lru;
+ int ret = 1;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ lru = !!TestClearPageActive(page);
+ ClearPageNoreclaim(page); /* for page_reclaimable() */
+
+ if (unlikely(!page->mapping)) {
+ /*
+ * page truncated. drop lock as put_page() will
+ * free the page.
+ */
+ VM_BUG_ON(page_count(page) != 1);
+ unlock_page(page);
+ ret = 0;
+ } else if (page_reclaimable(page, NULL)) {
+ /*
+ * For reclaimable pages, we can use the cache.
+ * In event of a race, worst case is we end up with a
+ * non-reclaimable page on [in]active list.
+ * We know how to handle that.
+ */
+ lru += page_file_cache(page);
+ lru_cache_add_lru(page, lru);
+ mem_cgroup_move_lists(page, lru);
+ } else {
+ /*
+ * Put non-reclaimable pages directly on zone's noreclaim
+ * list.
+ */
+ add_page_to_noreclaim_list(page);
+ mem_cgroup_move_lists(page, LRU_NORECLAIM);
+ }
+
+ put_page(page); /* drop ref from isolate */
+ return ret; /* ret => "page still locked" */
+}
+
+/*
+ * Cull page that shrink_*_list() has detected to be non-reclaimable
+ * under page lock to close races with other tasks that might be making
+ * the page reclaimable. Avoid stranding a reclaimable page on the
+ * noreclaim list.
+ */
+static inline void cull_nonreclaimable_page(struct page *page)
+{
+ lock_page(page);
+ if (putback_lru_page(page))
+ unlock_page(page);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -470,6 +537,12 @@ static unsigned long shrink_page_list(st
sc->nr_scanned++;
+ if (unlikely(!page_reclaimable(page, NULL))) {
+ if (putback_lru_page(page))
+ unlock_page(page);
+ continue;
+ }
+
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
@@ -566,7 +639,7 @@ static unsigned long shrink_page_list(st
* possible for a page to have PageDirty set, but it is actually
* clean (all its buffers are clean). This happens if the
* buffers were written out directly, with submit_bh(). ext3
- * will do this, as well as the blockdev mapping.
+ * will do this, as well as the blockdev mapping.
* try_to_release_page() will discover that cleanness and will
* drop the buffers and mark the page clean - it can be freed.
*
@@ -598,6 +671,7 @@ activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
if (PageSwapCache(page) && vm_swap_full())
remove_exclusive_swap_page_ref(page);
+ VM_BUG_ON(PageActive(page));
SetPageActive(page);
pgactivate++;
keep_locked:
@@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
return ret;
+ /*
+ * Non-reclaimable pages shouldn't make it onto either the active
+ * nor the inactive list. However, when doing lumpy reclaim of
+ * higher order pages we can still run into them.
+ */
+ if (PageNoreclaim(page))
+ return ret;
+
ret = -EBUSY;
if (likely(get_page_unless_zero(page))) {
/*
@@ -758,7 +840,7 @@ static unsigned long isolate_lru_pages(u
/* else it is being freed elsewhere */
list_move(&cursor_page->lru, src);
default:
- break;
+ break; /* ! on LRU or wrong list */
}
}
}
@@ -818,8 +900,9 @@ static unsigned long clear_active_flags(
* Returns -EBUSY if the page was not on an LRU list.
*
* The returned page will have PageLRU() cleared. If it was found on
- * the active list, it will have PageActive set. That flag may need
- * to be cleared by the caller before letting the page go.
+ * the active list, it will have PageActive set. If it was found on
+ * the noreclaim list, it will have the PageNoreclaim bit set. That flag
+ * may need to be cleared by the caller before letting the page go.
*
* The vmstat statistic corresponding to the list on which the page was
* found will be decremented.
@@ -844,7 +927,13 @@ int isolate_lru_page(struct page *page)
ret = 0;
ClearPageLRU(page);
+ /* Calculate the LRU list for normal pages ... */
lru += page_file_cache(page) + !!PageActive(page);
+
+ /* ... except NoReclaim, which has its own list. */
+ if (PageNoreclaim(page))
+ lru = LRU_NORECLAIM;
+
del_page_from_lru_list(zone, page, lru);
}
spin_unlock_irq(&zone->lru_lock);
@@ -959,19 +1048,27 @@ static unsigned long shrink_inactive_lis
int lru = LRU_BASE;
page = lru_to_page(&page_list);
VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
list_del(&page->lru);
- if (page_file_cache(page))
- lru += LRU_FILE;
- if (scan_global_lru(sc)) {
+ if (unlikely(!page_reclaimable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ cull_nonreclaimable_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
+ } else {
if (page_file_cache(page))
- zone->recent_rotated_file++;
- else
- zone->recent_rotated_anon++;
+ lru += LRU_FILE;
+ if (scan_global_lru(sc)) {
+ if (page_file_cache(page))
+ zone->recent_rotated_file++;
+ else
+ zone->recent_rotated_anon++;
+ }
+ if (PageActive(page))
+ lru += LRU_ACTIVE;
}
- if (PageActive(page))
- lru += LRU_ACTIVE;
+ SetPageLRU(page);
add_page_to_lru_list(zone, page, lru);
+ mem_cgroup_move_lists(page, lru);
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -1065,6 +1162,12 @@ static void shrink_active_list(unsigned
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
+
+ if (unlikely(!page_reclaimable(page, NULL))) {
+ cull_nonreclaimable_page(page);
+ continue;
+ }
+
if (page_referenced(page, 0, sc->mem_cgroup)) {
if (file) {
/* Referenced file pages stay active. */
@@ -1107,7 +1210,7 @@ static void shrink_active_list(unsigned
ClearPageActive(page);
list_move(&page->lru, &zone->list[lru]);
- mem_cgroup_move_lists(page, false);
+ mem_cgroup_move_lists(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1139,7 +1242,7 @@ static void shrink_active_list(unsigned
VM_BUG_ON(!PageActive(page));
list_move(&page->lru, &zone->list[lru]);
- mem_cgroup_move_lists(page, true);
+ mem_cgroup_move_lists(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1277,7 +1380,7 @@ static unsigned long shrink_zone(int pri
get_scan_ratio(zone, sc, percent);
- for_each_lru(l) {
+ for_each_reclaimable_lru(l) {
if (scan_global_lru(sc)) {
int file = is_file_lru(l);
int scan;
@@ -1308,7 +1411,7 @@ static unsigned long shrink_zone(int pri
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
- for_each_lru(l) {
+ for_each_reclaimable_lru(l) {
if (nr[l]) {
nr_to_scan = min(nr[l],
(unsigned long)sc->swap_cluster_max);
@@ -1859,8 +1962,8 @@ static unsigned long shrink_all_zones(un
if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
continue;
- for_each_lru(l) {
- /* For pass = 0 we don't shrink the active list */
+ for_each_reclaimable_lru(l) {
+ /* For pass = 0, we don't shrink the active list */
if (pass == 0 &&
(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
continue;
@@ -2197,3 +2300,26 @@ int zone_reclaim(struct zone *zone, gfp_
return ret;
}
#endif
+
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * page_reclaimable - test whether a page is reclaimable
+ * @page: the page to test
+ * @vma: the VMA in which the page is or will be mapped, may be NULL
+ *
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+ VM_BUG_ON(PageNoreclaim(page));
+
+ /* TODO: test page [!]reclaimable conditions */
+
+ return 1;
+}
+#endif
Index: linux-2.6.26-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mempolicy.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mempolicy.c 2008-06-06 16:05:15.000000000 -0400
@@ -2199,7 +2199,7 @@ static void gather_stats(struct page *pa
if (PageSwapCache(page))
md->swapcache++;
- if (PageActive(page))
+ if (PageActive(page) || PageNoreclaim(page))
md->active++;
if (PageWriteback(page))
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-06-06 16:05:15.000000000 -0400
@@ -34,8 +34,15 @@ static inline void __put_page(struct pag
atomic_dec(&page->_count);
}
+/*
+ * in mm/vmscan.c:
+ */
extern int isolate_lru_page(struct page *page);
+extern int putback_lru_page(struct page *page);
+/*
+ * in mm/page_alloc.c
+ */
extern void __free_pages_bootmem(struct page *page, unsigned int order);
/*
@@ -49,6 +56,25 @@ static inline unsigned long page_order(s
return page_private(page);
}
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * noreclaim_migrate_page() called only from migrate_page_copy() to
+ * migrate noreclaim flag to new page.
+ * Note that the old page has been isolated from the LRU lists at this
+ * point so we don't need to worry about LRU statistics.
+ */
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+ if (TestClearPageNoreclaim(old))
+ SetPageNoreclaim(new);
+}
+#else
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+}
+#endif
+
+
/*
* FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
* so all functions starting at paging_init should be marked __init
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c 2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c 2008-06-06 16:05:15.000000000 -0400
@@ -161,9 +161,10 @@ struct page_cgroup {
int ref_cnt; /* cached, mapped, migrating */
int flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
+#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
+#define PAGE_CGROUP_FLAG_NORECLAIM (0x8) /* page is noreclaimable page */
static int page_cgroup_nid(struct page_cgroup *pc)
{
@@ -283,10 +284,14 @@ static void __mem_cgroup_remove_list(str
{
int lru = LRU_BASE;
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
- lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
- lru += LRU_FILE;
+ if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+ lru = LRU_NORECLAIM;
+ else {
+ if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ lru += LRU_ACTIVE;
+ if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ lru += LRU_FILE;
+ }
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
@@ -299,10 +304,14 @@ static void __mem_cgroup_add_list(struct
{
int lru = LRU_BASE;
- if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
- lru += LRU_ACTIVE;
- if (pc->flags & PAGE_CGROUP_FLAG_FILE)
- lru += LRU_FILE;
+ if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+ lru = LRU_NORECLAIM;
+ else {
+ if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+ lru += LRU_ACTIVE;
+ if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+ lru += LRU_FILE;
+ }
MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_add(&pc->lru, &mz->lists[lru]);
@@ -310,21 +319,31 @@ static void __mem_cgroup_add_list(struct
mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
}
-static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
- int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
- int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
- int lru = LRU_FILE * !!file + !!from;
+ int active = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+ int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+ int noreclaim = pc->flags & PAGE_CGROUP_FLAG_NORECLAIM;
+ enum lru_list from = noreclaim ? LRU_NORECLAIM :
+ (LRU_FILE * !!file + !!active);
- MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+ if (lru == from)
+ return;
- if (active)
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
- else
+ MEM_CGROUP_ZSTAT(mz, from) -= 1;
+
+ if (is_noreclaim_lru(lru)) {
pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+ pc->flags |= PAGE_CGROUP_FLAG_NORECLAIM;
+ } else {
+ if (is_active_lru(lru))
+ pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ else
+ pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+ pc->flags &= ~PAGE_CGROUP_FLAG_NORECLAIM;
+ }
- lru = LRU_FILE * !!file + !!active;
MEM_CGROUP_ZSTAT(mz, lru) += 1;
list_move(&pc->lru, &mz->lists[lru]);
}
@@ -342,7 +361,7 @@ int task_in_mem_cgroup(struct task_struc
/*
* This routine assumes that the appropriate zone's lru lock is already held
*/
-void mem_cgroup_move_lists(struct page *page, bool active)
+void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
struct mem_cgroup_per_zone *mz;
@@ -362,7 +381,7 @@ void mem_cgroup_move_lists(struct page *
if (pc) {
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, active);
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
unlock_page_cgroup(page);
@@ -460,12 +479,10 @@ unsigned long mem_cgroup_isolate_pages(u
/*
* TODO: play better with lumpy reclaim, grabbing anything.
*/
- if (PageActive(page) && !active) {
- __mem_cgroup_move_lists(pc, true);
- continue;
- }
- if (!PageActive(page) && active) {
- __mem_cgroup_move_lists(pc, false);
+ if (PageNoreclaim(page) ||
+ (PageActive(page) && !active) ||
+ (!PageActive(page) && active)) {
+ __mem_cgroup_move_lists(pc, page_lru(page));
continue;
}
Index: linux-2.6.26-rc2-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/memcontrol.h 2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/memcontrol.h 2008-06-06 16:05:15.000000000 -0400
@@ -35,7 +35,7 @@ extern int mem_cgroup_charge(struct page
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page, bool active);
+extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
@ 2008-06-07 1:05 ` Andrew Morton
2008-06-08 20:34 ` Rik van Riel
2008-06-10 20:09 ` Rik van Riel
0 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-07 1:05 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Fri, 06 Jun 2008 16:28:51 -0400
Rik van Riel <riel@redhat.com> wrote:
>
> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>
> Infrastructure to manage pages excluded from reclaim--i.e., hidden
> from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked
> to maintain "nonreclaimable" pages on a separate per-zone LRU list,
> to "hide" them from vmscan.
>
> Kosaki Motohiro added the support for the memory controller noreclaim
> lru list.
>
> Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
> Thus, PG_noreclaim is analogous to and mutually exclusive with
> PG_active--it specifies which LRU list the page is on.
>
> The noreclaim infrastructure is enabled by a new mm Kconfig option
> [CONFIG_]NORECLAIM_LRU.
Having a config option for this really sucks, and needs extra-special
justification, rather than none.
Plus..
akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM_LRU
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/pagemap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/swap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/vmstat.h:#ifdef CONFIG_NORECLAIM_LRU
./kernel/sysctl.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/internal.h:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU
> A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
> or not a page is reclaimable. Subsequent patches will add the various
> !reclaimable tests. We'll want to keep these tests light-weight for
> use in shrink_active_list() and, possibly, the fault path.
>
> To avoid races between tasks putting pages [back] onto an LRU list and
> tasks that might be moving the page from nonreclaimable to reclaimable
> state, one should test reclaimability under page lock and place
> nonreclaimable pages directly on the noreclaim list before dropping the
> lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim
> list. It's OK to use the pagevec caches for reclaimable pages. The new
> function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
> this transition, including potential page truncation while the page is
> unlocked.
>
The changelog doesn't even mention, let alone explain and justify the
fact that this feature is not available on 32-bit systems. This is a
large drawback - it means that a (hopefully useful) feature is
unavailable to the large majority of Linux systems and that it reduces
the testing coverage and that it adversely impacts MM maintainability.
> Index: linux-2.6.26-rc2-mm1/mm/Kconfig
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
> @@ -205,3 +205,13 @@ config NR_QUICK
> config VIRT_TO_BUS
> def_bool y
> depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config NORECLAIM_LRU
> + bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> + depends on EXPERIMENTAL && 64BIT
> + help
> + Supports tracking of non-reclaimable pages off the [in]active lists
> + to avoid excessive reclaim overhead on large memory systems. Pages
> + may be non-reclaimable because: they are locked into memory, they
> + are anonymous pages for which no swap space exists, or they are anon
> + pages that are expensive to unmap [long anon_vma "related vma" list.]
Aunt Tillie might be struggling with some of that.
> Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> @@ -94,6 +94,9 @@ enum pageflags {
> PG_reclaim, /* To be reclaimed asap */
> PG_buddy, /* Page is free, on buddy lists */
> PG_swapbacked, /* Page is backed by RAM/swap */
> +#ifdef CONFIG_NORECLAIM_LRU
> + PG_noreclaim, /* Page is "non-reclaimable" */
> +#endif
I fear that we're messing up the terminology here.
Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
already means a few different things, but in the vmscan context,
"reclaimable" means that the page is unreferenced, clean and can be
stolen. "reclaimable" also means a lot of other things, and we just
made that worse.
Can we think of a new term which uniquely describes this new concept
and use that, rather than flogging the old horse?
>
> ...
>
> +/**
> + * add_page_to_noreclaim_list
> + * @page: the page to be added to the noreclaim list
> + *
> + * Add page directly to its zone's noreclaim list. To avoid races with
> + * tasks that might be making the page reclaimble while it's not on the
> + * lru, we want to add the page while it's locked or otherwise "invisible"
> + * to other tasks. This is difficult to do when using the pagevec cache,
> + * so bypass that.
> + */
How does a task "make a page reclaimable"? munlock()? fsync()?
exit()?
Choice of terminology matters...
> +void add_page_to_noreclaim_list(struct page *page)
> +{
> + struct zone *zone = page_zone(page);
> +
> + spin_lock_irq(&zone->lru_lock);
> + SetPageNoreclaim(page);
> + SetPageLRU(page);
> + add_page_to_lru_list(zone, page, LRU_NORECLAIM);
> + spin_unlock_irq(&zone->lru_lock);
> +}
> +
> /*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> @@ -339,6 +370,7 @@ void release_pages(struct page **pages,
>
> if (PageLRU(page)) {
> struct zone *pagezone = page_zone(page);
> +
> if (pagezone != zone) {
> if (zone)
> spin_unlock_irqrestore(&zone->lru_lock,
> @@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec
> {
> int i;
> struct zone *zone = NULL;
> + VM_BUG_ON(is_noreclaim_lru(lru));
>
> for (i = 0; i < pagevec_count(pvec); i++) {
> struct page *page = pvec->pages[i];
> @@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec
> zone = pagezone;
> spin_lock_irq(&zone->lru_lock);
> }
> + VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
If this ever triggers, you'll wish that it had been coded with two
separate assertions.
> VM_BUG_ON(PageLRU(page));
> SetPageLRU(page);
> if (is_active_lru(lru))
>
> ...
>
> +/**
> + * putback_lru_page
> + * @page to be put back to appropriate lru list
> + *
> + * Add previously isolated @page to appropriate LRU list.
> + * Page may still be non-reclaimable for other reasons.
> + *
> + * lru_lock must not be held, interrupts must be enabled.
> + * Must be called with page locked.
> + *
> + * return 1 if page still locked [not truncated], else 0
> + */
The kerneldoc function description is missing.
> +int putback_lru_page(struct page *page)
> +{
> + int lru;
> + int ret = 1;
> +
> + VM_BUG_ON(!PageLocked(page));
> + VM_BUG_ON(PageLRU(page));
> +
> + lru = !!TestClearPageActive(page);
> + ClearPageNoreclaim(page); /* for page_reclaimable() */
> +
> + if (unlikely(!page->mapping)) {
> + /*
> + * page truncated. drop lock as put_page() will
> + * free the page.
> + */
> + VM_BUG_ON(page_count(page) != 1);
> + unlock_page(page);
> + ret = 0;
> + } else if (page_reclaimable(page, NULL)) {
> + /*
> + * For reclaimable pages, we can use the cache.
> + * In event of a race, worst case is we end up with a
> + * non-reclaimable page on [in]active list.
> + * We know how to handle that.
> + */
> + lru += page_file_cache(page);
> + lru_cache_add_lru(page, lru);
> + mem_cgroup_move_lists(page, lru);
> + } else {
> + /*
> + * Put non-reclaimable pages directly on zone's noreclaim
> + * list.
> + */
> + add_page_to_noreclaim_list(page);
> + mem_cgroup_move_lists(page, LRU_NORECLAIM);
> + }
> +
> + put_page(page); /* drop ref from isolate */
> + return ret; /* ret => "page still locked" */
> +}
<stares for a while>
<penny drops>
So THAT'S what the magical "return 2" is doing in page_file_cache()!
<looks>
OK, after all the patches are applied, the "2" becomes LRU_FILE and the
enumeration of `enum lru_list' reflects that.
> +/*
> + * Cull page that shrink_*_list() has detected to be non-reclaimable
> + * under page lock to close races with other tasks that might be making
> + * the page reclaimable. Avoid stranding a reclaimable page on the
> + * noreclaim list.
> + */
> +static inline void cull_nonreclaimable_page(struct page *page)
> +{
> + lock_page(page);
> + if (putback_lru_page(page))
> + unlock_page(page);
> +}
Again, the terminology is quite overloaded and confusing. What does
"non-reclaimable" mean in this context? _Any_ page which was dirty or
which had an elevated refcount? Surely not referenced pages, which the
scanner also can treat as non-reclaimable.
Did you check whether all these inlined functions really should have
been inlined? Even ones like this are probably too large.
> /*
> * shrink_page_list() returns the number of reclaimed pages
> */
>
> ...
>
> @@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
> if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
> return ret;
>
> + /*
> + * Non-reclaimable pages shouldn't make it onto either the active
> + * nor the inactive list. However, when doing lumpy reclaim of
> + * higher order pages we can still run into them.
I guess that something along the lines of "when this function is being
called for lumpy reclaim we can still .." would be clearer.
> + */
> + if (PageNoreclaim(page))
> + return ret;
> +
> ret = -EBUSY;
> if (likely(get_page_unless_zero(page))) {
> /*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-07 1:05 ` Andrew Morton
@ 2008-06-08 20:34 ` Rik van Riel
2008-06-08 20:57 ` Andrew Morton
2008-06-08 21:07 ` KOSAKI Motohiro
2008-06-10 20:09 ` Rik van Riel
1 sibling, 2 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-08 20:34 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:51 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
> >
> > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > [CONFIG_]NORECLAIM_LRU.
>
> Having a config option for this really sucks, and needs extra-special
> justification, rather than none.
I believe the justification is that it uses a page flag.
PG_noreclaim would be the 20th page flag used, meaning there are
4 more free if 8 bits are used for zone and node info, which would
give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
for 32 bit x86.
If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
just compile in always.
Please let me know what your preference is.
> > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> > @@ -94,6 +94,9 @@ enum pageflags {
> > PG_reclaim, /* To be reclaimed asap */
> > PG_buddy, /* Page is free, on buddy lists */
> > PG_swapbacked, /* Page is backed by RAM/swap */
> > +#ifdef CONFIG_NORECLAIM_LRU
> > + PG_noreclaim, /* Page is "non-reclaimable" */
> > +#endif
>
> I fear that we're messing up the terminology here.
>
> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> already means a few different things, but in the vmscan context,
> "reclaimable" means that the page is unreferenced, clean and can be
> stolen. "reclaimable" also means a lot of other things, and we just
> made that worse.
>
> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?
Want to reuse the BSD term "pinned" instead?
> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page: the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list. To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks. This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
>
> How does a task "make a page reclaimable"? munlock()? fsync()?
> exit()?
>
> Choice of terminology matters...
Lee? Kosaki-san?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 20:34 ` Rik van Riel
@ 2008-06-08 20:57 ` Andrew Morton
2008-06-08 21:32 ` Rik van Riel
2008-06-08 22:03 ` Rik van Riel
2008-06-08 21:07 ` KOSAKI Motohiro
1 sibling, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-08 20:57 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 16:34:13 -0400 Rik van Riel <riel@redhat.com> wrote:
> On Fri, 6 Jun 2008 18:05:06 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Fri, 06 Jun 2008 16:28:51 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> >
> > >
> > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>
> > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > [CONFIG_]NORECLAIM_LRU.
> >
> > Having a config option for this really sucks, and needs extra-special
> > justification, rather than none.
>
> I believe the justification is that it uses a page flag.
>
> PG_noreclaim would be the 20th page flag used, meaning there are
> 4 more free if 8 bits are used for zone and node info, which would
> give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> for 32 bit x86.
>
> If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> just compile in always.
Seems unlikely to be useful? The only way in which this would be an
advantage if if we hae some other feature which also needs a page flag
but which will never be concurrently enabled with this one.
> Please let me know what your preference is.
Don't use another page flag?
> > > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-05-29 16:21:04.000000000 -0400
> > > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
> > > @@ -94,6 +94,9 @@ enum pageflags {
> > > PG_reclaim, /* To be reclaimed asap */
> > > PG_buddy, /* Page is free, on buddy lists */
> > > PG_swapbacked, /* Page is backed by RAM/swap */
> > > +#ifdef CONFIG_NORECLAIM_LRU
> > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > +#endif
> >
> > I fear that we're messing up the terminology here.
> >
> > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > already means a few different things, but in the vmscan context,
> > "reclaimable" means that the page is unreferenced, clean and can be
> > stolen. "reclaimable" also means a lot of other things, and we just
> > made that worse.
> >
> > Can we think of a new term which uniquely describes this new concept
> > and use that, rather than flogging the old horse?
>
> Want to reuse the BSD term "pinned" instead?
mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
from being reclaimed".
As a starting point: what, in your english-language-paragraph-length
words, does this flag mean?
> > > +/**
> > > + * add_page_to_noreclaim_list
> > > + * @page: the page to be added to the noreclaim list
> > > + *
> > > + * Add page directly to its zone's noreclaim list. To avoid races with
> > > + * tasks that might be making the page reclaimble while it's not on the
> > > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > > + * to other tasks. This is difficult to do when using the pagevec cache,
> > > + * so bypass that.
> > > + */
> >
> > How does a task "make a page reclaimable"? munlock()? fsync()?
> > exit()?
> >
> > Choice of terminology matters...
>
> Lee? Kosaki-san?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 20:57 ` Andrew Morton
@ 2008-06-08 21:32 ` Rik van Riel
2008-06-08 21:43 ` Ray Lee
2008-06-08 23:22 ` Andrew Morton
2008-06-08 22:03 ` Rik van Riel
1 sibling, 2 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-08 21:32 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> >
> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > [CONFIG_]NORECLAIM_LRU.
> > >
> > > Having a config option for this really sucks, and needs extra-special
> > > justification, rather than none.
> >
> > I believe the justification is that it uses a page flag.
> >
> > PG_noreclaim would be the 20th page flag used, meaning there are
> > 4 more free if 8 bits are used for zone and node info, which would
> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > for 32 bit x86.
> >
> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
>
> Seems unlikely to be useful? The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
>
> > Please let me know what your preference is.
>
> Don't use another page flag?
I don't see how that would work. We need a way to identify
the status of the page.
> > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > > +#endif
> > >
> > > I fear that we're messing up the terminology here.
> > >
> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > > already means a few different things, but in the vmscan context,
> > > "reclaimable" means that the page is unreferenced, clean and can be
> > > stolen. "reclaimable" also means a lot of other things, and we just
> > > made that worse.
> > >
> > > Can we think of a new term which uniquely describes this new concept
> > > and use that, rather than flogging the old horse?
> >
> > Want to reuse the BSD term "pinned" instead?
>
> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> from being reclaimed".
>
> As a starting point: what, in your english-language-paragraph-length
> words, does this flag mean?
"Cannot be reclaimed because someone has it locked in memory
through mlock, or the page belongs to something that cannot
be evicted like ramfs."
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 21:32 ` Rik van Riel
@ 2008-06-08 21:43 ` Ray Lee
2008-06-08 23:22 ` Andrew Morton
1 sibling, 0 replies; 49+ messages in thread
From: Ray Lee @ 2008-06-08 21:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Sun, Jun 8, 2008 at 2:32 PM, Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>> >
>> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
>> > > > [CONFIG_]NORECLAIM_LRU.
>> > >
>> > > Having a config option for this really sucks, and needs extra-special
>> > > justification, rather than none.
>> >
>> > I believe the justification is that it uses a page flag.
>> >
>> > PG_noreclaim would be the 20th page flag used, meaning there are
>> > 4 more free if 8 bits are used for zone and node info, which would
>> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
>> > for 32 bit x86.
>> >
>> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
>> > just compile in always.
>>
>> Seems unlikely to be useful? The only way in which this would be an
>> advantage if if we hae some other feature which also needs a page flag
>> but which will never be concurrently enabled with this one.
>>
>> > Please let me know what your preference is.
>>
>> Don't use another page flag?
>
> I don't see how that would work. We need a way to identify
> the status of the page.
>
>> > > > +#ifdef CONFIG_NORECLAIM_LRU
>> > > > + PG_noreclaim, /* Page is "non-reclaimable" */
>> > > > +#endif
>> > >
>> > > I fear that we're messing up the terminology here.
>> > >
>> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
>> > > already means a few different things, but in the vmscan context,
>> > > "reclaimable" means that the page is unreferenced, clean and can be
>> > > stolen. "reclaimable" also means a lot of other things, and we just
>> > > made that worse.
>> > >
>> > > Can we think of a new term which uniquely describes this new concept
>> > > and use that, rather than flogging the old horse?
>> >
>> > Want to reuse the BSD term "pinned" instead?
>>
>> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
>> from being reclaimed".
>>
>> As a starting point: what, in your english-language-paragraph-length
>> words, does this flag mean?
>
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."
"Unevictable"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 21:32 ` Rik van Riel
2008-06-08 21:43 ` Ray Lee
@ 2008-06-08 23:22 ` Andrew Morton
2008-06-08 23:34 ` Rik van Riel
1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-06-08 23:22 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 17:32:44 -0400 Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > >
> > > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > > [CONFIG_]NORECLAIM_LRU.
> > > >
> > > > Having a config option for this really sucks, and needs extra-special
> > > > justification, rather than none.
> > >
> > > I believe the justification is that it uses a page flag.
> > >
> > > PG_noreclaim would be the 20th page flag used, meaning there are
> > > 4 more free if 8 bits are used for zone and node info, which would
> > > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > > for 32 bit x86.
This feature isn't available on 32-bit cpus is it?
> > > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > > just compile in always.
> >
> > Seems unlikely to be useful? The only way in which this would be an
> > advantage if if we hae some other feature which also needs a page flag
> > but which will never be concurrently enabled with this one.
^^this?
> > > Please let me know what your preference is.
> >
> > Don't use another page flag?
>
> I don't see how that would work. We need a way to identify
> the status of the page.
We'll run out one day. Then we will have little choice but to increase
the size of the pageframe.
This is a direct downside of adding more lru lists.
The this-is-64-bit-only problem really sucks, IMO. We still don't know
the reason for that decision. Presumably it was because we've already
run out of page flags? If so, the time for the larger pageframe is
upon us.
> > > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > > + PG_noreclaim, /* Page is "non-reclaimable" */
> > > > > +#endif
> > > >
> > > > I fear that we're messing up the terminology here.
> > > >
> > > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
> > > > already means a few different things, but in the vmscan context,
> > > > "reclaimable" means that the page is unreferenced, clean and can be
> > > > stolen. "reclaimable" also means a lot of other things, and we just
> > > > made that worse.
> > > >
> > > > Can we think of a new term which uniquely describes this new concept
> > > > and use that, rather than flogging the old horse?
> > >
> > > Want to reuse the BSD term "pinned" instead?
> >
> > mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> > from being reclaimed".
> >
> > As a starting point: what, in your english-language-paragraph-length
> > words, does this flag mean?
>
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."
Ray's "unevictable" sounds good. It's not a term we've used elsewhere.
It's all a bit arbitrary, but it's just a label which maps onto a
concept and if we all honour that mapping carefully in our code and
writings, VM maintenance becomes that bit easier.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 23:22 ` Andrew Morton
@ 2008-06-08 23:34 ` Rik van Riel
2008-06-08 23:54 ` Andrew Morton
0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-08 23:34 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 16:22:08 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> The this-is-64-bit-only problem really sucks, IMO. We still don't know
> the reason for that decision. Presumably it was because we've already
> run out of page flags? If so, the time for the larger pageframe is
> upon us.
32 bit machines are unlikely to have so much memory that they run
into big scalability issues with mlocked memory.
The obvious exception to that are large PAE systems, which run
into other bottlenecks already and will probably hit the wall in
some other way before suffering greatly from the "kswapd is
scanning unevictable pages" problem.
I'll leave it up to you to decide whether you want this feature
64 bit only, or whether you want to use up the page flag on 32
bit systems too.
Please let me know which direction I should take, so I can fix
up the patch set accordingly.
> > > As a starting point: what, in your english-language-paragraph-length
> > > words, does this flag mean?
> >
> > "Cannot be reclaimed because someone has it locked in memory
> > through mlock, or the page belongs to something that cannot
> > be evicted like ramfs."
>
> Ray's "unevictable" sounds good. It's not a term we've used elsewhere.
>
> It's all a bit arbitrary, but it's just a label which maps onto a
> concept and if we all honour that mapping carefully in our code and
> writings, VM maintenance becomes that bit easier.
OK, I'll rename everything to unevictable and will add documentation
to clear up the meaning.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 23:34 ` Rik van Riel
@ 2008-06-08 23:54 ` Andrew Morton
2008-06-09 0:56 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-08 23:54 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 16:22:08 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > The this-is-64-bit-only problem really sucks, IMO. We still don't know
> > the reason for that decision. Presumably it was because we've already
> > run out of page flags? If so, the time for the larger pageframe is
> > upon us.
>
> 32 bit machines are unlikely to have so much memory that they run
> into big scalability issues with mlocked memory.
>
> The obvious exception to that are large PAE systems, which run
> into other bottlenecks already and will probably hit the wall in
> some other way before suffering greatly from the "kswapd is
> scanning unevictable pages" problem.
>
> I'll leave it up to you to decide whether you want this feature
> 64 bit only, or whether you want to use up the page flag on 32
> bit systems too.
>
> Please let me know which direction I should take, so I can fix
> up the patch set accordingly.
I'm getting rather wobbly about all of this.
This is, afair, by far the most intrusive and high-risk change we've
looked at doing since 2.5.x, for small values of x.
I mean, it's taken many years of work to get reclaim into its current
state (and the reduction in reported problems will in part be due to
the quadrupling-odd of memory over that time). And we're now proposing
radical changes which again will take years to sort out, all on behalf
of a small number of workloads upon a minority of 64-bit machines which
themselves are a minority of the Linux base.
And it will take longer to get those problems sorted out if 32-bt
machines aren't even compiing the new code in.
Are all of thse changes really justified?
ho hum. Can you remind us what problems this patchset actually
addresses? Preferably in order of seriousness? (The [0/n] description
told us about the implementation but forgot to tell us anything about
what it was fixing). Because I guess we should have a think about
alternative approaches.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 23:54 ` Andrew Morton
@ 2008-06-09 0:56 ` Rik van Riel
2008-06-09 6:10 ` Andrew Morton
2008-06-09 2:58 ` Rik van Riel
2008-06-10 19:17 ` Christoph Lameter
2 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-09 0:56 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:
> > Please let me know which direction I should take, so I can fix
> > up the patch set accordingly.
>
> I'm getting rather wobbly about all of this.
>
> This is, afair, by far the most intrusive and high-risk change we've
> looked at doing since 2.5.x, for small values of x.
Nowhere near as intrusive or risky as eg. the timer changes that went
in a few releases ago.
> I mean, it's taken many years of work to get reclaim into its current
> state (and the reduction in reported problems will in part be due to
> the quadrupling-odd of memory over that time).
Actually, memory is now getting so large that the current code no
longer works right. On machines 16GB and up, we have discovered
really pathetic behaviour by the VM currently upstream.
Things like the VM scanning over the (locked) shared memory segment
over and over and over again, to get at the 1GB of freeable pagecache
memory in the system. Or the system scanning over all anonymous
memory over and over again, despite the fact that there is no more
swap space left.
With heavy anonymous memory workloads, Linux can stall for minutes
once memory runs low and something needs to be swapped out, because
pretty much all memory is anonymous and everything has the referenced
bit set. We have seen systems with 128GB of RAM hang overnight, once
every CPU got wedged in the pageout scanning code. Typically the VM
decides on a first page to swap out in 2-3 minutes though, and then
it will start several gigabytes of swap IO at once...
Definately not acceptable behaviour.
> And we're now proposing radical changes which again will take years to sort
> out, all on behalf of a small number of workloads upon a minority of 64-bit
> machines which themselves are a minority of the Linux base.
Hardware gets larger. 4 years ago few people cared about systems
with more than 4GB of memory, but nowadays people have that in their
desktops.
> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.
32 bit systems will still get the file/anon LRU split. The only
thing that is 64 bit only in the current patch set is keeping the
unevictable pages off of the LRU lists.
This means that balancing between file and anon eviction will be
the same on 32 and 64 bit systems and things should get sorted out
on both systems at the same time.
> Are all of thse changes really justified?
People with large Linux servers are experiencing system stalls
of several minutes, or at worst complete livelocks, with the
current VM.
I believe that those issues need to be fixed.
After discussing this for a long time with Larry Woodman,
Lee Schermerhorn and others, I am convinced that they can
not be fixed by putting a bandaid on the current code.
After all, the fundamental problem often is that the file backed
and mem/swap backed pages are on the same LRU.
Think of a case that is becoming more and more common: a database
server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
locked shared memory segment, 30GB of other anonymous memory and
5GB of page cache.
Do you think it is reasonable for the VM to have to scan over
110GB of essentially unevictable memory, just to get at the 5GB
of page cache?
> Because I guess we should have a think about alternative approaches.
We have. We failed to come up with anything that avoids the
problem without actually fixing the fundamental issues.
If you have an idea, please let us know.
Otherwise, please give us a chance to shake things out in -mm.
I will prepare kernel RPMs for Fedora so users in the community can
easily test these patches too, and help find scenarios where these
patches do not perform as well as what the current kernel has.
I have time to track down and fix any issues that people find.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-09 0:56 ` Rik van Riel
@ 2008-06-09 6:10 ` Andrew Morton
2008-06-09 13:44 ` Rik van Riel
0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-06-09 6:10 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 20:56:29 -0400 Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:
>
> > > Please let me know which direction I should take, so I can fix
> > > up the patch set accordingly.
> >
> > I'm getting rather wobbly about all of this.
> >
> > This is, afair, by far the most intrusive and high-risk change we've
> > looked at doing since 2.5.x, for small values of x.
>
> Nowhere near as intrusive or risky as eg. the timer changes that went
> in a few releases ago.
Well. Intrusiveness doesn't matter much. But no, you're dead wrong -
this stuff is far more risky than timer changes. Because things like
the timer changes are trivial to detect errors in - it either works or
it doesn't.
Whereas reclaim problems can take *years* to identify and are often
very hard for the programmers to understand, reproduce and diagnose.
> > I mean, it's taken many years of work to get reclaim into its current
> > state (and the reduction in reported problems will in part be due to
> > the quadrupling-odd of memory over that time).
>
> Actually, memory is now getting so large that the current code no
> longer works right. On machines 16GB and up, we have discovered
> really pathetic behaviour by the VM currently upstream.
>
> Things like the VM scanning over the (locked) shared memory segment
> over and over and over again, to get at the 1GB of freeable pagecache
> memory in the system.
Earlier discussion about removing these pages from ALL LRUs reached a
quite detailed stage, but nobody seemed to finish any code.
> Or the system scanning over all anonymous
> memory over and over again, despite the fact that there is no more
> swap space left.
We shouldn't rewrite core VM to cater for incorrectly configured
systems.
> With heavy anonymous memory workloads, Linux can stall for minutes
> once memory runs low and something needs to be swapped out, because
> pretty much all memory is anonymous and everything has the referenced
> bit set. We have seen systems with 128GB of RAM hang overnight, once
> every CPU got wedged in the pageout scanning code. Typically the VM
> decides on a first page to swap out in 2-3 minutes though, and then
> it will start several gigabytes of swap IO at once...
>
> Definately not acceptable behaviour.
I see handwavy non-bug-reports loosely associated with a vast pile of
code and vague expressions of hope that one will fix the other.
Where's the meat in this, Rik? This is engineering.
Do you or do you not have a test case which demonstrates this problem?
It doesn't sound terribly hard. Where are the before-and-after test
results?
> > And we're now proposing radical changes which again will take years to sort
> > out, all on behalf of a small number of workloads upon a minority of 64-bit
> > machines which themselves are a minority of the Linux base.
>
> Hardware gets larger. 4 years ago few people cared about systems
> with more than 4GB of memory, but nowadays people have that in their
> desktops.
>
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
>
> 32 bit systems will still get the file/anon LRU split. The only
> thing that is 64 bit only in the current patch set is keeping the
> unevictable pages off of the LRU lists.
>
> This means that balancing between file and anon eviction will be
> the same on 32 and 64 bit systems and things should get sorted out
> on both systems at the same time.
>
> > Are all of thse changes really justified?
>
> People with large Linux servers are experiencing system stalls
> of several minutes, or at worst complete livelocks, with the
> current VM.
>
> I believe that those issues need to be fixed.
I'd love to see hard evidence that they have been. And that doesn't
mean getting palmed off on wikis and random blog pages.
Also, it is incumbent upon us to consider the other design proposals,
such as removing anon pages from the LRUs, removing mlocked pages from
the LRUs.
> After discussing this for a long time with Larry Woodman,
> Lee Schermerhorn and others, I am convinced that they can
> not be fixed by putting a bandaid on the current code.
>
> After all, the fundamental problem often is that the file backed
> and mem/swap backed pages are on the same LRU.
That actually isn't a fundamental problem.
It _becomes_ a problem because we try to treat the two types of pages
differently.
Stupid question: did anyone try setting swappiness=100? What happened?
> Think of a case that is becoming more and more common: a database
> server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> locked shared memory segment, 30GB of other anonymous memory and
> 5GB of page cache.
>
> Do you think it is reasonable for the VM to have to scan over
> 110GB of essentially unevictable memory, just to get at the 5GB
> of page cache?
Well for starters that system was grossly misconfigured. It is
incumbent upon you, in your design document (that thing we call a
changelog) to justify why the VM design needs to be altered to cater
for such misconfigured systems. It just drives me up the wall having
to engage in a 20-email discussion to be able to squeeze these little
revelations out. Only to have them lost again later.
Secondly, I expect that removal of mlocked pages from the LRU (as was
discussed a year or two ago and perhaps implemented by Andrea) along
with swappiness=100 might be get us towards a fix. Don't know.
> > Because I guess we should have a think about alternative approaches.
>
> We have. We failed to come up with anything that avoids the
> problem without actually fixing the fundamental issues.
Unless I missed it, none of your patch descriptions even attempt to
describe these fundamental issues. It's all buried in 20-deep email
threads.
> If you have an idea, please let us know.
I see no fundamental reason why we need to put mlocked or SHM_LOCKED
pages onto a VM LRU at all.
One cause of problms is that we attempt to prioritise anon pages over
file-backed pagecache. And we prioritise mmapped pages, which your patches
don't address, do they? Stopping doing that would, I expect, prevent a
range of these problems. It would introduce others, probably.
> Otherwise, please give us a chance to shake things out in -mm.
-mm isn't a very useful testing place any more, I'm afraid. The
patches would be better off in linux-next, but then they would screw up
all the other pending MM patches, and it's probably a bit early for
getting them into linux-next.
Once I get sections of -mm feeding into linux-next, things will be better.
> I will prepare kernel RPMs for Fedora so users in the community can
> easily test these patches too, and help find scenarios where these
> patches do not perform as well as what the current kernel has.
>
> I have time to track down and fix any issues that people find.
That helps.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-09 6:10 ` Andrew Morton
@ 2008-06-09 13:44 ` Rik van Riel
0 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-09 13:44 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 23:10:53 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> Also, it is incumbent upon us to consider the other design proposals,
> such as removing anon pages from the LRUs, removing mlocked pages from
> the LRUs.
That is certainly an option. We'll still need to keep track of
what kind of page the page is, though, otherwise we won't know
whether or not we can put it back onto the LRU lists at munlock
time.
> > After discussing this for a long time with Larry Woodman,
> > Lee Schermerhorn and others, I am convinced that they can
> > not be fixed by putting a bandaid on the current code.
> >
> > After all, the fundamental problem often is that the file backed
> > and mem/swap backed pages are on the same LRU.
>
> That actually isn't a fundamental problem.
>
> It _becomes_ a problem because we try to treat the two types of pages
> differently.
>
> Stupid question: did anyone try setting swappiness=100? What happened?
The database shared memory segment got swapped out and the
system crawled to a halt.
Swap IO usually is less efficient than page cache IO, because
page cache IO happens in larger chunks and does not involve
a swap-out first and a swap-in later - the data is just read,
which at least halves the disk IO compared to swap.
Readahead tilts the IO cost even more in favor of evicting
page cache pages, vs. swapping something out.
> > Think of a case that is becoming more and more common: a database
> > server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> > locked shared memory segment, 30GB of other anonymous memory and
> > 5GB of page cache.
> >
> > Do you think it is reasonable for the VM to have to scan over
> > 110GB of essentially unevictable memory, just to get at the 5GB
> > of page cache?
>
> Well for starters that system was grossly misconfigured.
Swapping out the database shared memory segment is not an option,
because it is mlocked. Even if it was an option, swapping it out
would be a bad idea because swap IO is simply less efficient than
page cache IO (see above).
> Secondly, I expect that removal of mlocked pages from the LRU (as was
> discussed a year or two ago and perhaps implemented by Andrea) along
> with swappiness=100 might be get us towards a fix. Don't know.
Removing mlocked pages from the LRU can be done, but I suspect
we'll still want to keep track of how many of these pages there
are, right?
> > > Because I guess we should have a think about alternative approaches.
> >
> > We have. We failed to come up with anything that avoids the
> > problem without actually fixing the fundamental issues.
>
> Unless I missed it, none of your patch descriptions even attempt to
> describe these fundamental issues. It's all buried in 20-deep email
> threads.
I'll add more problem descriptions to the next patch submission.
I'm halfway the patch series making all the cleanups and changes
you suggested.
> One cause of problms is that we attempt to prioritise anon pages over
> file-backed pagecache. And we prioritise mmapped pages, which your patches
> don't address, do they? Stopping doing that would, I expect, prevent a
> range of these problems. It would introduce others, probably.
Try running a database with swappiness=100 and then doing a
backup of the system simultaneously. The database will end
up being swapped out, which slows down the database, causes
extra IO and ends up slowing down the backup, too.
The backup does not benefit from having its data cached,
since it only reads everything once.
> > Otherwise, please give us a chance to shake things out in -mm.
>
> -mm isn't a very useful testing place any more, I'm afraid.
That's a problem. I can run tests on the VM patches, but you know
as well as I do that the code needs to be shaken out by lots of
users before we can be truly confident in it...
> > I will prepare kernel RPMs for Fedora so users in the community can
> > easily test these patches too, and help find scenarios where these
> > patches do not perform as well as what the current kernel has.
> >
> > I have time to track down and fix any issues that people find.
>
> That helps.
I sure hope so.
I'll send you a cleaned-up patch series soon. Hopefully tonight
or tomorrow.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 23:54 ` Andrew Morton
2008-06-09 0:56 ` Rik van Riel
@ 2008-06-09 2:58 ` Rik van Riel
2008-06-09 5:44 ` Andrew Morton
2008-06-10 19:17 ` Christoph Lameter
2 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-09 2:58 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> ho hum. Can you remind us what problems this patchset actually
> addresses? Preferably in order of seriousness?
Here are some other problems that my patch series can easily fix,
because file cache and anon/swap backed pages live on separate
LRUs:
http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/
http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/
I do not know for sure whether the patch set does fix it yet for
everyone, or whether it needs some more tuning first, but it is
fairly easily fixable by tweaking the relative pressure on both
sets of LRU lists.
No tricks of skipping over one type of pages while scanning, or
treating the referenced bits differently when the moon is in some
particular phase required - one set of lists for each type of
pages, and variable pressure between the two.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-09 2:58 ` Rik van Riel
@ 2008-06-09 5:44 ` Andrew Morton
0 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-09 5:44 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 22:58:00 -0400 Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > ho hum. Can you remind us what problems this patchset actually
> > addresses? Preferably in order of seriousness?
>
> Here are some other problems that my patch series can easily fix,
> because file cache and anon/swap backed pages live on separate
> LRUs:
>
> http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/
>
> http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/
Sorry, but sending us off to look at random bug reports (from people
who didn't report a bug) is not how we discuss or changelog kernel
patches.
It is for good reasons that we like to see an accurate and detailed
analysis of the problems which are being addressed, and a description
of the means by which they were solved.
> I do not know for sure whether the patch set does fix it yet for
> everyone, or whether it needs some more tuning first, but it is
> fairly easily fixable by tweaking the relative pressure on both
> sets of LRU lists.
I expect it will help, yes. On 64-bit systems. It's unclear whether
mlock or SHM_LOCK is part of the issue here - if it is then 32-bit
systems will still be exposed to these things.
I also expect that it will introduce new problems, ones which can take a
very long time to diagnose and fix. Inevitable, but hopefully acceptable,
if the benefit is there.
> No tricks of skipping over one type of pages while scanning, or
> treating the referenced bits differently when the moon is in some
> particular phase required - one set of lists for each type of
> pages, and variable pressure between the two.
For the unevictable pages we have previously considered just taking
them off the LRU and leaving them off - reattach them at
SHM_UNLOCK-time and at munlock()-time (potentially subject to
reexamination of any other vmas which map each page).
I believe that Andrea had code which leaves the anon pages off the LRU
as well, but I forget the details.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 23:54 ` Andrew Morton
2008-06-09 0:56 ` Rik van Riel
2008-06-09 2:58 ` Rik van Riel
@ 2008-06-10 19:17 ` Christoph Lameter
2008-06-10 19:37 ` Rik van Riel
2 siblings, 1 reply; 49+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Sun, 8 Jun 2008, Andrew Morton wrote:
> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.
The problem is going to be less if we dependedn on
CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
I did the pageflags rework in part because of Rik's project.
> ho hum. Can you remind us what problems this patchset actually
> addresses? Preferably in order of seriousness? (The [0/n] description
> told us about the implementation but forgot to tell us anything about
> what it was fixing). Because I guess we should have a think about
> alternative approaches.
It solves the livelock while reclaiming issues that we see more and more.
There are loads that have lots of unreclaimable pages. These are
frequently and uselessly scanned under memory pressure.
The larger the memory the more problems.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 19:17 ` Christoph Lameter
@ 2008-06-10 19:37 ` Rik van Riel
2008-06-10 21:33 ` Andrew Morton
0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-10 19:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Sun, 8 Jun 2008, Andrew Morton wrote:
>
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
>
> The problem is going to be less if we dependedn on
> CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
>
> I did the pageflags rework in part because of Rik's project.
I think your pageflags work freed up a number of bits on 32
bit systems, unless someone compiles a 32 bit system with
support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
nodes (6 bits NODE_SHIFT), in which case we should still
have 24 bits for flags.
Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
a 32 bit system is probably total insanity already. I
suspect very few people compile 32 bit with NUMA at all,
except if it is an architecture that uses DISCONTIGMEM
instead of zones, in which case ZONE_SHIFT is 0, which
will free up space too :)
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 19:37 ` Rik van Riel
@ 2008-06-10 21:33 ` Andrew Morton
2008-06-10 21:48 ` Andi Kleen
` (3 more replies)
0 siblings, 4 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-10 21:33 UTC (permalink / raw)
To: Rik van Riel
Cc: clameter, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney, Paul Mundt, Andi Kleen, Ingo Molnar,
Andy Whitcroft
On Tue, 10 Jun 2008 15:37:02 -0400
Rik van Riel <riel@redhat.com> wrote:
> On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > On Sun, 8 Jun 2008, Andrew Morton wrote:
> >
> > > And it will take longer to get those problems sorted out if 32-bt
> > > machines aren't even compiing the new code in.
> >
> > The problem is going to be less if we dependedn on
> > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> >
> > I did the pageflags rework in part because of Rik's project.
>
> I think your pageflags work freed up a number of bits on 32
> bit systems, unless someone compiles a 32 bit system with
> support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> nodes (6 bits NODE_SHIFT), in which case we should still
> have 24 bits for flags.
>
> Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> a 32 bit system is probably total insanity already. I
> suspect very few people compile 32 bit with NUMA at all,
> except if it is an architecture that uses DISCONTIGMEM
> instead of zones, in which case ZONE_SHIFT is 0, which
> will free up space too :)
Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
maximum node count is. The default for sh NODES_SHIFT is 3.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 21:33 ` Andrew Morton
@ 2008-06-10 21:48 ` Andi Kleen
2008-06-10 22:05 ` Dave Hansen
` (2 subsequent siblings)
3 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-10 21:48 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Ingo Molnar,
Andy Whitcroft
>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
Actually much more (most 64bit NUMA systems can run 32bit too), it just
doesn't work well because the code is not very good, undertested, many
bugs, weird design and in general 32bit NUMA has a lot of limitations
that don't make it a good idea.
But you don't need to kill it only for this (although imho there are
lots of other good reasons) Just use a different way to look up the
node. Encoding it into the flags is just an optimization.
But a separate hash or similar would also work. It seemed like a good
idea back then.
In fact there's already a hash for this (the pa->node hash) that
can do it. It' just some more instructions and one cache line
more accessed, but since i386 NUMA is a fringe application
that doesn't seem like a big issue.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 21:33 ` Andrew Morton
2008-06-10 21:48 ` Andi Kleen
@ 2008-06-10 22:05 ` Dave Hansen
2008-06-11 5:09 ` Paul Mundt
2008-06-11 19:03 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
3 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2008-06-10 22:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
Ingo Molnar, Andy Whitcroft
On Tue, 2008-06-10 at 14:33 -0700, Andrew Morton wrote:
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
Yeah, IBM sold a couple of these "interesting" 32-bit NUMA machines:
https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/tips0267.html?Open
I think those maxed out at 8 nodes, ever. But, no distro ever turned
NUMA on for i386, so no one actually depends on it working. We do have
a bunch of systems that we use for testing and so forth. It'd be a
shame to make these suck *too* much. The NUMA-Q is probably also so
intertwined with CONFIG_NUMA that we'd likely never get it running
again.
I'd rather just bloat page->flags on these platforms or move the
sparsemem/zone/node bits elsewhere than kill NUMA support.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 21:33 ` Andrew Morton
2008-06-10 21:48 ` Andi Kleen
2008-06-10 22:05 ` Dave Hansen
@ 2008-06-11 5:09 ` Paul Mundt
2008-06-11 6:16 ` Andrew Morton
2008-06-11 19:03 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
3 siblings, 1 reply; 49+ messages in thread
From: Paul Mundt @ 2008-06-11 5:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
Andy Whitcroft
On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> >
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > >
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > >
> > > The problem is going to be less if we dependedn on
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > >
> > > I did the pageflags rework in part because of Rik's project.
> >
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> >
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already. I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
>
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is. The default for sh NODES_SHIFT is 3.
In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
gradually more complex in the SMP cases where we are 3-4 levels deep in
various types of memories that we expose as nodes (ie, 4-8 CPUs with a
dozen different memories or so at various interconnect levels).
As far as testing goes, it's part of the regular build and regression
testing for a number of boards, which we verify on a daily basis
(although admittedly -mm gets far less testing, even though that's where
most of the churn in this area tends to be).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-11 5:09 ` Paul Mundt
@ 2008-06-11 6:16 ` Andrew Morton
2008-06-11 6:29 ` Paul Mundt
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-11 6:16 UTC (permalink / raw)
To: Paul Mundt
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
Andy Whitcroft
On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <lethal@linux-sh.org> wrote:
> On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > On Tue, 10 Jun 2008 15:37:02 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> >
> > > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > > >
> > > > > And it will take longer to get those problems sorted out if 32-bt
> > > > > machines aren't even compiing the new code in.
> > > >
> > > > The problem is going to be less if we dependedn on
> > > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > > >
> > > > I did the pageflags rework in part because of Rik's project.
> > >
> > > I think your pageflags work freed up a number of bits on 32
> > > bit systems, unless someone compiles a 32 bit system with
> > > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > > nodes (6 bits NODE_SHIFT), in which case we should still
> > > have 24 bits for flags.
> > >
> > > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > > a 32 bit system is probably total insanity already. I
> > > suspect very few people compile 32 bit with NUMA at all,
> > > except if it is an architecture that uses DISCONTIGMEM
> > > instead of zones, in which case ZONE_SHIFT is 0, which
> > > will free up space too :)
> >
> > Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> >
> > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > maximum node count is. The default for sh NODES_SHIFT is 3.
>
> In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> gradually more complex in the SMP cases where we are 3-4 levels deep in
> various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> dozen different memories or so at various interconnect levels).
Thanks.
Andi has suggested that we can remove the node-ID encoding from
page.flags on x86 because that info is available elsewhere, although a
bit more slowly.
<looks at page_zone(), wonders whether we care about performance anyway>
There wouldn't be much point in doing that unless we did it for all
32-bit architectures. How much trouble would it cause sh?
> As far as testing goes, it's part of the regular build and regression
> testing for a number of boards, which we verify on a daily basis
> (although admittedly -mm gets far less testing, even though that's where
> most of the churn in this area tends to be).
Oh well, that's what -rc is for :(
It would be good if someone over there could start testing linux-next.
Once I get my act together that will include most-of-mm anyway.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-11 6:16 ` Andrew Morton
@ 2008-06-11 6:29 ` Paul Mundt
2008-06-11 12:06 ` Andi Kleen
2008-06-11 14:09 ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
2 siblings, 0 replies; 49+ messages in thread
From: Paul Mundt @ 2008-06-11 6:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
Andy Whitcroft
On Tue, Jun 10, 2008 at 11:16:42PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <lethal@linux-sh.org> wrote:
> > On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > > Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> > > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> > >
> > > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > > maximum node count is. The default for sh NODES_SHIFT is 3.
> >
> > In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> > gradually more complex in the SMP cases where we are 3-4 levels deep in
> > various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> > dozen different memories or so at various interconnect levels).
>
> Thanks.
>
> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
>
> <looks at page_zone(), wonders whether we care about performance anyway>
>
> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures. How much trouble would it cause sh?
>
At first glance I don't think that should be too bad. We only do NUMA
through sparsemem anyways, and we have pretty much no overlap in any of
the ranges, so simply setting NODE_NOT_IN_PAGE_FLAGS should be ok there.
Given the relatively small number of pages we have, the added cost of
page_to_nid() referencing section_to_node_table should still be
tolerable. I'll give it a go and see what the numbers look like.
> > As far as testing goes, it's part of the regular build and regression
> > testing for a number of boards, which we verify on a daily basis
> > (although admittedly -mm gets far less testing, even though that's where
> > most of the churn in this area tends to be).
>
> Oh well, that's what -rc is for :(
>
> It would be good if someone over there could start testing linux-next.
> Once I get my act together that will include most-of-mm anyway.
>
Agreed. This is something we're attempting to add in to our automated
testing at present.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-11 6:16 ` Andrew Morton
2008-06-11 6:29 ` Paul Mundt
@ 2008-06-11 12:06 ` Andi Kleen
2008-06-11 14:09 ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
2 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-11 12:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Paul Mundt, Rik van Riel, clameter, linux-kernel,
lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
Ingo Molnar, Andy Whitcroft
> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
>
> <looks at page_zone(), wonders whether we care about performance anyway>
It would be just pfn_to_nid(page_pfn(page)) for 32bit && CONFIG_NUMA.
-sh should have that too.
Only trouble is that it needs some reordering because right now page_pfn
is not defined early enough.
> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures. How much trouble would it cause sh?
Probably very little from a quick look at the source.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II
2008-06-11 6:16 ` Andrew Morton
2008-06-11 6:29 ` Paul Mundt
2008-06-11 12:06 ` Andi Kleen
@ 2008-06-11 14:09 ` Andi Kleen
2 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-11 14:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Paul Mundt, Rik van Riel, clameter, linux-kernel,
lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
Ingo Molnar, Andy Whitcroft
After some comptemplation I don't think we need to do anything for this.
Just add more page flags. The ifdef jungle in mm.h should handle it already.
#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH NODES_SHIFT
#else
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#error "Vmemmap: No space for nodes field in page flags"
#endif
#define NODES_WIDTH 0
#endif
[btw the vmemmap case could be handled easily too by going through
the zone, but it's not used on 32bit]
and then
#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
#define NODE_NOT_IN_PAGE_FLAGS
#endif
and then
#ifdef NODE_NOT_IN_PAGE_FLAGS
extern int page_to_nid(struct page *page);
#else
static inline int page_to_nid(struct page *page)
{
return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif
and the sparse.c page_to_nid does a hash lookup.
So if NR_PAGEFLAGS is big enough it should work.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-10 21:33 ` Andrew Morton
` (2 preceding siblings ...)
2008-06-11 5:09 ` Paul Mundt
@ 2008-06-11 19:03 ` Andy Whitcroft
2008-06-11 20:52 ` Andi Kleen
2008-06-11 23:25 ` Christoph Lameter
3 siblings, 2 replies; 49+ messages in thread
From: Andy Whitcroft @ 2008-06-11 19:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
Ingo Molnar
On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> >
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > >
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > >
> > > The problem is going to be less if we dependedn on
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > >
> > > I did the pageflags rework in part because of Rik's project.
> >
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> >
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already. I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
>
> Maybe it's time to bite the bullet and kill i386 NUMA support. afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
>
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is. The default for sh NODES_SHIFT is 3.
I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
fact I don't think we have any more with more than 4 nodes left. From
the other discussion it sounds like we have a maximum if 8 nodes on
other sub-arches. So it would not be unreasonable to reduce the shift
to 3. Which might allow us to reduce the size of the reserve.
The problem will come with SPARSEMEM as that stores the section number
in the reserved field. Which can mean we need the whole reserve, and
there is currently no simple way to remove that.
I have been wondering whether we could make more use of the dynamic
nature of the page bits. As bits only need to exist when used, whether
we could consider letting the page flags grow to 64 bits if necessary.
However, at a quick count we are still only using about 19 bits, and if
memory serves we have 23/24 after the reserve on 32 bit.
-apw
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-11 19:03 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
@ 2008-06-11 20:52 ` Andi Kleen
2008-06-11 23:25 ` Christoph Lameter
1 sibling, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-11 20:52 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Andrew Morton, Rik van Riel, clameter, linux-kernel,
lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
Paul Mundt, Ingo Molnar
> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field. Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.
Why do you need that many sections on i386?
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-11 19:03 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
2008-06-11 20:52 ` Andi Kleen
@ 2008-06-11 23:25 ` Christoph Lameter
1 sibling, 0 replies; 49+ messages in thread
From: Christoph Lameter @ 2008-06-11 23:25 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Andrew Morton, Rik van Riel, linux-kernel, lee.schermerhorn,
kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
Ingo Molnar
On Wed, 11 Jun 2008, Andy Whitcroft wrote:
> I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
> fact I don't think we have any more with more than 4 nodes left. From
> the other discussion it sounds like we have a maximum if 8 nodes on
> other sub-arches. So it would not be unreasonable to reduce the shift
> to 3. Which might allow us to reduce the size of the reserve.
>
> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field. Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.
But in that case we can use the section number to look up the node number.
That is done automatically if we have too many page flags.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 20:57 ` Andrew Morton
2008-06-08 21:32 ` Rik van Riel
@ 2008-06-08 22:03 ` Rik van Riel
1 sibling, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-08 22:03 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
>
> Seems unlikely to be useful? The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
>
> > Please let me know what your preference is.
>
> Don't use another page flag?
To explain in more detail why we need the page flag:
When we move a page from the active or inactive list onto the
noreclaim list, we need to know what list it was on, in order
to adjust the zone counts for that list (NR_ACTIVE_ANON, etc).
For the same reason, we need to be able to identify whether
a page is already on the noreclaim list, so we can adjust
the statistics for the noreclaim pages, too. We cannot afford
to accidentally move a page onto the noreclaim list twice, or
try to remove it from the noreclaim list twice.
We need to know how many pages of each type there are in
each zone, and we need a way to specify that a page has
just become noreclaim. If a page is sitting a pagevec
somewhere, and it has just become unreclaimable, we want
that page to end up on the noreclaim list once that
pagevec is flushed.
As far as I can see, this requires a page flag.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-08 20:34 ` Rik van Riel
2008-06-08 20:57 ` Andrew Morton
@ 2008-06-08 21:07 ` KOSAKI Motohiro
1 sibling, 0 replies; 49+ messages in thread
From: KOSAKI Motohiro @ 2008-06-08 21:07 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, linux-kernel, lee.schermerhorn, linux-mm, eric.whitney
>> > +#ifdef CONFIG_NORECLAIM_LRU
>> > + PG_noreclaim, /* Page is "non-reclaimable" */
>> > +#endif
>>
>> I fear that we're messing up the terminology here.
>>
>> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'. The term
>> already means a few different things, but in the vmscan context,
>> "reclaimable" means that the page is unreferenced, clean and can be
>> stolen. "reclaimable" also means a lot of other things, and we just
>> made that worse.
>>
>> Can we think of a new term which uniquely describes this new concept
>> and use that, rather than flogging the old horse?
>
> Want to reuse the BSD term "pinned" instead?
I like this term :)
but I afraid to somebody confuse Xen/KVM term's pinned page.
IOW, I guess somebody imazine from "pinned page" to below flag.
#define PG_pinned PG_owner_priv_1 /* Xen pinned pagetable */
I have no idea....
>> > +/**
>> > + * add_page_to_noreclaim_list
>> > + * @page: the page to be added to the noreclaim list
>> > + *
>> > + * Add page directly to its zone's noreclaim list. To avoid races with
>> > + * tasks that might be making the page reclaimble while it's not on the
>> > + * lru, we want to add the page while it's locked or otherwise "invisible"
>> > + * to other tasks. This is difficult to do when using the pagevec cache,
>> > + * so bypass that.
>> > + */
>>
>> How does a task "make a page reclaimable"? munlock()? fsync()?
>> exit()?
>>
>> Choice of terminology matters...
>
> Lee? Kosaki-san?
IFAIK, moving noreclaim list to reclaim list happend at below situation.
mlock'ed page
- all mlocked process exit.
- all mlocked process call munlock().
- page related vma vanished
(e.g. mumap, mmap, remap_file_page)
SHM_LOCKed page
- sysctl(SHM_UNLOCK) called.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
2008-06-07 1:05 ` Andrew Morton
2008-06-08 20:34 ` Rik van Riel
@ 2008-06-10 20:09 ` Rik van Riel
1 sibling, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-10 20:09 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > +config NORECLAIM_LRU
> > + bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> > + depends on EXPERIMENTAL && 64BIT
> > + help
> > + Supports tracking of non-reclaimable pages off the [in]active lists
> > + to avoid excessive reclaim overhead on large memory systems. Pages
> > + may be non-reclaimable because: they are locked into memory, they
> > + are anonymous pages for which no swap space exists, or they are anon
> > + pages that are expensive to unmap [long anon_vma "related vma" list.]
>
> Aunt Tillie might be struggling with some of that.
I have now Aunt Tillified the description:
+++ linux-2.6.26-rc5-mm2/mm/Kconfig 2008-06-10 14:56:19.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config UNEVICTABLE_LRU
+ bool "Add LRU list to track non-evictable pages"
+ default y
+ help
+ Keeps unevictable pages off of the active and inactive pageout
+ lists, so kswapd will not waste CPU time or have its balancing
+ algorithms thrown off by scanning these pages. Selecting this
+ will use one page flag and increase the code size a little,
+ say Y unless you know what you are doing.
> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?
I have also switched to "unevictable".
> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page: the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list. To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks. This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
>
> How does a task "make a page reclaimable"? munlock()? fsync()?
> exit()?
>
> Choice of terminology matters...
I have added a linuxdoc function description here and
amended the comment to specify the ways in which a task
can make a page evictable.
> > + VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
>
> If this ever triggers, you'll wish that it had been coded with two
> separate assertions.
Good catch. I separated these.
> > +/**
> > + * putback_lru_page
> > + * @page to be put back to appropriate lru list
> The kerneldoc function description is missing.
Added this one, as well as a few others that were missing.
> > + } else if (page_reclaimable(page, NULL)) {
> > + /*
> > + * For reclaimable pages, we can use the cache.
> > + * In event of a race, worst case is we end up with a
> > + * non-reclaimable page on [in]active list.
> > + * We know how to handle that.
> > + */
> > + lru += page_file_cache(page);
> > + lru_cache_add_lru(page, lru);
> > + mem_cgroup_move_lists(page, lru);
> <stares for a while>
>
> <penny drops>
>
> So THAT'S what the magical "return 2" is doing in page_file_cache()!
>
> <looks>
>
> OK, after all the patches are applied, the "2" becomes LRU_FILE and the
> enumeration of `enum lru_list' reflects that.
In most places I have turned this into a call to page_lru(page).
> > +static inline void cull_nonreclaimable_page(struct page *page)
> Did you check whether all these inlined functions really should have
> been inlined? Even ones like this are probably too large.
Turned this into just a "static void" and renamed it
to cull_unevictable_page.
> > + /*
> > + * Non-reclaimable pages shouldn't make it onto either the active
> > + * nor the inactive list. However, when doing lumpy reclaim of
> > + * higher order pages we can still run into them.
>
> I guess that something along the lines of "when this function is being
> called for lumpy reclaim we can still .." would be clearer.
+ /*
+ * When this function is being called for lumpy reclaim, we
+ * initially look into all LRU pages, active, inactive and
+ * unreclaimable; only give shrink_page_list evictable pages.
+ */
+ if (PageUnevictable(page))
+ return ret;
... on to the next patch!
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
[not found] <20080606202838.390050172@redhat.com>
2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
2008-06-07 1:05 ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
` (4 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-15-lts-noreclaim-mlocked-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 4876 bytes --]
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists. When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list. Round and round she goes...
Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages. This will provide for efficient testing
of ramdisk pages in page_reclaimable().
Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.
Set the noreclaim state on address_space structures for new
ramdisk inodes. Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.
Similarly, ramfs pages are non-reclaimable. Set the 'noreclaim'
address_space flag for new ramfs inodes.
These changes depend on [CONFIG_]NORECLAIM_LRU.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
drivers/block/brd.c | 13 +++++++++++++
fs/ramfs/inode.c | 1 +
include/linux/pagemap.h | 22 ++++++++++++++++++++++
mm/vmscan.c | 5 +++++
4 files changed, 41 insertions(+)
Index: linux-2.6.26-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagemap.h 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagemap.h 2008-06-06 16:06:20.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
}
}
+#ifdef CONFIG_NORECLAIM_LRU
+#define AS_NORECLAIM (__GFP_BITS_SHIFT + 2) /* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+ set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+ if (mapping && (mapping->flags & AS_NORECLAIM))
+ return 1;
+ return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+ return 0;
+}
+#endif
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-06-06 16:05:50.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-06-06 16:06:20.000000000 -0400
@@ -2311,6 +2311,8 @@ int zone_reclaim(struct zone *zone, gfp_
* lists vs noreclaim list.
*
* Reasons page might not be reclaimable:
+ * (1) page's mapping marked non-reclaimable
+ *
* TODO - later patches
*/
int page_reclaimable(struct page *page, struct vm_area_struct *vma)
@@ -2318,6 +2320,9 @@ int page_reclaimable(struct page *page,
VM_BUG_ON(PageNoreclaim(page));
+ if (mapping_non_reclaimable(page_mapping(page)))
+ return 0;
+
/* TODO: test page [!]reclaimable conditions */
return 1;
Index: linux-2.6.26-rc2-mm1/fs/ramfs/inode.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/ramfs/inode.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/ramfs/inode.c 2008-06-06 16:06:20.000000000 -0400
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_noreclaim(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
Index: linux-2.6.26-rc2-mm1/drivers/block/brd.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/block/brd.c 2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/block/brd.c 2008-06-06 16:06:20.000000000 -0400
@@ -374,8 +374,21 @@ static int brd_ioctl(struct inode *inode
return error;
}
+/*
+ * brd_open():
+ * Just mark the mapping as containing non-reclaimable pages
+ */
+static int brd_open(struct inode *inode, struct file *filp)
+{
+ struct address_space *mapping = inode->i_mapping;
+
+ mapping_set_noreclaim(mapping);
+ return 0;
+}
+
static struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
+ .open = brd_open,
.ioctl = brd_ioctl,
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
@ 2008-06-07 1:05 ` Andrew Morton
2008-06-08 4:32 ` Greg KH
0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-06-07 1:05 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney
On Fri, 06 Jun 2008 16:28:53 -0400
Rik van Riel <riel@redhat.com> wrote:
>
> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>
> Christoph Lameter pointed out that ram disk pages also clutter the
> LRU lists. When vmscan finds them dirty and tries to clean them,
> the ram disk writeback function just redirties the page so that it
> goes back onto the active list. Round and round she goes...
>
> Define new address_space flag [shares address_space flags member
> with mapping's gfp mask] to indicate that the address space contains
> all non-reclaimable pages. This will provide for efficient testing
> of ramdisk pages in page_reclaimable().
>
> Also provide wrapper functions to set/test the noreclaim state to
> minimize #ifdefs in ramdisk driver and any other users of this
> facility.
>
> Set the noreclaim state on address_space structures for new
> ramdisk inodes. Test the noreclaim state in page_reclaimable()
> to cull non-reclaimable pages.
>
> Similarly, ramfs pages are non-reclaimable. Set the 'noreclaim'
> address_space flag for new ramfs inodes.
>
> These changes depend on [CONFIG_]NORECLAIM_LRU.
hm
>
> @@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
> inode->i_mapping->a_ops = &ramfs_aops;
> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + mapping_set_noreclaim(inode->i_mapping);
> inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> switch (mode & S_IFMT) {
> default:
That's OK.
> Index: linux-2.6.26-rc2-mm1/drivers/block/brd.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/drivers/block/brd.c 2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/drivers/block/brd.c 2008-06-06 16:06:20.000000000 -0400
> @@ -374,8 +374,21 @@ static int brd_ioctl(struct inode *inode
> return error;
> }
>
> +/*
> + * brd_open():
> + * Just mark the mapping as containing non-reclaimable pages
> + */
> +static int brd_open(struct inode *inode, struct file *filp)
> +{
> + struct address_space *mapping = inode->i_mapping;
> +
> + mapping_set_noreclaim(mapping);
> + return 0;
> +}
> +
> static struct block_device_operations brd_fops = {
> .owner = THIS_MODULE,
> + .open = brd_open,
> .ioctl = brd_ioctl,
> #ifdef CONFIG_BLK_DEV_XIP
> .direct_access = brd_direct_access,
But this only works for pagecache in /dev/ramN. afaict the pagecache
for files which are written onto that "blokk device" remain on the LRU.
But that's OK, isn't it? For the ramdisk driver these pages _do_ have
backing store and _can_ be written back and reclaimed, yes?
Still, I'm unsure about the whole implementation. We already maintain
this sort of information in the backing_dev. Would it not be better to
just avoid ever putting such pages onto the LRU in the first place?
Also, I expect there are a whole host of pseudo-filesystems (sysfs?)
which have this problem. Does the patch address all of them? If not,
can we come up with something which _does_ address them all without
having to hunt down and change every such fs?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
2008-06-07 1:05 ` Andrew Morton
@ 2008-06-08 4:32 ` Greg KH
0 siblings, 0 replies; 49+ messages in thread
From: Greg KH @ 2008-06-08 4:32 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Fri, Jun 06, 2008 at 06:05:10PM -0700, Andrew Morton wrote:
>
> Also, I expect there are a whole host of pseudo-filesystems (sysfs?)
> which have this problem. Does the patch address all of them? If not,
> can we come up with something which _does_ address them all without
> having to hunt down and change every such fs?
sysfs used to have this issue, until the people at IBM rewrote the whole
backing store for sysfs so that now it is reclaimable and pages out
quite nicely when there is memory pressure. That's how they run 20,000
disks on the s390 boxes with no memory :)
But it would be nice to solve the issue "generically" for ram based
filesystems, if possible (usbfs, securityfs, debugfs, etc.)
thanks,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
[not found] <20080606202838.390050172@redhat.com>
2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
2008-06-07 1:07 ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
` (3 subsequent siblings)
6 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
Eric Whitney, Nick Piggin
[-- Attachment #1: rvr-17-lts-noreclaim-handle-mlocked-pages-during-map-unmap.patch --]
[-- Type: text/plain, Size: 46324 bytes --]
Originally
From: Nick Piggin <npiggin@suse.de>
Against: 2.6.26-rc2-mm1
This patch:
1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
stub version of the mlock/noreclaim APIs when it's
not configured. Depends on [CONFIG_]NORECLAIM_LRU.
2) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
nonreclaimable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_noreclaim
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
3) add the mlock/noreclaim infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on noreclaim
LRU list.
4) update vmscan.c:page_reclaimable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull nonreclaimable pages in fault
path" patch is included.
5) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
6) Kosaki: added munlock page table walk to avoid using
get_user_pages() for unlock. get_user_pages() is unreliable
for some vma protections.
Lee: modified to wait for in-flight migration to complete
to close munlock/migration race that could strand pages.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
---
V8:
+ more refinement of rmap interaction, including attempt to
handle mlocked pages in non-linear mappings.
+ cleanup of lockdep reported errors.
+ enhancement of munlock page table walker to detect and
handle pages under migration [migration ptes].
V6:
+ Kosaki-san and Rik van Riel: added check for "page mapped
in vma" to try_to_unlock() processing in try_to_unmap_anon().
+ Kosaki-san added munlock page table walker to avoid use of
get_user_pages() for munlock. get_user_pages() proved to be
unreliable for some types of vmas.
+ added filtering of "special" vmas. Some [_IO||_PFN] we skip
altogether. Others, we just "make_pages_present" to simulate
old behavior--i.e., populate page tables. Clear/don't set
VM_LOCKED in non-mlockable vmas so that we don't try to unlock
at exit/unmap time.
+ rework PG_mlock page flag definitions for new page flags
macros.
+ Clear PageMlocked when COWing a page into a VM_LOCKED vma
so we don't leave an mlocked page in another non-mlocked
vma. If the other vma[s] had the page mlocked, we'll re-mlock
it if/when we try to reclaim it. This is less expensive than
walking the rmap in the COW/fault path.
+ in vmscan:shrink_page_list(), avoid adding anon page to
the swap cache if it's in a VM_LOCKED vma, even tho'
PG_mlocked might not be set. Call try_to_unlock() to
determine this. As a result, we'll never try to unmap
an mlocked anon page.
+ in support of the above change, updated try_to_unlock()
to use same logic as try_to_unmap() when it encounters a
VM_LOCKED vma--call mlock_vma_page() directly. Added
stub try_to_unlock() for vmscan when NORECLAIM_MLOCK
not configured.
V4 -> V5:
+ fixed problem with placement of #ifdef CONFIG_NORECLAIM_MLOCK
in prep_new_page() [Thanks, minchan Kim!].
V3 -> V4:
+ Added #ifdef CONFIG_NORECLAIM_MLOCK, #endif around use of
PG_mlocked in free_page_check(), et al. Not defined for
32-bit builds.
V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix page flags macros for *PageMlocked() when not configured.
+ ensure lru_add_drain_all() runs on all cpus when NORECLAIM_MLOCK
configured. Was just for NUMA.
V1 -> V2:
+ moved this patch [and related patches] up to right after
ramdisk/ramfs and SHM_LOCKed patches.
+ add [back] missing put_page() in putback_lru_page().
This solved page leakage as seen by stats in previous
version.
+ fix up munlock_vma_page() to isolate page from lru
before calling try_to_unlock(). Think I detected a
race here.
+ use TestClearPageMlock() on old page in migrate.c's
migrate_page_copy() to clean up old page.
+ live dangerously: remove TestSetPageLocked() in
is_mlocked_vma()--should only be called on new pages in
the fault path--iff we chose to cull there [later patch].
+ Add PG_mlocked to free_pages_check() etc to detect mlock
state mismanagement.
NOTE: temporarily [???] commented out--tripping over it
under load. Why?
Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 1 of 2.
include/linux/mm.h | 5
include/linux/page-flags.h | 16 +
include/linux/rmap.h | 14 +
mm/Kconfig | 14 +
mm/internal.h | 70 ++++++++
mm/memory.c | 19 ++
mm/migrate.c | 2
mm/mlock.c | 386 ++++++++++++++++++++++++++++++++++++++++++---
mm/mmap.c | 1
mm/page_alloc.c | 15 +
mm/rmap.c | 252 +++++++++++++++++++++++++----
mm/swap.c | 2
mm/vmscan.c | 40 +++-
13 files changed, 767 insertions(+), 69 deletions(-)
Index: linux-2.6.26-rc2-mm1/mm/Kconfig
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:06:28.000000000 -0400
@@ -215,3 +215,17 @@ config NORECLAIM_LRU
may be non-reclaimable because: they are locked into memory, they
are anonymous pages for which no swap space exists, or they are anon
pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_MLOCK
+ bool "Exclude mlock'ed pages from reclaim"
+ depends on NORECLAIM_LRU
+ help
+ Treats mlock'ed pages as no-reclaimable. Removing these pages from
+ the LRU [in]active lists avoids the overhead of attempting to reclaim
+ them. Pages marked non-reclaimable for this reason will become
+ reclaimable again when the last mlock is removed.
+ when no swap space exists. Removing these pages from the LRU lists
+ avoids the overhead of attempting to reclaim them. Pages marked
+ non-reclaimable for this reason will become reclaimable again when/if
+ sufficient swap space is added to the system.
+
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-06-06 16:06:28.000000000 -0400
@@ -56,6 +56,17 @@ static inline unsigned long page_order(s
return page_private(page);
}
+/*
+ * mlock all pages in this vma range. For mmap()/mremap()/...
+ */
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
+
+/*
+ * munlock all pages in vma. For munmap() and exit().
+ */
+extern void munlock_vma_pages_all(struct vm_area_struct *vma);
+
#ifdef CONFIG_NORECLAIM_LRU
/*
* noreclaim_migrate_page() called only from migrate_page_copy() to
@@ -74,6 +85,65 @@ static inline void noreclaim_migrate_pag
}
#endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Called only in fault path via page_reclaimable() for a new page
+ * to determine if it's being mapped into a LOCKED vma.
+ * If so, mark page as mlocked.
+ */
+static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+{
+ VM_BUG_ON(PageLRU(page));
+
+ if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+ return 0;
+
+ SetPageMlocked(page);
+ return 1;
+}
+
+/*
+ * must be called with vma's mmap_sem held for read, and page locked.
+ */
+extern void mlock_vma_page(struct page *page);
+
+/*
+ * Clear the page's PageMlocked(). This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache -- e.g.,
+ * on truncation or freeing.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
+ */
+extern void __clear_page_mlock(struct page *page);
+static inline void clear_page_mlock(struct page *page)
+{
+ if (unlikely(TestClearPageMlocked(page)))
+ __clear_page_mlock(page);
+}
+
+/*
+ * mlock_migrate_page - called only from migrate_page_copy() to
+ * migrate the Mlocked page flag
+ */
+static inline void mlock_migrate_page(struct page *newpage, struct page *page)
+{
+ if (TestClearPageMlocked(page))
+ SetPageMlocked(newpage);
+}
+
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+{
+ return 0;
+}
+static inline void clear_page_mlock(struct page *page) { }
+static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_migrate_page(struct page *new, struct page *old) { }
+
+#endif /* CONFIG_NORECLAIM_MLOCK */
/*
* FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c 2008-05-15 11:20:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c 2008-06-06 16:06:28.000000000 -0400
@@ -8,10 +8,18 @@
#include <linux/capability.h>
#include <linux/mman.h>
#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/pagemap.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
#include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/mmzone.h>
+#include <linux/hugetlb.h>
+
+#include "internal.h"
int can_do_mlock(void)
{
@@ -23,17 +31,354 @@ int can_do_mlock(void)
}
EXPORT_SYMBOL(can_do_mlock);
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Mlocked pages are marked with PageMlocked() flag for efficient testing
+ * in vmscan and, possibly, the fault path; and to support semi-accurate
+ * statistics.
+ *
+ * An mlocked page [PageMlocked(page)] is non-reclaimable. As such, it will
+ * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
+ * The noreclaim list is an LRU sibling list to the [in]active lists.
+ * PageNoreclaim is set to indicate the non-reclaimable state.
+ *
+ * When lazy mlocking via vmscan, it is important to ensure that the
+ * vma's VM_LOCKED status is not concurrently being modified, otherwise we
+ * may have mlocked a page that is being munlocked. So lazy mlock must take
+ * the mmap_sem for read, and verify that the vma really is locked
+ * (see mm/rmap.c).
+ */
+
+/*
+ * LRU accounting for clear_page_mlock()
+ */
+void __clear_page_mlock(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page)); /* for LRU islolate/putback */
+
+ if (!isolate_lru_page(page)) {
+ putback_lru_page(page);
+ } else {
+ /*
+ * Try hard not to leak this page ...
+ */
+ lru_add_drain_all();
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+ }
+}
+
+/*
+ * Mark page as mlocked if not already.
+ * If page on LRU, isolate and putback to move to noreclaim list.
+ */
+void mlock_vma_page(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+
+ if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+ putback_lru_page(page);
+}
+
+/*
+ * called from munlock()/munmap() path with page supposedly on the LRU.
+ *
+ * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
+ * [in try_to_unlock()] and then attempt to isolate the page. We must
+ * isolate the page() to keep others from messing with its noreclaim
+ * and mlocked state while trying to unlock. However, we pre-clear the
+ * mlocked state anyway as we might lose the isolation race and we might
+ * not get another chance to clear PageMlocked. If we successfully
+ * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
+ * mapping the page, it will restore the PageMlocked state, unless the page
+ * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
+ * perhaps redundantly.
+ * If we lose the isolation race, and the page is mapped by other VM_LOCKED
+ * vmas, we'll detect this in vmscan--via try_to_unlock() or try_to_unmap()
+ * either of which will restore the PageMlocked state by calling
+ * mlock_vma_page() above, if it can grab the vma's mmap sem.
+ */
+static void munlock_vma_page(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+
+ if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
+ try_to_unlock(page);
+ putback_lru_page(page);
+ }
+}
+
+/*
+ * mlock a range of pages in the vma.
+ *
+ * This takes care of making the pages present too.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+static int __mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr = start;
+ struct page *pages[16]; /* 16 gives a reasonable batch */
+ int write = !!(vma->vm_flags & VM_WRITE);
+ int nr_pages = (end - start) / PAGE_SIZE;
+ int ret;
+
+ VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+ VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
+ VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+
+ lru_add_drain_all(); /* push cached pages to LRU */
+
+ while (nr_pages > 0) {
+ int i;
+
+ cond_resched();
+
+ /*
+ * get_user_pages makes pages present if we are
+ * setting mlock.
+ */
+ ret = get_user_pages(current, mm, addr,
+ min_t(int, nr_pages, ARRAY_SIZE(pages)),
+ write, 0, pages, NULL);
+ /*
+ * This can happen for, e.g., VM_NONLINEAR regions before
+ * a page has been allocated and mapped at a given offset,
+ * or for addresses that map beyond end of a file.
+ * We'll mlock the the pages if/when they get faulted in.
+ */
+ if (ret < 0)
+ break;
+ if (ret == 0) {
+ /*
+ * We know the vma is there, so the only time
+ * we cannot get a single page should be an
+ * error (ret < 0) case.
+ */
+ WARN_ON(1);
+ break;
+ }
+
+ lru_add_drain(); /* push cached pages to LRU */
+
+ for (i = 0; i < ret; i++) {
+ struct page *page = pages[i];
+
+ /*
+ * page might be truncated or migrated out from under
+ * us. Check after acquiring page lock.
+ */
+ lock_page(page);
+ if (page->mapping)
+ mlock_vma_page(page);
+ unlock_page(page);
+ put_page(page); /* ref from get_user_pages() */
+
+ /*
+ * here we assume that get_user_pages() has given us
+ * a list of virtually contiguous pages.
+ */
+ addr += PAGE_SIZE; /* for next get_user_pages() */
+ nr_pages--;
+ }
+ }
+
+ lru_add_drain_all(); /* to update stats */
+
+ return 0; /* count entire vma as locked_vm */
+}
+
+/*
+ * private structure for munlock page table walk
+ */
+struct munlock_page_walk {
+ struct vm_area_struct *vma;
+ pmd_t *pmd; /* for migration_entry_wait() */
+};
+
+/*
+ * munlock normal pages for present ptes
+ */
+static int __munlock_pte_handler(pte_t *ptep, unsigned long addr,
+ unsigned long end, void *private)
+{
+ struct munlock_page_walk *mpw = private;
+ swp_entry_t entry;
+ struct page *page;
+ pte_t pte;
+
+retry:
+ pte = *ptep;
+ /*
+ * If it's a swap pte, we might be racing with page migration.
+ */
+ if (unlikely(!pte_present(pte))) {
+ if (!is_swap_pte(pte))
+ goto out;
+ entry = pte_to_swp_entry(pte);
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mpw->vma->vm_mm, mpw->pmd, addr);
+ goto retry;
+ }
+ goto out;
+ }
+
+ page = vm_normal_page(mpw->vma, addr, pte);
+ if (!page)
+ goto out;
+
+ lock_page(page);
+ if (!page->mapping) {
+ unlock_page(page);
+ goto retry;
+ }
+ munlock_vma_page(page);
+ unlock_page(page);
+
+out:
+ return 0;
+}
+
+/*
+ * Save pmd for pte handler for waiting on migration entries
+ */
+static int __munlock_pmd_handler(pmd_t *pmd, unsigned long addr,
+ unsigned long end, void *private)
+{
+ struct munlock_page_walk *mpw = private;
+
+ mpw->pmd = pmd;
+ return 0;
+}
+
+static struct mm_walk munlock_page_walk = {
+ .pmd_entry = __munlock_pmd_handler,
+ .pte_entry = __munlock_pte_handler,
+};
+
+/*
+ * munlock a range of pages in the vma using standard page table walk.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+static void __munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct munlock_page_walk mpw;
+
+ VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+ VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+ VM_BUG_ON(start < vma->vm_start);
+ VM_BUG_ON(end > vma->vm_end);
+
+ lru_add_drain_all(); /* push cached pages to LRU */
+ mpw.vma = vma;
+ (void)walk_page_range(mm, start, end, &munlock_page_walk, &mpw);
+ lru_add_drain_all(); /* to update stats */
+
+}
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * Just make pages present if VM_LOCKED. No-op if unlocking.
+ */
+static int __mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ if (vma->vm_flags & VM_LOCKED)
+ make_pages_present(start, end);
+ return 0;
+}
+
+/*
+ * munlock a range of pages in the vma -- no-op.
+ */
+static void __munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * mlock all pages in this vma range. For mmap()/mremap()/...
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ int nr_pages = (end - start) / PAGE_SIZE;
+ BUG_ON(!(vma->vm_flags & VM_LOCKED));
+
+ /*
+ * filter unlockable vmas
+ */
+ if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+ goto no_mlock;
+
+ if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+ is_vm_hugetlb_page(vma) ||
+ vma == get_gate_vma(current))
+ goto make_present;
+
+ return __mlock_vma_pages_range(vma, start, end);
+
+make_present:
+ /*
+ * User mapped kernel pages or huge pages:
+ * make these pages present to populate the ptes, but
+ * fall thru' to reset VM_LOCKED--no need to unlock, and
+ * return nr_pages so these don't get counted against task's
+ * locked limit. huge pages are already counted against
+ * locked vm limit.
+ */
+ make_pages_present(start, end);
+
+no_mlock:
+ vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */
+ return nr_pages; /* pages NOT mlocked */
+}
+
+
+/*
+ * munlock all pages in vma. For munmap() and exit().
+ */
+void munlock_vma_pages_all(struct vm_area_struct *vma)
+{
+ vma->vm_flags &= ~VM_LOCKED;
+ __munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+}
+
+/*
+ * mlock_fixup - handle mlock[all]/munlock[all] requests.
+ *
+ * Filters out "special" vmas -- VM_LOCKED never gets set for these, and
+ * munlock is a no-op. However, for some special vmas, we go ahead and
+ * populate the ptes via make_pages_present().
+ *
+ * For vmas that pass the filters, merge/split as appropriate.
+ */
static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, unsigned int newflags)
{
- struct mm_struct * mm = vma->vm_mm;
+ struct mm_struct *mm = vma->vm_mm;
pgoff_t pgoff;
- int pages;
+ int nr_pages;
int ret = 0;
+ int lock = newflags & VM_LOCKED;
- if (newflags == vma->vm_flags) {
- *prev = vma;
- goto out;
+ if (newflags == vma->vm_flags ||
+ (vma->vm_flags & (VM_IO | VM_PFNMAP)))
+ goto out; /* don't set VM_LOCKED, don't count */
+
+ if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+ is_vm_hugetlb_page(vma) ||
+ vma == get_gate_vma(current)) {
+ if (lock)
+ make_pages_present(start, end);
+ goto out; /* don't set VM_LOCKED, don't count */
}
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
@@ -44,8 +389,6 @@ static int mlock_fixup(struct vm_area_st
goto success;
}
- *prev = vma;
-
if (start != vma->vm_start) {
ret = split_vma(mm, vma, start, 1);
if (ret)
@@ -60,24 +403,31 @@ static int mlock_fixup(struct vm_area_st
success:
/*
+ * Keep track of amount of locked VM.
+ */
+ nr_pages = (end - start) >> PAGE_SHIFT;
+ if (!lock)
+ nr_pages = -nr_pages;
+ mm->locked_vm += nr_pages;
+
+ /*
* vm_flags is protected by the mmap_sem held in write mode.
* It's okay if try_to_unmap_one unmaps a page just after we
- * set VM_LOCKED, make_pages_present below will bring it back.
+ * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
*/
vma->vm_flags = newflags;
- /*
- * Keep track of amount of locked VM.
- */
- pages = (end - start) >> PAGE_SHIFT;
- if (newflags & VM_LOCKED) {
- pages = -pages;
- if (!(newflags & VM_IO))
- ret = make_pages_present(start, end);
- }
+ if (lock) {
+ ret = __mlock_vma_pages_range(vma, start, end);
+ if (ret > 0) {
+ mm->locked_vm -= ret;
+ ret = 0;
+ }
+ } else
+ __munlock_vma_pages_range(vma, start, end);
- mm->locked_vm -= pages;
out:
+ *prev = vma;
if (ret == -ENOMEM)
ret = -EAGAIN;
return ret;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-06-06 16:06:28.000000000 -0400
@@ -537,11 +537,8 @@ static unsigned long shrink_page_list(st
sc->nr_scanned++;
- if (unlikely(!page_reclaimable(page, NULL))) {
- if (putback_lru_page(page))
- unlock_page(page);
- continue;
- }
+ if (unlikely(!page_reclaimable(page, NULL)))
+ goto cull_mlocked;
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
@@ -578,9 +575,19 @@ static unsigned long shrink_page_list(st
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page))
+ if (PageAnon(page) && !PageSwapCache(page)) {
+ switch (try_to_unlock(page)) {
+ case SWAP_FAIL: /* shouldn't happen */
+ case SWAP_AGAIN:
+ goto keep_locked;
+ case SWAP_MLOCK:
+ goto cull_mlocked;
+ case SWAP_SUCCESS:
+ ; /* fall thru'; add to swap cache */
+ }
if (!add_to_swap(page, GFP_ATOMIC))
goto activate_locked;
+ }
#endif /* CONFIG_SWAP */
mapping = page_mapping(page);
@@ -595,6 +602,8 @@ static unsigned long shrink_page_list(st
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
+ case SWAP_MLOCK:
+ goto cull_mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -667,6 +676,11 @@ free_it:
__pagevec_release_nonlru(&freed_pvec);
continue;
+cull_mlocked:
+ if (putback_lru_page(page))
+ unlock_page(page);
+ continue;
+
activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
if (PageSwapCache(page) && vm_swap_full())
@@ -678,7 +692,7 @@ keep_locked:
unlock_page(page);
keep:
list_add(&page->lru, &ret_pages);
- VM_BUG_ON(PageLRU(page));
+ VM_BUG_ON(PageLRU(page) || PageNoreclaim(page));
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
@@ -2308,12 +2322,13 @@ int zone_reclaim(struct zone *zone, gfp_
* @vma: the VMA in which the page is or will be mapped, may be NULL
*
* Test whether page is reclaimable--i.e., should be placed on active/inactive
- * lists vs noreclaim list.
+ * lists vs noreclaim list. The vma argument is !NULL when called from the
+ * fault path to determine how to instantate a new page.
*
* Reasons page might not be reclaimable:
* (1) page's mapping marked non-reclaimable
+ * (2) page is part of an mlocked VMA
*
- * TODO - later patches
*/
int page_reclaimable(struct page *page, struct vm_area_struct *vma)
{
@@ -2323,13 +2338,16 @@ int page_reclaimable(struct page *page,
if (mapping_non_reclaimable(page_mapping(page)))
return 0;
- /* TODO: test page [!]reclaimable conditions */
+#ifdef CONFIG_NORECLAIM_MLOCK
+ if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+ return 0;
+#endif
return 1;
}
/**
- * check_move_noreclaim_page - check page for reclaimability and move to appropriate zone lru list
+ * check_move_noreclaim_page - check page for reclaimability and move to appropriate lru list
* @page: page to check reclaimability and move to appropriate lru list
* @zone: zone page is in
*
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h 2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h 2008-06-06 16:06:28.000000000 -0400
@@ -96,6 +96,9 @@ enum pageflags {
PG_swapbacked, /* Page is backed by RAM/swap */
#ifdef CONFIG_NORECLAIM_LRU
PG_noreclaim, /* Page is "non-reclaimable" */
+#ifdef CONFIG_NORECLAIM_MLOCK
+ PG_mlocked, /* Page is vma mlocked */
+#endif
#endif
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
@@ -210,12 +213,25 @@ PAGEFLAG_FALSE(SwapCache)
#ifdef CONFIG_NORECLAIM_LRU
PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
TESTCLEARFLAG(Noreclaim, noreclaim)
+
+#ifdef CONFIG_NORECLAIM_MLOCK
+#define MLOCK_PAGES 1
+PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
+ TESTSCFLAG(Mlocked, mlocked)
+#endif
+
#else
PAGEFLAG_FALSE(Noreclaim) TESTCLEARFLAG_FALSE(Noreclaim)
SETPAGEFLAG_NOOP(Noreclaim) CLEARPAGEFLAG_NOOP(Noreclaim)
__CLEARPAGEFLAG_NOOP(Noreclaim)
#endif
+#if !defined(CONFIG_NORECLAIM_MLOCK)
+#define MLOCK_PAGES 0
+PAGEFLAG_FALSE(Mlocked)
+ SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)
+#endif
+
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PAGEFLAG(Uncached, uncached)
#else
Index: linux-2.6.26-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/rmap.h 2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/rmap.h 2008-06-06 16:06:28.000000000 -0400
@@ -97,6 +97,19 @@ unsigned long page_address_in_vma(struct
*/
int page_mkclean(struct page *);
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called in munlock()/munmap() path to check for other vmas holding
+ * the page mlocked.
+ */
+int try_to_unlock(struct page *);
+#else
+static inline int try_to_unlock(struct page *page)
+{
+ return 0; /* a.k.a. SWAP_SUCCESS */
+}
+#endif
+
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
@@ -120,5 +133,6 @@ static inline int page_mkclean(struct pa
#define SWAP_SUCCESS 0
#define SWAP_AGAIN 1
#define SWAP_FAIL 2
+#define SWAP_MLOCK 3
#endif /* _LINUX_RMAP_H */
Index: linux-2.6.26-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/rmap.c 2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/rmap.c 2008-06-06 16:06:28.000000000 -0400
@@ -52,6 +52,8 @@
#include <asm/tlbflush.h>
+#include "internal.h"
+
static struct kmem_cache *anon_vma_cachep;
static inline struct anon_vma *anon_vma_alloc(void)
@@ -273,6 +275,32 @@ pte_t *page_check_address(struct page *p
return NULL;
}
+/**
+ * page_mapped_in_vma - check whether a page is really mapped in a VMA
+ * @page: the page to test
+ * @vma: the VMA to test
+ *
+ * Returns 1 if the page is mapped into the page tables of the VMA, 0
+ * if the page is not mapped into the page tables of this VMA. Only
+ * valid for normal file or anonymous VMAs.
+ */
+static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+{
+ unsigned long address;
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ address = vma_address(page, vma);
+ if (address == -EFAULT) /* out of vma range */
+ return 0;
+ pte = page_check_address(page, vma->vm_mm, address, &ptl);
+ if (!pte) /* the page is not in this mm */
+ return 0;
+ pte_unmap_unlock(pte, ptl);
+
+ return 1;
+}
+
/*
* Subfunctions of page_referenced: page_referenced_one called
* repeatedly from either page_referenced_anon or page_referenced_file.
@@ -294,10 +322,17 @@ static int page_referenced_one(struct pa
if (!pte)
goto out;
+ /*
+ * Don't want to elevate referenced for mlocked page that gets this far,
+ * in order that it progresses to try_to_unmap and is moved to the
+ * noreclaim list.
+ */
if (vma->vm_flags & VM_LOCKED) {
- referenced++;
*mapcount = 1; /* break early from loop */
- } else if (ptep_clear_flush_young(vma, address, pte))
+ goto out_unmap;
+ }
+
+ if (ptep_clear_flush_young(vma, address, pte))
referenced++;
/* Pretend the page is referenced if the task has the
@@ -306,6 +341,7 @@ static int page_referenced_one(struct pa
rwsem_is_locked(&mm->mmap_sem))
referenced++;
+out_unmap:
(*mapcount)--;
pte_unmap_unlock(pte, ptl);
out:
@@ -395,11 +431,6 @@ static int page_referenced_file(struct p
*/
if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
continue;
- if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
- == (VM_LOCKED|VM_MAYSHARE)) {
- referenced++;
- break;
- }
referenced += page_referenced_one(page, vma, &mapcount);
if (!mapcount)
break;
@@ -726,10 +757,15 @@ static int try_to_unmap_one(struct page
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration && ((vma->vm_flags & VM_LOCKED) ||
- (ptep_clear_flush_young(vma, address, pte)))) {
- ret = SWAP_FAIL;
- goto out_unmap;
+ if (!migration) {
+ if (vma->vm_flags & VM_LOCKED) {
+ ret = SWAP_MLOCK;
+ goto out_unmap;
+ }
+ if (ptep_clear_flush_young(vma, address, pte)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
}
/* Nuke the page table entry. */
@@ -811,12 +847,17 @@ out:
* For very sparsely populated VMAs this is a little inefficient - chances are
* there there won't be many ptes located within the scan cluster. In this case
* maybe we could scan further - to the end of the pte page, perhaps.
+ *
+ * Mlocked pages: check VM_LOCKED under mmap_sem held for read, if we can
+ * acquire it without blocking. If vma locked, mlock the pages in the cluster,
+ * rather than unmapping them. If we encounter the "check_page" that vmscan is
+ * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN.
*/
#define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE)
#define CLUSTER_MASK (~(CLUSTER_SIZE - 1))
-static void try_to_unmap_cluster(unsigned long cursor,
- unsigned int *mapcount, struct vm_area_struct *vma)
+static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
+ struct vm_area_struct *vma, struct page *check_page)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
@@ -828,6 +869,8 @@ static void try_to_unmap_cluster(unsigne
struct page *page;
unsigned long address;
unsigned long end;
+ int ret = SWAP_AGAIN;
+ int locked_vma = 0;
address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -838,15 +881,26 @@ static void try_to_unmap_cluster(unsigne
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
- return;
+ return ret;
pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- return;
+ return ret;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- return;
+ return ret;
+
+ /*
+ * MLOCK_PAGES => feature is configured.
+ * if we can acquire the mmap_sem for read, and vma is VM_LOCKED,
+ * keep the sem while scanning the cluster for mlocking pages.
+ */
+ if (MLOCK_PAGES && down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ locked_vma = (vma->vm_flags & VM_LOCKED);
+ if (!locked_vma)
+ up_read(&vma->vm_mm->mmap_sem); /* don't need it */
+ }
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -859,6 +913,13 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
+ if (locked_vma) {
+ mlock_vma_page(page); /* no-op if already mlocked */
+ if (page == check_page)
+ ret = SWAP_MLOCK;
+ continue; /* don't unmap */
+ }
+
if (ptep_clear_flush_young(vma, address, pte))
continue;
@@ -880,39 +941,104 @@ static void try_to_unmap_cluster(unsigne
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
+ if (locked_vma)
+ up_read(&vma->vm_mm->mmap_sem);
+ return ret;
}
-static int try_to_unmap_anon(struct page *page, int migration)
+/*
+ * common handling for pages mapped in VM_LOCKED vmas
+ */
+static int try_to_mlock_page(struct page *page, struct vm_area_struct *vma)
+{
+ int mlocked = 0;
+
+ if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ if (vma->vm_flags & VM_LOCKED) {
+ mlock_vma_page(page);
+ mlocked++; /* really mlocked the page */
+ }
+ up_read(&vma->vm_mm->mmap_sem);
+ }
+ return mlocked;
+}
+
+/**
+ * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock: request for unlock rather than unmap [unlikely]
+ * @migration: unmapping for migration - ignored if @unlock
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the anon_vma struct it points to.
+ *
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * anonymous pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write. So, we won't recheck
+ * vm_flags for that VMA. That should be OK, because that vma shouldn't be
+ * 'LOCKED.
+ */
+static int try_to_unmap_anon(struct page *page, int unlock, int migration)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
+ unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
+ if (MLOCK_PAGES && unlikely(unlock))
+ ret = SWAP_SUCCESS; /* default for try_to_unlock() */
+
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
return ret;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- ret = try_to_unmap_one(page, vma, migration);
- if (ret == SWAP_FAIL || !page_mapped(page))
- break;
+ if (MLOCK_PAGES && unlikely(unlock)) {
+ if (!((vma->vm_flags & VM_LOCKED) &&
+ page_mapped_in_vma(page, vma)))
+ continue; /* must visit all unlocked vmas */
+ ret = SWAP_MLOCK; /* saw at least one mlocked vma */
+ } else {
+ ret = try_to_unmap_one(page, vma, migration);
+ if (ret == SWAP_FAIL || !page_mapped(page))
+ break;
+ }
+ if (ret == SWAP_MLOCK) {
+ mlocked = try_to_mlock_page(page, vma);
+ if (mlocked)
+ break; /* stop if actually mlocked page */
+ }
}
page_unlock_anon_vma(anon_vma);
+
+ if (mlocked)
+ ret = SWAP_MLOCK; /* actually mlocked the page */
+ else if (ret == SWAP_MLOCK)
+ ret = SWAP_AGAIN; /* saw VM_LOCKED vma */
+
return ret;
}
/**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
- * @migration: migration flag
+ * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
+ * @page: the page to unmap/unlock
+ * @unlock: request for unlock rather than unmap [unlikely]
+ * @migration: unmapping for migration - ignored if @unlock
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
*
- * This function is only called from try_to_unmap for object-based pages.
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * object-based pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write. So, we won't recheck
+ * vm_flags for that VMA. That should be OK, because that vma shouldn't be
+ * 'LOCKED.
*/
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file(struct page *page, int unlock, int migration)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -923,20 +1049,44 @@ static int try_to_unmap_file(struct page
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
unsigned int mapcount;
+ unsigned int mlocked = 0;
+
+ if (MLOCK_PAGES && unlikely(unlock))
+ ret = SWAP_SUCCESS; /* default for try_to_unlock() */
spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- ret = try_to_unmap_one(page, vma, migration);
- if (ret == SWAP_FAIL || !page_mapped(page))
- goto out;
+ if (MLOCK_PAGES && unlikely(unlock)) {
+ if (!(vma->vm_flags & VM_LOCKED))
+ continue; /* must visit all vmas */
+ ret = SWAP_MLOCK;
+ } else {
+ ret = try_to_unmap_one(page, vma, migration);
+ if (ret == SWAP_FAIL || !page_mapped(page))
+ goto out;
+ }
+ if (ret == SWAP_MLOCK) {
+ mlocked = try_to_mlock_page(page, vma);
+ if (mlocked)
+ break; /* stop if actually mlocked page */
+ }
}
+ if (mlocked)
+ goto out;
+
if (list_empty(&mapping->i_mmap_nonlinear))
goto out;
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if ((vma->vm_flags & VM_LOCKED) && !migration)
+ if (MLOCK_PAGES && unlikely(unlock)) {
+ if (!(vma->vm_flags & VM_LOCKED))
+ continue; /* must visit all vmas */
+ ret = SWAP_MLOCK; /* leave mlocked == 0 */
+ goto out; /* no need to look further */
+ }
+ if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -946,7 +1096,7 @@ static int try_to_unmap_file(struct page
max_nl_size = cursor;
}
- if (max_nl_size == 0) { /* any nonlinears locked or reserved */
+ if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */
ret = SWAP_FAIL;
goto out;
}
@@ -970,12 +1120,16 @@ static int try_to_unmap_file(struct page
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if ((vma->vm_flags & VM_LOCKED) && !migration)
+ if (!MLOCK_PAGES && !migration &&
+ (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
while ( cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
- try_to_unmap_cluster(cursor, &mapcount, vma);
+ ret = try_to_unmap_cluster(cursor, &mapcount,
+ vma, page);
+ if (ret == SWAP_MLOCK)
+ mlocked = 2; /* to return below */
cursor += CLUSTER_SIZE;
vma->vm_private_data = (void *) cursor;
if ((int)mapcount <= 0)
@@ -996,6 +1150,10 @@ static int try_to_unmap_file(struct page
vma->vm_private_data = NULL;
out:
spin_unlock(&mapping->i_mmap_lock);
+ if (mlocked)
+ ret = SWAP_MLOCK; /* actually mlocked the page */
+ else if (ret == SWAP_MLOCK)
+ ret = SWAP_AGAIN; /* saw VM_LOCKED vma */
return ret;
}
@@ -1011,6 +1169,7 @@ out:
* SWAP_SUCCESS - we succeeded in removing all mappings
* SWAP_AGAIN - we missed a mapping, try again later
* SWAP_FAIL - the page is unswappable
+ * SWAP_MLOCK - page is mlocked.
*/
int try_to_unmap(struct page *page, int migration)
{
@@ -1019,12 +1178,33 @@ int try_to_unmap(struct page *page, int
BUG_ON(!PageLocked(page));
if (PageAnon(page))
- ret = try_to_unmap_anon(page, migration);
+ ret = try_to_unmap_anon(page, 0, migration);
else
- ret = try_to_unmap_file(page, migration);
-
- if (!page_mapped(page))
+ ret = try_to_unmap_file(page, 0, migration);
+ if (ret != SWAP_MLOCK && !page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
}
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * try_to_unlock - Check page's rmap for other vma's holding page locked.
+ * @page: the page to be unlocked. will be returned with PG_mlocked
+ * cleared if no vmas are VM_LOCKED.
+ *
+ * Return values are:
+ *
+ * SWAP_SUCCESS - no vma's holding page locked.
+ * SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem
+ * SWAP_MLOCK - page is now mlocked.
+ */
+int try_to_unlock(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page) || PageLRU(page));
+
+ if (PageAnon(page))
+ return try_to_unmap_anon(page, 1, 0);
+ else
+ return try_to_unmap_file(page, 1, 0);
+}
+#endif
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c 2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c 2008-06-06 16:06:28.000000000 -0400
@@ -359,6 +359,8 @@ static void migrate_page_copy(struct pag
__set_page_dirty_nobuffers(newpage);
}
+ mlock_migrate_page(newpage, page);
+
#ifdef CONFIG_SWAP
ClearPageSwapCache(page);
#endif
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-06-06 16:05:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c 2008-06-06 16:06:28.000000000 -0400
@@ -258,6 +258,9 @@ static void bad_page(struct page *page)
1 << PG_active |
#ifdef CONFIG_NORECLAIM_LRU
1 << PG_noreclaim |
+#ifdef CONFIG_NORECLAIM_MLOCK
+ 1 << PG_mlocked |
+#endif
#endif
1 << PG_dirty |
1 << PG_reclaim |
@@ -497,6 +500,9 @@ static inline int free_pages_check(struc
#ifdef CONFIG_NORECLAIM_LRU
1 << PG_noreclaim |
#endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+ 1 << PG_mlocked |
+#endif
1 << PG_buddy ))))
bad_page(page);
if (PageDirty(page))
@@ -650,6 +656,9 @@ static int prep_new_page(struct page *pa
1 << PG_active |
#ifdef CONFIG_NORECLAIM_LRU
1 << PG_noreclaim |
+#ifdef CONFIG_NORECLAIM_MLOCK
+ 1 << PG_mlocked |
+#endif
#endif
1 << PG_dirty |
1 << PG_slab |
@@ -669,7 +678,11 @@ static int prep_new_page(struct page *pa
page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim |
1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+ 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk
+#ifdef CONFIG_NORECLAIM_MLOCK
+ | 1 << PG_mlocked
+#endif
+ );
set_page_private(page, 0);
set_page_refcounted(page);
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c 2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c 2008-06-06 16:06:28.000000000 -0400
@@ -307,7 +307,7 @@ void lru_add_drain(void)
put_cpu();
}
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_NORECLAIM_MLOCK)
static void lru_add_drain_per_cpu(struct work_struct *dummy)
{
lru_add_drain();
Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c 2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c 2008-06-06 16:06:28.000000000 -0400
@@ -61,6 +61,8 @@
#include <linux/swapops.h>
#include <linux/elf.h>
+#include "internal.h"
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
@@ -1734,6 +1736,15 @@ gotten:
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!new_page)
goto oom;
+ /*
+ * Don't let another task, with possibly unlocked vma,
+ * keep the mlocked page.
+ */
+ if (vma->vm_flags & VM_LOCKED) {
+ lock_page(old_page); /* for LRU manipulation */
+ clear_page_mlock(old_page);
+ unlock_page(old_page);
+ }
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);
@@ -2176,7 +2187,7 @@ static int do_swap_page(struct mm_struct
page_add_anon_rmap(page, vma, address);
swap_free(entry);
- if (vm_swap_full())
+ if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
remove_exclusive_swap_page(page);
unlock_page(page);
@@ -2316,6 +2327,12 @@ static int __do_fault(struct mm_struct *
ret = VM_FAULT_OOM;
goto out;
}
+ /*
+ * Don't let another task, with possibly unlocked vma,
+ * keep the mlocked page.
+ */
+ if (vma->vm_flags & VM_LOCKED)
+ clear_page_mlock(vmf.page);
copy_user_highpage(page, vmf.page, address, vma);
__SetPageUptodate(page);
} else {
Index: linux-2.6.26-rc2-mm1/mm/mmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mmap.c 2008-05-15 11:20:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mmap.c 2008-06-06 16:06:28.000000000 -0400
@@ -652,7 +652,6 @@ again: remove_next = 1 + (end > next->
* If the vma has a ->close operation then the driver probably needs to release
* per-vma resources, so we don't attempt to merge those.
*/
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags)
Index: linux-2.6.26-rc2-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm.h 2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm.h 2008-06-06 16:06:28.000000000 -0400
@@ -126,6 +126,11 @@ extern unsigned int kobjsize(const void
#define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ)
/*
+ * special vmas that are non-mergable, non-mlock()able
+ */
+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
+
+/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
*/
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
@ 2008-06-07 1:07 ` Andrew Morton
2008-06-07 5:38 ` KOSAKI Motohiro
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-07 1:07 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
eric.whitney, npiggin
On Fri, 06 Jun 2008 16:28:55 -0400
Rik van Riel <riel@redhat.com> wrote:
> Originally
> From: Nick Piggin <npiggin@suse.de>
>
> Against: 2.6.26-rc2-mm1
>
> This patch:
>
> 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> stub version of the mlock/noreclaim APIs when it's
> not configured. Depends on [CONFIG_]NORECLAIM_LRU.
Oh sob.
akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM | wc -l
51
why oh why? Must we really really do this to ourselves? Cheerfully
unchangeloggedly?
> 2) add yet another page flag--PG_mlocked--to indicate that
> the page is locked for efficient testing in vmscan and,
> optionally, fault path. This allows early culling of
> nonreclaimable pages, preventing them from getting to
> page_referenced()/try_to_unmap(). Also allows separate
> accounting of mlock'd pages, as Nick's original patch
> did.
>
> Note: Nick's original mlock patch used a PG_mlocked
> flag. I had removed this in favor of the PG_noreclaim
> flag + an mlock_count [new page struct member]. I
> restored the PG_mlocked flag to eliminate the new
> count field.
How many page flags are left? I keep on asking this and I end up
either a) not being told or b) forgetting. I thought that we had
a whopping big comment somewhere which describes how all these
flags are allocated but I can't immediately locate it.
> 3) add the mlock/noreclaim infrastructure to mm/mlock.c,
> with internal APIs in mm/internal.h. This is a rework
> of Nick's original patch to these files, taking into
> account that mlocked pages are now kept on noreclaim
> LRU list.
>
> 4) update vmscan.c:page_reclaimable() to check PageMlocked()
> and, if vma passed in, the vm_flags. Note that the vma
> will only be passed in for new pages in the fault path;
> and then only if the "cull nonreclaimable pages in fault
> path" patch is included.
>
> 5) add try_to_unlock() to rmap.c to walk a page's rmap and
> ClearPageMlocked() if no other vmas have it mlocked.
> Reuses as much of try_to_unmap() as possible. This
> effectively replaces the use of one of the lru list links
> as an mlock count. If this mechanism let's pages in mlocked
> vmas leak through w/o PG_mlocked set [I don't know that it
> does], we should catch them later in try_to_unmap(). One
> hopes this will be rare, as it will be relatively expensive.
>
> 6) Kosaki: added munlock page table walk to avoid using
> get_user_pages() for unlock. get_user_pages() is unreliable
> for some vma protections.
> Lee: modified to wait for in-flight migration to complete
> to close munlock/migration race that could strand pages.
None of which is available on 32-bit machines. That's pretty significant.
Do we do per-zone or global number-of-mlocked-pages accounting for
/proc/meminfo or /proc/vmstat, etc? Seems not..
> --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig 2008-06-06 16:05:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/Kconfig 2008-06-06 16:06:28.000000000 -0400
> @@ -215,3 +215,17 @@ config NORECLAIM_LRU
> may be non-reclaimable because: they are locked into memory, they
> are anonymous pages for which no swap space exists, or they are anon
> pages that are expensive to unmap [long anon_vma "related vma" list.]
> +
> +config NORECLAIM_MLOCK
> + bool "Exclude mlock'ed pages from reclaim"
> + depends on NORECLAIM_LRU
> + help
> + Treats mlock'ed pages as no-reclaimable. Removing these pages from
> + the LRU [in]active lists avoids the overhead of attempting to reclaim
> + them. Pages marked non-reclaimable for this reason will become
> + reclaimable again when the last mlock is removed.
> + when no swap space exists. Removing these pages from the LRU lists
> + avoids the overhead of attempting to reclaim them. Pages marked
> + non-reclaimable for this reason will become reclaimable again when/if
> + sufficient swap space is added to the system.
The sentence "when no swap space exists." a) lacks capitalisation and
b) makes no sense.
The paramedics are caring for Aunt Tillie.
> Index: linux-2.6.26-rc2-mm1/mm/internal.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-06-06 16:05:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-06-06 16:06:28.000000000 -0400
> @@ -56,6 +56,17 @@ static inline unsigned long page_order(s
> return page_private(page);
> }
>
> +/*
> + * mlock all pages in this vma range. For mmap()/mremap()/...
> + */
> +extern int mlock_vma_pages_range(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end);
> +
> +/*
> + * munlock all pages in vma. For munmap() and exit().
> + */
> +extern void munlock_vma_pages_all(struct vm_area_struct *vma);
I don't think it's desirable that interfaces be documented in two
places. The documentation which you have at the definition site is
more complete than this, and is at the place where people will expect
to find it.
> #ifdef CONFIG_NORECLAIM_LRU
> /*
> * noreclaim_migrate_page() called only from migrate_page_copy() to
> @@ -74,6 +85,65 @@ static inline void noreclaim_migrate_pag
> }
> #endif
>
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/*
> + * Called only in fault path via page_reclaimable() for a new page
> + * to determine if it's being mapped into a LOCKED vma.
> + * If so, mark page as mlocked.
> + */
> +static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +{
> + VM_BUG_ON(PageLRU(page));
> +
> + if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> + return 0;
> +
> + SetPageMlocked(page);
> + return 1;
> +}
bool? If you like that sort of thing. It makes sense here...
> +/*
> + * must be called with vma's mmap_sem held for read, and page locked.
> + */
> +extern void mlock_vma_page(struct page *page);
> +
> +/*
> + * Clear the page's PageMlocked(). This can be useful in a situation where
> + * we want to unconditionally remove a page from the pagecache -- e.g.,
> + * on truncation or freeing.
> + *
> + * It is legal to call this function for any page, mlocked or not.
> + * If called for a page that is still mapped by mlocked vmas, all we do
> + * is revert to lazy LRU behaviour -- semantics are not broken.
> + */
> +extern void __clear_page_mlock(struct page *page);
> +static inline void clear_page_mlock(struct page *page)
> +{
> + if (unlikely(TestClearPageMlocked(page)))
> + __clear_page_mlock(page);
> +}
> +
> +/*
> + * mlock_migrate_page - called only from migrate_page_copy() to
> + * migrate the Mlocked page flag
> + */
So maybe just nuke it and open-code those two lines in mm/migrate.c?
> +static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> +{
> + if (TestClearPageMlocked(page))
> + SetPageMlocked(newpage);
> +}
> +
> +
> +#else /* CONFIG_NORECLAIM_MLOCK */
> +static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +{
> + return 0;
> +}
> +static inline void clear_page_mlock(struct page *page) { }
> +static inline void mlock_vma_page(struct page *page) { }
> +static inline void mlock_migrate_page(struct page *new, struct page *old) { }
It would be neater if the arguments to the two versions of
mlock_migrate_page() had the same names.
> +#endif /* CONFIG_NORECLAIM_MLOCK */
>
> /*
> * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
> Index: linux-2.6.26-rc2-mm1/mm/mlock.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/mlock.c 2008-05-15 11:20:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/mlock.c 2008-06-06 16:06:28.000000000 -0400
> @@ -8,10 +8,18 @@
> #include <linux/capability.h>
> #include <linux/mman.h>
> #include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +#include <linux/pagemap.h>
> #include <linux/mempolicy.h>
> #include <linux/syscalls.h>
> #include <linux/sched.h>
> #include <linux/module.h>
> +#include <linux/rmap.h>
> +#include <linux/mmzone.h>
> +#include <linux/hugetlb.h>
> +
> +#include "internal.h"
>
> int can_do_mlock(void)
> {
> @@ -23,17 +31,354 @@ int can_do_mlock(void)
> }
> EXPORT_SYMBOL(can_do_mlock);
>
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/*
> + * Mlocked pages are marked with PageMlocked() flag for efficient testing
> + * in vmscan and, possibly, the fault path; and to support semi-accurate
> + * statistics.
> + *
> + * An mlocked page [PageMlocked(page)] is non-reclaimable. As such, it will
> + * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
> + * The noreclaim list is an LRU sibling list to the [in]active lists.
> + * PageNoreclaim is set to indicate the non-reclaimable state.
> + *
> + * When lazy mlocking via vmscan, it is important to ensure that the
> + * vma's VM_LOCKED status is not concurrently being modified, otherwise we
> + * may have mlocked a page that is being munlocked. So lazy mlock must take
> + * the mmap_sem for read, and verify that the vma really is locked
> + * (see mm/rmap.c).
> + */
That's a useful comment.
Where would the reader (and indeed the reviewer) go to find out about
"lazy mlocking"? "grep -i 'lazy mlock' */*.c" doesn't work...
> +/*
> + * LRU accounting for clear_page_mlock()
> + */
> +void __clear_page_mlock(struct page *page)
> +{
> + VM_BUG_ON(!PageLocked(page)); /* for LRU islolate/putback */
typo
> +
> + if (!isolate_lru_page(page)) {
> + putback_lru_page(page);
> + } else {
> + /*
> + * Try hard not to leak this page ...
> + */
> + lru_add_drain_all();
> + if (!isolate_lru_page(page))
> + putback_lru_page(page);
> + }
> +}
When I review code I often come across stuff which I don't understand
(at least, which I don't understand sufficiently easily). So I'll ask
questions, and I do think the best way in which those questions should
be answered is by adding a code comment to fix the problem for ever.
When I look at the isolate_lru_page()-failed cases above I wonder what
just happened. We now have a page which is still on the LRU (how did
it get there in the first place?). Well no. I _think_ what happened is
that this function is using isolate_lru_page() and putback_lru_page()
to move a page off a now-inappropriate LRU list and to put it back onto
the proper one. But heck, maybe I just don't know what this function
is doing at all?
If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
what circumstances?) then... what? We now have a page which is on an
inappropriate LRU? Why is this OK? Do we handle it elsewhere? How?
etc.
> +/*
> + * Mark page as mlocked if not already.
> + * If page on LRU, isolate and putback to move to noreclaim list.
> + */
> +void mlock_vma_page(struct page *page)
> +{
> + BUG_ON(!PageLocked(page));
> +
> + if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
> + putback_lru_page(page);
> +}
extra tab.
> +/*
> + * called from munlock()/munmap() path with page supposedly on the LRU.
> + *
> + * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> + * [in try_to_unlock()] and then attempt to isolate the page. We must
> + * isolate the page() to keep others from messing with its noreclaim
page()?
> + * and mlocked state while trying to unlock. However, we pre-clear the
"unlock"? (See exhasperated comment against try_to_unlock(), below)
> + * mlocked state anyway as we might lose the isolation race and we might
> + * not get another chance to clear PageMlocked. If we successfully
> + * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
> + * mapping the page, it will restore the PageMlocked state, unless the page
> + * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> + * perhaps redundantly.
> + * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> + * vmas, we'll detect this in vmscan--via try_to_unlock() or try_to_unmap()
> + * either of which will restore the PageMlocked state by calling
> + * mlock_vma_page() above, if it can grab the vma's mmap sem.
> + */
OK, you officially lost me here. Two hours are up and I guess I need
to have another run at [patch 17/25]
I must say that having tried to absorb the above, my confidence in the
overall correctness of this code is not great. Hopefully wrong, but
gee.
> +static void munlock_vma_page(struct page *page)
> +{
> + BUG_ON(!PageLocked(page));
> +
> + if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
> + try_to_unlock(page);
> + putback_lru_page(page);
> + }
> +}
> +
> +/*
> + * mlock a range of pages in the vma.
> + *
> + * This takes care of making the pages present too.
> + *
> + * vma->vm_mm->mmap_sem must be held for write.
> + */
> +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long addr = start;
> + struct page *pages[16]; /* 16 gives a reasonable batch */
> + int write = !!(vma->vm_flags & VM_WRITE);
> + int nr_pages = (end - start) / PAGE_SIZE;
> + int ret;
> +
> + VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> + VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
> + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> +
> + lru_add_drain_all(); /* push cached pages to LRU */
> +
> + while (nr_pages > 0) {
> + int i;
> +
> + cond_resched();
> +
> + /*
> + * get_user_pages makes pages present if we are
> + * setting mlock.
> + */
> + ret = get_user_pages(current, mm, addr,
> + min_t(int, nr_pages, ARRAY_SIZE(pages)),
> + write, 0, pages, NULL);
Doesn't mlock already do a make_pages_present(), or did that get
removed and moved to here?
> + /*
> + * This can happen for, e.g., VM_NONLINEAR regions before
> + * a page has been allocated and mapped at a given offset,
> + * or for addresses that map beyond end of a file.
> + * We'll mlock the the pages if/when they get faulted in.
> + */
> + if (ret < 0)
> + break;
> + if (ret == 0) {
> + /*
> + * We know the vma is there, so the only time
> + * we cannot get a single page should be an
> + * error (ret < 0) case.
> + */
> + WARN_ON(1);
> + break;
> + }
> +
> + lru_add_drain(); /* push cached pages to LRU */
> +
> + for (i = 0; i < ret; i++) {
> + struct page *page = pages[i];
> +
> + /*
> + * page might be truncated or migrated out from under
> + * us. Check after acquiring page lock.
> + */
> + lock_page(page);
> + if (page->mapping)
> + mlock_vma_page(page);
> + unlock_page(page);
> + put_page(page); /* ref from get_user_pages() */
> +
> + /*
> + * here we assume that get_user_pages() has given us
> + * a list of virtually contiguous pages.
> + */
Good assumption, that ;)
> + addr += PAGE_SIZE; /* for next get_user_pages() */
Could be moved outside the loop I guess.
> + nr_pages--;
Ditto.
> + }
> + }
> +
> + lru_add_drain_all(); /* to update stats */
> +
> + return 0; /* count entire vma as locked_vm */
> +}
>
> ...
>
> +/*
> + * munlock a range of pages in the vma using standard page table walk.
> + *
> + * vma->vm_mm->mmap_sem must be held for write.
> + */
> +static void __munlock_vma_pages_range(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct munlock_page_walk mpw;
> +
> + VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> + VM_BUG_ON(start < vma->vm_start);
> + VM_BUG_ON(end > vma->vm_end);
> +
> + lru_add_drain_all(); /* push cached pages to LRU */
> + mpw.vma = vma;
> + (void)walk_page_range(mm, start, end, &munlock_page_walk, &mpw);
The (void) is un-kernely.
> + lru_add_drain_all(); /* to update stats */
> +
random newline.
> +}
> +
> +#else /* CONFIG_NORECLAIM_MLOCK */
>
> ...
>
> +int mlock_vma_pages_range(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + int nr_pages = (end - start) / PAGE_SIZE;
> + BUG_ON(!(vma->vm_flags & VM_LOCKED));
> +
> + /*
> + * filter unlockable vmas
> + */
> + if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> + goto no_mlock;
> +
> + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> + is_vm_hugetlb_page(vma) ||
> + vma == get_gate_vma(current))
> + goto make_present;
> +
> + return __mlock_vma_pages_range(vma, start, end);
Invert the `if' expression, remove the goto?
> +make_present:
> + /*
> + * User mapped kernel pages or huge pages:
> + * make these pages present to populate the ptes, but
> + * fall thru' to reset VM_LOCKED--no need to unlock, and
> + * return nr_pages so these don't get counted against task's
> + * locked limit. huge pages are already counted against
> + * locked vm limit.
> + */
> + make_pages_present(start, end);
> +
> +no_mlock:
> + vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */
> + return nr_pages; /* pages NOT mlocked */
> +}
> +
> +
>
> ...
>
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/**
> + * try_to_unlock - Check page's rmap for other vma's holding page locked.
> + * @page: the page to be unlocked. will be returned with PG_mlocked
> + * cleared if no vmas are VM_LOCKED.
I think kerneldoc will barf over the newline in @page's description.
> + * Return values are:
> + *
> + * SWAP_SUCCESS - no vma's holding page locked.
> + * SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem
> + * SWAP_MLOCK - page is now mlocked.
> + */
> +int try_to_unlock(struct page *page)
> +{
> + VM_BUG_ON(!PageLocked(page) || PageLRU(page));
> +
> + if (PageAnon(page))
> + return try_to_unmap_anon(page, 1, 0);
> + else
> + return try_to_unmap_file(page, 1, 0);
> +}
> +#endif
OK, this function is clear as mud. My first reaction was "what's wrong
with just doing unlock_page()?". The term "unlock" is waaaaaaaaaaay
overloaded in this context and its use here was an awful decision.
Can we please come up with a more specific name and add some comments
which give the reader some chance of working out what it is that is
actually being unlocked?
>
> ...
>
> @@ -652,7 +652,6 @@ again: remove_next = 1 + (end > next->
> * If the vma has a ->close operation then the driver probably needs to release
> * per-vma resources, so we don't attempt to merge those.
> */
> -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>
> static inline int is_mergeable_vma(struct vm_area_struct *vma,
> struct file *file, unsigned long vm_flags)
hm, so the old definition of VM_SPECIAL managed to wedge itself between
is_mergeable_vma() and is_mergeable_vma()'s comment. Had me confused
there.
pls remove the blank line between the comment and the start of
is_mergeable_vma() so people don't go sticking more things in there.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-07 1:07 ` Andrew Morton
@ 2008-06-07 5:38 ` KOSAKI Motohiro
2008-06-10 3:31 ` Nick Piggin
2008-06-11 1:00 ` Rik van Riel
2 siblings, 0 replies; 49+ messages in thread
From: KOSAKI Motohiro @ 2008-06-07 5:38 UTC (permalink / raw)
To: Andrew Morton
Cc: kosaki.motohiro, Rik van Riel, linux-kernel, lee.schermerhorn,
linux-mm, eric.whitney, npiggin
Hi
> > + if (!isolate_lru_page(page)) {
> > + putback_lru_page(page);
> > + } else {
> > + /*
> > + * Try hard not to leak this page ...
> > + */
> > + lru_add_drain_all();
> > + if (!isolate_lru_page(page))
> > + putback_lru_page(page);
> > + }
> > +}
>
> When I review code I often come across stuff which I don't understand
> (at least, which I don't understand sufficiently easily). So I'll ask
> questions, and I do think the best way in which those questions should
> be answered is by adding a code comment to fix the problem for ever.
>
> When I look at the isolate_lru_page()-failed cases above I wonder what
> just happened. We now have a page which is still on the LRU (how did
> it get there in the first place?). Well no. I _think_ what happened is
> that this function is using isolate_lru_page() and putback_lru_page()
> to move a page off a now-inappropriate LRU list and to put it back onto
> the proper one. But heck, maybe I just don't know what this function
> is doing at all?
>
> If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
> what circumstances?) then... what? We now have a page which is on an
> inappropriate LRU? Why is this OK? Do we handle it elsewhere? How?
I think this code is OK,
but "Try hard not to leak this page ..." is wrong comment and not true.
isolate_lru_page() failure mean this page is isolated by another one.
later, Another one put back page to proper LRU by putback_lru_page().
(putback_lru_page() alway put back right LRU.)
no leak happebnd.
> > +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long addr = start;
> > + struct page *pages[16]; /* 16 gives a reasonable batch */
> > + int write = !!(vma->vm_flags & VM_WRITE);
> > + int nr_pages = (end - start) / PAGE_SIZE;
> > + int ret;
> > +
> > + VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> > + VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
> > + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> > +
> > + lru_add_drain_all(); /* push cached pages to LRU */
> > +
> > + while (nr_pages > 0) {
> > + int i;
> > +
> > + cond_resched();
> > +
> > + /*
> > + * get_user_pages makes pages present if we are
> > + * setting mlock.
> > + */
> > + ret = get_user_pages(current, mm, addr,
> > + min_t(int, nr_pages, ARRAY_SIZE(pages)),
> > + write, 0, pages, NULL);
>
> Doesn't mlock already do a make_pages_present(), or did that get
> removed and moved to here?
I think,
vanilla: call make_pages_present() when mlock.
this series: call __mlock_vma_pages_range() when mlock.
thus, this code is right.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-07 1:07 ` Andrew Morton
2008-06-07 5:38 ` KOSAKI Motohiro
@ 2008-06-10 3:31 ` Nick Piggin
2008-06-10 12:50 ` Rik van Riel
2008-06-10 21:14 ` Rik van Riel
2008-06-11 1:00 ` Rik van Riel
2 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2008-06-10 3:31 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Fri, Jun 06, 2008 at 06:07:46PM -0700, Andrew Morton wrote:
> On Fri, 06 Jun 2008 16:28:55 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
> > Originally
> > From: Nick Piggin <npiggin@suse.de>
> >
> > Against: 2.6.26-rc2-mm1
> >
> > This patch:
> >
> > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> > stub version of the mlock/noreclaim APIs when it's
> > not configured. Depends on [CONFIG_]NORECLAIM_LRU.
>
> Oh sob.
>
> akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM | wc -l
> 51
>
> why oh why? Must we really really do this to ourselves? Cheerfully
> unchangeloggedly?
>
> > 2) add yet another page flag--PG_mlocked--to indicate that
> > the page is locked for efficient testing in vmscan and,
> > optionally, fault path. This allows early culling of
> > nonreclaimable pages, preventing them from getting to
> > page_referenced()/try_to_unmap(). Also allows separate
> > accounting of mlock'd pages, as Nick's original patch
> > did.
> >
> > Note: Nick's original mlock patch used a PG_mlocked
> > flag. I had removed this in favor of the PG_noreclaim
> > flag + an mlock_count [new page struct member]. I
> > restored the PG_mlocked flag to eliminate the new
> > count field.
>
> How many page flags are left? I keep on asking this and I end up
> either a) not being told or b) forgetting. I thought that we had
> a whopping big comment somewhere which describes how all these
> flags are allocated but I can't immediately locate it.
>
> > 3) add the mlock/noreclaim infrastructure to mm/mlock.c,
> > with internal APIs in mm/internal.h. This is a rework
> > of Nick's original patch to these files, taking into
> > account that mlocked pages are now kept on noreclaim
> > LRU list.
> >
> > 4) update vmscan.c:page_reclaimable() to check PageMlocked()
> > and, if vma passed in, the vm_flags. Note that the vma
> > will only be passed in for new pages in the fault path;
> > and then only if the "cull nonreclaimable pages in fault
> > path" patch is included.
> >
> > 5) add try_to_unlock() to rmap.c to walk a page's rmap and
> > ClearPageMlocked() if no other vmas have it mlocked.
> > Reuses as much of try_to_unmap() as possible. This
> > effectively replaces the use of one of the lru list links
> > as an mlock count. If this mechanism let's pages in mlocked
> > vmas leak through w/o PG_mlocked set [I don't know that it
> > does], we should catch them later in try_to_unmap(). One
> > hopes this will be rare, as it will be relatively expensive.
> >
> > 6) Kosaki: added munlock page table walk to avoid using
> > get_user_pages() for unlock. get_user_pages() is unreliable
> > for some vma protections.
> > Lee: modified to wait for in-flight migration to complete
> > to close munlock/migration race that could strand pages.
>
> None of which is available on 32-bit machines. That's pretty significant.
It should definitely be enabled for 32-bit machines, and enabled by default.
The argument is that 32 bit machines won't have much memory so it won't
be a problem, but a) it also has to work well on other machines without
much memory, and b) it is a nightmare to have significant behaviour changes
like this. For kernel development as well as kernel running.
If we eventually run out of page flags on 32 bit, then sure this might be
one we could look at geting rid of. Once the code has proven itself.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 3:31 ` Nick Piggin
@ 2008-06-10 12:50 ` Rik van Riel
2008-06-10 21:14 ` Rik van Riel
1 sibling, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-10 12:50 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 10 Jun 2008 05:31:30 +0200
Nick Piggin <npiggin@suse.de> wrote:
> It should definitely be enabled for 32-bit machines, and enabled by default.
> The argument is that 32 bit machines won't have much memory so it won't
> be a problem, but a) it also has to work well on other machines without
> much memory, and b) it is a nightmare to have significant behaviour changes
> like this. For kernel development as well as kernel running.
>
> If we eventually run out of page flags on 32 bit, then sure this might be
> one we could look at geting rid of. Once the code has proven itself.
Alternatively, we tell the 32 bit people not to compile their kernel
with support for 64 NUMA nodes :)
The number of page flags on 32 bits is (32 - ZONE_SHIFT - NODE_SHIFT)
after Christoph's cleanup and no longer a fixed number.
Does anyone compile a 32 bit kernel with a large (ZONE_SHIFT + NODE_SHIFT)?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 3:31 ` Nick Piggin
2008-06-10 12:50 ` Rik van Riel
@ 2008-06-10 21:14 ` Rik van Riel
2008-06-10 21:43 ` Lee Schermerhorn
1 sibling, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-10 21:14 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 10 Jun 2008 05:31:30 +0200
Nick Piggin <npiggin@suse.de> wrote:
> If we eventually run out of page flags on 32 bit, then sure this might be
> one we could look at geting rid of. Once the code has proven itself.
Yes, after the code has proven stable, we can probably get
rid of the PG_mlocked bit and use only PG_unevictable to mark
these pages.
Lee, Kosaki-san, do you see any problem with that approach?
Is the PG_mlocked bit really necessary for non-debugging
purposes?
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 21:14 ` Rik van Riel
@ 2008-06-10 21:43 ` Lee Schermerhorn
2008-06-10 21:57 ` Andrew Morton
2008-06-10 23:48 ` Rik van Riel
0 siblings, 2 replies; 49+ messages in thread
From: Lee Schermerhorn @ 2008-06-10 21:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> On Tue, 10 Jun 2008 05:31:30 +0200
> Nick Piggin <npiggin@suse.de> wrote:
>
> > If we eventually run out of page flags on 32 bit, then sure this might be
> > one we could look at geting rid of. Once the code has proven itself.
>
> Yes, after the code has proven stable, we can probably get
> rid of the PG_mlocked bit and use only PG_unevictable to mark
> these pages.
>
> Lee, Kosaki-san, do you see any problem with that approach?
> Is the PG_mlocked bit really necessary for non-debugging
> purposes?
>
Well, it does speed up the check for mlocked pages in page_reclaimable()
[now page_evictable()?] as we don't have to walk the reverse map to
determine that a page is mlocked. In many places where we currently
test page_reclaimable(), we really don't want to and maybe can't walk
the reverse map.
Unless you're evisioning even larger rework, the PG_unevictable flag
[formerly PG_noreclaim, right?] is analogous to PG_active. It's only
set when the page is on the corresponding lru list or being held
isolated from it, temporarily. See isolate_lru_page() and
putback_lru_page() and users thereof--such as mlock_vma_page(). Again,
I have seen what changes you're making here, so maybe that's all
changing. But, currently, PG_unevictable would not be a replacement for
PG_mlocked.
Anyway, let's see what you come up with before we tackle this.
Couple of related items:
+ 26-rc5-mm1 + a small fix to the double unlock_page() in
shrink_page_list() has been running for a couple of hours on my 32G,
16cpu ia64 numa platform w/o error. Seems to have survived the merge
into -mm, despite the issues Andrew has raised.
+ on same platform, Mel Gorman's mminit debug code is reporting that
we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
configured.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 21:43 ` Lee Schermerhorn
@ 2008-06-10 21:57 ` Andrew Morton
2008-06-11 16:01 ` Lee Schermerhorn
2008-06-10 23:48 ` Rik van Riel
1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-06-10 21:57 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: riel, npiggin, linux-kernel, kosaki.motohiro, linux-mm, eric.whitney
On Tue, 10 Jun 2008 17:43:17 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> Couple of related items:
>
> + 26-rc5-mm1 + a small fix to the double unlock_page() in
> shrink_page_list() has been running for a couple of hours on my 32G,
> 16cpu ia64 numa platform w/o error. Seems to have survived the merge
> into -mm, despite the issues Andrew has raised.
oh goody, thanks. Johannes's bootmem rewrite is holding up
surprisingly well.
gee test.kernel.org takes a long time.
> + on same platform, Mel Gorman's mminit debug code is reporting that
> we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
> configured.
what is "Mel Gorman's mminit debug code"?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 21:57 ` Andrew Morton
@ 2008-06-11 16:01 ` Lee Schermerhorn
0 siblings, 0 replies; 49+ messages in thread
From: Lee Schermerhorn @ 2008-06-11 16:01 UTC (permalink / raw)
To: Andrew Morton
Cc: riel, npiggin, linux-kernel, kosaki.motohiro, linux-mm, eric.whitney
On Tue, 2008-06-10 at 14:57 -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 17:43:17 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>
> > Couple of related items:
> >
> > + 26-rc5-mm1 + a small fix to the double unlock_page() in
> > shrink_page_list() has been running for a couple of hours on my 32G,
> > 16cpu ia64 numa platform w/o error. Seems to have survived the merge
> > into -mm, despite the issues Andrew has raised.
>
> oh goody, thanks.
I should have mentioned that it's running a fairly heavy stress load to
exercise the vm scalability changes. Lots of IO, page cache activity,
swapping, mlocking and shmlocking various sized regions, up to 16GB on
32GB machine, migrating of mlocked/shmlocked segments between
nodes, ... So far today, the load has been up for ~19.5 hours with no
errors, no softlockups, no oom-kills or such.
> Johannes's bootmem rewrite is holding up
> surprisingly well.
Well, I am seeing a lot of "potential offnode page_structs" messages for
our funky cache-line interleaved pseudo-node. I had to limit the prints
to boot at all. Still investigating. Looks like slub can't allocate
its initial per node data on that node either.
>
> gee test.kernel.org takes a long time.
>
> > + on same platform, Mel Gorman's mminit debug code is reporting that
> > we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
> > configured.
>
> what is "Mel Gorman's mminit debug code"?
mminit_loglevel={0|1|2} [I use 3 :)] shows page flag layout, zone
lists, ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 21:43 ` Lee Schermerhorn
2008-06-10 21:57 ` Andrew Morton
@ 2008-06-10 23:48 ` Rik van Riel
2008-06-11 15:29 ` Lee Schermerhorn
1 sibling, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-06-10 23:48 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 10 Jun 2008 17:43:17 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> > On Tue, 10 Jun 2008 05:31:30 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> >
> > > If we eventually run out of page flags on 32 bit, then sure this might be
> > > one we could look at geting rid of. Once the code has proven itself.
> >
> > Yes, after the code has proven stable, we can probably get
> > rid of the PG_mlocked bit and use only PG_unevictable to mark
> > these pages.
> >
> > Lee, Kosaki-san, do you see any problem with that approach?
> > Is the PG_mlocked bit really necessary for non-debugging
> > purposes?
>
> Well, it does speed up the check for mlocked pages in page_reclaimable()
> [now page_evictable()?] as we don't have to walk the reverse map to
> determine that a page is mlocked. In many places where we currently
> test page_reclaimable(), we really don't want to and maybe can't walk
> the reverse map.
There are a few places:
1) the pageout code, which calls page_referenced() anyway; we can
change page_referenced() to return PAGE_MLOCKED and do the right
thing from there
2) when the page is moved from a per-cpu pagevec onto an LRU list,
we may be able to simply skip the check there on the theory that
the pagevecs are small and the pageout code will eventually catch
these (few?) pages - actually, setting PG_noreclaim on a page
that is in a pagevec but not on an LRU list might catch that
Does that seem reasonable/possible?
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-10 23:48 ` Rik van Riel
@ 2008-06-11 15:29 ` Lee Schermerhorn
0 siblings, 0 replies; 49+ messages in thread
From: Lee Schermerhorn @ 2008-06-11 15:29 UTC (permalink / raw)
To: Rik van Riel
Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
linux-mm, eric.whitney
On Tue, 2008-06-10 at 19:48 -0400, Rik van Riel wrote:
> On Tue, 10 Jun 2008 17:43:17 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>
> > On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> > > On Tue, 10 Jun 2008 05:31:30 +0200
> > > Nick Piggin <npiggin@suse.de> wrote:
> > >
> > > > If we eventually run out of page flags on 32 bit, then sure this might be
> > > > one we could look at geting rid of. Once the code has proven itself.
> > >
> > > Yes, after the code has proven stable, we can probably get
> > > rid of the PG_mlocked bit and use only PG_unevictable to mark
> > > these pages.
> > >
> > > Lee, Kosaki-san, do you see any problem with that approach?
> > > Is the PG_mlocked bit really necessary for non-debugging
> > > purposes?
> >
> > Well, it does speed up the check for mlocked pages in page_reclaimable()
> > [now page_evictable()?] as we don't have to walk the reverse map to
> > determine that a page is mlocked. In many places where we currently
> > test page_reclaimable(), we really don't want to and maybe can't walk
> > the reverse map.
>
> There are a few places:
> 1) the pageout code, which calls page_referenced() anyway; we can
> change page_referenced() to return PAGE_MLOCKED and do the right
> thing from there
In vmscan, true. try_to_unmap() will catch it too. By then, we'll have
let the page ride through the active list to the inactive list and won't
catch it until shrink_page_list(). But, this only happens once per page
and then it's hidden on the nor^H^H^Hunevictable list.
We might want to kill the "cull in fault path" patch, tho'.
> 2) when the page is moved from a per-cpu pagevec onto an LRU list,
> we may be able to simply skip the check there on the theory that
> the pagevecs are small and the pageout code will eventually catch
> these (few?) pages - actually, setting PG_noreclaim on a page
> that is in a pagevec but not on an LRU list might catch that
>
> Does that seem reasonable/possible?
Not sure. The most recent patches that I posted do not use the pagevec
for the noreclaim/unevictable list. They put nonreclaimable/unevictable
pages directly onto the noreclaim/unevictable list to avoid race
conditions that could strand a page. Kosaki-san and I spent a lot of
time analyzing and testing the current code for potential page leaks
onto the noreclaim/unevictable list. It currently depends on the atomic
TestSet/TestClear of the PG_mlocked bit, along with page lock and lru
isolation/putback to resolve all of the potential races. I attempted to
describe this aspect in the doc. Have to rethink all of that.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
2008-06-07 1:07 ` Andrew Morton
2008-06-07 5:38 ` KOSAKI Motohiro
2008-06-10 3:31 ` Nick Piggin
@ 2008-06-11 1:00 ` Rik van Riel
2 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2008-06-11 1:00 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
eric.whitney, npiggin
On Fri, 6 Jun 2008 18:07:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:55 -0400
> Rik van Riel <riel@redhat.com> wrote:
> > Originally
> > From: Nick Piggin <npiggin@suse.de>
> >
> > Against: 2.6.26-rc2-mm1
> >
> > This patch:
> >
> > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> > stub version of the mlock/noreclaim APIs when it's
> > not configured. Depends on [CONFIG_]NORECLAIM_LRU.
>
> Oh sob.
OK, I just removed CONFIG_NORECLAIM_MLOCK.
> > 2) add yet another page flag--PG_mlocked--to indicate that
> > the page is locked for efficient testing in vmscan and,
> > optionally, fault path. This allows early culling of
> > nonreclaimable pages, preventing them from getting to
> > page_referenced()/try_to_unmap(). Also allows separate
> > accounting of mlock'd pages, as Nick's original patch
> > did.
> >
> > Note: Nick's original mlock patch used a PG_mlocked
> > flag. I had removed this in favor of the PG_noreclaim
> > flag + an mlock_count [new page struct member]. I
> > restored the PG_mlocked flag to eliminate the new
> > count field.
>
> How many page flags are left?
Depends on what CONFIG_ZONE_SHIFT and CONFIG_NODE_SHIFT
are set to.
I suspect we'll be able to get rid of the PG_mlocked page
flag in the future, since mlock is just one reason for
the page being PG_noreclaim.
> > +/*
> > + * mlock all pages in this vma range. For mmap()/mremap()/...
> > + */
> > +extern int mlock_vma_pages_range(struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end);
> > +
> > +/*
> > + * munlock all pages in vma. For munmap() and exit().
> > + */
> > +extern void munlock_vma_pages_all(struct vm_area_struct *vma);
>
> I don't think it's desirable that interfaces be documented in two
> places. The documentation which you have at the definition site is
> more complete than this, and is at the place where people will expect
> to find it.
I removed these comments.
> > + if (!isolate_lru_page(page)) {
> > + putback_lru_page(page);
> > + } else {
> > + /*
> > + * Try hard not to leak this page ...
> > + */
> > + lru_add_drain_all();
> > + if (!isolate_lru_page(page))
> > + putback_lru_page(page);
> > + }
> > +}
>
> When I review code I often come across stuff which I don't understand
> (at least, which I don't understand sufficiently easily). So I'll ask
> questions, and I do think the best way in which those questions should
> be answered is by adding a code comment to fix the problem for ever.
if (!isolate_lru_page(page)) {
putback_lru_page(page);
} else {
/*
* Page not on the LRU yet. Flush all pagevecs and retry.
*/
lru_add_drain_all();
if (!isolate_lru_page(page))
putback_lru_page(page);
}
> If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
> what circumstances?) then... what? We now have a page which is on an
> inappropriate LRU? Why is this OK? Do we handle it elsewhere? How?
It is OK because we will run into the page later on in the pageout
code, detect that the page is unevictable and move it to the
unevictable LRU.
> > +/*
> > + * called from munlock()/munmap() path with page supposedly on the LRU.
> > + *
> > + * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked
> > + * [in try_to_unlock()] and then attempt to isolate the page. We must
> > + * isolate the page() to keep others from messing with its noreclaim
>
> page()?
Fixed.
> > + * and mlocked state while trying to unlock. However, we pre-clear the
>
> "unlock"? (See exhasperated comment against try_to_unlock(), below)
Renamed that one to try_to_munlock() and adjusted all the callers and
comments.
> > +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end)
> > +{
> > + ret = get_user_pages(current, mm, addr,
> > + min_t(int, nr_pages, ARRAY_SIZE(pages)),
> > + write, 0, pages, NULL);
>
> Doesn't mlock already do a make_pages_present(), or did that get
> removed and moved to here?
make_pages_present does not work right for PROT_NONE and does
not add pages to the unevictable LRU. Now that we have a
separate function for unlocking, we may be able to just add
a few lines to make_pages_present and use that again.
Also, make_pages_present works on some other types of VMAs
that this code does not work on. I do not know whether
merging this with make_pages_present would make things
cleaner or uglier.
Lee? Kosaki-san? Either of you interested in investigating
this after Andrew has the patches merged with the fast cleanups
that I'm doing now?
> > + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> > + is_vm_hugetlb_page(vma) ||
> > + vma == get_gate_vma(current))
> > + goto make_present;
> > +
> > + return __mlock_vma_pages_range(vma, start, end);
>
> Invert the `if' expression, remove the goto?
Done, thanks.
> > +/**
> > + * try_to_unlock - Check page's rmap for other vma's holding page locked.
> > + * @page: the page to be unlocked. will be returned with PG_mlocked
> > + * cleared if no vmas are VM_LOCKED.
>
> I think kerneldoc will barf over the newline in @page's description.
Cleaned this up.
> > + * Return values are:
> > + *
> > + * SWAP_SUCCESS - no vma's holding page locked.
> > + * SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem
> > + * SWAP_MLOCK - page is now mlocked.
> > + */
> > +int try_to_unlock(struct page *page)
> > +{
> > + VM_BUG_ON(!PageLocked(page) || PageLRU(page));
> > +
> > + if (PageAnon(page))
> > + return try_to_unmap_anon(page, 1, 0);
> > + else
> > + return try_to_unmap_file(page, 1, 0);
> > +}
> > +#endif
>
> OK, this function is clear as mud. My first reaction was "what's wrong
> with just doing unlock_page()?". The term "unlock" is waaaaaaaaaaay
> overloaded in this context and its use here was an awful decision.
>
> Can we please come up with a more specific name and add some comments
> which give the reader some chance of working out what it is that is
> actually being unlocked?
try_to_munlock - I have fixed the documentation for this function too
> > ...
> >
> > @@ -652,7 +652,6 @@ again: remove_next = 1 + (end > next->
> > * If the vma has a ->close operation then the driver probably needs to release
> > * per-vma resources, so we don't attempt to merge those.
> > */
> > -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> >
> > static inline int is_mergeable_vma(struct vm_area_struct *vma,
> > struct file *file, unsigned long vm_flags)
>
> hm, so the old definition of VM_SPECIAL managed to wedge itself between
> is_mergeable_vma() and is_mergeable_vma()'s comment. Had me confused
> there.
>
> pls remove the blank line between the comment and the start of
> is_mergeable_vma() so people don't go sticking more things in there.
Done.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap
[not found] <20080606202838.390050172@redhat.com>
` (2 preceding siblings ...)
2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel, Lee Schermerhorn
` (2 subsequent siblings)
6 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-19-lts-noreclaim-cull-non-reclaimable-anon-pages-in-fault-path.patch --]
[-- Type: text/plain, Size: 11868 bytes --]
Originally
From: Nick Piggin <npiggin@suse.de>
Against: 2.6.26-rc2-mm1
Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap(), munmap(), mremap() and truncate(). Try to move back
to normal LRU lists on munmap() when last mlocked mapping removed.
Removed PageMlocked() status when page truncated from file.
Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
V6:
+ munlock page in range of VM_LOCKED vma being covered by
remap_file_pages(), as this is an implied unmap of the
range.
+ in support of special vma filtering, don't account for
non-mlockable vmas as locked_vm.
V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no changes]
V1 -> V2:
+ modified mmap.c:mmap_region() to return error if mlock_vma_pages_range()
does. This can only occur if the vma gets removed/changed while
we're switching mmap_sem lock modes. Most callers don't care, but
sys_remap_file_pages() appears to.
Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.
mm/fremap.c | 26 +++++++++++++++++---
mm/internal.h | 13 ++++++++--
mm/mlock.c | 10 ++++---
mm/mmap.c | 75 ++++++++++++++++++++++++++++++++++++++++++++--------------
mm/mremap.c | 8 +++---
mm/truncate.c | 4 +++
6 files changed, 106 insertions(+), 30 deletions(-)
Index: linux-2.6.26-rc2-mm1/mm/mmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mmap.c 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mmap.c 2008-06-06 16:06:35.000000000 -0400
@@ -32,6 +32,8 @@
#include <asm/tlb.h>
#include <asm/mmu_context.h>
+#include "internal.h"
+
#ifndef arch_mmap_check
#define arch_mmap_check(addr, len, flags) (0)
#endif
@@ -961,6 +963,7 @@ unsigned long do_mmap_pgoff(struct file
return -EPERM;
vm_flags |= VM_LOCKED;
}
+
/* mlock MCL_FUTURE? */
if (vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
@@ -1121,10 +1124,12 @@ munmap_back:
* The VM_SHARED test is necessary because shmem_zero_setup
* will create the file object for a shared anonymous map below.
*/
- if (!file && !(vm_flags & VM_SHARED) &&
- vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, NULL, pgoff, NULL))
- goto out;
+ if (!file && !(vm_flags & VM_SHARED)) {
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+ NULL, NULL, pgoff, NULL);
+ if (vma)
+ goto out;
+ }
/*
* Determine the object being mapped and call the appropriate
@@ -1206,10 +1211,14 @@ out:
mm->total_vm += len >> PAGE_SHIFT;
vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
- mm->locked_vm += len >> PAGE_SHIFT;
- make_pages_present(addr, addr + len);
- }
- if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+ /*
+ * makes pages present; downgrades, drops, reacquires mmap_sem
+ */
+ int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
+ if (nr_pages < 0)
+ return nr_pages; /* vma gone! */
+ mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
+ } else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
make_pages_present(addr, addr + len);
return addr;
@@ -1682,8 +1691,11 @@ find_extend_vma(struct mm_struct *mm, un
return vma;
if (!prev || expand_stack(prev, addr))
return NULL;
- if (prev->vm_flags & VM_LOCKED)
- make_pages_present(addr, prev->vm_end);
+ if (prev->vm_flags & VM_LOCKED) {
+ int nr_pages = mlock_vma_pages_range(prev, addr, prev->vm_end);
+ if (nr_pages < 0)
+ return NULL; /* vma gone! */
+ }
return prev;
}
#else
@@ -1709,8 +1721,11 @@ find_extend_vma(struct mm_struct * mm, u
start = vma->vm_start;
if (expand_stack(vma, addr))
return NULL;
- if (vma->vm_flags & VM_LOCKED)
- make_pages_present(addr, start);
+ if (vma->vm_flags & VM_LOCKED) {
+ int nr_pages = mlock_vma_pages_range(vma, addr, start);
+ if (nr_pages < 0)
+ return NULL; /* vma gone! */
+ }
return vma;
}
#endif
@@ -1895,6 +1910,18 @@ int do_munmap(struct mm_struct *mm, unsi
vma = prev? prev->vm_next: mm->mmap;
/*
+ * unlock any mlock()ed ranges before detaching vmas
+ */
+ if (mm->locked_vm) {
+ struct vm_area_struct *tmp = vma;
+ while (tmp && tmp->vm_start < end) {
+ if (tmp->vm_flags & VM_LOCKED)
+ munlock_vma_pages_all(tmp);
+ tmp = tmp->vm_next;
+ }
+ }
+
+ /*
* Remove the vma's, and unmap the actual pages
*/
detach_vmas_to_be_unmapped(mm, vma, prev, end);
@@ -2006,8 +2033,9 @@ unsigned long do_brk(unsigned long addr,
return -ENOMEM;
/* Can we just expand an old private anonymous mapping? */
- if (vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL))
+ vma = vma_merge(mm, prev, addr, addr + len, flags,
+ NULL, NULL, pgoff, NULL);
+ if (vma)
goto out;
/*
@@ -2029,8 +2057,9 @@ unsigned long do_brk(unsigned long addr,
out:
mm->total_vm += len >> PAGE_SHIFT;
if (flags & VM_LOCKED) {
- mm->locked_vm += len >> PAGE_SHIFT;
- make_pages_present(addr, addr + len);
+ int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
+ if (nr_pages >= 0)
+ mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
}
return addr;
}
@@ -2041,13 +2070,25 @@ EXPORT_SYMBOL(do_brk);
void exit_mmap(struct mm_struct *mm)
{
struct mmu_gather *tlb;
- struct vm_area_struct *vma = mm->mmap;
+ struct vm_area_struct *vma;
unsigned long nr_accounted = 0;
unsigned long end;
/* mm's last user has gone, and its about to be pulled down */
arch_exit_mmap(mm);
+ if (mm->locked_vm) {
+ vma = mm->mmap;
+ while (vma) {
+ if (vma->vm_flags & VM_LOCKED)
+ munlock_vma_pages_all(vma);
+ vma = vma->vm_next;
+ }
+ }
+
+ vma = mm->mmap;
+
+
lru_add_drain();
flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
Index: linux-2.6.26-rc2-mm1/mm/mremap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mremap.c 2008-05-15 11:20:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mremap.c 2008-06-06 16:06:35.000000000 -0400
@@ -23,6 +23,8 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include "internal.h"
+
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
@@ -232,8 +234,8 @@ static unsigned long move_vma(struct vm_
if (vm_flags & VM_LOCKED) {
mm->locked_vm += new_len >> PAGE_SHIFT;
if (new_len > old_len)
- make_pages_present(new_addr + old_len,
- new_addr + new_len);
+ mlock_vma_pages_range(new_vma, new_addr + old_len,
+ new_addr + new_len);
}
return new_addr;
@@ -373,7 +375,7 @@ unsigned long do_mremap(unsigned long ad
vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
if (vma->vm_flags & VM_LOCKED) {
mm->locked_vm += pages;
- make_pages_present(addr + old_len,
+ mlock_vma_pages_range(vma, addr + old_len,
addr + new_len);
}
ret = addr;
Index: linux-2.6.26-rc2-mm1/mm/truncate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/truncate.c 2008-05-15 11:20:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/truncate.c 2008-06-06 16:06:35.000000000 -0400
@@ -18,6 +18,7 @@
#include <linux/task_io_accounting_ops.h>
#include <linux/buffer_head.h> /* grr. try_to_release_page,
do_invalidatepage */
+#include "internal.h"
/**
@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
cancel_dirty_page(page, PAGE_CACHE_SIZE);
remove_from_page_cache(page);
+ clear_page_mlock(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */
@@ -128,6 +130,7 @@ invalidate_complete_page(struct address_
if (PagePrivate(page) && !try_to_release_page(page, 0))
return 0;
+ clear_page_mlock(page);
ret = remove_mapping(mapping, page);
return ret;
@@ -353,6 +356,7 @@ invalidate_complete_page2(struct address
if (PageDirty(page))
goto failed;
+ clear_page_mlock(page);
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c 2008-06-06 16:06:32.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c 2008-06-06 16:06:35.000000000 -0400
@@ -270,7 +270,8 @@ static void __munlock_vma_pages_range(st
struct munlock_page_walk mpw;
VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
- VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+ VM_BUG_ON((!rwsem_is_locked(&vma->vm_mm->mmap_sem)) &&
+ (atomic_read(&mm->mm_users) != 0));
VM_BUG_ON(start < vma->vm_start);
VM_BUG_ON(end > vma->vm_end);
@@ -354,12 +355,13 @@ no_mlock:
/*
- * munlock all pages in vma. For munmap() and exit().
+ * munlock all pages in the vma range. For mremap(), munmap() and exit().
*/
-void munlock_vma_pages_all(struct vm_area_struct *vma)
+void munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
vma->vm_flags &= ~VM_LOCKED;
- __munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+ __munlock_vma_pages_range(vma, start, end);
}
/*
Index: linux-2.6.26-rc2-mm1/mm/fremap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/fremap.c 2008-05-15 11:20:43.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/fremap.c 2008-06-06 16:06:35.000000000 -0400
@@ -20,6 +20,8 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include "internal.h"
+
static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
@@ -214,13 +216,29 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
+ if (vma->vm_flags & VM_LOCKED) {
+ /*
+ * drop PG_Mlocked flag for over-mapped range
+ */
+ unsigned int saved_flags = vma->vm_flags;
+ munlock_vma_pages_range(vma, start, start + size);
+ vma->vm_flags = saved_flags;
+ }
+
err = populate_range(mm, vma, start, size, pgoff);
if (!err && !(flags & MAP_NONBLOCK)) {
- if (unlikely(has_write_lock)) {
- downgrade_write(&mm->mmap_sem);
- has_write_lock = 0;
+ if (vma->vm_flags & VM_LOCKED) {
+ /*
+ * might be mapping previously unmapped range of file
+ */
+ mlock_vma_pages_range(vma, start, start + size);
+ } else {
+ if (unlikely(has_write_lock)) {
+ downgrade_write(&mm->mmap_sem);
+ has_write_lock = 0;
+ }
+ make_pages_present(start, start+size);
}
- make_pages_present(start, start+size);
}
/*
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-06-06 16:06:35.000000000 -0400
@@ -63,9 +63,18 @@ extern int mlock_vma_pages_range(struct
unsigned long start, unsigned long end);
/*
- * munlock all pages in vma. For munmap() and exit().
+ * munlock all pages in vma range. For mremap().
*/
-extern void munlock_vma_pages_all(struct vm_area_struct *vma);
+extern void munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
+
+/*
+ * munlock all pages in vma. For munmap and exit().
+ */
+static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+{
+ munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+}
#ifdef CONFIG_NORECLAIM_LRU
/*
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* [PATCH -mm 21/25] Cull non-reclaimable pages in fault path
[not found] <20080606202838.390050172@redhat.com>
` (3 preceding siblings ...)
2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel, Lee Schermerhorn
2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel, Lee Schermerhorn
2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
6 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel, Rik van Riel, Lee Schermerhorn @ 2008-06-06 20:28 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-21-lts-noreclaim-optional-scan-noreclaim-list-for-reclaimable-pages.patch --]
[-- Type: text/plain, Size: 6366 bytes --]
Against: 2.6.26-rc2-mm1
V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series.
V1 -> V2:
+ no changes
"Optional" part of "noreclaim infrastructure"
In the fault paths that install new anonymous pages, check whether
the page is reclaimable or not using lru_cache_add_active_or_noreclaim().
If the page is reclaimable, just add it to the active lru list [via
the pagevec cache], else add it to the noreclaim list.
This "proactive" culling in the fault path mimics the handling of
mlocked pages in Nick Piggin's series to keep mlocked pages off
the lru lists.
Notes:
1) This patch is optional--e.g., if one is concerned about the
additional test in the fault path. We can defer the moving of
nonreclaimable pages until when vmscan [shrink_*_list()]
encounters them. Vmscan will only need to handle such pages
once.
2) The 'vma' argument to page_reclaimable() is require to notice that
we're faulting a page into an mlock()ed vma w/o having to scan the
page's rmap in the fault path. Culling mlock()ed anon pages is
currently the only reason for this patch.
3) We can't cull swap pages in read_swap_cache_async() because the
vma argument doesn't necessarily correspond to the swap cache
offset passed in by swapin_readahead(). This could [did!] result
in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
cull in this path.
4) Move set_pte_at() to after where we add page to lru to keep it
hidden from other tasks that might walk the page table.
We already do it in this order in do_anonymous() page. And,
these are COW'd anon pages. Is this safe?
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
include/linux/swap.h | 2 ++
mm/memory.c | 20 ++++++++++++--------
mm/swap.c | 21 +++++++++++++++++++++
3 files changed, 35 insertions(+), 8 deletions(-)
Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c 2008-06-06 16:06:44.000000000 -0400
@@ -1774,12 +1774,15 @@ gotten:
* thread doing COW.
*/
ptep_clear_flush(vma, address, page_table);
- set_pte_at(mm, address, page_table, entry);
- update_mmu_cache(vma, address, entry);
+
SetPageSwapBacked(new_page);
- lru_cache_add_active_anon(new_page);
+ lru_cache_add_active_or_noreclaim(new_page, vma);
page_add_new_anon_rmap(new_page, vma, address);
+//TODO: is this safe? do_anonymous_page() does it this way.
+ set_pte_at(mm, address, page_table, entry);
+ update_mmu_cache(vma, address, entry);
+
/* Free the old page.. */
new_page = old_page;
ret |= VM_FAULT_WRITE;
@@ -2246,7 +2249,7 @@ static int do_anonymous_page(struct mm_s
goto release;
inc_mm_counter(mm, anon_rss);
SetPageSwapBacked(page);
- lru_cache_add_active_anon(page);
+ lru_cache_add_active_or_noreclaim(page, vma);
page_add_new_anon_rmap(page, vma, address);
set_pte_at(mm, address, page_table, entry);
@@ -2390,12 +2393,11 @@ static int __do_fault(struct mm_struct *
entry = mk_pte(page, vma->vm_page_prot);
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- set_pte_at(mm, address, page_table, entry);
if (anon) {
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, anon_rss);
SetPageSwapBacked(page);
- lru_cache_add_active_anon(page);
- page_add_new_anon_rmap(page, vma, address);
+ lru_cache_add_active_or_noreclaim(page, vma);
+ page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
@@ -2404,6 +2406,8 @@ static int __do_fault(struct mm_struct *
get_page(dirty_page);
}
}
+//TODO: is this safe? do_anonymous_page() does it this way.
+ set_pte_at(mm, address, page_table, entry);
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h 2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h 2008-06-06 16:06:44.000000000 -0400
@@ -173,6 +173,8 @@ extern unsigned int nr_free_pagecache_pa
/* linux/mm/swap.c */
extern void __lru_cache_add(struct page *, enum lru_list lru);
extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_cache_add_active_or_noreclaim(struct page *,
+ struct vm_area_struct *);
extern void activate_page(struct page *);
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c 2008-06-06 16:06:44.000000000 -0400
@@ -31,6 +31,8 @@
#include <linux/backing-dev.h>
#include <linux/memcontrol.h>
+#include "internal.h"
+
/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -273,6 +275,25 @@ void add_page_to_noreclaim_list(struct p
spin_unlock_irq(&zone->lru_lock);
}
+/**
+ * lru_cache_add_active_or_noreclaim
+ * @page: the page to be added to LRU
+ * @vma: vma in which page is mapped for determining reclaimability
+ *
+ * place @page on active or noreclaim LRU list, depending on
+ * page_reclaimable(). Note that if the page is not reclaimable,
+ * it goes directly back onto it's zone's noreclaim list. It does
+ * NOT use a per cpu pagevec.
+ */
+void lru_cache_add_active_or_noreclaim(struct page *page,
+ struct vm_area_struct *vma)
+{
+ if (page_reclaimable(page, vma))
+ lru_cache_add_lru(page, LRU_ACTIVE + page_file_cache(page));
+ else
+ add_page_to_noreclaim_list(page);
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* [PATCH -mm 23/25] Noreclaim LRU scan sysctl
[not found] <20080606202838.390050172@redhat.com>
` (4 preceding siblings ...)
2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel, Lee Schermerhorn
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel, Lee Schermerhorn
2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
6 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel, Rik van Riel, Lee Schermerhorn @ 2008-06-06 20:29 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-23-lts-noreclaim-lru-scan-sysctl.patch --]
[-- Type: text/plain, Size: 11638 bytes --]
Against: 2.6.26-rc2-mm1
V6:
+ moved to end of series as optional debug patch
V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series
New in V2
This patch adds a function to scan individual or all zones' noreclaim
lists and move any pages that have become reclaimable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.
Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.
Kosaki:
If reclaimable page found in noreclaim lru when write
/proc/sys/vm/scan_noreclaim_pages, print filename and file offset of
these pages.
TODO: DEBUGGING ONLY: NOT FOR UPSTREAM MERGE
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
drivers/base/node.c | 5 +
include/linux/rmap.h | 3
include/linux/swap.h | 15 ++++
kernel/sysctl.c | 10 +++
mm/rmap.c | 4 -
mm/vmscan.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 196 insertions(+), 2 deletions(-)
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h 2008-06-06 16:06:44.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h 2008-06-06 16:06:52.000000000 -0400
@@ -7,6 +7,7 @@
#include <linux/list.h>
#include <linux/memcontrol.h>
#include <linux/sched.h>
+#include <linux/node.h>
#include <asm/atomic.h>
#include <asm/page.h>
@@ -235,15 +236,29 @@ static inline int zone_reclaim(struct zo
#ifdef CONFIG_NORECLAIM_LRU
extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
extern void scan_mapping_noreclaim_pages(struct address_space *);
+
+extern unsigned long scan_noreclaim_pages;
+extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+extern int scan_noreclaim_register_node(struct node *node);
+extern void scan_noreclaim_unregister_node(struct node *node);
#else
static inline int page_reclaimable(struct page *page,
struct vm_area_struct *vma)
{
return 1;
}
+
static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
{
}
+
+static inline int scan_noreclaim_register_node(struct node *node)
+{
+ return 0;
+}
+
+static inline void scan_noreclaim_unregister_node(struct node *node) { }
#endif
extern int kswapd_run(int nid);
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-06-06 16:06:48.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-06-06 16:06:52.000000000 -0400
@@ -39,6 +39,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/memcontrol.h>
+#include <linux/sysctl.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -2355,6 +2356,37 @@ int page_reclaimable(struct page *page,
return 1;
}
+static void show_page_path(struct page *page)
+{
+ char buf[256];
+ if (page_file_cache(page)) {
+ struct address_space *mapping = page->mapping;
+ struct dentry *dentry;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ spin_lock(&mapping->i_mmap_lock);
+ dentry = d_find_alias(mapping->host);
+ printk(KERN_INFO "rescued: %s %lu\n",
+ dentry_path(dentry, buf, 256), pgoff);
+ spin_unlock(&mapping->i_mmap_lock);
+ } else {
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ printk(KERN_INFO "rescued: anon %s\n",
+ vma->vm_mm->owner->comm);
+ break;
+ }
+ page_unlock_anon_vma(anon_vma);
+ }
+}
+
+
/**
* check_move_noreclaim_page - check page for reclaimability and move to appropriate lru list
* @page: page to check reclaimability and move to appropriate lru list
@@ -2372,6 +2404,9 @@ static void check_move_noreclaim_page(st
ClearPageNoreclaim(page); /* for page_reclaimable() */
if (page_reclaimable(page, NULL)) {
enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+
+ show_page_path(page);
+
__dec_zone_state(zone, NR_NORECLAIM);
list_move(&page->lru, &zone->list[l]);
__inc_zone_state(zone, NR_INACTIVE_ANON + l);
@@ -2452,4 +2487,130 @@ void scan_mapping_noreclaim_pages(struct
}
}
+
+/**
+ * scan_zone_noreclaim_pages - check noreclaim list for reclaimable pages
+ * @zone - zone of which to scan the noreclaim list
+ *
+ * Scan @zone's noreclaim LRU lists to check for pages that have become
+ * reclaimable. Move those that have to @zone's inactive list where they
+ * become candidates for reclaim, unless shrink_inactive_zone() decides
+ * to reactivate them. Pages that are still non-reclaimable are rotated
+ * back onto @zone's noreclaim list.
+ */
+#define SCAN_NORECLAIM_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
+void scan_zone_noreclaim_pages(struct zone *zone)
+{
+ struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
+ unsigned long scan;
+ unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
+
+ while (nr_to_scan > 0) {
+ unsigned long batch_size = min(nr_to_scan,
+ SCAN_NORECLAIM_BATCH_SIZE);
+
+ spin_lock_irq(&zone->lru_lock);
+ for (scan = 0; scan < batch_size; scan++) {
+ struct page *page = lru_to_page(l_noreclaim);
+
+ if (TestSetPageLocked(page))
+ continue;
+
+ prefetchw_prev_lru_page(page, l_noreclaim, flags);
+
+ if (likely(PageLRU(page) && PageNoreclaim(page)))
+ check_move_noreclaim_page(page, zone);
+
+ unlock_page(page);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+
+ nr_to_scan -= batch_size;
+ }
+}
+
+
+/**
+ * scan_all_zones_noreclaim_pages - scan all noreclaim lists for reclaimable pages
+ *
+ * A really big hammer: scan all zones' noreclaim LRU lists to check for
+ * pages that have become reclaimable. Move those back to the zones'
+ * inactive list where they become candidates for reclaim.
+ * This occurs when, e.g., we have unswappable pages on the noreclaim lists,
+ * and we add swap to the system. As such, it runs in the context of a task
+ * that has possibly/probably made some previously non-reclaimable pages
+ * reclaimable.
+ */
+void scan_all_zones_noreclaim_pages(void)
+{
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ scan_zone_noreclaim_pages(zone);
+ }
+}
+
+/*
+ * scan_noreclaim_pages [vm] sysctl handler. On demand re-scan of
+ * all nodes' noreclaim lists for reclaimable pages
+ */
+unsigned long scan_noreclaim_pages;
+
+int scan_noreclaim_handler(struct ctl_table *table, int write,
+ struct file *file, void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+ if (write && *(unsigned long *)table->data)
+ scan_all_zones_noreclaim_pages();
+
+ scan_noreclaim_pages = 0;
+ return 0;
+}
+
+/*
+ * per node 'scan_noreclaim_pages' attribute. On demand re-scan of
+ * a specified node's per zone noreclaim lists for reclaimable pages.
+ */
+
+static ssize_t read_scan_noreclaim_node(struct sys_device *dev, char *buf)
+{
+ return sprintf(buf, "0\n"); /* always zero; should fit... */
+}
+
+static ssize_t write_scan_noreclaim_node(struct sys_device *dev,
+ const char *buf, size_t count)
+{
+ struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
+ struct zone *zone;
+ unsigned long res;
+ unsigned long req = strict_strtoul(buf, 10, &res);
+
+ if (!req)
+ return 1; /* zero is no-op */
+
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!populated_zone(zone))
+ continue;
+ scan_zone_noreclaim_pages(zone);
+ }
+ return 1;
+}
+
+
+static SYSDEV_ATTR(scan_noreclaim_pages, S_IRUGO | S_IWUSR,
+ read_scan_noreclaim_node,
+ write_scan_noreclaim_node);
+
+int scan_noreclaim_register_node(struct node *node)
+{
+ return sysdev_create_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+void scan_noreclaim_unregister_node(struct node *node)
+{
+ sysdev_remove_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
#endif
Index: linux-2.6.26-rc2-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/kernel/sysctl.c 2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/kernel/sysctl.c 2008-06-06 16:06:52.000000000 -0400
@@ -1151,6 +1151,16 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+#ifdef CONFIG_NORECLAIM_LRU
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "scan_noreclaim_pages",
+ .data = &scan_noreclaim_pages,
+ .maxlen = sizeof(scan_noreclaim_pages),
+ .mode = 0644,
+ .proc_handler = &scan_noreclaim_handler,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c 2008-06-06 16:06:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c 2008-06-06 16:06:52.000000000 -0400
@@ -13,6 +13,7 @@
#include <linux/nodemask.h>
#include <linux/cpu.h>
#include <linux/device.h>
+#include <linux/swap.h>
static struct sysdev_class node_class = {
.name = "node",
@@ -190,6 +191,8 @@ int register_node(struct node *node, int
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+
+ scan_noreclaim_register_node(node);
}
return error;
}
@@ -209,6 +212,8 @@ void unregister_node(struct node *node)
sysdev_remove_file(&node->sysdev, &attr_numastat);
sysdev_remove_file(&node->sysdev, &attr_distance);
+ scan_noreclaim_unregister_node(node);
+
sysdev_unregister(&node->sysdev);
}
Index: linux-2.6.26-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/rmap.h 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/rmap.h 2008-06-06 16:06:52.000000000 -0400
@@ -55,6 +55,9 @@ void anon_vma_unlink(struct vm_area_stru
void anon_vma_link(struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
+extern struct anon_vma *page_lock_anon_vma(struct page *page);
+extern void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
/*
* rmap interfaces called when adding or removing pte of page
*/
Index: linux-2.6.26-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/rmap.c 2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/rmap.c 2008-06-06 16:06:52.000000000 -0400
@@ -168,7 +168,7 @@ void __init anon_vma_init(void)
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -188,7 +188,7 @@ out:
return NULL;
}
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation
[not found] <20080606202838.390050172@redhat.com>
` (5 preceding siblings ...)
2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel, Lee Schermerhorn
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel
6 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:29 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm, Eric Whitney
[-- Attachment #1: rvr-25-lts-noreclaim-mlock-documentation.patch --]
[-- Type: text/plain, Size: 36246 bytes --]
From: Lee Schermerhorn <lee.schermerhorn@hp.com>
Documentation for noreclaim lru list and its usage.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Documentation/vm/noreclaim-lru.txt | 609 +++++++++++++++++++++++++++++++++++++
1 file changed, 609 insertions(+)
Index: linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt 2008-06-06 16:07:01.000000000 -0400
@@ -0,0 +1,609 @@
+
+This document describes the Linux memory management "Noreclaim LRU"
+infrastructure and the use of this infrastructure to manage several types
+of "non-reclaimable" pages. The document attempts to provide the overall
+rationale behind this mechanism and the rationale for some of the design
+decisions that drove the implementation. The latter design rationale is
+discussed in the context of an implementation description. Admittedly, one
+can obtain the implementation details--the "what does it do?"--by reading the
+code. One hopes that the descriptions below add value by provide the answer
+to "why does it do that?".
+
+Noreclaim LRU Infrastructure:
+
+The Noreclaim LRU adds an additional LRU list to track non-reclaimable pages
+and to hide these pages from vmscan. This mechanism is based on a patch by
+Larry Woodman of Red Hat to address several scalability problems with page
+reclaim in Linux. The problems have been observed at customer sites on large
+memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB
+of main memory will have over 32 million 4k pages in a single zone. When a
+large fraction of these pages are not reclaimable for any reason [see below],
+vmscan will spend a lot of time scanning the LRU lists looking for the small
+fraction of pages that are reclaimable. This can result in a situation where
+all cpus are spending 100% of their time in vmscan for hours or days on end,
+with the system completely unresponsive.
+
+The Noreclaim LRU infrastructure addresses the following classes of
+non-reclaimable pages:
+
++ page owned by ram disks or ramfs
++ page mapped into SHM_LOCKed shared memory regions
++ page mapped into VM_LOCKED [mlock()ed] vmas
+
+The infrastructure might be able to handle other conditions that make pages
+nonreclaimable, either by definition or by circumstance, in the future.
+
+
+The Noreclaim LRU List
+
+The Noreclaim LRU infrastructure consists of an additional, per-zone, LRU list
+called the "noreclaim" list and an associated page flag, PG_noreclaim, to
+indicate that the page is being managed on the noreclaim list. The PG_noreclaim
+flag is analogous to, and mutually exclusive with, the PG_active flag in that
+it indicates on which LRU list a page resides when PG_lru is set. The
+noreclaim LRU list is source configurable based on the NORECLAIM_LRU Kconfig
+option.
+
+Why maintain nonreclaimable pages on an additional LRU list? The Linux memory
+management subsystem has well established protocols for managing pages on the
+LRU. Vmscan is based on LRU lists. LRU list exist per zone, and we want to
+maintain pages relative to their "home zone". All of these make the use of
+an additional list, parallel to the LRU active and inactive lists, a natural
+mechanism to employ. Note, however, that the noreclaim list does not
+differentiate between file backed and swap backed [anon] pages. This
+differentiation is only important while the pages are, in fact, reclaimable.
+
+The noreclaim LRU list benefits from the "arrayification" of the per-zone
+LRU lists and statistics originally proposed and posted by Christoph Lameter.
+
+Note that the noreclaim list does not use the lru pagevec mechanism. Rather,
+nonreclaimable pages are placed directly on the page's zone's noreclaim
+list under the zone lru_lock. The reason for this is to prevent stranding
+of pages on the noreclaim list when one task has the page isolated from the
+lru and other tasks are changing the "reclaimability" state of the page.
+
+
+Noreclaim LRU and Memory Controller Interaction
+
+The memory controller data structure automatically gets a per zone noreclaim
+lru list as a result of the "arrayification" of the per-zone LRU lists. The
+memory controller tracks the movement of pages to and from the noreclaim list.
+When a memory control group comes under memory pressure, the controller will
+not attempt to reclaim pages on the noreclaim list. This has a couple of
+effects. Because the pages are "hidden" from reclaim on the noreclaim list,
+the reclaim process can be more efficient, dealing only with pages that have
+a chance of being reclaimed. On the other hand, if too many of the pages
+charged to the control group are non-reclaimable, the reclaimable portion of the
+working set of the tasks in the control group may not fit into the available
+memory. This can cause the control group to thrash or to oom-kill tasks.
+
+
+Noreclaim LRU: Detecting Non-reclaimable Pages
+
+The function page_reclaimable(page, vma) in vmscan.c determines whether a
+page is reclaimable or not. For ramfs and ram disk [brd] pages and pages in
+SHM_LOCKed regions, page_reclaimable() tests a new address space flag,
+AS_NORECLAIM, in the page's address space using a wrapper function.
+Wrapper functions are used to set, clear and test the flag to reduce the
+requirement for #ifdef's throughout the source code. AS_NORECLAIM is set on
+ramfs inode/mapping when it is created and on ram disk inode/mappings at open
+time. This flag remains for the life of the inode.
+
+For shared memory regions, AS_NORECLAIM is set when an application successfully
+SHM_LOCKs the region and is removed when the region is SHM_UNLOCKed. Note that
+shmctl(SHM_LOCK, ...) does not populate the page tables for the region as does,
+for example, mlock(). So, we make no special effort to push any pages in the
+SHM_LOCKed region to the noreclaim list. Vmscan will do this when/if it
+encounters the pages during reclaim. On SHM_UNLOCK, shmctl() scans the pages
+in the region and "rescues" them from the noreclaim list if no other condition
+keeps them non-reclaimable. If a SHM_LOCKed region is destroyed, the pages
+are also "rescued" from the noreclaim list in the process of freeing them.
+
+page_reclaimable() detects mlock()ed pages by testing an additional page flag,
+PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a
+non-NULL vma is supplied, page_reclaimable() will check whether the vma is
+VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
+update the appropriate statistics if the vma is VM_LOCKED. This method allows
+efficient "culling" of pages in the fault path that are being faulted in to
+VM_LOCKED vmas.
+
+
+Non-reclaimable Pages and Vmscan [shrink_*_list()]
+
+If non-reclaimable pages are culled in the fault path, or moved to the
+noreclaim list at mlock() or mmap() time, vmscan will never encounter the pages
+until they have become reclaimable again, for example, via munlock() and have
+been "rescued" from the noreclaim list. However, there may be situations where
+we decide, for the sake of expediency, to leave a non-reclaimable page on one of
+the regular active/inactive LRU lists for vmscan to deal with. Vmscan checks
+for such pages in all of the shrink_{active|inactive|page}_list() functions and
+will "cull" such pages that it encounters--that is, it diverts those pages to
+the noreclaim list for the zone being scanned.
+
+There may be situations where a page is mapped into a VM_LOCKED vma, but the
+page is not marked as PageMlocked. Such pages will make it all the way to
+shrink_page_list() where they will be detected when vmscan walks the reverse
+map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list()
+will cull the page at that point.
+
+Note that for anonymous pages, shrink_page_list() attempts to add the page to
+the swap cache before it tries to unmap the page. To avoid this unnecessary
+consumption of swap space, shrink_page_list() calls try_to_unlock() to check
+whether any VM_LOCKED vmas map the page without attempting to unmap the page.
+If try_to_unlock() returns SWAP_MLOCK, shrink_page_list() will cull the page
+without consuming swap space. try_to_unlock() will be described below.
+
+
+Mlocked Page: Prior Work
+
+The "Noreclaim Mlocked Pages" infrastructure is based on work originally posted
+by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". Nick's
+posted his patch as an alternative to a patch posted by Christoph Lameter to
+achieve the same objective--hiding mlocked pages from vmscan. In Nick's patch,
+he used one of the struct page lru list link fields as a count of VM_LOCKED
+vmas that map the page. This use of the link field for a count prevent the
+management of the pages on an LRU list. When Nick's patch was integrated with
+the Noreclaim LRU work, the count was replaced by walking the reverse map to
+determine whether any VM_LOCKED vmas mapped the page. More on this below.
+The primary reason for wanting to keep mlocked pages on an LRU list is that
+mlocked pages are migratable, and the LRU list is used to arbitrate tasks
+attempting to migrate the same page. Whichever task succeeds in "isolating"
+the page from the LRU performs the migration.
+
+
+Mlocked Pages: Basic Management
+
+Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of
+nonreclaimable pages. When such a page has been "noticed" by the memory
+management subsystem, the page is marked with the PG_mlocked [PageMlocked()]
+flag. A PageMlocked() page will be placed on the noreclaim LRU list when
+it is added to the LRU. Pages can be "noticed" by memory management in
+several places:
+
+1) in the mlock()/mlockall() system call handlers.
+2) in the mmap() system call handler when mmap()ing a region with the
+ MAP_LOCKED flag, or mmap()ing a region in a task that has called
+ mlockall() with the MCL_FUTURE flag. Both of these conditions result
+ in the VM_LOCKED flag being set for the vma.
+3) in the fault path, if mlocked pages are "culled" in the fault path,
+ and when a VM_LOCKED stack segment is expanded.
+4) as mentioned above, in vmscan:shrink_page_list() with attempting to
+ reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_unlock().
+
+Mlocked pages become unlocked and rescued from the noreclaim list when:
+
+1) mapped in a range unlocked via the munlock()/munlockall() system calls.
+2) munmapped() out of the last VM_LOCKED vma that maps the page, including
+ unmapping at task exit.
+3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file.
+4) before a page is COWed in a VM_LOCKED vma.
+
+
+Mlocked Pages: mlock()/mlockall() System Call Handling
+
+Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
+for each vma in the range specified by the call. In the case of mlockall(),
+this is the entire active address space of the task. Note that mlock_fixup()
+is used for both mlock()ing and munlock()ing a range of memory. A call to
+mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED
+is treated as a no-op--mlock_fixup() simply returns.
+
+If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas"
+below, mlock_fixup() will attempt to merge the vma with its neighbors or split
+off a subset of the vma if the range does not cover the entire vma. Once the
+vma has been merged or split or neither, mlock_fixup() will call
+__mlock_vma_pages_range() to fault in the pages via get_user_pages() and
+to mark the pages as mlocked via mlock_vma_page().
+
+Note that the vma being mlocked might be mapped with PROT_NONE. In this case,
+get_user_pages() will be unable to fault in the pages. That's OK. If pages
+do end up getting faulted into this VM_LOCKED vma, we'll handle them in the
+fault path or in vmscan.
+
+Also note that a page returned by get_user_pages() could be truncated or
+migrated out from under us, while we're trying to mlock it. To detect
+this, __mlock_vma_pages_range() tests the page_mapping after acquiring
+the page lock. If the page is still associated with its mapping, we'll
+go ahead and call mlock_vma_page(). If the mapping is gone, we just
+unlock the page and move on. Worse case, this results in page mapped
+in a VM_LOCKED vma remaining on a normal LRU list without being
+PageMlocked(). Again, vmscan will detect and cull such pages.
+
+mlock_vma_page(), called with the page locked [N.B., not "mlocked"] will
+TestSetPageMlocked() for each page returned by get_user_pages(). We use
+TestSetPageMlocked() because the page might already be mlocked by another
+task/vma and we don't want to do extra work. We especially do not want to
+count an mlocked page more than once in the statistics. If the page was
+already mlocked, mlock_vma_page() is done.
+
+If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
+page from the LRU, as it is likely on the appropriate active or inactive list
+at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will
+putback the page--putback_lru_page()--which will notice that the page is now
+mlocked and divert the page to the zone's noreclaim LRU list. If
+mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
+it later if/when it attempts to reclaim the page.
+
+
+Mlocked Pages: Filtering Vmas
+
+mlock_fixup() filters several classes of "special" vmas:
+
+1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind
+ these mappings are inherently pinned, so we don't need to mark them as
+ mlocked. In any case, most of the pages have no struct page in which to
+ so mark the page. Because of this, get_user_pages() will fail for these
+ vmas, so there is no sense in attempting to visit them.
+
+2) vmas mapping hugetlbfs page are already effectively pinned into memory.
+ We don't need nor want to mlock() these pages. However, to preserve the
+ prior behavior of mlock()--before the noreclaim/mlock changes--mlock_fixup()
+ will call make_pages_present() in the hugetlbfs vma range to allocate the
+ huge pages and populate the ptes.
+
+3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of
+ kernel pages, such as the vdso page, relay channel pages, etc. These pages
+ are inherently non-reclaimable and are not managed on the LRU lists.
+ mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls
+ make_pages_present() to populate the ptes.
+
+Note that for all of these special vmas, mlock_fixup() does not set the
+VM_LOCKED flag. Therefore, we won't have to deal with them later during
+munlock() or munmap()--for example, at task exit. Neither does mlock_fixup()
+account these vmas against the task's "locked_vm".
+
+Mlocked Pages: Downgrading the Mmap Semaphore.
+
+mlock_fixup() must be called with the mmap semaphore held for write, because
+it may have to merge or split vmas. However, mlocking a large region of
+memory can take a long time--especially if vmscan must reclaim pages to
+satisfy the regions requirements. Faulting in a large region with the mmap
+semaphore held for write can hold off other faults on the address space, in
+the case of a multi-threaded task. It can also hold off scans of the task's
+address space via /proc. While testing under heavy load, it was observed that
+the ps(1) command could be held off for many minutes while a large segment was
+mlock()ed down.
+
+To address this issue, and to make the system more responsive during mlock()ing
+of large segments, mlock_fixup() downgrades the mmap semaphore to read mode
+during the call to __mlock_vma_pages_range(). This works fine. However, the
+callers of mlock_fixup() expect the semaphore to be returned in write mode.
+So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not
+support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore
+and reacquire it in write mode. In a multi-threaded task, it is possible for
+the task memory map to change while the semaphore is dropped. Therefore,
+mlock_fixup() looks up the vma at the range start address after reacquiring
+the semaphore in write mode and verifies that it still covers the original
+range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of
+mlock_fixup() have been changed to deal with this new error condition.
+
+Note: when munlocking a region, all of the pages should already be resident--
+unless we have racing threads mlocking() and munlocking() regions. So,
+unlocking should not have to wait for page allocations nor faults of any kind.
+Therefore mlock_fixup() does not downgrade the semaphore for munlock().
+
+
+Mlocked Pages: munlock()/munlockall() System Call Handling
+
+The munlock() and munlockall() system calls are handled by the same functions--
+do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock
+vs lock operation indicated by an argument. So, these system calls are also
+handled by mlock_fixup(). Again, if called for an already munlock()ed vma,
+mlock_fixup() simply returns. Because of the vma filtering discussed above,
+VM_LOCKED will not be set in any "special" vmas. So, these vmas will be
+ignored for munlock.
+
+If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off
+the specified range. The range is then munlocked via the function
+__munlock_vma_pages_range(). Because the vma access protections could have
+been changed to PROT_NONE after faulting in and mlocking some pages,
+get_user_pages() is unreliable for visiting these pages for munlocking. We
+don't want to leave pages mlocked(), so __munlock_vma_pages_range() uses a
+custom page table walker to find all pages mapped into the specified range.
+Note that this again assumes that all pages in the mlocked() range are resident
+and mapped by the task's page table.
+
+As with __mlock_vma_pages_range(), unlocking can race with truncation and
+migration. It is very important that munlock of a page succeeds, lest we
+leak pages by stranding them in the mlocked state on the noreclaim list.
+The munlock page walk pte handler resolves the race with page migration
+by checking the pte for a special swap pte indicating that the page is
+being migrated. If this is the case, the pte handler will wait for the
+migration entry to be replaced and then refetch the pte for the new page.
+Once the pte handler has locked the page, it checks the page_mapping to
+ensure that it still exists. If not, the handler unlocks the page and
+retries the entire process after refetching the pte.
+
+The munlock page walk pte handler unlocks individual pages by calling
+munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
+flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page()
+use the Test*PageMlocked() function to handle the case where the page might
+have already been unlocked by another task. If the page was mlocked,
+munlock_vma_page() updates that zone statistics for the number of mlocked
+pages. Note, however, that at this point we haven't checked whether the page
+is mapped by other VM_LOCKED vmas.
+
+We can't call try_to_unlock(), the function that walks the reverse map to check
+for other VM_LOCKED vmas, without first isolating the page from the LRU.
+try_to_unlock() is a variant of try_to_unmap() and thus requires that the page
+not be on an lru list. [More on these below.] However, the call to
+isolate_lru_page() could fail, in which case we couldn't try_to_unlock().
+So, we go ahead and clear PG_mlocked up front, as this might be the only chance
+we have. If we can successfully isolate the page, we go ahead and
+try_to_unlock(), which will restore the PG_mlocked flag and update the zone
+page statistics if it finds another vma holding the page mlocked. If we fail
+to isolate the page, we'll have left a potentially mlocked page on the LRU.
+This is fine, because we'll catch it later when/if vmscan tries to reclaim the
+page. This should be relatively rare.
+
+Mlocked Pages: Migrating Them...
+
+A page that is being migrated has been isolated from the lru lists and is
+held locked across unmapping of the page, updating the page's mapping
+[address_space] entry and copying the contents and state, until the
+page table entry has been replaced with an entry that refers to the new
+page. Linux supports migration of mlocked pages and other non-reclaimable
+pages. This involves simply moving the PageMlocked and PageNoreclaim states
+from the old page to the new page.
+
+Note that page migration can race with mlocking or munlocking of the same
+page. This has been discussed from the mlock/munlock perspective in the
+respective sections above. Both processes [migration, m[un]locking], hold
+the page locked. This provides the first level of synchronization. Page
+migration zeros out the page_mapping of the old page before unlocking it,
+so m[un]lock can skip these pages. However, as discussed above, munlock
+must wait for a migrating page to be replaced with the new page to prevent
+the new page from remaining mlocked outside of any VM_LOCKED vma.
+
+To ensure that we don't strand pages on the noreclaim list because of a
+race between munlock and migration, we must also prevent the munlock pte
+handler from acquiring the old or new page lock from the time that the
+migration subsystem acquires the old page lock, until either migration
+succeeds and the new page is added to the lru or migration fails and
+the old page is putback to the lru. The achieve this coordination,
+the migration subsystem places the new page on success, or the old
+page on failure, back on the lru lists before dropping the respective
+page's lock. It uses the putback_lru_page() function to accomplish this,
+which rechecks the page's overall reclaimability and adjusts the page
+flags accordingly. To free the old page on success or the new page on
+failure, the migration subsystem just drops what it knows to be the last
+page reference via put_page().
+
+
+Mlocked Pages: mmap(MAP_LOCKED) System Call Handling
+
+In addition the the mlock()/mlockall() system calls, an application can request
+that a region of memory be mlocked using the MAP_LOCKED flag with the mmap()
+call. Furthermore, any mmap() call or brk() call that expands the heap by a
+task that has previously called mlockall() with the MCL_FUTURE flag will result
+in the newly mapped memory being mlocked. Before the noreclaim/mlock changes,
+the kernel simply called make_pages_present() to allocate pages and populate
+the page table.
+
+To mlock a range of memory under the noreclaim/mlock infrastructure, the
+mmap() handler and task address space expansion functions call
+mlock_vma_pages_range() specifying the vma and the address range to mlock.
+mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in
+"Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will
+have already been set by the caller, in filtered vmas. Thus these vma's need
+not be visited for munlock when the region is unmapped.
+
+For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
+fault/allocate the pages and mlock them. Again, like mlock_fixup(),
+mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
+attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore
+back to write mode before returning.
+
+The callers of mlock_vma_pages_range() will have already added the memory
+range to be mlocked to the task's "locked_vm". To account for filtered vmas,
+mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
+callers then subtract a non-negative return value from the task's locked_vm.
+A negative return value represent an error--for example, from get_user_pages()
+attempting to fault in a vma with PROT_NONE access. In this case, we leave
+the memory range accounted as locked_vm, as the protections could be changed
+later and pages allocated into that region.
+
+
+Mlocked Pages: munmap()/exit()/exec() System Call Handling
+
+When unmapping an mlocked region of memory, whether by an explicit call to
+munmap() or via an internal unmap from exit() or exec() processing, we must
+munlock the pages if we're removing the last VM_LOCKED vma that maps the pages.
+Before the noreclaim/mlock changes, mlocking did not mark the pages in any way,
+so unmapping them required no processing.
+
+To munlock a range of memory under the noreclaim/mlock infrastructure, the
+munmap() hander and task address space tear down function call
+munlock_vma_pages_all(). The name reflects the observation that one always
+specifies the entire vma range when munlock()ing during unmap of a region.
+Because of the vma filtering when mlocking() regions, only "normal" vmas that
+actually contain mlocked pages will be passed to munlock_vma_pages_all().
+
+munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup()
+for the munlock case, calls __munlock_vma_pages_range() to walk the page table
+for the vma's memory range and munlock_vma_page() each resident page mapped by
+the vma. This effectively munlocks the page, only if this is the last
+VM_LOCKED vma that maps the page.
+
+
+Mlocked Page: try_to_unmap()
+
+[Note: the code changes represented by this section are really quite small
+compared to the text to describe what happening and why, and to discuss the
+implications.]
+
+Pages can, of course, be mapped into multiple vmas. Some of these vmas may
+have VM_LOCKED flag set. It is possible for a page mapped into one or more
+VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one
+of the active or inactive LRU lists. This could happen if, for example, a
+task in the process of munlock()ing the page could not isolate the page from
+the LRU. As a result, vmscan/shrink_page_list() might encounter such a page
+as described in "Non-reclaimable Pages and Vmscan [shrink_*_list()]". To
+handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED
+vmas while it is walking a page's reverse map.
+
+try_to_unmap() is always called, by either vmscan for reclaim or for page
+migration, with the argument page locked and isolated from the LRU. BUG_ON()
+assertions enforce this requirement. Separate functions handle anonymous and
+mapped file pages, as these types of pages have different reverse map
+mechanisms.
+
+ try_to_unmap_anon()
+
+To unmap anonymous pages, each vma in the list anchored in the anon_vma must be
+visited--at least until a VM_LOCKED vma is encountered. If the page is being
+unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked
+pages are migratable. However, for reclaim, if the page is mapped into a
+VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap
+semphore of the mm_struct to which the vma belongs in read mode. If this is
+successful, try_to_unmap() will mlock the page via mlock_vma_page()--we
+wouldn't have gotten to try_to_unmap() if the page were already mlocked--and
+will return SWAP_MLOCK, indicating that the page is nonreclaimable. If the
+mmap semaphore cannot be acquired, we are not sure whether the page is really
+nonreclaimable or not. In this case, try_to_unmap() will return SWAP_AGAIN.
+
+ try_to_unmap_file() -- linear mappings
+
+Unmapping of a mapped file page works the same, except that the scan visits
+all vmas that maps the page's index/page offset in the page's mapping's
+reverse map priority search tree. It must also visit each vma in the page's
+mapping's non-linear list, if the list is non-empty. As for anonymous pages,
+on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will
+attempt to acquire the associated mm_struct's mmap semaphore to mlock the page,
+returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not.
+
+ try_to_unmap_file() -- non-linear mappings
+
+If a page's mapping contains a non-empty non-linear mapping vma list, then
+try_to_un{map|lock}() must also visit each vma in that list to determine
+whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit
+all vmas in the non-linear list to ensure that the pages is not/should not be
+mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate.
+However, there is no easy way to determine whether the page is actually mapped
+in a given vma--either for unmapping or testing whether the VM_LOCKED vma
+actually pins the page.
+
+So, try_to_unmap_file() handles non-linear mappings by scanning a certain
+number of pages--a "cluster"--in each non-linear vma associated with the page's
+mapping, for each file mapped page that vmscan tries to unmap. If this happens
+to unmap the page we're trying to unmap, try_to_unmap() will notice this on
+return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it
+will return SWAP_AGAIN, causing vmscan to recirculate this page. We take
+advantage of the cluster scan in try_to_unmap_cluster() as follows:
+
+For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap
+semaphore of the associated mm_struct for read without blocking. If this
+attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will
+retain the mmap semaphore for the scan; otherwise it drops it here. Then,
+for each page in the cluster, if we're holding the mmap semaphore for a locked
+vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This
+call is a no-op if the page is already locked, but will mlock any pages in
+the non-linear mapping that happen to be unlocked. If one of the pages so
+mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will
+return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan
+to cull the page, rather than recirculating it on the inactive list. Again,
+if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns
+SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but
+couldn't be mlocked.
+
+
+Mlocked pages: try_to_unlock() Reverse Map Scan
+
+TODO/FIXME: a better name might be page_mlocked()--analogous to the
+page_referenced() reverse map walker--especially if we continue to call this
+from shrink_page_list(). See related TODO/FIXME below.
+
+When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System
+Call Handling" above--tries to munlock a page, or when shrink_page_list()
+encounters an anonymous page that is not yet in the swap cache, they need to
+determine whether or not the page is mapped by any VM_LOCKED vma, without
+actually attempting to unmap all ptes from the page. For this purpose, the
+noreclaim/mlock infrastructure introduced a variant of try_to_unmap() called
+try_to_unlock().
+
+try_to_unlock() calls the same functions as try_to_unmap() for anonymous and
+mapped file pages with an additional argument specifing unlock versus unmap
+processing. Again, these functions walk the respective reverse maps looking
+for VM_LOCKED vmas. When such a vma is found for anonymous pages and file
+pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
+attempt to acquire the associated mmap semphore, mlock the page via
+mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs
+shrink_page_list() that the anonymous page should be culled rather than added
+to the swap cache in preparation for a try_to_unmap() that will almost
+certainly fail.
+
+If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap
+semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list()
+to recycle the page on the inactive list and hope that it has better luck
+with the page next time.
+
+For file pages mapped into non-linear vmas, the try_to_unlock() logic works
+slightly differently. On encountering a VM_LOCKED non-linear vma that might
+map the page, try_to_unlock() returns SWAP_AGAIN without actually mlocking
+the page. munlock_vma_page() will just leave the page unlocked and let
+vmscan deal with it--the usual fallback position.
+
+Note that try_to_unlock()'s reverse map walk must visit every vma in a pages'
+reverse map to determine that a page is NOT mapped into any VM_LOCKED vma.
+However, the scan can terminate when it encounters a VM_LOCKED vma and can
+successfully acquire the vma's mmap semphore for read and mlock the page.
+Although try_to_unlock() can be called many [very many!] times when
+munlock()ing a large region or tearing down a large address space that has been
+mlocked via mlockall(), overall this is a fairly rare event. In addition,
+although shrink_page_list() calls try_to_unlock() for every anonymous page that
+it handles that is not yet in the swap cache, on average anonymous pages will
+have very short reverse map lists.
+
+Mlocked Page: Page Reclaim in shrink_*_list()
+
+shrink_active_list() culls any obviously nonreclaimable pages--i.e.,
+!page_reclaimable(page, NULL)--diverting these to the noreclaim lru
+list. However, shrink_active_list() only sees nonreclaimable pages that
+made it onto the active/inactive lru lists. Note that these pages do not
+have PageNoreclaim set--otherwise, they would be on the noreclaim list and
+shrink_active_list would never see them.
+
+Some examples of these nonreclaimable pages on the LRU lists are:
+
+1) ramfs and ram disk pages that have been placed on the lru lists when
+ first allocated.
+
+2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to
+ allocate or fault in the pages in the shared memory region. This happens
+ when an application accesses the page the first time after SHM_LOCKing
+ the segment.
+
+3) Mlocked pages that could not be isolated from the lru and moved to the
+ noreclaim list in mlock_vma_page().
+
+3) Pages mapped into multiple VM_LOCKED vmas, but try_to_unlock() couldn't
+ acquire the vma's mmap semaphore to test the flags and set PageMlocked.
+ munlock_vma_page() was forced to let the page back on to the normal
+ LRU list for vmscan to handle.
+
+shrink_inactive_list() also culls any nonreclaimable pages that it finds
+on the inactive lists, again diverting them to the appropriate zone's noreclaim
+lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became
+SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or
+pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from
+the lru to recheck via try_to_unlock(). shrink_inactive_list() won't notice
+the latter, but will pass on to shrink_page_list().
+
+shrink_page_list() again culls obviously nonreclaimable pages that it could
+encounter for similar reason to shrink_inactive_list(). As already discussed,
+shrink_page_list() proactively looks for anonymous pages that should have
+PG_mlocked set but don't--these would not be detected by page_reclaimable()--to
+avoid adding them to the swap cache unnecessarily. File pages mapped into
+VM_LOCKED vmas but without PG_mlocked set will make it all the way to
+try_to_unmap(). shrink_page_list() will divert them to the noreclaim list when
+try_to_unmap() returns SWAP_MLOCK, as discussed above.
+
+TODO/FIXME: If we can enhance the swap cache to reliably remove entries
+with page_count(page) > 2, as long as all ptes are mapped to the page and
+not the swap entry, we can probably remove the call to try_to_unlock() in
+shrink_page_list() and just remove the page from the swap cache when
+try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page()
+doesn't seem to allow that.
+
+
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread