From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm <linux-mm@kvack.org>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Rik van Riel <riel@redhat.com>, Nick Piggin <npiggin@suse.de>
Subject: [PATCH] Update Unevictable LRU and Mlocked Pages documentation
Date: Wed, 30 Jul 2008 17:13:59 -0400 [thread overview]
Message-ID: <1217452439.7676.26.camel@lts-notebook> (raw)
Against: [27-rc1+]mmotm-080730-0356
Update to: doc-unevictable-lru-and-mlocked-pages-documentation.patch
Update unevictable lru documentation based on review and testing
rework and fixes.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Documentation/vm/unevictable-lru.txt | 170 +++++++++++++++++------------------
1 file changed, 84 insertions(+), 86 deletions(-)
Index: linux-2.6.27-rc1-mmotm-30jul/Documentation/vm/unevictable-lru.txt
===================================================================
--- linux-2.6.27-rc1-mmotm-30jul.orig/Documentation/vm/unevictable-lru.txt 2008-07-30 16:17:07.000000000 -0400
+++ linux-2.6.27-rc1-mmotm-30jul/Documentation/vm/unevictable-lru.txt 2008-07-30 16:37:31.000000000 -0400
@@ -26,7 +26,7 @@ with the system completely unresponsive.
The Unevictable LRU infrastructure addresses the following classes of
unevictable pages:
-+ page owned by ram disks or ramfs
++ page owned by ramfs
+ page mapped into SHM_LOCKed shared memory regions
+ page mapped into VM_LOCKED [mlock()ed] vmas
@@ -44,14 +44,21 @@ it indicates on which LRU list a page re
unevictable LRU list is source configurable based on the UNEVICTABLE_LRU Kconfig
option.
-Why maintain unevictable pages on an additional LRU list? The Linux memory
-management subsystem has well established protocols for managing pages on the
-LRU. Vmscan is based on LRU lists. LRU list exist per zone, and we want to
-maintain pages relative to their "home zone". All of these make the use of
-an additional list, parallel to the LRU active and inactive lists, a natural
-mechanism to employ. Note, however, that the unevictable list does not
-differentiate between file backed and swap backed [anon] pages. This
-differentiation is only important while the pages are, in fact, evictable.
+Why maintain unevictable pages on an additional LRU list? Primarily because
+we want to be able to migrate unevictable pages between nodes--for memory
+deframentation, workload management and memory hotplug. The linux kernel can
+only migrate pages that it can successfully isolate from the lru lists.
+Therefore, we want to keep the unevictable pages on an lru-like list, where
+they can be found by isolate_lru_page().
+
+Secondarily, the Linux memory management subsystem has well established
+protocols for managing pages on the LRU. Vmscan is based on LRU lists.
+LRU list exist per zone, and we want to maintain pages relative to their
+"home zone". All of these make the use of an additional list, parallel to
+the LRU active and inactive lists, a natural mechanism to employ. Note,
+however, that the unevictable list does not differentiate between file backed
+and swap backed [anon] pages. This differentiation is only important while
+the pages are, in fact, evictable.
The unevictable LRU list benefits from the "arrayification" of the per-zone
LRU lists and statistics originally proposed and posted by Christoph Lameter.
@@ -81,23 +88,23 @@ memory. This can cause the control grou
Unevictable LRU: Detecting Unevictable Pages
The function page_evictable(page, vma) in vmscan.c determines whether a
-page is evictable or not. For ramfs and ram disk [brd] pages and pages in
-SHM_LOCKed regions, page_evictable() tests a new address space flag,
-AS_UNEVICTABLE, in the page's address space using a wrapper function.
-Wrapper functions are used to set, clear and test the flag to reduce the
-requirement for #ifdef's throughout the source code. AS_UNEVICTABLE is set on
-ramfs inode/mapping when it is created and on ram disk inode/mappings at open
-time. This flag remains for the life of the inode.
-
-For shared memory regions, AS_UNEVICTABLE is set when an application successfully
-SHM_LOCKs the region and is removed when the region is SHM_UNLOCKed. Note that
-shmctl(SHM_LOCK, ...) does not populate the page tables for the region as does,
-for example, mlock(). So, we make no special effort to push any pages in the
-SHM_LOCKed region to the unevictable list. Vmscan will do this when/if it
-encounters the pages during reclaim. On SHM_UNLOCK, shmctl() scans the pages
-in the region and "rescues" them from the unevictable list if no other condition
-keeps them unevictable. If a SHM_LOCKed region is destroyed, the pages
-are also "rescued" from the unevictable list in the process of freeing them.
+page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions,
+page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's
+address space using a wrapper function. Wrapper functions are used to set,
+clear and test the flag to reduce the requirement for #ifdef's throughout the
+source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created.
+This flag remains for the life of the inode.
+
+For shared memory regions, AS_UNEVICTABLE is set when an application
+successfully SHM_LOCKs the region and is removed when the region is
+SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page
+tables for the region as does, for example, mlock(). So, we make no special
+effort to push any pages in the SHM_LOCKed region to the unevictable list.
+Vmscan will do this when/if it encounters the pages during reclaim. On
+SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the
+unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed
+region is destroyed, the pages are also "rescued" from the unevictable list in
+the process of freeing them.
page_evictable() detects mlock()ed pages by testing an additional page flag,
PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a
@@ -110,13 +117,13 @@ VM_LOCKED vmas.
Unevictable Pages and Vmscan [shrink_*_list()]
-If unevictable pages are culled in the fault path, or moved to the
-unevictable list at mlock() or mmap() time, vmscan will never encounter the pages
-until they have become evictable again, for example, via munlock() and have
-been "rescued" from the unevictable list. However, there may be situations where
-we decide, for the sake of expediency, to leave a unevictable page on one of
-the regular active/inactive LRU lists for vmscan to deal with. Vmscan checks
-for such pages in all of the shrink_{active|inactive|page}_list() functions and
+If unevictable pages are culled in the fault path, or moved to the unevictable
+list at mlock() or mmap() time, vmscan will never encounter the pages until
+they have become evictable again, for example, via munlock() and have been
+"rescued" from the unevictable list. However, there may be situations where we
+decide, for the sake of expediency, to leave a unevictable page on one of the
+regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for
+such pages in all of the shrink_{active|inactive|page}_list() functions and
will "cull" such pages that it encounters--that is, it diverts those pages to
the unevictable list for the zone being scanned.
@@ -133,22 +140,30 @@ whether any VM_LOCKED vmas map the page
If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page
without consuming swap space. try_to_munlock() will be described below.
+To "cull" an unevictable page, vmscan simply puts the page back on the lru
+list using putback_lru_page()--the inverse operation to isolate_lru_page()--
+after dropping the page lock. Because the condition which makes the page
+unevictable may change once the page is unlocked, putback_lru_page() will
+recheck the unevictable state of a page that it places on the unevictable lru
+list. If the page has become unevictable, putback_lru_page() removes it from
+the list and retries, including the page_unevictable() test. Because such a
+race is a rare event and movement of pages onto the unevictable list should be
+rare, these extra evictabilty checks should not occur in the majority of calls
+to putback_lru_page().
+
Mlocked Page: Prior Work
-The "Unevictable Mlocked Pages" infrastructure is based on work originally posted
-by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". Nick's
-posted his patch as an alternative to a patch posted by Christoph Lameter to
-achieve the same objective--hiding mlocked pages from vmscan. In Nick's patch,
-he used one of the struct page lru list link fields as a count of VM_LOCKED
-vmas that map the page. This use of the link field for a count prevent the
-management of the pages on an LRU list. When Nick's patch was integrated with
-the Unevictable LRU work, the count was replaced by walking the reverse map to
-determine whether any VM_LOCKED vmas mapped the page. More on this below.
-The primary reason for wanting to keep mlocked pages on an LRU list is that
-mlocked pages are migratable, and the LRU list is used to arbitrate tasks
-attempting to migrate the same page. Whichever task succeeds in "isolating"
-the page from the LRU performs the migration.
+The "Unevictable Mlocked Pages" infrastructure is based on work originally
+posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
+Nick's posted his patch as an alternative to a patch posted by Christoph
+Lameter to achieve the same objective--hiding mlocked pages from vmscan.
+In Nick's patch, he used one of the struct page lru list link fields as a count
+of VM_LOCKED vmas that map the page. This use of the link field for a count
+prevent the management of the pages on an LRU list. When Nick's patch was
+integrated with the Unevictable LRU work, the count was replaced by walking the
+reverse map to determine whether any VM_LOCKED vmas mapped the page. More on
+this below.
Mlocked Pages: Basic Management
@@ -209,7 +224,7 @@ unlock the page and move on. Worse case
in a VM_LOCKED vma remaining on a normal LRU list without being
PageMlocked(). Again, vmscan will detect and cull such pages.
-mlock_vma_page(), called with the page locked [N.B., not "mlocked"] will
+mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will
TestSetPageMlocked() for each page returned by get_user_pages(). We use
TestSetPageMlocked() because the page might already be mlocked by another
task/vma and we don't want to do extra work. We especially do not want to
@@ -225,7 +240,7 @@ mlock_vma_page() is unable to isolate th
it later if/when it attempts to reclaim the page.
-Mlocked Pages: Filtering Vmas
+Mlocked Pages: Filtering Special Vmas
mlock_fixup() filters several classes of "special" vmas:
@@ -295,26 +310,17 @@ ignored for munlock.
If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off
the specified range. The range is then munlocked via the function
-__munlock_vma_pages_range(). Because the vma access protections could have
-been changed to PROT_NONE after faulting in and mlocking some pages,
-get_user_pages() is unreliable for visiting these pages for munlocking. We
-don't want to leave pages mlocked(), so __munlock_vma_pages_range() uses a
-custom page table walker to find all pages mapped into the specified range.
-Note that this again assumes that all pages in the mlocked() range are resident
-and mapped by the task's page table.
-
-As with __mlock_vma_pages_range(), unlocking can race with truncation and
-migration. It is very important that munlock of a page succeeds, lest we
-leak pages by stranding them in the mlocked state on the unevictable list.
-The munlock page walk pte handler resolves the race with page migration
-by checking the pte for a special swap pte indicating that the page is
-being migrated. If this is the case, the pte handler will wait for the
-migration entry to be replaced and then refetch the pte for the new page.
-Once the pte handler has locked the page, it checks the page_mapping to
-ensure that it still exists. If not, the handler unlocks the page and
-retries the entire process after refetching the pte.
+__mlock_vma_pages_range()--the same function used to mlock a vma range--
+passing a flag to indicate that munlock() is being performed.
+
+Because the vma access protections could have been changed to PROT_NONE after
+faulting in and mlocking some pages, get_user_pages() was unreliable for visiting
+these pages for munlocking. Because we don't want to leave pages mlocked(),
+get_user_pages() was enhanced to accept a flag to ignore the permissions when
+fetching the pages--all of which should be resident as a result of previous
+mlock()ing.
-The munlock page walk pte handler unlocks individual pages by calling
+For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page()
use the Test*PageMlocked() function to handle the case where the page might
@@ -351,23 +357,16 @@ page. This has been discussed from the
respective sections above. Both processes [migration, m[un]locking], hold
the page locked. This provides the first level of synchronization. Page
migration zeros out the page_mapping of the old page before unlocking it,
-so m[un]lock can skip these pages. However, as discussed above, munlock
-must wait for a migrating page to be replaced with the new page to prevent
-the new page from remaining mlocked outside of any VM_LOCKED vma.
-
-To ensure that we don't strand pages on the unevictable list because of a
-race between munlock and migration, we must also prevent the munlock pte
-handler from acquiring the old or new page lock from the time that the
-migration subsystem acquires the old page lock, until either migration
-succeeds and the new page is added to the lru or migration fails and
-the old page is putback to the lru. The achieve this coordination,
-the migration subsystem places the new page on success, or the old
-page on failure, back on the lru lists before dropping the respective
-page's lock. It uses the putback_lru_page() function to accomplish this,
-which rechecks the page's overall evictability and adjusts the page
-flags accordingly. To free the old page on success or the new page on
-failure, the migration subsystem just drops what it knows to be the last
-page reference via put_page().
+so m[un]lock can skip these pages by testing the page mapping under page
+lock.
+
+When completing page migration, we place the new and old pages back onto the
+lru after dropping the page lock. The "unneeded" page--old page on success,
+new page on failure--will be freed when the reference count held by the
+migration process is released. To ensure that we don't strand pages on the
+unevictable list because of a race between munlock and migration, page
+migration uses the putback_lru_page() function to add migrated pages back to
+the lru.
Mlocked Pages: mmap(MAP_LOCKED) System Call Handling
@@ -566,8 +565,7 @@ shrink_active_list would never see them.
Some examples of these unevictable pages on the LRU lists are:
-1) ramfs and ram disk pages that have been placed on the lru lists when
- first allocated.
+1) ramfs pages that have been placed on the lru lists when first allocated.
2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to
allocate or fault in the pages in the shared memory region. This happens
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2008-07-30 21:13 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-30 21:13 Lee Schermerhorn [this message]
2008-07-31 14:14 ` Christoph Lameter
2008-07-31 14:43 ` Lee Schermerhorn
2008-08-01 13:46 ` Christoph Lameter
2008-08-01 14:06 ` Rik van Riel
2008-08-01 14:16 ` Christoph Lameter
2008-08-01 14:36 ` Lee Schermerhorn
2008-08-01 15:41 ` Christoph Lameter
2008-08-04 20:05 ` Randy Dunlap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1217452439.7676.26.camel@lts-notebook \
--to=lee.schermerhorn@hp.com \
--cc=akpm@linux-foundation.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox