[PATCH] vmscan: do not evict inactive pages when skipping an active list scan

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
@ 2009-11-25 18:37 Rik van Riel
  2009-11-25 20:35 ` Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Rik van Riel @ 2009-11-25 18:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm

In AIM7 runs, recent kernels start swapping out anonymous pages
well before they should.  This is due to shrink_list falling
through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
when all we really wanted to do is pre-age some anonymous pages to
give them extra time to be referenced while on the inactive list.

The obvious fix is to make sure that shrink_list does not fall
through to scanning/reclaiming inactive pages when we called it
to scan one of the active lists.

This change should be safe because the loop in shrink_zone ensures
that we will still shrink the anon and file inactive lists whenever
we should.


Reported-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..ec4dfda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
-		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+	if (lru == LRU_ACTIVE_FILE) {
+		if (inactive_file_is_low(zone, sc))
+		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
 
-	if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
-		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+	if (lru == LRU_ACTIVE_ANON) {
+		if (inactive_file_is_low(zone, sc))
+		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
 	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
  2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel
@ 2009-11-25 20:35 ` Johannes Weiner
  2009-11-25 20:47   ` Rik van Riel
  2009-11-26  2:50 ` KOSAKI Motohiro
  2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
  2 siblings, 1 reply; 28+ messages in thread
From: Johannes Weiner @ 2009-11-25 20:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro,
	Tomasz Chmielewski, akpm

Hello all,

On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote:
> In AIM7 runs, recent kernels start swapping out anonymous pages
> well before they should.  This is due to shrink_list falling
> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
> when all we really wanted to do is pre-age some anonymous pages to
> give them extra time to be referenced while on the inactive list.

I do not quite understand what changed 'recently'.

That fall-through logic to keep eating inactives when the ratio is off
came in a year ago with the second-chance-for-anon-pages patch..?

> The obvious fix is to make sure that shrink_list does not fall
> through to scanning/reclaiming inactive pages when we called it
> to scan one of the active lists.
>
> This change should be safe because the loop in shrink_zone ensures
> that we will still shrink the anon and file inactive lists whenever
> we should.

It was not so obvious to me ;)

At first, I thought it would make sense to actively rebalance between
the lists if the inactive one grows too large (the fall-through case).

But shrink_zone() does not know about this and although we scan
inactive pages, we do not account for them and decrease the 'nr[lru]'
for active pages instead, effectively shifting the 'active todo' over
to the 'inactive todo'.  I can imagine this going wrong!

So I agree, we should use the inactive_*_is_low() predicate only
passively.
 
> Reported-by: Larry Woodman <lwoodman@redhat.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..ec4dfda 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  {
>  	int file = is_file_lru(lru);
>  
> -	if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
> -		shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_FILE) {
> +		if (inactive_file_is_low(zone, sc))
> +		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
>  		return 0;
>  	}
>  
> -	if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
> -		shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_ANON) {
> +		if (inactive_file_is_low(zone, sc))
> +		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
>  		return 0;
>  	}
>  	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
  2009-11-25 20:35 ` Johannes Weiner
@ 2009-11-25 20:47   ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-11-25 20:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro,
	Tomasz Chmielewski, akpm

On 11/25/2009 03:35 PM, Johannes Weiner wrote:
> Hello all,
>
> On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote:
>    
>> In AIM7 runs, recent kernels start swapping out anonymous pages
>> well before they should.  This is due to shrink_list falling
>> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
>> when all we really wanted to do is pre-age some anonymous pages to
>> give them extra time to be referenced while on the inactive list.
>>      
> I do not quite understand what changed 'recently'.
>
> That fall-through logic to keep eating inactives when the ratio is off
> came in a year ago with the second-chance-for-anon-pages patch..?
>
>    
The confusion comes from my use of the word
"recently" here.  Larry started testing with
RHEL 5 as the baseline.

And yeah - I believe the code may well have
been wrong ever since it was merged. The
indentation just looked so pretty that noone
spotted the bug.

> Acked-by: Johannes Weiner<hannes@cmpxchg.org>
>
>    
Thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
  2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel
  2009-11-25 20:35 ` Johannes Weiner
@ 2009-11-26  2:50 ` KOSAKI Motohiro
  2009-11-26  2:57   ` Rik van Riel
  2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
  2 siblings, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-11-26  2:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, linux-kernel, linux-mm, lwoodman,
	kosaki.motohiro, Tomasz Chmielewski, akpm

> In AIM7 runs, recent kernels start swapping out anonymous pages
> well before they should.  This is due to shrink_list falling
> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
> when all we really wanted to do is pre-age some anonymous pages to
> give them extra time to be referenced while on the inactive list.
> 
> The obvious fix is to make sure that shrink_list does not fall
> through to scanning/reclaiming inactive pages when we called it
> to scan one of the active lists.
> 
> This change should be safe because the loop in shrink_zone ensures
> that we will still shrink the anon and file inactive lists whenever
> we should.

Good catch!


> 
> Reported-by: Larry Woodman <lwoodman@redhat.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..ec4dfda 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  {
>  	int file = is_file_lru(lru);
>  
> -	if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
> -		shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_FILE) {
> +		if (inactive_file_is_low(zone, sc))
> +		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
>  		return 0;
>  	}
>  
> -	if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
> -		shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_ANON) {
> +		if (inactive_file_is_low(zone, sc))

This inactive_file_is_low() should be inactive_anon_is_low().
cut-n-paste programming often makes similar mistake. probaby we need make
more cleanup to this function.

How about this? (this is incremental patch from you)


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8f61c0..80e94a2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1467,22 +1467,25 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 	return low;
 }
 
+static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, int file)
+{
+	if (file)
+		return inactive_file_is_low(zone, sc);
+	else
+		return inactive_anon_is_low(zone, sc);
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_FILE) {
-		if (inactive_file_is_low(zone, sc))
+	if (is_active_lru(lru)) {
+		if (inactive_list_is_low(zone, sc, file))
 		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
 
-	if (lru == LRU_ACTIVE_ANON) {
-		if (inactive_file_is_low(zone, sc))
-		      shrink_active_list(nr_to_scan, zone, sc, priority, file);
-		return 0;
-	}
 	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
 }
 
-- 
1.6.5.2






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
  2009-11-26  2:50 ` KOSAKI Motohiro
@ 2009-11-26  2:57   ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-11-26  2:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro,
	Tomasz Chmielewski, akpm

On 11/25/2009 09:50 PM, KOSAKI Motohiro wrote:
>
>> -	if (lru == LRU_ACTIVE_ANON&&  inactive_anon_is_low(zone, sc)) {
>> -		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> +	if (lru == LRU_ACTIVE_ANON) {
>> +		if (inactive_file_is_low(zone, sc))
>>      
> This inactive_file_is_low() should be inactive_anon_is_low().
> cut-n-paste programming often makes similar mistake. probaby we need make
> more cleanup to this function.
>
> How about this? (this is incremental patch from you)
>
>
>    
Doh!  Nice catch...
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
>    
Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC] high system time & lock contention running large mixed workload
  2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel
  2009-11-25 20:35 ` Johannes Weiner
  2009-11-26  2:50 ` KOSAKI Motohiro
@ 2009-11-30 22:00 ` Larry Woodman
  2009-12-01 10:04   ` Andrea Arcangeli
                     ` (2 more replies)
  2 siblings, 3 replies; 28+ messages in thread
From: Larry Woodman @ 2009-11-30 22:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm

[-- Attachment #1: Type: text/plain, Size: 4499 bytes --]


While running workloads that do lots of forking processes, exiting
processes and page reclamation(AIM 7) on large systems very high system
time(100%) and lots of lock contention was observed.



CPU5:
[<ffffffff814afb48>] ? _spin_lock+0x27/0x48
 [<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a
 [<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c
 [<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be
 [<ffffffff8105ea07>] ? do_fork+0x151/0x330
 [<ffffffff81058407>] ? default_wake_function+0x0/0x36
 [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
 [<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b

CPU4:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
 [<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84
 [<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1
 [<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110
 [<ffffffff8105ce4c>] ? mmput+0x55/0xd9
 [<ffffffff81061afd>] ? exit_mm+0x109/0x129
 [<ffffffff81063846>] ? do_exit+0x1d7/0x712
 [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
 [<ffffffff81063e07>] ? do_group_exit+0x86/0xb2
 [<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b

CPU0:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
[<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f
 [<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b
 [<ffffffff81102f5a>] ? page_referenced+0xcd/0x167
 [<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3
 [<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38
 [<ffffffff8108440a>] ? getnstimeofday+0x64/0xce
 [<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362
 [<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94
 [<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242
 [<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572
 [<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e
 [<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be
 [<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84
 [<ffffffff81130db8>] ? user_path_at+0x65/0xa3
 [<ffffffff8105ea07>] ? do_fork+0x151/0x330
 [<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab
 [<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b


------------------------------------------------------------------------------
   PerfTop:     864 irqs/sec  kernel:99.7% [100000 cycles],  (all, 8
CPUs)
------------------------------------------------------------------------------

             samples    pcnt         RIP          kernel function
  ______     _______   _____   ________________   _______________

             3235.00 - 75.1% - ffffffff814afb21 : _spin_lock
              670.00 - 15.6% - ffffffff81101a33 : page_check_address
              165.00 -  3.8% - ffffffffa01cbc39 : rpc_sleep_on  [sunrpc]
               40.00 -  0.9% - ffffffff81102113 : try_to_unmap_one
               29.00 -  0.7% - ffffffff81101c65 : page_referenced_one
               27.00 -  0.6% - ffffffff81101964 : vma_address
                8.00 -  0.2% - ffffffff8125a5a0 : clear_page_c
                6.00 -  0.1% - ffffffff8125a5f0 : copy_page_c
                6.00 -  0.1% - ffffffff811023ca : try_to_unmap_anon
                5.00 -  0.1% - ffffffff810fb014 : copy_page_range
                5.00 -  0.1% - ffffffff810e4d18 : get_page_from_freelist



The cause was determined to be the unconditional call to
page_referenced() for every mapped page encountered in
shrink_active_list().  page_referenced() takes the anon_vma->lock and
calls page_referenced_one() for each vma.  page_referenced_one() then
calls page_check_address() which takes the pte_lockptr spinlock.   If
several CPUs are doing this at the same time there is a lot of
pte_lockptr spinlock contention with the anon_vma->lock held.  This
causes contention on the anon_vma->lock, stalling in the fo and very
high system time.

Before the splitLRU patch shrink_active_list() would only call
page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
set when the priority worked its way from 12 all the way to 7. This
prevented page_referenced() from being called from shrink_active_list()
until the system was really struggling to reclaim memory.

On way to prevent this is to change page_check_address() to execute a
spin_trylock(ptl) when it was called by shrink_active_list() and simply
fail if it could not get the pte_lockptr spinlock.  This will make
shrink_active_list() consider the page not referenced and allow the
anon_vma->lock to be dropped much quicker.

The attached patch does just that, thoughts???




[-- Attachment #2: pageout.diff --]
[-- Type: text/x-patch, Size: 3419 bytes --]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cb0ba70..65b841d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags);
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+				unsigned long, spinlock_t **, int, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 1888b2d..35be29d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1, 0);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
diff --git a/mm/ksm.c b/mm/ksm.c
index 5575f86..8abb14b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (addr == -EFAULT)
 		goto out;
 
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, 0);
 	if (!ptep)
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index dd43373..4e4eb8e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			  unsigned long address, spinlock_t **ptlp, int sync, int try)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (try) {
+		if (!spin_trylock(ptl)) {
+			pte_unmap(pte);
+			return NULL;
+		}
+	} else
+		spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
@@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
 	pte_unmap_unlock(pte, ptl);
@@ -352,7 +358,7 @@ static int page_referenced_one(struct page *page,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, 1);
 	if (!pte)
 		goto out;
 
@@ -547,7 +553,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, 0);
 	if (!pte)
 		goto out;
 
@@ -774,7 +780,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, 0);
 	if (!pte)
 		goto out;
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
@ 2009-12-01 10:04   ` Andrea Arcangeli
  2009-12-01 12:31     ` KOSAKI Motohiro
  2009-12-02  2:00     ` Rik van Riel
  2009-12-01 12:23   ` KOSAKI Motohiro
  2009-12-02  1:55   ` [RFC] high system time & lock contention running large mixed workload Rik van Riel
  2 siblings, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2009-12-01 10:04 UTC (permalink / raw)
  To: Larry Woodman; +Cc: linux-kernel, linux-mm, akpm

On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.

page_referenced should never be called and nobody should touch ptes
until priority went down to 7. This is a regression in splitLRU that
should be fixed. With light VM pressure we should never touch ptes ever.

> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock.  This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
> 
> The attached patch does just that, thoughts???

Just stop calling page_referenced there...

Even if we ignore the above, one problem later in skipping over the PT
lock, is also to assume the page is not referenced when it actually
is, so it won't be activated again when page_referenced is called
again to move the page back in the active list... Not the end of the
world to lose a young bit sometime though.

There may be all reasons in the world why we have to mess with ptes
when there's light VM pressure, for whatever terabyte machine or
whatever workload that performs better that way, but I know in 100% of
my systems I don't ever want the VM to touch ptes when there's light
VM pressure, no matter what. So if you want the default to be messing
with ptes, just give me a sysctl knob to let me run faster.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 10:04   ` Andrea Arcangeli
@ 2009-12-01 12:31     ` KOSAKI Motohiro
  2009-12-01 12:46       ` Andrea Arcangeli
  2009-12-02  2:00     ` Rik van Riel
  1 sibling, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-01 12:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel

> On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
> > Before the splitLRU patch shrink_active_list() would only call
> > page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
> > set when the priority worked its way from 12 all the way to 7. This
> > prevented page_referenced() from being called from shrink_active_list()
> > until the system was really struggling to reclaim memory.
> 
> page_referenced should never be called and nobody should touch ptes
> until priority went down to 7. This is a regression in splitLRU that
> should be fixed. With light VM pressure we should never touch ptes ever.

Ummm. I can't agree this. 7 is too small priority. if large system have prio==7,
the system have unacceptable big latency trouble.
if only prio==DEF_PRIOTIRY or something, I can agree you probably.


> > On way to prevent this is to change page_check_address() to execute a
> > spin_trylock(ptl) when it was called by shrink_active_list() and simply
> > fail if it could not get the pte_lockptr spinlock.  This will make
> > shrink_active_list() consider the page not referenced and allow the
> > anon_vma->lock to be dropped much quicker.
> > 
> > The attached patch does just that, thoughts???
> 
> Just stop calling page_referenced there...
> 
> Even if we ignore the above, one problem later in skipping over the PT
> lock, is also to assume the page is not referenced when it actually
> is, so it won't be activated again when page_referenced is called
> again to move the page back in the active list... Not the end of the
> world to lose a young bit sometime though.
> 
> There may be all reasons in the world why we have to mess with ptes
> when there's light VM pressure, for whatever terabyte machine or
> whatever workload that performs better that way, but I know in 100% of
> my systems I don't ever want the VM to touch ptes when there's light
> VM pressure, no matter what. So if you want the default to be messing
> with ptes, just give me a sysctl knob to let me run faster.

Um.
Avoiding lock contention on light VM pressure is important than
strict lru order. I guess we don't need knob.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 12:31     ` KOSAKI Motohiro
@ 2009-12-01 12:46       ` Andrea Arcangeli
  2009-12-02  2:02         ` KOSAKI Motohiro
  2009-12-02  2:04         ` Rik van Riel
  0 siblings, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2009-12-01 12:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Rik van Riel

On Tue, Dec 01, 2009 at 09:31:09PM +0900, KOSAKI Motohiro wrote:
> Ummm. I can't agree this. 7 is too small priority. if large system have prio==7,
> the system have unacceptable big latency trouble.
> if only prio==DEF_PRIOTIRY or something, I can agree you probably.

I taken number 7 purely as mentioned by Larry about old code, but I
don't mind what is the actual breakpoint level where we start to send
the ipi flood to destroy all userland tlbs mapping the page so the
young bit can be set by the cpu on the old pte. If you agree with me
at the lowest priority we shouldn't flood ipi and destroy tlb when
there's plenty of clean unmapped clean cache, we already agree ;). If
that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates
gradually, first clean cache and re-activate mapped pages, then when
we're low on clean cache we start to check ptes and unmap whatever is
not found referenced.

> Avoiding lock contention on light VM pressure is important than
> strict lru order. I guess we don't need knob.

Hope so indeed. It's not just lock contention, that is exacerbated by
certain workloads, but even in total absence of any lock contention I
generally dislike the cpu waste itself of the pte loop to clear the
young bit, and the interruption of userland as well when it receives a
tlb flush for no good reason because 99% of the time plenty of
unmapped clean cache is available. I know this performs best, even if
there will be always someone that will want mapped and unmapped cache
to be threat totally equal in lru terms (which then make me wonder why
there are still & VM_EXEC magics in vmscan.c if all pages shall be
threaded equal in the lru... especially given VM_EXEC is often
meaningless [because potentially randomly set] unlike page_mapcount
[which is never randomly set]), which is the reason I mentioned the
knob.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 12:46       ` Andrea Arcangeli
@ 2009-12-02  2:02         ` KOSAKI Motohiro
  2009-12-02  2:04         ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-02  2:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel

Hi

> > Avoiding lock contention on light VM pressure is important than
> > strict lru order. I guess we don't need knob.
> 
> Hope so indeed. It's not just lock contention, that is exacerbated by
> certain workloads, but even in total absence of any lock contention I
> generally dislike the cpu waste itself of the pte loop to clear the
> young bit, and the interruption of userland as well when it receives a
> tlb flush for no good reason because 99% of the time plenty of
> unmapped clean cache is available. I know this performs best, even if
> there will be always someone that will want mapped and unmapped cache
> to be threat totally equal in lru terms (which then make me wonder why
> there are still & VM_EXEC magics in vmscan.c if all pages shall be
> threaded equal in the lru... especially given VM_EXEC is often
> meaningless [because potentially randomly set] unlike page_mapcount
> [which is never randomly set]), which is the reason I mentioned the
> knob.

Umm?? I'm puzlled. if almost pages in lru are unmapped file cache, pte walk
is not costly. reverse, if almost pages in lru are mapped pages, we have
to do pte walk, otherwise any pages don't deactivate and system cause
big latency trouble.

I don't want vmscan focus to peak speed and ignore worst case. it isn't proper
behavior in memory shortage situation. Then I hope to focus to solve lock
contention issue. 

Of course, if this cause any trouble to KVM or other usage in real world,
I'll fix it. 
if you have any trouble experience about current VM, please tell us.

[I (and Hugh at least) dislike VM_EXEC logic too. but it seems off topic.]



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 12:46       ` Andrea Arcangeli
  2009-12-02  2:02         ` KOSAKI Motohiro
@ 2009-12-02  2:04         ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  2:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki

Andrea Arcangeli wrote:
> I taken number 7 purely as mentioned by Larry about old code, but I
> don't mind what is the actual breakpoint level where we start to send
> the ipi flood to destroy all userland tlbs mapping the page so the
> young bit can be set by the cpu on the old pte. If you agree with me
> at the lowest priority we shouldn't flood ipi and destroy tlb when
> there's plenty of clean unmapped clean cache, we already agree ;). If
> that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates
> gradually, first clean cache and re-activate mapped pages, then when
> we're low on clean cache we start to check ptes and unmap whatever is
> not found referenced.
>    
>
The code already does what you propose.

It takes a heavy AIM7 run for Larry to run into the
lock contention issue.  I suspect that the page cache
was already very small by the time the lock contention
issue was triggered.

Larry, do you have any more info on the state of the
VM when you see the lock contention?

Also, do you have the latest patches to shrink_list()
by Kosaki and me applied?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 10:04   ` Andrea Arcangeli
  2009-12-01 12:31     ` KOSAKI Motohiro
@ 2009-12-02  2:00     ` Rik van Riel
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  2:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Larry Woodman, linux-kernel, linux-mm, akpm

On 12/01/2009 05:04 AM, Andrea Arcangeli wrote:
> On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
>    
>> Before the splitLRU patch shrink_active_list() would only call
>> page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
>> set when the priority worked its way from 12 all the way to 7. This
>> prevented page_referenced() from being called from shrink_active_list()
>> until the system was really struggling to reclaim memory.
>>      
> page_referenced should never be called and nobody should touch ptes
> until priority went down to 7. This is a regression in splitLRU that
> should be fixed. With light VM pressure we should never touch ptes ever.
>    
You appear to have not read the code, either.

The VM should not look at the active anon list much,
unless it has a good reason to start evicting anonymous
pages.  Yes, there was a bug in shrink_list(), but Kosaki
and I just posted patches to fix that.

As for page_referenced not being called until priority
goes down to 7 - that is one of the root causes the old
VM did not scale.  The number of pages that need to
be scanned to get down to that point is staggeringly
huge on systems with 1TB of RAM - a much larger
number than we should EVER scan in the pageout code.

There is no way we could go back to that heuristic.
It fell apart before and it would continue to fall apart
if we reintroduced it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
  2009-12-01 10:04   ` Andrea Arcangeli
@ 2009-12-01 12:23   ` KOSAKI Motohiro
  2009-12-01 16:41     ` Larry Woodman
  2009-12-02  2:55     ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro
  2009-12-02  1:55   ` [RFC] high system time & lock contention running large mixed workload Rik van Riel
  2 siblings, 2 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-01 12:23 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrea Arcangeli

(cc to some related person)

> The cause was determined to be the unconditional call to
> page_referenced() for every mapped page encountered in
> shrink_active_list().  page_referenced() takes the anon_vma->lock and
> calls page_referenced_one() for each vma.  page_referenced_one() then
> calls page_check_address() which takes the pte_lockptr spinlock.   If
> several CPUs are doing this at the same time there is a lot of
> pte_lockptr spinlock contention with the anon_vma->lock held.  This
> causes contention on the anon_vma->lock, stalling in the fo and very
> high system time.
> 
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.
> 
> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock.  This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
> 
> The attached patch does just that, thoughts???

At first look,

   - We have to fix this issue certenally.
   - But your patch is a bit risky. 

Your patch treat trylock(pte-lock) failure as no accessced. but
generally lock contention imply to have contention peer. iow, the page
have reference bit typically. then, next shrink_inactive_list() move it
active list again. that's suboptimal result.

However, we can't treat lock-contention as page-is-referenced simply. if it does,
the system easily go into OOM.

So, 
	if (priority < DEF_PRIORITY - 2)
		page_referenced()
	else
		page_refenced_trylock()

is better?
On typical workload, almost vmscan only use DEF_PRIORITY. then,
if priority==DEF_PRIORITY situation don't cause heavy lock contention,
the system don't need to mind the contention. anyway we can't avoid
contention if the system have heavy memory pressure.

btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
it prevent to drop page reference bit from unmapped cache page. it mean
we protect unmapped cache page than mapped page. it is strange.

Unfortunately, I don't have enough development time today. I'll
working on tommorow.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 12:23   ` KOSAKI Motohiro
@ 2009-12-01 16:41     ` Larry Woodman
  2009-12-02  2:20       ` Rik van Riel
  2009-12-02  2:55     ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro
  1 sibling, 1 reply; 28+ messages in thread
From: Larry Woodman @ 2009-12-01 16:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

On Tue, 2009-12-01 at 21:23 +0900, KOSAKI Motohiro wrote:
> (cc to some related person)

> 
> At first look,
> 
>    - We have to fix this issue certenally.
>    - But your patch is a bit risky. 
> 
> Your patch treat trylock(pte-lock) failure as no accessced. but
> generally lock contention imply to have contention peer. iow, the page
> have reference bit typically. then, next shrink_inactive_list() move it
> active list again. that's suboptimal result.
> 
> However, we can't treat lock-contention as page-is-referenced simply. if it does,
> the system easily go into OOM.
> 
> So, 
> 	if (priority < DEF_PRIORITY - 2)
> 		page_referenced()
> 	else
> 		page_refenced_trylock()
> 
> is better?
> On typical workload, almost vmscan only use DEF_PRIORITY. then,
> if priority==DEF_PRIORITY situation don't cause heavy lock contention,
> the system don't need to mind the contention. anyway we can't avoid
> contention if the system have heavy memory pressure.
> 


Agreed.  The attached updated patch only does a trylock in the
page_referenced() call from shrink_inactive_list() and only for
anonymous pages when the priority is either 10, 11 or
12(DEF_PRIORITY-2).  I have never seen a problem like this with active
pagecache pages and it does not alter the existing shrink_page_list
behavior.  What do you think about this???





[-- Attachment #2: pageout.diff --]
[-- Type: text/x-patch, Size: 6796 bytes --]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cb0ba70..d7eaeca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -80,7 +80,7 @@ static inline void page_dup_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
+			struct mem_cgroup *cnt, unsigned long *vm_flags, int try);
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
 	TTU_MIGRATION = 1,		/* migration mode */
@@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags);
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+				unsigned long, spinlock_t **, int, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
@@ -135,7 +135,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int try)
 {
 	*vm_flags = 0;
 	return TestClearPageReferenced(page);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 1888b2d..35be29d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1, 0);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
diff --git a/mm/ksm.c b/mm/ksm.c
index 5575f86..8abb14b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (addr == -EFAULT)
 		goto out;
 
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, 0);
 	if (!ptep)
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index dd43373..d8afe1a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			  unsigned long address, spinlock_t **ptlp, int sync, int try)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (try) {
+		if (!spin_trylock(ptl)) {
+			pte_unmap(pte);
+			return NULL;
+		}
+	} else
+		spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
@@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
 	pte_unmap_unlock(pte, ptl);
@@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 static int page_referenced_one(struct page *page,
 			       struct vm_area_struct *vma,
 			       unsigned int *mapcount,
-			       unsigned long *vm_flags)
+			       unsigned long *vm_flags,
+			       int try)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -352,7 +359,7 @@ static int page_referenced_one(struct page *page,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, try);
 	if (!pte)
 		goto out;
 
@@ -396,7 +403,8 @@ out:
 
 static int page_referenced_anon(struct page *page,
 				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+				unsigned long *vm_flags,
+				int try)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
@@ -417,7 +425,7 @@ static int page_referenced_anon(struct page *page,
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+						  &mapcount, vm_flags, try);
 		if (!mapcount)
 			break;
 	}
@@ -482,7 +490,7 @@ static int page_referenced_file(struct page *page,
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+						  &mapcount, vm_flags, 0);
 		if (!mapcount)
 			break;
 	}
@@ -504,7 +512,8 @@ static int page_referenced_file(struct page *page,
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int try)
 {
 	int referenced = 0;
 
@@ -515,7 +524,7 @@ int page_referenced(struct page *page,
 	if (page_mapped(page) && page->mapping) {
 		if (PageAnon(page))
 			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+								vm_flags, try);
 		else if (is_locked)
 			referenced += page_referenced_file(page, mem_cont,
 								vm_flags);
@@ -547,7 +556,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, 0);
 	if (!pte)
 		goto out;
 
@@ -774,7 +783,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, 0);
 	if (!pte)
 		goto out;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..fff63a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -643,7 +643,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		referenced = page_referenced(page, 1,
-						sc->mem_cgroup, &vm_flags);
+						sc->mem_cgroup, &vm_flags, 0);
 		/*
 		 * In active use or really unfreeable?  Activate it.
 		 * If page which have PG_mlocked lost isoltation race,
@@ -1355,7 +1355,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 		/* page_referenced clears PageReferenced */
 		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags,
+						priority<DEF_PRIORITY-2?0:1)) {
 			nr_rotated++;
 			/*
 			 * Identify referenced, file-backed active pages and

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-01 16:41     ` Larry Woodman
@ 2009-12-02  2:20       ` Rik van Riel
  2009-12-02  2:41         ` KOSAKI Motohiro
  2009-12-03 22:14         ` Larry Woodman
  0 siblings, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  2:20 UTC (permalink / raw)
  To: Larry Woodman
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On 12/01/2009 11:41 AM, Larry Woodman wrote:
>
> Agreed.  The attached updated patch only does a trylock in the
> page_referenced() call from shrink_inactive_list() and only for
> anonymous pages when the priority is either 10, 11 or
> 12(DEF_PRIORITY-2).  I have never seen a problem like this with active
> pagecache pages and it does not alter the existing shrink_page_list
> behavior.  What do you think about this???
This is reasonable, except for the fact that pages that are moved
to the inactive list without having the referenced bit cleared are
guaranteed to be moved back to the active list.

You'll be better off without that excess list movement, by simply
moving pages directly back onto the active list if the trylock
fails.

Yes, this means that page_referenced can now return 3 different
return values (not accessed, accessed, lock contended), which
should probably be an enum so we can test for the values
symbolically in the calling functions.

That way only pages where we did manage to clear the referenced bit
will be moved onto the inactive list.  This not only reduces the
amount of excess list movement, it also makes sure that the pages
which do get onto the inactive list get a fair chance at being
referenced again, instead of potentially being flooded out by pages
where the trylock failed.

A minor nitpick: maybe it would be good to rename the "try" parameter
to "noblock".  That more closely matches the requested behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-02  2:20       ` Rik van Riel
@ 2009-12-02  2:41         ` KOSAKI Motohiro
  2009-12-03 22:14         ` Larry Woodman
  1 sibling, 0 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-02  2:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli

> On 12/01/2009 11:41 AM, Larry Woodman wrote:
> >
> > Agreed.  The attached updated patch only does a trylock in the
> > page_referenced() call from shrink_inactive_list() and only for
> > anonymous pages when the priority is either 10, 11 or
> > 12(DEF_PRIORITY-2).  I have never seen a problem like this with active
> > pagecache pages and it does not alter the existing shrink_page_list
> > behavior.  What do you think about this???
> This is reasonable, except for the fact that pages that are moved
> to the inactive list without having the referenced bit cleared are
> guaranteed to be moved back to the active list.
> 
> You'll be better off without that excess list movement, by simply
> moving pages directly back onto the active list if the trylock
> fails.
> 
> Yes, this means that page_referenced can now return 3 different
> return values (not accessed, accessed, lock contended), which
> should probably be an enum so we can test for the values
> symbolically in the calling functions.
> 
> That way only pages where we did manage to clear the referenced bit
> will be moved onto the inactive list.  This not only reduces the
> amount of excess list movement, it also makes sure that the pages
> which do get onto the inactive list get a fair chance at being
> referenced again, instead of potentially being flooded out by pages
> where the trylock failed.

Agreed.


> A minor nitpick: maybe it would be good to rename the "try" parameter
> to "noblock".  That more closely matches the requested behaviour.

Another minor nit: probably we have to rename page_referenced(). it imply test
reference bit. but we use it for clear reference bit in shrink_active_list.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-02  2:20       ` Rik van Riel
  2009-12-02  2:41         ` KOSAKI Motohiro
@ 2009-12-03 22:14         ` Larry Woodman
  2009-12-04  0:29           ` Rik van Riel
  2009-12-04  0:36           ` KOSAKI Motohiro
  1 sibling, 2 replies; 28+ messages in thread
From: Larry Woodman @ 2009-12-03 22:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 1926 bytes --]

On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote:

> This is reasonable, except for the fact that pages that are moved
> to the inactive list without having the referenced bit cleared are
> guaranteed to be moved back to the active list.
> 
> You'll be better off without that excess list movement, by simply
> moving pages directly back onto the active list if the trylock
> fails.
> 


The attached patch addresses this issue by changing page_check_address()
to return -1 if the spin_trylock() fails and page_referenced_one() to
return 1 in that path so the page gets moved back to the active list.

Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
has 196491 isolated_anon pages.  This means that ~6140 processes are
somewhere down in try_to_free_pages() since we only isolate 32 pages at
a time, this is out of 9000 processes...


---------------------------------------------------------------------
active_anon:2140361 inactive_anon:453356 isolated_anon:196491
 active_file:3438 inactive_file:1100 isolated_file:0
 unevictable:2802 dirty:153 writeback:0 unstable:0
 free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
 mapped:1105 shmem:0 pagetables:139100 bounce:0

Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB 
active_anon:7835452kB inactive_anon:785764kB active_file:13672kB 
inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB 
isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB 
writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB 
slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB 
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0

202895 total pagecache pages
197629 pages in swap cache
Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
Free swap  = 65881196kB
Total swap = 67354616kB
3997696 pages RAM
207046 pages reserved
1688629 pages shared
3016248 pages non-shared


[-- Attachment #2: page_referenced.patch --]
[-- Type: text/x-patch, Size: 7402 bytes --]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cb0ba70..03a10f7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -80,7 +80,7 @@ static inline void page_dup_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
+			struct mem_cgroup *cnt, unsigned long *vm_flags, int trylock);
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
 	TTU_MIGRATION = 1,		/* migration mode */
@@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags);
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+				unsigned long, spinlock_t **, int, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
@@ -135,7 +135,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int trylock)
 {
 	*vm_flags = 0;
 	return TestClearPageReferenced(page);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 1888b2d..35be29d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1, 0);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
diff --git a/mm/ksm.c b/mm/ksm.c
index 5575f86..8abb14b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (addr == -EFAULT)
 		goto out;
 
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, 0);
 	if (!ptep)
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index dd43373..e066833 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			  unsigned long address, spinlock_t **ptlp, int sync, int trylock)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (trylock) {
+		if (!spin_trylock(ptl)) {
+			pte_unmap(pte);
+			return (pte_t *)-1;
+		}
+	} else
+		spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
@@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
 	pte_unmap_unlock(pte, ptl);
@@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 static int page_referenced_one(struct page *page,
 			       struct vm_area_struct *vma,
 			       unsigned int *mapcount,
-			       unsigned long *vm_flags)
+			       unsigned long *vm_flags,
+			       int trylock)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, trylock);
 	if (!pte)
 		goto out;
+	else if (pte == (pte_t *)-1)
+		return 1;
 
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
@@ -396,7 +405,8 @@ out:
 
 static int page_referenced_anon(struct page *page,
 				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+				unsigned long *vm_flags,
+				int trylock)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
@@ -417,7 +427,7 @@ static int page_referenced_anon(struct page *page,
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+						  &mapcount, vm_flags, trylock);
 		if (!mapcount)
 			break;
 	}
@@ -482,7 +492,7 @@ static int page_referenced_file(struct page *page,
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+						  &mapcount, vm_flags, 0);
 		if (!mapcount)
 			break;
 	}
@@ -497,6 +507,7 @@ static int page_referenced_file(struct page *page,
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @trylock: use spin_trylock to prevent high lock contention.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -504,7 +515,8 @@ static int page_referenced_file(struct page *page,
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int trylock)
 {
 	int referenced = 0;
 
@@ -515,7 +527,7 @@ int page_referenced(struct page *page,
 	if (page_mapped(page) && page->mapping) {
 		if (PageAnon(page))
 			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+						     		vm_flags, trylock);
 		else if (is_locked)
 			referenced += page_referenced_file(page, mem_cont,
 								vm_flags);
@@ -547,7 +559,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, 0);
 	if (!pte)
 		goto out;
 
@@ -774,7 +786,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, 0);
 	if (!pte)
 		goto out;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..fff63a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -643,7 +643,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		referenced = page_referenced(page, 1,
-						sc->mem_cgroup, &vm_flags);
+						sc->mem_cgroup, &vm_flags, 0);
 		/*
 		 * In active use or really unfreeable?  Activate it.
 		 * If page which have PG_mlocked lost isoltation race,
@@ -1355,7 +1355,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 		/* page_referenced clears PageReferenced */
 		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags,
+						priority<DEF_PRIORITY-2?0:1)) {
 			nr_rotated++;
 			/*
 			 * Identify referenced, file-backed active pages and

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-03 22:14         ` Larry Woodman
@ 2009-12-04  0:29           ` Rik van Riel
  2009-12-04 21:26             ` Larry Woodman
  2009-12-04  0:36           ` KOSAKI Motohiro
  1 sibling, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2009-12-04  0:29 UTC (permalink / raw)
  To: Larry Woodman
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On 12/03/2009 05:14 PM, Larry Woodman wrote:

> The attached patch addresses this issue by changing page_check_address()
> to return -1 if the spin_trylock() fails and page_referenced_one() to
> return 1 in that path so the page gets moved back to the active list.

Your patch forgot to add the code to vmscan.c to actually move
the page back to the active list.

Also, please use an enum for the page_referenced return
values, so the code in vmscan.c can use symbolic names.

enum page_reference {
	NOT_REFERENCED,
	REFERENCED,
	LOCK_CONTENDED,
};

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-04  0:29           ` Rik van Riel
@ 2009-12-04 21:26             ` Larry Woodman
  2009-12-06 21:04               ` Rik van Riel
  0 siblings, 1 reply; 28+ messages in thread
From: Larry Woodman @ 2009-12-04 21:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

On Thu, 2009-12-03 at 19:29 -0500, Rik van Riel wrote:
> On 12/03/2009 05:14 PM, Larry Woodman wrote:
> 
> > The attached patch addresses this issue by changing page_check_address()
> > to return -1 if the spin_trylock() fails and page_referenced_one() to
> > return 1 in that path so the page gets moved back to the active list.
> 
> Your patch forgot to add the code to vmscan.c to actually move
> the page back to the active list.

Right

> 
> Also, please use an enum for the page_referenced return
> values, so the code in vmscan.c can use symbolic names.
> 
> enum page_reference {
> 	NOT_REFERENCED,
> 	REFERENCED,
> 	LOCK_CONTENDED,
> };
> 

Here it is:




[-- Attachment #2: page_referenced.patch --]
[-- Type: text/x-patch, Size: 11269 bytes --]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cb0ba70..2b931c8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -76,11 +76,17 @@ static inline void page_dup_rmap(struct page *page)
 	atomic_inc(&page->_mapcount);
 }
 
+enum page_referenced {
+	NOT_REFERENCED,
+	REFERENCED,
+	LOCK_CONTENDED,
+};
+
 /*
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
+			struct mem_cgroup *cnt, unsigned long *vm_flags, int trylock);
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
 	TTU_MIGRATION = 1,		/* migration mode */
@@ -99,7 +105,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags);
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
+				unsigned long, spinlock_t **, int, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
@@ -135,10 +141,11 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int trylock)
 {
 	*vm_flags = 0;
-	return TestClearPageReferenced(page);
+	return (TestClearPageReferenced(page)?REFERENCED:NOT_REFERENCED);
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 1888b2d..35be29d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ retry:
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
+		pte = page_check_address(page, mm, address, &ptl, 1, 0);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
diff --git a/mm/ksm.c b/mm/ksm.c
index 5575f86..8abb14b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (addr == -EFAULT)
 		goto out;
 
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	ptep = page_check_address(page, mm, addr, &ptl, 0, 0);
 	if (!ptep)
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index dd43373..2b15a18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
+			  unsigned long address, spinlock_t **ptlp, int sync, int trylock)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	}
 
 	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (trylock) {
+		if (!spin_trylock(ptl)) {
+			pte_unmap(pte);
+			return (pte_t *)-1;
+		}
+	} else
+		spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
@@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	address = vma_address(page, vma);
 	if (address == -EFAULT)		/* out of vma range */
 		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
+	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0);
 	if (!pte)			/* the page is not in this mm */
 		return 0;
 	pte_unmap_unlock(pte, ptl);
@@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 static int page_referenced_one(struct page *page,
 			       struct vm_area_struct *vma,
 			       unsigned int *mapcount,
-			       unsigned long *vm_flags)
+			       unsigned long *vm_flags,
+			       int trylock)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, trylock);
 	if (!pte)
 		goto out;
+	else if (pte == (pte_t *)-1)
+		return LOCK_CONTENDED;
 
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
@@ -391,21 +400,24 @@ out_unmap:
 out:
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
-	return referenced;
+	return (referenced?REFERENCED:NOT_REFERENCED);
 }
 
 static int page_referenced_anon(struct page *page,
 				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+				unsigned long *vm_flags,
+				int trylock)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	int referenced = 0;
+	int locked = 0;
+	int ret;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
-		return referenced;
+		return NOT_REFERENCED;
 
 	mapcount = page_mapcount(page);
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
@@ -416,14 +428,19 @@ static int page_referenced_anon(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+		ret = page_referenced_one(page, vma,
+						  &mapcount, vm_flags, trylock);
+		if (ret == LOCK_CONTENDED)
+			locked++;
+		else if (ret == REFERENCED)
+			referenced++;
+		
 		if (!mapcount)
 			break;
 	}
 
 	page_unlock_anon_vma(anon_vma);
-	return referenced;
+	return (locked?LOCK_CONTENDED:referenced?REFERENCED:NOT_REFERENCED);
 }
 
 /**
@@ -482,13 +499,13 @@ static int page_referenced_file(struct page *page,
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
 		referenced += page_referenced_one(page, vma,
-						  &mapcount, vm_flags);
+						  &mapcount, vm_flags, 0);
 		if (!mapcount)
 			break;
 	}
 
 	spin_unlock(&mapping->i_mmap_lock);
-	return referenced;
+	return (referenced?REFERENCED:NOT_REFERENCED);
 }
 
 /**
@@ -497,6 +514,7 @@ static int page_referenced_file(struct page *page,
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @trylock: use spin_trylock to prevent high lock contention.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -504,9 +522,11 @@ static int page_referenced_file(struct page *page,
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int trylock)
 {
 	int referenced = 0;
+	int ret = NOT_REFERENCED;
 
 	if (TestClearPageReferenced(page))
 		referenced++;
@@ -514,25 +534,28 @@ int page_referenced(struct page *page,
 	*vm_flags = 0;
 	if (page_mapped(page) && page->mapping) {
 		if (PageAnon(page))
-			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+			ret = page_referenced_anon(page, mem_cont,
+						     		vm_flags, trylock);
 		else if (is_locked)
-			referenced += page_referenced_file(page, mem_cont,
+			ret = page_referenced_file(page, mem_cont,
 								vm_flags);
 		else if (!trylock_page(page))
 			referenced++;
 		else {
-			if (page->mapping)
-				referenced += page_referenced_file(page,
+			if (page->mapping) {
+				ret = page_referenced_file(page,
 							mem_cont, vm_flags);
+			}
 			unlock_page(page);
 		}
 	}
 
+	if (ret == REFERENCED)
+		referenced++;
 	if (page_test_and_clear_young(page))
 		referenced++;
 
-	return referenced;
+	return (ret == LOCK_CONTENDED?LOCK_CONTENDED:(referenced?REFERENCED:NOT_REFERENCED));
 }
 
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
@@ -547,7 +570,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
+	pte = page_check_address(page, mm, address, &ptl, 1, 0);
 	if (!pte)
 		goto out;
 
@@ -774,7 +797,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
+	pte = page_check_address(page, mm, address, &ptl, 0, 0);
 	if (!pte)
 		goto out;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..5543278 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -643,14 +643,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		referenced = page_referenced(page, 1,
-						sc->mem_cgroup, &vm_flags);
+						sc->mem_cgroup, &vm_flags, 0);
 		/*
 		 * In active use or really unfreeable?  Activate it.
 		 * If page which have PG_mlocked lost isoltation race,
 		 * try_to_unmap moves it to unevictable list
 		 */
 		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page)
+					(referenced == REFERENCED) && page_mapping_inuse(page)
 					&& !(vm_flags & VM_LOCKED))
 			goto activate_locked;
 
@@ -686,7 +686,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
-			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && (referenced == REFERENCED))
 				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
@@ -1320,6 +1320,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	struct page *page;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_rotated = 0;
+	int referenced;
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
@@ -1354,21 +1355,27 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		}
 
 		/* page_referenced clears PageReferenced */
-		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
-			nr_rotated++;
-			/*
-			 * Identify referenced, file-backed active pages and
-			 * give them one more trip around the active list. So
-			 * that executable code get better chances to stay in
-			 * memory under moderate memory pressure.  Anon pages
-			 * are not likely to be evicted by use-once streaming
-			 * IO, plus JVM can create lots of anon VM_EXEC pages,
-			 * so we ignore them here.
-			 */
-			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
-				list_add(&page->lru, &l_active);
-				continue;
+		if (page_mapping_inuse(page)) { 
+		   
+			referenced = page_referenced(page, 0, sc->mem_cgroup,
+						&vm_flags, priority<DEF_PRIORITY-2?0:1);
+
+			if (referenced != NOT_REFERENCED) {
+				nr_rotated++;
+				/*
+				 * Identify referenced, file-backed active pages and
+				 * give them one more trip around the active list. So
+				 * that executable code get better chances to stay in
+				 * memory under moderate memory pressure.  Anon pages
+				 * are not likely to be evicted by use-once streaming
+				 * IO, plus JVM can create lots of anon VM_EXEC pages,
+				 * so we ignore them here.
+				 */
+				if (((vm_flags & VM_EXEC) && page_is_file_cache(page)) ||
+				     (referenced == LOCK_CONTENDED)) {
+					list_add(&page->lru, &l_active);
+					continue;
+				}
 			}
 		}
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-04 21:26             ` Larry Woodman
@ 2009-12-06 21:04               ` Rik van Riel
  0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-06 21:04 UTC (permalink / raw)
  To: Larry Woodman
  Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On 12/04/2009 04:26 PM, Larry Woodman wrote:

>
> Here it is:

Kosaki's patch series looks a lot more complete.

Does Kosaki's patch series resolve your issue?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-03 22:14         ` Larry Woodman
  2009-12-04  0:29           ` Rik van Riel
@ 2009-12-04  0:36           ` KOSAKI Motohiro
  2009-12-04 19:31             ` Larry Woodman
  1 sibling, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-04  0:36 UTC (permalink / raw)
  To: Larry Woodman
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli

> On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote:
> 
> > This is reasonable, except for the fact that pages that are moved
> > to the inactive list without having the referenced bit cleared are
> > guaranteed to be moved back to the active list.
> > 
> > You'll be better off without that excess list movement, by simply
> > moving pages directly back onto the active list if the trylock
> > fails.
> > 
> 
> 
> The attached patch addresses this issue by changing page_check_address()
> to return -1 if the spin_trylock() fails and page_referenced_one() to
> return 1 in that path so the page gets moved back to the active list.
> 
> Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
> has 196491 isolated_anon pages.  This means that ~6140 processes are
> somewhere down in try_to_free_pages() since we only isolate 32 pages at
> a time, this is out of 9000 processes...
> 
> 
> ---------------------------------------------------------------------
> active_anon:2140361 inactive_anon:453356 isolated_anon:196491
>  active_file:3438 inactive_file:1100 isolated_file:0
>  unevictable:2802 dirty:153 writeback:0 unstable:0
>  free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
>  mapped:1105 shmem:0 pagetables:139100 bounce:0
> 
> Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB 
> active_anon:7835452kB inactive_anon:785764kB active_file:13672kB 
> inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB 
> isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB 
> writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB 
> slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB 
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> 
> 202895 total pagecache pages
> 197629 pages in swap cache
> Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
> Free swap  = 65881196kB
> Total swap = 67354616kB
> 3997696 pages RAM
> 207046 pages reserved
> 1688629 pages shared
> 3016248 pages non-shared

This seems we have to improve reclaim bale out logic. the system already
have 1.5GB free pages. IOW, the system don't need swap-out anymore.



> @@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page,
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> +	pte = page_check_address(page, mm, address, &ptl, 0, trylock);
>  	if (!pte)
>  		goto out;
> +	else if (pte == (pte_t *)-1)
> +		return 1;
>  
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,

Sorry, NAK.
I have to say the same thing of Rik's previous mention. shrink_active_list()
ignore the return value of page_referenced(). then above 'return 1' is meaningless.

Umm, ok, I'll make the patch myself.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-12-04  0:36           ` KOSAKI Motohiro
@ 2009-12-04 19:31             ` Larry Woodman
  0 siblings, 0 replies; 28+ messages in thread
From: Larry Woodman @ 2009-12-04 19:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On Fri, 2009-12-04 at 09:36 +0900, KOSAKI Motohiro wrote:

> > 
> > Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
> > has 196491 isolated_anon pages.  This means that ~6140 processes are
> > somewhere down in try_to_free_pages() since we only isolate 32 pages at
> > a time, this is out of 9000 processes...
> > 
> > 
> > ---------------------------------------------------------------------
> > active_anon:2140361 inactive_anon:453356 isolated_anon:196491
> >  active_file:3438 inactive_file:1100 isolated_file:0
> >  unevictable:2802 dirty:153 writeback:0 unstable:0
> >  free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
> >  mapped:1105 shmem:0 pagetables:139100 bounce:0
> > 
> > Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB 
> > active_anon:7835452kB inactive_anon:785764kB active_file:13672kB 
> > inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB 
> > isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB 
> > writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB 
> > slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB 
> > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> > 
> > 202895 total pagecache pages
> > 197629 pages in swap cache
> > Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
> > Free swap  = 65881196kB
> > Total swap = 67354616kB
> > 3997696 pages RAM
> > 207046 pages reserved
> > 1688629 pages shared
> > 3016248 pages non-shared
> 
> This seems we have to improve reclaim bale out logic. the system already
> have 1.5GB free pages. IOW, the system don't need swap-out anymore.
> 
> 

Whats going on here is there are about 7500 runable processes and the
memory is already low.  A process runs, requests memory and eventually
ends up in try_to_free_pages.  Since the page reclaim code calls
cond_resched()in several places so the scheduler eventually puts that
process on the run queue and runs another process which does the same
thing.  Eventually you end up with thousands of runnable processes in
the page reclaim code continuing to reclaim even though there is plenty
of free memory.  

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
7480 421 1474396 2240060   7640  12988   23   27    24    40   14   40 
12 45 43  1  0
7524 405 1474772 2224504   7644  13460  759  689   764   689 8932
11076  
8 92  0  0  0
7239 401 1474164 2210572   7656  13524  596  592   596   596 8809
10494  
9 91  0  0  0

BTW, this is easy to reproduce.  Fork thousands of processes that
collectively overcommit memory and keep them allocating...




  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] Clear reference bit although page isn't mapped.
  2009-12-01 12:23   ` KOSAKI Motohiro
  2009-12-01 16:41     ` Larry Woodman
@ 2009-12-02  2:55     ` KOSAKI Motohiro
  2009-12-02  3:07       ` Rik van Riel
  1 sibling, 1 reply; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-02  2:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli

> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
> it prevent to drop page reference bit from unmapped cache page. it mean
> we protect unmapped cache page than mapped page. it is strange.

How about this?

---------------------------------
SplitLRU VM replacement algorithm assume shrink_active_list() clear
the page's reference bit. but unnecessary page_mapping_inuse() test
prevent it.

This patch remove it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00156f2..3e942b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1347,8 +1347,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		}
 
 		/* page_referenced clears PageReferenced */
-		if (page_mapping_inuse(page) &&
-		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
 			nr_rotated++;
 			/*
 			 * Identify referenced, file-backed active pages and
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Clear reference bit although page isn't mapped.
  2009-12-02  2:55     ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro
@ 2009-12-02  3:07       ` Rik van Riel
  2009-12-02  3:28         ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  3:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
>> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
>> it prevent to drop page reference bit from unmapped cache page. it mean
>> we protect unmapped cache page than mapped page. it is strange.
>>      
> How about this?
>
> ---------------------------------
> SplitLRU VM replacement algorithm assume shrink_active_list() clear
> the page's reference bit. but unnecessary page_mapping_inuse() test
> prevent it.
>
> This patch remove it.
>    
Shrink_page_list ignores the referenced bit on pages
that are !page_mapping_inuse().

                 if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
                                         referenced && 
page_mapping_inuse(page)
&& !(vm_flags & VM_LOCKED))
                         goto activate_locked;

The reason we leave the referenced bit on unmapped
pages is that we want the next reference to a deactivated
page cache page to move that page back to the active
list.  We do not want to require that such a page gets
accessed twice before being reactivated while on the
inactive list, because (1) we know it was a frequently
accessed page already and (2) ongoing streaming IO
might evict it from the inactive list before it gets accessed
twice.

Arguably, we should just replace the page_mapping_inuse()
in both places with page_mapped() to simplify things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] Replace page_mapping_inuse() with page_mapped()
  2009-12-02  3:07       ` Rik van Riel
@ 2009-12-02  3:28         ` KOSAKI Motohiro
  2009-12-02  4:57           ` Rik van Riel
  2009-12-02 11:07           ` Johannes Weiner
  0 siblings, 2 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2009-12-02  3:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli

> On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
> >> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
> >> it prevent to drop page reference bit from unmapped cache page. it mean
> >> we protect unmapped cache page than mapped page. it is strange.
> >>      
> > How about this?
> >
> > ---------------------------------
> > SplitLRU VM replacement algorithm assume shrink_active_list() clear
> > the page's reference bit. but unnecessary page_mapping_inuse() test
> > prevent it.
> >
> > This patch remove it.
> >    
> Shrink_page_list ignores the referenced bit on pages
> that are !page_mapping_inuse().
> 
>                  if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
>                                          referenced && 
> page_mapping_inuse(page)
> && !(vm_flags & VM_LOCKED))
>                          goto activate_locked;
> 
> The reason we leave the referenced bit on unmapped
> pages is that we want the next reference to a deactivated
> page cache page to move that page back to the active
> list.  We do not want to require that such a page gets
> accessed twice before being reactivated while on the
> inactive list, because (1) we know it was a frequently
> accessed page already and (2) ongoing streaming IO
> might evict it from the inactive list before it gets accessed
> twice.
> 
> Arguably, we should just replace the page_mapping_inuse()
> in both places with page_mapped() to simplify things.

Ah, yes. /me was slept. thanks correct me.


From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Wed, 2 Dec 2009 12:05:26 +0900
Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()

page reclaim logic need to distingish mapped and unmapped pages.
However page_mapping_inuse() don't provide proper test way. it test
the address space (i.e. file) is mmpad(). Why `page' reclaim need
care unrelated page's mapped state? it's unrelated.

Thus, This patch replace page_mapping_inuse() with page_mapped()

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   25 ++-----------------------
 1 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00156f2..350b9cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,27 +277,6 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	return ret;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	/*
@@ -663,7 +642,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * try_to_unmap moves it to unevictable list
 		 */
 		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page)
+					referenced && page_mapped(page)
 					&& !(vm_flags & VM_LOCKED))
 			goto activate_locked;
 
@@ -1347,7 +1326,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		}
 
 		/* page_referenced clears PageReferenced */
-		if (page_mapping_inuse(page) &&
+		if (page_mapped(page) &&
 		    page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
 			nr_rotated++;
 			/*
-- 
1.6.5.2





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Replace page_mapping_inuse() with page_mapped()
  2009-12-02  3:28         ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro
@ 2009-12-02  4:57           ` Rik van Riel
  2009-12-02 11:07           ` Johannes Weiner
  1 sibling, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  4:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins,
	KAMEZAWA Hiroyuki, Andrea Arcangeli

On 12/01/2009 10:28 PM, KOSAKI Motohiro wrote:
>> On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
>>      
>>>> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
>>>> it prevent to drop page reference bit from unmapped cache page. it mean
>>>> we protect unmapped cache page than mapped page. it is strange.
>>>>
>>>>          
>>> How about this?
>>>
>>> ---------------------------------
>>> SplitLRU VM replacement algorithm assume shrink_active_list() clear
>>> the page's reference bit. but unnecessary page_mapping_inuse() test
>>> prevent it.
>>>
>>> This patch remove it.
>>>
>>>        
>> Shrink_page_list ignores the referenced bit on pages
>> that are !page_mapping_inuse().
>>
>>                   if (sc->order<= PAGE_ALLOC_COSTLY_ORDER&&
>>                                           referenced&&
>> page_mapping_inuse(page)
>> &&  !(vm_flags&  VM_LOCKED))
>>                           goto activate_locked;
>>
>> The reason we leave the referenced bit on unmapped
>> pages is that we want the next reference to a deactivated
>> page cache page to move that page back to the active
>> list.  We do not want to require that such a page gets
>> accessed twice before being reactivated while on the
>> inactive list, because (1) we know it was a frequently
>> accessed page already and (2) ongoing streaming IO
>> might evict it from the inactive list before it gets accessed
>> twice.
>>
>> Arguably, we should just replace the page_mapping_inuse()
>> in both places with page_mapped() to simplify things.
>>      
> Ah, yes. /me was slept. thanks correct me.
>
>
>  From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Date: Wed, 2 Dec 2009 12:05:26 +0900
> Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()
>
> page reclaim logic need to distingish mapped and unmapped pages.
> However page_mapping_inuse() don't provide proper test way. it test
> the address space (i.e. file) is mmpad(). Why `page' reclaim need
> care unrelated page's mapped state? it's unrelated.
>
> Thus, This patch replace page_mapping_inuse() with page_mapped()
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>    
Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] Replace page_mapping_inuse() with page_mapped()
  2009-12-02  3:28         ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro
  2009-12-02  4:57           ` Rik van Riel
@ 2009-12-02 11:07           ` Johannes Weiner
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2009-12-02 11:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Larry Woodman, linux-kernel, linux-mm, akpm,
	Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli

On Wed, Dec 02, 2009 at 12:28:26PM +0900, KOSAKI Motohiro wrote:

> From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Wed, 2 Dec 2009 12:05:26 +0900
> Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()
> 
> page reclaim logic need to distingish mapped and unmapped pages.
> However page_mapping_inuse() don't provide proper test way. it test
> the address space (i.e. file) is mmpad(). Why `page' reclaim need
> care unrelated page's mapped state? it's unrelated.
> 
> Thus, This patch replace page_mapping_inuse() with page_mapped()
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] high system time & lock contention running large mixed workload
  2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
  2009-12-01 10:04   ` Andrea Arcangeli
  2009-12-01 12:23   ` KOSAKI Motohiro
@ 2009-12-02  1:55   ` Rik van Riel
  2 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2009-12-02  1:55 UTC (permalink / raw)
  To: Larry Woodman; +Cc: linux-kernel, linux-mm, akpm

On 11/30/2009 05:00 PM, Larry Woodman wrote:
> While running workloads that do lots of forking processes, exiting
> processes and page reclamation(AIM 7) on large systems very high system
> time(100%) and lots of lock contention was observed.
>
>
>
> CPU5:
> [<ffffffff814afb48>] ? _spin_lock+0x27/0x48
>   [<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a
>   [<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c
>   [<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be
>   [<ffffffff8105ea07>] ? do_fork+0x151/0x330
>   [<ffffffff81058407>] ? default_wake_function+0x0/0x36
>   [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
>   [<ffffffff810121d3>] ? stub_clone+0x13/0x20
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
> CPU4:
> [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
>   [<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84
>   [<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1
>   [<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110
>   [<ffffffff8105ce4c>] ? mmput+0x55/0xd9
>   [<ffffffff81061afd>] ? exit_mm+0x109/0x129
>   [<ffffffff81063846>] ? do_exit+0x1d7/0x712
>   [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
>   [<ffffffff81063e07>] ? do_group_exit+0x86/0xb2
>   [<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
> CPU0:
> [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
> [<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f
>   [<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b
>   [<ffffffff81102f5a>] ? page_referenced+0xcd/0x167
>   [<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3
>   [<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38
>   [<ffffffff8108440a>] ? getnstimeofday+0x64/0xce
>   [<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362
>   [<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94
>   [<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242
>   [<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572
>   [<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e
>   [<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be
>   [<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84
>   [<ffffffff81130db8>] ? user_path_at+0x65/0xa3
>   [<ffffffff8105ea07>] ? do_fork+0x151/0x330
>   [<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab
>   [<ffffffff810121d3>] ? stub_clone+0x13/0x20
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
>
> ------------------------------------------------------------------------------
>     PerfTop:     864 irqs/sec  kernel:99.7% [100000 cycles],  (all, 8
> CPUs)
> ------------------------------------------------------------------------------
>
>               samples    pcnt         RIP          kernel function
>    ______     _______   _____   ________________   _______________
>
>               3235.00 - 75.1% - ffffffff814afb21 : _spin_lock
>                670.00 - 15.6% - ffffffff81101a33 : page_check_address
>                165.00 -  3.8% - ffffffffa01cbc39 : rpc_sleep_on  [sunrpc]
>                 40.00 -  0.9% - ffffffff81102113 : try_to_unmap_one
>                 29.00 -  0.7% - ffffffff81101c65 : page_referenced_one
>                 27.00 -  0.6% - ffffffff81101964 : vma_address
>                  8.00 -  0.2% - ffffffff8125a5a0 : clear_page_c
>                  6.00 -  0.1% - ffffffff8125a5f0 : copy_page_c
>                  6.00 -  0.1% - ffffffff811023ca : try_to_unmap_anon
>                  5.00 -  0.1% - ffffffff810fb014 : copy_page_range
>                  5.00 -  0.1% - ffffffff810e4d18 : get_page_from_freelist
>
>
>
> The cause was determined to be the unconditional call to
> page_referenced() for every mapped page encountered in
> shrink_active_list().  page_referenced() takes the anon_vma->lock and
> calls page_referenced_one() for each vma.  page_referenced_one() then
> calls page_check_address() which takes the pte_lockptr spinlock.   If
> several CPUs are doing this at the same time there is a lot of
> pte_lockptr spinlock contention with the anon_vma->lock held.  This
> causes contention on the anon_vma->lock, stalling in the fo and very
> high system time.
>
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set.  reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.
>
> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock.  This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
>
> The attached patch does just that, thoughts???
>    
My first thought is that you haven't read the code
you are trying to patch.

The purpose of calling page_referenced on anonymous
pages from shrink_active_list is to CLEAR the referenced
bit, not to test it!

Not clearing the referenced bit will break page replacement,
because pages that were not recently referenced will appear
to be, causing them to get another round on the active list,
which in turn could increase the list movement...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-12-06 21:04 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel
2009-11-25 20:35 ` Johannes Weiner
2009-11-25 20:47   ` Rik van Riel
2009-11-26  2:50 ` KOSAKI Motohiro
2009-11-26  2:57   ` Rik van Riel
2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman
2009-12-01 10:04   ` Andrea Arcangeli
2009-12-01 12:31     ` KOSAKI Motohiro
2009-12-01 12:46       ` Andrea Arcangeli
2009-12-02  2:02         ` KOSAKI Motohiro
2009-12-02  2:04         ` Rik van Riel
2009-12-02  2:00     ` Rik van Riel
2009-12-01 12:23   ` KOSAKI Motohiro
2009-12-01 16:41     ` Larry Woodman
2009-12-02  2:20       ` Rik van Riel
2009-12-02  2:41         ` KOSAKI Motohiro
2009-12-03 22:14         ` Larry Woodman
2009-12-04  0:29           ` Rik van Riel
2009-12-04 21:26             ` Larry Woodman
2009-12-06 21:04               ` Rik van Riel
2009-12-04  0:36           ` KOSAKI Motohiro
2009-12-04 19:31             ` Larry Woodman
2009-12-02  2:55     ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro
2009-12-02  3:07       ` Rik van Riel
2009-12-02  3:28         ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro
2009-12-02  4:57           ` Rik van Riel
2009-12-02 11:07           ` Johannes Weiner
2009-12-02  1:55   ` [RFC] high system time & lock contention running large mixed workload Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox