* [PATCH] vmscan: do not evict inactive pages when skipping an active list scan
@ 2009-11-25 18:37 Rik van Riel
2009-11-25 20:35 ` Johannes Weiner
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Rik van Riel @ 2009-11-25 18:37 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm
In AIM7 runs, recent kernels start swapping out anonymous pages
well before they should. This is due to shrink_list falling
through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
when all we really wanted to do is pre-age some anonymous pages to
give them extra time to be referenced while on the inactive list.
The obvious fix is to make sure that shrink_list does not fall
through to scanning/reclaiming inactive pages when we called it
to scan one of the active lists.
This change should be safe because the loop in shrink_zone ensures
that we will still shrink the anon and file inactive lists whenever
we should.
Reported-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..ec4dfda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
{
int file = is_file_lru(lru);
- if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ if (lru == LRU_ACTIVE_FILE) {
+ if (inactive_file_is_low(zone, sc))
+ shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
- if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ if (lru == LRU_ACTIVE_ANON) {
+ if (inactive_file_is_low(zone, sc))
+ shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan 2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel @ 2009-11-25 20:35 ` Johannes Weiner 2009-11-25 20:47 ` Rik van Riel 2009-11-26 2:50 ` KOSAKI Motohiro 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman 2 siblings, 1 reply; 28+ messages in thread From: Johannes Weiner @ 2009-11-25 20:35 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm Hello all, On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote: > In AIM7 runs, recent kernels start swapping out anonymous pages > well before they should. This is due to shrink_list falling > through to shrink_inactive_list if !inactive_anon_is_low(zone, sc), > when all we really wanted to do is pre-age some anonymous pages to > give them extra time to be referenced while on the inactive list. I do not quite understand what changed 'recently'. That fall-through logic to keep eating inactives when the ratio is off came in a year ago with the second-chance-for-anon-pages patch..? > The obvious fix is to make sure that shrink_list does not fall > through to scanning/reclaiming inactive pages when we called it > to scan one of the active lists. > > This change should be safe because the loop in shrink_zone ensures > that we will still shrink the anon and file inactive lists whenever > we should. It was not so obvious to me ;) At first, I thought it would make sense to actively rebalance between the lists if the inactive one grows too large (the fall-through case). But shrink_zone() does not know about this and although we scan inactive pages, we do not account for them and decrease the 'nr[lru]' for active pages instead, effectively shifting the 'active todo' over to the 'inactive todo'. I can imagine this going wrong! So I agree, we should use the inactive_*_is_low() predicate only passively. > Reported-by: Larry Woodman <lwoodman@redhat.com> > Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 777af57..ec4dfda 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > { > int file = is_file_lru(lru); > > - if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) { > - shrink_active_list(nr_to_scan, zone, sc, priority, file); > + if (lru == LRU_ACTIVE_FILE) { > + if (inactive_file_is_low(zone, sc)) > + shrink_active_list(nr_to_scan, zone, sc, priority, file); > return 0; > } > > - if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) { > - shrink_active_list(nr_to_scan, zone, sc, priority, file); > + if (lru == LRU_ACTIVE_ANON) { > + if (inactive_file_is_low(zone, sc)) > + shrink_active_list(nr_to_scan, zone, sc, priority, file); > return 0; > } > return shrink_inactive_list(nr_to_scan, zone, sc, priority, file); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan 2009-11-25 20:35 ` Johannes Weiner @ 2009-11-25 20:47 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-11-25 20:47 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm On 11/25/2009 03:35 PM, Johannes Weiner wrote: > Hello all, > > On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote: > >> In AIM7 runs, recent kernels start swapping out anonymous pages >> well before they should. This is due to shrink_list falling >> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc), >> when all we really wanted to do is pre-age some anonymous pages to >> give them extra time to be referenced while on the inactive list. >> > I do not quite understand what changed 'recently'. > > That fall-through logic to keep eating inactives when the ratio is off > came in a year ago with the second-chance-for-anon-pages patch..? > > The confusion comes from my use of the word "recently" here. Larry started testing with RHEL 5 as the baseline. And yeah - I believe the code may well have been wrong ever since it was merged. The indentation just looked so pretty that noone spotted the bug. > Acked-by: Johannes Weiner<hannes@cmpxchg.org> > > Thank you. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan 2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel 2009-11-25 20:35 ` Johannes Weiner @ 2009-11-26 2:50 ` KOSAKI Motohiro 2009-11-26 2:57 ` Rik van Riel 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman 2 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2009-11-26 2:50 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, linux-kernel, linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm > In AIM7 runs, recent kernels start swapping out anonymous pages > well before they should. This is due to shrink_list falling > through to shrink_inactive_list if !inactive_anon_is_low(zone, sc), > when all we really wanted to do is pre-age some anonymous pages to > give them extra time to be referenced while on the inactive list. > > The obvious fix is to make sure that shrink_list does not fall > through to scanning/reclaiming inactive pages when we called it > to scan one of the active lists. > > This change should be safe because the loop in shrink_zone ensures > that we will still shrink the anon and file inactive lists whenever > we should. Good catch! > > Reported-by: Larry Woodman <lwoodman@redhat.com> > Signed-off-by: Rik van Riel <riel@redhat.com> > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 777af57..ec4dfda 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > { > int file = is_file_lru(lru); > > - if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) { > - shrink_active_list(nr_to_scan, zone, sc, priority, file); > + if (lru == LRU_ACTIVE_FILE) { > + if (inactive_file_is_low(zone, sc)) > + shrink_active_list(nr_to_scan, zone, sc, priority, file); > return 0; > } > > - if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) { > - shrink_active_list(nr_to_scan, zone, sc, priority, file); > + if (lru == LRU_ACTIVE_ANON) { > + if (inactive_file_is_low(zone, sc)) This inactive_file_is_low() should be inactive_anon_is_low(). cut-n-paste programming often makes similar mistake. probaby we need make more cleanup to this function. How about this? (this is incremental patch from you) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 17 ++++++++++------- 1 files changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index a8f61c0..80e94a2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1467,22 +1467,25 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc) return low; } +static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, int file) +{ + if (file) + return inactive_file_is_low(zone, sc); + else + return inactive_anon_is_low(zone, sc); +} + static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { int file = is_file_lru(lru); - if (lru == LRU_ACTIVE_FILE) { - if (inactive_file_is_low(zone, sc)) + if (is_active_lru(lru)) { + if (inactive_list_is_low(zone, sc, file)) shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; } - if (lru == LRU_ACTIVE_ANON) { - if (inactive_file_is_low(zone, sc)) - shrink_active_list(nr_to_scan, zone, sc, priority, file); - return 0; - } return shrink_inactive_list(nr_to_scan, zone, sc, priority, file); } -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan 2009-11-26 2:50 ` KOSAKI Motohiro @ 2009-11-26 2:57 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-11-26 2:57 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-kernel, linux-mm, lwoodman, kosaki.motohiro, Tomasz Chmielewski, akpm On 11/25/2009 09:50 PM, KOSAKI Motohiro wrote: > >> - if (lru == LRU_ACTIVE_ANON&& inactive_anon_is_low(zone, sc)) { >> - shrink_active_list(nr_to_scan, zone, sc, priority, file); >> + if (lru == LRU_ACTIVE_ANON) { >> + if (inactive_file_is_low(zone, sc)) >> > This inactive_file_is_low() should be inactive_anon_is_low(). > cut-n-paste programming often makes similar mistake. probaby we need make > more cleanup to this function. > > How about this? (this is incremental patch from you) > > > Doh! Nice catch... > Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > > Reviewed-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC] high system time & lock contention running large mixed workload 2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel 2009-11-25 20:35 ` Johannes Weiner 2009-11-26 2:50 ` KOSAKI Motohiro @ 2009-11-30 22:00 ` Larry Woodman 2009-12-01 10:04 ` Andrea Arcangeli ` (2 more replies) 2 siblings, 3 replies; 28+ messages in thread From: Larry Woodman @ 2009-11-30 22:00 UTC (permalink / raw) To: linux-kernel; +Cc: linux-mm, akpm [-- Attachment #1: Type: text/plain, Size: 4499 bytes --] While running workloads that do lots of forking processes, exiting processes and page reclamation(AIM 7) on large systems very high system time(100%) and lots of lock contention was observed. CPU5: [<ffffffff814afb48>] ? _spin_lock+0x27/0x48 [<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a [<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c [<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be [<ffffffff8105ea07>] ? do_fork+0x151/0x330 [<ffffffff81058407>] ? default_wake_function+0x0/0x36 [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68 [<ffffffff810121d3>] ? stub_clone+0x13/0x20 [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b CPU4: [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48 [<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84 [<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1 [<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110 [<ffffffff8105ce4c>] ? mmput+0x55/0xd9 [<ffffffff81061afd>] ? exit_mm+0x109/0x129 [<ffffffff81063846>] ? do_exit+0x1d7/0x712 [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68 [<ffffffff81063e07>] ? do_group_exit+0x86/0xb2 [<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b CPU0: [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48 [<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f [<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b [<ffffffff81102f5a>] ? page_referenced+0xcd/0x167 [<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3 [<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38 [<ffffffff8108440a>] ? getnstimeofday+0x64/0xce [<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362 [<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94 [<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242 [<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572 [<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e [<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be [<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84 [<ffffffff81130db8>] ? user_path_at+0x65/0xa3 [<ffffffff8105ea07>] ? do_fork+0x151/0x330 [<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab [<ffffffff810121d3>] ? stub_clone+0x13/0x20 [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b ------------------------------------------------------------------------------ PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8 CPUs) ------------------------------------------------------------------------------ samples pcnt RIP kernel function ______ _______ _____ ________________ _______________ 3235.00 - 75.1% - ffffffff814afb21 : _spin_lock 670.00 - 15.6% - ffffffff81101a33 : page_check_address 165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc] 40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one 29.00 - 0.7% - ffffffff81101c65 : page_referenced_one 27.00 - 0.6% - ffffffff81101964 : vma_address 8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c 6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c 6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon 5.00 - 0.1% - ffffffff810fb014 : copy_page_range 5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist The cause was determined to be the unconditional call to page_referenced() for every mapped page encountered in shrink_active_list(). page_referenced() takes the anon_vma->lock and calls page_referenced_one() for each vma. page_referenced_one() then calls page_check_address() which takes the pte_lockptr spinlock. If several CPUs are doing this at the same time there is a lot of pte_lockptr spinlock contention with the anon_vma->lock held. This causes contention on the anon_vma->lock, stalling in the fo and very high system time. Before the splitLRU patch shrink_active_list() would only call page_referenced() when reclaim_mapped got set. reclaim_mapped only got set when the priority worked its way from 12 all the way to 7. This prevented page_referenced() from being called from shrink_active_list() until the system was really struggling to reclaim memory. On way to prevent this is to change page_check_address() to execute a spin_trylock(ptl) when it was called by shrink_active_list() and simply fail if it could not get the pte_lockptr spinlock. This will make shrink_active_list() consider the page not referenced and allow the anon_vma->lock to be dropped much quicker. The attached patch does just that, thoughts??? [-- Attachment #2: pageout.diff --] [-- Type: text/x-patch, Size: 3419 bytes --] diff --git a/include/linux/rmap.h b/include/linux/rmap.h index cb0ba70..65b841d 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags); * Called from mm/filemap_xip.c to unmap empty zero page */ pte_t *page_check_address(struct page *, struct mm_struct *, - unsigned long, spinlock_t **, int); + unsigned long, spinlock_t **, int, int); /* * Used by swapoff to help locate where page is expected in vma. diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 1888b2d..35be29d 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ retry: address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); diff --git a/mm/ksm.c b/mm/ksm.c index 5575f86..8abb14b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (addr == -EFAULT) goto out; - ptep = page_check_address(page, mm, addr, &ptl, 0); + ptep = page_check_address(page, mm, addr, &ptl, 0, 0); if (!ptep) goto out; diff --git a/mm/rmap.c b/mm/rmap.c index dd43373..4e4eb8e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) * On success returns with pte mapped and locked. */ pte_t *page_check_address(struct page *page, struct mm_struct *mm, - unsigned long address, spinlock_t **ptlp, int sync) + unsigned long address, spinlock_t **ptlp, int sync, int try) { pgd_t *pgd; pud_t *pud; @@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm, } ptl = pte_lockptr(mm, pmd); - spin_lock(ptl); + if (try) { + if (!spin_trylock(ptl)) { + pte_unmap(pte); + return NULL; + } + } else + spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; return pte; @@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) address = vma_address(page, vma); if (address == -EFAULT) /* out of vma range */ return 0; - pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); + pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0); if (!pte) /* the page is not in this mm */ return 0; pte_unmap_unlock(pte, ptl); @@ -352,7 +358,7 @@ static int page_referenced_one(struct page *page, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, 1); if (!pte) goto out; @@ -547,7 +553,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (!pte) goto out; @@ -774,7 +780,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, 0); if (!pte) goto out; ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman @ 2009-12-01 10:04 ` Andrea Arcangeli 2009-12-01 12:31 ` KOSAKI Motohiro 2009-12-02 2:00 ` Rik van Riel 2009-12-01 12:23 ` KOSAKI Motohiro 2009-12-02 1:55 ` [RFC] high system time & lock contention running large mixed workload Rik van Riel 2 siblings, 2 replies; 28+ messages in thread From: Andrea Arcangeli @ 2009-12-01 10:04 UTC (permalink / raw) To: Larry Woodman; +Cc: linux-kernel, linux-mm, akpm On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote: > Before the splitLRU patch shrink_active_list() would only call > page_referenced() when reclaim_mapped got set. reclaim_mapped only got > set when the priority worked its way from 12 all the way to 7. This > prevented page_referenced() from being called from shrink_active_list() > until the system was really struggling to reclaim memory. page_referenced should never be called and nobody should touch ptes until priority went down to 7. This is a regression in splitLRU that should be fixed. With light VM pressure we should never touch ptes ever. > On way to prevent this is to change page_check_address() to execute a > spin_trylock(ptl) when it was called by shrink_active_list() and simply > fail if it could not get the pte_lockptr spinlock. This will make > shrink_active_list() consider the page not referenced and allow the > anon_vma->lock to be dropped much quicker. > > The attached patch does just that, thoughts??? Just stop calling page_referenced there... Even if we ignore the above, one problem later in skipping over the PT lock, is also to assume the page is not referenced when it actually is, so it won't be activated again when page_referenced is called again to move the page back in the active list... Not the end of the world to lose a young bit sometime though. There may be all reasons in the world why we have to mess with ptes when there's light VM pressure, for whatever terabyte machine or whatever workload that performs better that way, but I know in 100% of my systems I don't ever want the VM to touch ptes when there's light VM pressure, no matter what. So if you want the default to be messing with ptes, just give me a sysctl knob to let me run faster. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 10:04 ` Andrea Arcangeli @ 2009-12-01 12:31 ` KOSAKI Motohiro 2009-12-01 12:46 ` Andrea Arcangeli 2009-12-02 2:00 ` Rik van Riel 1 sibling, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-01 12:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel > On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote: > > Before the splitLRU patch shrink_active_list() would only call > > page_referenced() when reclaim_mapped got set. reclaim_mapped only got > > set when the priority worked its way from 12 all the way to 7. This > > prevented page_referenced() from being called from shrink_active_list() > > until the system was really struggling to reclaim memory. > > page_referenced should never be called and nobody should touch ptes > until priority went down to 7. This is a regression in splitLRU that > should be fixed. With light VM pressure we should never touch ptes ever. Ummm. I can't agree this. 7 is too small priority. if large system have prio==7, the system have unacceptable big latency trouble. if only prio==DEF_PRIOTIRY or something, I can agree you probably. > > On way to prevent this is to change page_check_address() to execute a > > spin_trylock(ptl) when it was called by shrink_active_list() and simply > > fail if it could not get the pte_lockptr spinlock. This will make > > shrink_active_list() consider the page not referenced and allow the > > anon_vma->lock to be dropped much quicker. > > > > The attached patch does just that, thoughts??? > > Just stop calling page_referenced there... > > Even if we ignore the above, one problem later in skipping over the PT > lock, is also to assume the page is not referenced when it actually > is, so it won't be activated again when page_referenced is called > again to move the page back in the active list... Not the end of the > world to lose a young bit sometime though. > > There may be all reasons in the world why we have to mess with ptes > when there's light VM pressure, for whatever terabyte machine or > whatever workload that performs better that way, but I know in 100% of > my systems I don't ever want the VM to touch ptes when there's light > VM pressure, no matter what. So if you want the default to be messing > with ptes, just give me a sysctl knob to let me run faster. Um. Avoiding lock contention on light VM pressure is important than strict lru order. I guess we don't need knob. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 12:31 ` KOSAKI Motohiro @ 2009-12-01 12:46 ` Andrea Arcangeli 2009-12-02 2:02 ` KOSAKI Motohiro 2009-12-02 2:04 ` Rik van Riel 0 siblings, 2 replies; 28+ messages in thread From: Andrea Arcangeli @ 2009-12-01 12:46 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel On Tue, Dec 01, 2009 at 09:31:09PM +0900, KOSAKI Motohiro wrote: > Ummm. I can't agree this. 7 is too small priority. if large system have prio==7, > the system have unacceptable big latency trouble. > if only prio==DEF_PRIOTIRY or something, I can agree you probably. I taken number 7 purely as mentioned by Larry about old code, but I don't mind what is the actual breakpoint level where we start to send the ipi flood to destroy all userland tlbs mapping the page so the young bit can be set by the cpu on the old pte. If you agree with me at the lowest priority we shouldn't flood ipi and destroy tlb when there's plenty of clean unmapped clean cache, we already agree ;). If that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates gradually, first clean cache and re-activate mapped pages, then when we're low on clean cache we start to check ptes and unmap whatever is not found referenced. > Avoiding lock contention on light VM pressure is important than > strict lru order. I guess we don't need knob. Hope so indeed. It's not just lock contention, that is exacerbated by certain workloads, but even in total absence of any lock contention I generally dislike the cpu waste itself of the pte loop to clear the young bit, and the interruption of userland as well when it receives a tlb flush for no good reason because 99% of the time plenty of unmapped clean cache is available. I know this performs best, even if there will be always someone that will want mapped and unmapped cache to be threat totally equal in lru terms (which then make me wonder why there are still & VM_EXEC magics in vmscan.c if all pages shall be threaded equal in the lru... especially given VM_EXEC is often meaningless [because potentially randomly set] unlike page_mapcount [which is never randomly set]), which is the reason I mentioned the knob. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 12:46 ` Andrea Arcangeli @ 2009-12-02 2:02 ` KOSAKI Motohiro 2009-12-02 2:04 ` Rik van Riel 1 sibling, 0 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-02 2:02 UTC (permalink / raw) To: Andrea Arcangeli Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel Hi > > Avoiding lock contention on light VM pressure is important than > > strict lru order. I guess we don't need knob. > > Hope so indeed. It's not just lock contention, that is exacerbated by > certain workloads, but even in total absence of any lock contention I > generally dislike the cpu waste itself of the pte loop to clear the > young bit, and the interruption of userland as well when it receives a > tlb flush for no good reason because 99% of the time plenty of > unmapped clean cache is available. I know this performs best, even if > there will be always someone that will want mapped and unmapped cache > to be threat totally equal in lru terms (which then make me wonder why > there are still & VM_EXEC magics in vmscan.c if all pages shall be > threaded equal in the lru... especially given VM_EXEC is often > meaningless [because potentially randomly set] unlike page_mapcount > [which is never randomly set]), which is the reason I mentioned the > knob. Umm?? I'm puzlled. if almost pages in lru are unmapped file cache, pte walk is not costly. reverse, if almost pages in lru are mapped pages, we have to do pte walk, otherwise any pages don't deactivate and system cause big latency trouble. I don't want vmscan focus to peak speed and ignore worst case. it isn't proper behavior in memory shortage situation. Then I hope to focus to solve lock contention issue. Of course, if this cause any trouble to KVM or other usage in real world, I'll fix it. if you have any trouble experience about current VM, please tell us. [I (and Hugh at least) dislike VM_EXEC logic too. but it seems off topic.] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 12:46 ` Andrea Arcangeli 2009-12-02 2:02 ` KOSAKI Motohiro @ 2009-12-02 2:04 ` Rik van Riel 1 sibling, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-02 2:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki Andrea Arcangeli wrote: > I taken number 7 purely as mentioned by Larry about old code, but I > don't mind what is the actual breakpoint level where we start to send > the ipi flood to destroy all userland tlbs mapping the page so the > young bit can be set by the cpu on the old pte. If you agree with me > at the lowest priority we shouldn't flood ipi and destroy tlb when > there's plenty of clean unmapped clean cache, we already agree ;). If > that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates > gradually, first clean cache and re-activate mapped pages, then when > we're low on clean cache we start to check ptes and unmap whatever is > not found referenced. > > The code already does what you propose. It takes a heavy AIM7 run for Larry to run into the lock contention issue. I suspect that the page cache was already very small by the time the lock contention issue was triggered. Larry, do you have any more info on the state of the VM when you see the lock contention? Also, do you have the latest patches to shrink_list() by Kosaki and me applied? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 10:04 ` Andrea Arcangeli 2009-12-01 12:31 ` KOSAKI Motohiro @ 2009-12-02 2:00 ` Rik van Riel 1 sibling, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-02 2:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Larry Woodman, linux-kernel, linux-mm, akpm On 12/01/2009 05:04 AM, Andrea Arcangeli wrote: > On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote: > >> Before the splitLRU patch shrink_active_list() would only call >> page_referenced() when reclaim_mapped got set. reclaim_mapped only got >> set when the priority worked its way from 12 all the way to 7. This >> prevented page_referenced() from being called from shrink_active_list() >> until the system was really struggling to reclaim memory. >> > page_referenced should never be called and nobody should touch ptes > until priority went down to 7. This is a regression in splitLRU that > should be fixed. With light VM pressure we should never touch ptes ever. > You appear to have not read the code, either. The VM should not look at the active anon list much, unless it has a good reason to start evicting anonymous pages. Yes, there was a bug in shrink_list(), but Kosaki and I just posted patches to fix that. As for page_referenced not being called until priority goes down to 7 - that is one of the root causes the old VM did not scale. The number of pages that need to be scanned to get down to that point is staggeringly huge on systems with 1TB of RAM - a much larger number than we should EVER scan in the pageout code. There is no way we could go back to that heuristic. It fell apart before and it would continue to fall apart if we reintroduced it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman 2009-12-01 10:04 ` Andrea Arcangeli @ 2009-12-01 12:23 ` KOSAKI Motohiro 2009-12-01 16:41 ` Larry Woodman 2009-12-02 2:55 ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro 2009-12-02 1:55 ` [RFC] high system time & lock contention running large mixed workload Rik van Riel 2 siblings, 2 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-01 12:23 UTC (permalink / raw) To: Larry Woodman Cc: kosaki.motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel, Andrea Arcangeli (cc to some related person) > The cause was determined to be the unconditional call to > page_referenced() for every mapped page encountered in > shrink_active_list(). page_referenced() takes the anon_vma->lock and > calls page_referenced_one() for each vma. page_referenced_one() then > calls page_check_address() which takes the pte_lockptr spinlock. If > several CPUs are doing this at the same time there is a lot of > pte_lockptr spinlock contention with the anon_vma->lock held. This > causes contention on the anon_vma->lock, stalling in the fo and very > high system time. > > Before the splitLRU patch shrink_active_list() would only call > page_referenced() when reclaim_mapped got set. reclaim_mapped only got > set when the priority worked its way from 12 all the way to 7. This > prevented page_referenced() from being called from shrink_active_list() > until the system was really struggling to reclaim memory. > > On way to prevent this is to change page_check_address() to execute a > spin_trylock(ptl) when it was called by shrink_active_list() and simply > fail if it could not get the pte_lockptr spinlock. This will make > shrink_active_list() consider the page not referenced and allow the > anon_vma->lock to be dropped much quicker. > > The attached patch does just that, thoughts??? At first look, - We have to fix this issue certenally. - But your patch is a bit risky. Your patch treat trylock(pte-lock) failure as no accessced. but generally lock contention imply to have contention peer. iow, the page have reference bit typically. then, next shrink_inactive_list() move it active list again. that's suboptimal result. However, we can't treat lock-contention as page-is-referenced simply. if it does, the system easily go into OOM. So, if (priority < DEF_PRIORITY - 2) page_referenced() else page_refenced_trylock() is better? On typical workload, almost vmscan only use DEF_PRIORITY. then, if priority==DEF_PRIORITY situation don't cause heavy lock contention, the system don't need to mind the contention. anyway we can't avoid contention if the system have heavy memory pressure. btw, current shrink_active_list() have unnecessary page_mapping_inuse() call. it prevent to drop page reference bit from unmapped cache page. it mean we protect unmapped cache page than mapped page. it is strange. Unfortunately, I don't have enough development time today. I'll working on tommorow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 12:23 ` KOSAKI Motohiro @ 2009-12-01 16:41 ` Larry Woodman 2009-12-02 2:20 ` Rik van Riel 2009-12-02 2:55 ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro 1 sibling, 1 reply; 28+ messages in thread From: Larry Woodman @ 2009-12-01 16:41 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Rik van Riel, Andrea Arcangeli [-- Attachment #1: Type: text/plain, Size: 1335 bytes --] On Tue, 2009-12-01 at 21:23 +0900, KOSAKI Motohiro wrote: > (cc to some related person) > > At first look, > > - We have to fix this issue certenally. > - But your patch is a bit risky. > > Your patch treat trylock(pte-lock) failure as no accessced. but > generally lock contention imply to have contention peer. iow, the page > have reference bit typically. then, next shrink_inactive_list() move it > active list again. that's suboptimal result. > > However, we can't treat lock-contention as page-is-referenced simply. if it does, > the system easily go into OOM. > > So, > if (priority < DEF_PRIORITY - 2) > page_referenced() > else > page_refenced_trylock() > > is better? > On typical workload, almost vmscan only use DEF_PRIORITY. then, > if priority==DEF_PRIORITY situation don't cause heavy lock contention, > the system don't need to mind the contention. anyway we can't avoid > contention if the system have heavy memory pressure. > Agreed. The attached updated patch only does a trylock in the page_referenced() call from shrink_inactive_list() and only for anonymous pages when the priority is either 10, 11 or 12(DEF_PRIORITY-2). I have never seen a problem like this with active pagecache pages and it does not alter the existing shrink_page_list behavior. What do you think about this??? [-- Attachment #2: pageout.diff --] [-- Type: text/x-patch, Size: 6796 bytes --] diff --git a/include/linux/rmap.h b/include/linux/rmap.h index cb0ba70..d7eaeca 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -80,7 +80,7 @@ static inline void page_dup_rmap(struct page *page) * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *cnt, unsigned long *vm_flags); + struct mem_cgroup *cnt, unsigned long *vm_flags, int try); enum ttu_flags { TTU_UNMAP = 0, /* unmap mode */ TTU_MIGRATION = 1, /* migration mode */ @@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags); * Called from mm/filemap_xip.c to unmap empty zero page */ pte_t *page_check_address(struct page *, struct mm_struct *, - unsigned long, spinlock_t **, int); + unsigned long, spinlock_t **, int, int); /* * Used by swapoff to help locate where page is expected in vma. @@ -135,7 +135,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *cnt, - unsigned long *vm_flags) + unsigned long *vm_flags, + int try) { *vm_flags = 0; return TestClearPageReferenced(page); diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 1888b2d..35be29d 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ retry: address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); diff --git a/mm/ksm.c b/mm/ksm.c index 5575f86..8abb14b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (addr == -EFAULT) goto out; - ptep = page_check_address(page, mm, addr, &ptl, 0); + ptep = page_check_address(page, mm, addr, &ptl, 0, 0); if (!ptep) goto out; diff --git a/mm/rmap.c b/mm/rmap.c index dd43373..d8afe1a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) * On success returns with pte mapped and locked. */ pte_t *page_check_address(struct page *page, struct mm_struct *mm, - unsigned long address, spinlock_t **ptlp, int sync) + unsigned long address, spinlock_t **ptlp, int sync, int try) { pgd_t *pgd; pud_t *pud; @@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm, } ptl = pte_lockptr(mm, pmd); - spin_lock(ptl); + if (try) { + if (!spin_trylock(ptl)) { + pte_unmap(pte); + return NULL; + } + } else + spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; return pte; @@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) address = vma_address(page, vma); if (address == -EFAULT) /* out of vma range */ return 0; - pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); + pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0); if (!pte) /* the page is not in this mm */ return 0; pte_unmap_unlock(pte, ptl); @@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) static int page_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned int *mapcount, - unsigned long *vm_flags) + unsigned long *vm_flags, + int try) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -352,7 +359,7 @@ static int page_referenced_one(struct page *page, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, try); if (!pte) goto out; @@ -396,7 +403,8 @@ out: static int page_referenced_anon(struct page *page, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int try) { unsigned int mapcount; struct anon_vma *anon_vma; @@ -417,7 +425,7 @@ static int page_referenced_anon(struct page *page, if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + &mapcount, vm_flags, try); if (!mapcount) break; } @@ -482,7 +490,7 @@ static int page_referenced_file(struct page *page, if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + &mapcount, vm_flags, 0); if (!mapcount) break; } @@ -504,7 +512,8 @@ static int page_referenced_file(struct page *page, int page_referenced(struct page *page, int is_locked, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int try) { int referenced = 0; @@ -515,7 +524,7 @@ int page_referenced(struct page *page, if (page_mapped(page) && page->mapping) { if (PageAnon(page)) referenced += page_referenced_anon(page, mem_cont, - vm_flags); + vm_flags, try); else if (is_locked) referenced += page_referenced_file(page, mem_cont, vm_flags); @@ -547,7 +556,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (!pte) goto out; @@ -774,7 +783,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, 0); if (!pte) goto out; diff --git a/mm/vmscan.c b/mm/vmscan.c index 777af57..fff63a0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -643,7 +643,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } referenced = page_referenced(page, 1, - sc->mem_cgroup, &vm_flags); + sc->mem_cgroup, &vm_flags, 0); /* * In active use or really unfreeable? Activate it. * If page which have PG_mlocked lost isoltation race, @@ -1355,7 +1355,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, /* page_referenced clears PageReferenced */ if (page_mapping_inuse(page) && - page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { + page_referenced(page, 0, sc->mem_cgroup, &vm_flags, + priority<DEF_PRIORITY-2?0:1)) { nr_rotated++; /* * Identify referenced, file-backed active pages and ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-01 16:41 ` Larry Woodman @ 2009-12-02 2:20 ` Rik van Riel 2009-12-02 2:41 ` KOSAKI Motohiro 2009-12-03 22:14 ` Larry Woodman 0 siblings, 2 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-02 2:20 UTC (permalink / raw) To: Larry Woodman Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On 12/01/2009 11:41 AM, Larry Woodman wrote: > > Agreed. The attached updated patch only does a trylock in the > page_referenced() call from shrink_inactive_list() and only for > anonymous pages when the priority is either 10, 11 or > 12(DEF_PRIORITY-2). I have never seen a problem like this with active > pagecache pages and it does not alter the existing shrink_page_list > behavior. What do you think about this??? This is reasonable, except for the fact that pages that are moved to the inactive list without having the referenced bit cleared are guaranteed to be moved back to the active list. You'll be better off without that excess list movement, by simply moving pages directly back onto the active list if the trylock fails. Yes, this means that page_referenced can now return 3 different return values (not accessed, accessed, lock contended), which should probably be an enum so we can test for the values symbolically in the calling functions. That way only pages where we did manage to clear the referenced bit will be moved onto the inactive list. This not only reduces the amount of excess list movement, it also makes sure that the pages which do get onto the inactive list get a fair chance at being referenced again, instead of potentially being flooded out by pages where the trylock failed. A minor nitpick: maybe it would be good to rename the "try" parameter to "noblock". That more closely matches the requested behaviour. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-02 2:20 ` Rik van Riel @ 2009-12-02 2:41 ` KOSAKI Motohiro 2009-12-03 22:14 ` Larry Woodman 1 sibling, 0 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-02 2:41 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli > On 12/01/2009 11:41 AM, Larry Woodman wrote: > > > > Agreed. The attached updated patch only does a trylock in the > > page_referenced() call from shrink_inactive_list() and only for > > anonymous pages when the priority is either 10, 11 or > > 12(DEF_PRIORITY-2). I have never seen a problem like this with active > > pagecache pages and it does not alter the existing shrink_page_list > > behavior. What do you think about this??? > This is reasonable, except for the fact that pages that are moved > to the inactive list without having the referenced bit cleared are > guaranteed to be moved back to the active list. > > You'll be better off without that excess list movement, by simply > moving pages directly back onto the active list if the trylock > fails. > > Yes, this means that page_referenced can now return 3 different > return values (not accessed, accessed, lock contended), which > should probably be an enum so we can test for the values > symbolically in the calling functions. > > That way only pages where we did manage to clear the referenced bit > will be moved onto the inactive list. This not only reduces the > amount of excess list movement, it also makes sure that the pages > which do get onto the inactive list get a fair chance at being > referenced again, instead of potentially being flooded out by pages > where the trylock failed. Agreed. > A minor nitpick: maybe it would be good to rename the "try" parameter > to "noblock". That more closely matches the requested behaviour. Another minor nit: probably we have to rename page_referenced(). it imply test reference bit. but we use it for clear reference bit in shrink_active_list. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-02 2:20 ` Rik van Riel 2009-12-02 2:41 ` KOSAKI Motohiro @ 2009-12-03 22:14 ` Larry Woodman 2009-12-04 0:29 ` Rik van Riel 2009-12-04 0:36 ` KOSAKI Motohiro 1 sibling, 2 replies; 28+ messages in thread From: Larry Woodman @ 2009-12-03 22:14 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli [-- Attachment #1: Type: text/plain, Size: 1926 bytes --] On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote: > This is reasonable, except for the fact that pages that are moved > to the inactive list without having the referenced bit cleared are > guaranteed to be moved back to the active list. > > You'll be better off without that excess list movement, by simply > moving pages directly back onto the active list if the trylock > fails. > The attached patch addresses this issue by changing page_check_address() to return -1 if the spin_trylock() fails and page_referenced_one() to return 1 in that path so the page gets moved back to the active list. Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute has 196491 isolated_anon pages. This means that ~6140 processes are somewhere down in try_to_free_pages() since we only isolate 32 pages at a time, this is out of 9000 processes... --------------------------------------------------------------------- active_anon:2140361 inactive_anon:453356 isolated_anon:196491 active_file:3438 inactive_file:1100 isolated_file:0 unevictable:2802 dirty:153 writeback:0 unstable:0 free:578920 slab_reclaimable:49214 slab_unreclaimable:93268 mapped:1105 shmem:0 pagetables:139100 bounce:0 Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB active_anon:7835452kB inactive_anon:785764kB active_file:13672kB inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 202895 total pagecache pages 197629 pages in swap cache Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005 Free swap = 65881196kB Total swap = 67354616kB 3997696 pages RAM 207046 pages reserved 1688629 pages shared 3016248 pages non-shared [-- Attachment #2: page_referenced.patch --] [-- Type: text/x-patch, Size: 7402 bytes --] diff --git a/include/linux/rmap.h b/include/linux/rmap.h index cb0ba70..03a10f7 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -80,7 +80,7 @@ static inline void page_dup_rmap(struct page *page) * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *cnt, unsigned long *vm_flags); + struct mem_cgroup *cnt, unsigned long *vm_flags, int trylock); enum ttu_flags { TTU_UNMAP = 0, /* unmap mode */ TTU_MIGRATION = 1, /* migration mode */ @@ -99,7 +99,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags); * Called from mm/filemap_xip.c to unmap empty zero page */ pte_t *page_check_address(struct page *, struct mm_struct *, - unsigned long, spinlock_t **, int); + unsigned long, spinlock_t **, int, int); /* * Used by swapoff to help locate where page is expected in vma. @@ -135,7 +135,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *cnt, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { *vm_flags = 0; return TestClearPageReferenced(page); diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 1888b2d..35be29d 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ retry: address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); diff --git a/mm/ksm.c b/mm/ksm.c index 5575f86..8abb14b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (addr == -EFAULT) goto out; - ptep = page_check_address(page, mm, addr, &ptl, 0); + ptep = page_check_address(page, mm, addr, &ptl, 0, 0); if (!ptep) goto out; diff --git a/mm/rmap.c b/mm/rmap.c index dd43373..e066833 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) * On success returns with pte mapped and locked. */ pte_t *page_check_address(struct page *page, struct mm_struct *mm, - unsigned long address, spinlock_t **ptlp, int sync) + unsigned long address, spinlock_t **ptlp, int sync, int trylock) { pgd_t *pgd; pud_t *pud; @@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm, } ptl = pte_lockptr(mm, pmd); - spin_lock(ptl); + if (trylock) { + if (!spin_trylock(ptl)) { + pte_unmap(pte); + return (pte_t *)-1; + } + } else + spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; return pte; @@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) address = vma_address(page, vma); if (address == -EFAULT) /* out of vma range */ return 0; - pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); + pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0); if (!pte) /* the page is not in this mm */ return 0; pte_unmap_unlock(pte, ptl); @@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) static int page_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned int *mapcount, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, trylock); if (!pte) goto out; + else if (pte == (pte_t *)-1) + return 1; /* * Don't want to elevate referenced for mlocked page that gets this far, @@ -396,7 +405,8 @@ out: static int page_referenced_anon(struct page *page, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { unsigned int mapcount; struct anon_vma *anon_vma; @@ -417,7 +427,7 @@ static int page_referenced_anon(struct page *page, if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + &mapcount, vm_flags, trylock); if (!mapcount) break; } @@ -482,7 +492,7 @@ static int page_referenced_file(struct page *page, if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + &mapcount, vm_flags, 0); if (!mapcount) break; } @@ -497,6 +507,7 @@ static int page_referenced_file(struct page *page, * @is_locked: caller holds lock on the page * @mem_cont: target memory controller * @vm_flags: collect encountered vma->vm_flags who actually referenced the page + * @trylock: use spin_trylock to prevent high lock contention. * * Quick test_and_clear_referenced for all mappings to a page, * returns the number of ptes which referenced the page. @@ -504,7 +515,8 @@ static int page_referenced_file(struct page *page, int page_referenced(struct page *page, int is_locked, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { int referenced = 0; @@ -515,7 +527,7 @@ int page_referenced(struct page *page, if (page_mapped(page) && page->mapping) { if (PageAnon(page)) referenced += page_referenced_anon(page, mem_cont, - vm_flags); + vm_flags, trylock); else if (is_locked) referenced += page_referenced_file(page, mem_cont, vm_flags); @@ -547,7 +559,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (!pte) goto out; @@ -774,7 +786,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, 0); if (!pte) goto out; diff --git a/mm/vmscan.c b/mm/vmscan.c index 777af57..fff63a0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -643,7 +643,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } referenced = page_referenced(page, 1, - sc->mem_cgroup, &vm_flags); + sc->mem_cgroup, &vm_flags, 0); /* * In active use or really unfreeable? Activate it. * If page which have PG_mlocked lost isoltation race, @@ -1355,7 +1355,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, /* page_referenced clears PageReferenced */ if (page_mapping_inuse(page) && - page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { + page_referenced(page, 0, sc->mem_cgroup, &vm_flags, + priority<DEF_PRIORITY-2?0:1)) { nr_rotated++; /* * Identify referenced, file-backed active pages and ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-03 22:14 ` Larry Woodman @ 2009-12-04 0:29 ` Rik van Riel 2009-12-04 21:26 ` Larry Woodman 2009-12-04 0:36 ` KOSAKI Motohiro 1 sibling, 1 reply; 28+ messages in thread From: Rik van Riel @ 2009-12-04 0:29 UTC (permalink / raw) To: Larry Woodman Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On 12/03/2009 05:14 PM, Larry Woodman wrote: > The attached patch addresses this issue by changing page_check_address() > to return -1 if the spin_trylock() fails and page_referenced_one() to > return 1 in that path so the page gets moved back to the active list. Your patch forgot to add the code to vmscan.c to actually move the page back to the active list. Also, please use an enum for the page_referenced return values, so the code in vmscan.c can use symbolic names. enum page_reference { NOT_REFERENCED, REFERENCED, LOCK_CONTENDED, }; -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-04 0:29 ` Rik van Riel @ 2009-12-04 21:26 ` Larry Woodman 2009-12-06 21:04 ` Rik van Riel 0 siblings, 1 reply; 28+ messages in thread From: Larry Woodman @ 2009-12-04 21:26 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli [-- Attachment #1: Type: text/plain, Size: 665 bytes --] On Thu, 2009-12-03 at 19:29 -0500, Rik van Riel wrote: > On 12/03/2009 05:14 PM, Larry Woodman wrote: > > > The attached patch addresses this issue by changing page_check_address() > > to return -1 if the spin_trylock() fails and page_referenced_one() to > > return 1 in that path so the page gets moved back to the active list. > > Your patch forgot to add the code to vmscan.c to actually move > the page back to the active list. Right > > Also, please use an enum for the page_referenced return > values, so the code in vmscan.c can use symbolic names. > > enum page_reference { > NOT_REFERENCED, > REFERENCED, > LOCK_CONTENDED, > }; > Here it is: [-- Attachment #2: page_referenced.patch --] [-- Type: text/x-patch, Size: 11269 bytes --] diff --git a/include/linux/rmap.h b/include/linux/rmap.h index cb0ba70..2b931c8 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -76,11 +76,17 @@ static inline void page_dup_rmap(struct page *page) atomic_inc(&page->_mapcount); } +enum page_referenced { + NOT_REFERENCED, + REFERENCED, + LOCK_CONTENDED, +}; + /* * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *cnt, unsigned long *vm_flags); + struct mem_cgroup *cnt, unsigned long *vm_flags, int trylock); enum ttu_flags { TTU_UNMAP = 0, /* unmap mode */ TTU_MIGRATION = 1, /* migration mode */ @@ -99,7 +105,7 @@ int try_to_unmap(struct page *, enum ttu_flags flags); * Called from mm/filemap_xip.c to unmap empty zero page */ pte_t *page_check_address(struct page *, struct mm_struct *, - unsigned long, spinlock_t **, int); + unsigned long, spinlock_t **, int, int); /* * Used by swapoff to help locate where page is expected in vma. @@ -135,10 +141,11 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *cnt, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { *vm_flags = 0; - return TestClearPageReferenced(page); + return (TestClearPageReferenced(page)?REFERENCED:NOT_REFERENCED); } #define try_to_unmap(page, refs) SWAP_FAIL diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 1888b2d..35be29d 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ retry: address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); diff --git a/mm/ksm.c b/mm/ksm.c index 5575f86..8abb14b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -623,7 +623,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (addr == -EFAULT) goto out; - ptep = page_check_address(page, mm, addr, &ptl, 0); + ptep = page_check_address(page, mm, addr, &ptl, 0, 0); if (!ptep) goto out; diff --git a/mm/rmap.c b/mm/rmap.c index dd43373..2b15a18 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -270,7 +270,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) * On success returns with pte mapped and locked. */ pte_t *page_check_address(struct page *page, struct mm_struct *mm, - unsigned long address, spinlock_t **ptlp, int sync) + unsigned long address, spinlock_t **ptlp, int sync, int trylock) { pgd_t *pgd; pud_t *pud; @@ -298,7 +298,13 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm, } ptl = pte_lockptr(mm, pmd); - spin_lock(ptl); + if (trylock) { + if (!spin_trylock(ptl)) { + pte_unmap(pte); + return (pte_t *)-1; + } + } else + spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; return pte; @@ -325,7 +331,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) address = vma_address(page, vma); if (address == -EFAULT) /* out of vma range */ return 0; - pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); + pte = page_check_address(page, vma->vm_mm, address, &ptl, 1, 0); if (!pte) /* the page is not in this mm */ return 0; pte_unmap_unlock(pte, ptl); @@ -340,7 +346,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) static int page_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned int *mapcount, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, trylock); if (!pte) goto out; + else if (pte == (pte_t *)-1) + return LOCK_CONTENDED; /* * Don't want to elevate referenced for mlocked page that gets this far, @@ -391,21 +400,24 @@ out_unmap: out: if (referenced) *vm_flags |= vma->vm_flags; - return referenced; + return (referenced?REFERENCED:NOT_REFERENCED); } static int page_referenced_anon(struct page *page, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { unsigned int mapcount; struct anon_vma *anon_vma; struct vm_area_struct *vma; int referenced = 0; + int locked = 0; + int ret; anon_vma = page_lock_anon_vma(page); if (!anon_vma) - return referenced; + return NOT_REFERENCED; mapcount = page_mapcount(page); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { @@ -416,14 +428,19 @@ static int page_referenced_anon(struct page *page, */ if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; - referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + ret = page_referenced_one(page, vma, + &mapcount, vm_flags, trylock); + if (ret == LOCK_CONTENDED) + locked++; + else if (ret == REFERENCED) + referenced++; + if (!mapcount) break; } page_unlock_anon_vma(anon_vma); - return referenced; + return (locked?LOCK_CONTENDED:referenced?REFERENCED:NOT_REFERENCED); } /** @@ -482,13 +499,13 @@ static int page_referenced_file(struct page *page, if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) continue; referenced += page_referenced_one(page, vma, - &mapcount, vm_flags); + &mapcount, vm_flags, 0); if (!mapcount) break; } spin_unlock(&mapping->i_mmap_lock); - return referenced; + return (referenced?REFERENCED:NOT_REFERENCED); } /** @@ -497,6 +514,7 @@ static int page_referenced_file(struct page *page, * @is_locked: caller holds lock on the page * @mem_cont: target memory controller * @vm_flags: collect encountered vma->vm_flags who actually referenced the page + * @trylock: use spin_trylock to prevent high lock contention. * * Quick test_and_clear_referenced for all mappings to a page, * returns the number of ptes which referenced the page. @@ -504,9 +522,11 @@ static int page_referenced_file(struct page *page, int page_referenced(struct page *page, int is_locked, struct mem_cgroup *mem_cont, - unsigned long *vm_flags) + unsigned long *vm_flags, + int trylock) { int referenced = 0; + int ret = NOT_REFERENCED; if (TestClearPageReferenced(page)) referenced++; @@ -514,25 +534,28 @@ int page_referenced(struct page *page, *vm_flags = 0; if (page_mapped(page) && page->mapping) { if (PageAnon(page)) - referenced += page_referenced_anon(page, mem_cont, - vm_flags); + ret = page_referenced_anon(page, mem_cont, + vm_flags, trylock); else if (is_locked) - referenced += page_referenced_file(page, mem_cont, + ret = page_referenced_file(page, mem_cont, vm_flags); else if (!trylock_page(page)) referenced++; else { - if (page->mapping) - referenced += page_referenced_file(page, + if (page->mapping) { + ret = page_referenced_file(page, mem_cont, vm_flags); + } unlock_page(page); } } + if (ret == REFERENCED) + referenced++; if (page_test_and_clear_young(page)) referenced++; - return referenced; + return (ret == LOCK_CONTENDED?LOCK_CONTENDED:(referenced?REFERENCED:NOT_REFERENCED)); } static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) @@ -547,7 +570,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 1); + pte = page_check_address(page, mm, address, &ptl, 1, 0); if (!pte) goto out; @@ -774,7 +797,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl, 0); + pte = page_check_address(page, mm, address, &ptl, 0, 0); if (!pte) goto out; diff --git a/mm/vmscan.c b/mm/vmscan.c index 777af57..5543278 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -643,14 +643,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, } referenced = page_referenced(page, 1, - sc->mem_cgroup, &vm_flags); + sc->mem_cgroup, &vm_flags, 0); /* * In active use or really unfreeable? Activate it. * If page which have PG_mlocked lost isoltation race, * try_to_unmap moves it to unevictable list */ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && - referenced && page_mapping_inuse(page) + (referenced == REFERENCED) && page_mapping_inuse(page) && !(vm_flags & VM_LOCKED)) goto activate_locked; @@ -686,7 +686,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (PageDirty(page)) { - if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) + if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && (referenced == REFERENCED)) goto keep_locked; if (!may_enter_fs) goto keep_locked; @@ -1320,6 +1320,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct page *page; struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); unsigned long nr_rotated = 0; + int referenced; lru_add_drain(); spin_lock_irq(&zone->lru_lock); @@ -1354,21 +1355,27 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, } /* page_referenced clears PageReferenced */ - if (page_mapping_inuse(page) && - page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { - nr_rotated++; - /* - * Identify referenced, file-backed active pages and - * give them one more trip around the active list. So - * that executable code get better chances to stay in - * memory under moderate memory pressure. Anon pages - * are not likely to be evicted by use-once streaming - * IO, plus JVM can create lots of anon VM_EXEC pages, - * so we ignore them here. - */ - if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { - list_add(&page->lru, &l_active); - continue; + if (page_mapping_inuse(page)) { + + referenced = page_referenced(page, 0, sc->mem_cgroup, + &vm_flags, priority<DEF_PRIORITY-2?0:1); + + if (referenced != NOT_REFERENCED) { + nr_rotated++; + /* + * Identify referenced, file-backed active pages and + * give them one more trip around the active list. So + * that executable code get better chances to stay in + * memory under moderate memory pressure. Anon pages + * are not likely to be evicted by use-once streaming + * IO, plus JVM can create lots of anon VM_EXEC pages, + * so we ignore them here. + */ + if (((vm_flags & VM_EXEC) && page_is_file_cache(page)) || + (referenced == LOCK_CONTENDED)) { + list_add(&page->lru, &l_active); + continue; + } } } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-04 21:26 ` Larry Woodman @ 2009-12-06 21:04 ` Rik van Riel 0 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-06 21:04 UTC (permalink / raw) To: Larry Woodman Cc: KOSAKI Motohiro, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On 12/04/2009 04:26 PM, Larry Woodman wrote: > > Here it is: Kosaki's patch series looks a lot more complete. Does Kosaki's patch series resolve your issue? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-03 22:14 ` Larry Woodman 2009-12-04 0:29 ` Rik van Riel @ 2009-12-04 0:36 ` KOSAKI Motohiro 2009-12-04 19:31 ` Larry Woodman 1 sibling, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-04 0:36 UTC (permalink / raw) To: Larry Woodman Cc: kosaki.motohiro, Rik van Riel, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli > On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote: > > > This is reasonable, except for the fact that pages that are moved > > to the inactive list without having the referenced bit cleared are > > guaranteed to be moved back to the active list. > > > > You'll be better off without that excess list movement, by simply > > moving pages directly back onto the active list if the trylock > > fails. > > > > > The attached patch addresses this issue by changing page_check_address() > to return -1 if the spin_trylock() fails and page_referenced_one() to > return 1 in that path so the page gets moved back to the active list. > > Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute > has 196491 isolated_anon pages. This means that ~6140 processes are > somewhere down in try_to_free_pages() since we only isolate 32 pages at > a time, this is out of 9000 processes... > > > --------------------------------------------------------------------- > active_anon:2140361 inactive_anon:453356 isolated_anon:196491 > active_file:3438 inactive_file:1100 isolated_file:0 > unevictable:2802 dirty:153 writeback:0 unstable:0 > free:578920 slab_reclaimable:49214 slab_unreclaimable:93268 > mapped:1105 shmem:0 pagetables:139100 bounce:0 > > Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB > active_anon:7835452kB inactive_anon:785764kB active_file:13672kB > inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB > isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB > writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB > slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 > > 202895 total pagecache pages > 197629 pages in swap cache > Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005 > Free swap = 65881196kB > Total swap = 67354616kB > 3997696 pages RAM > 207046 pages reserved > 1688629 pages shared > 3016248 pages non-shared This seems we have to improve reclaim bale out logic. the system already have 1.5GB free pages. IOW, the system don't need swap-out anymore. > @@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page, > if (address == -EFAULT) > goto out; > > - pte = page_check_address(page, mm, address, &ptl, 0); > + pte = page_check_address(page, mm, address, &ptl, 0, trylock); > if (!pte) > goto out; > + else if (pte == (pte_t *)-1) > + return 1; > > /* > * Don't want to elevate referenced for mlocked page that gets this far, Sorry, NAK. I have to say the same thing of Rik's previous mention. shrink_active_list() ignore the return value of page_referenced(). then above 'return 1' is meaningless. Umm, ok, I'll make the patch myself. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-12-04 0:36 ` KOSAKI Motohiro @ 2009-12-04 19:31 ` Larry Woodman 0 siblings, 0 replies; 28+ messages in thread From: Larry Woodman @ 2009-12-04 19:31 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On Fri, 2009-12-04 at 09:36 +0900, KOSAKI Motohiro wrote: > > > > Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute > > has 196491 isolated_anon pages. This means that ~6140 processes are > > somewhere down in try_to_free_pages() since we only isolate 32 pages at > > a time, this is out of 9000 processes... > > > > > > --------------------------------------------------------------------- > > active_anon:2140361 inactive_anon:453356 isolated_anon:196491 > > active_file:3438 inactive_file:1100 isolated_file:0 > > unevictable:2802 dirty:153 writeback:0 unstable:0 > > free:578920 slab_reclaimable:49214 slab_unreclaimable:93268 > > mapped:1105 shmem:0 pagetables:139100 bounce:0 > > > > Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB > > active_anon:7835452kB inactive_anon:785764kB active_file:13672kB > > inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB > > isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB > > writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB > > slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB > > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 > > > > 202895 total pagecache pages > > 197629 pages in swap cache > > Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005 > > Free swap = 65881196kB > > Total swap = 67354616kB > > 3997696 pages RAM > > 207046 pages reserved > > 1688629 pages shared > > 3016248 pages non-shared > > This seems we have to improve reclaim bale out logic. the system already > have 1.5GB free pages. IOW, the system don't need swap-out anymore. > > Whats going on here is there are about 7500 runable processes and the memory is already low. A process runs, requests memory and eventually ends up in try_to_free_pages. Since the page reclaim code calls cond_resched()in several places so the scheduler eventually puts that process on the run queue and runs another process which does the same thing. Eventually you end up with thousands of runnable processes in the page reclaim code continuing to reclaim even though there is plenty of free memory. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 7480 421 1474396 2240060 7640 12988 23 27 24 40 14 40 12 45 43 1 0 7524 405 1474772 2224504 7644 13460 759 689 764 689 8932 11076 8 92 0 0 0 7239 401 1474164 2210572 7656 13524 596 592 596 596 8809 10494 9 91 0 0 0 BTW, this is easy to reproduce. Fork thousands of processes that collectively overcommit memory and keep them allocating... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Clear reference bit although page isn't mapped. 2009-12-01 12:23 ` KOSAKI Motohiro 2009-12-01 16:41 ` Larry Woodman @ 2009-12-02 2:55 ` KOSAKI Motohiro 2009-12-02 3:07 ` Rik van Riel 1 sibling, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-02 2:55 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli > btw, current shrink_active_list() have unnecessary page_mapping_inuse() call. > it prevent to drop page reference bit from unmapped cache page. it mean > we protect unmapped cache page than mapped page. it is strange. How about this? --------------------------------- SplitLRU VM replacement algorithm assume shrink_active_list() clear the page's reference bit. but unnecessary page_mapping_inuse() test prevent it. This patch remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 00156f2..3e942b5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1347,8 +1347,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, } /* page_referenced clears PageReferenced */ - if (page_mapping_inuse(page) && - page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { + if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { nr_rotated++; /* * Identify referenced, file-backed active pages and -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Clear reference bit although page isn't mapped. 2009-12-02 2:55 ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro @ 2009-12-02 3:07 ` Rik van Riel 2009-12-02 3:28 ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Rik van Riel @ 2009-12-02 3:07 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote: >> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call. >> it prevent to drop page reference bit from unmapped cache page. it mean >> we protect unmapped cache page than mapped page. it is strange. >> > How about this? > > --------------------------------- > SplitLRU VM replacement algorithm assume shrink_active_list() clear > the page's reference bit. but unnecessary page_mapping_inuse() test > prevent it. > > This patch remove it. > Shrink_page_list ignores the referenced bit on pages that are !page_mapping_inuse(). if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced && page_mapping_inuse(page) && !(vm_flags & VM_LOCKED)) goto activate_locked; The reason we leave the referenced bit on unmapped pages is that we want the next reference to a deactivated page cache page to move that page back to the active list. We do not want to require that such a page gets accessed twice before being reactivated while on the inactive list, because (1) we know it was a frequently accessed page already and (2) ongoing streaming IO might evict it from the inactive list before it gets accessed twice. Arguably, we should just replace the page_mapping_inuse() in both places with page_mapped() to simplify things. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Replace page_mapping_inuse() with page_mapped() 2009-12-02 3:07 ` Rik van Riel @ 2009-12-02 3:28 ` KOSAKI Motohiro 2009-12-02 4:57 ` Rik van Riel 2009-12-02 11:07 ` Johannes Weiner 0 siblings, 2 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2009-12-02 3:28 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli > On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote: > >> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call. > >> it prevent to drop page reference bit from unmapped cache page. it mean > >> we protect unmapped cache page than mapped page. it is strange. > >> > > How about this? > > > > --------------------------------- > > SplitLRU VM replacement algorithm assume shrink_active_list() clear > > the page's reference bit. but unnecessary page_mapping_inuse() test > > prevent it. > > > > This patch remove it. > > > Shrink_page_list ignores the referenced bit on pages > that are !page_mapping_inuse(). > > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > referenced && > page_mapping_inuse(page) > && !(vm_flags & VM_LOCKED)) > goto activate_locked; > > The reason we leave the referenced bit on unmapped > pages is that we want the next reference to a deactivated > page cache page to move that page back to the active > list. We do not want to require that such a page gets > accessed twice before being reactivated while on the > inactive list, because (1) we know it was a frequently > accessed page already and (2) ongoing streaming IO > might evict it from the inactive list before it gets accessed > twice. > > Arguably, we should just replace the page_mapping_inuse() > in both places with page_mapped() to simplify things. Ah, yes. /me was slept. thanks correct me. From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Wed, 2 Dec 2009 12:05:26 +0900 Subject: [PATCH] Replace page_mapping_inuse() with page_mapped() page reclaim logic need to distingish mapped and unmapped pages. However page_mapping_inuse() don't provide proper test way. it test the address space (i.e. file) is mmpad(). Why `page' reclaim need care unrelated page's mapped state? it's unrelated. Thus, This patch replace page_mapping_inuse() with page_mapped() Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 25 ++----------------------- 1 files changed, 2 insertions(+), 23 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 00156f2..350b9cc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -277,27 +277,6 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, return ret; } -/* Called without lock on whether page is mapped, so answer is unstable */ -static inline int page_mapping_inuse(struct page *page) -{ - struct address_space *mapping; - - /* Page is in somebody's page tables. */ - if (page_mapped(page)) - return 1; - - /* Be more reluctant to reclaim swapcache than pagecache */ - if (PageSwapCache(page)) - return 1; - - mapping = page_mapping(page); - if (!mapping) - return 0; - - /* File is mmap'd by somebody? */ - return mapping_mapped(mapping); -} - static inline int is_page_cache_freeable(struct page *page) { /* @@ -663,7 +642,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * try_to_unmap moves it to unevictable list */ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && - referenced && page_mapping_inuse(page) + referenced && page_mapped(page) && !(vm_flags & VM_LOCKED)) goto activate_locked; @@ -1347,7 +1326,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, } /* page_referenced clears PageReferenced */ - if (page_mapping_inuse(page) && + if (page_mapped(page) && page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { nr_rotated++; /* -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Replace page_mapping_inuse() with page_mapped() 2009-12-02 3:28 ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro @ 2009-12-02 4:57 ` Rik van Riel 2009-12-02 11:07 ` Johannes Weiner 1 sibling, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-02 4:57 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On 12/01/2009 10:28 PM, KOSAKI Motohiro wrote: >> On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote: >> >>>> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call. >>>> it prevent to drop page reference bit from unmapped cache page. it mean >>>> we protect unmapped cache page than mapped page. it is strange. >>>> >>>> >>> How about this? >>> >>> --------------------------------- >>> SplitLRU VM replacement algorithm assume shrink_active_list() clear >>> the page's reference bit. but unnecessary page_mapping_inuse() test >>> prevent it. >>> >>> This patch remove it. >>> >>> >> Shrink_page_list ignores the referenced bit on pages >> that are !page_mapping_inuse(). >> >> if (sc->order<= PAGE_ALLOC_COSTLY_ORDER&& >> referenced&& >> page_mapping_inuse(page) >> && !(vm_flags& VM_LOCKED)) >> goto activate_locked; >> >> The reason we leave the referenced bit on unmapped >> pages is that we want the next reference to a deactivated >> page cache page to move that page back to the active >> list. We do not want to require that such a page gets >> accessed twice before being reactivated while on the >> inactive list, because (1) we know it was a frequently >> accessed page already and (2) ongoing streaming IO >> might evict it from the inactive list before it gets accessed >> twice. >> >> Arguably, we should just replace the page_mapping_inuse() >> in both places with page_mapped() to simplify things. >> > Ah, yes. /me was slept. thanks correct me. > > > From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001 > From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > Date: Wed, 2 Dec 2009 12:05:26 +0900 > Subject: [PATCH] Replace page_mapping_inuse() with page_mapped() > > page reclaim logic need to distingish mapped and unmapped pages. > However page_mapping_inuse() don't provide proper test way. it test > the address space (i.e. file) is mmpad(). Why `page' reclaim need > care unrelated page's mapped state? it's unrelated. > > Thus, This patch replace page_mapping_inuse() with page_mapped() > > Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > Reviewed-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Replace page_mapping_inuse() with page_mapped() 2009-12-02 3:28 ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro 2009-12-02 4:57 ` Rik van Riel @ 2009-12-02 11:07 ` Johannes Weiner 1 sibling, 0 replies; 28+ messages in thread From: Johannes Weiner @ 2009-12-02 11:07 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Larry Woodman, linux-kernel, linux-mm, akpm, Hugh Dickins, KAMEZAWA Hiroyuki, Andrea Arcangeli On Wed, Dec 02, 2009 at 12:28:26PM +0900, KOSAKI Motohiro wrote: > From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001 > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Date: Wed, 2 Dec 2009 12:05:26 +0900 > Subject: [PATCH] Replace page_mapping_inuse() with page_mapped() > > page reclaim logic need to distingish mapped and unmapped pages. > However page_mapping_inuse() don't provide proper test way. it test > the address space (i.e. file) is mmpad(). Why `page' reclaim need > care unrelated page's mapped state? it's unrelated. > > Thus, This patch replace page_mapping_inuse() with page_mapped() > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] high system time & lock contention running large mixed workload 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman 2009-12-01 10:04 ` Andrea Arcangeli 2009-12-01 12:23 ` KOSAKI Motohiro @ 2009-12-02 1:55 ` Rik van Riel 2 siblings, 0 replies; 28+ messages in thread From: Rik van Riel @ 2009-12-02 1:55 UTC (permalink / raw) To: Larry Woodman; +Cc: linux-kernel, linux-mm, akpm On 11/30/2009 05:00 PM, Larry Woodman wrote: > While running workloads that do lots of forking processes, exiting > processes and page reclamation(AIM 7) on large systems very high system > time(100%) and lots of lock contention was observed. > > > > CPU5: > [<ffffffff814afb48>] ? _spin_lock+0x27/0x48 > [<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a > [<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c > [<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be > [<ffffffff8105ea07>] ? do_fork+0x151/0x330 > [<ffffffff81058407>] ? default_wake_function+0x0/0x36 > [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68 > [<ffffffff810121d3>] ? stub_clone+0x13/0x20 > [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b > > CPU4: > [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48 > [<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84 > [<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1 > [<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110 > [<ffffffff8105ce4c>] ? mmput+0x55/0xd9 > [<ffffffff81061afd>] ? exit_mm+0x109/0x129 > [<ffffffff81063846>] ? do_exit+0x1d7/0x712 > [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68 > [<ffffffff81063e07>] ? do_group_exit+0x86/0xb2 > [<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e > [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b > > CPU0: > [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48 > [<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f > [<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b > [<ffffffff81102f5a>] ? page_referenced+0xcd/0x167 > [<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3 > [<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38 > [<ffffffff8108440a>] ? getnstimeofday+0x64/0xce > [<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362 > [<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94 > [<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242 > [<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572 > [<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e > [<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be > [<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84 > [<ffffffff81130db8>] ? user_path_at+0x65/0xa3 > [<ffffffff8105ea07>] ? do_fork+0x151/0x330 > [<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab > [<ffffffff810121d3>] ? stub_clone+0x13/0x20 > [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b > > > ------------------------------------------------------------------------------ > PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8 > CPUs) > ------------------------------------------------------------------------------ > > samples pcnt RIP kernel function > ______ _______ _____ ________________ _______________ > > 3235.00 - 75.1% - ffffffff814afb21 : _spin_lock > 670.00 - 15.6% - ffffffff81101a33 : page_check_address > 165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc] > 40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one > 29.00 - 0.7% - ffffffff81101c65 : page_referenced_one > 27.00 - 0.6% - ffffffff81101964 : vma_address > 8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c > 6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c > 6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon > 5.00 - 0.1% - ffffffff810fb014 : copy_page_range > 5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist > > > > The cause was determined to be the unconditional call to > page_referenced() for every mapped page encountered in > shrink_active_list(). page_referenced() takes the anon_vma->lock and > calls page_referenced_one() for each vma. page_referenced_one() then > calls page_check_address() which takes the pte_lockptr spinlock. If > several CPUs are doing this at the same time there is a lot of > pte_lockptr spinlock contention with the anon_vma->lock held. This > causes contention on the anon_vma->lock, stalling in the fo and very > high system time. > > Before the splitLRU patch shrink_active_list() would only call > page_referenced() when reclaim_mapped got set. reclaim_mapped only got > set when the priority worked its way from 12 all the way to 7. This > prevented page_referenced() from being called from shrink_active_list() > until the system was really struggling to reclaim memory. > > On way to prevent this is to change page_check_address() to execute a > spin_trylock(ptl) when it was called by shrink_active_list() and simply > fail if it could not get the pte_lockptr spinlock. This will make > shrink_active_list() consider the page not referenced and allow the > anon_vma->lock to be dropped much quicker. > > The attached patch does just that, thoughts??? > My first thought is that you haven't read the code you are trying to patch. The purpose of calling page_referenced on anonymous pages from shrink_active_list is to CLEAR the referenced bit, not to test it! Not clearing the referenced bit will break page replacement, because pages that were not recently referenced will appear to be, causing them to get another round on the active list, which in turn could increase the list movement... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-12-06 21:04 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-25 18:37 [PATCH] vmscan: do not evict inactive pages when skipping an active list scan Rik van Riel 2009-11-25 20:35 ` Johannes Weiner 2009-11-25 20:47 ` Rik van Riel 2009-11-26 2:50 ` KOSAKI Motohiro 2009-11-26 2:57 ` Rik van Riel 2009-11-30 22:00 ` [RFC] high system time & lock contention running large mixed workload Larry Woodman 2009-12-01 10:04 ` Andrea Arcangeli 2009-12-01 12:31 ` KOSAKI Motohiro 2009-12-01 12:46 ` Andrea Arcangeli 2009-12-02 2:02 ` KOSAKI Motohiro 2009-12-02 2:04 ` Rik van Riel 2009-12-02 2:00 ` Rik van Riel 2009-12-01 12:23 ` KOSAKI Motohiro 2009-12-01 16:41 ` Larry Woodman 2009-12-02 2:20 ` Rik van Riel 2009-12-02 2:41 ` KOSAKI Motohiro 2009-12-03 22:14 ` Larry Woodman 2009-12-04 0:29 ` Rik van Riel 2009-12-04 21:26 ` Larry Woodman 2009-12-06 21:04 ` Rik van Riel 2009-12-04 0:36 ` KOSAKI Motohiro 2009-12-04 19:31 ` Larry Woodman 2009-12-02 2:55 ` [PATCH] Clear reference bit although page isn't mapped KOSAKI Motohiro 2009-12-02 3:07 ` Rik van Riel 2009-12-02 3:28 ` [PATCH] Replace page_mapping_inuse() with page_mapped() KOSAKI Motohiro 2009-12-02 4:57 ` Rik van Riel 2009-12-02 11:07 ` Johannes Weiner 2009-12-02 1:55 ` [RFC] high system time & lock contention running large mixed workload Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox