linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
Date: Fri, 26 Jul 2013 15:45:33 -0700	[thread overview]
Message-ID: <20130726154533.aebd39c603ffe8de3b2c76fb@linux-foundation.org> (raw)
In-Reply-To: <1374267325-22865-1-git-send-email-hannes@cmpxchg.org>

On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
> 
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>    nodes falling behind for a full reclaim cycle relative to the other
>    nodes in the system
> 
> #3 fixes an interaction where kswapd and a continuous stream of page
>    allocations keep the preferred zone of a task between the high and
>    low watermark (allocations succeed + kswapd does not go to sleep)
>    indefinitely, completely underutilizing the lower zones and
>    thrashing on the preferred zone
> 
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing.  Andrea recommended to submit them
> separately as they are bugfixes in their own right.
> 
> The following test ran a foreground workload (memcachetest) with
> background IO of various sizes on a 4 node 8G system (similar results
> were observed with single-node 4G systems):
> 
> parallelio
>                                                BAS                    FAIRALLO
>                                               BASE                   FAIRALLOC
> Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
> Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
> Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
> Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
> Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
> Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
> Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
> Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
> Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
> Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
> Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
> Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
> Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
> Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
> Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
> Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
> Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
> Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
> Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)

A reminder whether positive numbers are good or bad would be useful ;)

>                  BAS    FAIRALLO
>                 BASE   FAIRALLOC
> User          287.78      460.97
> System       2151.67     3142.51
> Elapsed      9737.00     8879.34

Confused.  Why would the amount of user time increase so much?

And that's a tremendous increase in system time.  Am I interpreting
this correctly?
 
>                                    BAS    FAIRALLO
>                                   BASE   FAIRALLOC
> Minor Faults                  53721925    57188551
> Major Faults                    392195       15157
> Swap Ins                       2994854      112770
> Swap Outs                      4907092      134982
> Direct pages scanned                 0       41824
> Kswapd pages scanned          32975063     8128269
> Kswapd pages reclaimed         6323069     7093495
> Direct pages reclaimed               0       41824
> Kswapd efficiency                  19%         87%
> Kswapd velocity               3386.573     915.414
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       4.710
> Percentage direct scans             0%          0%
> Zone normal velocity          2011.338     550.661
> Zone dma32 velocity           1365.623     369.221
> Zone dma velocity                9.612       0.242
> Page writes by reclaim    18732404.000  614807.000
> Page writes file              13825312      479825
> Page writes anon               4907092      134982
> Page reclaim immediate           85490        5647
> Sector Reads                  12080532      483244
> Sector Writes                 88740508    65438876
> Page rescued immediate               0           0
> Slabs scanned                    82560       12160
> Direct inode steals                  0           0
> Kswapd inode steals              24401       40013
> Kswapd skipped wait                  0           0
> THP fault alloc                      6           8
> THP collapse alloc                5481        5812
> THP splits                          75          22
> THP fault fallback                   0           0
> THP collapse fail                    0           0
> Compaction stalls                    0          54
> Compaction success                   0          45
> Compaction failures                  0           9
> Page migrate success            881492       82278
> Page migrate failure                 0           0
> Compaction pages isolated            0       60334
> Compaction migrate scanned           0       53505
> Compaction free scanned              0     1537605
> Compaction cost                    914          86
> NUMA PTE updates              46738231    41988419
> NUMA hint faults              31175564    24213387
> NUMA hint local faults        10427393     6411593
> NUMA pages migrated             881492       55344
> AutoNUMA cost                   156221      121361

Some nice numbers there.

> The overall runtime was reduced, throughput for both the foreground
> workload as well as the background IO improved, major faults, swapping
> and reclaim activity shrunk significantly, reclaim efficiency more
> than quadrupled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-07-26 22:45 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-19 20:55 Johannes Weiner
2013-07-19 20:55 ` [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd Johannes Weiner
2013-07-22 19:47   ` Rik van Riel
2013-07-22 20:14     ` Johannes Weiner
2013-07-26 22:53   ` Andrew Morton
2013-07-30 17:45     ` Johannes Weiner
2013-07-31 12:43       ` Johannes Weiner
2013-07-19 20:55 ` [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist Johannes Weiner
2013-07-22 19:51   ` Rik van Riel
2013-07-19 20:55 ` [patch 3/3] mm: page_alloc: fair zone allocator policy Johannes Weiner
2013-07-22 20:21   ` Rik van Riel
2013-07-22 21:04     ` Johannes Weiner
2013-07-22 22:48       ` Rik van Riel
2013-07-25  6:50     ` Paul Bolle
2013-07-25 15:10       ` Johannes Weiner
2013-07-25 15:20         ` Paul Bolle
2013-07-29 17:48   ` Andrea Arcangeli
2013-07-29 22:24     ` Johannes Weiner
2013-08-01  2:56   ` Minchan Kim
2013-08-01  4:31     ` Rik van Riel
2013-08-01 15:51       ` Andrea Arcangeli
2013-08-01 19:58         ` Johannes Weiner
2013-08-01 22:16           ` Andrea Arcangeli
2013-08-02  6:22     ` Johannes Weiner
2013-08-02  7:32       ` Minchan Kim
2013-07-22 16:48 ` [patch 0/3] mm: improve page aging fairness between zones/nodes Zlatko Calusic
2013-07-22 17:01   ` Johannes Weiner
2013-07-22 17:14     ` Zlatko Calusic
2013-07-24 11:18   ` Zlatko Calusic
2013-07-24 12:46     ` Hush Bensen
2013-07-24 13:59       ` Zlatko Calusic
2013-07-31  9:33     ` Zlatko Calusic
2013-07-26 22:45 ` Andrew Morton [this message]
2013-07-26 23:14   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130726154533.aebd39c603ffe8de3b2c76fb@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox