From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>, Shaohua Li <shaohua.li@intel.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"cl@linux.com" <cl@linux.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
Date: Wed, 13 Oct 2010 15:32:08 +0900 (JST) [thread overview]
Message-ID: <20101013152922.ADC6.A69D9226@jp.fujitsu.com> (raw)
In-Reply-To: <20101013151723.ADBD.A69D9226@jp.fujitsu.com>
Shaohua Li reported commit aa45484031(mm: page allocator: calculate a
better estimate of NR_FREE_PAGES when memory is low and kswapd is awake)
made performance regression.
| In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10%
| cpu time
| according to perf when memory pressure is high. The workload does
| something like:
| for i in `seq 1 $nr_cpu`
| do
| create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
| $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
| done
| this simply reads a sparse file for each CPU. Apparently the
| zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot()
| makes a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for
| reference.
| Is there any way to reduce the overhead?
|
| Node 3, zone Normal
| pages free 2055926
| min 1441
| low 1801
| high 2161
| scanned 0
| spanned 2097152
| present 2068480
| vm stats threshold: 98
It mean zone_page_state_snapshot() is costly than we expected. This
patch introduced very different approach. we are reserving max-drift pages
at first instead runtime free page calculation.
But, this technique can't be used on much cpus and few memory systems.
On such system, we still need to use zone_page_state_snapshot().
Example1: typical desktop
CPU: 2
MEM: 2GB
old) zone->min = sqrt(2x1024x1024x16) = 5792 KB = 1448 pages
new) max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40
zone->min = 1448 + 40 = 1488 pages
Example2: relatively large server
CPU: 64
MEM: 8GBx4 (=32GB)
old) zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
new) max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages
zone->min = 1448 + 6272 = 7720 pages
Hmm, zone->min became almost 5x times. Is it acceptable? I think yes.
Today, we can buy 8GB DRAM for $20. So, 6272 pages (=24.5MB) waste
mean about 6 cent waste. It's good deal for getting good performance.
Example3: ultimately big server
CPU: 2048
MEM: 64GBx256 (=16TB)
old) zone->min = sqrt(16x1024x1024x1024x16)/256 = 2048 KB = 512 pages
(Wow!, it's smaller than desktop)
new) max-drift = 125 x 2048 = 256000 pages = 1000MB (greater than 64GB/100)
zone->min = 512 pages
Reported-by: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/page_alloc.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53627fa..194bdaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4897,6 +4897,15 @@ static void setup_per_zone_wmarks(void)
for_each_zone(zone) {
u64 tmp;
+ /*
+ * If max drift are less than 1%, reserve max drift pages
+ * instead costly runtime calculation.
+ */
+ if (zone->percpu_drift_mark < (zone->present_pages/100)) {
+ pages_min += zone->percpu_drift_mark;
+ zone->percpu_drift_mark = 0;
+ }
+
spin_lock_irqsave(&zone->lock, flags);
tmp = (u64)pages_min * zone->present_pages;
do_div(tmp, lowmem_pages);
--
1.6.5.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-10-13 6:32 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-28 5:08 zone state overhead Shaohua Li
2010-09-28 12:39 ` Christoph Lameter
2010-09-28 13:30 ` Mel Gorman
2010-09-28 13:40 ` Christoph Lameter
2010-09-28 13:51 ` Mel Gorman
2010-09-28 14:08 ` Christoph Lameter
2010-09-29 3:02 ` Shaohua Li
2010-09-29 4:02 ` David Rientjes
2010-09-29 4:47 ` Shaohua Li
2010-09-29 5:06 ` David Rientjes
2010-09-29 10:03 ` Mel Gorman
2010-09-29 14:12 ` Christoph Lameter
2010-09-29 14:17 ` Mel Gorman
2010-09-29 14:34 ` Christoph Lameter
2010-09-29 14:41 ` Mel Gorman
2010-09-29 14:45 ` Mel Gorman
2010-09-29 14:54 ` Christoph Lameter
2010-09-29 14:52 ` Christoph Lameter
2010-09-29 19:44 ` David Rientjes
2010-10-08 15:29 ` Mel Gorman
2010-10-09 0:58 ` Shaohua Li
2010-10-11 8:56 ` Mel Gorman
2010-10-12 1:05 ` Shaohua Li
2010-10-12 16:25 ` Mel Gorman
2010-10-13 2:41 ` Shaohua Li
2010-10-13 12:09 ` Mel Gorman
2010-10-13 3:36 ` KOSAKI Motohiro
2010-10-13 6:25 ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
2010-10-13 6:27 ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
2010-10-13 6:39 ` KAMEZAWA Hiroyuki
2010-10-13 12:59 ` Mel Gorman
2010-10-14 2:44 ` KOSAKI Motohiro
2010-10-13 6:28 ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
2010-10-13 6:40 ` KAMEZAWA Hiroyuki
2010-10-13 13:02 ` Mel Gorman
2010-10-13 6:32 ` KOSAKI Motohiro [this message]
2010-10-13 13:19 ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() Mel Gorman
2010-10-14 2:39 ` KOSAKI Motohiro
2010-10-18 10:43 ` Mel Gorman
2010-10-13 7:10 ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
2010-10-13 7:16 ` KAMEZAWA Hiroyuki
2010-10-13 13:22 ` Mel Gorman
2010-10-14 2:50 ` KOSAKI Motohiro
2010-10-15 17:31 ` Christoph Lameter
2010-10-18 9:27 ` KOSAKI Motohiro
2010-10-18 15:44 ` Christoph Lameter
2010-10-19 1:10 ` KOSAKI Motohiro
2010-10-18 11:08 ` Mel Gorman
2010-10-19 1:34 ` KOSAKI Motohiro
2010-10-19 9:06 ` Mel Gorman
2010-10-18 15:51 ` Christoph Lameter
2010-10-19 0:43 ` KOSAKI Motohiro
2010-10-13 11:24 ` zone state overhead Mel Gorman
2010-10-14 3:07 ` KOSAKI Motohiro
2010-10-18 10:39 ` Mel Gorman
2010-10-19 1:16 ` KOSAKI Motohiro
2010-10-19 9:08 ` Mel Gorman
2010-10-22 14:12 ` Mel Gorman
2010-10-22 15:23 ` Christoph Lameter
2010-10-22 18:45 ` Mel Gorman
2010-10-22 15:27 ` Christoph Lameter
2010-10-22 18:46 ` Mel Gorman
2010-10-22 20:01 ` Christoph Lameter
2010-10-25 4:46 ` KOSAKI Motohiro
2010-10-27 8:19 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101013152922.ADC6.A69D9226@jp.fujitsu.com \
--to=kosaki.motohiro@jp.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=rientjes@google.com \
--cc=shaohua.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox