From: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>,
Pekka Enberg <penberg@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mel@csn.ul.ie>, Jens Axboe <jaxboe@fusionio.com>,
Linux Netdev List <netdev@vger.kernel.org>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: Re: slow performance on disk/network i/o full speed after drop_caches
Date: Thu, 01 Sep 2011 07:41:57 +0200 [thread overview]
Message-ID: <4E5F1B25.8040800@profihost.ag> (raw)
In-Reply-To: <20110901041458.GA30123@localhost>
Thanks!
Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
>> Can nobody help why the kernel in this case set it to 1?
>
> It's determined by RECLAIM_DISTANCE.
>
> build_zonelists():
>
> /*
> * If another node is sufficiently far away then it is better
> * to reclaim pages in a zone before going off node.
> */
> if (distance> RECLAIM_DISTANCE)
> zone_reclaim_mode = 1;
>
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
>
> commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
> Author: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Date: Wed Jun 15 15:08:20 2011 -0700
>
> mm: increase RECLAIM_DISTANCE to 30
>
> Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
> that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
> Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on
> a very traditional single-process model.
>
> * a master process which reads config files and manages the other
> process
> * multiple imapd processes, one per connection
> * multiple pop3d processes, one per connection
> * multiple lmtpd processes, one per connection
> * periodical "cleanup" processes.
>
> There are thousands of independent processes. The problem is, recent
> Intel motherboard turn on zone_reclaim_mode by default and traditional
> prefork model software don't work well on it. Unfortunatelly, such models
> are still typical even in the 21st century. We can't ignore them.
>
> This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have
> any specific meaning. but 20 means that one-hop QPI/Hypertransport and
> such relatively cheap 2-4 socket machine are often used for traditional
> servers as above. The intention is that these machines don't use
> zone_reclaim_mode.
>
> Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
> This patch doesn't change such high-end NUMA machine behavior.
>
> Dave Hansen said:
>
> : I know specifically of pieces of x86 hardware that set the information
> : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> : behavior which that implies.
> :
> : They've done performance testing and run very large and scary benchmarks
> : to make sure that they _want_ this turned on. What this means for them
> : is that they'll probably be de-optimized, at least on newer versions of
> : the kernel.
> :
> : If you want to do this for particular systems, maybe _that_'s what we
> : should do. Have a list of specific configurations that need the
> : defaults overridden either because they're buggy, or they have an
> : unusual hardware configuration not really reflected in the distance
> : table.
>
> And later said:
>
> : The original change in the hardware tables was for the benefit of a
> : benchmark. Said benchmark isn't going to get run on mainline until the
> : next batch of enterprise distros drops, at which point the hardware where
> : this was done will be irrelevant for the benchmark. I'm sure any new
> : hardware will just set this distance to another yet arbitrary value to
> : make the kernel do what it wants. :)
> :
> : Also, when the hardware got _set_ to this initially, I complained. So, I
> : guess I'm getting my way now, with this patch. I'm cool with it.
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
> * (in whatever arch specific measurement units returned by node_distance())
> * then switch on zone reclaim on boot.
> */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
> #endif
> #ifndef PENALTY_FOR_NODE_WITH_CPUS
> #define PENALTY_FOR_NODE_WITH_CPUS (1)
>
> Thanks,
> Fengguang
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-09-01 5:42 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <4E5494D4.1050605@profihost.ag>
2011-08-24 6:20 ` Pekka Enberg
2011-08-24 9:01 ` Stefan Priebe - Profihost AG
2011-08-24 9:33 ` Wu Fengguang
2011-08-25 9:00 ` Stefan Priebe - Profihost AG
2011-08-26 2:16 ` Wu Fengguang
2011-08-26 2:54 ` Stefan Priebe - Profihost AG
2011-08-26 3:03 ` Wu Fengguang
2011-08-26 3:13 ` Stefan Priebe
2011-08-26 3:26 ` Wu Fengguang
2011-08-26 3:30 ` Zhu Yanhai
2011-08-26 6:18 ` Stefan Priebe - Profihost AG
2011-08-31 7:11 ` Stefan Priebe - Profihost AG
2011-09-01 4:14 ` Wu Fengguang
2011-09-01 5:41 ` Stefan Priebe - Profihost AG [this message]
2011-09-01 12:57 ` Mel Gorman
2011-08-24 9:32 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E5F1B25.8040800@profihost.ag \
--to=s.priebe@profihost.ag \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=jaxboe@fusionio.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=netdev@vger.kernel.org \
--cc=penberg@kernel.org \
--cc=zhu.yanhai@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox