From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id 53E4F6B0031 for ; Tue, 8 Apr 2014 18:58:26 -0400 (EDT) Received: by mail-pa0-f42.google.com with SMTP id fb1so1684558pad.1 for ; Tue, 08 Apr 2014 15:58:25 -0700 (PDT) Received: from qmta10.emeryville.ca.mail.comcast.net (qmta10.emeryville.ca.mail.comcast.net. [2001:558:fe2d:43:76:96:30:17]) by mx.google.com with ESMTP id w4si1742779paa.34.2014.04.08.15.58.25 for ; Tue, 08 Apr 2014 15:58:25 -0700 (PDT) Date: Tue, 8 Apr 2014 17:58:21 -0500 (CDT) From: Christoph Lameter Subject: Re: [PATCH 0/2] Disable zone_reclaim_mode by default In-Reply-To: Message-ID: References: <1396910068-11637-1-git-send-email-mgorman@suse.de> <5343A494.9070707@suse.cz> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Robert Haas Cc: Vlastimil Babka , Mel Gorman , Andrew Morton , Josh Berkus , Andres Freund , Linux-MM , LKML , sivanich@sgi.com On Tue, 8 Apr 2014, Robert Haas wrote: > Well, as Josh quite rightly said, the hit from accessing remote memory > is never going to be as large as the hit from disk. If and when there > is a machine where remote memory is more expensive to access than > disk, that's a good argument for zone_reclaim_mode. But I don't > believe that's anywhere close to being true today, even on an 8-socket > machine with an SSD. I am nost sure how disk figures into this? The tradeoff is zone reclaim vs. the aggregate performance degradation of the remote memory accesses. That depends on the cacheability of the app and the scale of memory accesses. The reason that zone reclaim is on by default is that off node accesses are a big performance hit on large scale NUMA systems (like ScaleMP and SGI). Zone reclaim was written *because* those system experienced severe performance degradation. On the tightly coupled 4 and 8 node systems there does not seem to be a benefit from what I hear. > Now, perhaps the fear is that if we access that remote memory > *repeatedly* the aggregate cost will exceed what it would have cost to > fault that page into the local node just once. But it takes a lot of > accesses for that to be true, and most of the time you won't get them. > Even if you do, I bet many workloads will prefer even performance > across all the accesses over a very slow first access followed by > slightly faster subsequent accesses. Many HPC workloads prefer the opposite. > In an ideal world, the kernel would put the hottest pages on the local > node and the less-hot pages on remote nodes, moving pages around as > the workload shifts. In practice, that's probably pretty hard. > Fortunately, it's not nearly as important as making sure we don't > unnecessarily hit the disk, which is infinitely slower than any memory > bank. Shifting pages involves similar tradeoffs as zone reclaim vs. remote allocations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org