From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDFDAC28CBC for ; Wed, 6 May 2020 15:50:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 825ED20708 for ; Wed, 6 May 2020 15:50:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 825ED20708 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 169898E0005; Wed, 6 May 2020 11:50:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11BA98E0003; Wed, 6 May 2020 11:50:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F26288E0005; Wed, 6 May 2020 11:50:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id D8A4E8E0003 for ; Wed, 6 May 2020 11:50:31 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A52B3C5A3 for ; Wed, 6 May 2020 15:50:31 +0000 (UTC) X-FDA: 76786731462.11.order85_21a6b8910c354 X-HE-Tag: order85_21a6b8910c354 X-Filterd-Recvd-Size: 5942 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Wed, 6 May 2020 15:50:31 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 212ABB007; Wed, 6 May 2020 15:50:32 +0000 (UTC) Subject: Re: [PATCH] mm: vmstat: Use zeroed stats for unpopulated zones To: Michal Hocko Cc: Sandipan Das , akpm@linux-foundation.org, linux-mm@kvack.org, khlebnikov@yandex-team.ru, kirill@shutemov.name, aneesh.kumar@linux.ibm.com, srikar@linux.vnet.ibm.com References: <20200504070304.127361-1-sandipan@linux.ibm.com> <20200504102441.GM22838@dhcp22.suse.cz> <959f15af-28a8-371b-c5c3-cd7489d2a7fb@suse.cz> <20200506140241.GB6345@dhcp22.suse.cz> <20200506152408.GD6345@dhcp22.suse.cz> From: Vlastimil Babka Message-ID: Date: Wed, 6 May 2020 17:50:28 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <20200506152408.GD6345@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/6/20 5:24 PM, Michal Hocko wrote: >> >> Yes, if we allocate from cpu 0-3 then it should be a miss on node 0. But the >> zonelists are optimized in a way that they don't include empty zones - >> build_zonerefs_node() checks managed_zone(). As a result, node 0 zonelist has no >> node 0 zones, which confuses the stats code. We should probably document that >> numa stats are bogus on systems with memoryless nodes. This patch makes it >> somewhat more obvious by presenting nice zeroes on the memoryless node itself, >> but node 1 now include stats from node 0. > > Thanks for the clarification. So the underlying problem is that zone_statistics > operates on a preferred zone rather than node. This would be fixable but > I am not sure whether this is something worth bothering. Maybe it would > just be more convenient to document the unfortunate memory less nodes > stats situation and be done with it. Or do we have any consumers that > really do care? > >> >> NUMA_OTHER uses numa_node_id(), which would mean the node 0's cpus have node 1 >> >> in their numa_node_id() ? Is that correct? >> > >> > numa_node_id should reflect the real node the CPU is associated with. >> >> You're right, numa_node_id() is probably fine. But NUMA_OTHER is actually >> incremented at the zone where the allocation succeeds. This probably doesn't >> match Documentation/admin-guide/numastat.rst, even on a non-memoryless-node systems: >> >> other_node A process ran on this node and got memory from another node. > > Yeah, the documentation doesn't match the implementation. Maybe we > should just fix the documentation because this has been the case for > ages. > How about something like this: diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst index aaf1667489f8..08ec2c2bdce3 100644 --- a/Documentation/admin-guide/numastat.rst +++ b/Documentation/admin-guide/numastat.rst @@ -6,6 +6,21 @@ Numa policy hit/miss statistics All units are pages. Hugepages have separate counters. +The numa_hit, numa_miss and numa_foreign counters reflect how well processes +are able to allocate memory from nodes they prefer. If they succeed, numa_hit +is incremented on the preferred node, otherwise numa_foreign is incremented on +the preferred node and numa_miss on the node where allocation succeeded. + +Usually preferred node is the one local to the CPU where the process executes, +but restrictions such as mempolicies can change that, so there are also two +counters based on CPU local node. local_node is similar to numa_hit and is +incremented on allocation from a node by CPU on the same node. other_node is +similar to numa_miss and is incremented on the node where allocation succeeds +from a CPU from a different node. Note there is no counter analogical to +numa_foreign. + +In more detail: + =============== ============================================================ numa_hit A process wanted to allocate memory from this node, and succeeded. @@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node. numa_foreign A process wanted to allocate on this node, - but ended up with memory from another one. + but ended up with memory from another node. -local_node A process ran on this node and got memory from it. +local_node A process ran on this node's CPU, + and got memory from this node. -other_node A process ran on this node and got memory from another node. +other_node A process ran on a different node's CPU + and got memory from this node. interleave_hit Interleaving wanted to allocate from this node and succeeded. @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package (http://oss.sgi.com/projects/libnuma/). Note that it only works well right now on machines with a small number of CPUs. +Note that on systems with memoryless nodes (where a node has CPUs but no +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed +heavily. In the current kernel implementation, if a process prefers a +memoryless node (i.e. because it is running on one of its local CPU), the +implementation actually treats one of the nearest nodes with memory as the +preferred node. As a result, such allocation will not increase the numa_foreign +counter on the memoryless node, and will skew the numa_hit, numa_miss and +numa_foreign statistics of the nearest node. -- 2.26.2