From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D159C38A2A for ; Thu, 7 May 2020 09:05:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2BDA52075E for ; Thu, 7 May 2020 09:05:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2BDA52075E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9FC52900003; Thu, 7 May 2020 05:05:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9AE41900002; Thu, 7 May 2020 05:05:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C3D5900003; Thu, 7 May 2020 05:05:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0075.hostedemail.com [216.40.44.75]) by kanga.kvack.org (Postfix) with ESMTP id 74627900002 for ; Thu, 7 May 2020 05:05:24 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 369C41ADA9 for ; Thu, 7 May 2020 09:05:24 +0000 (UTC) X-FDA: 76789339368.28.shape75_2f19b78301d23 X-HE-Tag: shape75_2f19b78301d23 X-Filterd-Recvd-Size: 8500 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Thu, 7 May 2020 09:05:23 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 0478ZpOg016156; Thu, 7 May 2020 05:05:22 -0400 Received: from ppma04fra.de.ibm.com (6a.4a.5195.ip4.static.sl-reverse.com [149.81.74.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 30s4r67dse-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 07 May 2020 05:05:22 -0400 Received: from pps.filterd (ppma04fra.de.ibm.com [127.0.0.1]) by ppma04fra.de.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 04791KHB005096; Thu, 7 May 2020 09:05:21 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma04fra.de.ibm.com with ESMTP id 30s0g64emb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 07 May 2020 09:05:20 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04795IkD12255256 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 7 May 2020 09:05:18 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 616F111C052; Thu, 7 May 2020 09:05:18 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6C23111C050; Thu, 7 May 2020 09:05:16 +0000 (GMT) Received: from [9.199.34.251] (unknown [9.199.34.251]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Thu, 7 May 2020 09:05:16 +0000 (GMT) Subject: Re: [PATCH] mm: vmstat: Use zeroed stats for unpopulated zones To: Michal Hocko , Vlastimil Babka Cc: akpm@linux-foundation.org, linux-mm@kvack.org, khlebnikov@yandex-team.ru, kirill@shutemov.name, aneesh.kumar@linux.ibm.com, srikar@linux.vnet.ibm.com References: <20200504070304.127361-1-sandipan@linux.ibm.com> <20200504102441.GM22838@dhcp22.suse.cz> <959f15af-28a8-371b-c5c3-cd7489d2a7fb@suse.cz> <20200506140241.GB6345@dhcp22.suse.cz> <20200506152408.GD6345@dhcp22.suse.cz> <20200507070924.GE6345@dhcp22.suse.cz> From: Sandipan Das Message-ID: Date: Thu, 7 May 2020 14:35:15 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <20200507070924.GE6345@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-07_04:2020-05-05,2020-05-07 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 mlxlogscore=999 clxscore=1011 priorityscore=1501 malwarescore=0 impostorscore=0 bulkscore=0 suspectscore=0 adultscore=0 phishscore=0 mlxscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005070065 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07/05/20 12:39 pm, Michal Hocko wrote: > On Wed 06-05-20 17:50:28, Vlastimil Babka wrote: >> [...] >> >> How about something like this: >> >> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst >> index aaf1667489f8..08ec2c2bdce3 100644 >> --- a/Documentation/admin-guide/numastat.rst >> +++ b/Documentation/admin-guide/numastat.rst >> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics >> >> All units are pages. Hugepages have separate counters. >> >> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes >> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit >> +is incremented on the preferred node, otherwise numa_foreign is incremented on >> +the preferred node and numa_miss on the node where allocation succeeded. >> + >> +Usually preferred node is the one local to the CPU where the process executes, >> +but restrictions such as mempolicies can change that, so there are also two >> +counters based on CPU local node. local_node is similar to numa_hit and is >> +incremented on allocation from a node by CPU on the same node. other_node is >> +similar to numa_miss and is incremented on the node where allocation succeeds >> +from a CPU from a different node. Note there is no counter analogical to >> +numa_foreign. >> + >> +In more detail: >> + >> =============== ============================================================ >> numa_hit A process wanted to allocate memory from this node, >> and succeeded. >> @@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node, >> but ended up with memory from this node. >> >> numa_foreign A process wanted to allocate on this node, >> - but ended up with memory from another one. >> + but ended up with memory from another node. >> >> -local_node A process ran on this node and got memory from it. >> +local_node A process ran on this node's CPU, >> + and got memory from this node. >> >> -other_node A process ran on this node and got memory from another node. >> +other_node A process ran on a different node's CPU >> + and got memory from this node. >> >> interleave_hit Interleaving wanted to allocate from this node >> and succeeded. >> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package >> (http://oss.sgi.com/projects/libnuma/). Note that it only works >> well right now on machines with a small number of CPUs. >> >> +Note that on systems with memoryless nodes (where a node has CPUs but no >> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed >> +heavily. In the current kernel implementation, if a process prefers a >> +memoryless node (i.e. because it is running on one of its local CPU), the >> +implementation actually treats one of the nearest nodes with memory as the >> +preferred node. As a result, such allocation will not increase the numa_foreign >> +counter on the memoryless node, and will skew the numa_hit, numa_miss and >> +numa_foreign statistics of the nearest node. > > This is certainly an improvement. Thanks! The question whether we can > identify where bogus numbers came from would be interesting as well. > Maybe those are not worth fixing but it would be great to understand > them at least. I have to say that the explanation via boot_pageset is > not really clear to me. > The documentation update will definitely help. Thanks for that. I did collect some stack traces on a ppc64 guest for calls to zone_statistics() in case of zones that are using the boot_pageset and most of them originate from kmem_cache_init() with eventual calls to allocate_slab(). [ 0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0 [ 0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120 [ 0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100 [ 0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0 [ 0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580 [ 0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610 ... In the remaining cases, the sources are ftrace_init() and early_trace_init(). Unless they are useful, can we avoid incrementing stats for zones using boot_pageset inside zone_statistics()? - Sandipan