From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC81BC43331 for ; Tue, 31 Mar 2020 08:55:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 98E7E207FF for ; Tue, 31 Mar 2020 08:55:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 98E7E207FF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 23CDB6B0032; Tue, 31 Mar 2020 04:55:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1EE4D6B0037; Tue, 31 Mar 2020 04:55:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DB966B006C; Tue, 31 Mar 2020 04:55:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0206.hostedemail.com [216.40.44.206]) by kanga.kvack.org (Postfix) with ESMTP id E936A6B0032 for ; Tue, 31 Mar 2020 04:55:16 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id AEEFD6D97 for ; Tue, 31 Mar 2020 08:55:16 +0000 (UTC) X-FDA: 76655048232.07.act02_855b0863e1516 X-HE-Tag: act02_855b0863e1516 X-Filterd-Recvd-Size: 7385 Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Tue, 31 Mar 2020 08:55:16 +0000 (UTC) Received: by mail-wm1-f66.google.com with SMTP id f6so1640557wmj.3 for ; Tue, 31 Mar 2020 01:55:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=rTdu7oh4JjjFUdNyXovYxN1QmbRl7JqIGXsfhw5Wa0c=; b=d0onDV8ZD8hu3I12Ej5DQ+nEFnCIcqtVoZZEu43vr+t8lMQnZqC4K8AJ3bcuECrvO0 JYc3wZvP3GeXaBFg9SeNbgUXVtX0nsYjSBdbIlH4ueV8Ir+tRGinbhhjDG0d6vUka8sX kV0nk1YxN56vREU5gmiKaicU0jYkkO1Nk+FVXXXqZ5DR8HOHLGMi3M9JDLtWdX869r2T mURtJRrIMwXF+/YkfzoQG/ZCYcN4iHJXz9G+mtPQlAhwv+DaJLi8UVALStcOEzjjvFaL Z9fUuKnN6ksDU9folUbKZk64EXjUTb0f0AsU7ODa6fsx63sgSojzEjtRP7R9ZqIr29aY bHbQ== X-Gm-Message-State: ANhLgQ3q2TAbAz7lVR17ymZnxHvFt03zoXK1Ow+ijOYLRNF9VeYXZkBT lWvnnuw+LUxw4OFzupvaKxE= X-Google-Smtp-Source: ADFU+vuscSRwsLwXmVoQv1xnwhJFZnnlCFmntVIImf+p0Rb3SHZnEYoHiIxLd4qO4YJEIRanvIyR4Q== X-Received: by 2002:a1c:b60b:: with SMTP id g11mr2406590wmf.175.1585644915034; Tue, 31 Mar 2020 01:55:15 -0700 (PDT) Received: from localhost (ip-37-188-180-223.eurotel.cz. [37.188.180.223]) by smtp.gmail.com with ESMTPSA id v11sm26003208wrm.43.2020.03.31.01.55.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Mar 2020 01:55:14 -0700 (PDT) Date: Tue, 31 Mar 2020 10:55:13 +0200 From: Michal Hocko To: Mike Rapoport Cc: Hoan Tran , Catalin Marinas , Will Deacon , Andrew Morton , Vlastimil Babka , Oscar Salvador , Pavel Tatashin , Alexander Duyck , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , "David S. Miller" , Heiko Carstens , Vasily Gorbik , Christian Borntraeger , "open list:MEMORY MANAGEMENT" , linux-arm-kernel@lists.infradead.org, linux-s390@vger.kernel.org, sparclinux@vger.kernel.org, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, lho@amperecomputing.com, mmorana@amperecomputing.com Subject: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Message-ID: <20200331085513.GE30449@dhcp22.suse.cz> References: <1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com> <20200330074246.GA14243@dhcp22.suse.cz> <20200330175100.GD30942@linux.ibm.com> <20200330182301.GM14243@dhcp22.suse.cz> <20200331081423.GE30942@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200331081423.GE30942@linux.ibm.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 31-03-20 11:14:23, Mike Rapoport wrote: > On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote: > > On Mon 30-03-20 20:51:00, Mike Rapoport wrote: > > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote: > > > > > In NUMA layout which nodes have memory ranges that span across other nodes, > > > > > the mm driver can detect the memory node id incorrectly. > > > > > > > > > > For example, with layout below > > > > > Node 0 address: 0000 xxxx 0000 xxxx > > > > > Node 1 address: xxxx 1111 xxxx 1111 > > > > > > > > > > Note: > > > > > - Memory from low to high > > > > > - 0/1: Node id > > > > > - x: Invalid memory of a node > > > > > > > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES > > > > > config, mm only checks the memory validity but not the node id. > > > > > Because of that, Node 1 also detects the memory from node 0 as below > > > > > when it scans from the start address to the end address of node 1. > > > > > > > > > > Node 0 address: 0000 xxxx xxxx xxxx > > > > > Node 1 address: xxxx 1111 1111 1111 > > > > > > > > > > This layout could occur on any architecture. Most of them enables > > > > > this config by default with CONFIG_NUMA. This patch, by default, enables > > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA. > > > > > > > > I am not opposed to this at all. It reduces the config space and that is > > > > a good thing on its own. The history has shown that meory layout might > > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid > > > > which is clearly an overkill. > > > > > > > > Your description doesn't really explain why this is safe though. The > > > > history of this config is somehow messy, though. Mike has tried > > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent > > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd > > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any > > > > reasoning what so ever. This doesn't make it really easy see whether > > > > reasons for reintroduction are still there. Maybe there are some subtle > > > > dependencies. I do not see any TBH but that might be burried deep in an > > > > arch specific code. > > > > > > I've looked at this a bit more and it seems that the check for > > > early_pfn_in_nid() in memmap_init_zone() can be simply removed. > > > > > > The commits you've mentioned were way before the addition of > > > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone > > > sizes and boundaries based on the memblock node map. > > > So, the memmap_init_zone() is called when zone boundaries are already > > > within a node. > > > > But zones from different nodes might overlap in the pfn range. And this > > check is there to skip over those overlapping areas. > > Maybe I mis-read the code, but I don't see how this could happen. In the > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls > calculate_node_totalpages() that ensures that node->node_zones are entirely > within the node because this is checked in zone_spanned_pages_in_node(). zone_spanned_pages_in_node does chech the zone boundaries are within the node boundaries. But that doesn't really tell anything about other potential zones interleaving with the physical memory range. zone->spanned_pages simply gives the physical range for the zone including holes. Interleaving nodes are essentially a hole (__absent_pages_in_range is going to skip those). That means that when free_area_init_core simply goes over the whole physical zone range including holes and that is why we need to check both for physical and logical holes (aka other nodes). The life would be so much easier if the whole thing would simply iterate over memblocks... -- Michal Hocko SUSE Labs