From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8089CC433E1 for ; Tue, 23 Jun 2020 16:12:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3E8442076E for ; Tue, 23 Jun 2020 16:12:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3E8442076E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8EB2D6B0008; Tue, 23 Jun 2020 12:12:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89C586B000A; Tue, 23 Jun 2020 12:12:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78B396B000C; Tue, 23 Jun 2020 12:12:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id 623A86B0008 for ; Tue, 23 Jun 2020 12:12:20 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E8BE0184B9575 for ; Tue, 23 Jun 2020 16:12:19 +0000 (UTC) X-FDA: 76960968798.04.fall40_1b06c4826e3c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin04.hostedemail.com (Postfix) with ESMTP id A20FE8276B14 for ; Tue, 23 Jun 2020 16:12:19 +0000 (UTC) X-HE-Tag: fall40_1b06c4826e3c X-Filterd-Recvd-Size: 11260 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Tue, 23 Jun 2020 16:12:17 +0000 (UTC) IronPort-SDR: uvCRps+NpbqonHbXbn6yJe7mTKu9SHUcKiNT+xR04ZjY4f3ULU7bKX6S9+0Sg8HRML7rj7CNCe bNRc8OvhTNlA== X-IronPort-AV: E=McAfee;i="6000,8403,9661"; a="205632750" X-IronPort-AV: E=Sophos;i="5.75,271,1589266800"; d="scan'208";a="205632750" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jun 2020 09:12:14 -0700 IronPort-SDR: pbxQ2H2l/idGUoI9AM9BFZiTp8A/KSHbSaFejDDvWdwlCKEsK+aG5bSPsOwjfkktSoZP/IT3Tu O8jxzeHtUuyQ== X-IronPort-AV: E=Sophos;i="5.75,271,1589266800"; d="scan'208";a="279161974" Received: from bjscott-mobl3.amr.corp.intel.com (HELO intel.com) ([10.252.131.171]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jun 2020 09:12:13 -0700 Date: Tue, 23 Jun 2020 09:12:11 -0700 From: Ben Widawsky To: Michal Hocko Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200623161211.qjup5km5eiisy5wy@intel.com> Mail-Followup-To: Michal Hocko , linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org References: <20200619162425.1052382-1-ben.widawsky@intel.com> <20200622070957.GB31426@dhcp22.suse.cz> <20200623112048.GR31426@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200623112048.GR31426@dhcp22.suse.cz> X-Rspamd-Queue-Id: A20FE8276B14 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20-06-23 13:20:48, Michal Hocko wrote: > On Mon 22-06-20 09:10:00, Michal Hocko wrote: > [...] > > > The goal of the new mode is to enable some use-cases when using tie= red memory > > > usage models which I've lovingly named. > > > 1a. The Hare - The interconnect is fast enough to meet bandwidth an= d latency > > > requirements allowing preference to be given to all nodes with "fas= t" memory. > > > 1b. The Indiscriminate Hare - An application knows it wants fast me= mory (or > > > perhaps slow memory), but doesn't care which node it runs on. The a= pplication > > > can prefer a set of nodes and then xpu bind to the local node (cpu,= accelerator, > > > etc). This reverses the nodes are chosen today where the kernel att= empts to use > > > local memory to the CPU whenever possible. This will attempt to use= the local > > > accelerator to the memory. > > > 2. The Tortoise - The administrator (or the application itself) is = aware it only > > > needs slow memory, and so can prefer that. > > > > > > Much of this is almost achievable with the bind interface, but the = bind > > > interface suffers from an inability to fallback to another set of n= odes if > > > binding fails to all nodes in the nodemask. >=20 > Yes, and probably worth mentioning explicitly that this might lead to > the OOM killer invocation so a failure would be disruptive to any > workload which is allowed to allocate from the specific node mask (so > even tasks without any mempolicy). Thanks. I don't believe I mention this fact in any of the commit messages= or comments (and perhaps this is an indication I should have). I'll find a p= lace to mention this outside of the cover letter. >=20 > > > Like MPOL_BIND a nodemask is given. Inherently this removes orderin= g from the > > > preference. > > >=20 > > > > /* Set first two nodes as preferred in an 8 node system. */ > > > > const unsigned long nodes =3D 0x3 > > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > > >=20 > > > > /* Mimic interleave policy, but have fallback *. > > > > const unsigned long nodes =3D 0xaa > > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > > >=20 > > > Some internal discussion took place around the interface. There are= two > > > alternatives which we have discussed, plus one I stuck in: > > > 1. Ordered list of nodes. Currently it's believed that the added co= mplexity is > > > nod needed for expected usecases. >=20 > There is no ordering in MPOL_BIND either and even though numa apis tend > to be screwed up from multiple aspects this is not a problem I have eve= r > stumbled over. >=20 > > > 2. A flag for bind to allow falling back to other nodes. This confu= ses the > > > notion of binding and is less flexible than the current solution= . >=20 > Agreed. >=20 > > > 3. Create flags or new modes that helps with some ordering. This of= fers both a > > > friendlier API as well as a solution for more customized usage. = It's unknown > > > if it's worth the complexity to support this. Here is sample cod= e for how > > > this might work: > > >=20 > > > > // Default > > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL= , 0); > > > > // which is the same as > > > > set_mempolicy(MPOL_DEFAULT, NULL, 0); >=20 > OK >=20 > > > > // The Hare > > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, = 0); > > > > > > > > // The Tortoise > > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NU= LL, 0); > > > > > > > > // Prefer the fast memory of the first two sockets > > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2)= ; > > > > > > > > // Prefer specific nodes for some something wacky > > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM,= 0x17c, 1024); >=20 > I am not so sure about these though. It would be much more easier to > start without additional modifiers and provide MPOL_PREFER_MANY without > any additional restrictions first (btw. I would like MPOL_PREFER_MASK > more but I do understand that naming is not the top priority now). True. In fact, this is the same as making MPOL_F_PREFER_ORDER_TYPE_CUSTOM= the implicit default, and adding the others later. Luckily for me, this is effectively what I have already :-). It's a new domain for me, so I'm very flexible on the name. MASK seems li= ke an altogether better name to me as well, but I've been using "MANY" long eno= ugh now that it seems natural. >=20 > It would be also great to provide a high level semantic description > here. I have very quickly glanced through patches and they are not > really trivial to follow with many incremental steps so the higher leve= l > intention is lost easily. >=20 > Do I get it right that the default semantic is essentially > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > semantic) > - fallback to numa unrestricted allocation with the default > numa policy on the failure >=20 > Or are there any usecases to modify how hard to keep the preference ove= r > the fallback? tl;dr is: yes, and no usecases. Longer answer: Internal APIs (specifically, __alloc_pages_nodemask()) keep all the same semantics for trying to allocate with the exception that it will first tr= y the preferred nodes, and next try the bound nodes. It should be noted here th= at an empty preferred mask is the same as saying, traverse nodes in distance or= der starting from local. Therefore, both for preferred mask, and bound mask t= he universe set is equivalent to the empty set (=E2=88=85 =3D=3D U). [1] | prefmask | bindmask | how | |----------|----------|----------------------------------------| | =E2=88=85 | =E2=88=85 | Page allocation without policy = | | =E2=88=85 | N =E2=89=A0 =E2=88=85 | MPOL_BIND = | | N =E2=89=A0 =E2=88=85 | =E2=88=85 | MPOL_PREFERRED* or intern= al preference | | N =E2=89=A0 =E2=88=85 | N =E2=89=A0 =E2=88=85 | MPOL_BIND + inter= nal preference | |----------|----------|----------------------------------------| At the end of this patch series, there is never a case (that I can contri= ve anyway) where prefmask is multiple nodes, and bindmask is multiple nodes.= In the future, if internal callers wanted to try to get clever, this could be th= e case. The UAPI won't allow having both a bind and preferred node. "This system = call defines the default policy for the thread. The thread policy governs all= ocation of pages in the process's address space outside of memory ranges controll= ed by a more specific policy set by mbind(2)." To your second question. There isn't any usecase. Sans bugs and oversight= s, preferred nodes are always tried before fallback. I consider that almost = the hardest level of preference. The one thing I can think of that would be "= harder" would be some sort of mechanism to try all preferred nodes before any tri= cks are used, like reclaim. I fear doing this will make the already scary get_page_from_freelist() even more scary. On this topic, I haven't changed anything for fragmentation. In the code = right now, fragmentation is enabled as soon as the zone chosen for allocation d= oesn't match the preferred_zoneref->zone. ``` if (no_fallback && nr_online_nodes > 1 && zone !=3D ac->preferred_zoneref->zone) { ``` What might be more optimal is to move on to the next node and not allow fragmentation yet, unless zone =E2=88=89 prefmask. Like the above, I thi= nk this will add a decent amount of complexity. The last thing, which I mention in a commit message but not here, OOM wil= l scan all nodes, and not just preferred nodes first. This seemed like a prematu= re optimization to me. [1] There is an underlying assumption that the geodesic distance between = any two nodes is the same for all zonelists. IOW, if you have nodes M, N, P each = with zones A, and B the zonelists will be as follows: M zonelist: MA -> MB -> NA -> NB -> PA -> PB N zonelist: NA -> NB -> PA -> PB -> MA -> MB P zonelist: PA -> PB -> MA -> MB -> NA -> NC