From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CA16C433F5 for ; Wed, 27 Apr 2022 18:27:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A2FAB6B0071; Wed, 27 Apr 2022 14:27:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9DEA96B0073; Wed, 27 Apr 2022 14:27:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8CE2F6B0074; Wed, 27 Apr 2022 14:27:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 7C2CD6B0071 for ; Wed, 27 Apr 2022 14:27:27 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 53B9616F8 for ; Wed, 27 Apr 2022 18:27:27 +0000 (UTC) X-FDA: 79403491734.05.C95D018 Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52]) by imf08.hostedemail.com (Postfix) with ESMTP id B8DCB160053 for ; Wed, 27 Apr 2022 18:27:20 +0000 (UTC) Received: by mail-vs1-f52.google.com with SMTP id m14so2501573vsp.11 for ; Wed, 27 Apr 2022 11:27:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sELzMdbj5+HZ4sGPHJZnRGAhIQ7BCXymuFnX9zGs7BE=; b=aKHzEIuoGKVkf9zlW8zxPtiBPQ/ymiQxLmKRNGnk7xORraxDYRsElCIWngY5ooRyKp BEEVgfDboWTK5Yxqc66S6yaldjAO/UTqXcjqUK+5zGIPcI6i2aLPQUMuQ++aUUPicFoG 1SE6YGgtyypuRBiNpPX6WPwZ+xGkO8tqFY3aquNvWu+cZvbyRJ99jWnFLf1G758dK/Jb gTZ6mit2qD+ZZGJvDwcgsJtWk7jwO7v8zmFRy9HJPiJ5bB94OA+uTGX/3o83Aw9zhYJK TFJbf++ohKYiCl07NsotAoYyIsdg5shKj+hJ/eKvZKuOYQS7rYv44kIKvCJVByKAvLo/ ESoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sELzMdbj5+HZ4sGPHJZnRGAhIQ7BCXymuFnX9zGs7BE=; b=OApxnfm1dHg8md7R1Jn3xv/PdHmHVQhvXAGrzj1kKbARtZMgVeojg4RUit4wCL2Fty zplx3yfvXy8xnyyZrs9HDD7rwhDjXmN5MxPaJH0o7FHz8bTPcvM1khYfzb6S9/Rh+Znt aCToubBUNZlRPltKCOPTcaORBpSzC4NcJj115X+/9seFo7wh+nC2dbbUG8LwomGr7npq 1HY1mCaPNIA+uddiq1g0wwJoWTFLnUlHYEHN2xcec4aWSTbAob6Lr/d8uTnaBXnVimaX HLKy3rwjd1K61Ct7dBWNRwrHYFJQj8yAQ9gkUwEafVosQSC3uJiIupPwvS95g47aEa+Q 4Nuw== X-Gm-Message-State: AOAM533YUeETUkD5Jmz6YQ8Ipw6ZQlpIJfs7IadBoLFPKW7dDDlPbA5a 0fbuJQPpY/+VxRMXUEH4yeSxDwBMlSe0vQBOhvW7oQ== X-Google-Smtp-Source: ABdhPJwMlTmX7Q/LveWpSiwFwkoDsE2uLsTku/YPhVr4222dTua2gX1SootBCrzpOGqZ2CHVbzB80SUQGSgfEm4GfWw= X-Received: by 2002:a67:fd0b:0:b0:31b:e36d:31b1 with SMTP id f11-20020a67fd0b000000b0031be36d31b1mr9727091vsr.44.1651084045851; Wed, 27 Apr 2022 11:27:25 -0700 (PDT) MIME-Version: 1.0 References: <610ccaad03f168440ce765ae5570634f3b77555e.camel@intel.com> <8e31c744a7712bb05dbf7ceb2accf1a35e60306a.camel@intel.com> <78b5f4cfd86efda14c61d515e4db9424e811c5be.camel@intel.com> <200e95cf36c1642512d99431014db8943fed715d.camel@intel.com> In-Reply-To: From: Wei Xu Date: Wed, 27 Apr 2022 11:27:14 -0700 Message-ID: Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS To: Aneesh Kumar K V Cc: "ying.huang@intel.com" , Jagdish Gediya , Yang Shi , Dave Hansen , Dan Williams , Davidlohr Bueso , Linux MM , Linux Kernel Mailing List , Andrew Morton , Baolin Wang , Greg Thelen , MichalHocko , Brice Goglin Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B8DCB160053 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=aKHzEIuo; spf=pass (imf08.hostedemail.com: domain of weixugc@google.com designates 209.85.217.52 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Stat-Signature: 5if79qrwojczxmd5q5w4ua6nw1dym5zk X-HE-Tag: 1651084040-831931 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V wrote: > > On 4/25/22 10:26 PM, Wei Xu wrote: > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > wrote: > >> > > .... > > >> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > >> > >> Node 0 & 2 are cpu + dram nodes and node 1 are slow > >> memory node near node 0, > >> > >> available: 3 nodes (0-2) > >> node 0 cpus: 0 1 > >> node 0 size: n MB > >> node 0 free: n MB > >> node 1 cpus: > >> node 1 size: n MB > >> node 1 free: n MB > >> node 2 cpus: 2 3 > >> node 2 size: n MB > >> node 2 free: n MB > >> node distances: > >> node 0 1 2 > >> 0: 10 40 20 > >> 1: 40 10 80 > >> 2: 20 80 10 > >> > >> We have 2 choices, > >> > >> a) > >> node demotion targets > >> 0 1 > >> 2 1 > >> > >> b) > >> node demotion targets > >> 0 1 > >> 2 X > >> > >> a) is good to take advantage of PMEM. b) is good to reduce cross-socket > >> traffic. Both are OK as defualt configuration. But some users may > >> prefer the other one. So we need a user space ABI to override the > >> default configuration. > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > In general, we can view the demotion order in a way similar to > > allocation fallback order (after all, if we don't demote or demotion > > lags behind, the allocations will go to these demotion target nodes > > according to the allocation fallback order anyway). If we initialize > > the demotion order in that way (i.e. every node can demote to any node > > in the next tier, and the priority of the target nodes is sorted for > > each source node), we don't need per-node demotion order override from > > the userspace. What we need is to specify what nodes should be in > > each tier and support NUMA mempolicy in demotion. > > > > I have been wondering how we would handle this. For ex: If an > application has specified an MPOL_BIND policy and restricted the > allocation to be from Node0 and Node1, should we demote pages allocated > by that application > to Node10? The other alternative for that demotion is swapping. So from > the page point of view, we either demote to a slow memory or pageout to > swap. But then if we demote we are also breaking the MPOL_BIND rule. IMHO, the MPOL_BIND policy should be respected and demotion should be skipped in such cases. Such MPOL_BIND policies can be an important tool for applications to override and control their memory placement when transparent memory tiering is enabled. If the application doesn't want swapping, there are other ways to achieve that (e.g. mlock, disabling swap globally, setting memcg parameters, etc). > The above says we would need some kind of mem policy interaction, but > what I am not sure about is how to find the memory policy in the > demotion path. This is indeed an important and challenging problem. One possible approach is to retrieve the allowed demotion nodemask from page_referenced() similar to vm_flags. > > > Cross-socket demotion should not be too big a problem in practice > > because we can optimize the code to do the demotion from the local CPU > > node (i.e. local writes to the target node and remote read from the > > source node). The bigger issue is cross-socket memory access onto the > > demoted pages from the applications, which is why NUMA mempolicy is > > important here. > > > > > -aneesh