From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACF81C433EF for ; Thu, 28 Apr 2022 17:14:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 409176B00AE; Thu, 28 Apr 2022 13:14:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 391DC8D0015; Thu, 28 Apr 2022 13:14:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E4EA8D000F; Thu, 28 Apr 2022 13:14:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id 081EC6B00AE for ; Thu, 28 Apr 2022 13:14:44 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id D1EEF121AB0 for ; Thu, 28 Apr 2022 17:14:43 +0000 (UTC) X-FDA: 79406937246.13.C83AB8F Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf06.hostedemail.com (Postfix) with ESMTP id C57F9180056 for ; Thu, 28 Apr 2022 17:14:41 +0000 (UTC) Received: by mail-pl1-f181.google.com with SMTP id d15so4944478plh.2 for ; Thu, 28 Apr 2022 10:14:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ns0ZibwkGD1u6krlo33dxSAekBpL87Us1tozNLFEZbI=; b=OO8ftzFyzcA7n3c+lAkSHxcWShCs0XnGjZaGbz3RaXI7Ms9BCfgd7z6KNrun7AG67+ p1pFs/u1woxqWqR9lJ/j64/xhJEx4q6EswXmNFSWAtAAHvDfPPhFpln3xW3ULvxLVcY3 fgoF+h/nngIeUfXV8JvAwEzTkFG5chiQBkCHhci70JqH7ynwQA4M6HMFAraZmFCb69YC RVukBp5dHpmQPCcBWE4vsTHJnFSeSgKD7xGkb6t/MUa2XVvRrL31NlDiudc9ixKxrWml ihp0l+AlEbq4brSwL1pqB1obGfECnUbhltuDjD/jdSD4kdaKtYH/w0BnJoa1piL5866j cb0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ns0ZibwkGD1u6krlo33dxSAekBpL87Us1tozNLFEZbI=; b=toODC8oHrNSj1CPCe6wyJu/jfb8dNRLg925QmZqKmrSRo+8XqbVaLO/5dDDW8z6/7C jYthbrXrg6CuQz5L6qwtztsVxbTl2vyYB7CMMC3prfM1le+wIqrs3Wml6FfZIw19Pgwv SCFXsoSsK6NSl3WtYmHRuZUdGEsaWeTiBAhSkHREaebKGxsAQhs6vUFHe+LqhCwFAjMV rSsUesE5slhju2WN1gXP0Htkd0sWiC3htZoQ3dF/6T5RkcTal9JOIemzJp1p2gfoGFP5 qKTR1noKYTwB45yb87k9q+1yMOSQylRn8TZ4czXTl00UUnVy8RyuRX+JL2yM74LRIWWi ts6w== X-Gm-Message-State: AOAM533I1Re+CzP2JisoEHCQ5+maCRFdd9Jdw+DfBnsMP+Ecqc7pesW/ 9OAEDqI7HDvy1Nd8w6CgAh3Zj9igijcgceKllF8= X-Google-Smtp-Source: ABdhPJz4LmtYpL7uOsYPGSRkwPA9kSSPF9Akn3FNlOGej0f3130Rce7Iz1YO68ewAjrqM4bqetirppr+A+/EYzFW7hw= X-Received: by 2002:a17:90b:4b07:b0:1db:c488:7394 with SMTP id lx7-20020a17090b4b0700b001dbc4887394mr6923567pjb.21.1651166082138; Thu, 28 Apr 2022 10:14:42 -0700 (PDT) MIME-Version: 1.0 References: <610ccaad03f168440ce765ae5570634f3b77555e.camel@intel.com> <8e31c744a7712bb05dbf7ceb2accf1a35e60306a.camel@intel.com> <78b5f4cfd86efda14c61d515e4db9424e811c5be.camel@intel.com> <200e95cf36c1642512d99431014db8943fed715d.camel@intel.com> <9d9ef67127b1e2cf0b6c72f60cb7304dc573c28b.camel@intel.com> In-Reply-To: From: Yang Shi Date: Thu, 28 Apr 2022 10:14:29 -0700 Message-ID: Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS To: Wei Xu Cc: "ying.huang@intel.com" , Aneesh Kumar K V , Jagdish Gediya , Dave Hansen , Dan Williams , Davidlohr Bueso , Linux MM , Linux Kernel Mailing List , Andrew Morton , Baolin Wang , Greg Thelen , MichalHocko , Brice Goglin , Feng Tang Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: C57F9180056 X-Stat-Signature: w4y77cjtgy7nyuuxb51zqprcn1abmqhb Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=OO8ftzFy; spf=pass (imf06.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1651166081-345144 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 27, 2022 at 9:11 PM Wei Xu wrote: > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > wrote: > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > wrote: > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > wrote: > > > > > > > > > > > > > > .... > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > memory node near node 0, > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > node 0 cpus: 0 1 > > > > > > node 0 size: n MB > > > > > > node 0 free: n MB > > > > > > node 1 cpus: > > > > > > node 1 size: n MB > > > > > > node 1 free: n MB > > > > > > node 2 cpus: 2 3 > > > > > > node 2 size: n MB > > > > > > node 2 free: n MB > > > > > > node distances: > > > > > > node 0 1 2 > > > > > > 0: 10 40 20 > > > > > > 1: 40 10 80 > > > > > > 2: 20 80 10 > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > a) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 1 > > > > > > > > > > > > b) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 X > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > default configuration. > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > according to the allocation fallback order anyway). If we initialize > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > each source node), we don't need per-node demotion order override from > > > > > the userspace. What we need is to specify what nodes should be in > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > application has specified an MPOL_BIND policy and restricted the > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > by that application > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > the page point of view, we either demote to a slow memory or pageout to > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > tool for applications to override and control their memory placement > > > when transparent memory tiering is enabled. If the application > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > what I am not sure about is how to find the memory policy in the > > > > demotion path. > > > > > > This is indeed an important and challenging problem. One possible > > > approach is to retrieve the allowed demotion nodemask from > > > page_referenced() similar to vm_flags. > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > struct task_struct. Mutiple threads in a process may have different > > mempolicy. > > From vm_area_struct, we can get to mm_struct and then to the owner > task_struct, which has the process mempolicy. > > It is indeed a problem when a page is shared by different threads or > different processes that have different thread default mempolicy > values. Sorry for chiming in late, this is a known issue when we were working on demotion. Yes, it is hard to handle the shared pages and multi threads since mempolicy is applied to each thread so each thread may have different mempolicy. And I don't think this case is rare. And not only mempolicy but also may cpuset settings cause the similar problem, different threads may have different cpuset settings for cgroupv1. If this is really a problem for real life workloads, we may consider tackling it for exclusively owned pages first. Thanks to David's patches, now we have dedicated flags to tell exclusively owned pages. > > On the other hand, it can already support most interesting use cases > for demotion (e.g. selecting the demotion node, mbind to prevent > demotion) by respecting cpuset and vma mempolicies. > > > Best Regards, > > Huang, Ying > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > node (i.e. local writes to the target node and remote read from the > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > important here. > > > > > > > > > > > > > > -aneesh > > > >