From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=X5fG=JF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9A592C433B4
	for <linux-mm@archiver.kernel.org>; Thu,  8 Apr 2021 20:50:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4A04C61165
	for <linux-mm@archiver.kernel.org>; Thu,  8 Apr 2021 20:50:46 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4A04C61165
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D689D6B0036; Thu,  8 Apr 2021 16:50:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D3EAE6B006C; Thu,  8 Apr 2021 16:50:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C2E166B006E; Thu,  8 Apr 2021 16:50:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0133.hostedemail.com [216.40.44.133])
	by kanga.kvack.org (Postfix) with ESMTP id A6F3B6B0036
	for <linux-mm@kvack.org>; Thu,  8 Apr 2021 16:50:45 -0400 (EDT)
Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 68C748248047
	for <linux-mm@kvack.org>; Thu,  8 Apr 2021 20:50:45 +0000 (UTC)
X-FDA: 78010393650.28.BE7BF5A
Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45])
	by imf23.hostedemail.com (Postfix) with ESMTP id CAF35A00038C
	for <linux-mm@kvack.org>; Thu,  8 Apr 2021 20:50:43 +0000 (UTC)
Received: by mail-ej1-f45.google.com with SMTP id u21so5222809ejo.13
        for <linux-mm@kvack.org>; Thu, 08 Apr 2021 13:50:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=nSkcR4vgYfQ9bCiomxJK4YRcFfPacWHOv81qx13uXCk=;
        b=S+jRaaDesoRNqSTt0/e0qy7JOw05Ck9EsUF9SiuA3sBxm9U/qNUCvFFH6486LuDeXI
         wbNusF9mGCJxP1RKbUuQqZLicwrX7SWzwTHnbP4TQzIvUITOnNeD0Fni+HQXSFGGTyI9
         UnCIiMhdLvmiMjTDDyue4KlFrxDiRrdRhN+RVSF/MKyJ7FwmZ75epJ9C8ekNhEqt+qcm
         FDppNgO5h5AyDiEFW/JaQ/zL+dZOjnAAnd1shc9Fa7oDWIgJP25tIqTbDj+63aw5aP5Y
         /BknCfTYUOySH6gHm0TbwTzxonlU2WA5Brrv6BOpjoKpDvBbRSE08NPegH/Tyl7DnsEm
         OcUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=nSkcR4vgYfQ9bCiomxJK4YRcFfPacWHOv81qx13uXCk=;
        b=e8MWKYIH5rk1U1XpVH3ru+NZgyJ9ic0ZtNNrf8ourY8TOtnb/A1SOcyyfIDqowoxtA
         nJfZZbdevON1RYSFWFR0w91HXqYK3ywWNQhwFLVVHAIXB2VVSe6v2Ag9u6lt2TFEzf0r
         douSlB5tVa8hN8L/InU7QnynGNNe/y6qBvGbpMJQVwkDjehexq5aYA7w2/LIkYJ36vEQ
         CAF1f/n2UBANrA/Dxxx0Xe1+op7016GkYJpRQ1/bMiqZjGVHYjHrj84ZN0hOUDgviEYb
         1pwaOMgChx618l2lJTToZZ5iPKnf4byfFY1B7ikh0zxEzw6SmsrjT40fChLRlHrP0pxw
         XOaQ==
X-Gm-Message-State: AOAM533ghNUFiJFp8wKaF2zAuqPq+gE94x9RSeYCHGMOXni6CFZc3MJW
	DzX6/Jpp90leaYNaeYk5/Iit9h61+zZWPpri3VE=
X-Google-Smtp-Source: ABdhPJzDT4h97XJaHF/PbtM16viIv5D560gqJE9uxih1mtbScheWKyhXhZv6JlID3X0eZrK6FPC8CHmEXn9gQ6gyxJg=
X-Received: by 2002:a17:906:c143:: with SMTP id dp3mr11934552ejc.499.1617915043821;
 Thu, 08 Apr 2021 13:50:43 -0700 (PDT)
MIME-Version: 1.0
References: <cover.1617642417.git.tim.c.chen@linux.intel.com>
 <CALvZod7StYJCPnWRNLnYQV8S5CBLtE0w4r2rH-wZzNs9jGJSRg@mail.gmail.com>
 <CAHbLzkrPD6s9vRy89cgQ36e+1cs6JbLqV84se7nnvP9MByizXA@mail.gmail.com> <CALvZod69-GcS2W57hAUvjbWBCD6B2dTeVsFbtpQuZOM2DphwCQ@mail.gmail.com>
In-Reply-To: <CALvZod69-GcS2W57hAUvjbWBCD6B2dTeVsFbtpQuZOM2DphwCQ@mail.gmail.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 8 Apr 2021 13:50:32 -0700
Message-ID: <CAHbLzkoce41b-pJ5x=6nRhex_xBdC-+cYACBw9HKtA87H71A-Q@mail.gmail.com>
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
To: Shakeel Butt <shakeelb@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>, Michal Hocko <mhocko@suse.cz>, 
	Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Dave Hansen <dave.hansen@intel.com>, Ying Huang <ying.huang@intel.com>, 
	Dan Williams <dan.j.williams@intel.com>, David Rientjes <rientjes@google.com>, 
	Linux MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>, 
	LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: CAF35A00038C
X-Stat-Signature: 1rbnw5hpnxisg7bkbreos5ksmcitjbq3
Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf23; identity=mailfrom; envelope-from="<shy828301@gmail.com>"; helo=mail-ej1-f45.google.com; client-ip=209.85.218.45
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1617915043-716712
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Hi Tim,
> > >
> > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > >
> > > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > > others NUMA wise, but a byte of media has about the same cost whether it
> > > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > > PMEM.
> > > >
> > > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > > >
> > > > Previously, the patchset
> > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > > >
> > > > And the patchset
> > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > > leveraging autonuma.
> > > >
> > > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > > in PMEM.
> > >
> > > Thanks for working on this as this is becoming more and more important
> > > particularly in the data centers where memory is a big portion of the
> > > cost.
> > >
> > > I see you have responded to Michal and I will add my more specific
> > > response there. Here I wanted to give my high level concern regarding
> > > using v1's soft limit like semantics for top tier memory.
> > >
> > > This patch series aims to distribute/partition top tier memory between
> > > jobs of different priorities. We want high priority jobs to have
> > > preferential access to the top tier memory and we don't want low
> > > priority jobs to hog the top tier memory.
> > >
> > > Using v1's soft limit like behavior can potentially cause high
> > > priority jobs to stall to make enough space on top tier memory on
> > > their allocation path and I think this patchset is aiming to reduce
> > > that impact by making kswapd do that work. However I think the more
> > > concerning issue is the low priority job hogging the top tier memory.
> > >
> > > The possible ways the low priority job can hog the top tier memory are
> > > by allocating non-movable memory or by mlocking the memory. (Oh there
> > > is also pinning the memory but I don't know if there is a user api to
> > > pin memory?) For the mlocked memory, you need to either modify the
> > > reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
> >
>
> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.
>
> > >
> > > Basically I am saying we should put the upfront control (limit) on the
> > > usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
> >
>
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.

I don't get why low priority jobs can *not* use top tier memory? I can
think of it may incur latency overhead for high priority jobs. If it
is not allowed, it could be restricted by cpuset without introducing
in any new interfaces.

I'm supposed the memory utilization could be maximized by allowing all
jobs allocate memory from all applicable nodes, then let reclaimer (or
something new if needed) do the job to migrate the memory to proper
nodes by time. We could achieve some kind of balance between memory
utilization and resource isolation.

>
> I have some thoughts on NUMA node limits which I will share in the other thread.

Look forward to reading it.