Re: memcg reclaim demotion wrt. isolation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Yang Shi <shy828301@gmail.com>, Wei Xu <weixugc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: memcg reclaim demotion wrt. isolation
Date: Wed, 14 Dec 2022 18:40:19 +0100	[thread overview]
Message-ID: <Y5oKg4KFsFIYOYuZ@cmpxchg.org> (raw)
In-Reply-To: <Y5nrwrP0twm9IIDl@dhcp22.suse.cz>

Hey Michal,

On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote:
> On Wed 14-12-22 13:40:33, Johannes Weiner wrote:
> > The only way to prevent cgroups from disrupting each other on NUMA
> > nodes is NUMA constraints. Cgroup per-node limits. That shields not
> > only from demotion, but also from DoS-mbinding, or aggressive
> > promotion. All of these can result in some form of premature
> > reclaim/demotion, proactive demotion isn't special in that way.
> 
> Any numa based balancing is a real challenge with memcg semantic. I do
> not see per numa node memcg limits without a major overhaul of how we do
> charging though. I am not sure this is on the table even long term.
> Unless I am really missing something here we have to live with the
> existing semantic for a foreseeable future.

Yes, I think you're quite right.

We've been mostly skirting the NUMA issue in cgroups (and to a degree
in MM code in general) with two possible answers:

a) The NUMA distances are close enough that we ignore it and pretend
   all memory is (mostly) fungible.

b) The NUMA distances are big enough that it matters, in which case
   the best option is to avoid sharing, and use bindings to keep
   workloads/containers isolated to their own CPU+memory domains.

Tiered memory forces the issue by providing memory that must be shared
between workloads/containers, but is not fungible. At least not
without incurring priority inversions between containers, where a
lopri container promotes itself to the top and demotes the hipri
workload, while staying happily within its global memory allowance.

This applies to mbind() cases as much as it does to NUMA balancing.

If these setups proliferate, it seems inevitable to me that sooner or
later the full problem space of memory cgroups - dividing up a shared
resource while allowing overcommit - applies not just to "RAM as a
whole", but to each memory tier individually.

Whether we need the full memcg interface per tier or per node, I'm not
sure. It might be enough to automatically apportion global allowances
to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has
a 20G allowance, it gets 13G on top and 7G on low.

(That, or we settle on multi-socket systems with private tiers, such
that memory continues to be unshared :-)

Either way, I expect this issue will keep coming up as we try to use
containers on such systems.

next prev parent reply	other threads:[~2022-12-14 17:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-13 15:41 Michal Hocko
2022-12-13 16:14 ` Johannes Weiner
2022-12-14  9:42   ` Michal Hocko
2022-12-14 12:40     ` Johannes Weiner
2022-12-14 15:29       ` Michal Hocko
2022-12-14 17:40         ` Johannes Weiner [this message]
2022-12-15  6:17     ` Huang, Ying
2022-12-15  8:22       ` Johannes Weiner
2022-12-16  3:16         ` Huang, Ying
2022-12-13 22:26 ` Dave Hansen
2022-12-14  9:45   ` Michal Hocko
2022-12-14  2:57 ` Huang, Ying
2022-12-14  9:49   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y5oKg4KFsFIYOYuZ@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox