From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F25E1C433F5
	for <linux-mm@archiver.kernel.org>; Tue,  8 Mar 2022 12:53:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 929F18D0003; Tue,  8 Mar 2022 07:53:25 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8D8F58D0001; Tue,  8 Mar 2022 07:53:25 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7CC898D0003; Tue,  8 Mar 2022 07:53:25 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com [216.40.44.151])
	by kanga.kvack.org (Postfix) with ESMTP id 6EC168D0001
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 07:53:25 -0500 (EST)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 2D7D9822DB17
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 12:53:25 +0000 (UTC)
X-FDA: 79221209970.19.8F42AE3
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
	by imf18.hostedemail.com (Postfix) with ESMTP id 90E7A1C0006
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 12:53:24 +0000 (UTC)
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 53246210F2;
	Tue,  8 Mar 2022 12:53:23 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1646744003; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=pLgW+v4wP6XY+dCxDscHwiEDedKsHHCuzNoLVsHZt70=;
	b=fChb4xcx6y+W4qSi5fkOay9TZvnR51c9omWet62tahuCEkTrAkGzZwtTZx+epJu9eENjDT
	Fwlfr1ok+1ZbjtJJVnuLAL8+i4iZdPSPOaWbns404OhyViAgKfziDUkpaQUI1rJdT40Ni4
	Zq0Zu/99Yl5ciX9CMWPgtdC/reNQfSY=
Received: from suse.cz (unknown [10.100.201.86])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id 2390FA3B84;
	Tue,  8 Mar 2022 12:53:23 +0000 (UTC)
Date: Tue, 8 Mar 2022 13:53:19 +0100
From: Michal Hocko <mhocko@suse.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>, linux-mm@kvack.org,
	Yosry Ahmed <yosryahmed@google.com>, Wei Xu <weixugc@google.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [RFC] Mechanism to induce memory reclaim
Message-ID: <YidRv7Lx9kG4npSX@dhcp22.suse.cz>
References: <5df21376-7dd1-bf81-8414-32a73cea45dd@google.com>
 <YiYZqemRVlk2joYn@dhcp22.suse.cz>
 <20220307183141.npa4627fpbsbgwvv@google.com>
 <YiZqau8LQyNoLSd7@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YiZqau8LQyNoLSd7@cmpxchg.org>
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 90E7A1C0006
X-Stat-Signature: 4qzczyqmgji1j4m7bxd39nra1d73s969
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=suse.com header.s=susede1 header.b=fChb4xcx;
	dmarc=pass (policy=quarantine) header.from=suse.com;
	spf=pass (imf18.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com
X-HE-Tag: 1646744004-175657
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > > [...]
> > > > Some questions to get discussion going:
> > > >
> > > >  - Overall feedback or suggestions for the proposal in general?
> > 
> > > Do we really need this interface? What would be usecases which cannot
> > > use an existing interfaces we have for that? Most notably memcg and
> > > their high limit?
> > 
> > 
> > Let me take a stab at this. The specific reasons why high limit is not a
> > good interface to implement proactive reclaim:
> > 
> > 1) It can cause allocations from the target application to get
> > throttled.
> > 
> > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > by the userspace part of proactive reclaimer.
> > 
> > If I remember correctly, Facebook actually tried to use high limit to
> > implement the proactive reclaim but due to exactly these limitations [1]
> > they went the route [2] aligned with this proposal.
> > 
> > To further explain why the above limitations are pretty bad: The
> > proactive reclaimers usually use feedback loop to decide how much to
> > squeeze from the target applications without impacting their performance
> > or impacting within a tolerable range. The metrics used for the feedback
> > loop are either refaults or PSI and these metrics becomes messy due to
> > application getting throttled due to high limit.
> > 
> > For (2), the high limit interface is a very awkward interface to use to
> > do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> > due to whatever reason during triggering the reclaim in an application,
> > it can leave the application in a bad state (memory pressure state and
> > throttled) for a long time.
> 
> Yes.
> 
> In addition to the proactive reclaimer crashing, we also had problems
> of it simply not responding quickly enough.
> 
> Because there is a delay between reclaim (action) and refaults
> (feedback), there is a very real upper limit of pages you can
> reasonably reclaim per second, without risking pressure spikes that
> far exceed tolerances. A fixed memory.high limit can easily exceed
> that safe reclaim rate when the workload expands abruptly. Even if the
> proactive reclaimer process is alive, it's almost impossible to step
> between a rapidly allocating process and its cgroup limit in time.
> 
> The semantics of writing to memory.high also require that the new
> limit is met before returning to userspace. This can take a long time,
> during which the reclaimer cannot re-evaluate the optimal target size
> based on observed pressure. We routinely saw the reclaimer get stuck
> in the kernel hammering a suffering workload down to a stale target.
> 
> We tried for quite a while to make this work, but the limit semantics
> turned out to not be a good fit for proactive reclaim.

Thanks for sharing your experience, Johannes. This is a useful insight.

> A mechanism to request a fixed number of pages to reclaim turned out
> to work much, much better in practice. We've been using a simple
> per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).

Could you share more details here please? How have you managed to find
the reclaim target and how have you overcome challenges to react in time
to have some head room for the actual reclaim?
 
> With tiered memory systems coming up, I can see the need for
> restricting to specific numa nodes. Demoting from DRAM to CXL has a
> different cost function than evicting RAM/CXL to storage, and those
> two things probably need to happen at different rates.

Yes, in an absense of per-node watermarks I can see how a per-node
reclaim trigger could be useful. The question is whether a per-node
wmark interface wouldn't be a better fit.

-- 
Michal Hocko
SUSE Labs