From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0DA76C433F5
	for <linux-mm@archiver.kernel.org>; Tue,  8 Mar 2022 14:44:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 610ED8D0002; Tue,  8 Mar 2022 09:44:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5BF188D0001; Tue,  8 Mar 2022 09:44:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 487608D0002; Tue,  8 Mar 2022 09:44:43 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0183.hostedemail.com [216.40.44.183])
	by kanga.kvack.org (Postfix) with ESMTP id 3668E8D0001
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 09:44:43 -0500 (EST)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id DF38AA89C4
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 14:44:42 +0000 (UTC)
X-FDA: 79221490404.22.2037AD5
Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54])
	by imf18.hostedemail.com (Postfix) with ESMTP id 90CAF1C0003
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 14:44:41 +0000 (UTC)
Received: by mail-wr1-f54.google.com with SMTP id b5so28961947wrr.2
        for <linux-mm@kvack.org>; Tue, 08 Mar 2022 06:44:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=BMCYIOsjTwS8Rg027kyd8UitPvWa9FU9QRDJdpKZhU4=;
        b=dOsAJem+FrwgoEaxfNZLWkmcV/cjeXS2Wdw2c8eIEjwfDT9xPZuWoISMRh+5OmCulj
         ktQ+h9+pImgiHkhRXEAm8ooE/hGZqq8z0ZcZCd5Frq/j6rm5Rut3VfT6QuXIDBBedk+W
         L7DzkYKaUTDZxVwRNZo97qvE5hzB6koy8cwOLW5N5dRaxKZb2KiLOHRCsNguUl/jtFcV
         5B3a2RoAYcc8M5Y+LS3Mjd/4MxWeXs4DJI5BnBr5sRL3FpVU+Qo/jnAG30Vo7PB9RHOo
         S8+KoeWkoeElT2f1yema1B3HPJi8pNw1ZLG7YzYO29diy6RsppNp7PMv5YzGtWN1Bm1c
         6cZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=BMCYIOsjTwS8Rg027kyd8UitPvWa9FU9QRDJdpKZhU4=;
        b=gYLaKbaFhdXu/+E6DIfkBAGtGeMbgL1xBon0jADhBO8yxK09fdewTh1w+I81N/QXhh
         XlTfOid478lcp1lTjs3TpuBckXLQDtbtdefTxyNm3KNH5nscvZpJmNWY+gtCGANL1X07
         mY7ypdxlW38yI4QMuuHlbJ+arChKwXs60Ofs0l5ayYFF9kqjPeYFHFUmZ30WiYMYZkTe
         +mp55pB71t2ruk96HpDOEdTYfqTNTxSYyMd859MZ4PPGGN62EHAqCfVEAOASo/DQoHtv
         SxT4ZMb26uwtaxQhErFOdEtuh6Dq8iFL+gKR3q/h8BYoFFUYTWrKaNc3f6fMS8YrpiX0
         HC2w==
X-Gm-Message-State: AOAM530shGI9Bd1kVGltVxP2P/c527RrV482rbr+luiCEdFJCR00SdO8
	zbVirdUgsMHGARtwwYc+/P0BnpOiSf8lgg==
X-Google-Smtp-Source: ABdhPJzPJAnX9Q69L14cHNraV2CyV0shTJasFXr0mdV8tA4lmzJlPy7aeyeWRNJ3eYhg/uw3LWg4wQ==
X-Received: by 2002:adf:e98d:0:b0:1f1:5d2b:eee6 with SMTP id h13-20020adfe98d000000b001f15d2beee6mr12300815wrm.143.1646750680053;
        Tue, 08 Mar 2022 06:44:40 -0800 (PST)
Received: from dschatzberg-fedora-PC0Y6AEN.fios-router.home ([2620:10d:c092:400::4:5489])
        by smtp.gmail.com with ESMTPSA id h13-20020adff18d000000b001f1de9f930esm16425362wro.81.2022.03.08.06.44.37
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 08 Mar 2022 06:44:39 -0800 (PST)
Date: Tue, 8 Mar 2022 09:44:35 -0500
From: Dan Schatzberg <schatzberg.dan@gmail.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeelb@google.com>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>, linux-mm@kvack.org,
	Yosry Ahmed <yosryahmed@google.com>, Wei Xu <weixugc@google.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [RFC] Mechanism to induce memory reclaim
Message-ID: <Yidr0x4FPLwjKsep@dschatzberg-fedora-PC0Y6AEN.fios-router.home>
References: <5df21376-7dd1-bf81-8414-32a73cea45dd@google.com>
 <YiYZqemRVlk2joYn@dhcp22.suse.cz>
 <20220307183141.npa4627fpbsbgwvv@google.com>
 <YiZqau8LQyNoLSd7@cmpxchg.org>
 <YidRv7Lx9kG4npSX@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YidRv7Lx9kG4npSX@dhcp22.suse.cz>
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 90CAF1C0003
X-Stat-Signature: 8h881y8mt3e31uqw94fdobmnp89my6hk
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=dOsAJem+;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf18.hostedemail.com: domain of schatzberg.dan@gmail.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=schatzberg.dan@gmail.com
X-HE-Tag: 1646750681-473981
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote:
> On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > > > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > > > [...]
> > > > > Some questions to get discussion going:
> > > > >
> > > > >  - Overall feedback or suggestions for the proposal in general?
> > > 
> > > > Do we really need this interface? What would be usecases which cannot
> > > > use an existing interfaces we have for that? Most notably memcg and
> > > > their high limit?
> > > 
> > > 
> > > Let me take a stab at this. The specific reasons why high limit is not a
> > > good interface to implement proactive reclaim:
> > > 
> > > 1) It can cause allocations from the target application to get
> > > throttled.
> > > 
> > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > by the userspace part of proactive reclaimer.
> > > 
> > > If I remember correctly, Facebook actually tried to use high limit to
> > > implement the proactive reclaim but due to exactly these limitations [1]
> > > they went the route [2] aligned with this proposal.
> > > 
> > > To further explain why the above limitations are pretty bad: The
> > > proactive reclaimers usually use feedback loop to decide how much to
> > > squeeze from the target applications without impacting their performance
> > > or impacting within a tolerable range. The metrics used for the feedback
> > > loop are either refaults or PSI and these metrics becomes messy due to
> > > application getting throttled due to high limit.
> > > 
> > > For (2), the high limit interface is a very awkward interface to use to
> > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> > > due to whatever reason during triggering the reclaim in an application,
> > > it can leave the application in a bad state (memory pressure state and
> > > throttled) for a long time.
> > 
> > Yes.
> > 
> > In addition to the proactive reclaimer crashing, we also had problems
> > of it simply not responding quickly enough.
> > 
> > Because there is a delay between reclaim (action) and refaults
> > (feedback), there is a very real upper limit of pages you can
> > reasonably reclaim per second, without risking pressure spikes that
> > far exceed tolerances. A fixed memory.high limit can easily exceed
> > that safe reclaim rate when the workload expands abruptly. Even if the
> > proactive reclaimer process is alive, it's almost impossible to step
> > between a rapidly allocating process and its cgroup limit in time.
> > 
> > The semantics of writing to memory.high also require that the new
> > limit is met before returning to userspace. This can take a long time,
> > during which the reclaimer cannot re-evaluate the optimal target size
> > based on observed pressure. We routinely saw the reclaimer get stuck
> > in the kernel hammering a suffering workload down to a stale target.
> > 
> > We tried for quite a while to make this work, but the limit semantics
> > turned out to not be a good fit for proactive reclaim.
> 
> Thanks for sharing your experience, Johannes. This is a useful insight.

Just to add another issue with memory.high - there's a race window
between reading memory.current and setting memory.high if you want to
reclaim just a little bit of memory. On a fast expanding workload this
could result in reclaiming much more than intended.

> 
> > A mechanism to request a fixed number of pages to reclaim turned out
> > to work much, much better in practice. We've been using a simple
> > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).
> 
> Could you share more details here please? How have you managed to find
> the reclaim target and how have you overcome challenges to react in time
> to have some head room for the actual reclaim?

We have a userspace agent that just repeatedly triggers proactive
reclaim and monitors PSI metrics to maintain some constant but low
pressure. In the complete absense of pressure we will reclaim some
configurable percentage of the workload's memory. This reclaim amount
tapers down to zero as PSI approaches the target threshold.

I don't follow your question regarding head-room. Could you elaborate?