From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F25E1C433F5 for ; Tue, 8 Mar 2022 12:53:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 929F18D0003; Tue, 8 Mar 2022 07:53:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D8F58D0001; Tue, 8 Mar 2022 07:53:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CC898D0003; Tue, 8 Mar 2022 07:53:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com [216.40.44.151]) by kanga.kvack.org (Postfix) with ESMTP id 6EC168D0001 for ; Tue, 8 Mar 2022 07:53:25 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 2D7D9822DB17 for ; Tue, 8 Mar 2022 12:53:25 +0000 (UTC) X-FDA: 79221209970.19.8F42AE3 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf18.hostedemail.com (Postfix) with ESMTP id 90E7A1C0006 for ; Tue, 8 Mar 2022 12:53:24 +0000 (UTC) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 53246210F2; Tue, 8 Mar 2022 12:53:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1646744003; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=pLgW+v4wP6XY+dCxDscHwiEDedKsHHCuzNoLVsHZt70=; b=fChb4xcx6y+W4qSi5fkOay9TZvnR51c9omWet62tahuCEkTrAkGzZwtTZx+epJu9eENjDT Fwlfr1ok+1ZbjtJJVnuLAL8+i4iZdPSPOaWbns404OhyViAgKfziDUkpaQUI1rJdT40Ni4 Zq0Zu/99Yl5ciX9CMWPgtdC/reNQfSY= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 2390FA3B84; Tue, 8 Mar 2022 12:53:23 +0000 (UTC) Date: Tue, 8 Mar 2022 13:53:19 +0100 From: Michal Hocko To: Johannes Weiner Cc: Shakeel Butt , David Rientjes , Andrew Morton , Yu Zhao , Dave Hansen , linux-mm@kvack.org, Yosry Ahmed , Wei Xu , Greg Thelen Subject: Re: [RFC] Mechanism to induce memory reclaim Message-ID: References: <5df21376-7dd1-bf81-8414-32a73cea45dd@google.com> <20220307183141.npa4627fpbsbgwvv@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 90E7A1C0006 X-Stat-Signature: 4qzczyqmgji1j4m7bxd39nra1d73s969 Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=fChb4xcx; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf18.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com X-HE-Tag: 1646744004-175657 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote: > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > > [...] > > > > Some questions to get discussion going: > > > > > > > > - Overall feedback or suggestions for the proposal in general? > > > > > Do we really need this interface? What would be usecases which cannot > > > use an existing interfaces we have for that? Most notably memcg and > > > their high limit? > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > good interface to implement proactive reclaim: > > > > 1) It can cause allocations from the target application to get > > throttled. > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > by the userspace part of proactive reclaimer. > > > > If I remember correctly, Facebook actually tried to use high limit to > > implement the proactive reclaim but due to exactly these limitations [1] > > they went the route [2] aligned with this proposal. > > > > To further explain why the above limitations are pretty bad: The > > proactive reclaimers usually use feedback loop to decide how much to > > squeeze from the target applications without impacting their performance > > or impacting within a tolerable range. The metrics used for the feedback > > loop are either refaults or PSI and these metrics becomes messy due to > > application getting throttled due to high limit. > > > > For (2), the high limit interface is a very awkward interface to use to > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > > due to whatever reason during triggering the reclaim in an application, > > it can leave the application in a bad state (memory pressure state and > > throttled) for a long time. > > Yes. > > In addition to the proactive reclaimer crashing, we also had problems > of it simply not responding quickly enough. > > Because there is a delay between reclaim (action) and refaults > (feedback), there is a very real upper limit of pages you can > reasonably reclaim per second, without risking pressure spikes that > far exceed tolerances. A fixed memory.high limit can easily exceed > that safe reclaim rate when the workload expands abruptly. Even if the > proactive reclaimer process is alive, it's almost impossible to step > between a rapidly allocating process and its cgroup limit in time. > > The semantics of writing to memory.high also require that the new > limit is met before returning to userspace. This can take a long time, > during which the reclaimer cannot re-evaluate the optimal target size > based on observed pressure. We routinely saw the reclaimer get stuck > in the kernel hammering a suffering workload down to a stale target. > > We tried for quite a while to make this work, but the limit semantics > turned out to not be a good fit for proactive reclaim. Thanks for sharing your experience, Johannes. This is a useful insight. > A mechanism to request a fixed number of pages to reclaim turned out > to work much, much better in practice. We've been using a simple > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). Could you share more details here please? How have you managed to find the reclaim target and how have you overcome challenges to react in time to have some head room for the actual reclaim? > With tiered memory systems coming up, I can see the need for > restricting to specific numa nodes. Demoting from DRAM to CXL has a > different cost function than evicting RAM/CXL to storage, and those > two things probably need to happen at different rates. Yes, in an absense of per-node watermarks I can see how a per-node reclaim trigger could be useful. The question is whether a per-node wmark interface wouldn't be a better fit. -- Michal Hocko SUSE Labs