From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D90EC47077 for ; Thu, 11 Jan 2024 17:59:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2412B6B008A; Thu, 11 Jan 2024 12:59:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C9CF6B008C; Thu, 11 Jan 2024 12:59:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 091C76B0093; Thu, 11 Jan 2024 12:59:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EE9CC6B008A for ; Thu, 11 Jan 2024 12:59:21 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C32911405D8 for ; Thu, 11 Jan 2024 17:59:21 +0000 (UTC) X-FDA: 81667792122.16.04F528D Received: from out-185.mta1.migadu.com (out-185.mta1.migadu.com [95.215.58.185]) by imf20.hostedemail.com (Postfix) with ESMTP id 10C7D1C001B for ; Thu, 11 Jan 2024 17:59:18 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=WrqiSA80; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf20.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.185 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704995959; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LFlTmqKWu1L5WwW9zjp3A8ZgS4CejVZXVReh/bTmlHI=; b=osLXdbSjUHZ5jGX1fyrzX2/8Iq3/FVK+JJUTGaFTtRLCdu8aKkooG/7Ycb6dn/yVW5q1Jp r+KLSlPxPGoWCTtQT7Z4oOkoGAeN7tZ+XBYGqLK+fTh6bxXuzIA1lgFR1GyUjqX//Nk6WX 6cEXm1Lox1KVprK7tDTXWoov8+AQSus= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=WrqiSA80; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf20.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.185 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704995959; a=rsa-sha256; cv=none; b=dXUUgiJqy8+pse4YsF5MWKZt5Q6ReaFaSbtMydit3XaRtkzFgdgtB3sP+9LnomXe960tLD YKkdU8ZG059c+JTwLe/KabbnzV9jtC7PRFLENr47Ocp/tzmEZa6c2W0qTbe/JzGHx4qZdz UPEt8EIUP1cKDdAEuzjkQa6LNZiRU30= Date: Thu, 11 Jan 2024 09:59:11 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1704995956; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LFlTmqKWu1L5WwW9zjp3A8ZgS4CejVZXVReh/bTmlHI=; b=WrqiSA80IXtiFa7+jyPFcSmaNikWHHzREvNb09PiZp7Os1rT/qkUmCdDpiT8lud2QchAdD AKoaznC9qnvG/OBr93VGvmoftGM4EtgudZ5o5N3CxLaNH0dgnzsvpFR04nWX+3S4yTod2k p9I6asesQ+eEoQwn/E0cnVPZAunYf2o= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Shakeel Butt , Muchun Song , Tejun Heo , Dan Schatzberg , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high Message-ID: References: <20240111132902.389862-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240111132902.389862-1-hannes@cmpxchg.org> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 10C7D1C001B X-Stat-Signature: faif6q6ifry3fjphha5ya6r9916fnz8i X-Rspam-User: X-HE-Tag: 1704995958-740491 X-HE-Meta: U2FsdGVkX1/qXpGzhh2Jr8rOM3m3xPzWWZf6cwqKimloeXbrn4G86oYKtWr+ayrQ7w1CNzQa7+Xi5crXO8RXH0hDCFSwftkxDZ9xMoFYXAaW21T7l7KxPMyDdfjNA2b2f7/tTsTyN0b1VoUutBSyUw6l6iEwFp+BGYWOglvYw2GSIXYSztRx5TcSgeR8sUqJgvN+psDZeiYNBv36mTqnbV4Vs73f8StSKLS1Qou+yJvIzArp8QCQXoz+wjwLWHs8d7B60T+I5Hhlymu2gH4SaB4QON+jbGaGSi1OIgiTxws76XcPJVDN5tzsd4m9GLMPVwshM5ilQCR6wbZtk2+n0RsyiHt2BSzVAQjR4SMpezDSzcYTXRRrmeF9X8Ri2LMxRmHa7zmWhk4TZK/JaevcAFV1kEPNCNqtd5AAaCpneVfBldkKabgO2L8H1gw/RMTs2mZmmL5ZpebLZLzJo54FuMI3sKl5PrZKxy0arHTpWO2K2bMp1EoO7HhXsJsnhaWcPL9GjoBmrAktSGwGxXaogUDlaBw5bmXSwiE7SoJ3LS6hqTNaG43qtuvEwdmokSlY6BpSNOjX/YLEUAw8gburlBvVs04KdQL1+WrM0criJcDT4VyD/H42YwEGAMzDPVrR/6WYIuPLwSaz0Tvxt1+ygWLuZU44Xq+2InSLqWYnAJmMA+BofGuLcnREH0g4ImmQwoOqTEnnzvy7mLeK8NLn6f5iKjOGW2bw0+9TDmv4MzeSY7CT/veVy5BTnQsexLxAb+WPjtqrB7hnyjvCwK9lK9tTw1BT+uT8GS03NOHC6Y83d5TSR4+oeoeQU5TDGsRDF+xsMlapQL7FGpGvkFop1MK++FIBnn2oTd2f8B+QiqJIjpW9QbgWGwToNiEnGTr9zQBEMAaIhh0kKW3NWSDR9JucvzCn/Im6sOEji4tYd+qmPlrq5tJynS9ww0nFOERiFxfd7MJRi8PDJoskJiB WcbL/Grm CI/W1ZVgazTt4I09P4aIZFjKDVLO0MPQUmzISpHIijo9yx2PUgNwMtZB94MI/PqI+LMNFliixmQyN6+xXpG1DhNLUsZRmiAsjAcoTobjZD+5tiXdEftVcyc84TP//vPu3Mfx0KX4tO0NyobqLU9B+yW2ejJO2FlNJufVf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote: > While investigating hosts with high cgroup memory pressures, Tejun > found culprit zombie tasks that had were holding on to a lot of > memory, had SIGKILL pending, but were stuck in memory.high reclaim. > > In the past, we used to always force-charge allocations from tasks > that were exiting in order to accelerate them dying and freeing up > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: > prohibit unconditional exceeding the limit of dying tasks"); it noted > that this can cause (userspace inducable) containment failures, so it > added a mandatory reclaim and OOM kill cycle before forcing charges. > At the time, memory.high enforcement was handled in the userspace > return path, which isn't reached by dying tasks, and so memory.high > was still never enforced by dying tasks. > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large > overcharges") added synchronous reclaim for memory.high, it added > unconditional memory.high enforcement for dying tasks as well. The > callstack shows that this path is where the zombie is stuck in. > > We need to accelerate dying tasks getting past memory.high, but we > cannot do it quite the same way as we do for memory.max: memory.max is > enforced strictly, and tasks aren't allowed to move past it without > FIRST reclaiming and OOM killing if necessary. This ensures very small > levels of excess. With memory.high, though, enforcement happens lazily > after the charge, and OOM killing is never triggered. A lot of > concurrent threads could have pushed, or could actively be pushing, > the cgroup into excess. The dying task will enter reclaim on every > allocation attempt, with little hope of restoring balance. > > To fix this, skip synchronous memory.high enforcement on dying tasks > altogether again. Update memory.high path documentation while at it. It makes total sense to me. Acked-by: Roman Gushchin However if tasks can stuck for a long time in the "high reclaim" state, shouldn't we also handle the case when tasks are being killed during the reclaim? E. g. something like this (completely untested): diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c4c422c81f93..9f971fc6aae8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, READ_ONCE(memcg->memory.high)) continue; + if (task_is_dying()) + break; + memcg_memory_event(memcg, MEMCG_HIGH); psi_memstall_enter(&pflags); @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) current->memcg_nr_pages_over_high = 0; retry_reclaim: + if (task_is_dying()) + return; + /* * The allocating task should reclaim at least the batch size, but for * subsequent retries we only want to do what's necessary to prevent oom