From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A73DC61D9C for ; Wed, 22 Nov 2023 09:38:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 723606B0588; Wed, 22 Nov 2023 04:38:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6ABA46B0589; Wed, 22 Nov 2023 04:38:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 573E26B058A; Wed, 22 Nov 2023 04:38:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4412D6B0588 for ; Wed, 22 Nov 2023 04:38:36 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0FCF71406F4 for ; Wed, 22 Nov 2023 09:38:36 +0000 (UTC) X-FDA: 81485090232.27.65B0CFD Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) by imf05.hostedemail.com (Postfix) with ESMTP id 3A357100004 for ; Wed, 22 Nov 2023 09:38:33 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=gENINQ8D; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700645913; a=rsa-sha256; cv=none; b=CJILn0JGBQ71P0u7cXql+f9gvI+boyKPljM4Em0c2F0T0vrnAWeY98MA3a5K8WdSSkYK/e QsLTVksfpDVu+i5Ygyw+sWZmbIP9pZD7Qw8ZNT1/ZtX7hlffLGUynY80lKTuTIWG66fzwN 6ilmD7CadOHweoBqbj9k7r1faQCc8so= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=gENINQ8D; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700645913; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=hzQTsEYOMm4An3het2XGLy3/nW2/MSOzvCz8NWvRraU=; b=fynNpZD9qC4Ch0moVynL2HsFGXRJjFO3zBlHCtRsjAsM03euopgEgVJD/CWpC4OZHDK/6H qPBMDMPkHDwMHVG+Lhb0l5QdEdN3+Dwj2GGCLjteCOaLRP8Uf1YoX0r06aEQD3fw5dOg7T p5CbEhNsCOVV6f9AducfXouTJ7z+nJo= Message-ID: <109029e0-1772-4102-a2a8-ab9076462454@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700645911; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=hzQTsEYOMm4An3het2XGLy3/nW2/MSOzvCz8NWvRraU=; b=gENINQ8D2HHl+YbB5l3Lhsql8T2XGT6IAOflzD6Ia6CIkZjbuFWGyp8zNJ5jEgQZeGszrA vHmtt3dZdiuT83pKrz220NM+97JIGMnfllW/viSxKcyLZdBT7+xrWV6MLnEgez0v1Xr9el e2qhBZE1nE2FWIH01b8P5SEx6uaZU9g= Date: Wed, 22 Nov 2023 17:38:25 +0800 MIME-Version: 1.0 Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou To: LKML , linux-mm Cc: jack@suse.cz, Tejun Heo , Johannes Weiner , Christoph Hellwig , shr@devkernel.io, neilb@suse.de, Michal Hocko Subject: Question: memcg dirty throttle caused by low per-memcg dirty thresh Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3A357100004 X-Stat-Signature: 8djjnxmq6izkbfe5k1yrk983in5hj56q X-HE-Tag: 1700645913-575884 X-HE-Meta: U2FsdGVkX1+S6JDv9AATkvFv22RQeo4LhZbA6UsvDjA4ULvnThlLOaM8j4vaFr2li3OSzjzqeA1YNd9HCUMUfIpJ9F3VQx6qzLoi6NNH0+/g4ZBKeUcSm/XhP6JXquYr1LBQj9ZfYt7IHHldhzgBLsFsel8FNO/CILZwKqe0ZdjgH1jkmJt3vKMwNEXHZfmSLDRX1JRC7gHXdri1B/2ETDvPpkA/hVWgFE80iJxlM3KkQBNGwzQ6sIJ/4lN2CLY2xFZ3eU6nbX9FEQsHpUnR/G+Kb6MnsuFCiwwKktlUh7cdCRsc8jWqL7fp+ITLScYphbWnMJO5SMHoN/A1e0FETUDmMpQPP+VxM7LahXSEmQxLCpvl5U3d0R7WKUm43pksDi6eMGm5QEcAqSIsw7rH7L4ACdcnRvApDAwXczMpQjLBOdVwLcaQodtGlLnmEuXYI8LPcg73hk6xPS2ZFITcq2GE/RGoTP8Z0YL3up3zwaokboj6gjXijKMrreTlWm+N1Pro7wduqoTYdOnL+Yuo7wFORHlptAdyr4Ux0W5GJw8qX77y7TfnEuG7gIV//UUzOxES7ir6ksNdr+8djBaCgp+IGwQrhjaz79KQNWytfk9InFCe8q99/y79LJe5t6Ob5PlZnZ8pkIuRSf6cSvuCXumbSt2597x7B8kzTzTKf8iGqJuaSIa1uqMXJtDPIiMNowl8x5Wdz0YtRk12DhA0mCmIXo+OKftYMZETH4xjhq9RyPaCujE2B82UYSWttFZ3m9aDvzNjrNKqibeFME3pVrUCeApwP1rZEQO4QB+G6IJzS8s/tlIv5AfR4DCj1Ytf3JtB/D2LPgcxfeyvSrKpfB7n1E/+NLksNi1soV8FYv/rxvy9Gjz0skiNtH3edb786uTSTykjC3yKxEvX0Z0xun1BDOOMIvx1VIwYOK7ny0/DClF45m/eMArnSOyEziWBBaMRJq34m9eLcizh/qX pSKVOY8f YS3rQk3O/iuCl6n9i+us62gxmUPVoMngBMttt3s/zjRVXWBWGLrGwJiqxz/3dVPaJlOIznu8muuP+gfYFJ3Z63vuylA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello all, Sorry to bother you, we encountered a problem related to the memcg dirty throttle after migrating from cgroup v1 to v2, so want to ask for some comments or suggestions. 1. Problem We have the "containerd" service running under system.slice, with its memory.max set to 5GB. It will be constantly throttled in the balance_dirty_pages() since the memcg has dirty memory more than the memcg dirty thresh. We haven't this problem on cgroup v1, because cgroup v1 doesn't have the per-memcg writeback and per-memcg dirty thresh. Only the global dirty thresh will be checked in balance_dirty_pages(). 2. Thinking So we wonder if we can support the per-memcg dirty thresh interface? Now the memcg dirty thresh is just calculated from memcg max * ratio, which can be set from /proc/sys/vm/dirty_ratio. We have to set it to 60 instead of the default 20 to workaround now, but worry about the potential side effects. If we can support the per-memcg dirty thresh interface, we can set some containers to a much higher dirty_ratio, especially for hungry dirtier workloads like "containerd". 3. Solution? But we could't think of a good solution to support this. The current memcg dirty thresh is calculated from a complex rule: memcg dirty thresh = memcg avail * dirty_ratio memcg avail is from combination of: memcg max/high, memcg files and capped by system-wide clean memory excluding the amount being used in the memcg. Although we may find a way to calculate the per-memcg dirty thresh, we can't use it directly, since we still need to calculate/distribute dirty thresh to the per-wb dirty thresh share. R - A - B \-- C For example, if we know the dirty thresh of A, but wb is in C, we have no way to distribute the dirty thresh shares to the wb in C. But we have to get the dirty thresh of the wb in C, since we need it to control throttling process of the wb in balance_dirty_pages(). I may have missed something above, but the problem seems clear IMHO. Looking forward to any comment or suggestion. Thanks!