From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56309C61D9B for ; Wed, 22 Nov 2023 14:49:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E29856B0628; Wed, 22 Nov 2023 09:49:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DDA586B0629; Wed, 22 Nov 2023 09:49:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA1C86B062A; Wed, 22 Nov 2023 09:49:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BC5F56B0628 for ; Wed, 22 Nov 2023 09:49:38 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7484C140C5B for ; Wed, 22 Nov 2023 14:49:38 +0000 (UTC) X-FDA: 81485874036.20.24D7E18 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf19.hostedemail.com (Postfix) with ESMTP id 466191A0019 for ; Wed, 22 Nov 2023 14:49:35 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U6b9LWRb; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=YXw5P8+T; dmarc=none; spf=pass (imf19.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700664575; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Kq4fxvPRD3+OAsCP745uuOIslKd+n71GOa7alH4nMCE=; b=r3G3w76CWO8ELJbiMvLxru/n+sP9bAXHGsXVEfS8HstjwNwcDM0jgpX2Cf3yBLBPkbJu8C mzzIUWqHj7/+0CNQvYjNN1t436wDFCX4vhs8S2P8lDwD6kiJfPepzd68ZDD2CstMn2X6Eq WsXhZoIj/OQaZXbdO60ZU8IYPpzGFAI= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U6b9LWRb; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=YXw5P8+T; dmarc=none; spf=pass (imf19.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700664575; a=rsa-sha256; cv=none; b=cIaHdtlo82qmre2ibvxRlJyKcVZVgwfeKrPs/npy+vtqcaneK8stwiVG0TIxnPrWO2I7fy ZyxH3KiKjC3WcL34LRT39GQxv4skiAdgd55PnjGdeUVVq6dBjHRpz6sRnP82r1Usj+9Soc r2HSqAqP+KPaOjE+PKKhoTnzZEbgsqY= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 36ADA1F385; Wed, 22 Nov 2023 14:49:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1700664573; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Kq4fxvPRD3+OAsCP745uuOIslKd+n71GOa7alH4nMCE=; b=U6b9LWRbsjHW7H4M2bMLBt1dNaWPoEtqn+VEFumm3hzQC3pFxyTHbK1o8+jMJjLHu3ges3 0f831xeKBhHtlUA3GWiiNiPYj1QbwH96b3uV35HeOH5AlEl+PBidEBk6rd2nnpx2taMOii a9wWv5jdlcS4ywUitEEXKTQPhGKQ/bk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1700664573; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Kq4fxvPRD3+OAsCP745uuOIslKd+n71GOa7alH4nMCE=; b=YXw5P8+TxQHa5tDxeU/lV5qni6HtBmE6hyYifqyfBvD0lsqaN4qC3eT2QbwdQD5z563RMQ MmM2AKJa/mw+RnDA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 305D513461; Wed, 22 Nov 2023 14:49:33 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id OtDIC/0UXmWQEgAAMHmgww (envelope-from ); Wed, 22 Nov 2023 14:49:33 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id A911AA07DC; Wed, 22 Nov 2023 15:49:32 +0100 (CET) Date: Wed, 22 Nov 2023 15:49:32 +0100 From: Jan Kara To: Chengming Zhou Cc: LKML , linux-mm , jack@suse.cz, Tejun Heo , Johannes Weiner , Christoph Hellwig , shr@devkernel.io, neilb@suse.de, Michal Hocko Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty thresh Message-ID: <20231122144932.m44oiw5lufwkc5pw@quack3> References: <109029e0-1772-4102-a2a8-ab9076462454@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <109029e0-1772-4102-a2a8-ab9076462454@linux.dev> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 466191A0019 X-Stat-Signature: 4w4zjmj8ipt8sg19ndekuskxbznag48z X-HE-Tag: 1700664575-873986 X-HE-Meta: U2FsdGVkX18DP005IxRlq1mGYdLhs1Y3D32sgXmBLpvlL2JFReqEQ89B878JGlY/QT/Fb32Zz8XeAzC7nWtMfMU3xqobpcvPNGm1upQ/nqfpfh4OEUz5CHNK1KZzs1YZQLurZTD5lfC/MGgKk/IetudCmbFU+gjVhwWdUTu/dlzRoC3D5sNSH3xaJ/nctGwbB16V3tSD2CdH4iwtAJem/ZkuGDNpGTJfe+Y+rfpTcDfvf0vK2ApZ3LeutzDPr8ytpM+rFahMetIxIrtvd1GIAZ5D8ZqEdqTww48i8EOW5Uon93PlygixOjjn0KP8YLQCgMcIRxk6hXDH5/SaI+GspHwU1zgBk/9uOqCflV+zdsOmG16YeUrx2iAA8Ukt3UUez9mJ9pMh0awu4AN/yO8bnILogUoiwm85GEAF7Cwa8fGjpo0arGvbb+GvyTOTX7XAHugLubVDuOsCjiYquDXduGltU7MBGZS0JrdF2b608IzSC58hIgIp+cbscYRv4slh/luV52DfpAvqtlvAO3unWRjJc9cfb+LRVKMyJXa9e/2uTLwlmzHAftUhC8hp2uW26OkDucKJY2X0u/aTmKTIf1r4H0VsYpVxZiYWSnomUgrRdF9XW6giJz6FQDp0las045rq4b4cMGH3uv2l/okqoXe/nB5/scjQzApRZXyaQH+Yn2zxY9jNz50OMuxpZnLYF89j7U0yPC27IuVGk5KHHEtaFql0HmVAH50J3WbB1AFzdQFGENs3nYk5w7bL/xY7eAZEW4WzTlf+/Sjds2StQUzet9exQP8gpkgBij53s70hr+YHYkOnsV+RcQWp+aM/rO0r7Id+wTdC/pZ52XB2g3I32KwU9fFtuc6jexidx1l+eHTyB5rnn+dFC/ebJCn/DKNQQiFld40Ymq9glyI205Sv5R06NnWW8u0foBJgbOtZijJIwf38O5lbMczqDIO1AiemsON//bW5E+rGL0j budYwJ7g 1KRZRWKdiRk+NvlTS1kMaxaz63JRtXUtrtB20xl2zqBDusdSVksMSWMQOjLDmNs7WPN8h/mnPvAbtjW/wcZSKNup+6QySgmVs+WT5bduwXb/XoSddm7yx2N0hMX+5JM6K5/NJ5DjcBXIahVM3A0ijQIaApfhSVo7jrxPmkMAvwxfBlAlRq8ICLoc2QbKgKafVFg0FszTlPkvwZMAYqZmokXOA5tDzb5Ap8lM2nwC2juldhR52wCRoX/ZrOSEGsyp7XlyB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello! On Wed 22-11-23 17:38:25, Chengming Zhou wrote: > Sorry to bother you, we encountered a problem related to the memcg dirty > throttle after migrating from cgroup v1 to v2, so want to ask for some > comments or suggestions. > > 1. Problem > > We have the "containerd" service running under system.slice, with > its memory.max set to 5GB. It will be constantly throttled in the > balance_dirty_pages() since the memcg has dirty memory more than > the memcg dirty thresh. > > We haven't this problem on cgroup v1, because cgroup v1 doesn't have > the per-memcg writeback and per-memcg dirty thresh. Only the global > dirty thresh will be checked in balance_dirty_pages(). As Michal writes, if you allow too many memcg pages to become dirty, you might be facing issues with page reclaim so there are actually good reasons why you want amount of dirty pages in each memcg reasonably limited. Also generally increasing number of available dirty pages beyond say 1GB is not going to bring any benefit in the overall writeback performance. It may still be useful in case you generate a lot of (or large) temporary files which get quickly deleted and thus with high enough dirty limit they don't have to be written to the disk at all. Similarly if the generation of dirty data is very bursty (i.e. you generate a lot of dirty data in a short while and then don't dirty anything for a long time), having higher dirty limit may be useful. What is your usecase that you think you'll benefit from higher dirty limit? > 2. Thinking > > So we wonder if we can support the per-memcg dirty thresh interface? > Now the memcg dirty thresh is just calculated from memcg max * ratio, > which can be set from /proc/sys/vm/dirty_ratio. > > We have to set it to 60 instead of the default 20 to workaround now, > but worry about the potential side effects. > > If we can support the per-memcg dirty thresh interface, we can set > some containers to a much higher dirty_ratio, especially for hungry > dirtier workloads like "containerd". As Michal wrote, if this ought to be configurable per memcg, then configuring dirty amount directly in bytes may be more sensible. > 3. Solution? > > But we could't think of a good solution to support this. The current > memcg dirty thresh is calculated from a complex rule: > > memcg dirty thresh = memcg avail * dirty_ratio > > memcg avail is from combination of: memcg max/high, memcg files > and capped by system-wide clean memory excluding the amount being used > in the memcg. > > Although we may find a way to calculate the per-memcg dirty thresh, > we can't use it directly, since we still need to calculate/distribute > dirty thresh to the per-wb dirty thresh share. > > R - A - B > \-- C > > For example, if we know the dirty thresh of A, but wb is in C, we > have no way to distribute the dirty thresh shares to the wb in C. > > But we have to get the dirty thresh of the wb in C, since we need it > to control throttling process of the wb in balance_dirty_pages(). > > I may have missed something above, but the problem seems clear IMHO. > Looking forward to any comment or suggestion. I'm not sure I follow what is the problem here. In balance_dirty_pages() we have global dirty threshold (tracked in gdtc) and memcg dirty threshold (tracked in mdtc). This can get further scaled down based on the device throughput (that is the difference between 'thresh' and 'wb_thresh') but that is independent of the way mdtc->thresh is calculated. So if we provide a different way of calculating mdtc->thresh, technically everything should keep working as is. Honza -- Jan Kara SUSE Labs, CR