From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E8D4C433F5 for ; Wed, 13 Oct 2021 15:39:40 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 26B3861183 for ; Wed, 13 Oct 2021 15:39:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 26B3861183 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 6DC2C6B0073; Wed, 13 Oct 2021 11:39:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 68BA76B0078; Wed, 13 Oct 2021 11:39:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 553326B007B; Wed, 13 Oct 2021 11:39:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 471B86B0073 for ; Wed, 13 Oct 2021 11:39:39 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id EB6431801F2CB for ; Wed, 13 Oct 2021 15:39:38 +0000 (UTC) X-FDA: 78691824036.03.7D3B5A7 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf26.hostedemail.com (Postfix) with ESMTP id B534C20019C9 for ; Wed, 13 Oct 2021 15:39:38 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id DB7442199F; Wed, 13 Oct 2021 15:39:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1634139576; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FNjFlLBCp8eIniFVxC2OTUCUHJDZc+bQGW4gxrxjguU=; b=2mNBWlrdAdRT1IpwE9rQPOuZw6AahAK19rjBykFGnH2K1nyTFnMqyyVT6I19SNAlfLbQL9 gav+K0NvwaPnXEdVUrZ8RWSO1ZA9K3+4NSeMQ0obfx7sq9UL276QvTnTrBHQCtgUFoRihS FAVNP/qetNKkny0rkaZwQBsEaEvsW9w= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1634139576; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FNjFlLBCp8eIniFVxC2OTUCUHJDZc+bQGW4gxrxjguU=; b=KqOOnrNZJ/HPOwlOsOK0o6nRptwG4pCDO5AkIv7nwQlkaAIFWH3/TlZZMxlrHZ/kqYhBw3 g0zeQRk8Q1UqrZBQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id A5B4813D05; Wed, 13 Oct 2021 15:39:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 06aHJ7j9ZmEDfQAAMHmgww (envelope-from ); Wed, 13 Oct 2021 15:39:36 +0000 Message-ID: <63898e7a-0846-3105-96b5-76c89635e499@suse.cz> Date: Wed, 13 Oct 2021 17:39:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Content-Language: en-US To: Mel Gorman , Linux-MM Cc: NeilBrown , Theodore Ts'o , Andreas Dilger , "Darrick J . Wong" , Matthew Wilcox , Michal Hocko , Dave Chinner , Rik van Riel , Johannes Weiner , Jonathan Corbet , Linux-fsdevel , LKML References: <20211008135332.19567-1-mgorman@techsingularity.net> <20211008135332.19567-2-mgorman@techsingularity.net> From: Vlastimil Babka Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested In-Reply-To: <20211008135332.19567-2-mgorman@techsingularity.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2mNBWlrd; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=KqOOnrNZ; spf=pass (imf26.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B534C20019C9 X-Stat-Signature: kjspy8q99uf47hr34pytth3rx9po4ikj X-HE-Tag: 1634139578-781090 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/8/21 15:53, Mel Gorman wrote: > Page reclaim throttles on wait_iff_congested under the following conditions > > o kswapd is encountering pages under writeback and marked for immediate > reclaim implying that pages are cycling through the LRU faster than > pages can be cleaned. > > o Direct reclaim will stall if all dirty pages are backed by congested > inodes. > > wait_iff_congested is almost completely broken with few exceptions. This > patch adds a new node-based workqueue and tracks the number of throttled > tasks and pages written back since throttling started. If enough pages > belonging to the node are written back then the throttled tasks will wake > early. If not, the throttled tasks sleeps until the timeout expires. > > [neilb@suse.de: Uninterruptible sleep and simpler wakeups] > [hdanton@sina.com: Avoid race when reclaim starts] > Signed-off-by: Mel Gorman Seems mostly OK, have just some doubts wrt NR_THROTTLED_WRITTEN mechanics, that may ultimately be just a point of comments to add. ... > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping, > unlock_page(page); > } > > +static void > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, > + long timeout) > +{ > + wait_queue_head_t *wqh = &pgdat->reclaim_wait; > + long ret; > + DEFINE_WAIT(wait); > + > + /* > + * Do not throttle IO workers, kthreads other than kswapd or > + * workqueues. They may be required for reclaim to make > + * forward progress (e.g. journalling workqueues or kthreads). > + */ > + if (!current_is_kswapd() && > + current->flags & (PF_IO_WORKER|PF_KTHREAD)) > + return; > + > + if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) { > + WRITE_ONCE(pgdat->nr_reclaim_start, > + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); > + } > + > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > + ret = schedule_timeout(timeout); > + finish_wait(wqh, &wait); > + atomic_dec(&pgdat->nr_reclaim_throttled); > + > + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), > + jiffies_to_usecs(timeout - ret), > + reason); > +} > + > +/* > + * Account for pages written if tasks are throttled waiting on dirty > + * pages to clean. If enough pages have been cleaned since throttling > + * started then wakeup the throttled tasks. > + */ > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, > + int nr_throttled) > +{ > + unsigned long nr_written; > + > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); Is this intentionally using the __ version that normally expects irqs to be disabled (AFAIK they are not in this path)? I think this is rarely used cold path so it doesn't seem worth to trade off speed for accuracy. > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - > + READ_ONCE(pgdat->nr_reclaim_start); Even if the inc above was safe, node_page_state() will return only the global counter, so the value we read here will only actually increment when some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just worth documenting? > + > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) > + wake_up_all(&pgdat->reclaim_wait); Hm it seems a bit weird that the more tasks are throttled, the more we wait, and then wake up all. Theoretically this will lead to even more bursty/staggering herd behavior. Could be better to wake up single task each SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in practice due to HZ/10 timeouts being short enough? > +} > +