From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00C6CE77188 for ; Mon, 30 Dec 2024 20:04:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F55A6B00A8; Mon, 30 Dec 2024 15:04:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 67CED6B00AA; Mon, 30 Dec 2024 15:04:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F66A6B00AB; Mon, 30 Dec 2024 15:04:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2E1756B00A8 for ; Mon, 30 Dec 2024 15:04:12 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D0DAD140991 for ; Mon, 30 Dec 2024 20:04:11 +0000 (UTC) X-FDA: 82952700348.21.A257DE4 Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) by imf22.hostedemail.com (Postfix) with ESMTP id 2292CC0005 for ; Mon, 30 Dec 2024 20:03:20 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=kubpX0jz; spf=pass (imf22.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.172 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735589001; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2EO1Yi3HcLyfpgCw0VDXNfnE46zeiYYc/QyQcLhmjIQ=; b=OWolUTvFWG5RQf9fqV1nQtcF9xleDZmgzeBi9mNauDj8PDOubAlSI5G0vKCyHry47FzOGM UwSjVUvCxknPL7ofYIfbrUp+4IqTq2izZcADigERT2L0R36bcyn3tjIdtGrs/cX6RKehMh ngGkno3wei2kR+P/Whcc4WSC/B3SYwQ= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=kubpX0jz; spf=pass (imf22.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.172 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735589001; a=rsa-sha256; cv=none; b=EPlKwVxf76tYYr6yJ8kC4hTVtH+5vwWZdKnxBKhLgjXRSUPkUypJf1dk3MRAiLSPhJ64sI uAbSR8JVmZgRgoqpTseuxRNxV/n9QJUlKvi+68kj5LG9QNw6fQ5guO5e+F8ziuNoo6i3rC kfqHOkZg4/lkKdbmuki9xp+wqNMHIBQ= Date: Mon, 30 Dec 2024 12:04:02 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1735589047; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2EO1Yi3HcLyfpgCw0VDXNfnE46zeiYYc/QyQcLhmjIQ=; b=kubpX0jzPig1X4Rk5l+eT3miiVNear2eJFMJkj37Doj9uiV2E9fHSIQXFfExW7F0UlEugo /90k/ZGzB9Aa7g4nlO0CEufkVIKxCJdozakae0m+rgN569xZ4HCvwRhM8m630qucx6YS4x 6JSH88H32bqo19zHzwv+imdC9fG+nmY= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Joanne Koong Cc: David Hildenbrand , Bernd Schubert , Zi Yan , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, jefflexu@linux.alibaba.com, josef@toxicpanda.com, linux-mm@kvack.org, kernel-team@meta.com, Matthew Wilcox , Oscar Salvador , Michal Hocko Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings Message-ID: References: <71d7ac34-a5e5-4e59-802b-33d8a4256040@redhat.com> <9404aaa2-4fc2-4b8b-8f95-5604c54c162a@redhat.com> <3f3c7254-7171-4987-bb1b-24c323e22a0f@redhat.com> <0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2292CC0005 X-Rspam-User: X-Stat-Signature: 6nknnp48zaqas6attr1arws9fjda1pgh X-HE-Tag: 1735589000-623025 X-HE-Meta: U2FsdGVkX1/JJuNh5gAH9iQVRsn5VM9ungR4hUAANgi6HiAh/wrPfcism86cd+SjhkIqECgdsUGz7gnZOykOEJbjrzpRuO6c7RB7ED3ypiTBYgLCaCY4ivdRPVczquEebEz7JNg5gQtge0oPcTwrflE/u1ZhsmP/5wcz5+ol799yaQAGvvUmxzU1oOxRrcBiKOs0rHuXbTFa723drUt84/E5iuxhN3mxNozcVdBUSYl2xXxSyGYelo6hKCwTYUYaB2c5Cr7YhxK3QDbyhzdox2eTVAL3ZMlS7IZAAskurIZccF56I9xspt1HFID++SNiwl2ILcN/6wEjVZ6+zybpJsd91dWXAVsMxm4oldYGtj4RWv6+xpz2i67O3jNxmr32C4pLl6e6uUiO3epW7hZ6IBUsTcxy7ran0WvRo/LBJRXZ9kxlNrWWxA9FJMspCVuKWIT0IXOWKlH7OKCCto/xqzu9ubi/70taie+x3ZMlniyZnZ3at5HV6bw2u2pB6C7cr1H6buOoYq2MOAN4A7lNf0YH5Y37Q8N1xdgidKGon8AjdDBlOZ5N41z4/AWJJOb6lw8RuSk0+pMFzHci6cjFDv1zh1sYpwJo0HqlT/ifAn8V5Ya0dlCgrzMldPNsXkl3+pka7poRNmtfKACUwRtoss/qDCvSrBomHCDTh5gSBsOsItOaaAQp0STW0rdMkXUwmHD17TM1hwokIJ3IUW1SJo1RHpfQb/mk6MaWQLab/HEpDt7hzxd+bmRQFaV5D2AdiIcROPDdCm+LLVXmfgFHVAVxAxl0/QzW4jaq6Paf3hQ6tHEhju6A6zwjhx5TirGN6D4qdleM8h4nVoaolARfwyQcQy7NSGEOFb293KeMvUug4c8umXkJVkDPHKosXQW+ISa4TIs5RhTGO2FMdMQ40mvgPdNQ6hiurxjGYJplJMvyHqjIJWdsOF0Porqv/3UQ0AlNzZ+66BxzCzpgXKc 7A00wb3j dFQPwOgEx5Akbz+WV0Wo5BuIEAmC0ykFFrILmK6OgUSgrycf7akYBcn2J6aDeH4ProNn+VGeGzmsXXc9dYq7coNKryLp9lcTkIY9JbwMun+fi34na2g8M/erX4m7NKBX9bgtTSU/JfGVhevRE7/+JviRrDYX9Hq9dSLomSM/II8vIAzRhoPR0N9d6g61g9xCwsfzXQVGHfpdQid2YxbmBBp9FO3/f5d5f8JNYcoBbd0aG633aVWhcjFqphA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand wrote: Thanks David for the response. > > > > >> BTW, I just looked at NFS out of interest, in particular > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > >> privileged user that mounts it can set higher ones. I guess one could run > > >> into similar writeback issues? > > > > > > > Hi, > > > > sorry for the late reply. > > > > > Yes, I think so. > > > > > >> > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > We are adding logic that says "unconditionally, never wait on writeback > > for these folios, not even any sync migration". That's the main problem > > I have. > > > > Your explanation below is helpful. Because ... > > > > > So, let me explain why it is required (but later I will tell you how it > > > can be avoided). The FUSE thread which is actively handling writeback of > > > a given folio can cause memory allocation either through syscall or page > > > fault. That memory allocation can trigger global reclaim synchronously > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > allocations. The userspace fs can also use a similar approach which is > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > told that it is hard to use as it is per-thread flag and has to be set > > > for all the threads handling writeback which can be error prone if the > > > threadpool is dynamic. Second it is very coarse such that all the > > > allocations from those threads (e.g. page faults) become NOFS which > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > scenario only? > > > > What sounds plausible for me is: > > > > a) Make this only affect the actual deadlock path: sync migration > > during compaction. Communicate it either using some "context" > > information or with a new MIGRATE_SYNC_COMPACTION. > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > that very deadlock problem. > > c) Leave all others sync migration users alone for now > > The deadlock path is separate from sync migration. The deadlock arises > from a corner case where cgroupv1 reclaim waits on a folio under > writeback where that writeback itself is blocked on reclaim. > Joanne, let's drop the patch to migrate.c completely and let's rename the flag to something like what David is suggesting and only handle in the reclaim path. > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > ask the fs if starting writeback on a specific folio could deadlock. > > Because in most cases, as I understand, we'll not actually run into the > > deadlock and would just want to wait for writeback to just complete > > (esp. compaction). > > > > (I still think having folios under writeback for a long time might be a > > problem, but that's indeed something to sort out separately in the > > future, because I suspect NFS has similar issues. We'd want to "wait > > with timeout" and e.g., cancel writeback during memory > > offlining/alloc_cma ...) Thanks David and yes let's handle the folios under writeback issue separately. > > I'm looking back at some of the discussions in v2 [1] and I'm still > not clear on how memory fragmentation for non-movable pages differs > from memory fragmentation from movable pages and whether one is worse > than the other. I think the fragmentation due to movable pages becoming unmovable is worse as that situation is unexpected and the kernel can waste a lot of CPU to defrag the block containing those folios. For non-movable blocks, the kernel will not even try to defrag. Now we can have a situation where almost all memory is backed by non-movable blocks and higher order allocations start failing even when there is enough free memory. For such situations either system needs to be restarted (or workloads restarted if they are cause of high non-movable memory) or the admin needs to setup ZONE_MOVABLE where non-movable allocations don't go. > Currently fuse uses movable temp pages (allocated with > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > issue where a buggy/malicious server may never complete writeback. So, these temp pages are not an issue for fragmenting the movable blocks but if there is no limit on temp pages, the whole system can become non-movable (there is a case where movable blocks on non-ZONE_MOVABLE can be converted into non-movable blocks under low memory). ZONE_MOVABLE will avoid such scenario but tuning the right size of ZONE_MOVABLE is not easy. > This has the same effect of fragmenting memory and has a worse memory > cost to the system in terms of memory used. With not having temp pages > though, now in this scenario, pages allocated in a movable page block > can't be compacted and that memory is fragmented. My (basic and maybe > incorrect) understanding is that memory gets allocated through a buddy > allocator and moveable vs nonmovable pages get allocated to > corresponding blocks that match their type, but there's no other > difference otherwise. Is this understanding correct? Or is there some > substantial difference between fragmentation for movable vs nonmovable > blocks? The main difference is the fallback of high order allocation which can trigger compaction or background compaction through kcompactd. The kernel will only try to defrag the movable blocks.