From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2AAFE77188 for ; Mon, 30 Dec 2024 18:38:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4B9756B00A7; Mon, 30 Dec 2024 13:38:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4696F6B00A8; Mon, 30 Dec 2024 13:38:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 331396B00A9; Mon, 30 Dec 2024 13:38:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 142D66B00A7 for ; Mon, 30 Dec 2024 13:38:31 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id BC19FC06DB for ; Mon, 30 Dec 2024 18:38:30 +0000 (UTC) X-FDA: 82952484510.30.7D407FC Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf20.hostedemail.com (Postfix) with ESMTP id 43BB91C000F for ; Mon, 30 Dec 2024 18:37:40 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xq8IXB3s; spf=pass (imf20.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735583860; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fIBhi3sZlTanZJgZcnBKy+hL3cGk5vGjv2PbmgxoS6w=; b=UOt2jONU44KoXnWuOx1+/0hHGw1NRhyqTDhvoDmnqiUps6t+dq3JlIGSKGkTr47Q5ZFQoc CH6L8c1rBYAJs6ajrgOBCQdWg3GnZupyKHs1mGQ9IdcLdKFBQlXqCqIuXmEHFujolUrMw+ W+vWzL1isrW6XD7Awiveo3Ab/UJDMpM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xq8IXB3s; spf=pass (imf20.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735583860; a=rsa-sha256; cv=none; b=y4ATrY1UJDFQQN2I6ZohLQ2J+WTmAOUGWkwQquGFA0u1c5C138bLnx4Ggf9fqehB6wwxSf z1SU1040AHa6NW2EQx0Epzd0wh/klbihEYxWfKbfBJ0mbtHG/qL4uQqXLBepH8ncQwVACb tn4Q0TJZjZtpNJkOG27ssTC7Uxqqcf8= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-467b74a1754so117652291cf.1 for ; Mon, 30 Dec 2024 10:38:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735583908; x=1736188708; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fIBhi3sZlTanZJgZcnBKy+hL3cGk5vGjv2PbmgxoS6w=; b=Xq8IXB3sFUzBSjahslqsB3QVnKzjJhKMNTJaDvWCKYPaPSBauw9I3Kef9knC+7/t97 urdeYlXmRi/YUNBVmYtEwXdqu9Htbtqa2E6NLDukWNS4gmswKDdX4GCSBLtsa8/uCeq/ VEKBlSXMYzBp8w+nAhFvfAba1K5VGGCi24R0wDViDmhnE9Yrjq55Ih5zaHGmMeCfPJQh wqKUuFVgfgxMv6yS0vhxDEVX2CDHm8VTSdSeESW6gcSq35CGIlLky5n/zUpnNUhDVdty Aer7ZgpRMZSrmYpcuKofTuhvwiFz3Ffb8byCUjZjJF6bcJnjO+F1n5kuaB32yuW8QzGd 1+Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735583908; x=1736188708; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fIBhi3sZlTanZJgZcnBKy+hL3cGk5vGjv2PbmgxoS6w=; b=uTaJoSADK5cyWAs1MDpbG72BIwb7fsDZEYdexvsviuHn56l8D802ENTpYN0nKGb29/ uT39xo6ucypRw3hPcZ+Ysj84uQnHWzci/NpV4OHxVIAtfwdkBtyCXtLJVs+kkdhWJ3lF V+neUYL/BxUiqP7vsfx01qaFgl9cmV1zVLcGYGmlcmM/TtBledz+iVgwQr2+7aYMN8Rc z/AOjysDVgpH1JVGewoTPhoqJvdQoMVg3xgG6AwlQn2O3O3eDvCtSMLMP4sVL3IYIuV8 FXnEF4DZi8Bq3VJmq01kmqBb4kSFcC2PK+hAlXbcBLB3Ai/4fEVSuW/SMAqTMafhAYni 6mYw== X-Forwarded-Encrypted: i=1; AJvYcCXqPSBuViacb5W4JhV6bT/OT15nqC0MAIE9oYykrhsWU4r0jyW41oO+gPgP++zstjRZGe+YdYu8UQ==@kvack.org X-Gm-Message-State: AOJu0Yx2uHPFT8lKzs9ljvsXS+C61nXzDI3XKryzHWKE+tWaLurv0MIG 3skFH5uCoFJFrDONvUxR0mlfPXxMa7KmUYeQ2gUGUqehHzFf4075CkJC7yiCrLm0coAu3MKnICJ +N/pgznehfuwm1a4UCHcxDTh2XsQ= X-Gm-Gg: ASbGncuIVchLuMhTKpVBmPNERSXZUlrEMWkSe9QrqTBstqQss0h9SrslZ9kQxv1OBdU 6rKyeX1hvnFf/jMbOMiOgceKsIY05h6jiftIDx4s= X-Google-Smtp-Source: AGHT+IHhFR+9j2ElTQgI1v43znVjmdPD/UDZvi/qrv4RiDwr+t+JOVG0YC8hTCXV+qDqipb6YuI9Hp9JpaPqxbDploM= X-Received: by 2002:a05:622a:110c:b0:467:6692:c18a with SMTP id d75a77b69052e-46a4a8dc464mr641394481cf.13.1735583907890; Mon, 30 Dec 2024 10:38:27 -0800 (PST) MIME-Version: 1.0 References: <2bph7jx4hvhxpgp77shq2j7mo4xssobhqndw5v7hdvbn43jo2w@scqly5zby7bm> <71d7ac34-a5e5-4e59-802b-33d8a4256040@redhat.com> <9404aaa2-4fc2-4b8b-8f95-5604c54c162a@redhat.com> <3f3c7254-7171-4987-bb1b-24c323e22a0f@redhat.com> <0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com> In-Reply-To: <0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com> From: Joanne Koong Date: Mon, 30 Dec 2024 10:38:16 -0800 Message-ID: Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings To: David Hildenbrand Cc: Shakeel Butt , Bernd Schubert , Zi Yan , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, jefflexu@linux.alibaba.com, josef@toxicpanda.com, linux-mm@kvack.org, kernel-team@meta.com, Matthew Wilcox , Oscar Salvador , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 43BB91C000F X-Rspam-User: X-Stat-Signature: rfa63ztu467qw4m6n1rks83xwoj4jbut X-HE-Tag: 1735583860-66668 X-HE-Meta: U2FsdGVkX1/vsPPE1TYgeM6vd0yofXp+snqU6TSLYld7frxLYEJtWw7q0AuQ7wBBH/AFaAKxemxlPMBeuOYMKSCUusarYNPtqPdRzp0TGTjVfV2akTqxD/xgI6MwpAlRFz2EGDuptLAJG9FZOlcqhGOyw+nbGlIG3N5wZXtL3zJRY1utizwQMDSBpa7ijY5HZrg7FYDLSWuUHD6WX8Az8lo0zAyzyga34rXpG+nsyrmeO1wOuKarUaW1u5/D6roWJxMOhEL/O5l4xURBtnP9beca6/DJOjTDaoPTKK7GFyZWfoBnmFa3jjkYDS4lolNPURbi4oJAjfZ9whwN35UeB27CUP64yFBTHEcKEn09x4+LtdlykBtNHcAvBP6ApTJL//aU+TJPMOmA0UzP5mNiuQ0kJ1rF1HtAg9NzbQp30B2ax0dUCc8XvHVIaKdjX2zY2tcCvYXAsrbMCIOGymWw56VU/b7cnoYG3tpXX1tCGAc4L0DbcBbRBlcKxu6mSYJnHs4Z0fTyjg/vhPyjmVIzehWDi8Jar7ZOrBoQ0bBXh4FwLMeBIR123bCOVY1fcsNHjKrP+J7OZGhhNxZLK7FQVDlr4ZioMrXE/AmPjJ7l+NUBxA8PyaIMBmcUDfGDeZO/2JUMa1QTCD7I6eoaJhSYhC7KTQifSr05w6O9nGS8dwplyilK14yI9h9J4dGBLWYLzrQsmcUA6DbhyI1rx5eNsbKEoMtC2sAZOwtsEa9jJwXzZ/jTaXBHVao4sueGjR7awwLl+ay1ESZWunUL7jOit+09rwJQ+URl86Q8xGAR6yxk9IcxQnMv6JKB3GKA0gh0qyPpOCZIa/IzZ3yQ8kbHRgleYWFUWQs6rAGUfH/keiVLt9Tz4M4zdIWb8FLLDMyvpNlTSI4eTc4R+piyAYLhPCf1xZtcMVTwa2FcnA5dtNkaAietCSK2BeqoLb6EngUQaeh37P+DieESugEBdPE +vxqK9SV KNkfG71zACxKknHL8zSlBLqEJ/PfE4SO8dUBEBMc1huHJZNT48JkSp7KMiEqL8jqPaDUNjv6JqBmM7no9BVhn+T+91CbHlp3xqeQBsYfrY+PCRs9ye9eJXL5znNbz/+qAfRrD1dA6n4dreFnSNKUqOxCC8U0JR0d79OGRBrJRgQMwKxseli5r4dCye/r/TTWGq/fic4pCTfo/8VXBZMsNWankPlB0kNQx3rXbAnVwJr+gLa3McnbCpYlNXRw53L4bN+YRc7AchfafUIRMKDxWpzpK44hY7TERR+ZJfBwWeU3LqaoByFU5EdU827xnuEjhocKQqn549YQheHyzE6hL5u42PTVw1JI6OIXA9HfIgh1+uRNr5AQt12+noLjhqX+CXxoOYaF9xYybpgmMwxxXOVjXsr15a/pgRawO8TPojMXPCykDfjyUmFBN4w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 2:16=E2=80=AFAM David Hildenbrand wrote: > > >> BTW, I just looked at NFS out of interest, in particular > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages = + > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > >> whereby the TCP default one seems to be around 60s (* retrans?), and t= he > >> privileged user that mounts it can set higher ones. I guess one could = run > >> into similar writeback issues? > > > > Hi, > > sorry for the late reply. > > > Yes, I think so. > > > >> > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > We are adding logic that says "unconditionally, never wait on writeback > for these folios, not even any sync migration". That's the main problem > I have. > > Your explanation below is helpful. Because ... > > > So, let me explain why it is required (but later I will tell you how it > > can be avoided). The FUSE thread which is actively handling writeback o= f > > a given folio can cause memory allocation either through syscall or pag= e > > fault. That memory allocation can trigger global reclaim synchronously > > and in cgroup-v1, that FUSE thread can wait on the writeback on the sam= e > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > allocations. The userspace fs can also use a similar approach which is > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have bee= n > > told that it is hard to use as it is per-thread flag and has to be set > > for all the threads handling writeback which can be error prone if the > > threadpool is dynamic. Second it is very coarse such that all the > > allocations from those threads (e.g. page faults) become NOFS which > > makes userspace very unreliable on highly utilized machine as NOFS can > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > ... now I understand that we want to prevent a deadlock in one specific > scenario only? > > What sounds plausible for me is: > > a) Make this only affect the actual deadlock path: sync migration > during compaction. Communicate it either using some "context" > information or with a new MIGRATE_SYNC_COMPACTION. > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > that very deadlock problem. > c) Leave all others sync migration users alone for now The deadlock path is separate from sync migration. The deadlock arises from a corner case where cgroupv1 reclaim waits on a folio under writeback where that writeback itself is blocked on reclaim. > > Would that prevent the deadlock? Even *better* would be to to be able to > ask the fs if starting writeback on a specific folio could deadlock. > Because in most cases, as I understand, we'll not actually run into the > deadlock and would just want to wait for writeback to just complete > (esp. compaction). > > (I still think having folios under writeback for a long time might be a > problem, but that's indeed something to sort out separately in the > future, because I suspect NFS has similar issues. We'd want to "wait > with timeout" and e.g., cancel writeback during memory > offlining/alloc_cma ...) I'm looking back at some of the discussions in v2 [1] and I'm still not clear on how memory fragmentation for non-movable pages differs from memory fragmentation from movable pages and whether one is worse than the other. Currently fuse uses movable temp pages (allocated with gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same issue where a buggy/malicious server may never complete writeback. This has the same effect of fragmenting memory and has a worse memory cost to the system in terms of memory used. With not having temp pages though, now in this scenario, pages allocated in a movable page block can't be compacted and that memory is fragmented. My (basic and maybe incorrect) understanding is that memory gets allocated through a buddy allocator and moveable vs nonmovable pages get allocated to corresponding blocks that match their type, but there's no other difference otherwise. Is this understanding correct? Or is there some substantial difference between fragmentation for movable vs nonmovable blocks? Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelk= oong@gmail.com/T/#m7637e26a559db86348461ebc1104352083085d6d > > >> Not > >> sure if I grasped all details about NFS and writeback and when it woul= d > >> redirty+end writeback, and if there is some other handling in there. > >> > > [...] > >>> > >>> Please note that such filesystems are mostly used in environments lik= e > >>> data center or hyperscalar and usually have more advanced mechanisms = to > >>> handle and avoid situations like long delays. For such environment > >>> network unavailability is a larger issue than some cma allocation > >>> failure. My point is: let's not assume the disastrous situaion is nor= mal > >>> and overcomplicate the solution. > >> > >> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be = used > >> for movable allocations. > >> > >> Mechanisms that possible turn these folios unmovable for a > >> long/indeterminate time must either fail or migrate these folios out o= f > >> these regions, otherwise we start violating the very semantics why > >> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > >> > >> Yes, there are corner cases where we cannot guarantee movability (e.g.= , OOM > >> when allocating a migration destination), but these are not cases that= can > >> be triggered by (unprivileged) user space easily. > >> > >> That's why FOLL_LONGTERM pinning does exactly that: even if user space= would > >> promise that this is really only "short-term", we will treat it as "po= ssibly > >> forever", because it's under user-space control. > >> > >> > >> Instead of having more subsystems violate these semantics because > >> "performance" ... I would hope we would do better. Maybe it's an issue= for > >> NFS as well ("at least" only for privileged user space)? In which case= , > >> again, I would hope we would do better. > >> > >> > >> Anyhow, I'm hoping there will be more feedback from other MM folks, bu= t > >> likely right now a lot of people are out (just like I should ;) ). > >> > >> If I end up being the only one with these concerns, then likely people= can > >> feel free to ignore them. ;) > > > > I agree we should do better but IMHO it should be an iterative process. > > I think your concerns are valid, so let's push the discussion > towards> resolving those concerns. I think the concerns can be resolved > by better > > handling of lifetime of folios under writeback. The amount of such > > folios is already handled through existing dirty throttling mechanism. > > > > We should start with a baseline i.e. distribution of lifetime of folios > > under writeback for traditional storage devices (spinning disk and SSDs= ) > > as we don't want an unrealistic goal for ourself. I think this data wil= l > > drive the appropriate timeout values (if we decide timeout based > > approach is the right one). > > > > At the moment we have timeout based approach to limit the lifetime of > > folios under writeback. Any other ideas? > > See above, maybe we could limit the deadlock avoidance to the actual > deadlock path and sort out the "infinite writeback in some corner cases" > problem separately. > > -- > Cheers, > > David / dhildenb >