linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: David Hildenbrand <david@redhat.com>,
	 Bernd Schubert <bernd.schubert@fastmail.fm>,
	Zi Yan <ziy@nvidia.com>,
	miklos@szeredi.hu,  linux-fsdevel@vger.kernel.org,
	jefflexu@linux.alibaba.com, josef@toxicpanda.com,
	 linux-mm@kvack.org, kernel-team@meta.com,
	Matthew Wilcox <willy@infradead.org>,
	 Oscar Salvador <osalvador@suse.de>,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Mon, 30 Dec 2024 12:04:02 -0800	[thread overview]
Message-ID: <xucuoi4ywape4ftgzgahqqgzk6xhvotzdu67crq37ccmyl53oa@oiq354b6sfu7> (raw)
In-Reply-To: <CAJnrk1ZYV3hXz_fdssk=tCWPzD_fpHyMW1L_+VRJtK8fFGD-1g@mail.gmail.com>

On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote:
> On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:

Thanks David for the response.

> >
> > >> BTW, I just looked at NFS out of interest, in particular
> > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> > >> privileged user that mounts it can set higher ones. I guess one could run
> > >> into similar writeback issues?
> > >
> >
> > Hi,
> >
> > sorry for the late reply.
> >
> > > Yes, I think so.
> > >
> > >>
> > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> > >
> > > I feel like INDETERMINATE in the name is the main cause of confusion.
> >
> > We are adding logic that says "unconditionally, never wait on writeback
> > for these folios, not even any sync migration". That's the main problem
> > I have.
> >
> > Your explanation below is helpful. Because ...
> >
> > > So, let me explain why it is required (but later I will tell you how it
> > > can be avoided). The FUSE thread which is actively handling writeback of
> > > a given folio can cause memory allocation either through syscall or page
> > > fault. That memory allocation can trigger global reclaim synchronously
> > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> >  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > > allocations. The userspace fs can also use a similar approach which is
> > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > > told that it is hard to use as it is per-thread flag and has to be set
> > > for all the threads handling writeback which can be error prone if the
> > > threadpool is dynamic. Second it is very coarse such that all the
> > > allocations from those threads (e.g. page faults) become NOFS which
> > > makes userspace very unreliable on highly utilized machine as NOFS can
> > > not reclaim potentially a lot of memory and can not trigger oom-kill.
> > >
> >
> > ... now I understand that we want to prevent a deadlock in one specific
> > scenario only?
> >
> > What sounds plausible for me is:
> >
> > a) Make this only affect the actual deadlock path: sync migration
> >     during compaction. Communicate it either using some "context"
> >     information or with a new MIGRATE_SYNC_COMPACTION.
> > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> >      that very deadlock problem.
> > c) Leave all others sync migration users alone for now
> 
> The deadlock path is separate from sync migration. The deadlock arises
> from a corner case where cgroupv1 reclaim waits on a folio under
> writeback where that writeback itself is blocked on reclaim.
> 

Joanne, let's drop the patch to migrate.c completely and let's rename
the flag to something like what David is suggesting and only handle in
the reclaim path.

> >
> > Would that prevent the deadlock? Even *better* would be to to be able to
> > ask the fs if starting writeback on a specific folio could deadlock.
> > Because in most cases, as I understand, we'll  not actually run into the
> > deadlock and would just want to wait for writeback to just complete
> > (esp. compaction).
> >
> > (I still think having folios under writeback for a long time might be a
> > problem, but that's indeed something to sort out separately in the
> > future, because I suspect NFS has similar issues. We'd want to "wait
> > with timeout" and e.g., cancel writeback during memory
> > offlining/alloc_cma ...)

Thanks David and yes let's handle the folios under writeback issue
separately.

> 
> I'm looking back at some of the discussions in v2 [1] and I'm still
> not clear on how memory fragmentation for non-movable pages differs
> from memory fragmentation from movable pages and whether one is worse
> than the other.

I think the fragmentation due to movable pages becoming unmovable is
worse as that situation is unexpected and the kernel can waste a lot of
CPU to defrag the block containing those folios. For non-movable blocks,
the kernel will not even try to defrag. Now we can have a situation
where almost all memory is backed by non-movable blocks and higher order
allocations start failing even when there is enough free memory. For
such situations either system needs to be restarted (or workloads
restarted if they are cause of high non-movable memory) or the admin
needs to setup ZONE_MOVABLE where non-movable allocations don't go.

> Currently fuse uses movable temp pages (allocated with
> gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> issue where a buggy/malicious server may never complete writeback.

So, these temp pages are not an issue for fragmenting the movable blocks
but if there is no limit on temp pages, the whole system can become
non-movable (there is a case where movable blocks on non-ZONE_MOVABLE
can be converted into non-movable blocks under low memory). ZONE_MOVABLE
will avoid such scenario but tuning the right size of ZONE_MOVABLE is
not easy.

> This has the same effect of fragmenting memory and has a worse memory
> cost to the system in terms of memory used. With not having temp pages
> though, now in this scenario, pages allocated in a movable page block
> can't be compacted and that memory is fragmented. My (basic and maybe
> incorrect) understanding is that memory gets allocated through a buddy
> allocator and moveable vs nonmovable pages get allocated to
> corresponding blocks that match their type, but there's no other
> difference otherwise. Is this understanding correct? Or is there some
> substantial difference between fragmentation for movable vs nonmovable
> blocks?

The main difference is the fallback of high order allocation which can
trigger compaction or background compaction through kcompactd. The
kernel will only try to defrag the movable blocks.



  parent reply	other threads:[~2024-12-30 20:04 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05   ` David Hildenbrand
2024-12-19 14:19     ` Zi Yan
2024-12-19 15:08       ` Zi Yan
2024-12-19 15:39         ` David Hildenbrand
2024-12-19 15:47           ` Zi Yan
2024-12-19 15:50             ` David Hildenbrand
2024-12-19 15:43     ` Shakeel Butt
2024-12-19 15:47       ` David Hildenbrand
2024-12-19 15:53         ` Shakeel Butt
2024-12-19 15:55           ` Zi Yan
2024-12-19 15:56             ` Bernd Schubert
2024-12-19 16:00               ` Zi Yan
2024-12-19 16:02                 ` Zi Yan
2024-12-19 16:09                   ` Bernd Schubert
2024-12-19 16:14                     ` Zi Yan
2024-12-19 16:26                       ` Shakeel Butt
2024-12-19 16:31                         ` David Hildenbrand
2024-12-19 16:53                           ` Shakeel Butt
2024-12-19 16:22             ` Shakeel Butt
2024-12-19 16:29               ` David Hildenbrand
2024-12-19 16:40                 ` Shakeel Butt
2024-12-19 16:41                   ` David Hildenbrand
2024-12-19 17:14                     ` Shakeel Butt
2024-12-19 17:26                       ` David Hildenbrand
2024-12-19 17:30                         ` Bernd Schubert
2024-12-19 17:37                           ` Shakeel Butt
2024-12-19 17:40                             ` Bernd Schubert
2024-12-19 17:44                             ` Joanne Koong
2024-12-19 17:54                               ` Shakeel Butt
2024-12-20 11:44                                 ` David Hildenbrand
2024-12-20 12:15                                   ` Bernd Schubert
2024-12-20 14:49                                     ` David Hildenbrand
2024-12-20 15:26                                       ` Bernd Schubert
2024-12-20 18:01                                       ` Shakeel Butt
2024-12-21  2:28                                         ` Jingbo Xu
2024-12-21 16:23                                           ` David Hildenbrand
2024-12-22  2:47                                             ` Jingbo Xu
2024-12-24 11:32                                               ` David Hildenbrand
2024-12-21 16:18                                         ` David Hildenbrand
2024-12-23 22:14                                           ` Shakeel Butt
2024-12-24 12:37                                             ` David Hildenbrand
2024-12-26 15:11                                               ` Zi Yan
2024-12-26 20:13                                               ` Shakeel Butt
2024-12-26 22:02                                                 ` Bernd Schubert
2024-12-27 20:08                                                 ` Joanne Koong
2024-12-27 20:32                                                   ` Bernd Schubert
2024-12-30 17:52                                                     ` Joanne Koong
2024-12-30 10:16                                                 ` David Hildenbrand
2024-12-30 18:38                                                   ` Joanne Koong
2024-12-30 19:52                                                     ` David Hildenbrand
2024-12-30 20:11                                                       ` Shakeel Butt
2025-01-02 18:54                                                         ` Joanne Koong
2025-01-03 20:31                                                           ` David Hildenbrand
2025-01-06 10:19                                                             ` Miklos Szeredi
2025-01-06 18:17                                                               ` Shakeel Butt
2025-01-07  8:34                                                                 ` David Hildenbrand
2025-01-07 18:07                                                                   ` Shakeel Butt
2025-01-09 11:22                                                                     ` David Hildenbrand
2025-01-10 20:28                                                                       ` Jeff Layton
2025-01-10 21:13                                                                         ` David Hildenbrand
2025-01-10 22:00                                                                           ` Shakeel Butt
2025-01-13 15:27                                                                             ` David Hildenbrand
2025-01-13 21:44                                                                               ` Jeff Layton
2025-01-14  8:38                                                                                 ` Miklos Szeredi
2025-01-14  9:40                                                                                   ` Miklos Szeredi
2025-01-14  9:55                                                                                     ` Bernd Schubert
2025-01-14 10:07                                                                                       ` Miklos Szeredi
2025-01-14 18:07                                                                                         ` Joanne Koong
2025-01-14 18:58                                                                                           ` Miklos Szeredi
2025-01-14 19:12                                                                                             ` Joanne Koong
2025-01-14 20:00                                                                                               ` Miklos Szeredi
2025-01-14 20:29                                                                                               ` Jeff Layton
2025-01-14 21:40                                                                                                 ` Bernd Schubert
2025-01-23 16:06                                                                                                   ` Pavel Begunkov
2025-01-14 20:51                                                                                         ` Joanne Koong
2025-01-24 12:25                                                                                           ` David Hildenbrand
2025-01-14 15:49                                                                                     ` Jeff Layton
2025-01-24 12:29                                                                                       ` David Hildenbrand
2025-01-28 10:16                                                                                         ` Miklos Szeredi
2025-01-14 15:44                                                                                   ` Jeff Layton
2025-01-14 18:58                                                                                     ` Joanne Koong
2025-01-10 23:11                                                                           ` Jeff Layton
2025-01-10 20:16                                                                   ` Jeff Layton
2025-01-10 20:20                                                                     ` David Hildenbrand
2025-01-10 20:43                                                                       ` Jeff Layton
2025-01-10 21:00                                                                         ` David Hildenbrand
2025-01-10 21:07                                                                           ` Jeff Layton
2025-01-10 21:21                                                                             ` David Hildenbrand
2025-01-07 16:15                                                                 ` Miklos Szeredi
2025-01-08  1:40                                                                   ` Jingbo Xu
2024-12-30 20:04                                                     ` Shakeel Butt [this message]
2025-01-02 19:59                                                       ` Joanne Koong
2025-01-02 20:26                                                         ` Zi Yan
2024-12-20 21:01                                       ` Joanne Koong
2024-12-21 16:25                                         ` David Hildenbrand
2024-12-21 21:59                                           ` Bernd Schubert
2024-12-23 19:00                                             ` Joanne Koong
2024-12-26 22:44                                               ` Bernd Schubert
2024-12-27 18:25                                                 ` Joanne Koong
2024-12-19 17:55                         ` Joanne Koong
2024-12-19 18:04                           ` Bernd Schubert
2024-12-19 18:11                             ` Shakeel Butt
2024-12-20  7:55                     ` Jingbo Xu
2025-04-02 21:34     ` Joanne Koong
2025-04-03  3:31       ` Jingbo Xu
2025-04-03  9:18         ` David Hildenbrand
2025-04-03  9:25           ` Bernd Schubert
2025-04-03  9:35             ` Christian Brauner
2025-04-03 19:09           ` Joanne Koong
2025-04-03 20:44             ` David Hildenbrand
2025-04-03 22:04               ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25  9:46   ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47   ` Shakeel Butt
2024-12-18 17:37     ` Joanne Koong
2024-12-18 17:44       ` Shakeel Butt
2024-12-18 17:53         ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xucuoi4ywape4ftgzgahqqgzk6xhvotzdu67crq37ccmyl53oa@oiq354b6sfu7 \
    --to=shakeel.butt@linux.dev \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=osalvador@suse.de \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox