From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7AAC31076387 for ; Wed, 1 Apr 2026 17:33:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 838966B0005; Wed, 1 Apr 2026 13:33:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 80F166B0088; Wed, 1 Apr 2026 13:33:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 726036B0089; Wed, 1 Apr 2026 13:33:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 612096B0005 for ; Wed, 1 Apr 2026 13:33:10 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 0724E1A017D for ; Wed, 1 Apr 2026 17:33:10 +0000 (UTC) X-FDA: 84610682940.21.635B734 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf23.hostedemail.com (Postfix) with ESMTP id DDD00140019 for ; Wed, 1 Apr 2026 17:33:07 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=pa1khK1j; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of ljs@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=ljs@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1775064788; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ronOSmoSJdDkkS2OtRbsZ5sq5u1IqlwG5XpYhAg+T20=; b=dkBTCqZ+L+pjWpsiR+VjVVODbfsXGO2Ae/pVbCy6rk4uwPUXOPXCvCouAqmdiixylhmmr1 WlJCkEuDVQvZ+aCM9TZPtxxkz9BE5OtKZ5dKmsBT09gylOuqE4Jpw2GuB/ntxhC3uEO7+r uYjvs9L6JFT/LAErgQVmklF1XeSr9/0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1775064788; a=rsa-sha256; cv=none; b=GmZhv3THRvmWdLha+/bMD3djVT0ab95VeuLoCNB64jWPkidyZNFWFUWXnbRFIdBOG6UWmY JwcJDOF0YHEvU7WN32Kb0NgIa3buh+aKUv9q9ta5D7KuhyUzdhOAVMEDdFRUxayXVo1/TY RvC0Lw3xK3Zi9z5FYt6TzXh5pFC5Kac= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=pa1khK1j; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of ljs@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=ljs@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id CA979436F0; Wed, 1 Apr 2026 17:33:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8C7E7C4CEF7; Wed, 1 Apr 2026 17:33:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775064786; bh=Ft0ZGWvwaRe8vjzKCBRNJ4Ep4RcJtw9gG4cZdXpAeoo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=pa1khK1jrd+UjHl2lyj5pMF3EXuI408XXjyPoKeGj6GTEY0cbzuaVGelOvXeGYINb dN3/omAjaNvT4qjlWOZSsDQduOMWze0fSMxIp9l+oES8b+jBdwsHHJxzF6Fo2RuQVL cvurJQowf2EKFJSPuR6Uy9uwY0XMmqg2gtEOl23hipwqAxX5nvxszYjYyL0HsCi6tm TWRMZiFbkZwNPnk68XhiZ3wcEXWHimqL5EDxsK54pkhfeVZogrtMwpGleb6heXcZji AR9xGkgPKGVRodz08f4smCELOLf3yUwutzPvBLNMqmWvUtG5NdzMF5819rVPxnUKBt WkzVRrE0Ehqfw== Date: Wed, 1 Apr 2026 18:33:04 +0100 From: "Lorenzo Stoakes (Oracle)" To: Johannes Weiner Cc: Andrew Morton , David Hildenbrand , Shakeel Butt , Yosry Ahmed , Zi Yan , "Liam R. Howlett" , Usama Arif , Kiryl Shutsemau , Dave Chinner , Roman Gushchin , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 7/7] mm: switch deferred split shrinker to list_lru Message-ID: References: <20260318200352.1039011-1-hannes@cmpxchg.org> <20260318200352.1039011-8-hannes@cmpxchg.org> <0cf8a859-b142-4e53-9113-94872dd68f40@lucifer.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: DDD00140019 X-Stat-Signature: 3jhrepoazj5qhhnk7iry8c9mkgirxfdr X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1775064787-579008 X-HE-Meta: U2FsdGVkX1+EAMOdhtA0ucnfYvdu2EByE/smcULghaudQwzC9Q7kRqnf6thx/Vjd9mGopMo7K+b/o6qpc02EotvtQGZ11UKRbh44B3/JrCTzqlvS+O05GU7Ld/NK6PEFl0vUCNXMROC52SrzcAJc9R9DoTHZBc4tg3baYgMEo9MQzDee2jOa11OSuOFFjO1gkghU92RzrwT5+3iJHW8f2srLzx7lUGqmER0JTJUYYvyU4YBi1tjv5uY3ZvhJGL4Vr/YqBCzIZyPKCzxNRKivTNKLtEm1rE89K40X6tuV//iQc4rzbIZJgSE407rC9/k3K+syOMaHBQln2zkMFC/KzpmttcFRzXLxuw6Uky4DR4Mi8jouGlVoWzMC7pdHsTp2xrPX/irQosXzNN8UuM2wXq87oL4vhnj5ZRYCfDU3C1LvLB8tJQXBBW27YvyrFp/o4iPu1PkLuz0nZqNw9BIPtqcgSHawccIXh+KlaQrz3XJAeCeQqiiuqjrM09s9k7wpGaialGpHhq5g4ojBSzfpKSGywDfPAmESscWbPCpsgmasN8CZlLHF6X2KWBI7AZ/syj66RNoVvC5suIW0TkIjwbfXvp4N9LgzwxBvu+wX5Eyk1kVoKqZ2iQtl4/btJEziIWonZRFKrsLR0TqXYS3bcd21LoYgMZ20Jsv5CbB6Hbg/x0O7K5XYGG3pJLWhZVNsvpvalt3QNvcux95BMSyqm9OD/tymmefdxeAI3NPdFmH/XOsrAoIZiU3myhFaP1LSGt57qDjLMOE7X0YXyT7RHZHr1djuELdGRlC3HF4mMiKd4jkZW2bfCbExWq1R/JVfJRSpb9Snq8N2syNgr5bKBgscQmVfJpbSdDDbns0a6T7o+PTp85L/NPDCcpWngCHwk4fEn9gv81b0zUnNiCoidlry2vejEfEvdAf6tyLP/p078PWKhpaKdYetrqVDBYg7WufAwO99wxf/hJ2wvVf 7Gki4Scb H5trFxlcO+Ovgd7WPpyMjENO4z4XMXELgB658vH7QMRSIqhiv3lLZcDQhL9S/SmpdNhMOrr1vcNNElyuDCQ3pLlMhpH+S9DFAtlBEL3oLGGqPQ4DQFtr9b10evSK8IPLg4oWB/zXjkJ5/OTRUg4fmUpHxNKMmVRSfPYGHWZFm9Hj34m6/whqCcxrtaIeSLu8h+x44zePTVOdo5qaLkwvuIUDIaMFk2pT96R9pqyeNVaV6kCEhNAofmZ2IuXxW8QlZmBrqZDa1rKKwkiY= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 30, 2026 at 12:40:22PM -0400, Johannes Weiner wrote: > Hey Lorenzo, > > Thanks for taking an in-depth look at this! > > Apologies for the late reply, I was traveling last week. > > On Tue, Mar 24, 2026 at 01:48:06PM +0000, Lorenzo Stoakes (Oracle) wrote: > > On Wed, Mar 18, 2026 at 03:53:25PM -0400, Johannes Weiner wrote: > > > The deferred split queue handles cgroups in a suboptimal fashion. The > > > queue is per-NUMA node or per-cgroup, not the intersection. That means > > > on a cgrouped system, a node-restricted allocation entering reclaim > > > can end up splitting large pages on other nodes: > > > > > > alloc/unmap > > > deferred_split_folio() > > > list_add_tail(memcg->split_queue) > > > set_shrinker_bit(memcg, node, deferred_shrinker_id) > > > > So here it's: > > > > __do_huge_pmd_anonymous_page() / do_huge_zero_wp_pmd() > > -> map_anon_folio_pmd_pf() > > -> map_anon_folio_pmd_nopf() > > -> deferred_split_folio() > > -> set_shrinker_bit() > > > > Yeah it makes sense to make the first bit succinct anyway :) > > Yep I was trying to keep the focus on those parts that interact with > the other stacks listed below. Not biblically accurate > deferred_split_folio() callsites ;) > > > > for_each_zone_zonelist_nodemask(restricted_nodes) > > > mem_cgroup_iter() > > > shrink_slab(node, memcg) > > > shrink_slab_memcg(node, memcg) > > > if test_shrinker_bit(memcg, node, deferred_shrinker_id) > > > > Hmm there's no such function is this kind of pseudocode, essentially? > > > > Wouldn't it be clearer to reference: > > > > shrink_slab_memcg() -> do_shrink_slab() -> shrinker->scan_objects -> > > deferred_split_scan()? > > > > Though I get that it adds verbosity :) > > > > Sorry not (just) being pedantic here, also so I can understand myself :) > > Yes, it's pseudocode. Direct reclaim (shrink_zones) and kswapd > (wake_all_kswapds) do the "for all nodes" loops. Nested within them > are the loops that iterate the cgroup tree (shrink_node_memcgs, > shrink_many, depending on LRU vs MGLRU). And then nested within > *those* we have the shrink_slab() calls. > > Again, I tried to keep the boiler plate low and to the point. The most > important thing is: 1) the shrinker bit gets set for each node, memcg > pair that has objects, 2) shrinker visits all nodes x memcgs with it > set 3) but the shrinker scans happen on per-memcg queue, not node-aware. > > Let me know if that works as is, or if you would like me to rephrase > that. I would like to keep it as simple as possible, but no simpler > than that ;) Yeah that's fine, maybe just add a little note to say 'simplified for clarity' or something so very-literal nutters like me don't get confused ;) > > > > deferred_split_scan() > > > walks memcg->split_queue > > > > Ok so overall there's a per-memcg memcg->split_queue, but we set a bit in > > memcg->nodeinfo[nid]->shrinker_info->unit[blah]->map for it, and when we > > enter shrink_slab_memcg(), we figure out the shrink_control from the > > for_each_set_bit() across memcg->...->unit->map? > > > > > The shrinker bit adds an imperfect guard rail. As soon as the cgroup > > > has a single large page on the node of interest, all large pages owned > > > by that memcg, including those on other nodes, will be split. > > > > So at this point, regardless of node, we are invoking deferred_split_scan() > > and it's the same memcg->split_queue we are walking, regardless of node? > > > > Do let me know if my understanding is correct here! > > > > Hmm that does sound sub-optimal. > > That is exactly what's happening. We have all this dance to be precise > about the shrinker bit, but the queue itself is not node-aware. > > That makes for some pretty erratic behavior. Yeah that sucks! > > > > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it > > > streamlines a lot of the list operations and reclaim walks. It's used > > > widely by other major shrinkers already. Convert the deferred split > > > queue as well. > > > > It's odd that it wasn't used before? > > The list_lru API was initially explicitly for *slab objects*, doing > virt_to_page() and mem_cgroup_from_slab_obj() on the passed in list > heads to derive the proper queue. It was only extended more recently > in 0a97c01cd20b ("list_lru: allow explicit memcg and NUMA node > selection") to allow passing in the queue routing information and > embedding the list heads in other things (such as folios). > > The only missing piece now was the support for callside locking of the > list_lru structures. Ack > > > > include/linux/huge_mm.h | 6 +- > > > include/linux/memcontrol.h | 4 - > > > include/linux/mmzone.h | 12 -- > > > mm/huge_memory.c | 342 ++++++++++++------------------------- > > > mm/internal.h | 2 +- > > > mm/khugepaged.c | 7 + > > > mm/memcontrol.c | 12 +- > > > mm/memory.c | 52 +++--- > > > mm/mm_init.c | 15 -- > > > 9 files changed, 151 insertions(+), 301 deletions(-) > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index bd7f0e1d8094..8d801ed378db 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -414,10 +414,9 @@ static inline int split_huge_page(struct page *page) > > > { > > > return split_huge_page_to_list_to_order(page, NULL, 0); > > > } > > > + > > > +extern struct list_lru deferred_split_lru; > > > > It might be nice for the sake of avoiding a global to instead expose this > > as a getter? > > > > Or actually better, since every caller outside of huge_memory.c that > > references this uses folio_memcg_list_lru_alloc(), do something like: > > > > int folio_memcg_alloc_deferred(struct folio *folio, gfp_t gfp); > > > > in mm/huge_memory.c: > > > > /** > > * blah blah blah put on error blah > > */ > > int folio_memcg_alloc_deferred(struct folio *folio, gfp_t gfp) > > { > > int err; > > > > err = folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfP); > > if (err) { > > folio_put(folio); > > return err; > > } > > > > return 0; > > } > > > > And then the callers can just invoke this, and you can make > > deferred_split_lru static in mm/huge_memory.c? > > That sounds reasonable. Let me make this change. Thanks! > > > > void deferred_split_folio(struct folio *folio, bool partially_mapped); > > > -#ifdef CONFIG_MEMCG > > > -void reparent_deferred_split_queue(struct mem_cgroup *memcg); > > > -#endif > > > > > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > > > unsigned long address, bool freeze); > > > @@ -650,7 +649,6 @@ static inline int try_folio_split_to_order(struct folio *folio, > > > } > > > > > > static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {} > > > -static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {} > > > #define split_huge_pmd(__vma, __pmd, __address) \ > > > do { } while (0) > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 086158969529..0782c72a1997 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -277,10 +277,6 @@ struct mem_cgroup { > > > struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; > > > #endif > > > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > - struct deferred_split deferred_split_queue; > > > -#endif > > > - > > > #ifdef CONFIG_LRU_GEN_WALKS_MMU > > > /* per-memcg mm_struct list */ > > > struct lru_gen_mm_list mm_list; > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > index 7bd0134c241c..232b7a71fd69 100644 > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -1429,14 +1429,6 @@ struct zonelist { > > > */ > > > extern struct page *mem_map; > > > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > -struct deferred_split { > > > - spinlock_t split_queue_lock; > > > - struct list_head split_queue; > > > - unsigned long split_queue_len; > > > -}; > > > -#endif > > > - > > > #ifdef CONFIG_MEMORY_FAILURE > > > /* > > > * Per NUMA node memory failure handling statistics. > > > @@ -1562,10 +1554,6 @@ typedef struct pglist_data { > > > unsigned long first_deferred_pfn; > > > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > - struct deferred_split deferred_split_queue; > > > -#endif > > > - > > > #ifdef CONFIG_NUMA_BALANCING > > > /* start time in ms of current promote rate limit period */ > > > unsigned int nbp_rl_start; > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index 3fc02913b63e..e90d08db219d 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -14,6 +14,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -67,6 +68,8 @@ unsigned long transparent_hugepage_flags __read_mostly = > > > (1< > > (1< > > > > > +static struct lock_class_key deferred_split_key; > > > +struct list_lru deferred_split_lru; > > > static struct shrinker *deferred_split_shrinker; > > > static unsigned long deferred_split_count(struct shrinker *shrink, > > > struct shrink_control *sc); > > > @@ -919,6 +922,13 @@ static int __init thp_shrinker_init(void) > > > if (!deferred_split_shrinker) > > > return -ENOMEM; > > > > > > + if (list_lru_init_memcg_key(&deferred_split_lru, > > > + deferred_split_shrinker, > > > + &deferred_split_key)) { > > > + shrinker_free(deferred_split_shrinker); > > > + return -ENOMEM; > > > + } > > > + > > > > It's kind of out of scope for the series, but I hate that the huge zero > > folio stuff has an early exit for persistent but if not, falls through to > > trying to set that up - it'd be nice to split out the huge zero folio stuff > > into another function. > > > > But probably one for a follow up! > > It doesn't bother me personally, but I wouldn't object to it being > moved to its own function and have the caller do necessary cleanups. > > But yeah sounds like an unrelated refactor :) Yeah it is entirely unrelated for sure, can be an optional follow up! > > > > deferred_split_shrinker->count_objects = deferred_split_count; > > > deferred_split_shrinker->scan_objects = deferred_split_scan; > > > shrinker_register(deferred_split_shrinker); > > > @@ -939,6 +949,7 @@ static int __init thp_shrinker_init(void) > > > > > > huge_zero_folio_shrinker = shrinker_alloc(0, "thp-zero"); > > > if (!huge_zero_folio_shrinker) { > > > + list_lru_destroy(&deferred_split_lru); > > > shrinker_free(deferred_split_shrinker); > > > > Presumably no probably-impossible-in-reality race on somebody entering the > > shrinker and referencing the deferred_split_lru before the shrinker is freed? > > Ah right, I think for clarity it would indeed be better to destroy the > shrinker, then the queue. Let me re-order this one. > > But yes, in practice, none of the above fails. If we have trouble > doing a couple of small kmallocs during a subsys_initcall(), that > machine is unlikely to finish booting, let alone allocate enough > memory to enter the THP shrinker. Yeah I thought that might be the case, but seems more logical killing shrinker first, thanks! > > > > @@ -953,6 +964,7 @@ static int __init thp_shrinker_init(void) > > > static void __init thp_shrinker_exit(void) > > > { > > > shrinker_free(huge_zero_folio_shrinker); > > > + list_lru_destroy(&deferred_split_lru); > > > > Same question above as to race/ordering. > > ... and this one as well. > > > > shrinker_free(deferred_split_shrinker); > > > } > > > > @@ -3854,34 +3761,34 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n > > > struct folio *end_folio = folio_next(folio); > > > struct folio *new_folio, *next; > > > int old_order = folio_order(folio); > > > + struct list_lru_one *l; > > > > Nit, and maybe this is a convention, but hate single letter variable names, > > 'lru' or something might be nicer? > > Yeah I stuck with the list_lru internal naming, which uses `lru` for > the struct list_lru, and `l` for struct list_lru_one. I suppose that > was fine for the very domain-specific code and short functions in > there, but it's grating in large, general MM functions like these. > > Since `lru` is taken, any preferences? llo? ljs? ;) Could be list? > > > > + bool dequeue_deferred; > > > int ret = 0; > > > - struct deferred_split *ds_queue; > > > > > > VM_WARN_ON_ONCE(!mapping && end); > > > /* Prevent deferred_split_scan() touching ->_refcount */ > > > - ds_queue = folio_split_queue_lock(folio); > > > + dequeue_deferred = folio_test_anon(folio) && old_order > 1; > > > > Why anon? (This review is partly me learning about the shrinker, an area > > I'm weak on :) > > This is the type of folios that can be queued through > deferred_split_folio(). The order check is inside that function > itself; and the function is only called for anon pages. > > File pages are split differently, they don't use the deferred split > shrinker and we don't allocate list_lru heads for them. However, they > still come through this split path here, and so I need to gate whether > it's safe to do list_lru_lock(), __list_lru_del() etc. > > > > + if (dequeue_deferred) { > > > + rcu_read_lock(); > > > + l = list_lru_lock(&deferred_split_lru, > > > + folio_nid(folio), folio_memcg(folio)); > > > > Hm don't adore this sort of almost 'hidden' RCU lock here, but this > > function is pretty disgusting and needs serious work in general. > > > > And any function that took the RCU lock and list_lru lock/did the unlock > > equivalent would be equally horrible so yeah, I guess needs deferring to a > > refactor. > > > > OTOH, this could be a good excuse for us to pay down some technical debt > > and split out for instance the folio_ref_freeze() bits? > > > > Could we do something like: > > > > bool frozen; > > > > ... > > > > dequeue_deferred = folio_test_anon(folio) && old_order > 1; > > frozen = folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1); > > > > if (dequeue_deferred && frozen) { > > struct list_lru_one *lru; > > > > rcu_read_lock(); > > lru = list_lru_lock(&deferred_split_lru, > > folio_nid(folio), folio_memcg(folio)); > > __list_lru_del(&deferred_split_lru, lru, > > &folio->_deferred_list, folio_nid(folio)); > > if (folio_test_partially_mapped(folio)) { > > folio_clear_partially_mapped(folio); > > mod_mthp_stat(old_order, > > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); > > } > > list_lru_unlock(lru); > > rcu_read_unlock(); > > } > > > > if (!frozen) > > return -EAGAIN; > > > > ... rest of logic now one indent level lower ... > > > > Or maybe factor that out into a helper function or something? > > > > static void execute_deferred_dequeue(...) { ... } > > > > With this implemented either way you'd be able to get rid of the else block > > too. > > > > obviously only valid if you are able to do the freezing earlier? > > The reason it locks the list_lru first is because there can be a race > between the shrinker and a synchronous split attempt. Ahh! > > This is what that comment above it is about: > > /* Prevent deferred_split_scan() touching ->_refcount */ I _may_ have glossed over this :) > > If the shrinker fails to "get" the folio while holding the shrinker > lock, it thinks it beat the free path to the list_lru lock and will > feel responsible for cleaning up the deferred split state: > > /* caller holds the list_lru lock already */ > static enum lru_status deferred_split_isolate(struct list_head *item, > struct list_lru_one *lru, > void *cb_arg) > { > struct folio *folio = container_of(item, struct folio, _deferred_list); > struct list_head *freeable = cb_arg; > > if (folio_try_get(folio)) { > list_lru_isolate_move(lru, item, freeable); > return LRU_REMOVED; > } > > /* We lost race with folio_put() */ > list_lru_isolate(lru, item); > if (folio_test_partially_mapped(folio)) { > folio_clear_partially_mapped(folio); > mod_mthp_stat(folio_order(folio), > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); > } > return LRU_REMOVED; > } > > So if the split path freezes before acquiring the list_lru lock, we > get them both stepping on the deferred split state. So we're sort of intentionally doing a lock inversion on the other path (give me some rope here, I mean broadly speaking :) to avoid this? > > This may or may not be safe given the current manifestation of > list_lru_del() and the folio_test_partially_mapped() tests. But it > feels pretty hairy to let them race and rely on individual tests into > what should be a coherent aggregate state. Yeah sure best not. But, and I _know_ it's nitty sorry, but maybe worth expanding that comment to explain that e.g. 'we must take the folio look prior to the list_lru lock to avoid racing with deferred_split_scan() in accessing the folio reference count' or similar? > > > > + } > > > if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { > > > struct swap_cluster_info *ci = NULL; > > > struct lruvec *lruvec; > > > > > > - if (old_order > 1) { > > > > Before was this also applicable to non-anon folios? > > This branch, yes. But non-anon pages would then fail this: > > > > - if (!list_empty(&folio->_deferred_list)) { > > > - ds_queue->split_queue_len--; > > > - /* > > > - * Reinitialize page_deferred_list after removing the > > > - * page from the split_queue, otherwise a subsequent > > > - * split will see list corruption when checking the > > > - * page_deferred_list. > > > - */ > > > - list_del_init(&folio->_deferred_list); > > > - } > > ..since they aren't ever added. This was okay because accessing that > list head is always safe. Now I need to be explicit to determine if > it's safe to call __list_lru_del() which does all these list_lru > references, which aren't allocated for file. > > > > + if (dequeue_deferred) { > > > + __list_lru_del(&deferred_split_lru, l, > > > + &folio->_deferred_list, folio_nid(folio)); > > > if (folio_test_partially_mapped(folio)) { > > > folio_clear_partially_mapped(folio); > > > mod_mthp_stat(old_order, > > > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); > > > } > > And honestly, I think it's nicer to be explicit here about the > expectations of which pages get which treatment. Yeah for sure. > > Filtering file pages on !list_empty() was not very obvious. Yes :) > > > > + int nid = folio_nid(folio); > > > unsigned long flags; > > > bool unqueued = false; > > > > > > WARN_ON_ONCE(folio_ref_count(folio)); > > > WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio)); > > > > > > - ds_queue = folio_split_queue_lock_irqsave(folio, &flags); > > > - if (!list_empty(&folio->_deferred_list)) { > > > - ds_queue->split_queue_len--; > > > + rcu_read_lock(); > > > + l = list_lru_lock_irqsave(&deferred_split_lru, nid, folio_memcg(folio), &flags); > > > + if (__list_lru_del(&deferred_split_lru, l, &folio->_deferred_list, nid)) { > > > > Maybe worth factoring __list_lru_del() into something that explicitly > > references &folio->_deferred_list rather than open codingin both places? > > Hm, I wouldn't want to encode this into list_lru API, but we could do > a huge_memory.c-local helper? > > folio_deferred_split_del(folio, l, nid) Well, I kind of hate how we're using the global deferred_split_lru all over the place, so a helper woudl be preferable but one that also could be used for khugepaged.c and memory.c also? > > > > @@ -4406,7 +4320,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped) > > > if (folio_test_swapcache(folio)) > > > return; > > > > > > - ds_queue = folio_split_queue_lock_irqsave(folio, &flags); > > > + nid = folio_nid(folio); > > > + > > > + rcu_read_lock(); > > > + memcg = folio_memcg(folio); > > > + l = list_lru_lock_irqsave(&deferred_split_lru, nid, memcg, &flags); > > > > Do we really need to hold the lock over all the below, but hmm we can't do > > an irqsave/restore list_lru_add(), and maybe not worth adding one either, > > OK. > > > > Just seems odd to have <__unlocked_add_variant> , > > instinctively feels like it should be just . > > It protects that auxiliary folio_test_partially_mapped() state that > the shrinker also touches. LRU + that page flag + the counts is what > that lock protects. Ack thanks > > > > @@ -4473,45 +4375,47 @@ static bool thp_underused(struct folio *folio) > > > return false; > > > } > > > > > > +static enum lru_status deferred_split_isolate(struct list_head *item, > > > + struct list_lru_one *lru, > > > + void *cb_arg) > > > +{ > > > + struct folio *folio = container_of(item, struct folio, _deferred_list); > > > + struct list_head *freeable = cb_arg; > > > + > > > + if (folio_try_get(folio)) { > > > + list_lru_isolate_move(lru, item, freeable); > > > + return LRU_REMOVED; > > > + } > > > + > > > + /* We lost race with folio_put() */ > > > > Hmm, in the original code, this comment is associated with the partially > > mapped logic, BUT this seems actually correct, because folio_try_get() > > because it does folio_ref_add_unless_zero() only fails if the folio lost > > the race. > > > > So I think you're more correct right? > > I think the original placement was because that else if made it > awkward to place where it should be. But yes, the above is more > correct: the comment refers to what happens when the "get" fails. Right yeah > > > > - ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags); > > > - /* Take pin on all head pages to avoid freeing them under us */ > > > - list_for_each_entry_safe(folio, next, &ds_queue->split_queue, > > > - _deferred_list) { > > > - if (folio_try_get(folio)) { > > > - folio_batch_add(&fbatch, folio); > > > - } else if (folio_test_partially_mapped(folio)) { > > > - /* We lost race with folio_put() */ > > > - folio_clear_partially_mapped(folio); > > > - mod_mthp_stat(folio_order(folio), > > > - MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); > > > - } > > > > @@ -4534,64 +4438,32 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, > > > } > > > folio_unlock(folio); > > > next: > > > - if (did_split || !folio_test_partially_mapped(folio)) > > > - continue; > > > /* > > > * Only add back to the queue if folio is partially mapped. > > > * If thp_underused returns false, or if split_folio fails > > > * in the case it was underused, then consider it used and > > > * don't add it back to split_queue. > > > */ > > > - fqueue = folio_split_queue_lock_irqsave(folio, &flags); > > > - if (list_empty(&folio->_deferred_list)) { > > > - list_add_tail(&folio->_deferred_list, &fqueue->split_queue); > > > - fqueue->split_queue_len++; > > > + if (!did_split && folio_test_partially_mapped(folio)) { > > > + rcu_read_lock(); > > > + l = list_lru_lock_irqsave(&deferred_split_lru, > > > + folio_nid(folio), > > > + folio_memcg(folio), > > > + &flags); > > > + __list_lru_add(&deferred_split_lru, l, > > > + &folio->_deferred_list, > > > + folio_nid(folio), folio_memcg(folio)); > > > + list_lru_unlock_irqrestore(l, &flags); > > > > Hmm this does make me think it'd be nice to have a list_lru_add() variant > > for irqsave/restore then, since it's a repeating pattern! > > Yeah, this site calls for it the most :( I tried to balance callsite > prettiness with the need to extend the list_lru api; it's just one > caller. And the possible mutations and variants with these locks is > seemingly endless once you open that can of worms... True... > > Case in point: this is process context and we could use > spin_lock_irq() here. I'm just using list_lru_lock_irqsave() because > that's the common variant used by the add and del paths already. > > If I went with a helper, I could do list_lru_add_irq(). > > I think it would actually nicely mirror the list_lru_shrink_walk_irq() > a few lines up. Yeah, I mean I'm pretty sure this repeats quite a few times so is worthy of a helper. > > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1081,6 +1081,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru > > > } > > > > > > count_vm_event(THP_COLLAPSE_ALLOC); > > > + > > > if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { > > > folio_put(folio); > > > *foliop = NULL; > > > @@ -1089,6 +1090,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru > > > > > > count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1); > > > > Do we want to put this stat counter underneath the below so it's not > > incremented on fail? > > We currently increment it also when the cgroup charge fails, which is > common. The list_lru allocation (a smaller kmalloc) in turn should > basically never fail - we just managed to get a PMD-sized folio. > > Executive summar being: the way we bump them currently is weird. I > tried to stay out of it to keep this series on track ;) Ack ok fair enough :) > > > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) { > > > + folio_put(folio); > > > + *foliop = NULL; > > > + return SCAN_CGROUP_CHARGE_FAIL; > > > > Do we not need to uncharge here? > > folio_put() does that. We could do it for symmetry, but there is no > need. And the hard and fast rule is that folio_memcg() is immutable > until the put hits 0, so it would look weird/unexpected. So it does! void __folio_put(struct folio *folio) { ... mem_cgroup_uncharge(folio); ... } I didn't _think_ it would but I guess I didn't realise how much work __folio_put() was doing :) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index a47fb68dd65f..f381cb6bdff1 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -4015,11 +4015,6 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) > > > for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > > memcg->cgwb_frn[i].done = > > > __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); > > > -#endif > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > - spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); > > > - INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); > > > - memcg->deferred_split_queue.split_queue_len = 0; > > > #endif > > > lru_gen_init_memcg(memcg); > > > return memcg; > > > @@ -4167,11 +4162,10 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) > > > zswap_memcg_offline_cleanup(memcg); > > > > > > memcg_offline_kmem(memcg); > > > - reparent_deferred_split_queue(memcg); > > > /* > > > - * The reparenting of objcg must be after the reparenting of the > > > - * list_lru and deferred_split_queue above, which ensures that they will > > > - * not mistakenly get the parent list_lru and deferred_split_queue. > > > + * The reparenting of objcg must be after the reparenting of > > > + * the list_lru in memcg_offline_kmem(), which ensures that > > > + * they will not mistakenly get the parent list_lru. > > > */ > > > memcg_reparent_objcgs(memcg); > > > reparent_shrinker_deferred(memcg); > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 219b9bf6cae0..e68ceb4aa624 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4651,13 +4651,19 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > while (orders) { > > > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > > folio = vma_alloc_folio(gfp, order, vma, addr); > > > - if (folio) { > > > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > > > - gfp, entry)) > > > - return folio; > > > + if (!folio) > > > + goto next; > > > + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) { > > > count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > > > folio_put(folio); > > > + goto next; > > > } > > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) { > > > > Do we need to uncharge here? > > Same here. It's folio state that the put cleans up. Ack, indeed! > > > > + folio_put(folio); > > > + goto fallback; > > > + } > > > + return folio; > > > +next: > > > count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); > > > order = next_order(&orders, order); > > > } > > > @@ -5169,24 +5175,28 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > > while (orders) { > > > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > > folio = vma_alloc_folio(gfp, order, vma, addr); > > > - if (folio) { > > > - if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { > > > - count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); > > > - folio_put(folio); > > > - goto next; > > > - } > > > - folio_throttle_swaprate(folio, gfp); > > > - /* > > > - * When a folio is not zeroed during allocation > > > - * (__GFP_ZERO not used) or user folios require special > > > - * handling, folio_zero_user() is used to make sure > > > - * that the page corresponding to the faulting address > > > - * will be hot in the cache after zeroing. > > > - */ > > > - if (user_alloc_needs_zeroing()) > > > - folio_zero_user(folio, vmf->address); > > > - return folio; > > > + if (!folio) > > > + goto next; > > > > This applies to the above equivalent refactorings, but maybe worth > > separating out this: > > > > if (folio) { ... big branch ... } -> > > if (!folio) goto next; ... what was big branch ... > > > > Refactoring into a separate patch? Makes it easier to see pertinent logic > > changes + helps review/bisectability/fixes/etc. etc. > > Will do. Thanks! > > > > + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { > > > + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); > > > + folio_put(folio); > > > + goto next; > > > } > > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) { > > > > Again, do we need to uncharge here? > > folio_put() does it :) Yeah, I should have just checked __folio_put() :>) > > > > + folio_put(folio); > > > + goto fallback; > > > + } > > > + folio_throttle_swaprate(folio, gfp); > > > + /* > > > + * When a folio is not zeroed during allocation > > > + * (__GFP_ZERO not used) or user folios require special > > > + * handling, folio_zero_user() is used to make sure > > > + * that the page corresponding to the faulting address > > > + * will be hot in the cache after zeroing. > > > + */ > > > + if (user_alloc_needs_zeroing()) > > > + folio_zero_user(folio, vmf->address); > > > + return folio; > > > next: > > > count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK); > > > order = next_order(&orders, order); > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > > index cec7bb758bdd..f293a62e652a 100644 > > > --- a/mm/mm_init.c > > > +++ b/mm/mm_init.c > > > @@ -1388,19 +1388,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat, > > > pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages); > > > } > > > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > -static void pgdat_init_split_queue(struct pglist_data *pgdat) > > > -{ > > > - struct deferred_split *ds_queue = &pgdat->deferred_split_queue; > > > - > > > - spin_lock_init(&ds_queue->split_queue_lock); > > > - INIT_LIST_HEAD(&ds_queue->split_queue); > > > - ds_queue->split_queue_len = 0; > > > -} > > > -#else > > > -static void pgdat_init_split_queue(struct pglist_data *pgdat) {} > > > -#endif > > > - > > > #ifdef CONFIG_COMPACTION > > > static void pgdat_init_kcompactd(struct pglist_data *pgdat) > > > { > > > @@ -1416,8 +1403,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) > > > > > > pgdat_resize_init(pgdat); > > > pgdat_kswapd_lock_init(pgdat); > > > - > > > - pgdat_init_split_queue(pgdat); > > > pgdat_init_kcompactd(pgdat); > > > > > > init_waitqueue_head(&pgdat->kswapd_wait); > > > -- > > > 2.53.0 > > > > > > > Overall a lovely amount of code deletion here :) > > > > Thanks for doing this, Cheers Lorenzo > > Thanks for your feedback! No worries, thanks for the series :) Cheers, Lorenzo